Screen Scraping using HttpWebRequest

Asked By Mike Smith
12-May-02 06:39 PM
Earn up to 0 extra points for answering this tough question.
I'm trying to write a screen scraping routine in C# that works with any website but am getting an exception with some URLs. The code is as follows: ... FileInfo MyFile = new FileInfo(@"c:\source.txt"); StreamWriter sw = MyFile.CreateText(); string url; url = String.Format(@"https://investing.schwab.com/trading/start"); Uri uri = new Uri(url); HttpWebRequest req = (HttpWebRequest)WebRequest.Create(uri); HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); string s; Stream stream = resp.GetResponseStream(); StreamReader sr = new StreamReader(stream); while ((s = sr.ReadLine()) != null) { sw.WriteLine(s); } ... This works with most URLs but with the one in the code above it gives the following exception: System.Net.WebException: The underlying connection was closed: Could not establish secure channel for SSL/TLS. ---> System.ComponentModel.Win32Exception: The function completed successfully, but must be called again to complete the context --- End of inner exception stack trace --- at System.Net.HttpWebRequest.CheckFinalStatus() at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) at System.Net.HttpWebRequest.GetResponse() at ReadFromSchwabMozilla.Form1..ctor() in c:\documents and settings\mike\my documents\visual studio projects\mikes examples\read from schwab mozilla\form1.cs:line 49 Does anyone know what I'm doing wrong? Thanks!

  Off topic but on target

Asked By Robbe Morris
12-May-02 06:42 PM
This article is on threading but the HTTPWebRequest code at the bottom of the article is what I think you need. http://www.eggheadcafe.com/articles/20020224.asp

  Get the same exception

Asked By Mike Smith
12-May-02 07:04 PM
when I plug the offending URL into your threading example.

  Hmmm

Asked By Robbe Morris
12-May-02 09:15 PM
URL1 and URL2 work for me but not URL3. My guess is that URL3 may be redirecting your request elsewhere and the HttpWebRequest can't handle it. I think a friend of mine dealt with this in classic ASP and the XMLHttp object. Perhaps he has done the same with .NET. <%@ Import Namespace="System" %> <%@ Import Namespace="System.IO" %> <%@ Import Namespace="System.Net" %> <%@ Import Namespace="System.Web.Services" %> <script Language="C#" runat="server"> protected void Page_Load(object sender, EventArgs e) { string sURL1 = "http://www.google.com"; string sURL2 = "https://qa-ncp.gartner.com"; string sURL3 = "https://investing.schwab.com/trading/start"; string sResp = ""; try { HttpWebRequest oWebReq = (HttpWebRequest)WebRequest.Create(sURL3); HttpWebResponse oWebResp = (HttpWebResponse)oWebReq.GetResponse(); StreamReader oStream = new StreamReader(oWebResp.GetResponseStream(),System.Text.Encoding.ASCII ); sResp = oStream.ReadToEnd(); if (sResp.Length > 0) { Response.Write(sResp);} } catch(Exception HttpEx) { Response.Write(HttpEx.Message);} } </script>
  Try something like this on your https page:
Asked By Peter Bromberg
13-May-02 07:45 PM
public static string getPage(String url) { try { WebRequest req = WebRequest.Create(url); WebResponse result = req.GetResponse(); Stream ReceiveStream = result.GetResponseStream(); Encoding encode = System.Text.Encoding.GetEncoding("utf-8"); StreamReader sr = new StreamReader( ReceiveStream, encode ); StringBuilder sb = new StringBuilder(); Char[] read = new Char[256]; int count = sr.Read( read, 0, 256 ); while (count > 0) { String str = new String(read, 0, count); sb.Append(str) count = sr.Read(read, 0, 256); } return sb.ToString(); } } catch(Exception ex) { return ex.Message; } } } Also, if there is a lot of redirection at the site, you may need to set the MaximumAutomaticRedirections (int) property on the WebRequest instance to an arbitrarily large number, say , 30.
  Screen Scraping over SSL (https)
Asked By J. Michael Terenin
17-Dec-05 11:58 PM
I am experiencing similar problems with accessing a redirected https web page using the HttpRequest/Response classes, and am using Fiddler to debug my problem. I can see that I need to supply some sort of Authentication Certificate along with the rest of my header info, but i can't seem to find my answer, through experimentation or examples over the internet. Have you resolved your problem yet ? Please expain if you have. Thanks in advance - Mike
  Screen Scraping over SSL (https)
Asked By J. Michael Terenin
18-Dec-05 12:02 AM
Robbie, It's possible he might be experiencing what I am right now, a Certificate Authentication problem because the initial page is redirecting to an https url. I'm using Fiddler to debug my problem, but I'm stuck on how to resolve it ? Do you know of any strong reading material on this topic or a site that I could go to resolve this SSL encryption issue ? Thanks in advance - Mike
Create New Account
Screen Scraping using HttpWebRequest I'm trying to write a screen scraping routine in investing.schwab.com / trading / start"); Uri uri = new Uri(url); HttpWebRequest req = (HttpWebRequest)WebRequest.Create(uri); HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); string s; Stream stream = resp.GetResponseStream(); StreamReader sr context - -- End of inner exception stack trace - -- at System.Net.HttpWebRequest.CheckFinalStatus() at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) at System.Net.HttpWebRequest.GetResponse() at ReadFromSchwabMozilla.Form1. . ctor() in c: \ documents and settings doing wrong? Thanks! This article is on threading but the HTTPWebRequest code at the bottom of the article is what I
Too many automatic redirections -HTTPWebRequest I'm experimenting with some c# webservice code that is y = 20 - - My problem is that in a webservice call: HttpWebRequest webreq = (HttpWebRequest)WebRequest.Create(serverURL); HttpWebResponse webresp = (HttpWebResponse)webreq.GetResponse(); StreamReader strm = new StreamReader(webresp.GetResponseStream(), Encoding.ASCII Net.WebException: Too many automatic redirections attempted. at System.Net.HttpWebRequest.CheckFinalStatus() at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) at System.Net.HttpWebRequest.GetResponse() at MSInfoService.GetInfo(String sType, String sDays) Anybody got any ideas? The Uri you access wants you to set HttpWebRequest webreq = (HttpWebRequest)WebRequest.Create(serverURL); webreq.UserAgent = "Mozilla / 4.0
Virtual Directory I have a windows form application that uses httpwebrequest to access files in a virtual directory. The windows form As String) As Byte() allDone.Reset() Try Dim wr As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest) If Not ProxyObject Is Nothing Then wr.Proxy = ProxyObject Dim MRS As New RequestState MRS.request = wr Dim result As IAsyncResult = CType(wr.BeginGetResponse(AddressOf RespCallBack, MRS), IAsyncResult) allDone.WaitOne() MRS.response.Close() Return MRS.BLOB Catch MyException End Try End Function Private Sub RespCallBack(ByVal asyncResult As IAsyncResult) Try Dim MRS As RequestState = CType(asyncResult.AsyncState, RequestState) Dim wr2 As HttpWebRequest = CType(MRS.request, HttpWebRequest) MRS.response = CType(wr2.EndGetResponse(asyncResult HttpWebResponse) Dim RS As Stream = MRS.response.GetResponseStream() MRS.responseStream = RS
Visual Studio. I am connecting to the remote server using HTTPWebRequest. I wrote a snippet of code to just test getting Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception) +431 [WebException: Unable to connect to the remote server] System.Net.HttpWebRequest.GetResponse() +1501755 _Default.Page_Load(Object sender, EventArgs e) +61 System IS THE CODE: protected void Page_Load(object sender, EventArgs e) { HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http: / / www.google.com"); HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); StreamReader sr = new StreamReader(resp.GetResponseStream()); string google from any machine hosting the web application this allowed our httpwebrequest to get the info it needed from the internet. I but on the 2.0 box for some reason the httpwebrequest couldnt get out. I un-installed the firewall client on
httpdata As DataObject httpdata = New DataObject Dim Request As Net.HttpWebRequest = Net.HttpWebRequest.Create(siteurl) Request.KeepAlive = True Request.Method = "GET" 'ThreadPool.SetMaxThreads State As New RequestState(httpdata, Request, siteurl) Dim Result As IAsyncResult = CType(Request.BeginGetResponse(New AsyncCallback(AddressOf ReceiveResult), State), IAsyncResult) ThreadPool.RegisterWaitForSingleObject(Result.AsyncWaitHandle, New WaitOrTimerCallback(AddressOf ScanTimeoutCallback), State, 18000000 True) Next End Sub Private Sub ReceiveResult(ByVal Result As IAsyncResult) '- - Asynchronously receive the response Dim State As RequestState = CType(Result RequestState) Dim SR As IO.StreamReader Dim Response As Net.HttpWebResponse Dim Request As Net.HttpWebRequest = State.RequestObj Try Response = CType(Request.EndGetResponse(Result), Net.HttpWebResponse) SR = New IO.StreamReader(Response.GetResponseStream()) '### datastring = SR.ReadToEnd datastring string siteurl in uri) { DataObject httpdata ; httpdata = new DataObject() ; Net.HttpWebRequest Request = Net.HttpWebRequest.Create(siteurl) ; Request.KeepAlive = true; Request.Method = "GET" ; RequestState State
Unable to connect to the remote server. at System.Net.HttpWebRequest.CheckFinalStatus() at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) at System.Net.HttpWebRequest.GetResponse() at System.Net.WebClient.DownloadData(String address) at TestDemo com / cc / lmtrial2005 / xml / 4.0 / GetPostingURLRequest" Dim loHttp As HttpWebRequest = CType(WebRequest.Create(lcUrl), HttpWebRequest) loHttp.Proxy = d1 loHttp.Method = "GET" Dim loWebResponse As HttpWebResponse = CType(loHttp.GetResponse, HttpWebResponse) Dim enc As Encoding = System.Text.Encoding.GetEncoding(1252) Dim
httpdata As DataObject httpdata = New DataObject Dim Request As Net.HttpWebRequest = Net.HttpWebRequest.Create(siteurl) Request.KeepAlive = True Request.Method = "GET" 'ThreadPool.SetMaxThreads State As New RequestState(httpdata, Request, siteurl) Dim Result As IAsyncResult = CType(Request.BeginGetResponse(New AsyncCallback(AddressOf ReceiveResult), State), IAsyncResult) ThreadPool.RegisterWaitForSingleObject(Result.AsyncWaitHandle, New WaitOrTimerCallback(AddressOf ScanTimeoutCallback), State, 18000000 True) Next End Sub Private Sub ReceiveResult(ByVal Result As IAsyncResult) '- - Asynchronously receive the response Dim State As RequestState = CType(Result RequestState) Dim SR As IO.StreamReader Dim Response As Net.HttpWebResponse Dim Request As Net.HttpWebRequest = State.RequestObj Try Response = CType(Request.EndGetResponse(Result), Net.HttpWebResponse) SR = New IO.StreamReader(Response.GetResponseStream()) '### datastring = SR.ReadToEnd datastring CurrentThread.ManagedThreadId , a+b); return a + b; } static void SumDone( IAsyncResult async ) { / / Wait a second to simulate some work. Thread .Sleep void Main() { Deleg proc = WriteSum; AutoResetEvent ev = new AutoResetEvent ( false ); IAsyncResult async = proc.BeginInvoke( 10, 10, SumDone, ev ); Console .WriteLine( "Thread
NetworkCredential("test2", "test2") Line 43: Line 44: Dim response As HttpWebResponse = request.GetResponse() Line 45: Line 46: Dim r As StreamReader The remote server returned an error: (401) Unauthorized.] System.Net.HttpWebRequest.CheckFinalStatus() System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) System.Net.HttpWebRequest.GetResponse() DigestAuthModWeb2.SampleCredentials.Page_Load(Object sender, EventArgs e) in C The remote server returned an error: (401) Unauthorized.] System.Net.HttpWebRequest.CheckFinalStatus() System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) System.Net.HttpWebRequest.GetResponse() DigestAuthModWeb2.SampleCredentials.Page_Load(Object sender, EventArgs e) +92 System line that is giving me the issues: Dim response As HttpWebResponse = request.GetResponse() once that is invoked, we get the 401