Screen Scraping using HttpWebRequest
Asked By Mike Smith
12-May-02 06:39 PM

I'm trying to write a screen scraping routine in C# that works with any website but am getting an exception with some URLs. The code is as follows:
...
FileInfo MyFile = new FileInfo(@"c:\source.txt");
StreamWriter sw = MyFile.CreateText();
string url;
url = String.Format(@"https://investing.schwab.com/trading/start");
Uri uri = new Uri(url);
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(uri);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
string s;
Stream stream = resp.GetResponseStream();
StreamReader sr = new StreamReader(stream);
while ((s = sr.ReadLine()) != null)
{
sw.WriteLine(s);
}
...
This works with most URLs but with the one in the code above it gives the following exception:
System.Net.WebException: The underlying connection was closed: Could not establish secure channel
for SSL/TLS. ---> System.ComponentModel.Win32Exception: The function completed successfully, but
must be called again to complete the context
--- End of inner exception stack trace ---
at System.Net.HttpWebRequest.CheckFinalStatus()
at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
at System.Net.HttpWebRequest.GetResponse()
at ReadFromSchwabMozilla.Form1..ctor() in c:\documents and settings\mike\my documents\visual
studio projects\mikes examples\read from schwab mozilla\form1.cs:line 49
Does anyone know what I'm doing wrong?
Thanks!
Off topic but on target
Asked By Robbe Morris
12-May-02 06:42 PM
This article is on threading but the HTTPWebRequest code at the bottom of the article is what I think you need.
http://www.eggheadcafe.com/articles/20020224.asp
Get the same exception
Asked By Mike Smith
12-May-02 07:04 PM
when I plug the offending URL into your threading example.
Hmmm
Asked By Robbe Morris
12-May-02 09:15 PM

URL1 and URL2 work for me but not URL3. My guess is that URL3 may be redirecting your request elsewhere and the HttpWebRequest can't handle it. I think a friend of mine dealt with this in classic ASP and the XMLHttp object. Perhaps he has done the same with .NET.
<%@ Import Namespace="System" %>
<%@ Import Namespace="System.IO" %>
<%@ Import Namespace="System.Net" %>
<%@ Import Namespace="System.Web.Services" %>
<script Language="C#" runat="server">
protected void Page_Load(object sender, EventArgs e)
{
string sURL1 = "http://www.google.com";
string sURL2 = "https://qa-ncp.gartner.com";
string sURL3 = "https://investing.schwab.com/trading/start";
string sResp = "";
try
{
HttpWebRequest oWebReq = (HttpWebRequest)WebRequest.Create(sURL3);
HttpWebResponse oWebResp = (HttpWebResponse)oWebReq.GetResponse();
StreamReader oStream = new StreamReader(oWebResp.GetResponseStream(),System.Text.Encoding.ASCII );
sResp = oStream.ReadToEnd();
if (sResp.Length > 0) { Response.Write(sResp);}
}
catch(Exception HttpEx) { Response.Write(HttpEx.Message);}
}
</script>
Try something like this on your https page:
Asked By Peter Bromberg
13-May-02 07:45 PM
public static string getPage(String url) {
try {
WebRequest req = WebRequest.Create(url);
WebResponse result = req.GetResponse();
Stream ReceiveStream = result.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
StreamReader sr = new StreamReader( ReceiveStream, encode );
StringBuilder sb = new StringBuilder();
Char[] read = new Char[256];
int count = sr.Read( read, 0, 256 );
while (count > 0) {
String str = new String(read, 0, count);
sb.Append(str)
count = sr.Read(read, 0, 256);
}
return sb.ToString();
}
} catch(Exception ex) {
return ex.Message;
}
}
}
Also, if there is a lot of redirection at the site, you may need to set the
MaximumAutomaticRedirections (int) property on the WebRequest instance
to an arbitrarily large number, say , 30.
Screen Scraping over SSL (https)
Asked By J. Michael Terenin
17-Dec-05 11:58 PM
I am experiencing similar problems with accessing a redirected https web page using the HttpRequest/Response classes, and am using Fiddler to debug my problem. I can see that I need to supply some sort of Authentication Certificate along with the rest of my header info, but i can't seem to find my answer, through experimentation or examples over the internet. Have you resolved your problem yet ? Please expain if you have. Thanks in advance - Mike
Screen Scraping over SSL (https)
Asked By J. Michael Terenin
18-Dec-05 12:02 AM
Robbie,
It's possible he might be experiencing what I am right now, a Certificate Authentication problem because the initial page is redirecting to an https url. I'm using Fiddler to debug my problem, but I'm stuck on how to resolve it ? Do you know of any strong reading material on this topic or a site that I could go to resolve this SSL encryption issue ? Thanks in advance - Mike


Screen Scraping using HttpWebRequest I'm trying to write a screen scraping routine in investing.schwab.com / trading / start"); Uri uri = new Uri(url); HttpWebRequest req = (HttpWebRequest)WebRequest.Create(uri); HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); string s; Stream stream = resp.GetResponseStream(); StreamReader sr context - -- End of inner exception stack trace - -- at System.Net.HttpWebRequest.CheckFinalStatus() at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) at System.Net.HttpWebRequest.GetResponse() at ReadFromSchwabMozilla.Form1. . ctor() in c: \ documents and settings doing wrong? Thanks! This article is on threading but the HTTPWebRequest code at the bottom of the article is what I
Too many automatic redirections -HTTPWebRequest I'm experimenting with some c# webservice code that is y = 20 - - My problem is that in a webservice call: HttpWebRequest webreq = (HttpWebRequest)WebRequest.Create(serverURL); HttpWebResponse webresp = (HttpWebResponse)webreq.GetResponse(); StreamReader strm = new StreamReader(webresp.GetResponseStream(), Encoding.ASCII Net.WebException: Too many automatic redirections attempted. at System.Net.HttpWebRequest.CheckFinalStatus() at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) at System.Net.HttpWebRequest.GetResponse() at MSInfoService.GetInfo(String sType, String sDays) Anybody got any ideas? The Uri you access wants you to set HttpWebRequest webreq = (HttpWebRequest)WebRequest.Create(serverURL); webreq.UserAgent = "Mozilla / 4.0
Virtual Directory I have a windows form application that uses httpwebrequest to access files in a virtual directory. The windows form As String) As Byte() allDone.Reset() Try Dim wr As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest) If Not ProxyObject Is Nothing Then wr.Proxy = ProxyObject Dim MRS As New RequestState MRS.request = wr Dim result As IAsyncResult = CType(wr.BeginGetResponse(AddressOf RespCallBack, MRS), IAsyncResult) allDone.WaitOne() MRS.response.Close() Return MRS.BLOB Catch MyException End Try End Function Private Sub RespCallBack(ByVal asyncResult As IAsyncResult) Try Dim MRS As RequestState = CType(asyncResult.AsyncState, RequestState) Dim wr2 As HttpWebRequest = CType(MRS.request, HttpWebRequest) MRS.response = CType(wr2.EndGetResponse(asyncResult HttpWebResponse) Dim RS As Stream = MRS.response.GetResponseStream() MRS.responseStream = RS
Visual Studio. I am connecting to the remote server using HTTPWebRequest. I wrote a snippet of code to just test getting Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception) +431 [WebException: Unable to connect to the remote server] System.Net.HttpWebRequest.GetResponse() +1501755 _Default.Page_Load(Object sender, EventArgs e) +61 System IS THE CODE: protected void Page_Load(object sender, EventArgs e) { HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http: / / www.google.com"); HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); StreamReader sr = new StreamReader(resp.GetResponseStream()); string google from any machine hosting the web application this allowed our httpwebrequest to get the info it needed from the internet. I but on the 2.0 box for some reason the httpwebrequest couldnt get out. I un-installed the firewall client on
httpdata As DataObject httpdata = New DataObject Dim Request As Net.HttpWebRequest = Net.HttpWebRequest.Create(siteurl) Request.KeepAlive = True Request.Method = "GET" 'ThreadPool.SetMaxThreads State As New RequestState(httpdata, Request, siteurl) Dim Result As IAsyncResult = CType(Request.BeginGetResponse(New AsyncCallback(AddressOf ReceiveResult), State), IAsyncResult) ThreadPool.RegisterWaitForSingleObject(Result.AsyncWaitHandle, New WaitOrTimerCallback(AddressOf ScanTimeoutCallback), State, 18000000 True) Next End Sub Private Sub ReceiveResult(ByVal Result As IAsyncResult) '- - Asynchronously receive the response Dim State As RequestState = CType(Result RequestState) Dim SR As IO.StreamReader Dim Response As Net.HttpWebResponse Dim Request As Net.HttpWebRequest = State.RequestObj Try Response = CType(Request.EndGetResponse(Result), Net.HttpWebResponse) SR = New IO.StreamReader(Response.GetResponseStream()) '### datastring = SR.ReadToEnd datastring string siteurl in uri) { DataObject httpdata ; httpdata = new DataObject() ; Net.HttpWebRequest Request = Net.HttpWebRequest.Create(siteurl) ; Request.KeepAlive = true; Request.Method = "GET" ; RequestState State
Unable to connect to the remote server. at System.Net.HttpWebRequest.CheckFinalStatus() at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) at System.Net.HttpWebRequest.GetResponse() at System.Net.WebClient.DownloadData(String address) at TestDemo com / cc / lmtrial2005 / xml / 4.0 / GetPostingURLRequest" Dim loHttp As HttpWebRequest = CType(WebRequest.Create(lcUrl), HttpWebRequest) loHttp.Proxy = d1 loHttp.Method = "GET" Dim loWebResponse As HttpWebResponse = CType(loHttp.GetResponse, HttpWebResponse) Dim enc As Encoding = System.Text.Encoding.GetEncoding(1252) Dim
httpdata As DataObject httpdata = New DataObject Dim Request As Net.HttpWebRequest = Net.HttpWebRequest.Create(siteurl) Request.KeepAlive = True Request.Method = "GET" 'ThreadPool.SetMaxThreads State As New RequestState(httpdata, Request, siteurl) Dim Result As IAsyncResult = CType(Request.BeginGetResponse(New AsyncCallback(AddressOf ReceiveResult), State), IAsyncResult) ThreadPool.RegisterWaitForSingleObject(Result.AsyncWaitHandle, New WaitOrTimerCallback(AddressOf ScanTimeoutCallback), State, 18000000 True) Next End Sub Private Sub ReceiveResult(ByVal Result As IAsyncResult) '- - Asynchronously receive the response Dim State As RequestState = CType(Result RequestState) Dim SR As IO.StreamReader Dim Response As Net.HttpWebResponse Dim Request As Net.HttpWebRequest = State.RequestObj Try Response = CType(Request.EndGetResponse(Result), Net.HttpWebResponse) SR = New IO.StreamReader(Response.GetResponseStream()) '### datastring = SR.ReadToEnd datastring CurrentThread.ManagedThreadId , a+b); return a + b; } static void SumDone( IAsyncResult async ) { / / Wait a second to simulate some work. Thread .Sleep void Main() { Deleg proc = WriteSum; AutoResetEvent ev = new AutoResetEvent ( false ); IAsyncResult async = proc.BeginInvoke( 10, 10, SumDone, ev ); Console .WriteLine( "Thread
NetworkCredential("test2", "test2") Line 43: Line 44: Dim response As HttpWebResponse = request.GetResponse() Line 45: Line 46: Dim r As StreamReader The remote server returned an error: (401) Unauthorized.] System.Net.HttpWebRequest.CheckFinalStatus() System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) System.Net.HttpWebRequest.GetResponse() DigestAuthModWeb2.SampleCredentials.Page_Load(Object sender, EventArgs e) in C The remote server returned an error: (401) Unauthorized.] System.Net.HttpWebRequest.CheckFinalStatus() System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) System.Net.HttpWebRequest.GetResponse() DigestAuthModWeb2.SampleCredentials.Page_Load(Object sender, EventArgs e) +92 System line that is giving me the issues: Dim response As HttpWebResponse = request.GetResponse() once that is invoked, we get the 401