ScreenScraping
with ServerXMLHttp
By
Peter A. Bromberg, Ph.D.
|
 |
ScreenScraping, or the process of grabbing content from
another site, stripping out only what you want, and then displaying it
in your own site, is a highly frowned-upon, although often used practice.
I don't want to get into the ethical and copyright issues involved here,
that's something each individual will have to take into consideration.
With the arrival of MSXML3, Microsoft has provided the
web developer with a server-safe, high load tool called the ServerXMLHttp
object. Its use is relatively simple. In this short article I'll demonstrate
how to grab weather information from a public weather site based on a
specific zip code, strip out a small portion of it (eliminating all the
ads and other extraneous content) and then display this in your own page.
Basically ServerXMLHttp is an http GET/POST component
with some major added advantages in that unlike XMLHTTP, it does not rely
on WinInet for HTTP. ServerXMLHTTP uses a new HTTP client stack. Designed
for server applications, this "server-safe" subset of WinInet
offers the following advantages.
reliability
The HTTP client stack offers longer uptimes. WinInet features that are
not critical for server applications, such as URL caching, auto-discovery
of proxy servers, HTTP/1.1 chunking, offline support, and support for
Gopher and FTP (File Transfer Protocol) protocols are not included in
the new HTTP subset.
security
The HTTP client stack enforces that a user-specific state cannot be shared
with another user's session. Note that ServerXMLHTTP does not provide
support for certificates.
The maximum number of instances that can exist simultaneously within a
single process is 5,460. A similar limitation applies to the XMLHTTP component.
However, other factors, such as available memory, CPU processing capacity,
or available socket connections can further limit the number of instances
that can be active simultaneously. Developers can partition the server
application into multiple processes if this limit becomes a bottleneck.
The open method makes the connection between servers and
the send method sends the request.
You can read the response using one of four properties.
responseBody
responseStream
responseText
responseXML
With ServerXMLHTTP, the usual sequence is to call open, set any custom
header information through setRequestHeader, send, and then check one
of the four response properties.
Let's say that you have a web application that is customized
to the user, and one of the items you retrieve, either through reading
a client cookie or looking up user information in your database, is the
customer's zip code. You'll store the zip code in a variable , "zip"
for use in the page. Here is how you would grab zip-code specific "weather"
information and display it somewhere in one of your pages:
<%
if request.Form("SEND") ="" Then
%>
<Form action ="weather.asp"
method=post>
<input type=text name=zip>Enter your Zipcode to see local weather<BR>
<input type=submit name="SEND" value="GET IT!">
</form>
<%
else
if request.form("zip") ="" then
zip="32801" ' if it's blank, we
just show them somebody else's weather..
else
zip=Request.form("zip")
end if
Dim srvXmlHttp
Dim result
dim URL
dim beginpos, endpos
' if this doesn't work (because you don't have MSXML3 installed / configured)
you can revert to the commented line:
'Set srvXmlHttp=Server.CreateObject("MICROSOFT.XMLHTTP")
Set srvXmlHttp = Server.CreateObject("MSXML2.ServerXMLHTTP.3.0")
' This site is easy
to strip weather info from ...
URL= "http://www.wunderground.com/cgi-bin/findweather/getForecast?query="
& zip
srvXmlHttp.open "GET", URL, false
srvXmlHttp.send()
'on error resume next
if srvXmlHttp.status = 200 Then
result = srvXmlHttp.responseText
beginpos =Instr(result,"<form name=""airport"">")
result =Mid(result,beginpos,len(result))
endpos =Instr(result,"</form>")
result = Mid(result,1,endpos+7)
Response.write "<BASEFONT FACE=Verdana>"
Response.write "<CENTER><h1>ScreenScraping 101</h1><BR>"
Response.write result
Response.write "</CENTER>"
end if
Response.write "<A HREF=http://www.wunderground.com/cgi-bin/findweather/getForecast?query="
& zip &">Your Weather</a>"
Set srvXMLHttp=Nothing
end if
%>
(Please be aware - the above is the code that was on their
site at the time this article was written. It's very likely to have changed
since then!)
Wanna try it? click
here:
One more note: I've read a number of posts and even articles
by professional developers claiming that they can't get ServerXMLHttp
to work. Microsoft has a proxycfg.exe tool that you can download separately
from the MSXML3 Release distribution. Just run "proxycfg -d"
to set up the registry entries to make ServerXMLHttp connect directly
to URLS and your problems should disappear. You must use this utility
even if your server does not use proxy connections.
Peter Bromberg is an independent consultant specializing in distributed .NET solutions
an independent consultant specializing in distributed application development in Orlando and a co-developer
of the EggheadCafe.com developer
website. He can be reached at pbromberg@yahoo.com
|