Search EggHeadCafe's Job Board
EggHeadCafe Silverlight WPF ASP.NET VB.NET C# Excel SQL Server SharePoint
search
MicrosoftArticlesForumsFAQs
C# .NET
VB.NET
Visual Studio .NET
ADO.NET
Xml / Xslt
VB 6.0
.NET CF
GDI+
LINQ
Deployment
Security
FoxPro
Silverlight / WPF
Entity Framework
RIA Services

WebArticlesForumsFAQs
JavaScript
ASP
ASP.NET
WCF

DatabasesArticlesForumsFAQs
SQL Server
Access
Oracle
MySQL
Other Databases

OfficeArticlesForumsFAQs
Excel
Word
Powerpoint
Outlook
Publisher
Money

Non-MicrosoftArticlesForumsFAQs
NHibernate
Perl
PHP
Ruby
Java
Linux / Unix
Apple
Open Source

Operating SysArticlesForumsFAQs
Windows 7
Windows Server
Windows Vista
Windows XP
Windows Update
MAC
Linux / UNIX

Server PlatformsArticlesForumsFAQs
BizTalk
Site Server
Exhange Server
IIS

Graphic DesignArticlesForumsFAQs
Macromedia Flash
Adobe PhotoShop
Expression Blend
Expression Design
Expression Web

OtherArticlesForumsFAQs
Lounge
Subversion / CVS
Ask Dr. Dotnetsky
Active Directory
Networking
Uninstall Virus
Job Openings
Product Reviews
Search Engines
Resumes

 

ASP.NET Request Logger and Crawler Killer


By Peter Bromberg
Printer Friendly Version
View My Articles
131 Views
    

Shows a simplified way to log requests and deny requests that come from <enter annoying bot name here>. Can easily be turned on or off with a database entry and without causing app recycle.


If you have ever had a web site that gets visited in the middle of peak hours by a nasty crawler / bot that doesn't completely observe the robots standard, tying up lots of your pages and causing humongous database access, then you know that you absolutely have to have good metrics to help identify the problem.

This is a simple logging class that:

1) Grabs key information from each request and logs it into a SQL Server table.
2) Can be programmed to identify certain "nastybots" via their User-Agent string and reply with a 401  Access Denied.
3) Can easily be turned on and off by simply updating a row in a SQL Server Database table, which will NOT cause an application restart.

The basic concept here is to try and intercept a request before Page processing and any database access has begun. The easiest way to do that is to override the Application_PreRequestHandlerExecute event. This is most easily done in Global.asax, where you can simply make a static class method call, like so:
protected void Application_PreRequestHandlerExecute (object sender, EventArgs e)
        {
            RequestLogger.Logger.LogRequest(sender as HttpApplication);
        }
When this call is made to the LogRequest method, it checks two private fields, _loggingOn, and _denyBots, and behaves accordingly. If _loggingOn is true, it grabs the items we want from the Request object and writes a row into your Requests SQL Table. The list I have is short, but you can add many more items if your needs differ.

If _denyBots is true, it performs an advanced "IsCrawler" check using Regex test strings of your choosing, and will issue a 401 Access Denied response, which basically stops the bot dead in its tracks, preventing it from doing any damage. Not even a Page object is created. 

The class self-populates the values of the two state variables through a method that checks the Cache and reloads from the database every 10 minutes. So you can change the state in the database, and be guaranteed that ten minutes later it will check and change state without recycling your app, as rewriting the web.config or other file might do.

Here's the code for the logging class:

using System;
using System.Configuration;
using System.Data;
using System.Data.SqlClient;
using System.Text.RegularExpressions;
using System.Web;
using System.Web.Caching;

namespace RequestLogger
{
    public static class Logger
    {
        private static bool _loggingOn=true;
        private static bool _denyBots=false;
        private static string _connectionString = ConfigurationManager.AppSettings["connectionString"];

        public static void LogRequest(HttpApplication app)
        {
            HttpRequest request = app.Request;
            EnsureSwitches(app);
            if (!_loggingOn) return;
            bool isCrawler = IsCrawler(request);
            string userAgent = request.UserAgent;
            string requestPath = request.Url.AbsolutePath;
            string referer = request.UrlReferrer != null ? request.UrlReferrer.AbsolutePath : "";
            string userIp = request.UserHostAddress;
            string isCrawlerStr = isCrawler.ToString();

            SqlConnection cn = new SqlConnection(_connectionString);
            SqlCommand cmd = new SqlCommand("dbo.insertRequest", cn);
            cmd.CommandType = CommandType.StoredProcedure;
            try
            {
                cmd.Parameters.AddWithValue("@UserAgent", userAgent);
                cmd.Parameters.AddWithValue("@RequestPath", requestPath);
                cmd.Parameters.AddWithValue("@Referer", referer);
                cmd.Parameters.AddWithValue("@RemoteIp", userIp);
                cmd.Parameters.AddWithValue("@IsCrawler", isCrawlerStr);
                cn.Open();
                cmd.ExecuteNonQuery();
            }
            catch (SqlException ex)
            {
                // this is just for quick debugging, can be commented out:
                app.Response.Write(ex.Message);
            }
            finally
            {
                cn.Close();
                cmd.Dispose();
            }
            if (isCrawler && _denyBots)
                DenyAccess(app);
        }

        private static void EnsureSwitches(HttpApplication app)
        {
            if (app.Context.Cache["_loggingOn"] == null)
            {
                SqlConnection cn = new SqlConnection(_connectionString);
                SqlCommand cmd = new SqlCommand("dbo.GetRequestLogState", cn);
                cmd.CommandType = CommandType.StoredProcedure;
                cn.Open();
                SqlDataReader rdr = cmd.ExecuteReader(CommandBehavior.CloseConnection);
                if (rdr.HasRows)
                {
                    rdr.Read();
                    _loggingOn = rdr.GetBoolean(0);
                    _denyBots = rdr.GetBoolean(1);
                }
                rdr.Close();
                cmd.Dispose();
                app.Context.Cache.Insert("_loggingOn", _loggingOn, null, DateTime.Now.AddMinutes(10),
                                         Cache.NoSlidingExpiration);
                app.Context.Cache.Insert("_denyBots", _denyBots, null, DateTime.Now.AddMinutes(10),
                                         Cache.NoSlidingExpiration);
            }
            else
            {
                _loggingOn = (bool) app.Context.Cache["_loggingOn"];
                _denyBots = (bool) app.Context.Cache["_denyBots"];
            }
        }

        private static void DenyAccess(HttpApplication app)
        {
            app.Response.StatusCode = 401;
            app.Response.StatusDescription = "Access Denied";
            app.Response.Write("401 Access Denied");
            app.CompleteRequest();
        }


        public static bool IsCrawler(HttpRequest request)
        {
            // set next line to "bool isCrawler = false; to use this to deny certain bots
            bool isCrawler = request.Browser.Crawler;
            // Microsoft doesn't properly detect several crawlers
            if (!isCrawler)
            {
                // put any additional known crawlers in the Regex below
                // you can also use this list to deny certain bots instead, if desired:
                // just set bool isCrawler = false; for first line in method 
                // and only have the ones you want to deny in the following Regex list
                Regex regEx = new Regex("Slurp|slurp|ask|Ask|Teoma|teoma");
                isCrawler = regEx.Match(request.UserAgent).Success;
            }
            return isCrawler;
        }
    }
}
The above code should be pretty much self-explanatory. I threw this together pretty fast for a specific need, but I bet you can think of plenty of ways to extend it, so please, have fun!    

You can download the Visual Studio 2005 solution, which includes a Web Application Project "test harness", and the SQL script to create your two tables and stored procs. After running the SQL, you can use the test project "out of the box" by simply ensuring that the connection string is correct in the appSettings section of web.config.

Biography - Peter Bromberg
Peter Bromberg is a C# MVP, MCP, and .NET expert who has worked in banking, financial and telephony for over 20 years. Pete focuses exclusively on the .NET Platform, and currently develops SOA and other .NET applications for a Fortune 500 clientele. Peter enjoys producing digital photo collage with Maya,playing jazz flute, the beach, and fine wines. You can view Peter's UnBlog and IttyUrl sites.
Please post questions at forums, not via email!

button
Article Discussion: ASP.NET Request Logger and Crawler Killer
Peter Bromberg posted at Monday, June 18, 2007 1:27 PM
Original Article
 

Crawler Killer - Detector
Jerome Vernon replied to Peter Bromberg at Friday, August 10, 2007 12:37 PM
Can the process of testing for crawlers be used to prevent unwanted hit counter increments in the Session_Start event within global.asax? It seems that my hit counter is getting more hits then I think it should (middle of night). Do crawlers cause Session_Start to fire? If so then perhaps testing for crawlers prior to incrementing the hit counter would be the way to go.