Saturday, January 12, 2008

Catching Unwanted Spiders And Content Scraping Bots In ASP.NET

Just found this great article on stopping bots you don't want on your site using ASP.Net.

Here's an idea for using C# and the application cache so anyone using dynamic IPs to attempt this on your site is less likely to spoil it for the next person unfortunate enough to share an ISP with them.

To use it, add the following to your code-behind files on the page you've set up to catch the unwanted scrapers...
GrokkingCode.ClientTrap.BadClients.Instance.AddClient();

... and the next line of code gets added on the pages you don't want scraped...
GrokkingCode.ClientTrap.BadClients.Instance.TestClient();


This is the code saved as App_Code/BadClients.cs
using System;
using System.Data;
using System.Configuration;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using System.Collections.Specialized;
using System.Web.Caching;

namespace GrokkingCode.ClientTrap {
///
/// Handle clients forbidden from accessing site
///
public sealed class BadClients {
private HttpContext http = null;
private HybridDictionary dictBadClients = null;

public static BadClients Instance {
get {
try { return (System.Web.HttpContext.Current.Items["oBadClients"] ?? (System.Web.HttpContext.Current.Items["oBadClients"] = new BadClients())) as BadClients; }
catch (Exception ex) { throw new Exception("Failed to instantiate BadClients.", ex); }
}
}

private BadClients() {
try { http = System.Web.HttpContext.Current; }
catch (Exception ex) { throw new Exception("Web environment only.", ex); }

if (http.Cache["badclients"] != null) {
try { dictBadClients = (HybridDictionary)http.Cache["badclients"]; }
catch { }
} else {
dictBadClients = new HybridDictionary(false);
}
}

public void AddClient() {
dictBadClients.Add(http.Request.UserHostAddress, http.Request.UserAgent);
if (http.Cache["badclients"] == null) {
http.Cache.Add("badclients", dictBadClients, null, Cache.NoAbsoluteExpiration, TimeSpan.FromMinutes(5), CacheItemPriority.Normal, null);
} else {
http.Cache.Insert("badclients", dictBadClients);
}
return;
}

public void TestClient() {
if ((string)dictBadClients[http.Request.UserHostAddress] == http.Request.UserAgent) {
http.Trace.Write("Blocked client.");
try { http.Response.Clear(); }
catch { http.Trace.Write("Could not clear buffer."); }
http.Response.End();
}
return;
}
}
}

Follow Colin's article on setting up the hidden link and your robots.txt. The biggest difference between his approach and mine is that his will let you review a log file and decide what you're going to do, mine creates an immediate blacklist entry for the offending client system. The blacklist itself resets after 5 minutes of inactivity. Feel free to change that to suit your site's specific needs. I may make some future refinements to the system to timeout individual entries.

2 comments:

When Web Scraping Can Help — Or Hurt — Your Business | Cynosure said...

[...] Catching Unwanted Spiders and Content Scraping Bots in ASP.NET [...]

Catching Unwanted Spiders and Bots In ASP.Net 2.0 « Grokking Code said...

[...] January 12, 2009 — Ryan Grange This is an upgrade of the code previously posted in Catching Unwanted Spiders And Content Scraping Bots In ASP.NET. To use it, add the following to your code-behind files on the page you’ve set up to catch [...]

Post a Comment