I'm so glad that he's doing that, and talking about it publicly. I've been a web developer for a number of years now, and from the beginning I knew about some basic caching issues and about the HTTP 304 (Not Modified) response -- but it took me a while to figure out that in my scripts it's my responsibility to send this header.
Request caching on this level is simply something most people don't think about when they develop web applications, and I'm glad that, thanks to Sam, this may change.
Yesterday at 19:00 local time I sent out this email:
From: Martin Dittus
Subject: Your crawler is _very_ impolite
I'm the owner and webmaster of the domain dekstop.de.
Since yesterday morning I've been getting thousands of hits by your "Virtual Reach Newsclip Collector" aggregator from sp.virtualreach.com. These were requests to virtually all comment feeds from blog articles I offer on my domain. Apparently someone imported the OPML file I offer that links to all those feeds.
Which is all fine and dandy.
But when I looked at my logfiles today I thought you guys must be kidding... You can't just request thousands of feeds per day, and then not support ETags/If-modified-since, which means every request results in a full download of the respective file! And you don't even seem to request robots.txt to allow webmasters control over such requests.
The result: today alone (it's 7pm local time) there were ca. 30MB traffic from my domain to sp.virtualreach.com, which means you will use up nearly 1 GB of _my_ traffic per month. For reading feeds that virtually never change.
Fix this ASAP, it's just not polite to waste other people's bandwidth that way.
And by fixing it I don't mean 'remove my site from your aggregator', but I mean that you:
1. Implement ETags/request caching
2. start to respect robots.txt files
This morning, just before 8:00, their requests to my domain stopped. There was no reply to my email yet.
I wonder what their developers are doing right now.