ETags Support in Aggregators

Martin Dittus · 2006-06-22 · commentary, drop culture, web services · 3 comments

Did you notice Sam Ruby's new preoccupation with ETags? When he's talking with founders about their new web services, "the first thing I ask is; 'do you support ETags?'"

I'm so glad that he's doing that, and talking about it publicly. I've been a web developer for a number of years now, and from the beginning I knew about some basic caching issues and about the HTTP 304 (Not Modified) response -- but it took me a while to figure out that in my scripts it's my responsibility to send this header.

Request caching on this level is simply something most people don't think about when they develop web applications, and I'm glad that, thanks to Sam, this may change.

Yesterday at 19:00 local time I sent out this email:

From: Martin Dittus
To: info@(domain)
Subject: Your crawler is _very_ impolite

I'm the owner and webmaster of the domain dekstop.de.

Since yesterday morning I've been getting thousands of hits by your "Virtual Reach Newsclip Collector" aggregator from sp.virtualreach.com. These were requests to virtually all comment feeds from blog articles I offer on my domain. Apparently someone imported the OPML file I offer that links to all those feeds.

Which is all fine and dandy.

But when I looked at my logfiles today I thought you guys must be kidding... You can't just request thousands of feeds per day, and then not support ETags/If-modified-since, which means every request results in a full download of the respective file! And you don't even seem to request robots.txt to allow webmasters control over such requests.

The result: today alone (it's 7pm local time) there were ca. 30MB traffic from my domain to sp.virtualreach.com, which means you will use up nearly 1 GB of _my_ traffic per month. For reading feeds that virtually never change.

Fix this ASAP, it's just not polite to waste other people's bandwidth that way.

And by fixing it I don't mean 'remove my site from your aggregator', but I mean that you:
1. Implement ETags/request caching
2. start to respect robots.txt files

Regards,
Martin Dittus

This morning, just before 8:00, their requests to my domain stopped. There was no reply to my email yet.

I wonder what their developers are doing right now.

Mjuzak: "everybody turns" 2006-06-27

Data Mining for World Peace 2006-06-15

Comments

Martin,

Unfortunately, polite informative emails to developers don't always work. Here's the email I sent--on February 3 of this year--to the CTO of the very same company whose product has been causing you grief:

Jay,

Here's some background info on one issue I'd like to discuss, which is support for E-Tags to reduce bandwidth consumption:

I noticed that NewsClip doesn't seem to support ETag/If-None-Match caching, which is super-easy to implement and is a HUGE bandwidth saver, so (as someone pumping out lots of the same data to NewsClip users) I'd like to request that feature.

There's a good discussion of it here:

http://www.kbcafe.com/rss/rssfeedstate.html#entitytags

but here's the super-simple version:

The only thing NewsClip needs to do is look for an HTTP header in the response from the server that looks like this:

ETag: "574671cf42ca4f3beee74d05c0ddff75a"

and store the quoted value, then include a header on subsequent requests that looks like this:

If-None-Match: "574671cf42ca4f3beee74d05c0ddff75a"

(where the long hex number is whatever was returned in the ETag.)

In many cases, this is a 99.9% reduction in bandwidth utilization for the feed, which is obviously good. :-)

Regards,
Charile

I still get lots of requests from Virtual Reach Newsclip, non of which support ETag/If-None-Match.

-Charlie

Charlie Wood, 2006-06-22 19:55 CET (+0100) Link

Yeah I too found that I spoke too soon -- I'm still seing the same requests.

My initial reaction after sending the mail yesterday was to make it really obvious to them and block their IP -- but after hearing of your experience I wouldn't be surprised if they don't pay attention to HTTP replies either.

Martin Dittus, 2006-06-22 20:03 CET (+0100) Link

What there really needs to be is an HTTP library that supports everything like this out-of-the-box. It should cache into its own set of temporary files which can be spread across all apps.

Porges, 2006-06-23 00:17 CET (+0100) Link

Comments are closed. You can contact me instead.

dekstop: weblog