The Wonderful World of Logfile Analysis, Part One: Search Engine Referers

Martin Dittus · 2005-10-30 · data mining · write a comment

One of the things I like to do in my spare time is analyze web server logfiles. It doesn't even have to be those of my own domain, but it helps if it is a site that I know and use. I've been starting to write an article about some recent findings back in August and, as it goes, had it in a draft state for months. I just decided that I'm going to post the first segment in an ongoing series about my habit of logfile analysis, and maybe I'll go into some of the techniques I use later on. There are just so many things you can learn from logfiles besides vanity statistics!

I'm presenting you here with an unsorted list of Very Good Reasons Why Having a Domain of Your Own Pays Off in the long run. Reason number one: you have access to the logfiles. Reason number two: people use search engines. Reason number three: you can watch them search!

Tools used: the command line (grep, wc, sed, pbcopy etc). Webalizer (which sucks for user agent analysis unless you compile it yourself, which I didn't). A custom Perl CGI script that I use to browse my daily stats. TextMate's excellent automation capabilities. Etc.

Note that all numbers given here are inherently fuzzy; logfile analysis is almost by definition a very inexact science.

Search Engine Referers

The most common search word that leads people to this site is "dekstop". Before registering this domain I already knew that this was a common typo from looking at search engine results, but I'm still surprised that it is the most common search keyword in my referers by far, with about 2500 hits since the site's beginning in late 2003.

Ok, so much for the vanity part. Here's the interesting trivia: strangely, most of those people searching for "dekstop" seem to come from italian search engines and IP addresses. Apparently (this is my current theory) the series of consonants "skt" in the proper term "desktop" is so uncommon in the Italian language, or is so uncomfortable to pronounce to native Italians, that their brain replaces it for the consonant series "kst", as in "dekstop". I am not familiar with Italian phonology, nor have I spoken to a linguist, so maybe someone with better knowledge of the subject matter can help explain this phenomenon.

But of course that's only one example in a long list of searches. People also love to do image searches via Google, Yahoo, MSN and others (at least 1350 hits). For some reason that I can't figure out the most popular image-related search term that leads people to this site seems to be "jeans" (ca. 270 hits).

And people love to combine virtually any search term with the words "bikini" (ca. 10 hits), "sex" (ca. 10 hits), "girl" (ca. 90 hits), "playboy" (ca. 380 hits), and "centerfold" (ca. 400 hits). No surprises here. These types of searches include "sound of music playboy", "bikini blemish photoshop", "scrollbar playboy bunny", "sex in don delillo's underworld", "photographs of girls in white dresses with blue satin sashes", and of course all of those terms combined with the all-time favorite misspelling, "dekstop".

Some other search queries:

Amazing what happens when you take random words out of context, and occasionally combine them with other random words.

Amazing what happens when this mess is then analyzed by a script and printed in a harmless list, thus creating another meta-level of context. From the October 2004 stats:

Hits
77        29.06%  dekstop
...
1          0.38%  judith butler quote gender troubles
1          0.38%  maxim centerfolds
1          0.38%  packages tied up with strings these are a few of my 
...

And now tell me that logfiles are not entertaining.


Next article:

Previous article:

Recent articles:

Comments

Comments are closed. You can contact me instead.