The Wonderful World of Logfile Analysis, Part One: Search Engine Referers

Martin Dittus · 2005-10-30 · data mining · write a comment

One of the things I like to do in my spare time is analyze web server logfiles. It doesn't even have to be those of my own domain, but it helps if it is a site that I know and use. I've been starting to write an article about some recent findings back in August and, as it goes, had it in a draft state for months. I just decided that I'm going to post the first segment in an ongoing series about my habit of logfile analysis, and maybe I'll go into some of the techniques I use later on. There are just so many things you can learn from logfiles besides vanity statistics!

I'm presenting you here with an unsorted list of Very Good Reasons Why Having a Domain of Your Own Pays Off in the long run. Reason number one: you have access to the logfiles. Reason number two: people use search engines. Reason number three: you can watch them search!

Read on for part one of this series: my search engine referers.

Full entry

Rhinola: JavaScript for the Server!

Martin Dittus · 2005-10-29 · links, software, tools · 4 comments

Chris Zumbrunn has news in the comments of my article "RFE: Server-Side Javascript?": There is a new JavaScript execution framework called Rhinola which looks like what I asked for: a framework that enables server-side JavaScript web development!

A quick search leads to an enthusiastic article ("whoa! rhinola rocks!") on the haboglabobloggin' blog (which just went offline while I was writing this...), but the author mentions that the current incarnation of the software requires quite some Linux admin-fu to get it working; I assume this will change over the next year as the product matures.

Rhinola is currently based on the Servlet-container mod_gcj, but apparently could easily be ported to other environments (someone called hns comments on haboglabobloggin': "possibly, rhinola might become bigger than mod_gcj, so it will eventually start a life of its own". And he adds: "basic performance is very fast.") Rhinola is as close to mod_js as it can get.

So my earlier prediction has become true: JavaScript is gaining popularity as an all-purpose development language. One thing's for sure: there are more and more JavaScript-based frameworks popping up for all levels of application development. Cf. my other article on the topic, "TrimJunction: JavaScript on Rails", and the frameworks Chris mentions in his comment, Helma and OpenMocha (which apparently is Chris' baby).

Just in case you wonder about why this is such exciting news to me, I'll repeat an excerpt from the first article I wrote on this topic:

I'm not actually in a position to compare the technical merits of each of the popular scripting languages available, but I couldn't care less which of them has briefer syntaxes for string operations, better database support, better libraries etc. What I do care about is the amount of time it takes me to get into the zone with the tools I'm using. I'm changing contexts on a regular basis, which means I have to be fluent in a lot of tools, not necessarily in-depth, and the amount of time it takes me to get re-acquainted with a software is directly related to my productivity.

Be sure to also read oriba san's comments on JavaScript as a development language to put my excitement about this into perspective -- he made me realize that I wasn't really aware of what I was asking for. JavaScript might be a popular and easy to learn language, but that doesn't necessarily mean it's well suited for this job. So I will have to check out Rhinola myself at some point to find out if server-side JavaScript it's actually as cool as I thought it was.

SearchFox RSS's "Topics I Like"

Martin Dittus · 2005-10-27 · data mining, recommendation engines, tools, web services · write a comment

For the last two weeks I've observed SearchFox RSS's list of "Topics I like" to both find out how it's working and to see if it accurately reflects my taste. See my earlier article "SearchFox Rocks. But Where Are the Web Services?" for a little context.

Random observations:

The keywords are indeed ordered by rank, the first keyword being the most highly ranked. You can deduce this by comparing the keyword movements at the start and end of the list over time: lots of movement at the end of the list. Easy come, easy go.

The list actually reflects what I'm reading, although it doesn't necessarily reflect what I would tell you that I'm interested in, or even more what I actually am interested in. Maybe I read the wrong feeds. Should never have subscribed to that stupid Ning feed...

And while there are some irrelevant stop words ("not", "bad", "much", ...), the keywords in general can serve as a nice measure of content. Lots of 'Web two-point-oh' (argh, stop it!)-related technologies, and even the '2.0' manages in on the list.

Which also makes it a nice measure of the "Zeitgeist": watch e.g. the sudden appearance of Flock, its quick rise, and after some days it's already off the radar again (I decided it's not worth it as long as Firefox on Macs still sucks, so I stopped reading about it).

Even more reason for a web service API to let us access this data.

In other news: we now have OPML export and easier subscriptions per bookmarklet (which will definitely influence my subscription behavior...)

So if you are looking for a feed reader, give SearchFox RSS a try. It's getting better and better.

Read on for the full log of my "Topics I like".

Full entry

Ah, Now I Get It... (An Interview with Joshua Schachter)

Martin Dittus · 2005-10-27 · commentary, links, recommendation engines · 1 comment

On David Weinberger's blog: transcript of a talk and Q+A by Joshua Schachter of delicio.us. It's a bit sketchy, but has some interesting bits nevertheless.

I was especially delighted by the discussion after Joshua introduces the upcoming "network" and "group" features, where groups are opt-in collaborations and networks more like the current inbox feature, in that users won't be told that you have included them in your network.

Excerpt:

I point out that flickr tells you. Joshua says that every time he gets a notice from some random person that he's been added as a contact "I want to rip my face off."

Joshua: "I'm not trying to build up the delicious community. There are plenty of communities."

That definitely resonates with me, and now I understand why inbox-subscriptions on del.icio.us are one-way streets. I've been planning to filter out comments from Flickr pages for months now, but haven't found a decent Greasemonkey-alternative for Safari yet. Well, wait and see, MouseHole is looking more and more interesting...

A bit further down:

Q: Are you building systems to monitor the trends of what people are doing?
A: Right now it's not hard to identify the outliers. It's not our focus. But my background is in analyzing bulk data.
Q: How about letting your users see that data?
A: I'm generally wary of this. If I publish the most clicked-on list, then it becomes a high score list that people will try to get on.

Other interesting bits:

...and after using Rails for a while I can definitely see how more lightweight frameworks like Mason make sense for high-traffic applications.

This Week in Tech with Larry Lessig

Martin Dittus · 2005-10-25 · a new world, intellectual property, links · write a comment

The new episode of This Week in Tech features copyright lawyer and Creative Commons-guru Lawrence Lessig, and it's a great show with lots of great clarifications and anectodes about America's current state of copyright law. Probably the best way to learn about the subject and still be entertained.

The cast of this show is a pretty diverse list of characters coming from different backgrounds (most are journalists of some kind, some are also publishers, hobby musicians, or software developers, and Lessig obviously is a lawyer), and it's great to see how they interact. It seems to me that this is a rare TWiT show where all participants are equally committed and emotionally invested in the show's topic.

And yet they still can easily find a consensus -- which just shows how the existing system is only backed by those who have a financial interest in the distribution, and not by those who actually produce content.

It's also interesting to compare the situation in the US with our own problems in Germany, see PopKomm 2005 - Business as Usual for some pointers.

It's Coming, It's Coming!

Martin Dittus · 2005-10-13 · commentary, web services · write a comment

This is an update on my ongoing series on web service client authorization ([1], [2]): exciting times ahead!

To reiterate: there are more and more sites appearing that provide means to connect to them to read and manipulate data you have stored on their servers via web service APIs. This enables the creation of third-party applications and services that build upon these sites and enhance their services. E.g., there is an iPhoto plugin for Flickr, lots of alternative interfaces to your del.icio.us bookmarks, etc.

The problem: nearly every one of these services requires you to give up your username and password to connect to these services. I've outlined in the two articles referenced above why this is a bad tendency, and why we need alternative mechanisms to authenticate third-party services when they connect to your user account.

Earlier today I posted a comment on an article by Pascal Van Hecke which touched upon this topic. I have just received a mail from him where he points me to an ongoing discussion on the del.icio.us mailing list about "remote application authorization", ignited by Joshua Schachter's announcement that he will be implementing such a token-based scheme for del.icio.us. Josh starts off with a short description of what he's trying to accomplish, and it's looking great.

This means we will finally be able to safely use other people's services and software tools to get more value from our del.icio.us accounts, but we don't have to reveal passwords to strangers, and we might even be presented with a way to revoke access rights as well. Because it's del.icio.us, I'm confident that other service providers will adopt similar schemes soon; or as I wrote in the comments to Pascal's article: "Joshua, I love it that you are the first to do it, because you have the freedom to do it right, and the exposure and popularity to motivate others to follow suit." Yay for small teams!

(update: I was traveling when I first posted this, so this post may seem a bit sketchy. I've now edited it a bit to make it more clear what I'm writing about.)

SearchFox Rocks. But Where Are the Web Services?

Martin Dittus · 2005-10-12 · commentary, data mining, recommendation engines, tools, web services · 2 comments

SearchFox is really great. It's a web-based feed reader (currently in beta) that watches you reading feeds, and which uses this attention data to improve your reading experience. After you have used it for a while SearchFox develops an understanding of the things you care about, and presents these accordingly (feed articles are sorted by ranking, not time).

I just thought that it would be great to get access to the attention data captured with SearchFox, as it would allow us to develop applications based on the service. E.g. something like Nick Bradbury's Feed Reports in FeedDemon. As a satisfied user and a developer I would embrace such a web service instantly.

Read on for an explanation.

Full entry

So it is a Corporate Internet Boom after all.

Martin Dittus · 2005-10-11 · commentary · write a comment

The last time I wrote about it this was little more than an extrapolation of what was happening, but now (only weeks later) we're getting beaten over the head with it.

This is what I see when I search my memory and feeds archive of the last couple of weeks for the term "buys":

... and most of these businesses were traded for unholy sums of cash. Note how the left hand side of all transactions is comprised of a very limited set of companies. Note how the right hand side of all transactions points to some of the same trends everybody's already talking about. Note how nobody wants to be left behind (it'll be interesting to see how Google integrates Meetroduction, and how Yahoo integrates Upcoming.org).

To prevent any confusions between Weblogs, Inc. and Weblogs.com:

I won't introduce the other companies, but close with some random notes:

The eBay/skype transaction probably was the biggest surprise. It's easy to see why search engines are interested in "location-aware social networking software" (think Google Maps), why media companies are interested in content publishing communities (both infrastructure and actual content), and why Fox is interested in the apparently most popular meeting ground for American teenagers on the web.

Let's see how this changes the customer experience over the next year. I'm still fearing for Flickr.

IRC Bots on Web Services

Martin Dittus · 2005-10-09 · data mining, stuff, web services · 7 comments

Take a look at this very strange del.icio.us account: http://del.icio.us/cuthu -- I stumbled upon this user while datamining my own del.icio.us account with simple Ruby scripts and an SQLite database. His account shares three bookmarks with mine (covering three very distinct and arbitrary topics), and it only caught my eye because of the very strange appearance of its bookmarks. So I took a look at the user's del.icio.us page.

Read on for an analysis of the output of what looks like an exotic IRC bot/del.icio.us API mashup.

Full entry

On Language

Martin Dittus · 2005-09-29 · stuff · write a comment

Surprisingly language constructs emitted by the same entity seem mostly repetitive, even if covering different subjects. "Interesting".

Popkomm Panel: Management, the new Majors? - dekstop weblog
"Surprisingly the most interesting aspects of the panel were the insights into DJ Bobo's business life; take a closer look at his numbers quoted below."
to dekstop.de popkomm conferences music pop_culture distribution ... on 2005-09-29 ...

Popkomm Panel: A&R in a Digital Environment - dekstop weblog
"Surprisingly the panel was mostly about ringtones -- it turned out interesting nevertheless, even if the participants enthusiastically painted a picture of a brave new world that to me looked rather devastating."
to dekstop.de popkomm pop_culture ringtones jamba longtail music conferences ... on 2005-09-29 ...