IRC Bots on Web Services

Martin Dittus · 2005-10-09 · data mining, stuff, web services · 7 comments

Take a look at this very strange del.icio.us account: http://del.icio.us/cuthu -- I stumbled upon this user while datamining my own del.icio.us account with simple Ruby scripts and an SQLite database. His account shares three bookmarks with mine (covering three very distinct and arbitrary topics), and it only caught my eye because of the very strange appearance of its bookmarks. So I took a look at the user's del.icio.us page.

Excerpt from the page (sans formatting):

http://www.netfunny.com/rhf/jokes/05/Sep/fema4.html
[nitrogen:#geeks] http://www.netfunny.com/rhf/jokes/05/Sep/fema4.html
to nitrogen #geeks ... on 2005-09-27 ... copy this item

http://qdb.us/48067
[prj:#geeks] n2: http://qdb.us/48067
to prj #geeks ... on 2005-09-27 ... copy this item

http://bash.org/?543436
[kreaturr:#geeks] bastards, why'd someone mention bash? http://bash.org/?543436
to kreaturr #geeks ... and 2 other people ... on 2005-09-27 ... copy this item

http://qdb.us/48628
[myself:#geeks] Whoah, spooky: http://qdb.us/48628
to myself #geeks ... on 2005-09-27 ... copy this item

qdb.us
[myself:#geeks] qdb.us moderates quicker.
to myself #geeks ... and 17 other people ... on 2005-09-27 ... copy this item

My first thought: what the hell is that? Not only does the corresponding tag cloud have a very strange frequency distribution distinctly different from the usual "organic" distributions (which becomes most visible when you set your del.icio.us to show tags as alpha-sorted tag cloud), but the bookmark descriptions follow a strange scheme. Initially I only saw the surprisingly frequent occurrence of hash characters, but then found that each note accompanying a bookmarks also repeats the target URL of the bookmark, occasionally wrapped in a sentence.

After a while it became clear: this is the output of an IRC bot. A bot that watches IRC channels for messages containing URLs, and which then creates a del.icio.us bookmark for that URL. Each bookmark's generated description starts with a prefix specifying a username and IRC channel (e.g., "[beth:#geeks]"), followed by the actual IRC message (e.g., "anyone seen this: http://www.ning.com/"). And each bookmark always gets two tags: one for the IRC channel in which the conversation took place, one for the username whose message contained the URL. (Which explains the strange frequency distribution of tags.)

With this knowledge you can read the bookmarks and actually make sense of their content. E.g. on 2005-09-27, kreaturr said in the #geeks channel: "bastards, why'd someone mention bash? http://bash.org/?543436" -- and the bot promptly generated a del.icio.us bookmark for http://bash.org/?543436.

I think it's a neat experiment, and although I know nothing of the people or motivation behind it it's fun to look at the generated data.

As I'm writing this there are about 7700 bookmarks in the profile, the first bookmark was set on 2004-10-17.

The bot seems to only watch two channels, #geeks and #notacon, but one should note that the bookmarks tagged #notacon only have dates between 2005-10-03 and 2005-10-05. Originally I thought that Notacon probably was some kind of convention and the channel #notacon was only active while the convention took place, but the Notacon everybody points to already took place in April, so there must be another reason for the time-limited activity. Maybe there simply is more than one Notacon and I just didn't find the one that took place in October.

The rest of the tags are about 150 distinctive usernames, although some of those are actually variations of the same name, so the number of people participating in this experiment (willingly or not) is probably around 100..130.

There are no tags describing the actual link content, which would be rather hard for a bot to autogenerate -- but which is a pity, since it could provide us with a simple way to find out what these people are talking about.

While I found all this I was playing around with small Ruby scripts that among other things can generate a list of tags that other people use for URLs that you have bookmarked, and I would like to run them over cuthu's profile data, but there seems no easy way to export all bookmarks off another person's account, and I certainly don't have cuthu's login data, so this isn't of much help.

To compensate for that I then skimmed over the bookmarks manually and found links to sites about web development, contemporary scripting technologies, mapping applications, mashups, and various jokes and prank sites, but that's not very surprising as this probably describes most of the bookmarks in the huge del.icio.us database.

Anyways. Always surprising what you find when you least expect it.


Next article:

Previous article:

Recent articles:

Comments

Odd that I would stumble upon this (came to your page from the /. post about broken Sony cameras).

Anyhow, I recognize the channels listed. Those channels are on irc.cwru.edu, composed (as you might guess) mainly of former and current Case Western Reserve University students. As the name #geeks suggests, Comp Sci and Engineering sorts, mostly. I was never much of a regular there, but I knew a decent number of the ones who did. Notacon is a fairly new but continuing annual event started by Froggy, a former Case student and now a Case employee.

pimlottc, 2005-10-11 15:54 CET (+0100) Link


Join us on our irc server if you have additional questions. It's interesting that you found us. Meet the people behind the mysterious links :) irc.cwru.edu

Froggy, 2005-10-11 16:13 CET (+0100) Link


Notacon is an annual convention in Cleveland Ohio in April. Lots of info at Notacon.org. I idle on that IRC network, I don't know who's running the bot, but I am guessing that they abandoned running it in #notacon simply because there is nowhere near as much traffic there as in the main channel of #geeks, and therefore fewer URLs to snarf.

P.S. come to Notacon ;) This year is the third year, I've had a blast for both of the last two :)

omal, 2005-10-11 16:18 CET (+0100) Link


Thanks guys ;)

martin, 2005-10-11 18:37 CET (+0100) Link


I am one of the founders of Notacon, and another user of irc.cwru.edu. I would like to hypothesize that the reason you don't see as many links from #notacon is that we generally only discuss con business there, and really don't share links. :)

The bot is indeed active on that channel.

Tyger, 2005-10-12 05:25 CET (+0100) Link


Aaah.. I didn't even think of that!

martin, 2005-10-12 11:50 CET (+0100) Link


Hey, I just found that this article has found its way into the cuthu account, because moebius posted it in #geeks -- of course the surrounding conversation isn't visible from the del.icio.us listing.

How very cool. We have now created our own Heisenberg joke (the professor yells out "No fair! By oberserving the results you've changed them!" http://en.wikipedia.org/wiki/Uncertainty_principle )

martin, 2005-10-13 00:00 CET (+0100) Link


Comments are closed. You can contact me instead.