Spotlight Helps Fight Comment Spam!

Martin Dittus · 2006-07-07 · code, data mining, osx, tools · write a comment

I'm using a combination of fairly primitive methods to cope with blog spam. As this blog doesn't get too much comments the amount of manual work is relatively limited; main line of defense is an old-fashioned and relatively short blacklist. I'm notified of incoming comments, and in the rare event that a spam comment gets through I'll inspect it for new keywords.

For a couple of months now it's become apparent that specific posts seem to attract more spam than others.

I just thought that it may be great to have a statistic of this phenomenon -- so that I could make an informed decision wether it makes sense to close comments for particular old articles.

Unfortunately the tools I use create no such statistics. I have a log of successfully filtered keywords, but unfortunately this log doesn't map blocked attempts to individual articles.

But I still have all comment notification emails of spam that got through the filter -- and these are a relatively accessible source of the required information:

Comment notifications use a distinct mail subject, and mention the article URL in the email body.
If I delete comment spam I will then also delete the respective email notification.
I never empty my email trash folder.

... which means I still have all notifications for successful spam comments, stored away in a separate folder from 'legit' comment notifications. If I combined that with Spotlight and a short Ruby script, creating a 'successfully spammed articles' ranking is a piece of cake.

Taming Spotlight

The mdls command line tool lists all Spotlight metadata for a specific file, so if you point it to an email file (~/Library/Mail/.../Messages/*.emlx) you can get a list of the available metadata for filtering email messages. I was only interested in the mail subject, kMDItemTitle; but there are also kMDItemAuthors, kMDItemAuthorEmailAddresses, kMDItemContentCreationDate, and others.

The mdfind command line tool lets you then search the Spotlight database. I'm using a query like this:

mdfind -onlyin ~/Library/Mail/folder_name/ "kMDItemTitle == '*New Comment*'"

This results in a list of files:

/Users/.../Messages/11579.emlx
/Users/.../Messages/5077.emlx
/Users/.../Messages/5171.emlx
/Users/.../Messages/5218.emlx
/Users/.../Messages/5279.emlx
/Users/.../Messages/5280.emlx
/Users/.../Messages/5282.emlx
/Users/.../Messages/11550.emlx
/Users/.../Messages/11551.emlx
...etc

And Then Count

The .emlx file format is afaik proprietary, but when you open it in a text editor you'll see that it's a plaintext format and includes, among other things, the raw email message with full headers and body text. In my case I was only interested in extracting an article URL from each of those emails, so I didn't even bother to look at the file format but simply used a regular expression against the complete file content.

Then count each captured string in a hash, and finally print the ranking:

Searching... found 94 emails.
Parsing.........................................................
..................................... done.
Ranking:
11 times /weblog/2006/01/flip4mac_has_a_strange_eula/
9 times /weblog/2006/03/added_article_feeds_with_comments/
7 times /weblog/2005/08/screw_objectivism/
7 times /weblog/2005/12/mailfeedorg_has_launched/
7 times /weblog/2005/12/osx_10_4_3_phoning_home/
5 times /weblog/2006/06/notice_the_information_bar/
5 times /weblog/2006/03/jabber_server_space_starts_boiling/
4 times /weblog/2005/11/new_comment_spam/
4 times /weblog/2006/03/yarv_benchmark_with_rexml/
4 times /weblog/2005/11/team_ramrod/
4 times /weblog/2005/08/a_first_look_at_pandora/
3 times /weblog/2005/08/securely_connect_web_services/
3 times /weblog/2005/12/contact_form_spam_bots/
2 times /weblog/2005/12/qre_recommendations_from_your_database/
2 times /weblog/2005/08/small_feed_changes/
2 times /weblog/2006/01/gmail_html_view_as_default_view/
2 times /weblog/2005/07/ballmer_creeps_me_out/
1 times /weblog/2005/10/rhinola_javascript_for_the_server/
1 times /weblog/2006/06/midnightbot/
1 times /weblog/2005/12/printable_schedule_for_22c3/
1 times /weblog/2006/03/feed_readers_a_commodity/
1 times /weblog/2006/01/visualization_of_numeric_data/
1 times /weblog/2005/09/marc_mcdonald/
1 times /weblog/2005/11/100_dollar_laptop/
1 times /weblog/2005/11/type_managers_replacing_file_managers/
1 times /weblog/2005/07/backup_hell/
1 times /weblog/2006/02/delicious_404/
1 times /weblog/2005/10/webservice_auth_finally_coming/
1 times /weblog/2005/12/rails_inflector_in_ruby_scripts/
1 times /weblog/2005/11/22c3_schedule/

I was a little surprised to see the result. The ranking of total spam attacks, as opposed to successful attacks as shown here, probably looks a little different. But to get such a ranking I might have to parse Apache logfiles and correlate timestamps with the blacklist log -- much less fun.

Download

comment_spam_stats.rb.txt, Ruby script

OPML Search Results for Feed Grazing 2006-07-08

Jimmy Wales Believes in Participatory Democracy, Too 2006-07-07

dekstop: weblog