I'm using a combination of fairly primitive methods to cope with blog spam. As this blog doesn't get too much comments the amount of manual work is relatively limited; main line of defense is an old-fashioned and relatively short blacklist. I'm notified of incoming comments, and in the rare event that a spam comment gets through I'll inspect it for new keywords.
For a couple of months now it's become apparent that specific posts seem to attract more spam than others.
I just thought that it may be great to have a statistic of this phenomenon -- so that I could make an informed decision wether it makes sense to close comments for particular old articles.
Unfortunately the tools I use create no such statistics. I have a log of successfully filtered keywords, but unfortunately this log doesn't map blocked attempts to individual articles.
But I still have all comment notification emails of spam that got through the filter -- and these are a relatively accessible source of the required information:
- Comment notifications use a distinct mail subject, and mention the article URL in the email body.
- If I delete comment spam I will then also delete the respective email notification.
- I never empty my email trash folder.
... which means I still have all notifications for successful spam comments, stored away in a separate folder from 'legit' comment notifications. If I combined that with Spotlight and a short Ruby script, creating a 'successfully spammed articles' ranking is a piece of cake.
Taming Spotlight
The mdls command line tool lists all Spotlight metadata for a specific file, so if you point it to an email file (~/Library/Mail/.../Messages/*.emlx) you can get a list of the available metadata for filtering email messages. I was only interested in the mail subject, kMDItemTitle; but there are also kMDItemAuthors, kMDItemAuthorEmailAddresses, kMDItemContentCreationDate, and others.
The mdfind command line tool lets you then search the Spotlight database. I'm using a query like this:
mdfind -onlyin ~/Library/Mail/folder_name/ "kMDItemTitle == '*New Comment*'"
This results in a list of files:
/Users/.../Messages/11579.emlx /Users/.../Messages/5077.emlx /Users/.../Messages/5171.emlx /Users/.../Messages/5218.emlx /Users/.../Messages/5279.emlx /Users/.../Messages/5280.emlx /Users/.../Messages/5282.emlx /Users/.../Messages/11550.emlx /Users/.../Messages/11551.emlx ...etc
And Then Count
The .emlx file format is afaik proprietary, but when you open it in a text editor you'll see that it's a plaintext format and includes, among other things, the raw email message with full headers and body text. In my case I was only interested in extracting an article URL from each of those emails, so I didn't even bother to look at the file format but simply used a regular expression against the complete file content.
Then count each captured string in a hash, and finally print the ranking:
Searching... found 94 emails. Parsing......................................................... ..................................... done. Ranking: 11 times /weblog/2006/01/flip4mac_has_a_strange_eula/ 9 times /weblog/2006/03/added_article_feeds_with_comments/ 7 times /weblog/2005/08/screw_objectivism/ 7 times /weblog/2005/12/mailfeedorg_has_launched/ 7 times /weblog/2005/12/osx_10_4_3_phoning_home/ 5 times /weblog/2006/06/notice_the_information_bar/ 5 times /weblog/2006/03/jabber_server_space_starts_boiling/ 4 times /weblog/2005/11/new_comment_spam/ 4 times /weblog/2006/03/yarv_benchmark_with_rexml/ 4 times /weblog/2005/11/team_ramrod/ 4 times /weblog/2005/08/a_first_look_at_pandora/ 3 times /weblog/2005/08/securely_connect_web_services/ 3 times /weblog/2005/12/contact_form_spam_bots/ 2 times /weblog/2005/12/qre_recommendations_from_your_database/ 2 times /weblog/2005/08/small_feed_changes/ 2 times /weblog/2006/01/gmail_html_view_as_default_view/ 2 times /weblog/2005/07/ballmer_creeps_me_out/ 1 times /weblog/2005/10/rhinola_javascript_for_the_server/ 1 times /weblog/2006/06/midnightbot/ 1 times /weblog/2005/12/printable_schedule_for_22c3/ 1 times /weblog/2006/03/feed_readers_a_commodity/ 1 times /weblog/2006/01/visualization_of_numeric_data/ 1 times /weblog/2005/09/marc_mcdonald/ 1 times /weblog/2005/11/100_dollar_laptop/ 1 times /weblog/2005/11/type_managers_replacing_file_managers/ 1 times /weblog/2005/07/backup_hell/ 1 times /weblog/2006/02/delicious_404/ 1 times /weblog/2005/10/webservice_auth_finally_coming/ 1 times /weblog/2005/12/rails_inflector_in_ruby_scripts/ 1 times /weblog/2005/11/22c3_schedule/
I was a little surprised to see the result. The ranking of total spam attacks, as opposed to successful attacks as shown here, probably looks a little different. But to get such a ranking I might have to parse Apache logfiles and correlate timestamps with the blacklist log -- much less fun.
Download
- comment_spam_stats.rb.txt, Ruby script
Related Articles
- Contact Form Spam Bots Ahead
- New Wave of Comment Spam: PageRank Indirection
- Critique and Countercritique
Comments
Comments are closed. You can contact me instead.