I'm using a combination of fairly primitive methods to cope with blog spam. As this blog doesn't get too much comments the amount of manual work is relatively limited; main line of defense is an old-fashioned and relatively short blacklist. I'm notified of incoming comments, and in the rare event that a spam comment gets through I'll inspect it for new keywords.
For a couple of months now it's become apparent that specific posts seem to attract more spam than others.
I just thought that it may be great to have a statistic of this phenomenon -- so that I could make an informed decision wether it makes sense to close comments for particular old articles.
Unfortunately the tools I use create no such statistics. I have a log of successfully filtered keywords, but unfortunately this log doesn't map blocked attempts to individual articles.
But I still have all comment notification emails of spam that got through the filter -- and these are a relatively accessible source of the required information:
- Comment notifications use a distinct mail subject, and mention the article URL in the email body.
- If I delete comment spam I will then also delete the respective email notification.
- I never empty my email trash folder.
... which means I still have all notifications for successful spam comments, stored away in a separate folder from 'legit' comment notifications. If I combined that with Spotlight and a short Ruby script, creating a 'successfully spammed articles' ranking is a piece of cake.
The mdls command line tool lists all Spotlight metadata for a specific file, so if you point it to an email file (~/Library/Mail/.../Messages/*.emlx) you can get a list of the available metadata for filtering email messages. I was only interested in the mail subject, kMDItemTitle; but there are also kMDItemAuthors, kMDItemAuthorEmailAddresses, kMDItemContentCreationDate, and others.
The mdfind command line tool lets you then search the Spotlight database. I'm using a query like this:
mdfind -onlyin ~/Library/Mail/folder_name/ "kMDItemTitle == '*New Comment*'"
This results in a list of files:
/Users/.../Messages/11579.emlx /Users/.../Messages/5077.emlx /Users/.../Messages/5171.emlx /Users/.../Messages/5218.emlx /Users/.../Messages/5279.emlx /Users/.../Messages/5280.emlx /Users/.../Messages/5282.emlx /Users/.../Messages/11550.emlx /Users/.../Messages/11551.emlx ...etc
And Then Count
The .emlx file format is afaik proprietary, but when you open it in a text editor you'll see that it's a plaintext format and includes, among other things, the raw email message with full headers and body text. In my case I was only interested in extracting an article URL from each of those emails, so I didn't even bother to look at the file format but simply used a regular expression against the complete file content.
Then count each captured string in a hash, and finally print the ranking:
I was a little surprised to see the result. The ranking of total spam attacks, as opposed to successful attacks as shown here, probably looks a little different. But to get such a ranking I might have to parse Apache logfiles and correlate timestamps with the blacklist log -- much less fun.