2013-Apr-03 - When the metadata matters more than the data itself - Comment spam detection

I have been thinking and dealing a lot lately with comment spam. In the past, most comments (or forum) spam were easily to be detected (at least by the human eye). They either sold something completely unrelated to the site in question:

Buy our cheap purses online: http://spammylink
Buy our cheap shoes online: http://sappmylink2

Or linked to the common type of pharma/viagra spam. Doing filters to detect those automatically was not always simple, but an human could easily differentiate between a spam comment from a valid one.

New types of spam

The new type of spam we are seeing is very hard to differentiate. For example, I did a post on my company blog ( http://blog.sucuri.net ) a few weeks ago and it had two comments:

Thanks guys, good as always.


This is useful information, i like it.

That was it, no pharma and no suspicious links. Both had a link to a twitter account (one for a male and one for a female and all with followers) in the comment URL and a valid name. Was that spam? Can you guess? ... Yes, both were spam... But took me a while to go through their profiles and the IP address that sent the comment to confirm.

If you go through our analysis of 100,000 comments ( http://blog.sucuri.net/2012/05/blog-comments-analysing-100000-comments-and-spammers.html ) you would see how common it is getting.

Comment (and forum) spam that actually look like a simple (thank you) comment. I guess spammers learned that we all like to be flattered.

Why they do it is a discussion for another post.

When the Metadata matters more than the data itself

And that’s an interesting area where the metadata matters more than the data itself for our comment spam analysis. In fact, on our WordPress plugin and Proxy service (where we block millions of spams per month), we are completely ignoring the comment message. And just focusing on the metadata.

And what is the comment metadata? We use a few:

  • User agent (browser) that sent the comment
  • Referer (where it came from)
  • Cookies
  • Passive OS fingerprint
  • IP address reputation
  • IP address location
  • Comment email (when available)
  • Comment URL (when available) reputation

Just those fields generally allow us to pretty much identify if a comment is good or not. When we add our network/anomaly behaviour analysis, which includes:

  • How often is this IP sending comments
  • How many sites (of the thousands we look for) received comments from this IP
  • If this IP sent more than one comment, has the email or name or URL changed, has the browser changed, etc
  • And a few more others ...

We can reduce the amount of false positives and pretty much nail the false negative rate.

And, it allows us to do that without having to worry about the comment content and having to create signatures for all types of spam (a race that is hard to win).

I would love to hear more areas where the meta data can be used to identify if a specific data is malicious or not. Anyone?

By Daniel B. Cid - Tags: sec - Notes index.

Quick Links


My Projects