« Previous article — Next article »

Blocking image spam with FuzzyOCR

September 10th, 2006 Posted by D Webber

Image spam has been growing lately. Spam where the scumbags hide their advertising for bogus products and scams inside an image file is nothing new, but it’s been steadily rising for the past several months, especially for pump-and-dump stock scams.

To reduce it for one of our clients, we added the FuzzyOCR plugin to SpamAssassin on their mail servers. The servers were already well defended against spam using the usual mix of SMTP sanity checks, blocklists and SpamAssassin rules, but too much image spam was still getting in.

When a message contains just an image file, the FuzzyOCR plugin runs the image through the open source GoCR optical character recognition utility, then uses fuzzy string matching techniques on any words that pop out.

Although legitimate mail will occasionally contain only images (such as when someone emails photos), it’s rare for those images to contain lots of text. When the words are "mortgage", "invest", and "enhancement", it’s likely that it’s spam.

GoCR is an impressive open-source project but isn’t the world’s most accurate OCR. The words it finds usually come out mangled to a certain degree. Spammers also purposely obfuscate words. Fuzzy string algorithms attempt to take care of that so for example what comes out as "en-hnEmnt" still has a chance of matching spam keyword "enhancement".

The FuzzyOCR plugin adds new scores to messages based on whether words in images match a list of common spam words. Here’s an original spam image and here are the results from the plugin:

  *  6.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
  *    Words found:
  *    "news" in 7 lines
  *    "breaking" in 1 lines
  *    "symbol" in 1 lines
  *    "investor" in 1 lines
  *    "company" in 2 lines
  *    "money" in 1 lines
  *    "thousand" in 1 lines
  *    "buy" in 1 lines
  *    "trade" in 1 lines
  *    "target" in 1 lines
  *    "banking" in 1 lines
  *    (18 word occurrences found)

Most of the words in the above are present in the image spam, though a few like "money" and "thousand" are not. For this particular message, FuzzyOCR added 6.0 to the overall spam score while other SpamAssassin rules only gave it a score 3.6. The mail server  is configured using our "Fighting malware and spam with Postfix" setup with the sideline filter script… it labels but delivers messages when total score is 4.0 or higher, and quarantines messages scoring 6.0 or higher. So in this case the additional scoring provided by FuzzyOCR successfully prevented the end-users from receiving the spam.

Taking a quick look at the quarantine on the server, we find that 28% of the image spam received exceeded the quarantine limit just from the normal SpamAssassin rules alone, and 72% were quarantined because of the additional scoring from FuzzyOCR. There were no false positive image spams at all in the quarantine.
Very good results so far.

Of course, spam defense is an arms race… every countermeasure is attacked immediately by those diligent little scumbag spammers. In the spam image above the spammer has added noise to the image and anti-aliased some words to reduce the accuracy of OCR. Others use animated GIFs to further confuse OCR software. Two messages received were in PNG format, which seems to be a new tactic. The effectiveness of OCR will decrease, but for right now it seems to be worth having.

Recently Google released Tesseract OCR as open source. Originally developed as a commercial product by Hewlett-Packard, a decade ago it was considered one of the most accurate OCR products. If it’s better than GoCR, hopefully soon it can be used as an alternative OCR engine for SpamAssassin plugins and boost accuracy for obfuscated image spam even more.

Keeping up with spammer tactics is a never-ending chore. OCR is just one of many identification techniques. Interestingly, while every single image-based pump-and-dump spam has been caught by this new filter, traditional misspelled character spam have been scoring low enough to get through. Oh well… time to adjust the SpamAssassin scores again.

Related posts:

Posted in Email security |
Tags: ,

Comments for this article are closed.