Spam Filtering Techniques

Submitted by anonymous 2006-05-22 Internet 13 Comments

“The problem of unsolicited e-mail has been increasing for years, but help has arrived. In this article, David discusses and compares several broad approaches to the automatic elimination of unwanted e-mail while introducing and testing some popular tools that follow these approaches.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

13 Comments

2006-05-22 5:59 pm

WorknMan
The Pharmacy Express spam has been the bane of my existence for the past few months. Not only do they intentionally mispell words like ‘Viagra’, which makes them hard to filter, but they also use a different domain on their URL every time they spam (How the hell do they do that anyway? Buy a new domain every day? I’ve viewed the raw headers for these messages and the domains really ARE different .. these are not embedded links.) Even regular expression filters only go so far.

You would think somebody would be able to take down this website. But I guess the jerks that do DoS attacks and the spammers are really cut from the same cloth, so whadya want?

I say we adopt a new anti-spam policy .. find the bastards who are responsible and break both of their legs! Then we can film it and upload it to the Internet so everybody can watch.

Edited 2006-05-22 17:59

2006-05-22 7:50 pm

cfrankb
Regular expression posts sorted by date:

http://www.vamsoft.com/orf/rgxexpr.asp?t=3&p=1&s=date_desc

Webpage to easily create a regex that checks for spamvertized variations of a word:

http://public.kvalley.com/regex/regex.asp

2006-05-22 6:20 pm

casantos
This article is from 01 Sep 2002!…

So old that even doesn’t talk about grey listing. Why put it here? Why give a 2002 perpective?…
2006-05-22 6:29 pm

elsewhere
Blue Frog had an interesting approach because it combined a human element with the technology, and it targeted the companies using spam rather than the spammers themselves, who are pretty much above the law anyways.

Report a spam, Blue Frog evaluates it, determines the company being promoted in the spam, contacts them privately about removing unwanted mail recipients, and without a satisfactory resolution within 10 days, would proceed to send an automated email requesting an opt-out for each spam email received by their susbscribers. Although it was frequently accused of being one, it wasn’t some sort of half-baked DDoS attack response because a) actual people evaluted the reported spam emails to confirm the companies involved rather than targeting the bot machines sending them and b) one opt-out email was sent for each unsolicited email received, and the emails were throttled to avoid an unintended DoS.

The net result was that spammers that were reckless with harvesting email lists wound up frustrating their “sponsors” with the deluge of unwanted opt-out requests being sent directly to them. It was simply bad business for the spammers, which is why they became so determined to take out Blue Frog.

Blue Frog’s flaw, similar to Napster, was being centrally managed. With the death of Blue Frog, there are now a couple of projects surfacing that will be P2P based, similar to what happened with Napster. There will be some work involved in making it work efficiently, but it will effectively eliminate a single point of failure for the network. The other issue is that spammers operate within different jurisdictions and under different legal obligations, so any method attempting to counter them has to operate similarly. Anti-spam legislation in the US means nothing to a spammer in Russia, or Canada for that matter. Similarly, mail order operations outside of a particular jurisdiction have nothing to fear from legal reprecussions either. The only thing that works is to choke off their supply of customers.

Blue Frog succeeded because it ignored the spammers and went right for the advertisers using their services. Ultimately that’s the only method that will succeed. Companies certainly have a right to advertise via email but in doing so should be responsible about it, and will eventually start facing reprecussions otherwise.
2006-05-22 6:34 pm

DeadFishMan
But it was kinda light on details. I have been using for a few years an OSS anti spam tool called POPFile that is amazing. Even better than the spam filters embedded on Thunderbird, Outlook 2003(?) and the likes.

POPFile is a Perl app – surprisingly snappy, I might add – that runs on the background and keeps checking your e-mail using the Bayesian method mentioned on the article, tags it as SPAM or not and then let your MUA filters file it based on a tag on the headers of the message. After running it for two weeks or so, I found that it was accurately classifying over 99% of my e-mails as SPAM (the program itself gives you detailed stats about the filtering). And since my native language is not English, I was quite surprised on how great it really is.

For those of you that still filter your inbox manually and that don´t know it, I´d recommend you to give a try.
2006-05-22 6:46 pm

MikeGA
I have personally found Apple Mail’s spam filter to be very effective, certainly more so than Outlook Express that I was using before.

I believe it uses a quit advanced technique that differs quite a lot from most other filters. I think MacDevCenter might have had an article on it a while back.
2006-05-22 6:58 pm

SteveB
I am currently testing DSPAM CVS (aka 3.8.0) with various tokenizers. In my experiance (using DSPAM and other tools since years), I can say that tagging spam is easy. I have currently with my DSPAM CVS version a hit rate of 100% against the complete spamarchive.org files (wich is about 1/8 million spam mails). But this is only one part of the game. Getting 100% hitrate against spam is very easy (IMHO). But getting as well 100% hit rate on ham is very very difficult.

I am somewhere at 99.9x% accurancy (ham/spam combined) on my own mailbox. But others on my server have less (okay… but still above 98%).

In the article, the author has 0 False Positives and 0 False Negatives when using a Challenge & Response system. This is sure one way of handling the problem, but is definatly not a very business friendly way of doing it.

I personaly (mostly) never respond on C&S request. I don’t want to accept that I need to work for other, just because they can not configure a Anti-Spam filter.

As for the other solutions outlined in the article:

Hash based Anti-Spam (distributed or not): Is sure okay but not dynamic/personal enought to handle my needs or the needs of a server holding many domains with many users. The individual requirements for Anti-Spam are way way to diverse to be well handled by a hash based Anti-Spam solution.

Heuristic based Anti-Spam: Well… I am completly against heuristic based Anti-Spam solutions. It may be okay in filtering 80% to 90% of spams, but it still is to unflexible for beeing deployed on a server with many differend users and many differend needs (for example: I am from Switzerland and we have 4 national languages and often bussines mail is in english. Most of the developers of those rules only focus on the english language and miss completly the other languages). I have as well my hard time to accept that hand written rules (for example the about 3’000 rules in SA) are the best way in handling spam. A developer can never ever capture the diverse needs of so many users in a bunch of rules. No way. Mail is to dynamic and to individual to be captured in rules.

Blacklists: Personaly I could live with blacklists. But imagine hosting a company traiding with steel (I have such a customer) and they communicate with asian countries, russian countries and some other coutries. You can imagine, that alot of their customers is often found on some servers wich are in a blacklist. Now I can tell my customer what ever I like, but they don’t care! They need communication, because a delay of serval hours could cost them alot of money.

On the other hand: Blacklists (DNSBL or RHSBL) do filter alot of spam. But it would be more wise to not block the sender, but use the result from the blacklist query as a judgement to tag the message as spam or transfer it into a quarantaine.

What I absolutly miss from the article is: Greylisting

Greylisting is such a easy to deploy tool and it filters alot of spam mails. A good solution (for example SQLGrey) is very user and admin friendly and does not require me to do babysitting all the time (in fact I don’t do anything. It just works) and combined with a good frontend (for example a Web frontend), the maintenance is easy as 1-2-3.

My personal two tools I like the most in terms of Anti-Spam are: DSPAM and CRM114

DSPAM is in my opinion, one of the best tools. It offers alot of functionality, flexibility and once set up it just works and works and works.

cheers

SteveB
2006-05-22 7:23 pm

linuxh8r
Sorry if someone already mentioned this, but I use spam gourmet.

It’s basically creates temporary email addresses for you and they’re only good for a certain number of uses. For example suppose you only want to allow a website to send you 3 emails. Then you would give them this email address:

[email protected]

Go to: spamgourmet.com for more info. I find it to be very effective.
2006-05-22 8:15 pm

tspears
If you are running GroupWise, check out GWAVA. I’ve found it to be the best email protection suite out there.

http://www.gwava.com
2006-05-22 8:40 pm

wylde342
We offer an offsite spam/virus removal service to our clients. I’ve noticed an increase in the body of the emails being an actual image; these are MUCH harder for programs to decipher.

It’s non stop…kinda like radar detectors. 6 months we’re on top, then it’s their turn.
2006-05-22 10:37 pm

moleskine
Spamassassin here. Like any weapon, though, out of the box it does not give of its best. You need to assemble it, tweak it and load it well. Then it will work OK.

I reinforce Spamassassin with the Rules du Jour ruleset which is updated automatically, and I turn on all the online checking options. Without online checking, Spamassassin’s accuracy is indeed poor in my experience. I also have a Bayes db which I’ve been building up for a couple of years now with both spam and ham (you need both, not just spam). That seems to make a big difference.

There may well be far better systems out there now, but I know how this one works on my home PC and it’s accuracy is very high. So much so that I can tell procmail to delete unread any spam with a score of 15 or more without worrying that I will be throwing out a genuine mail.

Still, some concerted action against spammers by industry and governments would be a better way.
2006-05-23 1:17 am

SteveB
This was my accurancy some days ago with DSPAM 3.6.6 in my environment:

nautilus / # dspam_stats -H globaluser

globaluser:

TP True Positives: 1769983

TN True Negatives: 2951556

FP False Positives: 8185

FN False Negatives: 403

SC Spam Corpusfed: 447

NC Nonspam Corpusfed: 3605

TL Training Left: 0

SHR Spam Hit Rate 99.98%

HSR Ham Strike Rate: 0.28%

OCA Overall Accuracy: 99.82%

nautilus / #

I removed that data because I am now playing around with DSPAM CVS and the various new tokenizers. As far I can tell, the new tokenizers are extremly good and learn extremly quick (and most of the new ones produce a gazillion of tokens. It’s crazy!)

With the new tokenizers it is already enought with a small amount of data to get excellent results:

nautilus / # dspam_stats -H globaluser

globaluser:

TP True Positives: 2039

TN True Negatives: 2867

FP False Positives: 1

FN False Negatives: 5

SC Spam Corpusfed: 0

NC Nonspam Corpusfed: 0

TL Training Left: 0

SHR Spam Hit Rate 99.76%

HSR Ham Strike Rate: 0.03%

OCA Overall Accuracy: 99.88%

nautilus / #
2006-05-23 3:00 pm

rcsteiner
Not sure what techniques <a href=”http://www.postini.com/“>POSTINI uses, but their filters seem to work very well. You can blacklist or whitelist single addresses, allow mailing lists to pass through based on either TO or FROM headers, adjust various filtering settings, etc.

I’ve been quite happy with it…