Analysis of 55 000 Spam Mails

I handle mail for about 40 domains on my servers at the moment, some are secondary and some are primary, they all get spam.

I have been keeping close track of all emails in and out of my machine. I keep lots of meta information about these emails including to, from, sender hostname, subject, attachments, time spent processing, is it spam or not etc. I do this partly because there are certain legal requirements for this to be done in the EU and because i like the kind of stats I can pull out of this.

It has now been a year since I started keeping this stats and my SpamAssassin has tagged 55 000 emails as spam. I religiously check my own spam folder for false positives and do not get much, but I am aware of some html newsletters that gets tagged as spam when it shouldn’t be. Overall though I believe that my tagging is fairly accurate.

What follows in the extended entry is a bit of analysis I did on this data to find out what ISP’s, Countries and so forth are to blame for this plague.

It is important to note that I am not setting out to have a hugely scientific approach to this or even a highly accurate one. If there were a very accurate way to identify spam we would not have a problem with it, this is merely interesting observations made on a small system.

All of the data is kept in a SQL database for ease of query, I put the data there using iScan and a plugin I wrote to dump its memory state into SQL statements after processing of an email is complete.

First some general interesting bits about my mail volumes, I am by no means a big carrier of email, in fact its a very modest mail installation.

Total amount of email handled	366 395
Total size of email handled	13.70 Gb
Amount tagged as spam	54 936
Size of email tagged as spam	336.79 Mb
Total CPU time spent analyzing email	90786.703 seconds
Time spent that resulted in spam being tagged	10526.617 seconds

As you can see, not a huge amount of email for a whole year, but it is clear already that spam accounts for almost 15% of my email in volume but luckily only 0.02% in size, i guess there is something to be glad about.

About times spent, the times are usertime spent by iScan, iScan is by no means the fastest or most optimized system and I have modest hardware so this is a rough indication only of the kind of CPU investment you should consider when planning to roll out spam or virus checking. But these figures are scary, to spend 25 hours usertime checking 13 Gb of mail is no good. I have purchased much faster IO and CPU systems now and this should improve a lot. CPU and IO is however not the only time consideration when using SpamAssassin since many of its checks involves speaking to hosts on the net. I have these disabled, I only check RBL against bl.spamcop.net, but I imagine even this will introduce a lot of waiting time.

During the year a few significant events happened that affected my spam counts. I purchased a domain that was previously used by a company for general email, this brought with it a massive influx of spam. I countered this by rejecting email to those old accounts at SMTP time and things returned to normal.

I noticed that spammers started sending more and more spam to secondary MX rather than primary, I can only guess they think that secondaries don’t usually run spam checkers and the primaries trust secondaries. As I host a lot of secondary domains here, I get a lot of spam for them.

An overview of my spam count and spam size by month can be seen in these graphs, the sizes are in Mb.

It is very interesting that while my count has gone up dramatically my size of spam per month does not go up in the same percentage. Maybe the spammers are learning to tighten up their emails to waste less of their bandwidth, no doubt so they can send more.

The next obvious question is where does it all come from, a lot have been said on various on-line forums about where people think it comes from, I decided to find out from which countries and ISP’s the spam comes.

First the countries, I record the sender hostname of all my mail, unfortunately I record the hostname when available and not just the IP address, this means that a lot of my older records are useless because the domains do not hang around for very long, I will be amending my records so I do not have this problem next year. The following graphs shows the top offending countries, no surprises there for me.

Spam Hosts per Country

I did the lookup using Maxmind GeoIP, it took roughly 14 hours to run through the 30 000 distinct sending hosts, though I could have done them in parallel but I was being lazy.

The next interesting thing is to look at the top domains that is sending mail, I did not expect to see here the names of a few big spamming companies but what I did find shows that the big ADSL providers are the ones that should be making a plan such as blocking their customers from communicating on port 25 to anything but their relay hosts and make an investment in technology at that point to block it.

Spam Hosts per Domain
It is impossible for me to say if these are spammers that has a ADSL connection at his home and is so sending out mail or if these are machines that is somehow compromised into spammer drones, but it is clear that this is where the blocking tactics should be focusing. I will be making some adjustments myself based on this.

While on the subject of sending hostnames, I had a look at the total number of hosts that sent me mail – spam or not – and counted 37 818. If I do the same count on hosts that sent me spam I get 30 345, that is quite interesting on its own, maybe I should be creating a ‘white list’ of hosts that I trust and not waste time checking if the mail they are sending me is spam, this should improve the performance of my system quite a bit and clearly with only 7 000 legit hosts out there that sent me mail (I find even this number hard to believe, I know my spam tagging has quite a bit of false negatives) it should be easy to pick out the ones that sends me most mail that is not spam.

There are a number of blacklists out there, the DUL is one of them containing a list of all dynamic IP addresses, so if you are a modem user chances are you are in it. As can be seen from above these dynamic users are a problem, but what about people who use their ADSL connections as small time mail operations, I know many of them and none of them send spam. This is a hotly debated topic that I won’t get into. I have however run my 30 000 unique hosts through a DUL found at SORBS and the results are very interesting, I am presenting the same per-isp and per-country graphs as before, but this time on the DUL filtered hosts.

In the original data set there were 30 345 unique sending hosts, after filtering them through DUL only 13 464 remains, this is a very impressive indeed, I can however not say how accurate the filtering was.

It is also important to note that when I took a look at entries just below than the ones listed here – yahoo.com and hotmail.com – that there were a lot of false positives from spamassassin due to people forwarding spam and newsletters to a certain mailing list I host.

Spam Hosts per Domain – DUL filtered

Spam Hosts per Country – DUL filtered

Other blocklists would be interesting to see in a similar fashion, but due to the age of my data I would not be able to reliably run it through them as no doubt many users who was compromised have been cleaned or IP space has been re-allocated.

What conclusion does all this lead me to? I am not sure, I know I have to do something about the dynamic blocks to prevent mail from entering my systems, but other than that there isn’t much you can do, my spam tagging is already effective thanks to Spamassassin and it is keeping my Inbox clean. Short term improvements I will make is to upgrade to a newer version of Spamassassin, logging improvements in iScan and spend more time investigating DUL’s and other blocklists as clearly they are controversial but will have a dramatic effect – the question is just choosing the right one(s) to support and trust.

Analysis of 55 000 Spam Mails

Licence