Bayes Host Classification

I run a little anti spam service and often try out different strategies to combat spam. At present I have a custom nameserver that I wrote that does lots of regex checks against hostnames and tries to determine if a host is a dynamic ip or a static ip. I use the server in standard RBL lookups.

The theory is that dynamic hosts are suspicious and so they get a greylist penalty, doing lots of regular expressions though is not the best option and I often have to fiddle these things to be effective. I thought I’d try a Bayesian approach using Ruby Classifier

I pulled out 400 known dynamic ips and 400 good ones from my stats and used them to train the classifier:

require ‘rubygems’
require ‘stemmer’
require ‘classifier’

classifier = Classifier::Bayes.new(‘bad’, ‘good’)

classifier.train_bad(“3e70dcb2.adsl.enternet.hu”)
.
.

classifier.train_good(“mail193.messagelabs.com”)
.
.

I then fed 100 of each known good and known bad hostnames – ones not in the initial dataset – through it and had a 100% hit on good names and only 5 bad hosts classified as good.

This is very impressive and more than acceptable for my needs, now if only there was a good Net::DNS port to Ruby that also included the Nameserver classes.

Bayes Host Classification

Licence