{"id":450,"date":"2009-05-30T00:43:36","date_gmt":"2009-05-29T23:43:36","guid":{"rendered":"http:\/\/wp.devco.net\/?p=450"},"modified":"2009-10-09T12:22:24","modified_gmt":"2009-10-09T11:22:24","slug":"bayes_host_classification","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2009\/05\/30\/bayes_host_classification.php","title":{"rendered":"Bayes Host Classification"},"content":{"rendered":"
I run a little anti spam service and often try out different strategies to combat spam. At present I have a custom nameserver that I wrote that does lots of regex checks against hostnames and tries to determine if a host is a dynamic ip or a static ip. I use the server in standard RBL lookups.<\/p>\n
The theory is that dynamic hosts are suspicious and so they get a greylist penalty, doing lots of regular expressions though is not the best option and I often have to fiddle these things to be effective. I thought I’d try a Bayesian approach using Ruby Classifier<\/a><\/p>\n I pulled out 400 known dynamic ips and 400 good ones from my stats and used them to train the classifier:<\/p>\n require ‘rubygems’ classifier = Classifier::Bayes.new(‘bad’, ‘good’)<\/p>\n classifier.train_bad(“3e70dcb2.adsl.enternet.hu”) classifier.train_good(“mail193.messagelabs.com”) I then fed 100 of each known good and known bad hostnames – ones not in the initial dataset – through it and had a 100% hit on good names and only 5 bad hosts classified as good.<\/p>\n
require ‘stemmer’
require ‘classifier’<\/p>\n
.
.<\/p>\n
.
.<\/p>\n<\/blockquote>\n