|
|
|
|
||
![]() |
||||
|
|
||||
|
|
![]() |
|||
|
|
||||
|
Why Bayesian Filtering is the Most Effective Anti-Spam Technology ![]() Achieving a 98%+ spam detection rate using a mathematical approach. This white paper describes how Bayesian filtering works and explains why it is the best way to combat spam. Introduction This white paper describes how Bayesian mathematics can be applied to the spam problem, resulting in an adaptive, ‘statistical intelligence’ technique that achieves very high spam detection rates. It also explains why the Bayesian approach is the best way to tackle spam once and for all, as it overcomes the obstacles faced by more static technologies such as blacklist checking, comparing to databases of known spam and keyword checking. These technologies are not obsolete, but cannot be relied upon without a Bayesian filter. Current spam detection techniques Spam is an ever-increasing problem. The number of spam mails is increasing daily - studies show that over 50% of all current email is spam; the Radicati Group predicts this will reach 70% by 2007. Added to this, spammers are becoming more sophisticated and are constantly managing to outsmart 'static' methods of fighting spam. The techniques currently used by most anti-spam software are static, meaning that it is fairly easy to evade by tweaking the message a little. To do this, spammers simply examine the latest anti-spam techniques and find ways how to dodge them. To effectively combat spam, an adaptive new technique is needed. This method must be familiar with spammers' tactics as they change over time. It must also be able to adapt to the particular organization that it is protecting from spam. The answer lies in Bayesian mathematics. How the Bayesian spam filter works Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event. (More information about the mathematical basis of Bayesian filtering is available at Bayesian Parameter Estimation - http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html and An Introduction to Bayesian Networks and their Contemporary Applications http://www.niedermayer.ca/papers/bayesian/bayes.html).
This same technique can be used to
classify spam. If some piece of text
occurs often in spam Creating a tailor-made Bayesian word database Before mail can be filtered using this method, the user needs to generate a database with words and tokens (such as the $ sign, IP addresses and domains, and so on), collected from a sample of spam mail and valid mail (referred to as ‘ham’).
Creating a word database for the filter A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word occurs in spam as opposed to legitimate mail (ham). This is done by analyzing the users' outbound mail and by analyzing known spam: All the words and tokens in both pools of mail are analyzed to generate the probability that a particular word points to the mail being spam. This word probability is calculated as follows: If the word "mortgage" occurs in 400 of 3,000 spam mails and in 5 out of 300 legitimate emails, for example, then its spam probability would be 0.8889 (that is, [400/3000] divided by [5/300 + 400/3000]). Creating the ham database (tailored to your company) It is important to note that the analysis of ham mail is performed on the organization's mail, and is therefore tailored to that particular organization. For example, a financial institution might use the word "mortgage" many times over and would get a lot of false positives if using a general anti-spam rule set. On the other hand, the Bayesian filter, if tailored to your company ..........this publication continues |
||||
Home | Contact Us | Site Map | Terms of Use | Privacy Policy
© 2005 S&A Consulting Group LLP • Cleveland Ohio