- Beginner's Workshop(20 Mar 2010)
- LUV April Meeting(6 Apr 2010)
- Beginner's Workshop(17 Apr 2010)
- LUV May Meeting(4 May 2010)
- Beginner's Workshop(15 May 2010)

Reviewed by Major Keary
If Omar Khayyám were penning verse today he might have mused,
"I wonder what the spammers receive
One half so obnoxious
As the crap they weave".
Spam irritates most of us, which has resuilted in quite a body of literature about spam and its leading perpetrators; however, much of it is historical rather than technical. There are books about—or that mention—how to apply spam detection software, but they tend to be sparse on the methodology used by spammers and anti-spammers.
Some remedies involve time and effort that is just as irksome as the spam, and present-generation filters are rapidly losing their effectiveness as spammers resort to tricks that "obfuscate their text so that it's human readable, but not very machine readable" [Ending Spam].
The goal of spam-filter developers is to find a 'scientific'—and fast—method of getting rid of all spam without triggering false positives. Bayesian filtering is all Greek to most of us. Heuristic methods sounds scientific, but it boils down to 'suck it and see'. The most impressive sounding is Fifth-Order Markovian Discrimination, which may sound like something out of science fiction, or even a mysterious division of martial arts.
Machine learning has been around for a very long time and has been applied to spam filtering using language classification. The filter has to be trained by presenting it with examples of what is and is not spam. One of the approaches to language classification is statistical filtering, which is where the Reverend Thomas Bayes (1702-1762) comes into the picture. He was the first person to pursue the concept of statistical inference and the Bayes Theorem has become a standard component of statistical theory.
It is not just Bayes' work that has emerged from the past in the anti-spam effort. Andrei Andreyevich Markov (1856-1922) is another mathematician whose research, which had no application at the time, has become a part (Markov chains) of statistical theory, and has more recently been applied to an advanced language classification system. The best summary of the difference between Markov and Bayes is in Zdziarski's Ending Spam:
"The central idea of Markov's work was that some things in nature are more complex than Bayes' independent event statistics can describe. Markov came up with a very simple yet powerful description of nonindependent, related events that accurately models many natural processes and natural languages."
The book explains the concepts and the way they are applied to modern spam filtering systems.
Recently published by No Starch Press, Ending Spam is the only book I have seen that discusses the underlying science of spam filtering, which provides an insight into the techniques used in next-generation spam filters, and discusses the filter-avoidance techniques being used by spammers. The book explains how spam works and how it can be countered using modern methods. It is not for the digitally faint hearted: there is no hand-holding or 'dumbing down' to cater for a popular readership. However, ordinary readers with an interest in the technical side of the subject should have no trouble in following the discussions.
The author is a researcher in the fields of algorithmic theory and neural networking, and has developed an open source spam filter, DSPAM. He is also an excellent technical communicator; in Ending Spam he has been able to describe current and next-generation filtering methodology in language that is comprehensible without compromising the book's academic integrity. The book is not written in a formal academic style, but treats its subject matter in the academic spirit. A formal text would hardly name a chapter, The Low-Down Dirty Tricks of Spammers; however, the tone of the content is quite sober, but never boring or stodgy.
The introduction identifies the book's audience as those who want to design and implement their own spam filters using 'best practices'; those who want to understand the different approaches to filtering and what the various systems actually do; those who seek a general understanding of how filters work and the tactics of spammers; and—of course—spammers.
The first two chapters provide a history of spam and an overview of approaches that have been taken to the development of filters and describe various aspects of filtering, such as throttling, challenge/response, and spammer fingerprinting.
Chapters that introduce language classification concepts and statistical filtering fundamentals complete the introductory part of the book. The next two parts become more technical—but are eminently readable—and deal with Fundamentals of Statistical Filtering and Advanced Concepts of Statistical Filtering. Even if the reader is not interested in—or equipped to comprehend—topics such as Karnaugh mapping the advanced parts of the book provide an explanation of just what goes into developing an effective filter, the problems that have to be solved, and the tricks used by spammers to fool filters.
If you are a programmer, application developer, or software engineer Ending Spam is the best specific resource that I have seen. If you want an in-depth understanding of filter methodology, this is the definitive text. For the rest of us it is a remarkably lucid explanation of concepts and tools that many of us may never have heard of before.
An appendix, Shining Examples of Filtering, describes "several best-of-breed filters" that are all open source software. There are five listed with pros-and-cons for each, and the URLs for downloading. One of them is the CRM114 Discriminator, which is well worth reading about.
Jonathan Zdziarski: Ending Spam
ISBN 1-59327-052-6
Published by No Starch Press, 287 pp., RRP AU$74.95 incl. GST