Spam and statistics

By Joel Snyder
Network World, 09/15/03

Original Article on Network World Web Site

Say false positive, and you immediately dive into a tough world - statistics of diagnostic tests. The terms false positive and false negative (and their cousins, true positive and true negative) are fairly easy to define. But turning the number of false positives and false negatives into easy-to-digest statistics is different, because the anti-spam community has not come to any agreement on which numbers to use across products.

A spam filter is a diagnostic test. For some set of thresholds, it will say "this is spam" or "this is not spam." In our testing, we didn't expose those thresholds. Instead, we asked the vendors to pick thresholds such that the false-positive rate would be kept to less than 1%. Interestingly enough, none of the vendors asked what we meant when we asked for false-positive rate. Based on your tolerance for false negatives (spam in your mailbox) or false positives (mail mismarked as spam, lost or delayed), you might want to set these thresholds differently.

Four main statistics are used to describe diagnostic tests. Positive predictive value (PPV) and negative predictive value (NPV) go together. They measure how likely the test is to be correct. PPV measures the probability that a message actually is spam, given that the test says that it is. PPV is computed by dividing the number of true positives by the sum of true positives and false positives. However, PPV doesn't say how much spam will be filtered out: The number of missed spam doesn't figure into that statistic at all.

Sensitivity and specificity are the other two statistics, sometimes called the true positive rate and true negative rate. They measure how likely a test is to catch whatever is being tested. Sensitivity, for example, measures the probability that a message will test as spam, given that it actually is spam. Sensitivity is computed by dividing the number of true positives by the sum of true positives and false negatives. Most research on diagnostic tests uses PPV and NPV or sensitivity and specificity to describe how well a test works because these are well-defined statistics.

The term false-positive rate is, unfortunately, not commonly defined or agreed on. For some people, the false-positive rate is the proportion of those cases that test positive but that are actually not spam. That is, it's the complement of the PPV. For others, false-positive rate is the proportion of the total sample (i.e., all mail messages) that is not spam, but test positive as spam. That is, it's the complement of the relative specificity. Rather than pick an ambiguous definition, we focused on things that made sense in the world of spam and didn't overlap each other in definition.

In thinking about anti-spam software, network managers will be concerned with two main questions. The first is "How much spam will this filter out?" The sensitivity statistic best answers that question. It tells us what percentage of the time the filter will identify spam. A perfect score would be 100%. In our sample, there were 7,840 spam messages. MailFrontier's Anti-Spam Gateway (ASG) caught 7,005 of those and missed the rest. Forgetting the false positives (because that's a different question), ASG therefore gives us a 89.4% reduction in the spam: About 9 out of 10 spam messages are blocked.

The second question is "How accurate is this filter?" The PPV statistic best answers accuracy. That tells us what percentage of the time the test filters out mail correctly. Again, a perfect score would be 100%, meaning that when the filter says that something is spam, it's right 100% of the time. Because people like to talk about false-positive rate and want a perfect score to be 0%, we've taken the PPV and subtracted it from 1 to calculate a false-positive rate. In our test, MailFrontier ASG said that 7,053 messages were spam. It was right in all but 48 cases, giving a PPV of 99.3%, or a false positive rate of 0.7%.

Some researchers define false-positive rate by subtracting the specificity from 1 or by dividing false positives by the sum of false positives and true negatives (these end up being the same value). For most products, these numbers are close, within a few percentage points, but there are some dramatic exceptions. GFI MailEssentials has a false-positive rate of 56.3%, meaning that of the messages it said were spam, less than half actually were. But the specificity tells another story: Compared to the actual number of non-spam messages, MailEssentials was wrong only 10.4% of the time.

Of course, you can use whatever statistics you wish to compare products, as long as you understand what the statistic is telling you and compute it identically across all products. For most network managers, our statistics will give a strong feel for how good these products are at filtering spam.