If the individual feature probabilities were independent, then it’d be mathematically sound to multiply them together to get a combined probability. But it’s unlikely they actually are independent—certain features are likely to appear together, while others never do.11

    Robinson proposed using a method for combining probabilities invented by the statistician R. A. Fisher. Without going into the details of exactly why his technique works, it’s this: First you combine the probabilities by multiplying them together. This gives you a number nearer to 0 the more low probabilities there were in the original set. Then take the log of that number and multiply by -2. Fisher showed in 1950 that if the individual probabilities were independent and drawn from a uniform distribution between 0 and 1, then the resulting value would be on a chi-square distribution. This value and twice the number of probabilities can be fed into an inverse chi-square function, and it’ll return the probability that reflects the likelihood of obtaining a value that large or larger by combining the same number of randomly selected probabilities. When the inverse chi-square function returns a low probability, it means there was a disproportionate number of low probabilities (either a lot of relatively low probabilities or a few very low probabilities) in the individual probabilities.

    To use this probability in determining whether a given message is a spam, you start with a null hypothesis, a straw man you hope to knock down. The null hypothesis is that the message being classified is in fact just a random collection of features. If it were, then the individual probabilities—the likelihood that each feature would appear in a spam—would also be random. That is, a random selection of features would usually contain some features with a high probability of appearing in spam and other features with a low probability of appearing in spam. If you were to combine these randomly selected probabilities according to Fisher’s method, you should get a middling combined value, which the inverse chi-square function will tell you is quite likely to arise just by chance, as, in fact, it would have. But if the inverse chi-square function returns a very low probability, it means it’s unlikely the probabilities that went into the combined value were selected at random; there were too many low probabilities for that to be likely. So you can reject the null hypothesis and instead adopt the alternative hypothesis that the features involved were drawn from a biased sample—one with few high spam probability features and many low spam probability features. In other words, it must be a ham message.

    To get a final score, you need to combine those two measures into a single number that gives you a combined hamminess-spamminess score ranging from 0 to 1. The method recommended by Robinson is to add half the difference between the hamminess and spamminess scores to 1/2, in other words, to average the spamminess and 1 minus the hamminess. This has the nice effect that when the two scores agree (high spamminess and low hamminess, or vice versa) you’ll end up with a strong indicator near either 0 or 1. But when the spamminess and hamminess scores are both high or both low, then you’ll end up with a final value near 1/2, which you can treat as an “uncertain” classification.

    The score function that implements this scheme looks like this:

    You take a list of features and loop over them, building up two lists of probabilities, one listing the probabilities that a message containing each feature is a spam and the other that a message containing each feature is a ham. As an optimization, you can also count the number of probabilities while looping over them and pass the count to fisher to avoid having to count them again in fisher itself. The value returned by will be low if the individual probabilities contained too many low probabilities to have come from random text. Thus, a low fisher score for the spam probabilities means there were many hammy features; subtracting that score from 1 gives you a probability that the message is a ham. Conversely, subtracting the fisher score for the ham probabilities gives you the probability that the message was a spam. Combining those two probabilities gives you an overall spamminess score between 0 and 1.

    The only other new function is itself. Assuming you already had an inverse-chi-square function, fisher is conceptually simple.

    Unfortunately, there’s a small problem with this straightforward implementation. While using **REDUCE** is a concise and idiomatic way of multiplying a list of numbers, in this particular application there’s a danger the product will be too small a number to be represented as a floating-point number. In that case, the result will underflow to zero. And if the product of the probabilities underflows, all bets are off because taking the **LOG** of zero will either signal an error or, in some implementation, result in a special negative-infinity value, which will render all subsequent calculations essentially meaningless. This is particularly unfortunate in this function because the Fisher method is most sensitive when the input probabilities are low—near zero—and therefore in the most danger of causing the multiplication to underflow.

    Luckily, you can use a bit of high-school math to avoid this problem. Recall that the log of a product is the same as the sum of the logs of the factors. So instead of multiplying all the probabilities and then taking the log, you can sum the logs of each probability. And since takes a :key keyword parameter, you can use it to perform the whole calculation. Instead of this: