Anyway, Language Log has some interesting textual analysis of the recent presidential debate. They seem to possibly be doing "the Laplace correction" (see slide 21 here) of adding 1 to every word-count to make sure you don't get divide-by-zero problems. This is the only thing I can think of to explain why the 20:2 "states" ended up in the middle of a bunch of 6:0 words like "forces", "union", "two", and, appropriately enough, "sort". If the metric is kerry-count + 1 divided by (kerry-count 1) + (bush-count + 1) (which my sketchy memory of machine learning jargon suggests has the name TFIDF, "total frequency inverse document frequency") then 20:2 comes out 21/24 and 6:0 gives 7/8, exactly equal! so a stable sort ought indeed leave "states" mixed in with the rest.
I'm not sure I agree that the fact that Bush's top words have bigger such ratios than Kerry's is great evidence for "the greater repetitiveness of Bush's language", especially since Kerry hammered on "president" 87 times to Bush's 12. I'd like to just work out the entropy of one Kerry word vs. the entropy of one Bush word and see which candidate is delivering me more bits per word. Do you hear that, you washington fat-cats? The people demand bits! Random White Noise party, 2008!
no, wait, my theory above isn't consistent with the placement of "president" between the 7:0s and the 8:0s. Not sure exactly what's going on.
Whoops, my example is crap. Kerry's repetition of the word "president" is probably mostly to be expected, since he uses the word in reference to Bush now, and also hypothetically in reference to himself in the future. I wonder if Lakoff would chide Kerry for referring to his opponent in the debate as anything but "my opponent".