December 15th, 2012On this day in different years

beartato phd

(no subject)

Reading some of Nate Silver's "The Signal and the Noise", finding it kind of underwhelming. My fault for having big expectations, I guess.

Here's a simple empirical question for the NLP nerdoscenti out there: I've got an English text, considered as a sequence of tokens I can look at some prefix of that, say, and ask what the entropy H_k is of the probability distribution where you uniformly pick one of those k words. This is of course going to be somewhat less than log(k), because we reuse words in a text. I guess maybe in the very long limit it's going to level off at a constant, because you just arrive at the entropy of the probability distribution of all english words. But I'm kind of curious how fast it does or how fast we should expect it to reach that limit, so:

Question: in actual, real, realistic, pragmatic english texts in the wild, what's the relationship between H_k and k? I downloaded a few project gutenberg texts and made some logplots, and squinted a lot, and it looked very roughly like H_k is (2^(1/3))(lg k)(k^(-1/15)). But I haven't a clue what might be determining these arbitrary constants 2^(1/3) and -1/15 that I am just eyeballing off of vaguely linear-looking log-log plots* without even doing proper regressions.

*essentially all log-log plots are vaguely linear.