Jason (jcreed) wrote,

Reading some of Nate Silver's "The Signal and the Noise", finding it kind of underwhelming. My fault for having big expectations, I guess.

Here's a simple empirical question for the NLP nerdoscenti out there: I've got an English text, considered as a sequence of tokens t1...tn. I can look at some prefix of that, say t1...tk, and ask what the entropy H_k is of the probability distribution where you uniformly pick one of those k words. This is of course going to be somewhat less than log(k), because we reuse words in a text. I guess maybe in the very long limit it's going to level off at a constant, because you just arrive at the entropy of the probability distribution of all english words. But I'm kind of curious how fast it does or how fast we should expect it to reach that limit, so:

Question: in actual, real, realistic, pragmatic english texts in the wild, what's the relationship between H_k and k? I downloaded a few project gutenberg texts and made some logplots, and squinted a lot, and it looked very roughly like H_k is (2^(1/3))(lg k)(k^(-1/15)). But I haven't a clue what might be determining these arbitrary constants 2^(1/3) and -1/15 that I am just eyeballing off of vaguely linear-looking log-log plots* without even doing proper regressions.

*essentially all log-log plots are vaguely linear.
Tags: books, language, math, statistics
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded