November 1st, 2011

beartato phd

(no subject)

Had a couple ideas for funny things to do to big lists of natural-language tokens over lunch. I am not any kind of NLP sophisticate; just thought it would be interesting to find out what happened if I turned certain cranks.

Anyway, I coded up one of them just now, and here's the result of it applied to the text of Jane Eyre:

It's a kind of clustering algorithm for the tokens. It goes back and forth in an EM-like way between (1) estimating probabilities that a token will have a certain label, given that it occurs in a context with other labels next to it (2) finding the max-likelihood labeling of each token. Except, to make the algorithm not always converge to something like "EVERY TOKEN IS LABEL 3 LOLOLOL" I had to abuse the probability calculation with a rather ad hoc regularization scheme. At this point I'm pretty sure it's no longer really the probability of anything. The dumb script is here, anyway, if you want to look:

At least "he" and "she" ended up in the same class, as well as "the", "a", and "an".