Jason (jcreed) wrote,
Jason
jcreed

Had a couple ideas for funny things to do to big lists of natural-language tokens over lunch. I am not any kind of NLP sophisticate; just thought it would be interesting to find out what happened if I turned certain cranks.

Anyway, I coded up one of them just now, and here's the result of it applied to the text of Jane Eyre:
http://jcreed.org/text/jane/jane.html

It's a kind of clustering algorithm for the tokens. It goes back and forth in an EM-like way between (1) estimating probabilities that a token will have a certain label, given that it occurs in a context with other labels next to it (2) finding the max-likelihood labeling of each token. Except, to make the algorithm not always converge to something like "EVERY TOKEN IS LABEL 3 LOLOLOL" I had to abuse the probability calculation with a rather ad hoc regularization scheme. At this point I'm pretty sure it's no longer really the probability of anything. The dumb script is here, anyway, if you want to look:
http://jcreed.org/text/jane/jane.pl.txt

At least "he" and "she" ended up in the same class, as well as "the", "a", and "an".
Tags: text
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 3 comments