June 16th, 2008

beartato phd

(no subject)

I remember seeing a couple of folks from IBM's Many Eyes give a talk here a while ago — neat stuff. LanguageLog linked to this visualization in particular, with a challenge to guess what "ne" might mean just by looking at this corpus. My guts suggest some kind of discourse particle since it seems to occur at the beginning and end of sentences so often, but I think I am being unconsciously biased by it being one in japanese.

It got me thinking about various things, those which I'm sure (especially corpus-based) NLP people have actually thought pretty hard about. I mean, just what can you figure out just form co-occurrence patterns of tokens? The fact that visualization is so helpful suggests that we have reached a point where the algorithms we know how to write (for visualization or for massive statistical analysis) and the algorithms in our brain we know to use (i.e. stare at something and see if any patterns reveal themselves) are no longer really in a relationship of the one just beating the other, but are complementarily useful.

A specific open question to those of you that do know a lot NLP/machine learning stuff: how hard/already pretty well solved is the following problem? You are given some random typical western european text from which all the spaces have been removed, and all letters have been permuted - that is, a random cipher has been applied, just to keep you from cheating and using dictionaries. Just from statistical properties of letter co-occurrence alone, figure out where spaces should go.