Jason (jcreed) wrote,

Noodled around with that same python script over lunch. I did greedy agglomerative clustering of tokens, but after every addition of an item to a cluster, I re-ran the whole estimation of which tokens were closest after patching the original data so that tokens in the same cluster were forced to be identical. This being in the hopes that, as the model got a better sense of, like, "what a preposition is", it would be better able to guess other things.

Here's an extract of the clusters it inferred, headed by the most commonly occurring element of the cluster in my corpus (corpus = arbitrarily selected project gutenberg ebook, apparently about whaling)

electronic: broken certain [ed. !!! ahahaha]

may: can might must will would

of: about above after against all among and around as at before below
  besides between beyond but by during especially even find for from
  having here him home how if illustration in indeed into leave like
  make meet near off on or over perhaps reached round see since such
  than that them though through to towards under until upon us when
  where whether while with within without yet

so: quite still therefore too

were: are could did ever felt go had have knew never passed saw should

not: also become been cast now remain soon usually well wholly

the: a almost an another any board clothing death each every finally
  five four god he her his its least making many march mr no october
  once our painful possible several sight six some take ten their then
  these thirty this those thus twelve twenty two walrus what which
  whom whose you your

was: has is tm

ship: agreement air american appearance arctic being best blubber boat
  boats body business captain case casks character circumstances
  citizen coast cold common company condition country course crew dead
  deck deep different direction distance dog earth efforts enterprise
  eyes fact fate few fire first fish fisher following food form
  foundation friends full future general good great greatest ground
  half hands head high hut huts ice islands kind land language large
  last length line little lives long manner mate means men minds more
  most name native natives new next north northern number o ocean
  officers oil one order ordinary other others own party place places
  polar port present provisions public purpose region relatives report
  right river sad sail sailor same sea seaman season seasons second
  settlement ships shore short skin small snow south species sperm
  spermaceti strong surface terms th thing thousand three time
  traveling true united very vessel water way weather whale whaling
  whole wind winter work wreck year young

out: account again along appeared arrival arrived articles ashore away
  back barrels bay began both bound brought c came carried come cut
  day days desire distant done down e either end engaged escape fall
  feet fitted floating found frozen gave given heard himself history
  hope however intelligence interest just killed known left living
  lost made miles months nearly obtained part pieces placed position
  put received remained return s said seemed seen set side skins
  something state states struck success supposed taken themselves
  together unable up use value view went works years 

I like how it nailed prepositions pretty well (and in fact and, or, but were off their own cluster earlier in the iteration, before they merged in) and modal verbs all landed together, and a lot of numbers ended nicely in the same bucket as determiners.

I can't quite tell what the difference between the two big content-word clusters is supposed to be.
Tags: language

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded