Jason (jcreed) wrote,
Jason
jcreed

Had another brief spurt of NLP-amateur-hour times.

The topic this time was "Zipf's law" which is an empirically observed regularity that the nth most common word in a corpus occurs something proportional to 1/n times. Ish. Maybe with a slightly different exponent on the 1/n, and maybe offset by a bit. The whole business starts smelling a little sketchy and "hey pretty much anything looks linear on a log-log plot" if you stare at it too hard, but hey, natural language definitely does have more common and less common tokens, so what's the deal with that, huh.

I can tell from very brief literature-reading that plenty of people have tried to make models to generate Zipfy effects, but instead of carefully reading about any of them, I tried playing with making my own.

A preposterously simple thing that seems to work well is:

Start with an array A of numbers 1..N.
For each iteration, pick two random numbers i,j in the range 1..N
Let m = max(A[i], A[j])
Set A[i] := m
Set A[j] := m
Repeat for a lot of iterations.

The important parameter setting is that the number of iterations is about 5 times as big as N. Using N = 50,000 and 250,000 iterations I get the following nice-looking graph:

Tags: linguistics, math, zipf
Subscribe

  • (no subject)

    Something that's bugged me for a long time is this: How many paths, starting at the origin, taking N steps either up, down, left or right, end up at…

  • (no subject)

    Didn't sleep well. Long day of work. Dinner with akiva at hanamichi.

  • (no subject)

    K was going to do a thing for her dad's birthday, but scheduling kept slipping and slipping so I guess we're going to try doing it tomorrow instead.

  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 4 comments