Jason (jcreed) wrote,
Jason
jcreed

Had another brief spurt of NLP-amateur-hour times.

The topic this time was "Zipf's law" which is an empirically observed regularity that the nth most common word in a corpus occurs something proportional to 1/n times. Ish. Maybe with a slightly different exponent on the 1/n, and maybe offset by a bit. The whole business starts smelling a little sketchy and "hey pretty much anything looks linear on a log-log plot" if you stare at it too hard, but hey, natural language definitely does have more common and less common tokens, so what's the deal with that, huh.

I can tell from very brief literature-reading that plenty of people have tried to make models to generate Zipfy effects, but instead of carefully reading about any of them, I tried playing with making my own.

A preposterously simple thing that seems to work well is:

Start with an array A of numbers 1..N.
For each iteration, pick two random numbers i,j in the range 1..N
Let m = max(A[i], A[j])
Set A[i] := m
Set A[j] := m
Repeat for a lot of iterations.

The important parameter setting is that the number of iterations is about 5 times as big as N. Using N = 50,000 and 250,000 iterations I get the following nice-looking graph:

Tags: linguistics, math, zipf
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 4 comments