The topic this time was "Zipf's law" which is an empirically observed regularity that the nth most common word in a corpus occurs something proportional to 1/n times. Ish. Maybe with a slightly different exponent on the 1/n, and maybe offset by a bit. The whole business starts smelling a little sketchy and "hey pretty much anything looks linear on a log-log plot" if you stare at it too hard, but hey, natural language definitely does have more common and less common tokens, so what's the deal with that, huh.
I can tell from very brief literature-reading that plenty of people have tried to make models to generate Zipfy effects, but instead of carefully reading about any of them, I tried playing with making my own.
A preposterously simple thing that seems to work well is:
Start with an array A of numbers 1..N.
For each iteration, pick two random numbers i,j in the range 1..N
Let m = max(A[i], A[j])
Set A[i] := m
Set A[j] := m
Repeat for a lot of iterations.
The important parameter setting is that the number of iterations is about 5 times as big as N. Using N = 50,000 and 250,000 iterations I get the following nice-looking graph:
