Playing around with SVG a bit. Here is a visualization of all the words in "The Hacker Crackdown", "Huckleberry Finn", and "Middlemarch":

The x-axis is the average position that the word occurs in a sentence, (so that words that typically begin sentences are farther to the left, and words that occur late, and in long sentences occur farther to the right) and the y-axis is log frequency.

Full svg files and larger pngs and the perl (sorry!) script that made them all are in this directory.

From: altamira16
2007-09-16 03:08 pm (UTC)
The top two look a little stripy towards the bottom, and the bottom one does not. Would using ln v log10 change that or not?
[User Picture]From: jcreed
2007-09-16 07:15 pm (UTC)
Ultimately it's just because Middlemarch is a longer book.

I cut off at the bottom all words that occur less often than 1 in 20,000 times; in the top two graphs, the total word count of the book is something ilke 100,000 words, so the rarest words you're seeing occur about 5 times. Log of five is significantly different from, say, 6 or 7, enough that you see banding. With Middlemarch the 1/20,000 mark is up around 16 I think, so the bands blur together.
From: _wirehead_
2007-09-16 04:08 pm (UTC)
ooh, pretty.
From: chrsjxn
2007-09-16 05:41 pm (UTC)
Why are you apologizing for using a Perl script?

It's probably the best sort of tool that I could think of for this.
[User Picture]From: jcreed
2007-09-16 07:16 pm (UTC)
It's just because I am a PL wanker, and we fancy ourselves above such things usually.
[User Picture]From: nolacoaster
2007-09-17 01:26 pm (UTC)
Are these books all public domain?
[User Picture]From: jcreed
2007-09-17 01:29 pm (UTC)
"The Hacker Crackdown" has some other license IIRC, but they are all at project gutenberg.
From: ex_thousand816
2007-09-18 06:35 pm (UTC)
very cool!!!
