Unique Word Occurences

Unique word occurences

Within documents of the same document length, what sort of distribution do we see for the number of unique words in a document? Check out what we found for document set 1 and document set 2. Curves seem to be pretty bell shaped.

What about overall relationship between unique word occurences as a function of document length? Initial tests seemed to indicate a linear relationship (unique_words.old.gif, unique_words-comparison.old.gif). These tests, I believe, were done only on the WSJ documents. However, more extensive tests seem to show some sort of curve, especially in the high document length ranges (uniqueWords.low.gif, uniqueWords.high.gif). Look how there seems to be two seperate tails here. This may be data from two different document sets with different linear relationships?