Unique word occurences
Within documents of the same document length, what sort of distribution
do we see for the number of unique words in a document? Check out what
we found for document set 1 and
document set 2. Curves seem to be pretty
bell shaped.
What about overall relationship between unique word occurences as a function
of document length? Initial tests seemed to indicate a linear relationship
(unique_words.old.gif,
unique_words-comparison.old.gif).
These tests, I believe, were done only on the WSJ documents.
However, more extensive tests seem to show some sort of curve, especially
in the high document length ranges
(uniqueWords.low.gif,
uniqueWords.high.gif). Look how there
seems to be two seperate tails here. This may be data from two different
document sets with different linear relationships?