The Effect of Length on log(1+x)
Let's look at the sum of the log(1+x) for each term in documents of two
different lengths [gif]. Since there is a lot of
noise, let's bin the terms [gif].
A sort-of linear relationship.
NEW: Instead of two different document lengths, two different sum( log(1+x) )s
[gif].
Now let's look at how document length affects the average term's
log(1+x) in documents of that length [by source,
all].
We also have these same graphs looking at the effect of the sum of log(1+x)'s
of a document has on the average term's log(1+x) in documents with that sum
[by source, all].
There's a weird hump, so I looked at the bin size, so we whether there were
enough documents in those high-end bins to warrant drawing conclusions from
[gif]. It drops into the low hundreds by
the time we're looking at documents with the sum of log(1+x)'s above 500,
and remains that way until the end.
But still seems to be a reasonable number of documents.
Anyway, I also made some graphs
focusing on the pre-hump portion of the graph
[log(1+x),
bin size].