Term Distribution
We want to look at the probability distribution of how often a term
occurs in a document. To do this, we tried to eliminate length as a
variable. For this reason we binned the documents by length.
Initially we've been looking primarily at the group of documents in
document length bin #2. There are 10,530 documents in this bin (d),
with an average length of 431.5 (n). Some results are also available
for data set #1, for which there are 25,295 documents in this bin (d),
with an average length of 125.4 (n).
n: Document length
d: Number of documents
f: Frequency of terms in bin over corpus (=pnd)
p: Terms pobability of occurence (=f/nd)
k: Number of terms that occur f number of times (terms in freq. bin)
To get a probability distribution over the number of times a term occurs
in a document, we made the following graph (which can be found here:
gif, set #1, ps, set #1,
gif, set #2, ps, set #2):
On the x axis: Term occurs exactly x times.
On the y axis: The number of times a term occurs exactly x times /
k
=> This is equal to the probability the term occurs
exactly x times times d
Notice that the graph has a much larger tail than would be predicted by
our original bayesian model. You can see here that
the graphs cross between 0 and 1 occurence in a document.
On a log-log scale, the lines appear almost linear, at least for low values
(gif, set #1, gif, set #2).
We then looked at the effect that frequency has on the number of occurences.
We did this by making these graphs: dist2.occ.0.ps,
dist2.occ.all-0.ps and
dist2.occ.all.log.ps. I'll write up
a little more on this later . . . Note that the x axis is a point
for each bin (bin of documents with similar f). Since the bins are
growing exponentially, this can be seen as already being a log scale.
(Also available: dist1.occ.gif,
dist1.occ.ps,
dist2.occ.gif,
dist2.occ.ps)
The parallel lines here seem to imply that the frequency of a term within
a document isn't important. To investigate this further, see the graphs in
this directory, and specifically
probOnce.probTwiceOverProbOnce.1.gif and
probOnce.probTwiceOverProbOnce.2.gif. Note that there does
seem to be a lot of variation within bins (error bars are standard deviation).
Now, there seems to be an incosistency between these graphs and the
parrallel line graphs, namely that there is a rise at the end in these graphs.
We have extended the parallel line graphs to include the area with the tail.
Go here to see those graphs.
We want to get an idea of what makes terms unique. For this reason, I have
graphed the term distribution for
every term in the 49-57 corpus frequency range in data set 2.
Can we see any patterns here? Graphs can be found
here. Everything in that directory is a
graph of the same thing -- just some zoom in on the graph (small is the
furthest zoomed in), and some are in postscript/gif format.
Corups frequency versus Document frequency
The relationship seems to be pretty linear.
cf_v_df.1.gif,
cf_v_df.2.gif
Average and document frequency
AVG as a variable to consider along with DF
AVG considered without DF
Standard devaition for AVG compared with DF