Term Distribution

We want to look at the probability distribution of how often a term occurs in a document. To do this, we tried to eliminate length as a variable. For this reason we binned the documents by length.

Initially we've been looking primarily at the group of documents in document length bin #2. There are 10,530 documents in this bin (d), with an average length of 431.5 (n). Some results are also available for data set #1, for which there are 25,295 documents in this bin (d), with an average length of 125.4 (n).

n: Document length
d: Number of documents
f: Frequency of terms in bin over corpus (=pnd)
p: Terms pobability of occurence (=f/nd)
k: Number of terms that occur f number of times (terms in freq. bin)

To get a probability distribution over the number of times a term occurs in a document, we made the following graph (which can be found here: gif, set #1, ps, set #1, gif, set #2, ps, set #2):

On the x axis: Term occurs exactly x times.
On the y axis: The number of times a term occurs exactly x times / k
   => This is equal to the probability the term occurs exactly x times times d

Notice that the graph has a much larger tail than would be predicted by our original bayesian model. You can see here that the graphs cross between 0 and 1 occurence in a document.

On a log-log scale, the lines appear almost linear, at least for low values (gif, set #1, gif, set #2).

We then looked at the effect that frequency has on the number of occurences. We did this by making these graphs: dist2.occ.0.ps, dist2.occ.all-0.ps and dist2.occ.all.log.ps. I'll write up a little more on this later . . . Note that the x axis is a point for each bin (bin of documents with similar f). Since the bins are growing exponentially, this can be seen as already being a log scale. (Also available: dist1.occ.gif, dist1.occ.ps, dist2.occ.gif, dist2.occ.ps)

The parallel lines here seem to imply that the frequency of a term within a document isn't important. To investigate this further, see the graphs in this directory, and specifically probOnce.probTwiceOverProbOnce.1.gif and probOnce.probTwiceOverProbOnce.2.gif. Note that there does seem to be a lot of variation within bins (error bars are standard deviation). Now, there seems to be an incosistency between these graphs and the parrallel line graphs, namely that there is a rise at the end in these graphs. We have extended the parallel line graphs to include the area with the tail. Go here to see those graphs.

We want to get an idea of what makes terms unique. For this reason, I have graphed the term distribution for every term in the 49-57 corpus frequency range in data set 2. Can we see any patterns here? Graphs can be found here. Everything in that directory is a graph of the same thing -- just some zoom in on the graph (small is the furthest zoomed in), and some are in postscript/gif format.

Corups frequency versus Document frequency

The relationship seems to be pretty linear. cf_v_df.1.gif, cf_v_df.2.gif

Average and document frequency

AVG as a variable to consider along with DF
AVG considered without DF
Standard devaition for AVG compared with DF