Standard Deviation of Probability Distribution: DF and AVG
Here we compare the standard deviation for the probability distribution
for terms that are similar binned based on either their document
frequency or based on their AVG.
The bin make-up:
Because we have seen from probability distribution graphs that vary
the document frequency while holding the average constant (see graphs in
this)
directory) there is some weird behavior with terms that have a very low
document frequency, in doing this analysis I ignore all terms with document
frequecy of 1 or 2. Now I will be using bins that grow by 10% each time.
I want there to be bins of a similar size, so I played around with the
starting bin size, and came up with the following bins:
AVG: 1.0 1.1 1.21 1.33 1.46 1.61 1.77
DF: 3 6 9 12 16 21 26
Note that the number representing the bin represents the lowest value in
that bin.
You can see that these bins ended up being reasonably similar with respect
to size in this graph.
What I found:
As we have seen before, the probability distribution is not affected by
document frequency, so I was able to graph the probability distribution for
each of the document frequency bins, along with their standard deviation,
in on graph [gif].
On the other hand, the probability distributions are very different for
different averages, so I made a couple of example graphs for the AVG bins
[1.0,
1.33,
1.61].
To be able to compare all of the standard deviations, and their behavior
as a function of AVG/DF, I graphed only the standard deviation (not as
error bars). [AVG,
DF]
Things noticed from these graphs:
- Error is overall smallest for AVG as compared to DF.
- DF: Error greatest at 0, and decreases. AVG: Error greatest at 1.
- Error gets smaller as DF goes up, but larger as AVG goes up.