Standard Deviation of Probability Distribution: DF and AVG

Here we compare the standard deviation for the probability distribution for terms that are similar binned based on either their document frequency or based on their AVG.

The bin make-up:

Because we have seen from probability distribution graphs that vary the document frequency while holding the average constant (see graphs in this) directory) there is some weird behavior with terms that have a very low document frequency, in doing this analysis I ignore all terms with document frequecy of 1 or 2. Now I will be using bins that grow by 10% each time. I want there to be bins of a similar size, so I played around with the starting bin size, and came up with the following bins:

AVG: 1.0 1.1 1.21 1.33 1.46 1.61 1.77
DF: 3 6 9 12 16 21 26

Note that the number representing the bin represents the lowest value in that bin.

You can see that these bins ended up being reasonably similar with respect to size in this graph.

What I found:

As we have seen before, the probability distribution is not affected by document frequency, so I was able to graph the probability distribution for each of the document frequency bins, along with their standard deviation, in on graph [gif].

On the other hand, the probability distributions are very different for different averages, so I made a couple of example graphs for the AVG bins [1.0, 1.33, 1.61].

To be able to compare all of the standard deviations, and their behavior as a function of AVG/DF, I graphed only the standard deviation (not as error bars). [AVG, DF]

Things noticed from these graphs: