Correlations

You can find the correlation coeffcients between various different values we have been looking at here. The following definitions are useful:
cf
The corpus frequency. Represents the total number of times the term appears (sum_i{i*# times occurs exactly i times}).
df
The document frequency. The number of distinct documents that a term occurs in (=P(at least 1).
avg
The average number of times a term appears (cf/df).
std
Represents how heavy the terms tail is (sum_i{i*i*# times occurs exactly i times}).
Exactly X
A count of the number of times that a term occurs exactly X times.
At least X
A count of the number of times that a terms occurs at least X times.
P(exactly X)
The probability that the term will occur exactly X time given that it has occured (= Exactly X/df).
P(at least X)
The probability that a term will occur at least X times given that it has occured (= At least X/df).
The values were arrived at by:
- Creating a file with a line for each term with each of statistics of interest for that term on that line
- Loading this file into matlab
- Calling the function corrcoef on this matrix