Correlations
You can find the correlation coeffcients
between various different values we have been looking at
here. The following definitions are
useful:
- cf
- The corpus frequency. Represents the total number of times the
term appears (sum_i{i*# times occurs exactly i times}).
- df
- The document frequency. The number of distinct documents that a
term occurs in (=P(at least 1).
- avg
- The average number of times a term appears (cf/df).
- std
- Represents how heavy the terms tail is (sum_i{i*i*# times occurs exactly
i times}).
- Exactly X
- A count of the number of times that a term occurs exactly X times.
- At least X
- A count of the number of times that a terms occurs at least X
times.
- P(exactly X)
- The probability that the term will occur exactly X time given
that it has occured (= Exactly X/df).
- P(at least X)
- The probability that a term will occur at least X times given
that it has occured (= At least X/df).
The values were arrived at by:
- Creating a file with a line for each term with each of statistics of interest
for that term on that line
- Loading this file into matlab
- Calling the function corrcoef on this matrix