Clustering Algorithms
Many clustering algorithms: greedy agglomerative
- start with documents in singleton “clusters”
- repeatedly merge “nearest” two
- similarity measure between clusters
- min-min distance between points (SLINK/MST)
- min-max distance (CLINK)
- average distance between points (our choice)
Problems:
- quadratic time (or worse)
- “giant component” absorbs everything else