Identifying Hierarchical Structure in Sequences: A Linear-Time Algorithm Nevill-Manning and Witten. Jonathan Ledlie (jonathan@eecs) October 16, 2001 The authors use two fairly simple rules to turn an input text into a context-free grammar (at least, I think it is a CFG). Their two rules are: 1) no pair of adjacent symbols appears more than once in the grammar 2) every rule is used more than once. The first rule ensures that the resulting grammar is unambiguous: that the output of the compression phase is sufficient to decompress into the original text. The second makes sure that a given rule is usefully compressing the original text: if a rule was not used more than once, the it would waste space in the compressed output. They show fairly convincingly that their mechanism would capture grammatical structures in human-language inputs and music; I imagine it would perform even better on more structured inputs like computer programs. It captures structure because repeated items become rules in the grammar. Although they do have a section on implementation, I question how well this would work on long inputs. While they argue that their use of an open addressing hash table for symbol storage would be efficient, I think their doubly-linked list for the grammar itself might become slow in large inputs. Another issue they discuss is that the memory requirements for the algorithm are linear in the size of the input. If a structure is broken up into pieces however, its overall structure cannot be found.