tfh.sh/projects/ngram

N-gram Music
Generative music inspired by linguistics.

A final project for 6.S083: Computation, Cognitive Science, and Language
taught by Prof. Bob Berwick.

paper, code

In this project I developed a relatively simple method that takes an input audio file or files and outputs new music "in the style" of the inputs. It is based on n-grams which are Markov-style generators used in linguistics. Results are at the end of this page.

A brief introduction to n-grams

An n-gram model is used in linguistics to generate novel sentences from a large corpus of text. An n-gram is any continuous sequence of n words. For example the previous sentence contained the 3-grams, "An n-gram is", "n-gram is any", "is any continous", etc. If we assume that language is made of an independent stream of n-grams, then we can generate the next word $x_i$ in a stream by randomly sampling from the distribution $P(x_i | x_{i - (n-1)}\ldots x_{i-1})$.

For example, suppose our corpus is composed of the sentences:

[START] to be or not to be [STOP]
[START] to die to sleep to sleep perchance to dream [STOP]

Some sentences generated by 2-grams could be:

[START] to be [STOP]
[START] to be or to sleep to sleep to sleep perchance to be [STOP]
[START] to die or not to die to be [STOP]

The results are ok but not great - if n is too small the result is nonsensical, if n is too large the output is identical to the input. But what would happen if we applied this to music?

Applying n-grams to music

My goal was to generate music using n-grams with raw audio (not MIDI). To do that I first perform transient detection on the input audio to break a song up into individual "words" via [1]. The detected split points between these words (called onsets) are plotted below for the Aria of Bach's Goldberg Variations:

Then I perform harmonic analysis on each word to determine a 12 dimensional vector that describes the prevalence of each of the 12 notes in that word. To do this I perform frequency estimation [2] on the spectrum of the word and then wrap those frequencies logarithmically. The two chords below containing the same notes C-E-G have the same harmonic profile when played on a piano despite having different voicings:

These words are then clustered based on their cosine distance so that words with similar harmonic content are considered the same. Now, we can simply apply the n-gram streaming algorithm described above to generate music! Some special care needs to be made when stitching words together to prevent clipping, and I decided to perform an independent n-gram for volume to avoid sharp dynamic changes.

The code can be found on github and a more in depth in depth description can be found in the paper.

Results

For songs containing a single instrument, the resulting music is actually pretty good. If you're not listening closely this process could generate an infinite stream of impressionist music:

But other times the result is just abysmal (this one is particularly funny):

And just for fun here is a variation of every Goldberg variation: