tfh.sh/projects/audio

Audio Transport
A generalized portamento between any two audio streams.

This was a final project for 6.838: Shape Analysis taught by Justin Solomon in 2017 and eventually turned into a paper presented at DAFx19. It won "Best Student Paper" at the conference and shortly after was featured on the front page of the MIT news.

paper, code

Here is a video demonstrating the effect:

Here are some more results:

Portamento

As shown in the examples above, a portamento is a gliding from one note to another via pitch. A classic example is the oboe portamento at the beginning of Rhapsody in Blue. The effect can be used to create effortless transitions or graceful slurs. But typically the effect can only be achieved on monophonic instruments: instruments that produce a single note at a time like the human voice, but not the piano for instance.

There are some exceptions of course. Instruments like lap steel guitars, properly played polyphonic synthesizers and offline pitch manipulation software can achieve the effect in certain situations. But a general solution that can produce a portamento between any two audio sources has been out of the question...

A Brief Introduction to Optimal Transport

The secret to the audio transport effect lies in the optimal transport problem. This problem asks how to move mass from one place to another in a way that requires a minimum amount of work. Historically, this problem was developed to streamline resource allocation: if you have a number of factories in various locations and settlements in other locations, where should each factory send its resources so that the total cost spent on transportation is minimized? The problem also has connections to a number of other branches of mathematics that are explored by Cédric Villani in what is perhaps my favorite lecture of all time.

Alternatively, the solution to the optimal transport problem can be viewed as the laziest way to build a sand castle. For example, if we have the pile of sand on the left and want to transform it into the sand castle on the right, where should each grain of sand go so that the least amount of work is spent moving sand?

More formally, an optimal plan $\pi^*$ is one that minimizes the mass transferred from position $x$ to $y$, $\pi(x,y)$, times the distance between $x$ and $y$ raised to the $p$th power, $\|x - y\|^p$.
$$ \begin{align*} \pi^* = \text{arg min}_{\pi} \int \|x - y\|^pd\pi(x,y) \end{align*} $$
(In many applications, as in this one, $p = 2$. This makes the optimal transport cost a metric space and other nice things.)

Once a plan is found, moving sand along along the minimizing path produces a smooth "displacement interpolation" like the one shown below:

We can contrast this with a linear interpolation as in this second animation vvv

Now imagine that height of sand over position $x$ represents the volume of sound at pitch $x$ --- i.e. the sand is the sound's frequency spectrum. The linear transform is "fading" between two sounds while the displacement interpolation is "gliding" between them. A portamento!

I'm leaving the out details of solving the transport problem, but it's not too hard in one-dimension: a greedy assignment is optimal. Solving the problem in higher dimensions for things like images or meshes becomes much harder and doing it efficiently is still an open area of research. There are some cool examples of what those solutions look like in this paper

Audio transport

The essence of the audio transport effect is simply to use an optimal transport plan to perform displacement interpolation between the spectra of two incoming audio signals. However, there are some caveats and they have to do with how audio signals are converted to spectra and back.

A stream of audio can be turned into a stream of frequency spectra via a phase vocoder. This breaks the audio up into overlapping chunks, or windows, each of which is analysed using a fast Fourier Transform to get its frequency components. Manipulating the frequencies in these windows and then turning them back into audio is the basis of many modern effects. For example, "time stretching", or the ability to change a songs pitch without changing its tempo, and autotune both rely on this technology.

Without any additional modifications, if you apply displacement interpolation to spectra and resynthesize you get this:

Yikes! That's pretty bad --- it should sound like a 220 Hz Sine wave sweeping up to a 440 Hz Sine wave. From the spectrogram you can kind of see an upward trend, but something things are going very wrong.

The first is known as "vertical incoherence" which means that there is distortion occurring within each resynthesized window. Whenever you analyze frequency content there will always be some amount of blurring due to the uncertainty principle and quantization. This makes even pure tones, like the example sine wave, spread over multiple frequency "bins". During transportation these spread regions can be split apart, which causes self-interference. For illustration:

In the paper, we introduced a new technique that uses frequency reassignment to groups bins together that have been smeared. By performing the transport over these groups rather than over individual bins resolves the vertical incoherence:

Without vertical incoherence, the sound is much closer to an ideal sine sweep. However there is still something strange going on --- this time it is "horizontal incoherence". This is distortion that occurs between adjacent windows; when a sine wave is transposed its phase rotates at a different rate, so the phases in adjacent bins are different.

The change in phase (with units in radians) can be computed based on the group angular frequencies $\omega$ (with units radians per second) and the window size $\Delta$ (with units in seconds). More precisely for group $i$ in window $t$ the phase $\varphi_i^t$ is:
$$ \varphi_i^t = \varphi_i^{t-1} + \frac{\omega_i^t + \omega_i^{t-1}}{2}\cdot\Delta $$
Accounting for these phase updates makes the transformation sound indistinguishable from an ideal sweep: