Parallel and Flexible Sampling from Autoregressive Models via Langevin Dynamics

Vivek Jayaram*   John Thickstun*

University of Washington

38th International Conference on Machine Learning (ICML 2021)

(* Equal contribution)

Demo: Audio Super-Resolution

Demo: Audio Restoration

Demo: Source Separation


[Paper]   [Code]   [Download Demos]

Abstract

This paper introduces an alternative approach to sampling from autoregressive models. Autoregressive models are typically sampled sequentially, according to the transition dynamics defined by the model. Instead, we propose a sampling procedure that initializes a sequence with white noise and follows a Markov chain defined by Langevin dynamics on the global log-likelihood of the sequence. This approach parallelizes the sampling process and generalizes to conditional sampling, using an autoregressive model as a Bayesian prior. This allows us to steer the output of a generative model using a conditional likelihood or constraints. We apply these techniques to autoregressive models in the visual and audio domains, with competitive results for audio source separation, super-resolution, and inpainting.

Audio Demos + Comparisons

We present audio demos that show the quality of samples generated with PnF. We also compare with task specific baselines when applicable




Spectrogram Conditioned Generation

We show that PnF sampling is indistinguishable from autoregressive sampling and can faithfully reproduce the ground truth.



Voice

GT Autoregressive Wavenet PnF Wavenet - 2048 iterations



Piano

GT Autoregressive Wavenet PnF Wavenet - 2048 iterations



Varying Number of Iterations

The quality of samples from our method is dependent on the number of Langevin iterations at each noise level. We show this effect below.

PnF Wavenet - 64 iterations PnF Wavenet - 128 iterations PnF Wavenet - 512 iterations PnF Wavenet - 2048 iterations






Audio Super-Resolution: Piano


Ground Truth (22kHz)

4x Super-Resolution

Input (4x Downsampled) PnF (WaveNet) Upsampled Spline KEE Network [1]



8x Super-Resolution

Input (8x Downsampled) PnF (WaveNet) Upsampled Spline KEE Network [1]



16x Super-Resolution

Input (16x Downsampled) PnF (WaveNet) Upsampled Spline KEE Network [1]



32x Super-Resolution

Input (32x Downsampled) PnF (WaveNet) Upsampled

Sample spectrograms showing the effect of 12x super-resolution. This example is from the top of the page



[1] Kuleshov, V., Enam, S. Z., and Ermon, S. Audio super-resolution using neural nets. ICLR (Workshop Track), 2017.



Audio Super-Resolution: Voice


Ground Truth (22kHz)


4x Super-Resolution

Input (4x Downsampled) PnF (WaveNet) Upsampled Spline KEE Network [1]



8x Super-Resolution

Input (8x Downsampled) PnF (WaveNet) Upsampled Spline KEE Network [1]



16x Super-Resolution

Input (16x Downsampled) PnF (WaveNet) Upsampled Spline KEE Network [1]





Source Separation


Input Mixture


GT PnF (WaveNet) Conv-TasNet [2] Demucs [3]


[2] Luo, Y. and Mesgarani, N. Conv-tasnet: Surpassing idealtime–frequency magnitude masking for speech separa-tion.IEEE/ACM Transactions on Audio, Speech, andLanguage Processing, 2019.
[3] Defossez, A., Usunier, N., Bottou, L., and Bach, F. Musicsource separation in the waveform domain



Audio Inpainting

Piano

Ground Truth Input (Missing 200ms) PnF (WaveNet) Inpainted Input (Missing 400ms) PnF (WaveNet) Inpainted



Voice

Ground Truth Input (Missing 100ms) PnF (WaveNet) Inpainted Input (Missing 200ms) PnF (WaveNet) Inpainted


Sample spectrograms showing the effect of 400ms piano inpainting.







Sample PixelCNN++ Results




We show results using PnF to sample from a PixelCNN++ model on the CIFAR10 dataset. We show results for three different tasks: source separation (left), 2x super-resolution (top right), and inpainting (bottom right)