Abstract

This paper introduces an alternative approach to sampling from autoregressive models. Autoregressive models are typically sampled sequentially, according to the transition dynamics defined by the model. Instead, we propose a sampling procedure that initializes a sequence with white noise and follows a Markov chain defined by Langevin dynamics on the global log-likelihood of the sequence. This approach parallelizes the sampling process and generalizes to conditional sampling, using an autoregressive model as a Bayesian prior. This allows us to steer the output of a generative model using a conditional likelihood or constraints. We apply these techniques to autoregressive models in the visual and audio domains, with competitive results for audio source separation, super-resolution, and inpainting.

Audio Demos + Comparisons

We present audio demos that show the quality of samples generated with PnF. We also compare with task specific baselines when applicable

Spectrogram Conditioned Generation

We show that PnF sampling is indistinguishable from autoregressive sampling and can faithfully reproduce the ground truth.

Voice

GT	Autoregressive Wavenet	PnF Wavenet - 2048 iterations

Piano

GT	Autoregressive Wavenet	PnF Wavenet - 2048 iterations

Varying Number of Iterations

The quality of samples from our method is dependent on the number of Langevin iterations at each noise level. We show this effect below.

PnF Wavenet - 64 iterations	PnF Wavenet - 128 iterations	PnF Wavenet - 512 iterations	PnF Wavenet - 2048 iterations

Audio Super-Resolution: Piano

Ground Truth (22kHz)

4x Super-Resolution

Input (4x Downsampled)	PnF (WaveNet) Upsampled	Spline	KEE Network [1]

8x Super-Resolution

Input (8x Downsampled)	PnF (WaveNet) Upsampled	Spline	KEE Network [1]

16x Super-Resolution

Input (16x Downsampled)	PnF (WaveNet) Upsampled	Spline	KEE Network [1]

32x Super-Resolution

Input (32x Downsampled)	PnF (WaveNet) Upsampled

[1] Kuleshov, V., Enam, S. Z., and Ermon, S. Audio super-resolution using neural nets. ICLR (Workshop Track), 2017.

Parallel and Flexible Sampling from Autoregressive Models via Langevin Dynamics

Vivek Jayaram* John Thickstun*

University of Washington

38th International Conference on Machine Learning (ICML 2021)

(* Equal contribution)

Demo: Audio Super-Resolution

Demo: Audio Restoration

Demo: Source Separation

Abstract

Audio Demos + Comparisons

Spectrogram Conditioned Generation

Voice

Piano

Varying Number of Iterations

Audio Super-Resolution: Piano

4x Super-Resolution

8x Super-Resolution

16x Super-Resolution

32x Super-Resolution

Audio Super-Resolution: Voice

4x Super-Resolution

8x Super-Resolution

16x Super-Resolution

Source Separation

Audio Inpainting

Piano

Voice

Sample PixelCNN++ Results

GT	PnF (WaveNet)	Conv-TasNet [2]	Demucs [3]