Synthesizing Obama: Learning Lip Sync from Audio

SIGGRAPH 2017

Supasorn Suwajanakorn, Steven M. Seitz, Ira Kemelmacher-Shlizerman

Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

Supplementary Video

Publication

SIGGRAPH 2017 Paper

Training Videos

A list of youtube videos used for training our recurrent neural network: obama_addresses.txt

Video A - Teaser

Input Audio: nIxM8rL5GVE (0:10 - 1:16)
Target Video: 3vPdtajOJfw

Video B - Comparison to face2face [Thies et al. 2016]

Input Audio: nIxM8rL5GVE (0:10 - 0:25)
Target Video: k4OZOTaf3lk

Video C - Method Pipeline

Input Audio: deF-f0OqvQ4 (1:37 - 2:14)
Target Video: 25GOnaY8ZCY

Video D - Target Video Retiming

Input Audio: nIxM8rL5GVE (2:02 - 2:23)
Target Video: 25GOnaY8ZCY

Video E - Weekly Address Speech (4-Obama)

Input Audio 1: nIxM8rL5GVE (3:53 - 4:20)
Input Audio 2: WtOhZ--YeFY (0:58 - 1:31)
Target Videos:
Top-Left: k4OZOTaf3lk
Top-Right: E3gfMumXCjI
Bottom-Left: 3vPdtajOJfw
Bottom-Right: 25GOnaY8ZCY

Video F - Non-address Speech

Input Audio 1: Steve Harvey's (0:45 - 1:14)
Target Video 1: k4OZOTaf3lk
Input Audio 2: 60 Minutes Interview (1:12 - 1:31)
Target Video 2: 3vPdtajOJfw
Input Audio 3: The View (15:48 - 16:08)
Target Video 3: k4OZOTaf3lk
Input Audio 4: Obama in 1990 (0:00 - 0:21)
Target Video 4: 3vPdtajOJfw
Input Audio 5: Impressionist, Ryan Goldsher (1:18 - 1:36)
Target Video 5: EAZIHIiuhrc

Video G - Speech Summarization

Input Audio: deF-f0OqvQ4
Target Video: 25GOnaY8ZCY

Last modified: