Abstract
We propose a novel unsupervised method to autoregressively generate videos from a single frame and a sparse motion input. Our trained model can generate realistic object-to-object interactions and separate the dynamics and the extents of multiple objects despite only observing them under correlated motion activities. Key components in our method are the randomized conditioning scheme, the encoding of the input motion control, and the randomized and sparse sampling to break correlations. Our model, which we call YODA, has the ability to move objects without physically touching them. We show both qualitatively and quantitatively that YODA accurately follows the user control, while yielding a video quality that is on par with or better than state-of-the-art video generation prior work on several datasets.
CLEVRER (samples)
Different samples generated by YODA with different controls.
Notice the ability of YODA to predict long-range consequences
of the inputs and to correctly model the physics of the scene.
YODA was designed to make it is possible to intervene into the
generation process at any timestamp, which allows giving new
impulse to the objects on-fly.
Demo
To ease the access to our model, we have designed an interactive demo that will be released with the code.
Given a frame from the dataset, the user can manipulate the objects by drawing arrows at certain locations.
YODA will then generate the next frame by taking into account the user-input and the previously generated frames.