Learn the Force We Can:
Enabling Sparse Motion Control in Multi-Object Video Generation
Aram Davtyan, Paolo Favaro
Computer Vision Group, Institute of Informatics, University of Bern, Bern, Switzerland
at AAAI 2024

We propose a novel unsupervised method to autoregressively generate videos from a single frame and a sparse motion input. Our trained model can generate realistic object-to-object interactions and separate the dynamics and the extents of multiple objects despite only observing them under correlated motion activities. Key components in our method are the randomized conditioning scheme, the encoding of the input motion control, and the randomized and sparse sampling to break correlations. Our model, which we call YODA, has the ability to move objects without physically touching them. We show both qualitatively and quantitatively that YODA accurately follows the user control, while yielding a video quality that is on par with or better than state-of-the-art video generation prior work on several datasets.


BAIR (different control scenarios)

Number of control vectors ablation

With too many control vectors we obtain moderate control over background objects, but limited interactions.

With too few control vectors we get good interactions, but limited control over background objects.

With optimal number of control vectors we get the best of two worlds.

CLEVRER (samples)
Different samples generated by YODA with different controls. Notice the ability of YODA to predict long-range consequences of the inputs and to correctly model the physics of the scene. YODA was designed to make it is possible to intervene into the generation process at any timestamp, which allows giving new impulse to the objects on-fly.

iPER (samples)
Generated videos on the iPER dataset.

CLEVRER (controllabilty)
Starting from the same frame, we apply different control inputs and observe YODA generate correct responses.

To ease the access to our model, we have designed an interactive demo that will be released with the code. Given a frame from the dataset, the user can manipulate the objects by drawing arrows at certain locations. YODA will then generate the next frame by taking into account the user-input and the previously generated frames.


The website template was borrowed from VIDM.