Enabling Visual Composition and Animation in Unsupervised Video Generation

1Computer Vision Group, Institute of Informatics, University of Bern, Switzerland 2CompVis @ LMU Munich and MCML, Germany

Our model, CAGE, is designed to both compose and animate scenes from a sparse set of visual features. The model is trained in an unsupervised way from a dataset of unannotated videos.

The features are extracted from the regions shown as overlaying red patches. They are then rearranged and pasted to the corresponding locations in the control layout. Blue patches in the controls correspond to the intended future locations of the objects. Notice the ability of the model to carefully adjust the appearances (e.g. sizes, shadows and lights) of the objects based on their location. Due to the stochastic nature of the model, the motion of uncontrolled objects is random.


In this work we propose a novel method for unsupervised controllable video generation. Once trained on a dataset of unannotated videos, at inference our model is capable of both composing scenes of predefined object parts and animating them in a plausible and controlled way.

This is achieved by conditioning video generation on a randomly selected subset of local pre-trained self-supervised features during training. We call our model CAGE for visual Composition and Animation for video GEneration.

We conduct a series of experiments to demonstrate capabilities of CAGE in various settings.

Training Pipeline

The overall pipeline of CAGE. The model takes all the colored frames and processes them equally and in parallel. The pipeline for a single frame (in red) is illustrated. CAGE is trained to predict the denoising direction for the future frames in the Conditional Flow Matching framework conditioned on the past frames (context and reference) and sparse random sets of DINOv2 features. The frames communicate with each other via the Temporal Blocks while being separately processed by the Spatial Blocks. The controls are incorporated through Cross-Attention.

In order to prevent overfitting to the position information that is present in DINOv2 features as well as to impose some scale invariance, the features are calculated on random crops and then pasted back to their original locations.

Out of Distribution Controls

Here we demonstrate some generalization of CAGE to out of distribution controls. Those include curved trajectories and sudden changes in velocity in CLEVRER, and moving background objects in BAIR. Notice that sometimes our models manage to cast the o.o.d. controls to in distribution ones (e.g. by introducing new objects in CLEVRER that collide with the controlled object, the 2nd example in the first row). This shows the emergent planning capabilities of CAGE.

Cross-domain Transfer

With CAGE one is able to copy features from images coming from domains other than the training data and still compose and animate scenes based on those features. This is possible because our model is conditioned on data-agnostic abstract features extracted with DINOv2.

Comparison to Prior Work

This section shows some comparisons between CAGE (columns 1 for the controls and 2 for the generated videos) to the prior work YODA (column 3).

Long Video Generation

Although in our controllability evaluation we mostly show short videos, thanks to its autoregressive nature CAGE is able to generate longer videos at inference. Here are a few examples of 512-frame-long videos generated with CAGE.

Robustness to the Number of Controls

Here we show some generated sequences with only one vs many DINOv2 token provided per object. While CAGE is pretty robust to the number of controls provided and can inpaint the missing information, the sequences generated conditioned on more tokens are more consistent with the source image (e.g. in the second example on the second slide the blue cube does not rotate with more tokens provided, i.e. the pose of the object is fixed given the features).


Coming soon...