In this work we propose a novel method for unsupervised controllable video generation. Once trained on a dataset of unannotated videos, at inference our model is capable of both composing scenes of predefined object parts and animating them in a plausible and controlled way.
This is achieved by conditioning video generation on a randomly selected subset of local pre-trained self-supervised features during training. We call our model CAGE for visual Composition and Animation for video GEneration.
We conduct a series of experiments to demonstrate capabilities of CAGE in various settings.
The overall pipeline of CAGE. The model takes all the colored frames and processes them equally and in parallel. The pipeline for a single frame (in red) is illustrated. CAGE is trained to predict the denoising direction for the future frames in the Conditional Flow Matching framework conditioned on the past frames (context and reference) and sparse random sets of DINOv2 features. The frames communicate with each other via the Temporal Blocks while being separately processed by the Spatial Blocks. The controls are incorporated through Cross-Attention.
In order to prevent overfitting to the position information that is present in DINOv2 features as well as to impose some scale invariance, the features are calculated on random crops and then pasted back to their original locations.
Here we demonstrate some generalization of CAGE to out of distribution controls. Those include curved trajectories and sudden changes in velocity in CLEVRER, and moving background objects in BAIR. Notice that sometimes our models manage to cast the o.o.d. controls to in distribution ones (e.g. by introducing new objects in CLEVRER that collide with the controlled object, the 2nd example in the first row). This shows the emergent planning capabilities of CAGE.
This section shows some comparisons between CAGE (columns 1 for the controls and 2 for the generated videos) to the prior work YODA (column 3).
Although in our controllability evaluation we mostly show short videos, thanks to its autoregressive nature CAGE is able to generate longer videos at inference. Here are a few examples of 512-frame-long videos generated with CAGE.
Here we show some generated sequences with only one vs many DINOv2 token provided per object. While CAGE is pretty robust to the number of controls provided and can inpaint the missing information, the sequences generated conditioned on more tokens are more consistent with the source image (e.g. in the second example on the second slide the blue cube does not rotate with more tokens provided, i.e. the pose of the object is fixed given the features).
Coming soon...