Paper:
[Arxiv]

Supplementary:
[Arxiv]

## Abstract

We present GLASS, a method for Global and Local Action-driven Sequence Synthesis. GLASS is a generative model that is trained on video sequences in an unsupervised manner and that can animate an input image at test time. The method learns to segment frames into foreground-background layers and to generate transitions of the foregrounds over time through a global and local action representation. Global actions are explicitly related to 2D shifts, while local actions are instead related to (both geometric and photometric) local deformations. GLASS uses a recurrent neural network to transition between frames and is trained through a reconstruction loss. We also introduce W-Sprites (Walking Sprites), a novel synthetic dataset with a predefined action space. We evaluate our method on both W-Sprites and real datasets, and find that GLASS is able to generate realistic video sequences from a single input image and to successfully learn a more advanced action space than in prior work.

## Method

GLASS consists of two stages: One is the Global Motion Analysis (GMA) and the other is the Local Motion Analysis (LMA). GMA aims to separate the foreground agent from the background and to also regress the 2D shifts between foregrounds and backgrounds. LMA aims to learn a representation for local actions that can describe deformations other than 2D shifts. Towards this purpose it uses a Recurrent Neural Network (RNN) and a feature encoding of a frame and of the global and local actions as input. Both GMA and LMA stages are jointly trained in an unsupervised manner.

Global Motion Analysis

Two input frames $I_t$ and $I_{t+1}$ are fed (separately) to a segmentation network to output the foreground masks $m_t$ and $m_{t+1}$ respectively. The masks are used to separate the foregrounds $f_t$ and $f_{t+1}$ from the backgrounds $b_t$ and $b_{t+1}$. The concatenated foregrounds are fed to the network Pf to predict their relative shift $\Delta_F$. We use $\Delta_F$ to shift $f_t$ and match it to $f_{t+1}$ via an $L_2$ loss (foregrounds may not match exactly and this loss does not penalize small errors). In the case of the backgrounds we also train an inpainting network before shifting them with the predicted $\Delta_B$ and matching them with an $L_1$ loss (unlike foregrounds, we can expect backgrounds to match).

Local Motion Analysis

We feed the segmented foreground $f_t$, its shifted version and $f_{t+1}$ separately as inputs to an encoder network to obtain features $\phi_t$, $\tilde\phi_t$ and $\phi_{t+1}$ respectively. The latter two features are then mapped to an action at by the action network. A further encoding of $\phi_t$ into $e_t$, the previous state $s_t$, and the local action $a_t$ and global action $\Delta_F$ are fed as input to the RNN to predict the next state $s_{t+1}$. Finally, a decoder maps the state $s_{t+1}$ to the next foreground $\hat&space;f_{t+1}$, which is matched to the original foreground $f_{t+1}$ via the reconstruction loss.

## W-Sprites

In order to assess and ablate the components of GLASS, we build a synthetic video dataset of cartoon characters acting on a moving background. We call the dataset W-Sprites (for Walking Sprites).

Here we provide some sample videos from the W-Sprites dataset:

Please, check the paper for the details and the official GitHub repository for the instructions on generating the data.

## Results

GLASS automatically separates the foreground from the background in video sequences and discovers most relevant global and local actions that can be used at inference time to generate diverse videos. Trained GLASS can be used in a variety of applications: from controllable generation to motion transfer. We test our model on the W-Sprites, the Tennis and the BAIR datasets.

Global Actions

W-Sprites

Each row starts with the same frame. Each column corresponds to one of the global actions, from left to right: right, left, down, up and stay.

Tennis

Each row starts with the same frame. Each column corresponds to one of the global actions, from left to right: right, left, down, up and stay.

BAIR

Each row starts with the same frame. Each column corresponds to one of the global actions, from left to right: right, left, down, up and stay.

Local Actions

W-Sprites

The local actions learnt by the model can be interpreted as turn front, slash front, spellcast, slash left, turn right, turn left.

Tennis

The actions capture some small variations of the pose of the tennis player, such as rotation and the distance between the legs.

BAIR

The actions capture some local deformations of the robot arm, i.e. the state of the manipulator (open / close).

Motion Transfer

Tennis

Row by row: Original videos, reconstruction and motion transfer examples on the Tennis dataset. Note the ability of GLASS to generate very diverse videos from the same initial frame.

BAIR

Row by row: Original videos, reconstruction and motion transfer examples on the BAIR dataset. Note the ability of GLASS to generate very diverse videos from the same initial frame.

## Citation

The paper is to appear in the Proceedings of the 17th European Conference on Computer Vision in 2022. In the meantime we suggest using the arxiv preprint bibref.

Davtyan, A. & Favaro, P. (2022). Controllable Video Generation through Global and Local Motion Dynamics. arXiv preprint arXiv:2204.06558.

@misc{https://doi.org/10.48550/arxiv.2204.06558,
doi = {10.48550/ARXIV.2204.06558},
url = {https://arxiv.org/abs/2204.06558},
author = {Davtyan, Aram and Favaro, Paolo},
keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Controllable Video Generation through Global and Local Motion Dynamics},
publisher = {arXiv},
year = {2022},