Abstract
We present GLASS, a method for Global and Local Action-driven Sequence Synthesis. GLASS is a generative model that is trained on video sequences in an unsupervised manner and that can animate an input image at test time. The method learns to segment frames into foreground-background layers and to generate transitions of the foregrounds over time through a global and local action representation. Global actions are explicitly related to 2D shifts, while local actions are instead related to (both geometric and photometric) local deformations. GLASS uses a recurrent neural network to transition between frames and is trained through a reconstruction loss. We also introduce W-Sprites (Walking Sprites), a novel synthetic dataset with a predefined action space. We evaluate our method on both W-Sprites and real datasets, and find that GLASS is able to generate realistic video sequences from a single input image and to successfully learn a more advanced action space than in prior work.
Method
GLASS consists of two stages: One is the Global Motion Analysis (GMA) and the other is the Local Motion Analysis (LMA). GMA aims to separate the foreground agent from the background and to also regress the 2D shifts between foregrounds and backgrounds. LMA aims to learn a representation for local actions that can describe deformations other than 2D shifts. Towards this purpose it uses a Recurrent Neural Network (RNN) and a feature encoding of a frame and of the global and local actions as input. Both GMA and LMA stages are jointly trained in an unsupervised manner.
  Global Motion Analysis
  
Two input frames  and 
 are fed (separately) to a segmentation network to output the foreground masks 
 and 
 respectively. The masks are used to separate the foregrounds 
 and 
 from the backgrounds 
 and 
. The concatenated foregrounds are fed to the network Pf to predict their relative shift 
. We use 
 to shift 
 and match it to 
 via an 
 loss (foregrounds may not match exactly and this loss does not penalize small errors). In the case of the backgrounds we also train an inpainting network before shifting them with the predicted 
 and matching them with an 
 loss (unlike foregrounds, we can expect backgrounds to match).
  Local Motion Analysis
  
We feed the segmented foreground , its shifted version and 
 separately as inputs to an encoder network to obtain features 
, 
 and 
 respectively. The latter two features are then mapped to an action at by the action network. A further encoding of 
 into 
, the previous state 
, and the local action 
 and global action 
 are fed as input to the RNN to predict the next state 
. Finally, a decoder maps the state 
 to the next foreground 
, which is matched to the original foreground 
 via the reconstruction loss.
W-Sprites
In order to assess and ablate the components of GLASS, we build a synthetic video dataset of cartoon characters acting on a moving background. We call the dataset W-Sprites (for Walking Sprites).
Here we provide some sample videos from the W-Sprites dataset:
Please, check the paper for the details and the official GitHub repository for the instructions on generating the data.
Results
GLASS automatically separates the foreground from the background in video sequences and discovers most relevant global and local actions that can be used at inference time to generate diverse videos. Trained GLASS can be used in a variety of applications: from controllable generation to motion transfer. We test our model on the W-Sprites, the Tennis and the BAIR datasets.
  Global Actions
W-Sprites

Each row starts with the same frame. Each column corresponds to one of the global actions, from left to right: right, left, down, up and stay.
Tennis

Each row starts with the same frame. Each column corresponds to one of the global actions, from left to right: right, left, down, up and stay.
BAIR

Each row starts with the same frame. Each column corresponds to one of the global actions, from left to right: right, left, down, up and stay.
  Local Actions
W-Sprites

The local actions learnt by the model can be interpreted as turn front, slash front, spellcast, slash left, turn right, turn left.
Tennis

The actions capture some small variations of the pose of the tennis player, such as rotation and the distance between the legs.
BAIR

The actions capture some local deformations of the robot arm, i.e. the state of the manipulator (open / close).
  Motion Transfer
Tennis
    
    
    
  
    
    
    
  
    
    
    
  
Row by row: Original videos, reconstruction and motion transfer examples on the Tennis dataset. Note the ability of GLASS to generate very diverse videos from the same initial frame.
BAIR
    
    
    
    
    
  
    
    
    
    
    
  
    
    
    
    
    
  
Row by row: Original videos, reconstruction and motion transfer examples on the BAIR dataset. Note the ability of GLASS to generate very diverse videos from the same initial frame.
Citation
Davtyan, A., Favaro, P. (2022). Controllable Video Generation Through Global and Local Motion Dynamics. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13677. Springer, Cham. https://doi.org/10.1007/978-3-031-19790-1_5
@InProceedings{10.1007/978-3-031-19790-1_5,
	author="Davtyan, Aram and Favaro, Paolo",
	editor="Avidan, Shai and Brostow, Gabriel and Ciss{\'e}, Moustapha and Farinella, Giovanni Maria and Hassner, Tal",
	title="Controllable Video Generation Through Global and Local Motion Dynamics",
	booktitle="Computer Vision -- ECCV 2022",
	year="2022",
	publisher="Springer Nature Switzerland",
	address="Cham",
	pages="68--84",
	isbn="978-3-031-19790-1"
}