Customizing Motion in Text-to-Video Diffusion Models

Joanna MaterzyƄska1, Josef Sivic2,3, Eli Shechtman 2, Antonio Torralba1, Richard Zhang2, Bryan Russell2
1Massachusetts Institute of Technology 2Adobe
3Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University
Paper Code

Input videos

"A person doing the V* dance."

Generated videos

A futuristic robot mimicking human movements in the V* dance.
An older lady doing the V* dance while jumping up and down.
Nurses dancing the V* dance in a hospital.
A toddler giggling while attempting the V* dance in the living room.
A chef dancing the V* dance.
A content cat enjoying the V* dance on a sunny windowsill.
A man and a woman doing the V* dance in New York.
Elderly people doing the V* dance.
Given a few examples of the "Carlton dance", our customization method incorporates the depicted motion into a pretrained text-to-video diffusion model using a new motion identifier "V* dance". We generate the depicted motion across a variety of novel contexts, including with a non-humanoid subject (robot), different subject scale (toddler), and multiple subjects (a group of nurses).

Abstract

We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold.

First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos.

Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions.

Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.

Method Overview

Method Overview

Given a small set of exemplar videos, our approach fine-tunes the U-Net of a text-to-video model using a reconstruction objective. The motion is identified with a unique motion identifier and can be used at test time to synthesize novel subjects performing the motion. To represent the added motion but preserve information from the pretrained model, we tune a subset of weights -- the temporal convolution and attention layers, in addition to the key and value layers in the spatial attention layer. A set of related videos is used to regularize the tuning process.



Motion: "Airquotes"

Input videos

"A person doing the V*."

Generated videos

A chef doing V* while describing a secret recipe.
A close-up of a nun doing the V*.
A close-up of a pirate doing the V*.
A DJ in headphones doing the V* in a nightclub.
A farmer in overalls doing the V* in a field.
A gardener in a sun hat doing V* in a garden.


Motion: "Appear"

Input videos

"A person V*."

Generated videos

A colorful parrot V*in the dense canopy of a rainforest.
A kitty V* on a yellow couch.
A gentle deer V* in a sun-dappled forest clearing.
A green tractor V* in a wide, golden wheat field.
A purple umbrella V* on a crowded, rainy sidewalk.
Queen Elizabeth I V* in Tudor palace.


Motion: "Dab"

Input videos

"A person V*."

Generated videos

A chef in a white apron doing the V* in a kitchen.
A gorilla doing the V*.
Babies doing the V*.
A pack of cyclists doing the V* at the finish line of a race.
A farmer in overalls doing the V* in a field.
A company of soldiers doing the V* at a military base.


Motion: "3D rotation"

Input videos

"A camera V* around [OBJECT]."

Generated videos

A camera V* around a banana.
A camera V* around an elephant.
A camera V* around a collection of rare coins.
A camera V* around a teddy bear in a toy shop.
A camera V* around a tree in a lush forest.
A camera V* around a colorful mural.



Comparison with Image Customization Methods

Examples of learning a customized motion "Sliding Two Fingers Up" from the Jester dataset with prompt "A female firefighter doing the V* sign". Image personalization methods (first three columns) fail to capture the motion and produce a temporally coherent video.
A female firefighter doing the V* sign.
A hiker doing the V* sign.
A blond woman doing the V* sign.
A doctor doing the V* sign.
Custom Diffusion
Dreambooth
Textual Inversion
Ours

Comparison with Tune-A-Video

Our method seamlessly renders a custom motion in novel scenarios. Despite the videos in the training set only showing a single actor doing one single motion in the same way, our method can generate the custom motion in conjunction with a different motion ("doing the gesture while eating a burger with other hand"). Our method can also vary the timing of the motion ("doing the gesture very slowly and precisely") and involve multiple people ("children"). On the contrary, Tune-A-Video fails to generalize to these novel scenarios or produce a temporally coherent video.
A clown doing the V* gesture while eating a burger with other hand
An elder woman doing the V* gesture very slowly and precisely.
Children doing the V* gesture in the classroom in front of a blackboard.
Tune-A-Video
Ours


Comparison with the original pre-trained text-to-video model


Original Model
A close-up of a pirate doing the airquotes.

Our Method
A close-up of a pirate doing the V*.

Original Model
A kitty appears on a yellow couch.

Our Method
A kitty V* on a yellow couch.

Original Model
A gorilla doing the dab.

Our Method
A gorilla doing the V*.

Original Model
A content cat enjoying the Carlton dance on a sunny windowsill.

Our Method
A content cat enjoying the V* dance on a sunny windowsill.

Original Model
A camera rotating around a banana.

Our Method
A camera V* around a banana.

Customizing Appearance and Motion

Input Appearance Images
A photo of a X* man.
Input Motion Videos
Your browser does not support the video tag.
A person doing the V* sign.

The X* man doing the V* sign.

Our method can be extended to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions.




BibTeX

@inproceedings{materzynska2024newmove,
  title={NewMove: Customizing text-to-video models with novel motions},
  author={Materzy{\'n}ska, Joanna and Sivic, Josef and Shechtman, Eli and Torralba, Antonio and Zhang, Richard and Russell, Bryan},
  booktitle={Proceedings of the Asian Conference on Computer Vision},
  pages={1634--1651},
  year={2024}
}

We would like to thank Kabir Swain for his helpful discussions.

Website source based on this source code.