Input videos

"A person doing the V* dance."

Generated videos

A futuristic robot mimicking human movements in the V* dance.

An older lady doing the V* dance while jumping up and down.

Nurses dancing the V* dance in a hospital.

A toddler giggling while attempting the V* dance in the living room.

A chef dancing the V* dance.

A content cat enjoying the V* dance on a sunny windowsill.

A man and a woman doing the V* dance in New York.

Elderly people doing the V* dance.

Method Overview

Given a small set of exemplar videos, our approach fine-tunes the U-Net of a text-to-video model using a reconstruction objective. The motion is identified with a unique motion identifier and can be used at test time to synthesize novel subjects performing the motion. To represent the added motion but preserve information from the pretrained model, we tune a subset of weights -- the temporal convolution and attention layers, in addition to the key and value layers in the spatial attention layer. A set of related videos is used to regularize the tuning process.

Motion: "Airquotes"

Input videos

"A person doing the V*."

Generated videos

A chef doing V* while describing a secret recipe.

A close-up of a nun doing the V*.

A close-up of a pirate doing the V*.

A DJ in headphones doing the V* in a nightclub.

A farmer in overalls doing the V* in a field.

A gardener in a sun hat doing V* in a garden.

Motion: "Appear"

Input videos

"A person V*."

Generated videos

A colorful parrot V*in the dense canopy of a rainforest.

A kitty V* on a yellow couch.

A gentle deer V* in a sun-dappled forest clearing.

A green tractor V* in a wide, golden wheat field.

A purple umbrella V* on a crowded, rainy sidewalk.

Queen Elizabeth I V* in Tudor palace.

Motion: "Dab"

Input videos

"A person V*."

Generated videos

A chef in a white apron doing the V* in a kitchen.

A gorilla doing the V*.

Babies doing the V*.

A pack of cyclists doing the V* at the finish line of a race.

A farmer in overalls doing the V* in a field.

A company of soldiers doing the V* at a military base.

Motion: "3D rotation"

Input videos

"A camera V* around [OBJECT]."

Generated videos

A camera V* around a banana.

A camera V* around an elephant.

A camera V* around a collection of rare coins.

A camera V* around a teddy bear in a toy shop.

A camera V* around a tree in a lush forest.

A camera V* around a colorful mural.

Comparison with Image Customization Methods

Examples of learning a customized motion "Sliding Two Fingers Up" from the Jester dataset with prompt "A female firefighter doing the V* sign". Image personalization methods (first three columns) fail to capture the motion and produce a temporally coherent video.

A female firefighter doing the V* sign.

A hiker doing the V* sign.

A blond woman doing the V* sign.

A doctor doing the V* sign.

Custom Diffusion

Dreambooth

Textual Inversion

Ours

Comparison with Tune-A-Video

Our method seamlessly renders a custom motion in novel scenarios. Despite the videos in the training set only showing a single actor doing one single motion in the same way, our method can generate the custom motion in conjunction with a different motion ("doing the gesture while eating a burger with other hand"). Our method can also vary the timing of the motion ("doing the gesture very slowly and precisely") and involve multiple people ("children"). On the contrary, Tune-A-Video fails to generalize to these novel scenarios or produce a temporally coherent video.

A clown doing the V* gesture while eating a burger with other hand

An elder woman doing the V* gesture very slowly and precisely.

Children doing the V* gesture in the classroom in front of a blackboard.

Tune-A-Video

Ours

Original Model

A close-up of a pirate doing the airquotes.

Our Method

A close-up of a pirate doing the V*.

Original Model

A kitty appears on a yellow couch.

Our Method

A kitty V* on a yellow couch.

Original Model

A gorilla doing the dab.

Our Method

A gorilla doing the V*.

Original Model

A content cat enjoying the Carlton dance on a sunny windowsill.

Our Method

A content cat enjoying the V* dance on a sunny windowsill.

Original Model

A camera rotating around a banana.

Our Method

A camera V* around a banana.

BibTeX

@inproceedings{materzynska2024newmove,
  title={NewMove: Customizing text-to-video models with novel motions},
  author={Materzy{\'n}ska, Joanna and Sivic, Josef and Shechtman, Eli and Torralba, Antonio and Zhang, Richard and Russell, Bryan},
  booktitle={Proceedings of the Asian Conference on Computer Vision},
  pages={1634--1651},
  year={2024}
}

Customizing Motion in Text-to-Video Diffusion Models

Input videos

Generated videos

Abstract

Method Overview

Motion: "Airquotes"

Input videos

Generated videos

Motion: "Appear"

Input videos

Generated videos

Motion: "Dab"

Input videos

Generated videos

Motion: "3D rotation"

Input videos

Generated videos

Comparison with Image Customization Methods

Comparison with Tune-A-Video

Comparison with the original pre-trained text-to-video model

Customizing Appearance and Motion

BibTeX