Less is More: Data-Efficient Adaptation for
Controllable Text-to-Video Generation

Shihan Cheng1Nilesh Kulkarni2David Hyde1Dmitriy Smirnov2

1Vanderbilt University    2Netflix

CVPR 2026

Teaser figure showing controllable generation results

Our method enables fine-grained, continuous control over physical camera parameters — shutter speed, aperture, and color temperature — in text-to-video generation, learned from sparse synthetic data alone.


Abstract

Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic “real” data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.


Method

Pipeline overview figure

Overview of our controllable generation pipeline. To achieve decoupled control, we encode the scalar condition separately from the text guidance via a parallel cross-attention module. During training, we optimize the conditional adapter while actively updating the backbone by injecting LoRA layers into all DiT blocks. During inference, we discard the LoRA weights from the shallow two-thirds of the transformer blocks, retaining only the conditional adapter and backbone LoRA in the deepest third of the blocks. This selective retention enables high-fidelity physical control while minimizing semantic corruption of the backbone.

Training Data

Each property is trained on only 150 synthetic samples — low-fidelity videos or images spanning the full control range [−1, 1]. The examples below are a few representative samples illustrating what this training data looks like. Despite the minimal and low-quality nature of this data, the model learns robust, generalizable control.

Shutter
Low
High
Aperture
Low
Aperture condition 1
Aperture condition 2
Aperture condition 3
Aperture condition 4
Aperture condition 5
Aperture condition 6
Aperture condition 7
High
Temperature
Low
Temperature condition 1
Temperature condition 2
Temperature condition 3
Temperature condition 4
Temperature condition 5
Temperature condition 6
Temperature condition 7
High

Results Gallery


Citation

@inproceedings{cheng2026lessismore,
  title     = {Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation},
  author    = {Cheng, Shihan and Kulkarni, Nilesh and Hyde, David and Smirnov, Dmitriy},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgements

This work was conducted during an internship at Netflix. It builds on diffusion-pipe by tdrussell and the Wan2.1 backbone by the Wan team.