Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

Ge Ya (Olga) Luo, Zhi Hao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal

MILA

Overview of Ctrl-V's generation pipeline Inputs (left): Our inputs include an initial frame, its corresponding bounding boxes and the final frame's bounding boxes. Bounding-box prediction samples (middle): We illustrate three different sequences generated from our diffusion based bounding box motion prediction model. Videos sampled from our Box2Video diffusion model (right): Our Box2Video model conditions on the predicted bounding-box samples to produce final video clips.

Abstract

With recent advances in video prediction, controllable video generation has been attracting more attention. Generating high fidelity videos according to simple and flexible conditioning is of particular interest. To this end, we propose a controllable video generation model using pixel level renderings of 2D or 3D bounding boxes as conditioning. In addition, we also create a bounding box predictor that, given the initial and ending frames' bounding boxes, can predict up to 15 bounding boxes per frame for all the frames in a 25-frame clip. We perform experiments across 3 well-known AV video datasets: KITTI, Virtual-KITTI 2 and BDD100k.

From BBox Predictor to Box2Video

We showcase a range of BDD generation results produced by our model in different scene scenarios, such as city, urban, highways, busy intersections and at night. Each visualization displays a 25-frame clip at 5 fps. The left column displays the ground-truth clip, while the right column features Ctrl-V generations, created using the BBox predictor's predictions and generated by Box2video. Inputs: ONE initial GT frame + THREE initial GT 2D-bounding-box frames + ONE last GT 2D-bounding-box frame.

3D-Bounding-Box Examples

Our model can predict and condition on both 2D bounding boxes and 3D-bounding boxes. Below are our 3D-bounding-box frame predictions and video generations conditioned on the 3D-bounding-box predictions from KITTI and vKITTI2 datasets.Inputs: ONE initial GT frame + THREE initial GT 3D-bounding-box frames + ONE last GT 3D-bounding-box frame. Left: bounding-box frame predictions. Right: Video generated based on predicted bounding-box frames. These examples are being compressed from 375x1242 to 320x512.

Frame-by-Frame Visualization of Teacher-Forced Box2Video Generations

A frame-by-frame visualization of Box2Video generation, conditioned on the ground-truth 2D-bounding-box frame sequence from the BDD dataset. The ground-truth bounding boxes are overlayed in the plots.

Comparison of Model Variants

BDD100K's generation visualization when our BBox predictor receives a trajectory-frame as the final conditional frame instead of a 2D-bounding-box frame.

BibTeX

@misc{luo2024ctrlv,
        title={Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion}, 
        author={Ge Ya Luo and Zhi Hao Luo and Anthony Gosselin and Alexia Jolicoeur-Martineau and Christopher Pal},
        year={2024},
        eprint={2406.05630},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }