With recent advances in video prediction, controllable video generation has been attracting more attention. Generating high fidelity videos according to simple and flexible conditioning is of particular interest. To this end, we propose a controllable video generation model using pixel level renderings of 2D or 3D bounding boxes as conditioning. In addition, we also create a bounding box predictor that, given the initial and ending frames' bounding boxes, can predict up to 15 bounding boxes per frame for all the frames in a 25-frame clip. We perform experiments across 3 well-known AV video datasets: KITTI, Virtual-KITTI 2 and BDD100k.
We showcase a range of BDD generation results produced by our model in different scene scenarios, such as city, urban, highways, busy intersections and at night. Each visualization displays a 25-frame clip at 5 fps. The left column displays the ground-truth clip, while the right column features Ctrl-V generations, created using the BBox predictor's predictions and generated by Box2video. Inputs: ONE initial GT frame + THREE initial GT 2D-bounding-box frames + ONE last GT 2D-bounding-box frame.
Our model can predict and condition on both 2D bounding boxes and 3D-bounding boxes. Below are our 3D-bounding-box frame predictions and video generations conditioned on the 3D-bounding-box predictions from KITTI and vKITTI2 datasets.Inputs: ONE initial GT frame + THREE initial GT 3D-bounding-box frames + ONE last GT 3D-bounding-box frame. Left: bounding-box frame predictions. Right: Video generated based on predicted bounding-box frames. These examples are being compressed from 375x1242 to 320x512.
A frame-by-frame visualization of Box2Video generation, conditioned on the ground-truth 2D-bounding-box frame sequence from the BDD dataset. The ground-truth bounding boxes are overlayed in the plots.
BDD100K's generation visualization when our BBox predictor receives a trajectory-frame as the final conditional frame instead of a 2D-bounding-box frame.
@misc{luo2024ctrlv,
title={Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion},
author={Ge Ya Luo and Zhi Hao Luo and Anthony Gosselin and Alexia Jolicoeur-Martineau and Christopher Pal},
year={2024},
eprint={2406.05630},
archivePrefix={arXiv},
primaryClass={cs.CV}
}