Yuqing Wen1*†,
Yucheng Zhao2*,
Yingfei Liu2*,
Fan Jia2,
Yanhui Wang1,
Chong Luo1,
Chi Zhang3,
Tiancai Wang2‡,
Xiaoyan Sun1‡,
Xiangyu Zhang2
1University of Science and Technology of China, 2MEGVII Technology, 3Mach Drive
*Equal Contribution, †This work was done during the internship at MEGVII, ‡Corresponding Author.
Overview of Panacea. (a). The diffusion training process of Panacea, enabled by a diffusion encoder and decoder with the decomposed 4D attention module. (b). The decomposed 4D attention module comprises three components: intra-view attention for spatial processing within individual views, cross-view attention to engage with adjacent views, and cross-frame attention for temporal processing. (c). Controllable module for the integration of diverse signals. The image conditions are derived from a frozen VAE encoder and combined with diffused noises. The text prompts are processed through a frozen CLIP encoder, while BEV sequences are handled via ControlNet. (d). The details of BEV layout sequences, including projected bounding boxes, object depths, road maps and camera pose.
The two-stage inference pipeline of Panacea. Its two-stage process begins by creating multi-view images with BEV layouts, followed by using these images, along with subsequent BEV layouts, to facilitate the generation of following frames.
![]() |
![]() |
Controllable multi-view video generation. Panacea is able to generate realistic, controllable videos with good temporal and view consistensy.
![]() |
Video generation with variable attribute controls, such as weather, time, and scene, which allows Panacea to simulate a variety of rare driving scenarios, including extreme weather conditions such as rain and snow, thereby greatly enhancing the diversity of the data.
![]() |
(a). Panoramic video generation based on BEV (Bird’s-Eye-View) layout sequence facilitates the establishment of a synthetic video dataset, which enhances perceptual tasks. (b). Producing panoramic videos with conditional images and BEV layouts can effectively elevate image-only datasets to video datasets, thus enabling the advancement of video-based perception techniques.
@artical{@misc{wen2023panacea,
title={Panacea: Panoramic and Controllable Video Generation for Autonomous Driving},
author={Yuqing Wen and Yucheng Zhao and Yingfei Liu and Fan Jia and Yanhui Wang and Chong Luo and Chi Zhang and Tiancai Wang and Xiaoyan Sun and Xiangyu Zhang},
year={2023},
eprint={2311.16813},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
}
Feel free to contact us at wenyuqing AT mail.ustc.edu.cn or wangtiancai AT megvii.com