Sony Group Corporation
Paper (arXiv) Paper (github) Code (comming soon) Supplementary
We propose a high-quality 3D-to-3D conversion method, Instruct 3D-to-3D. Our method is designed for a novel task, which is to convert a given 3D scene to another scene according to text instructions. Instruct 3D-to-3D applies pretrained Image-to-Image diffusion models for 3D-to-3D conversion. This enables the likelihood maximization of each viewpoint image and high-quality 3D generation. In addition, our proposed method explicitly inputs the source 3D scene as a condition, which enhances 3D consistency and controllability of how much of the source 3D scene structure is reflected. We also propose dynamic scaling, which allows the intensity of the geometry transformation to be adjusted. We performed quantitative and qualitative evaluations and showed that our proposed method achieves higher quality 3D-to-3D conversions than baseline methods.
First, the target model is initialized with the source model (i). Next, the target image is rendered from a random camera viewpoint (ii) and then the noise is added to input into InstructPix2Pix. The source image is rendered from the same viewpoint (iii) and input to InstructPix2Pix as conditions along with the text instruction (iv). The gradient of loss function is calculated using them (v) and the target model is updated with it. By performing this procedure from various camera viewpoints, we can convert the target model along with the text instruction.
Our Instruct 3D-to-3D is able to accurately preserve the information of the source scene during 3D-to-3D conversion. The followings are comparisons of before and after conversion.