Creating realistic 3D models for applications such as virtual reality, filmmaking, and engineering design can be a tedious process that requires a lot of manual trial and error.
Image Generation Artificial intelligence models can streamline the artistic process by allowing creators to create lifelike 2D images from text prompts, but these models are not designed to generate 3D shapes. No. To fill this gap, a recently developed technique called score distillation leverages 2D image generation models to create 3D shapes, but the output is often blurry or cartoonish.
MIT researchers investigated the relationships and differences between algorithms used to generate 2D images and 3D shapes to identify the root causes of poor quality 3D models. From there, I created a simple fix for score distillation. This allows us to produce sharp, high-quality 3D shapes that are close in quality to the best 2D images produced by our models.
Some other methods attempt to solve this problem by retraining or fine-tuning the generative AI model, but this can be expensive and time-consuming.
In contrast, the MIT researchers’ method achieves 3D shape quality comparable to or better than these approaches without the need for additional training or complex post-processing.
Additionally, by identifying the cause of the problem, researchers can now improve their mathematical understanding of score distillation and related techniques, allowing future work to further improve performance.
“Now we know where to go, which allows us to find faster, higher quality, and more efficient solutions,” said the electrical engineering and computer science (EECS) professor of electrical engineering and computer science (EECS). Artem Lukoianov, a graduate student and lead author of the paper, says this technique. “In the long term, our work will help facilitate the process of co-piloting designers and facilitate the creation of more realistic 3D shapes.”
Lukoianov’s co-author is Heitz Sáez de Ocárís Borde, a graduate student at the University of Oxford. Kristjan Greenewald, Research Scientist, MIT-IBM Watson AI Lab. Vitor Campagnolo Guigilini, a scientist at the Toyota Research Institute. Timur Bagaudinov, researcher at Meta. The senior authors are Vincent Sitzmann, an MIT EECS assistant professor who heads the Scene Representation Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL), and Justin Solomon, an EECS associate professor and leader of the CSAIL Geometric Data Processing Group. This research will be presented at the Neural Information Processing Systems Conference.
From 2D images to 3D shapes
Diffusion models, such as DALL-E, are a type of generative AI model that can generate lifelike images from random noise. To train these models, researchers add noise to images and teach the models to remove the noise by reversing the process. The model uses this learned “denoising” process to create images based on the user’s text prompts.
However, diffusion models perform poorly when directly generating realistic 3D shapes because there is not enough 3D data to train them on. To get around this problem, researchers in 2022 developed a technique called score-distilled sampling (SDS) that uses a pre-trained diffusion model to combine 2D images into a 3D representation.
The technique starts with a random 3D representation, renders a 2D view of the object of interest from a random camera angle, adds noise to that image, denoises it with a diffusion model, and then creates the denoised image. Optimize the random 3D representation to match. These steps are repeated until the desired 3D object is generated.
However, 3D shapes created this way tend to look blurry or oversaturated.
“This has been a bottleneck for a while. We know the underlying model can perform better, but people didn’t know why this was happening with 3D shapes.” says Lukoianov.
The MIT researchers investigated the SDS steps and identified a mismatch between the equations that form a critical part of the process and the corresponding equations in the 2D diffusion model. This expression tells the model how to add and remove noise one step at a time to update the random representation to approximate the desired image.
Some of this equation contains equations that are too complex to solve efficiently, so SDS replaces them with randomly sampled noise at each step. Researchers at MIT found that this noise makes 3D shapes look blurry or cartoonish.
approximate answer
Instead of trying to solve this cumbersome formula exactly, the researchers tested approximate methods until they identified the best one. Rather than randomly sampling noise terms, their approximation technique infers missing terms from the current 3D shape rendering.
“Doing this produces sharp, realistic-looking 3D shapes, as the paper’s analysis predicts,” he says.
Additionally, the researchers increased the image rendering resolution and adjusted several model parameters to further improve the quality of the 3D shapes.
Ultimately, they were able to create smooth, realistic-looking 3D shapes using an off-the-shelf, pre-trained image diffusion model without the need for costly retraining. 3D objects are as sharp as objects created using other methods that rely on ad hoc solutions.
“If you blindly try different parameters, sometimes it works and sometimes it doesn’t, but you don’t know why. You know this is the equation you need to solve. We can now think of more efficient ways to do things,” he says.
Because their method relies on a pre-trained diffusion model, it inherits that model’s biases and shortcomings and is prone to hallucinations and other failures. Improving the underlying diffusion model would enhance the process.
In addition to studying the formula and considering how to solve it more effectively, the researchers are interested in exploring how these insights can improve image editing techniques.
Funding for this research was provided in part by Toyota Research Institute, U.S. National Science Foundation, Singapore Defense Science and Technology Agency, U.S. Information Advanced Research Projects Activity, Amazon Science Hub, IBM, U.S. Army Research Office, CSAIL Future of Data Program, and Wistron Corporation. , and the MIT-IBM Watson AI Laboratory.