In the current era of artificial intelligence, computers can generate their own “art” through diffusion models, repeatedly adding structure to a noisy initial state until a clear image or video emerges. The diffusion model suddenly has a seat at everyone’s table. Type in a few words and you’ll experience a dreamlike, dopamine-filled moment where reality and fantasy intersect. Behind the scenes, a complex and time-consuming process takes place that requires algorithms to iterate over and over again to perfect the image.
Researchers at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) have introduced a new framework that simplifies the multi-step process of traditional diffusion models into a single step and addresses previous limitations. This is done through a kind of teacher-student model. That is, teaching a new computer model to mimic the behavior of the more complex original model that produced the image. This approach, known as distribution-matched distillation (DMD), preserves the quality of the images produced and allows for faster production.
“Our work is a new way to accelerate current diffusion models, such as stable diffusion and DALLE-3, by a factor of 30,” said MIT electrical engineering and computer science doctoral student and CSAIL-affiliated DMD director. said Tianwei ying, who is also a researcher. Framework. “This advancement not only significantly reduces computation time, but also maintains the quality of the generated visual content, if not better. In theory, this approach could be applied to generative adversarial networks ( It combines the principles of GAN (GAN) and diffusion models to achieve visual content generation in one step, as opposed to the 100 steps of iterative refinement required by current diffusion models. This has the potential to be a new generative modeling method with greater speed and quality.”
This single-step dissemination model enhances design tools and enables faster content creation, potentially supporting advances in drug discovery and 3D modeling where immediacy and effectiveness are key.
dream of delivery
DMD cleverly has two components. First, we use regression loss. This makes the mapping fixed and the image space coarsely organized, making training more stable. Next, use a distributed matching loss. This ensures that the probability of producing a particular image using the Student model corresponds to its frequency of occurrence in the real world. To do this, we leverage his two diffusion models that act as guides, allow the system to understand the difference between real and generated images, and enable rapid one-step generator training.
The system achieves faster generation by training a new network to minimize distribution differences between generated images and images from the training dataset used in traditional diffusion models. It will come true. “Our key insight is that we use two diffusion models to approximate the gradient that leads to new model improvements,” Yin says. “In this way, we extract the knowledge of the original, more complex model into a simpler, faster model while avoiding the notorious instability and mode collapse problems of GANs.”
Yin et al. used a pre-trained network on the new student model to simplify the process. By copying parameters from the original model and tweaking them, the team achieved fast training convergence for the new model. This allows you to generate high-quality images using the same architectural foundation. “This allows us to further accelerate the creation process in combination with other system optimizations based on the original architecture,” Yin adds.
When tested against conventional methods using a wide range of benchmarks, DMD showed consistent performance. In a common benchmark of generating images based on specific classes on ImageNet, DMD is the first one-step diffusion technique to mass-produce images that are nearly equivalent to those from the original, more complex model, and is very Shake Close Fréchet Starting Distance (FID) score is only 0.3, which is impressive since FID is all about determining the quality and variety of images produced. Additionally, DMD excels at industrial-scale text-to-image generation, delivering state-of-the-art one-step generation performance. There is still a slight quality gap when tackling more tricky text-to-image applications, suggesting room for improvement in the future.
Moreover, the performance of images generated with DMD is intrinsically related to the features of the supervised model used during the distillation process. In its current form, using Stable Diffusion v1.5 as the teacher model, students inherit limitations such as detailed depictions of text and small faces, and the DMD-generated images are further enhanced by the more advanced teacher model. This suggests that it may be possible.
“Reducing the number of iterations has been the holy grail of diffusion modeling since its inception,” said Fredo, professor of electrical engineering and computer science at the Massachusetts Institute of Technology, principal investigator of CSAIL, and first author of the paper.・Mr. Durand said. “We are very excited to finally be able to perform single-step image generation. This significantly reduces computing costs and accelerates the process.”
“Finally, we have a paper that successfully combines the versatility and high visual quality of diffusion models with the real-time performance of GANs,” said Alexei Efros, professor of electrical engineering and computer science at the University of California, Berkeley, who was not involved. says. In this study. “We look forward to this work opening up exciting possibilities for high-quality, real-time visual editing.”
Yin and Durand’s co-authors include William T. Freeman, MIT professor of electrical engineering and computer science and CSAIL principal investigator, and Michael Garbi (SM ’15, PhD’18), a research scientist at Adobe. ) It is included. Richard Chan. Eli Shechtman. And Park Tae Sung. Their research was supported in part by grants from the U.S. National Science Foundation (including a grant to the Institute for Artificial Intelligence and Basic Interactions), the Defense Science and Technology Agency of Singapore, and funding from Gwangju University of Science and Technology and Amazon it was done. Their research will be presented at the Conference on Computer Vision and Pattern Recognition in June.