On Tuesday, researchers from Google and Tel Aviv University unveiled GameNGen, a new AI model that can interactively simulate the classic 1993 first-person shooter game. Doom It generates images in real time using AI image generation techniques borrowed from Stable Diffusion, a neural network system that acts as a limited game engine, potentially opening up new possibilities for real-time video game compositing in the future.
For example, instead of using traditional techniques to draw graphic video frames, future games may use AI engines to “imagine” or hallucinate graphics in real time as a predictive task.
““There’s tremendous potential here,” app developer Nick Dobos wrote in response to the news. “Why hand-write complex rules for software when an AI can do the thinking for you, pixel by pixel?”
GameNGen is reportedly capable of generating new frames. Doom It uses a single Tensor Processing Unit (TPU), a type of specialized GPU-like processor optimized for machine learning tasks, to power gameplay at speeds exceeding 20 frames per second.
The researchers say that in tests, 10 human raters were sometimes unable to distinguish between two real-world clips that were short (1.6 seconds and 3.2 seconds long). Doom We compared game footage with the output generated by GameNGen and identified actual gameplay footage 58 or 60 percent of the time.
Real-time video game synthesis using something called “neural rendering” isn’t an entirely new idea: In an interview in March, Nvidia CEO Jensen Huang predicted, perhaps boldly, that within five to 10 years, most video game graphics will be generated in real time by AI.
GameNGen also builds on previous work in the field, including World Models in 2018, GameGAN in 2020, and Google’s own Genie in March, all of which are cited in the GameNGen paper. And earlier this year, a group of university researchers trained an AI model (called “DIAMOND”) that uses the diffusion model to simulate vintage Atari video games.
And ongoing research into “world models” or “world simulators,” the kind of AI video synthesis models often associated with, like Runway’s Gen-3 Alpha and OpenAI’s Sora, leans in a similar direction. For example, when Sora debuted, OpenAI released a demo video of its AI generator running a simulation. Minecraft.
Popularization is key
In a preprint research paper titled “The Diffusion Model is a Real-Time Game Engine,” authors Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter explain how GameNGen works: Their system uses a modified version of Stable Diffusion 1.4, an image synthesis diffusion model released in 2022 that is used to create AI-generated imagery.
“The answer to the question, ‘Can you run?’ Doom“For diffusion models, yes,” wrote Tanishq Matthew Abraham, director of research at Stability AI, who was not involved in the research project.
The diffusion model, guided by player input, is trained on a wide range of game footage and then predicts the next game state from previous game states. Doom Working.
The development of GameNGen involved a two-stage training process. The researchers first trained a reinforcement learning agent to DoomWe recorded gameplay sessions to create an automatically generated training dataset (the footage mentioned earlier), which we then used to train a custom Stable Diffusion model.
However, using Stable Diffusion does introduce some graphical glitches: “The pre-trained autoencoder in Stable Diffusion v1.4, which compresses 8×8 pixel patches into four latent channels, produces meaningful artifacts when predicting game frames, affecting fine details, especially the bottom bar HUD,” the researchers write in their summary.
The challenges don’t end there: keeping an image visually sharp and consistent over time (often called “temporal consistency” in the AI video field) can be tricky. “Interactive world simulation is much more than very fast video generation,” GameNGen researchers write in their paper. “The requirement to be conditional on a stream of input actions available only during generation breaks several assumptions of existing diffusion model architectures.” This involves repeatedly generating new frames based on previous frames (known as “autoregression”), which can lead to instability and rapid degradation of quality of the generated world over time.