Last month, Google The GameNGen AI model demonstrated that it was possible to generate a playable version without any issues using generalized image diffusion techniques. DoomNow, researchers are using similar techniques in a model called MarioVGG to see if AI can generate lifelike videos. Super Mario Bros. Depending on user input.
Results from the MarioVGG model are available as a preprint paper published by crypto AI company Virtuals Protocol, but still show many obvious glitches and are too slow to approach real-time gameplay. But the results show that even a limited model can infer impressive physics and gameplay dynamics by studying just a few video and input data.
The researchers hope that this is a first step towards “fabricating and demonstrating a reliable and controllable video game generator” or even in the future “using video generative models to completely replace game development and game engines.”
Watch 737,000 frames of Mario
The MarioVGG researchers (GitHub users erniechew and Brian Lim are listed as contributors) started with a public dataset to train their model. Super Mario Bros. Gameplay with 280 “levels” worth of input and image data tuned for machine learning (level 1-1 was removed from the training data so its images could be used for evaluation). Over 737,000 individual frames from this dataset were “pre-processed” into chunks of 35 frames to allow the model to learn what the immediate results of different inputs typically look like.
To “simplify the gameplay landscape,” the researchers decided to focus on only two potential inputs in the dataset: “run right” and “run right and jump.” But even this limited set of moves posed some challenges for the machine learning system, as the preprocessor had to look back several frames before the jump to determine when the “run” began. Jumps that included mid-air adjustments (i.e., the “left” button) also had to be excluded because they “introduced noise to the training dataset,” the researchers wrote.
After preprocessing (and roughly 48 hours of training on a single RTX 4090 graphics card), the researchers used standard convolution and denoising processes to generate new video frames from static game-start images and text inputs (in this limited case, “run” or “jump”). Although these generated sequences only last for a few frames, the last frame of one sequence can be used as the first frame of a new sequence, allowing them to create gameplay videos that show “consistent, consistent gameplay” regardless of length, according to the researchers.
Super Mario 0.5
Even with all these settings, MarioVGG still doesn’t produce smooth video that’s indistinguishable from the actual NES game. For efficiency, the researchers scaled down the output frames from the NES’s 256×240 resolution to a much blurrier 64×48. They also condensed 35 frames’ worth of video time into the seven generated frames and spread them out at “uniform intervals,” creating a “gameplay” video that’s much grainier than the actual game output.
Despite these limitations, the MarioVGG model currently struggles to even come close to real-time video generation. On the single RTX 4090 the researchers used, it took six seconds to generate a six-frame video sequence (with a video length of just over half a second) even at a very limited frame rate. The researchers acknowledge that this is “impractical and difficult to use for interactive video games,” but they hope that future optimizations of weight quantization (and perhaps the use of more computing resources) will improve this rate.
But with these limitations in mind, MarioVGG can create fairly realistic videos of Mario running and jumping from a static starting image, much like Google’s Genie game maker. The model is even able to “learn game physics solely from the video frames in the training data, without any explicit hard-coded rules,” the researchers write. This includes inferring behaviors like Mario falling (with realistic gravity) when running off the edge of a cliff, and (usually) halting his progress when adjacent to an obstacle, the researchers write.
While MarioVGG was focused on simulating Mario’s movements, the researchers found that the system could effectively hallucinate new obstacles for Mario as the video scrolled through the imagined level. These obstacles are “consistent with the game’s graphical language,” the researchers wrote, but currently cannot be influenced by user prompts (such as placing a hole in front of Mario and having him jump through it).
Just make it up
But like all probabilistic AI models, MarioVGG has a frustrating tendency to produce results that are completely useless. Sometimes that means simply ignoring user input prompts (“It has been observed that input action text is not always followed,” the researchers write). Other times, it hallucinates obvious visual glitches: Mario landing inside obstacles, walking through obstacles or enemies, flashing different colors, shrinking and expanding every frame, or disappearing completely and then reappearing for multiple frames.
One particularly ridiculous video shared by the researchers shows Mario falling off a bridge, turning into Chee-Che, then jumping back over the bridge and transforming back into Mario — something you’d see from Wonder Flower, not the original AI video. Super Mario Bros.
The researchers speculate that training for a longer period with “more diverse gameplay data” would address these significant issues and allow the model to simulate more than just running right and jumping. Still, MarioVGG serves as a fun proof of concept that even with limited training data and algorithms, you can create a decent starting model of a basic game.
This story originally Ars Technica.