January 22, 2025
Making AI Worlds
Spline is introducing Spell, an AI model to generate 3D worlds.
Spell is designed to generate entire 3D scenes or “Worlds” from an image, in just a few minutes. The worlds are consistent with the initial image input and are represented as a volume that can be rendered using Gaussian Splatting (or other methods, like NeRFs).
Capabilities
At its core, Spell is a type of diffusion model that can generate 3D worlds with realistic multi-view consistency across a wide range of categories, including people, objects, environments, 3D characters, and more.
The model can render images from multiple angles of a particular subject with high accuracy and detail, as well as generate controlled camera paths, all while staying consistent with the 3D scene.
Volumes generated by Spell
It is also capable of visually simulating physical material properties like reflections, refractions, surface roughness, and some camera properties like Depth of field, and even camera/object intersections when attempting to go inside surfaces.
Spell prioritizes physical consistency and aims to stay rooted in reality by simulating camera intersections with objects instead of interpolating/morphing to maintain visual flow. Example: If the camera gets inside a wall, it will simulate an actual intersection with the wall, instead of converting the wall into something else.
We found that the model can extrapolate knowledge very well across categories, in some cases, producing good results even in situations far from the initial data distribution.
However, there is significant room for improvement in terms of quality and consistency, and we are already working on it. Our initial focus was on discovering the best approach for training the model and achieving consistency across multiple angles for a wide set of categories.
Videos generated by Spell
Training
Spell has been trained with a combination of primarily real data (captured from real life), complemented with synthetic data (digitally rendered 3D data).
For the real data, we built our own extensive dataset by manually capturing real-world data over a long period in different countries around the world.
For the synthetic data, we rendered 3D objects using multiple techniques; in some cases, we developed our own rendering pipelines and utilized internal tools we previously developed while building Spline’s real-time editor. We also licensed 3D models from a trusted 3D marketplace, ensuring they are approved for use in ML training.
We did not use any data from Spline users for training.
Gaussian splats generated by Spell
Volumes
At the moment, Spell's final outputs or exports can be either a video, a sequence of images or a volume (Gaussian Splatting).
However, Spell is not dependent on any specific volume rendering technique, and it is also possible to convert the internal volume representation into a mesh using any reconstruction technique (or using a reconstruction model).
Gaussian splats generated by Spell
A new era for 3D
We believe this represents a significant leap towards a new era of graphics driven by AI, with granular control of the output, and interactivity. This is the first version of the model, and we expect that both quality and consistency will improve throughout the year.
Spell training is ongoing, and we plan to release newer model checkpoints frequently.
We’re excited to see what you create with it!
– Spline Team