Geometry-Consistent World Modeling via Unified Video and 3D Prediction
High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.
FantasyWorld is a unified feed-forward model for joint video and 3D scene generation. The front end employs Preconditioning Blocks (PCBs) that reuse the frozen WanDiT denoiser to supply partially denoised latents, ensuring the geometry pathway operates on meaningful features rather than pure noise. The backbone then consists of stacked Integrated Reconstruction and Generation (IRG) Blocks, which iteratively refine video latents and geometry features under multimodal conditioning. Each IRG block contains an asymmetric dual-branch structure: an Imagination Prior Branch for appearance synthesis and a Geometry-Consistent Branch for explicit 3D reasoning, coupled through lightweight adapters and cross attention.
Our experiments show that the generated videos not only maintain strong visual realism but also achieve higher multi-view coherence and improved geometric fidelity compared to existing methods.
AETHER
Voyager
Uni3C
WonderWorld
FantasyWorld
AETHER
Voyager
Uni3C
FantasyWorld
AETHER
Voyager
Uni3C
WonderWorld
FantasyWorld
AETHER
Voyager
Uni3C
FantasyWorld
FantasyWorld maintains multi-view consistency across significant viewpoint variations (e.g., 180-degree rotations).
@article{dai2025fantasyworld,
title={FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction},
author={Dai, Yixiang and Jiang, Fan and Wang, Chiyu and Xu, Mu and Qi, Yonggang},
journal={arXiv preprint arXiv:2509.21657},
year={2025}
}