Fantasy AIGC Family

Unifying Human, World, and Interaction with Generative AI

Overview

Fantasy AIGC Family is built on Wan as a unified video generation foundation, radiating into spatial intelligence, world modeling, embodied intelligence, and AI creativity through a single strong generative core, and forming a tri unified family of models and system interfaces across Human, World, and Interaction: the Human axis consolidates controllable expression and consistent representations for trustworthy avatars, the World axis consolidates explorable scene representations with verifiable consistency for usable world models, and the Interaction axis consolidates Action driven closed loop control and coordination mechanisms for scalable interactive systems, so all three share representations, data, and engineering pipelines and evolve as a capability flywheel with continuous feedback.

Latest News & Milestones

πŸ“’ Jan 2026 – We released the training and inference code and model weights of FantasyVLN.
πŸ† Dec 2025 - FantasyWorld ranked 1st on the WorldScore Leaderboard (by Stanford Prof. Fei-Fei Li's Team), validating our approach against global state-of-the-art models.
πŸ› Nov 2025 – Two papers from our family, FantasyTalking2 and FantasyHSI, have been accepted to AAAI 2026.
πŸ› Nov 2025 – Two papers from our family, FantasyTalking2 and FantasyHSI, have been accepted to AAAI 2026.
πŸ› Jul 2025 – FantasyTalking is accepted by ACM MM 2025.
πŸ“’ Apr 2025 – We released the inference code and model weights of FantasyTalking and FantasyID.
A unified multimodal Chain-of-Thought (CoT) reasoning framework that enables efficient and precise navigation based on natural language instructions and visual observations.
Corresponds to the "Worlds" dimension. A unified world model integrating video priors and geometric grounding for synthesizing explorable and geometrically consistent 3D scenes. It emphasizes spatiotemporal consistency driven by Action and serves as a verifiable structural anchor for spatial intelligence.
A novel Timestep-Layer Adaptive Multi-Expert Preference Optimization (TLPO) method enhances the quality of audio-driven avatar in three dimensions: lip-sync, motion naturalness, and visual quality.
FantasyPortrait
A novel expression-driven video-generation method that pairs emotion-enhanced learning with masked cross-attention, enabling the creation of high-quality, richly expressive animations for both single and multi-portrait scenarios.
Corresponds to the "Interaction" dimension. A graph-based multi-agent framework that grounds video generation within 3D world dynamics. It unifies the action space with a broader interaction loop, transforming video generation from a content endpoint into a control channel for interactive systems.
A tuning-free text-to-video model that leverages 3D facial priors, multi-view augmentation, and layer-aware guidance injection to deliver dynamic, identity-preserving video generation.