FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework

Human-Scene Interaction (HSI) seeks to generate realistic human behaviors within complex environments, yet it faces significant challenges in handling long-horizon, high-level tasks and generalizing to unseen scenes. To address these limitations, we introduce FantasyHSI, a novel HSI framework centered on video generation and multi-agent systems that operates without paired data. We model the complex interaction process as a dynamic directed graph, upon which we build a collaborative multi-agent system. This system comprises a scene navigator agent for environmental perception and high-level path planning, and a planning agent that decomposes long-horizon goals into atomic actions. Critically, we introduce a critic agent that establishes a closed-loop feedback mechanism by evaluating the deviation between generated actions and the planned path. This allows for the dynamic correction of trajectory drifts caused by the stochasticity of the generative model, thereby ensuring long-term logical consistency. To enhance the physical realism of the generated motions, we leverage Direct Preference Optimization (DPO) to train the action generator, significantly reducing artifacts such as limb distortion and foot-sliding. Extensive experiments on our custom SceneBench benchmark demonstrate that FantasyHSI significantly outperforms existing methods in terms of generalization, long-horizon task completion, and physical realism.

Overview

In this work, we presented FantasyHSI, a novel framework for synthesizing expressive and physically plausible humanscene interactions in complex 3D environments. By reformulating HSI as a dynamic directed graph, we established an interpretable structure for modeling long-horizon interactions. The integrated VLM-based multi-agent collaboration comprise scene understanding, hierarchical planning, and trajectory correction. Furthermore, our reinforcement learning-based optimization of video diffusion models ensures that synthesized motions adhere to physical laws, eliminating artifacts such as foot sliding and body-scene penetration. Experiments show that FantasyHSI surpasses existing methods in generalization to unseen scenes and long-horizon tasks while maintaining motion realism and logical coherence.

Method

We introduce FantasyHSI, a novel framework that generates dynamic 4D sequences of humans interacting with their 3D environment. As illustrated on the left, FantasyHSI operates based on high-level task instruction, enabling it to autonomously plan paths, traverse obstacles, and execute a variety of complex motions, such as climbing a ladder. Moreover, the right side of the figure illustrates FantasyHSI's ability to generalize to arbitrary scenes and a variety of actions.

Comparison with SOTA methods

Abaltion Results

Character Binding

BibTeX

@misc{mu2025fantasyhsivideogenerationcentric4dhuman,
      title={FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework}, 
      author={Lingzhou Mu and Qiang Wang and Fan Jiang and Mengchao Wang and Yaqi Fan and Mu Xu and Kai Zhang},
      year={2025},
      eprint={2509.01232},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.01232}, 
}