FantasyPortrait

Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

Qiang Wang1*         Mengchao Wang1*         Fan Jiang1†        
            Yaqi Fan2           YongGang Qi2‡   Mu Xu1  
1AMAP,   Alibaba Group
2Beijing University of Posts and Telecommunications
*Equal contribution   Project leader   Corresponding author  

Producing expressive facial animations from static images is a challenging task. Prior methods relying on explicit geometric priors (e.g., facial landmarks or 3DMM) often suffer from artifacts in cross reenactment and struggle to capture subtle emotions. Furthermore, existing approaches lack support for multi-character animation, as driving features from different individuals frequently interfere with one another, complicating the task. To address these challenges, we propose FantasyPortrait, a diffusion transformer based framework capable of generating high-fidelity and emotion-rich animations for both single- and multi-character scenarios. Our method introduces an expression-augmented learning strategy that utilizes implicit representations to capture identity-agnostic facial dynamics, enhancing the model's ability to render fine-grained emotions. For multi-character control, we design a masked cross-attention mechanism that ensures independent yet coordinated expression generation, effectively preventing feature interference. To advance research in this area, we propose the Multi-Expr dataset and ExprBench, which are specifically designed datasets and benchmarks for training and evaluating multi-character portrait animations. Extensive experiments demonstrate that FantasyPortrait significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, excelling particularly in challenging cross reenactment and multi-character contexts.

Multi-Character Portrait Animations

FantasyPortrait supports driving multiple characters using either multiple single-person videos or single multi-person videos, generating detailed expressions and realistic portrait animations.

Diverse Character Styles

FantasyPortrait can animate characters in various styles, generating dynamic, expressive, and naturally realistic stylized videos.

Comparison with with other methods

Animal Animation

FantasyPortrait demonstrates strong generalization to animal animation tasks, despite not being explicitly trained on animal datasets.

Audio-Driven Portrait Animation

We can readily extend our video-driven model to an audio-driven portrait animation framework. Specifically, we employ Whisper for audio encoding, followed by a small transformer-based network that maps the audio features to latent driving representations. Notably, compared to other DiT-based audio-driven approaches, FantasyPortrait achieves remarkable audio-visual alignment with just thousands of training samples. More significantly, while existing mainstream methods and datasets primarily focus on English, adapting them to other languages typically incurs substantial data collection costs and computational overhead. In contrast, our method only requires a few hundred samples and approximately 1 GPU hour of fine-tuning the transformer-based mapping network to support new languages or dialects. This significantly lower the barrier to entry and promotes technological inclusivity. Below we demonstrate our results on Chinese, Japanese, and Arabic.

Architecture Overview

Full Body

In this work, we present FantasyPortrait, a novel DiT-based framework for generating expressive and well-aligned multi-character portrait animations. Our method leverages implicit facial expression representations to achieve identity-agnostic motion transfer while preserving fine-grained affective details. Additionally, we introduce a masked cross-attention mechanism to enable synchronized yet independent control of multiple characters, effectively soluting expression leakage. To support research in this field, we contribute ExprBench, a comprehensive evaluation benchmark, along with a multi-character facial expression Multi-Expr dataset. Extensive experiments demonstrate that FantasyPortrait outperforms existing methods in both single- and multi-character animation scenarios, particularly in handling cross-identity reenactment and complex emotional expressions.

BibTeX

@misc{wang2025fantasyportraitenhancingmulticharacterportrait,
      title={FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers}, 
      author={Qiang Wang and Mengchao Wang and Fan Jiang and Yaqi Fan and Yonggang Qi and Mu Xu},
      year={2025},
      eprint={2507.12956},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.12956}, 
 }