FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations.

Generated Videos

FantasyTalking2 can generate various character videos with natural movements, accurate lip-sync, and high visual quality through Timestep-Layer adaptive multi-expert Preference Optimization(TLPO).

TimeStep-Layer Adaptive Preference Optimization Comparison

Through our proposed multi-preference collaborative optimization method for diffusion models, our method can enhance the original baseline's performance across multiple dimensions.

Comparison with SOTA methods

We compare with the latest public SOTA methods, including FantasyTalking, HunyuanAvatar, OmniAvatar and MultiTalk. Our method achieves more natural motion variations, significantly improves lip synchronization, and enhances overall video quality.

Architecture Overview

In this work, we address the challenge of balancing motion naturalness, visual fidelity, and lip synchronization in audio-driven human animation through TLPO – a novel multi-objective preference optimization framework for diffusion models. Our solution decouples competing preferences into specialized expert modules for precise singledimension alignment, while a timestep-layer dualaware fusion mechanism dynamically adapts knowledge injection throughout the denoising process. This effectively resolves multi-preference competition, enabling simultaneous optimization of all objectives without trade-offs to achieve comprehensive alignment. Qualitative and quantitative experiments demonstrate that FantasyTalking2 surpasses existing SOTA methods across key metrics: character motion naturalness, lip-sync accuracy, and visual quality. Our work establishes the critical importance of granular preference fusion in diffusion-based models and delivers a robust solution for highly expressive and photorealistic human animation.

BibTeX

@misc{wang2025fantasytalking2timesteplayeradaptivepreference,
      title={FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation}, 
      author={MengChao Wang and Qiang Wang and Fan Jiang and Mu Xu},
      year={2025},
      eprint={2508.11255},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.11255}, 
}