ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis

Abstract

Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.

Architecture

Stage 1. Temporal-aware Vector Quantization

Temporal-aware VQ-VAE reduces temporal information loss caused by fixed-window quantization. It first applies Local Temporal Enhancement (LTE), where an MLP-based weighted sum preserves important short-term motion cues within each window. It then applies Global Temporal Enhancement (GTE), where a GCN updates group-level features to capture long-range temporal dependencies. The enhanced features are mapped to codebook tokens, and the same design is used for both full-body and part (arms/legs) streams to provide stable motion tokens for next stage.

Stage 2. Part-Guided Motion Synthesis

Next, we first applies Part-aware Text Grounding (PTG): a CLIP text embedding is transformed by multiple part-specific MLPs, and a part gate selects the most suitable embedding for each body part. To keep both diversity and semantic consistency, PTG is trained with a contrastive diversity objective and an auxiliary part-text alignment loss. Then the Part-Guided Network generates arm/leg tokens autoregressively and fuses them into Part Guidance over short cycles. The holistic transformer uses this guidance to generate full-body tokens, instead of predicting holistic motion alone. During generation, Holistic-Part Fusion (HPF) continuously injects part tokens into the holistic stream via attention, improving whole-body coordination while preserving fine-grained part expressiveness.

Visualizations

Part-level evaluation. The qualitative analysis shows that ParTY better matches fine-grained part instructions (e.g., specific arm/leg actions) than both holistic (MoMask) and prior part-wise baseline (ParCo). This observation is consistent with the part-level metrics: ParTY improves part-wise text-motion alignment and motion quality for both arms and legs, indicating that part semantics are preserved during generation rather than diluted in full-body synthesis.

Coherence-level evaluation. We also highlights that strong part control alone is insufficient if global coordination collapses. Prior part-wise generation can show artifacts such as neck distortion or mismatched upper/lower body orientation, which lowers temporal and spatial consistency. ParTY maintains synchronized full-body dynamics while executing part-specific motions, reflected by stronger Temporal Coherence (TC) and Spatial Coherence (SC) scores and visually stable poses across frames.