DFM: Deep Fourier Mimic for
Expressive Dance Motion Learning

1ETH Zurich,2Sony Group Corporation
ICRA 2025

Abstract

As entertainment robots gain popularity, the demand for natural and expressive motion, particularly in dancing, continues to rise. Traditionally, dancing motions have been manually designed by artists, a process that is both labor-intensive and restricted to simple motion playback, lacking the flexibility to incorporate additional tasks such as locomotion or gaze control during dancing. To overcome these challenges, we introduce Deep Fourier Mimic (DFM), a novel method that combines advanced motion representation with Reinforcement Learning (RL) to enable smooth transitions between motions while concurrently managing auxiliary tasks during dance sequences. While previous frequency domain based motion representations have successfully encoded dance motions into latent parameters, they often impose overly rigid periodic assumptions at the local level, resulting in reduced tracking accuracy and motion expressiveness, which is a critical aspect for entertainment robots. By relaxing these locally periodic constraints, our approach not only enhances tracking precision but also facilitates smooth transitions between different motions. Furthermore, the learned RL policy that supports simultaneous base activities, such as locomotion and gaze control, allows entertainment robots to engage more dynamically and interactively with users rather than merely replaying static, pre-designed dance routines.

MY ALT TEXT

Method

The expressive dance motion learning system is composed of four key components: motion design, motion representation, motion learning, and hardware inference. In the motion design phase, artists create motion references using specialized design software. The representation of these diverse motions is then learned using a Periodic Autoencoder (PAE). Reinforcement learning (RL) is employed to enable the robot to perform auxiliary tasks, such as walking and head orientation control, while accurately tracking the designed dance references. During inference, the learned policy is deployed on the actual hardware, allowing for real-time execution of dance motions and dynamic and interactive motions by tracking the auxiliary task commands.

MY ALT TEXT

Tracking Accuracy

The tracking accuracy of DFM is demonstrated by conditioning the reference dance to a rear leg lifting motion. We compare our method with Fourier Latent Dynamics (FLD) as a baseline. Due to strong periodic assumptions in both motion representation and reinforcement learning, FLD overly smooths out reference motions. DFM, which relaxes the strong periodic assumption, results in moving up the rear leg by tracking the reference motion details more accurately.

Natural Transition

The transitions between different types of dance motions are shown. DeepMimic yields high tracking performance on single trajectories but lacks the capabilities to deal with diverse motions. The resulting hard switches lead to jerky changes of motion types. In contrast, the motion representation employed by DFM achieves smooth transitions.

Frequency Interpolation

The modulation from higher to slower frequency by conditioning the mainly head-moving dance motion is shown. Even though the training dataset consists of discrete frequency types, the motion representation allows for continuous frequency interpolation.