Constraints as Rewards: Reinforcement Learning for Robots without Reward Functions

Yu Ishihara1, Noriaki Takasugi1, Kotaro Kawakami2, Masaya Kinoshita1, Kazumi Aoyama1
1Sony Group Corporation,2Sony Global Manufacturing and Operation Corporation

Standing-up task performed by our Six-Wheeled-Telescopic-Legged robot: Tachyon 3.
Our new approach enables the robot to learn a task without tuning the reward weights manually.

Abstract

Reinforcement learning has become an essential algorithm for generating complex robotic behaviors. However, to learn such behaviors, it is necessary to design a reward function that describes the task, which often consists of multiple objectives that needs to be balanced. This tuning process is known as reward engineering and typically involves extensive trial-and-error. In this paper, to avoid this trial-and-error process, we propose the concept of Constraints as Rewards (CaR). CaR formulates the task objective using multiple constraint functions instead of a reward function and solves a reinforcement learning problem with constraints using the Lagrangian-method. By adopting this approach, different objectives are automatically balanced, because Lagrange multipliers serves as the weights among the objectives. In addition, we will demonstrate that constraints, expressed as inequalities, provide an intuitive interpretation of the optimization target designed for the task. We apply the proposed method to the standing-up motion generation task of a six-wheeled-telescopic-legged robot and demonstrate that the proposed method successfully acquires the target behavior, even though it is challenging to learn with manually designed reward functions.

Method

Conventional reinforcement learning Constraints as Rewards (CaR)
Conventional RL CaR

In many practical applications, reward functions are designed as a weighted sum of multiple functions. Therefore, conventional reinforcement learning solves the expression on the left. This typically involves extensive trial-and-error to tune the weights. To avoid this trial-and-error process, we propose solving the expression on the right. In the right expression, g_m(s,a) are constraint functions and lambdas are the Lagrange multipliers. The right expression suggests that, if we design the learning objective in terms of constraints, we can obtain the desired policy without tuning the weights among different objectives. We propose composing the learning objective with constraints to train the robot without manual tuning of weight parameters.

Comparisons with manually designed rewards

Initial pose Reward design 1 Reward design 2 Reward design 3 Reward design 4 Reward design 5 Constraints as Rewards (CaR)
Pose 1
Pose 2

The above video show the training results of the robot when using manually designed rewards (Reward design 1-5) and our proposed Constraints as Rewards. The initial pose is set randomly, and the robot is requested to transition safely to the upright pose. We do not claim that there is no reward function that can achieve this task. However, from the video, we can confirm that designing a reward function is not a straightforward task, and the proposed method is effective in such situations.

Learning curve

Algorithm learning curves Weight parameter learning curves
Algorithm learning curve Weight parameter learning curve

The figure on the left shows the learning curve of the proposed algorithm and comparison algorithms in the task. Proposed CaR with QRSAC-Lagrangian achieves faster and more robust convergence. The right figure shows the tuning results of weight parameters conducted by CaR. We can confirm that weight parameters were tuned dynamically during training.

Real robot experiments

BibTeX


        @article{ishihara2025car,
         title={Constraints as Rewards: Reinforcement Learning for Robots without Reward Functions}, 
         author={Yu Ishihara and Noriaki Takasugi and Kotaro Kawakami and Masaya Kinoshita and Kazumi Aoyama},
         journal={arXiv preprint arXiv:2501.04228},
         year={2025}
        }