It's another day in the lab, and a student walks up to you asking for advice on how to control a robot arm with an RL agent. The position can be handled with deltas in Cartesian space, but what about the orientation? They thought about using Euler angles, but these have gimbal lock issues. Also, surely smooth representations are better for learning, so maybe quaternions? Then again, quaternions double-cover SO(3) and thus are not unique, which could lead to conflicting gradients. Maybe rotation matrices, since they are unique? But isn't that a lot of dimensions to learn for just 3 degrees of freedom? And wasn't somebody saying tangent vectors (i.e. axis-angles) are best?

Each choice comes with different trade-offs, and while there are excellent papers on these representations for supervised learning tasks [1], nobody has systematically investigated which one works best for actions in reinforcement learning. This matters because your choice doesn't just affect how the network outputs rotations—it fundamentally shapes how the agent explores, how entropy regularization behaves, and ultimately how well your policy learns. We ran the experiments for PPO, SAC and TD3 to figure out what actually works, and link the performance to theoretical intuitions often mentioned when reasoning about rotation representations.

An Idealized Experiment

To really understand how rotation representations affect learning, we need to isolate the problem. So we built the simplest possible environment: an agent that only controls orientation, with orientation as the only state. Think of it as a floating gyroscope that just needs to rotate into target orientations. Each episode, the agent starts at some random orientation and gets assigned a random goal orientation. It can rotate at most αmax radians per step, taking the shortest path toward whatever orientation it commands.

This setup let us run a lot of experiments. We tested all major representations (rotation matrices, quaternions, Euler angles, and tangent vectors) across three popular RL algorithms (PPO, SAC and TD3 ), using both global and delta action formulations, and each with dense and sparse rewards. That's a lot of combinations, and the idealized environment made it feasible to thoroughly explore the design space.

Environment dynamics showing rotation from current state to goal
The agent rotates from Rt towards Ra, with max step αmax, aiming to reach goal Rg. Conceptual illustration in 3D of the 4D rotation space.

Key Findings and Recommendations

TLDR: Use tangent vectors (i.e. axis-angles) in the local frame

  • Default choice: Delta tangent vectors in the local frame. Scale outputs to the range of permissible rotations.
  • Dense rewards help: Continuous feedback can mask representation issues. Sparse rewards amplify differences.
  • For unstable systems (e.g. drones): Tangent vectors remain the best choise. If using matrices/quaternions, use delta actions and unit-centering. For limited operation ranges, Euler angles can be viable.
  • Fixed target poses: If your task involves reaching fixed target poses (not relative positioning), matrices or quaternions in the global frame may match or beat deltas.
  • Avoid Euler angles for general tasks: Delta Euler angles work for small rotations but degrade as coverage of SO(3) increases.

Example Training Curves

We show a few representative training curves from our idealized rotation environment to illustrate the performance differences between representations. Mind that we only vary the action representation here. Everything else (algorithms, dynamics etc.) is kept constant. Hyperparameters are optimized per representation to ensure a fair comparison. The full comparisons are available in the paper.
PPO training curves with dense rewards
PPO with dense rewards
TD3 training curves with sparse rewards
TD3 with sparse rewards

Explaining the Results

Distribution warping

Most RL algorithms rely on sampling exploration noise from a simple distribution (e.g. Gaussian) in the action space. Algorithms like PPO rely on small initial logstandards to encourage local exploration. However, Gaussian noise in the raw action space gets heavily distorted when projected onto SO(3).

Here we show how the same Gaussian noise limited to [-1, 1] with a Tanh activation and projected onto SO(3) leads to very different distributions of rotations for different action spaces. This explains e.g. why quaternions and rotation matrices perform poorly at initialization: the noise is nearly uniform across the representation space, leading to chaotic exploration.

Tangent Vectors
0.3
Rotation Matrix
0.3
Quaternion
0.3
Euler Angles
0.3

Note: Ideally, we would visualize quaternions on a 4D sphere (S³), but since this is impossible to directly visualize, we instead show the resulting SO(3) samples in axis-angle (tangent) space for all representations.

Conflicting gradients for multi-covers

One intuition behind avoiding quaternions is their double-cover property: each rotation corresponds to two antipodal points on the 4D unit sphere. Unfortunately, the double-cover quaternion is antiparallel in Euclidean space, driving network outputs in the exact opposite direction. This can lead to conflicting gradients during learning.

But does this actually happen in practice? If the critic doesn't learn that both quaternions are valid solutions, we would never see conflicting gradients. To investigate this, we visualize the Q-values predicted by a trained SAC critic for actions interpolating from the optimal quaternion action to its double-cover and back. As can be seen, the critic correctly learns to assign both quaternions high Q-values with a dip in between, which indeed leads to conflicting gradients for the actor and explains why quaternions are a suboptimal action representation.

SAC critic Q-values for quaternion double-cover
Actual Q-values predicted by the critic for quaternion actions around the manifold. The critic indeed learns the double cover, producing conflicting gradients.

Misguided entropy regularization

The action noise distribution shown above directly leads to another problem. While we regularize the entropy of the action distribution in Euclidean space, the resulting distribution on SO(3) can look very different. Distributions with more entropy in Euclidean space (large σ) can actually have less entropy on SO(3) than more concentrated distributions (small σ). For many representations, the bonus drives the policy towards select actions with larger magnitudes, which causes jittery behavior and poorer exploration.

Limiting actions

Physical systems cannot rotate arbitrarily fast. Thus, we can often limit the action space to a maximum rotation angle per timestep (αmax, red circle). But how do we scale different representations to this limit? We cannot remap quaternions or rotation matrices without loosing their smoothness properties. Euler angles can be scaled, but the scaling non-linearly dependents on the current state.

Tangent vectors (axis-angles) on the other hand naturally scale with rotation magnitude, making it straightforward to limit actions. You can see the idea in the figure below. Multiplying your bounded axis-angle actions (outer gray plane) by αmax limits the maximum rotation to almost the real action limit of αmax (inner gray plane). While some edges remain outside the limit (compare the corners of the inner gray plane to the gray circle representing the action limit projected onto the tangent space), these do not cause issues in practice.

Action limiting comparison across representations
Action limiting effectiveness varies across representations. Tangent vectors naturally scale with rotation magnitude.

Benchmarks

The idealized environment isolates rotation representations, but real robots deal with much messier scenarios. We tested the representations on three actual robotics benchmarks to see if our findings hold up when orientation control is mixed with position control, contact dynamics, and physical constraints.

Trajectory tracking

Tracking a figure 8 trajectory is a typical control benchmark for drones. In our case, the agent controls collective thrust and drone attitude (orientation), a common control interface for drones. We train a PPO agent in Crazyflow, a jax-powered, massively parallel drone simulator. For systems that have unstable dynamics such as drones, having a representation centered around the unit rotation (tangents or Euler angles) is especially impactful.

Drone trajectory tracking results
Trajectory tracking results

Drone racing

Drone racing is interesting because agents operate at the limits of the drones' capabilities, using a larger range of attitudes. We adopt the setup of the IROS 2022 Safe Robot Learning Competition, modified to work with our parallelized simulation (GitHub). While the effect is less clear due to the task complexity, we see the same trends as in the trajectory tracking benchmark with delta actions in the tangent space outperforming all other representations.

Drone racing results
Drone racing results

ReachOrient

ReachOrient modifies the popular Fetch environments from OpenAI to include position and orientation goals, and switches out the arm with a FR3. We use a setup with Hindsight Experience Replay (HER) and train with sparse rewards. Because the arm cannot get unstable, the matrix representation performs much better than in the drone benchmarks and is on par with the tangent (i.e. axis-angle) representation.

ReachOrient task results
ReachOrient results

PickAndPlaceOrient

PickAndPlaceOrient similarly modifies the PickAndPlace Fetch environment with orientation goals. The robot must pick up a cube and place it into a target position with the correct orientation. The handling of the cube makes reasoning about rotations more complex, and we see a larger gap between the representations, with tangent vectors outperforming all other representations.

PickAndPlaceOrient task results
PickAndPlaceOrient results

RoboSuite

We evaluate the representations across nine manipulation tasks from the RoboSuite benchmark, testing performance on a variety of contact-rich manipulation scenarios. The performance is dominated by the task choice, not representation choice. We attribute this to agents failing to learn meaningful policies in several of the tasks, or only finding a partial solution, irrespective of their representation.

RoboSuite benchmark results
Performance across nine manipulation tasks