Cross-Embodiment Transfer
via Behavior-Aligned Representations

Ajay Sridhar*1, Jensen Gao*1, Jonathan Yang1, Jean Mercat2, Suneel Belkhale1, Dorsa Sadigh1

1Stanford University    2Toyota Research Institute (TRI)

*Equal contribution

Left: We consider cross-embodiment datasets of both simulated and real-world robots. We then extract behavior-aligned representations: language motions via heuristics using end effector pose deltas, end-effector traces via segmentation models, and bounding boxes via Grounding DINO. Right: We train VLAs to predict actions with or without first predicting representations. This enhances transfer from cross-embodiment datasets, without needing representations during inference.


Abstract

Recent progress in large-scale imitation learning for robot manipulation has been driven by leveraging datasets across a wide range of robot embodiments. However, achieving significant cross-embodiment transfer is often still challenging. In this work, we study the role of using behavior-aligned representations(e.g., object bounding boxes, language motions, 2D end effector traces) in vision-language-action (VLA) models to promote cross-embodiment transfer. We hypothesize that by carrying invariance across different embodiments while being predictive of robot actions, these representations can help unify diverse cross-embodiment data to enhance transfer in a scalable manner. We assess our hypothesis by developing a simulation-based benchmark designed to assess transfer with diverse, cross-embodiment data to new embodiments. Using this benchmark, we compare different representations and ways of incorporating them. Through our experiments, we identify that end-effector traces can be particularly beneficial for transfer, representations are generally more useful with larger cross-embodiment datasets, and can be used to benefit from action-free data. We also demonstrate that they can enhance sim-to-real cross-embodiment transfer, improving task completion progress of real robot policies pre-trained on simulation data by 28%.



Visualizations of Behavior-Aligned Representations

These visualizations demonstrate predicted behavior-aligned representations for real and simulated robots: language motions, 2D end-effector traces, and target object bounding boxes.

Simulation Rollouts

Real-World Rollouts


Adaptation to New Embodiments

RoboCasa Embodiments

We train models on X-Prior-300 and X-Prior-1000, a medium- and large-scale cross-embodiment simulation dataset with three robots in RoboCasa: the Kinova, UR5e, and IIWA. Then we adapt the models to new robots, with small amounts of embodiment-specific data. In these experiments, we find that training with both cross-embodiment priors and behavior-aligned representations yields the best results. We train and test all models on the PnP Counter to Sink, PnP Sink to Counter, Turn On Sink Faucet, and Flip Mug Upright tasks.

Simulation Adaptation Results

Simulation Results

For our simulation experiments, we adapt our models to the Jaco, the Panda with the Robotiq Gripper, and the Panda with the original Panda Gripper. We found that represenations are generally more impactful when scaling up prior data (except for the Turn On Sink Faucet task). Specifically, for all tasks except the Turn On Sink Faucet, the overall improvement from training with representations is +5.0% when using no prior data, +15.0% when pretraining on X-Prior-300, and +19.0% when pretraining X-Prior-1000. We hypothesize this because the Turn On Sink Faucet task has less variation than our other tasks, which involve manipulating a variety of objects in different poses, while there is not as much variation in the poses and types of faucets. The baselines make progress on the tasks, such as moving the gripper to the target object, but often fail to complete more precise motions like grasping the object in the correct orientation.

Comparison Rollouts for PnP Tasks

Comparison Rollouts for Flip Mug Upright

Comparison Rollouts for Turn On Sink Faucet

Real-World Adaptation Results

Real Results

For our real-world experiments, we adapt our models to the ViperX 300 S and the Franka Research 3. This is especially challenging because we have to deal with the sim-to-real gap in addition to the cross-embodiment gap. We found that training with the representations was particularly beneficial for the PnP Sink to Counter task. The target was difficult to spot in certain locations, so using representations such as bounding boxes helped the model locate the object. Training with representations improves the task progress over the cross-embodiment baseline on average by 28.3%.


Representation Inference

Representation Inference

We compare inference of Joint Reps models by either only predicting actions, or first predicting a representation and then actions. Predicting representations can help in some situations, but the impact is usually not substantial.


Action-Free Transfer

Action-Free Transfer

We find that behavior-aligned representations can leverage action-free prior datasets to improve over learning with no prior dataset, as well as using the full prior dataset with actions but without representations. However, performance is slightly worse than using representations with the prior dataset with actions.



The website design was adapted from Nerfies.