UniT: Unified Tactile Representation for Robot Learning

Abstract

UniT is a novel approach to tactile representation learning, which uses VQVAE to learn a compact latent space and serve as the tactile representation. It uses tactile images obtained from a single simple object to train the representation with transferability and generalizability. This tactile representation can be zero-shot transferred to various downstream tasks, including perception tasks and manipulation policy learning. Our benchmarking on a in-hand 3D pose estimation task shows that UniT outperforms existing visual and tactile representation learning methods. Additionally, UniT's effectiveness in policy learning is demonstrated across three real-world tasks involving diverse manipulated objects and complex robot-object-environment interactions. Through extensive experimentation, UniT is shown to be a simple-to-train, plug-and-play, yet widely effective method for tactile representation learning.

Transferability and Generalizability

Train on a Single Simple Object

Tactile images obtained from a single simple object such as an Allen key or a small ball can train a representation with robust transferability and generalizability using UniT.

Although UniT was trained only on a single simple object, the tactile representation it learns can effectively generalize to unseen objects with diverse shapes, sizes, and textures. This tactile representation can reconstruct images that preserve most of the critical information of the original image, such as contact geometry and configuration. Compared to UniT, masked autoencoder (MAE) performs less ideal.

Autoencoders Trained on Allen Key

Autoencoders Trained with on Small Ball

Generalize to Multiple Sensors

Moreover, we demonstrate that tactile representations learned using UniT can effectively generalize across different sensors, despite the training data originating from a single sensor.

Dynamic Marker Motion Reconstruction

Both UniT and MAE effectively capture the dynamic motion of markers. This capability is essential for applying these tactile representations in robot manipulation tasks with complex force interactions.

In-hand 3D Pose Estimation Experiment

We show the effectiveness of UniT on a USB plug 3D pose estimation task, where UniT outperforms training ResNet from scratch, along with existing representation learning methods BYOL and MAE, and the state-of-the-art tactile representation framework, T3. The data collection process is depicted in the video. For more details of this experiment. please refer to our paper.

Policy Learning Experiments

Tactile information provides rich feedback for manipulation tasks involving extensive robot-environment-object interactions and can enhance the performance of manipulation. We demonstrate the effectiveness of UniT in tactile-involved policy learning through experiments across multiple tasks.

We use diffusion policy as the policy backbone and integrate UniT into it. We benchmark this with vision-only diffusion policy and visual-tactile diffusion policy with tactile encoders trained from scratch. We completed three tasks: Allen key insertion, chicken legs hanging, and chips grasping.

Chicken Legs Hanging

The precision required for this task is very high, because the width of the chicken legs and the opening of the slots are almost a perfect match. Moreover, this task involves rich force interactions with the objects being manipulated: the left arm applies force on the rack and stabilizes it, while the right arm exerts pressure to insert the chicken leg into the slot on the rack.

Time Reactiveness of Visual-Tactile Policy with UniT

Visualizations of Visual-Tactile Policy with UniT

Overfitting issue of Visual-Tactile Policy Trained from Scratch

Training a visual-tactile policy from scratch for this task tends to overfit on joint states, evident when the right arm moves toward the rack without grasping the chicken leg. This issue likely arises from the high variance in tactile images caused by the surface texture of the artificial chicken leg. Consequently, incorporating the tactile modality with tactile encoders trained from scratch may degrade the performance of the policy.

Typical Failure Scenarios of Vision-Only Policy

For visual-only policy, its ability to capture interaction information is compromised. As a result, the policy's performance is affected; often, it appears that the chicken leg has been roughly inserted into the slot, but due to a lack of interaction information, the gripper is released too early, causing the chicken leg to ultimately fall out.

Chips Grasping

The challenge of this task lies in the fragility of the chips, which necessitates precise control over the gripper width. The gripper must adjust accurately to accommodate the shape and size of the chips. Inadequate adjustment may lead to either missing the grasp or crushing the chips. Real-time tactile feedback is crucial for this task. Vision-tactile policy with UniT can handle chips with different shapes and sizes robustly.

Continuous Rollout of Vision-Tactile Policy with UniT

Visualizations of Vision-Tactile Policy with UniT

Typical Failure Scenarios of Vision-Only Policy

Relying solely on visual feedback makes it difficult to accurately determine the state of the grasp, which easily lead to missing the grasp or breaking the chips.

Allen Key Insertion

We collected data for two different types of racks in the training set. During policy rollout, we tested three different types of rack, including a rack that was completely unseen in the training set, representing out-of-distribution data. The measurements 65/70 mm, 35/70 mm, and 50/70 mm refer to the respective heights of the left and right brackets of the rack. Different combinations of these heights result in variations in the in-hand pose of the Allen key when it is grasped by the gripper.

Differences in the in-hand angle can be captured by the tactile image. Videos below demonstrate that the visual-tactile policy with UniT adapts effectively to racks of various sizes. Moreover, during insertion, aligning the Allen key with the nut on the first attempt is often challenging and frequently results in slight deviations. In such cases, tactile feedback aids in interactively adjusting the policy to accurately guide the Allen key into the nut.

65/70 mm Autonomous Rollout

35/70 mm Autonomous Rollout

50/70 mm Autonomous Rollout

50/70 mm Continuous Rollout

Here, we also present a video featuring multiple rollouts to demonstrate the effectiveness of the policy with UniT in this task. The purple rack shown in the video is the unseen one.

Typical Failure Scenarios of Vision-Only Policy and Visual-Tactile Policy from Scratch

For both the vision-only policy and the visual-tactile policy from scratch, missed insertions often occur due to the robot's end effector not being adjusted to the correct orientation. This issue is more pronounced with the vision-only policy, where there is a noticeable angular deviation of the Allen key during insertion.

Related Work

In our previous work LeTac-MPC, we introduce a reactive grasping policy that can generalize to objects with diverse stiffness, shapes, and surface textures by learning a physics-informed tactile representation on simple objects.

LeTac-MPC demonstrates robust generalizability similar to UniT: it is trained on simple objects yet effectively adaptable to a broader range of scenarios.

BibTeX

@misc{xu2024unit,
      title={{UniT}: Unified Tactile Representation for Robot Learning}, 
      author={Zhengtong Xu and Raghava Uppuluri and Xinwei Zhang and Cael Fitch and Philip Glen Crandall and Wan Shou and Dongyi Wang and Yu She},
      year={2024},
      eprint={2408.06481},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2408.06481}, 
}