RE³SIM: Generating High-Fidelity Simulation Data via
3D-Photorealistic Real-to-Sim for Robotic Manipulation

We introduced RE³SIM, a novel Real-to-Sim-to-Real pipeline that integrates Gaussian splatting with NVIDIA Isaac Sim's PhysX engine, improving scene reconstruction and sim-to-real transfer for robotic manipulation tasks.

Highlights:

High-fidelity geometry and vision: small sim-to-real gaps in both geometry and visual aspects.
Highly efficient data collection: scene reconstruction in ~2.5 minutes and simulation data at 100 episodes per 10 minutes.
Zero-shot sim-to-real transfer: limited simulation data brings high success rates.

Key Observation:

Scaling law: Increasing the simulation data scale can enhance the success rate until it converges at a high-performance level.
Mixing Sim-Real: Co-training real-world data can integrate the characteristics of both datasets.

Xiaoshen Han Minghuan Liu ^{^} Yilun Chen^{^ †} Junqiu Yu Xiaoyang Lyu Yang Tian Bolun Wang
Weinan Zhang Jiangmiao Pang^†

Shanghai Jiao Tong University Shanghai AI Lab The University of Hong Kong

^{^}Project Lead ^†Corresponding author

arXiv Paper Code Video Bilibili Summary

Visual Comparison: Low Vision Gap

Background Rendering	PSNR	SSIM
Polycam	11.52 ± 1.40	0.34 ± 0.04
OpenMVS	13.40 ± 0.96	0.27 ± 0.03
3DGS	13.29 ± 1.11	0.37 ± 0.04

Note: We manually aligned the objects with those in the simulation, but noticeable pixel-level discrepancies remain. The background alignment also has some pixel-level deviations. These factors collectively lead to the relatively low PSNR and SSIM values of all methods, especially in the texture-rich scene.

Note: 3DGS outperforms Polycam in both RSNR and SSIM. Its PSNR is comparable to OpenMVS, but SSIM is notably higher. OpenMVS's reconstruction has cracks, causing an obvious sim-to-real gap. The qualitative and quantitative results demonstrate that RE³SIM is capable of producing high-quality and well-aligned reconstruction results, making zero-shot sim-to-real transfer possible.

Zero-Shot Sim-to-Real

Note: RE³SIM can generate high-quality simulation data for training generalizable robotic policies by zero-shot sim-to-real transfer. Here are the videos of the real-world experiments of tasks pick and drop a bottle into the basket, place a vegetable on the board, stack blocks and clear objects on the table. All videos are played at normal speed.

Pick and drop a bottle into the basket

Place a vegetable on the board

Stack blocks

Clear objects on the table

Real-to-Sim-to-Real Efficiency

Note: human effort in reconstruction. The table presents estimated reconstruction times at the table level. Additionally, we show the human effort for reconstructing an object with ARCode.

Input Types	Video	Images	ARCode
Human Efforts (s)	51.5	84.5	60.5

Note: time cost for simulation data collection. Time needed to collect 100 episodes of simulation data for each task, using a machine equipped with 8 RTX 4090 GPUs.

Tasks	Time Cost (minutes)
Pick and drop a bottle into the basket	12.35
Place a vegetable on the board	13.78
Stack blocks	6.45

Large-Scale Sim-to-Real

Note: To push the limit of utilizing synthetic data for real-world manipulation problems, we choose a clear objects on the table task and evaluate the generalizability of a policy trained on a large-scale simulation dataset.

Note: Doubling the data size often results in a large improvement in success rate until convergence.

Note: A large dataset enables the policy to exhibit some robustness to variations in objects or lighting.

➤ Comparison over Simulated and Real Data

Note: Real-world and simulation data often exhibit variations in both distribution and quality, because of differences in scene initialization methods and trajectory preferences between human operators and the rule-based policy.

Data Quality

• In simulation, the motion planner tends to take the shortest path, resulting in shorter trajectories but with larger angular variations.
• Longer trajectories may include more pauses, which can negatively impact model training due to reduced action continuity. This is more often observed in real-world data.

➤ Co-training and Fine-tuning

Note: Left: Kernel Density Estimate (KDE) of the Euclidean distance traveled by the robotic arm's end effector between adjacent time steps. Right: The number of time steps taken by the robotic arm from the start of movement to the first closure of the gripper. "Sim" and "Real" indicate models trained on simulated and real data, respectively, while "Co-train" and "Fine-tune" refer to models trained on a mix of data and pre-trained with real data, respectively.

Note: The distribution of simulation and real data is generally similar. The data generated by our method can be integrated into real data through pretraining or co-training, introducing new features without causing the training process to collapse.

Framework

RE³SIM leverages 3D reconstruction and a physics-based simulator, providing small 3D gaps that enable large-scale simulation data generation for learning manipulation skills via sim-to-real transfer. We first reconstruct the background and the objects of the scene separately, and then align them with the robot in the real world. Then high-quality simulation data can be generated in the reconstructed simulator, which can be used to train a policy that can be transferred to the real world.

More Visual Results in Simulation

Rendering results of place a vegetable on the board task.

Rendering results of stack blocks task.

Rendering results of clear objects on the table task.

RE³SIM: Generating High-Fidelity Simulation Data via
3D-Photorealistic Real-to-Sim for Robotic Manipulation

Highlights:

Key Observation:

➤ Real-to-Sim-to-Real for Diverse Robotic Manipulation Tasks

Visual Comparison: Low Vision Gap

Zero-Shot Sim-to-Real

Real-to-Sim-to-Real Efficiency

Large-Scale Sim-to-Real

➤ Comparison over Simulated and Real Data

Object Location

Data Quality

➤ Co-training and Fine-tuning

➤ More Details

Framework

More Visual Results in Simulation

RE3SIM: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation

Highlights:

Key Observation:

➤ Real-to-Sim-to-Real for Diverse Robotic Manipulation Tasks

Visual Comparison: Low Vision Gap

Zero-Shot Sim-to-Real

Real-to-Sim-to-Real Efficiency

Large-Scale Sim-to-Real

➤ Comparison over Simulated and Real Data

Object Location

Data Quality

➤ Co-training and Fine-tuning

➤ More Details

Framework

More Visual Results in Simulation

RE³SIM: Generating High-Fidelity Simulation Data via
3D-Photorealistic Real-to-Sim for Robotic Manipulation