Skip to content

필사 모드: Sim-to-Real — Bringing What Was Learned in Simulation into Reality

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Training a robot with reinforcement learning requires hundreds of thousands to millions of trials. Accumulating that much experience on a real robot pushes time, wear, and safety issues to unmanageable levels. This is why much research chooses a strategy of accumulating vast experience quickly and safely inside simulation, then transferring the learned policy to a real robot. This is sim-to-real transfer.

But there is a fundamental difficulty here. No matter how sophisticated, a simulation is not a perfect replica of reality. Friction, contact, sensor noise, actuator delay, and similar factors all differ slightly. These minute differences accumulate, and it is common for a policy that works perfectly in simulation to fail miserably in reality. This gap is called the reality gap.

This article examines why the reality gap arises and the representative techniques for narrowing it: domain randomization, domain adaptation, system identification, digital twins, and physics that is especially hard to reproduce, such as tactile sensing. For accuracy, we mention details of specific physics engines or systems only when well established, and generalize where things are uncertain.

What Is the Reality Gap

The reality gap collectively refers to every mismatch between simulation and reality. This gap stems from several causes.

Causes of the Reality Gap

──────────────────────────

┌─────────────────────┐ ┌─────────────────────┐

│ Simulation │ │ Reality │

│ │ │ │

│ - approximate physics│ ≠ │ - real physics │

│ - ideal friction │ ──gap─▶ │ - complex friction │

│ - perfect sensors │ │ - noisy sensors │

│ - instant actuators │ │ - delay, backlash │

│ - simplified contact │ │ - deformation, slip │

└─────────────────────┘ └─────────────────────┘

If the policy learns to exploit simulation's "loopholes,"

those loopholes vanish in reality and the policy collapses.

The causes of the reality gap can be organized as follows.

- Dynamics mismatch: The physics approximation the simulator uses differs subtly from real objects' mass, inertia, and friction.

- Contact and friction: The forces, slippage, and deformation at the moment objects touch are the parts physics engines reproduce least accurately.

- Sensor differences: Simulated camera images or depth values do not perfectly capture real sensors' noise, lens distortion, and lighting changes.

- Actuator differences: Real motors have delay, backlash (gear play), torque limits, and wear, but simulations tend to idealize these away.

Domain Randomization

The most widely known technique for narrowing the reality gap is domain randomization. The idea is paradoxical. Instead of making the simulation closer to reality, it randomly and heavily perturbs many simulation parameters each episode.

The core intuition is this. If you train while widely varying factors like friction, mass, lighting, color, and sensor noise, the policy does not overfit to any single setting; it learns that "variation of about this range is normal." To a policy trained this way, the real world looks like just one more random variation.

The Intuition of Domain Randomization

──────────────────────────────────────

[ Single simulation ]

● ← optimized to one narrow setting → fails in reality (○)

reality ○ (outside training distribution)

[ Domain randomization ]

● ● ● ● ● ● ● ● ● ← robust across a wide range of variations

reality ○ (reality lies within training distribution)

Randomization targets fall into two broad categories.

- Visual randomization: Perturbs textures, colors, lighting, camera position, background, and so on to make the visual policy robust.

- Dynamics randomization: Perturbs physical parameters like mass, friction coefficient, actuator delay, and joint damping to make the control policy robust.

Example dynamics randomization config (conceptual)

domain_randomization:

friction:

distribution: uniform

low: 0.5

high: 1.5

mass_scale:

distribution: uniform

low: 0.8

high: 1.2

actuator_delay_ms:

distribution: uniform

low: 0

high: 20

observation_noise_std: 0.02

gravity_scale:

low: 0.98

high: 1.02

Domain randomization has clear limits too. If the randomization range is too wide, the task itself becomes so hard that the policy learns nothing. Too narrow, and it fails to cover reality. Setting this range appropriately itself demands considerable tuning effort.

System Identification

System identification is complementary to domain randomization. Instead of random perturbation, it calibrates the simulator's parameters as closely as possible to reality using data measured from the real robot.

For example, you move a real robot arm through various trajectories, measure the joints' responses, and optimize parameters like friction coefficient, inertia, and delay so that the measurements match the simulation results. A simulator calibrated this way resembles reality more closely, narrowing the gap.

The System Identification Loop

───────────────────────────────

real robot ──▶ collect measurement data

compare with simulation results

adjust sim parameters to reduce error

└────▶ (repeat to bring sim near reality)

In practice, domain randomization and system identification are used together. System identification brings the simulator near reality, and domain randomization perturbs widely around it to absorb the remaining gap as robustness.

Observation History and Implicit Adaptation

One interesting mechanism by which a policy trained with domain randomization transfers well to reality is that the policy implicitly estimates the current environment's characteristics from the flow of past observations.

Looking at a single moment's observation, you cannot tell whether friction is high or low right now. But by looking at how the robot actually moved over the last few steps given the commands it issued, you can infer the environment's physical characteristics to some degree. In other words, if the policy takes an observation history as input, there is room for it to figure out "what kind of environment this is now" on its own and adjust its behavior accordingly.

Implicitly Estimating the Environment from Observation History

──────────────────────────────────────────────────────────────

single frame: observation o_t ──▶ environment traits unknown

history: o_(t-k) ... o_t ──▶ from "response vs command"

(flow of the past) implicitly estimate friction, mass, etc.

──▶ adjust behavior accordingly

From this viewpoint, broad domain randomization is not merely making the policy insensitive; it can be understood as training the ability to distinguish various environments and adapt to each. This is why structures that handle temporal information, such as recurrent neural networks, are advantageous for this implicit adaptation.

Domain Adaptation

Domain adaptation focuses on porting a policy or perception model from the simulation domain to the reality domain. Typically, in visual perception, it trains so that the feature distributions of simulated and real images align, making representations learned in simulation work on real images too.

The approaches vary: transforming simulated images to look real, aligning the feature space so the two domains become indistinguishable, or fine-tuning the last stage with a small amount of real data. The core goal is the same: transfer knowledge learned in simulation to reality while minimizing the amount of real data required.

Digital Twins and Physics Engines

The foundation of sim-to-real is ultimately the quality of the simulator. Recently the concept of a digital twin, a precise virtual replica of the real robot and work environment, appears frequently. The idea is that learning a policy in a virtual environment that reflects the real cell's geometry, objects, and lighting as much as possible can narrow the reality gap.

On the physics-engine side, tools that use the GPU to simulate thousands of environments in parallel simultaneously have greatly accelerated robot reinforcement learning. NVIDIA's Isaac-family simulation tools are commonly cited, and various other physics engines are used in robotics research as well. That said, the exact features and performance of each tool may vary by version, so it is safest to consult the official documentation for details.

The Benefit of GPU Parallel Simulation

───────────────────────────────────────

Single env: [env] ──▶ slow data collection

Parallel env: [env][env][env] ... [env] (thousands at once)

│ │ │ │

└────┴────┴────────┘

collect vast experience in short time

──▶ reinforcement learning becomes practical

Hard-to-Reproduce Physics — Touch and Friction

An especially tricky area in sim-to-real is contact-rich manipulation. Grasping, inserting, and wiping motions depend heavily on subtle friction and contact forces, which is precisely where physics engines struggle to reproduce most accurately.

- Friction: The transition between static and kinetic friction, and changes with surface state, are hard to capture with a single simple coefficient.

- Deformation: Soft objects or cloth that change shape are not represented by rigid-body assumptions.

- Tactile sensors: Tactile sensors that detect fingertip pressure distribution or slippage are particularly hard to reproduce in simulation.

For these reasons, tasks where touch matters, such as precise assembly, have significant limits with pure simulation learning alone, and tend to require more real data or real-world fine-tuning.

The Promise and Limits of Zero-Shot Transfer

The ideal goal is zero-shot transfer: a policy learned in simulation working directly in reality without any additional learning or adjustment. For some tasks, such as locomotion of mobile robots with broadly applied domain randomization, impressive zero-shot results have been reported.

But not all tasks are like this. For tasks with a large reality gap, such as contact-rich precise manipulation, real-world few-shot fine-tuning is still often needed. So in practice it is realistic to understand it as a spectrum.

The Transfer Spectrum

──────────────────────

zero-shot few-shot full real training

(no real tuning) (small real data) (lots of real learning)

│ │ │

small-gap tasks medium-gap tasks very-large-gap tasks

(e.g. some gaits) (many manipulations) (extreme precision, soft)

Observation Gap and Dynamics Gap

When dealing with the reality gap, it helps to think of it as two kinds: the observation gap and the dynamics gap.

The observation gap is the difference in how the robot sees the world. This is when the simulation's camera image looks different from the real image, or the depth sensor's noise characteristics differ. The dynamics gap is the difference in how the world actually moves. This is when an object responds differently in sim and reality given the same force.

Two Kinds of Gap

──────────────────

Observation gap Dynamics gap

┌──────────────┐ ┌──────────────┐

│ difference in │ │ difference in │

│ seeing the │ │ how the world │

│ world │ │ moves │

│ │ │ │

│ e.g. texture, │ │ e.g. friction,│

│ sensor noise │ │ inertia,delay│

└──────┬───────┘ └──────┬───────┘

│ │

▼ ▼

mainly addressed by mainly addressed by

visual randomization, dynamics randomization,

domain adaptation system identification

This distinction is useful because the prescription differs depending on which gap is the problem. If visual perception is shaky, you need visual randomization or domain adaptation; if control is shaky, you should focus on dynamics randomization or system identification. Lumping the two together makes it easy to pour effort into the wrong place.

Privileged Information and Teacher-Student Learning

A powerful technique often used in sim-to-real is teacher-student learning with privileged information.

The idea is this. Inside simulation you can access perfect information unobtainable in reality: an object's exact mass, the terrain's exact height, the friction coefficient, and so on. A teacher policy that receives such privileged information directly learns very easily and very well. The problem is that the real robot does not have this information.

So we split it into two stages. First, learn a teacher policy with privileged information in simulation. Then learn a student policy that mimics the teacher's actions using only the observations actually obtainable in reality (cameras, joint sensors, and so on). The student learns to behave like the teacher without privileged information, and this student policy is deployed to reality.

Teacher-Student Learning (using privileged information)

─────────────────────────────────────────────────────

[Teacher] privileged info (perfect physics) ──▶ easily learns good policy

│ teacher's actions as target

[Student] uses only real-obtainable observations ──▶ learns to imitate teacher

Real deployment: student policy (no privileged info needed)

This approach is especially effective for tasks where terrain information matters, such as legged locomotion. In simulation you know the terrain height exactly, but in reality you cannot, so the student is trained to implicitly estimate the terrain from the flow of past observations.

The Size of the Gap Seen Through Cases

Let us look at the transfer spectrum mentioned earlier a bit more concretely. Where a task lands is largely determined by two factors: contact complexity and precision requirement.

- Tasks with little contact and much slack: Locomotion in open space or roughly pushing large objects have a relatively small reality gap, and good transfer is often possible with broad domain randomization alone.

- Tasks with much contact but some slack: General manipulation of picking up and moving objects has a medium gap, and a small amount of real-data fine-tuning on top of randomization helps.

- Tasks with much contact and extreme precision: Inserting parts with fine clearance or handling cloth has a very large gap, requiring a large increase in the proportion of real data.

Two Axes That Place a Task

────────────────────────────

Precision

requirement

high ┤ precise assembly extreme soft objects

│ (much real data) (mostly real training)

│ rough pushing general manipulation

low ┤ (near zero-shot) (few-shot)

└────────────────────────────▶

low contact complexity high

This map is not absolute and every task has exceptions. But it gives a rough intuition for gauging "how hard a transfer my task is."

The Practical Workflow

The rough flow when actually applying sim-to-real is as follows.

sim-to-real Practical Pipeline

───────────────────────────────

[1] Digital twin / environment modeling

[2] Calibrate sim parameters via system identification

[3] Set domain randomization range, then train in sim

[4] Small-scale evaluation in reality

├── performance sufficient ──▶ deploy

└── gap found ──▶ readjust randomization / identification (back to [2]/[3])

Actuator and Latency Modeling

An unexpectedly large share of the reality gap comes from the non-idealities of actuators. Simulation often assumes "the commanded torque comes out instantly and exactly," but real motors do not behave that way.

- Latency: It takes time from issuing a command to the actual force coming out. The faster the control cycle, the greater the relative impact of this delay.

- Backlash: Gears have play, so there is a brief interval where force is not transmitted when reversing direction.

- Torque limit and saturation: There is an upper bound on the force a motor can produce, and near it the response does not follow the command.

- Friction and wear: Joint friction changes over time and differs subtly from robot to robot.

Ideal vs Real Actuator

────────────────────────

Ideal: commanded torque ──instant──▶ output torque (perfect match)

Real: commanded torque ──delay──▶ ──backlash──▶ ──saturation──▶ output

(time) (play) (upper bound) (imperfect)

▶ If the policy relies on the ideal assumption, it oscillates or fails in reality.

Methods used are to reflect these non-idealities in simulation, or to replace the actuator's response itself with a separate model learned from data. Including latency among the randomization targets in particular greatly helps the robustness of the control policy.

Real-to-Sim — The Bridge in the Opposite Direction

So far we have covered the direction from simulation to reality, but the reverse flow matters too: real-to-sim, observing reality to build or improve the simulation.

If you transfer a real scene into a virtual environment via 3D scanning or reconstruction techniques, you obtain a simulation that looks exactly like the actual workspace. A policy learned in a simulation built this way transfers back to that very space more easily. Repeating real-to-sim and sim-to-real cyclically creates a virtuous cycle where sim and reality improve each other.

real-to-sim ↔ sim-to-real cycle

─────────────────────────────────

real observation ──(real-to-sim)──▶ build precise sim

▲ │

│ ▼

│ learn policy in sim

│ │

└──(sim-to-real)◀── validate/collect data in reality

Sim-to-Real Seen Through Cases

For a concrete feel, let us contrast two types of tasks. Details of actual systems vary by implementation, so here we describe only general tendencies.

First, locomotion of quadruped and biped robots. This task has several relatively successful sim-to-real cases reported, combining domain randomization with privileged-information-based teacher-student learning. The approach learns a policy in simulation that is robust to varied terrain and disturbances, then applies it to real walking.

Second, fingertip precise manipulation. Tasks of rotating or reorienting an object with multiple fingers have had impressive results reported through extensive domain randomization. That said, such tasks involve very complex contact, requiring considerable engineering effort in setting the randomization range and in training.

The common lesson of both cases is that successful sim-to-real is not one magical technique but the product of a combination of many techniques (randomization, identification, teacher-student, careful evaluation) and iterative tuning.

Safety Measures for Real-World Deployment

No matter how well validated in simulation, unexpected behavior can emerge when a learned policy is first put on a real robot. So layered safety measures at the actual deployment stage are standard.

- Torque and velocity limits: No matter how large a command the policy issues, block it at the hardware level from exceeding safety limits.

- Workspace boundaries: If the robot arm leaves a designated safe region, stop it immediately.

- Emergency stop: Provide physical and software switches by which a human can halt the robot at any time.

- Gradual deployment: Start with low speed and a narrow range, and gradually widen the conditions as trust builds.

Safety Layers of Real-World Deployment

────────────────────────────────────────

policy output

[torque/velocity limit] ──▶ block exceeding hardware limits

[workspace boundary] ──▶ stop if leaving the safe region

[emergency-stop standby]──▶ human can intervene anytime

actual robot motion (within safe bounds)

These measures do not raise sim-to-real performance itself, but they are the last line of defense preventing unexpected failures from the gap from turning into accidents. They are essential especially for robots operating alongside people.

Quantifying the Gap

In sim-to-real research and practice, gauging "how large the gap is" matters. To this end, one commonly runs the same policy in simulation and reality and compares how far their performance or trajectories diverge.

- Performance gap: The difference between the success rate in sim and the success rate in reality. A large difference means transfer failed.

- Trajectory gap: For the same command, how differently the robot's state evolves over time in sim versus reality.

Sim-Reality Trajectory Gap

────────────────────────────

sim: s0 ─▶ s1 ─▶ s2 ─▶ s3 ─▶ ...

reality: s0 ─▶ s1'─▶ s2'─▶ s3'─▶ ...

↑ ↑ ↑

progressively widening gap

▶ If it already widens early, suspect the dynamics gap;

if it widens only in specific situations, suspect that situation's physics.

Such quantification tells you where to fix. If the trajectory diverges from the start, basic dynamics calibration is needed; if it diverges only at a specific moment (for example, when contact occurs), you must improve the reproduction of that physics.

Pitfalls and Caveats

- Excessive randomization: Blindly widening the randomization range causes learning to not converge at all, or produces an overly conservative policy.

- Exploiting sim loopholes: A policy can exploit the simulator's physical errors to obtain reward in unrealistic ways. Such sim-only strategies collapse in reality.

- Evaluation bias: Do not be optimistic about real performance just because simulation performance is good. You must validate thoroughly in reality.

- Sensor alignment: If the sensor coordinate frames, delays, or scales of sim and reality diverge, the policy fails immediately. This alignment is often overlooked but very important.

Conclusion

Sim-to-real transfer is the key bridge that makes robot learning practical. Domain randomization absorbs the gap through robustness, system identification and digital twins pull the simulator closer to reality, and domain adaptation ports learned knowledge into reality. These techniques are not mutually exclusive; in practice they are combined.

That said, in areas where physics is hard to reproduce, such as touch and precise contact, real data is still needed, and zero-shot transfer is not a cure-all. Sim-to-real is less a finished solution than an ongoing engineering process of narrowing the reality gap little by little.

References

- NVIDIA Isaac / robotics: https://developer.nvidia.com/isaac

- NVIDIA Isaac Lab docs: https://isaac-sim.github.io/IsaacLab/

- OpenAI Spinning Up (intro to RL): https://spinningup.openai.com/

- Domain Randomization paper (Tobin et al., arXiv): https://arxiv.org/abs/1703.06907

- Sim-to-Real via Domain Randomization (OpenAI Dactyl, arXiv): https://arxiv.org/abs/1808.00177

- Open X-Embodiment paper (arXiv): https://arxiv.org/abs/2310.08864

- MuJoCo physics engine: https://mujoco.org/

- Gymnasium (RL environments): https://gymnasium.farama.org/

현재 단락 (1/242)

Training a robot with reinforcement learning requires hundreds of thousands to millions of trials. A...

작성 글자: 0원문 글자: 19,062작성 단락: 0/242