How Robots Learn — Imitation Learning and Reinforcement Learning

Introduction
Four Approaches to Robot Learning
Imitation Learning — Watch and Copy
Reinforcement Learning — Learning by Trial and Error
Inverse Reinforcement Learning — Inferring Reward from Demonstrations
Comparing the Two Approaches
Combining the Two Approaches
Practical Decision Criteria
Understanding Data Efficiency Numerically
Policy Evaluation Methodology
The Whole Flow Seen Through One Task
Pitfalls and Caveats
Conclusion
References

Introduction

Telling a robot to "pick up the coffee cup and move it" is easy. But behind that one sentence lie countless low-level control problems: modulating fingertip force, estimating the cup's position, planning the arm's trajectory, detecting slippage. Humans acquire this sensorimotor intuition through years of falling and spilling, but a robot must either learn all of it from scratch or learn it from someone.

The ways a robot can acquire skills fall into roughly four categories. Imitation learning, where a human demonstrates and the robot copies. Reinforcement learning, where the robot finds optimal behavior through trial, error, and reward. Simulation learning, where vast experience is accumulated in a virtual environment. And pre-programming, where a human hand-codes rules and trajectories.

This article focuses on the two approaches at the center of recent robot learning research: imitation learning and reinforcement learning. We examine how each works, when each is strong or weak, how much data each demands, and why the latest research tries to combine them. For accuracy, we cite specific numbers or system details only when they are well established, and generalize where things are uncertain.

Four Approaches to Robot Learning

Let us first survey the landscape. The diagram below shows the four paths by which a robot obtains a behavior policy (a function that decides which action to take in which state).

                   Paths to a Robot Policy
                   ────────────────────────

  (1) Pre-programming            (2) Imitation Learning
  ┌───────────────────┐      ┌───────────────────────┐
  │ Human writes rules │      │ Human demonstrates,    │
  │ and trajectories   │      │ robot learns from data │
  │ directly in code   │      │                       │
  │                   │      │ e.g. manipulation      │
  │ e.g. fixed weld    │      │ demos gathered via     │
  │      path          │      │ teleoperation          │
  └─────────┬─────────┘      └───────────┬───────────┘
            │                            │
            ▼                            ▼
       ┌──────────────────────────────────────────┐
       │            Behavior Policy π(a | s)         │
       │      state s ──▶ function that picks a       │
       └──────────────────────────────────────────┘
            ▲                            ▲
            │                            │
  ┌─────────┴─────────┐      ┌───────────┴───────────┐
  │ Train policy on    │      │ Improve policy by      │
  │ vast virtual       │      │ trial and error to     │
  │ experience         │      │ maximize reward        │
  │                   │      │                       │
  │ e.g. thousands of  │      │ e.g. +reward if it     │
  │  parallel hours    │      │  stays up, -if it falls│
  └───────────────────┘      └───────────────────────┘
  (3) Simulation Learning        (4) Reinforcement Learning

These four approaches are not mutually exclusive. Real systems often combine them: build a rough policy with reinforcement learning in simulation, then refine it with imitation learning from real human demonstrations. The following sections examine imitation learning and reinforcement learning in turn.

Imitation Learning — Watch and Copy

The premise of imitation learning is simple. An expert (usually a human) performs a task, the performance is recorded as data, and the robot learns the mapping "in this state, take this action" from that data. The key is that no explicit reward function needs to be designed. Instead of defining "what makes a good action" as a formula, you show examples of good actions.

Ways to Collect Demonstration Data

The success of imitation learning hinges on the quality and quantity of demonstration data. The main collection methods are as follows.

Teleoperation: A human remotely operates the robot with a joystick, VR controller, or a master device shaped like the robot arm. The robot's joint angles, gripper state, and camera images are recorded together, forming state-action pairs. Because data is collected through the robot's own body, embodiment-mismatch problems are minimized.
Motion Capture: A human's motion is tracked with markers or cameras to obtain trajectories. This yields natural motion at scale, but because human and robot bodies differ, there is a retargeting problem in transferring the motion directly.
Kinesthetic Teaching: A human physically grabs the robot arm and moves it through the desired trajectory. Intuitive, but ill-suited to large-scale collection.

Behavioral Cloning — The Most Basic Form

The simplest form of imitation learning is behavioral cloning (BC). It is essentially supervised learning. The state is the input, the expert's action is the target label, and the policy network is trained to mimic that mapping.

# Conceptual training loop for behavioral cloning (near-pseudocode)
import torch
import torch.nn as nn

class PolicyNet(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, act_dim),
        )

    def forward(self, obs):
        return self.net(obs)

policy = PolicyNet(obs_dim=32, act_dim=7)
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-4)
loss_fn = nn.MSELoss()

# demo_states, demo_actions: state-action pairs from expert demos
for states, expert_actions in dataloader:
    predicted = policy(states)
    loss = loss_fn(predicted, expert_actions)  # minimize gap to expert action
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Behavioral cloning is easy to implement and trains stably. But it has one fatal weakness: the distribution shift problem.

Distribution Shift and Compounding Error

A policy trained by behavioral cloning works well only in the states seen during demonstrations. As the robot actually moves, tiny errors accumulate, and those errors carry it into states never seen in the demonstrations. In those states the policy does not know what to do, so it produces even larger errors. This snowballing of errors is called compounding error.

   Demo distribution (states the expert passed through)
   ●●●●●●●●●●●●●●●●●●●●●●▶ goal
        │
        │ small error occurs
        ▼
   ○ ── state not in the demos (policy is inexperienced)
        │
        │ larger error
        ▼
   ○ ─────── drift into ever more unfamiliar states
             (task fails from compounding error)

DAgger — Preparing for Unfamiliar States

The canonical technique for mitigating this is DAgger (Dataset Aggregation). The core idea is to additionally ask "what would the expert have done in the states the robot actually visits."

The procedure is as follows. First, train a policy from the initial demonstrations. Roll out that policy, and the robot visits new states not present in the demonstrations. Record those states, then have the expert label "what is the correct action in this state." Merge the newly obtained state-action pairs into the existing data and retrain the policy. Repeating this gradually aligns the distribution of states the policy visits with the distribution of the training data.

   The DAgger Loop
   ────────────────

   [1] Run the robot with the current policy
            │
            ▼
   [2] Collect the states the robot visited
            │
            ▼
   [3] Expert labels the correct action for those states
            │
            ▼
   [4] Add to the dataset and retrain the policy
            │
            └────────▶ (back to [1])

The price of DAgger is the expert's repeated involvement. A human must keep labeling, which is costly. Even so, it is widely used because it substantially reduces the compounding-error problem.

Reinforcement Learning — Learning by Trial and Error

Reinforcement learning (RL) starts from an entirely different philosophy. Instead of showing demonstrations, it defines "what is good" as a scalar signal called reward. The robot (agent) interacts with the environment, takes actions, and receives reward as a result. The goal is to find a policy that maximizes cumulative reward (the total reward received over the long run).

Core Components

The interaction structure of reinforcement learning can be summarized as follows.

        The Reinforcement Learning Loop
        ────────────────────────────────

        ┌────────────────────────┐
        │        Agent            │
        │   (policy π: s ──▶ a)    │
        └───────┬────────────────┘
                │ action a
                ▼
        ┌────────────────────────┐
        │      Environment        │
        │  (robot + physical world)│
        └───────┬────────────────┘
                │ next state s', reward r
                ▼
        ┌────────────────────────┐
        │ Accumulate (s, a, r, s') │
        │  ──▶ improve policy π     │
        └────────────────────────┘

State: The current situation of robot and environment. Joint angles, object positions, camera images.
Action: The manipulation the robot can take. Joint torques, target velocities.
Reward: The immediate evaluative signal at each step. For example, positive reward for getting closer to the goal, negative for dropping an object.
Policy: The function mapping states to actions. This is what reinforcement learning seeks to improve.
Exploration vs Exploitation: The balance between repeating actions already known to be good (exploitation) and trying new actions to find something better (exploration). This balance is one of the central challenges of reinforcement learning.

The Difficulty of Reward Design

RL performance depends heavily on how well the reward function is designed. If the reward is too sparse (for example, +1 only on full task success), the agent receives no learning signal until it succeeds by chance. Conversely, densely designed rewards (reward shaping) speed learning up but can lead to reward hacking, where the agent maximizes reward in ways the designer never intended.

   Sparse reward vs Dense reward
   ─────────────────────────────

   Sparse: ......................●   (+1 only at the moment of success)
           learning signal is rare, early exploration is very hard

   Dense:  ▁▂▃▄▅▆▇█             (increases as the goal nears)
           learns fast, but poor design risks reward hacking

Data Efficiency and Safety Problems

Running reinforcement learning directly on a physical robot faces two large barriers.

First, data efficiency. RL typically needs hundreds of thousands to millions of interactions. Performing that many trials with a real robot incurs enormous time and wear. This is why much research chooses the sim-to-real approach: learn in simulation, then transfer to reality.

Second, safety. During exploration the robot tries near-random actions, which can damage itself or its surroundings. Real-world learning requires safety constraints or human-in-the-loop safeguards.

Inverse Reinforcement Learning — Inferring Reward from Demonstrations

There is an interesting middle ground between imitation learning and reinforcement learning: inverse reinforcement learning (IRL).

Ordinary reinforcement learning finds an optimal policy given a reward function. Inverse RL is the reverse. Given expert demonstrations, it infers the hidden reward function: "what reward was this expert trying to maximize." With that inferred reward, it then runs reinforcement learning again to obtain a policy.

   Ordinary RL vs Inverse RL (IRL)
   ─────────────────────────────────

   Ordinary RL:  reward function ──▶ (RL) ──▶ policy

   IRL:          expert demos ──▶ (reward inference) ──▶ reward function
                                                          │
                                                          ▼
                                                    (RL) ──▶ policy

Inverse RL is appealing because, instead of merely mimicking actions, it tries to recover a reward that captures the "intent" of the behavior. Since a reward function is a concise evaluation over states, there is room for it to generalize better to novel situations absent from the demonstrations. That said, it has a limitation: inference is fundamentally hard because there can be many reward functions explaining a single demonstration (ambiguity).

This approach can be seen as a bridge connecting imitation learning (using demonstrations) and reinforcement learning (reward-based optimization), and is especially useful for tasks where a reward is hard to hand-design.

Comparing the Two Approaches

The table below summarizes what we have covered.

Item	Imitation Learning	Reinforcement Learning
Learning signal	Expert demonstrations	Reward function
Reward design	Not needed	Needed (a core challenge)
Data source	Human demonstrations	Interaction with environment
Data efficiency	Relatively high	Relatively low
Early performance	Reaches demo level quickly	Very low at first
Performance ceiling	Roughly expert level	Can surpass the expert
Main weakness	Distribution shift, compounding error	Exploration, reward hacking, safety
Representative methods	Behavioral cloning, DAgger	Policy gradient, actor-critic

The core contrast in one sentence: imitation learning is data-efficient but capped by demonstration quality, while reinforcement learning can in principle surpass demonstrations but demands vast experience and careful reward design.

Combining the Two Approaches

A major thread in recent research is fusing the strengths of both methods. The representative combinations are as follows.

Bootstrapping — Start with Imitation, Refine with RL

The most intuitive combination is to initialize the policy with imitation learning, then improve it with reinforcement learning. Once imitation places the policy at a plausible starting point, RL skips the daunting random exploration of the early phase and raises performance from there. The demonstrations effectively steer the direction of exploration.

   Imitation → RL bootstrapping
   ─────────────────────────────

   random policy ──(imitation)──▶ demo-level policy ──(RL)──▶ above-demo policy

   [ Performance ]
     high ┤                              ╭──────  further gain from RL
          │                    ╭─────────╯
          │          ╭─────────╯  ← reaches quickly via imitation
     low  ┤──────────╯
          └──────────────────────────────────▶ [ Training time ]

Offline RL — Treating Demonstration Data with RL

Offline RL learns a policy purely from a fixed, pre-collected dataset. Without new interaction, it exploits the state-action-reward information contained in demonstrations or past logs. This is as data-efficient as imitation learning, yet like RL it uses reward information to distinguish the good and bad behaviors contained in the data. It does, however, require dedicated conservative techniques to handle the problem of overestimating actions absent from the data.

VLA — Generalist Policies Trained on Large-Scale Demonstrations

A major trend of the mid-2020s is the Vision-Language-Action (VLA) model, which jointly leverages large-scale robot demonstration data and web-scale vision-language data. These are fundamentally rooted in large-scale imitation learning, but inherit language understanding and visual recognition from pretrained large models to aim at generalization to new tasks.

RT-2 (Google DeepMind): Fine-tunes a vision-language model to output discretized action tokens.
Open X-Embodiment / RT-X: Trained on cross-robot data gathered from different robots at many institutions, attempting generalization across robot embodiments.
OpenVLA: A 7B-scale open model trained on roughly 970k demonstrations, combining vision encoders (DINOv2, SigLIP) with a language model (Llama 2).
π0 (Physical Intelligence): Takes an approach that generates continuous, high-frequency actions with flow matching / diffusion-family techniques.
GR00T N1 (NVIDIA): Aims at a dual structure combining a fast-reacting System 1 (diffusion-based) with a slower-planning System 2.

These models generally use co-fine-tuning that trains on web vision-language data and robot trajectories together, and sometimes adapt efficiently to new robots and tasks with techniques like LoRA. Specific details may vary by version and implementation, so it is best to consult the official sources for the latest information.

Practical Decision Criteria

Which method to choose in a real project depends on the situation. Rough criteria are as follows.

If good demonstrations are easy to obtain and the task is relatively structured, imitation learning (especially behavioral cloning + DAgger) is a fast, stable starting point.
If the simulator is accurate enough and the reward can be clearly defined, simulation-based RL can build a policy without any demonstrations.
If you need performance beyond expert level or adaptation to novel situations, the combined approach of initializing with imitation and improving with RL is a strong candidate.
If generalization across many tasks and understanding of language instructions matter, the large-scale demonstration-based VLA approach is worth considering.

Understanding Data Efficiency Numerically

The phrase "imitation learning is data-efficient" comes up often, but its meaning deserves to be pinned down a bit more concretely. Data efficiency can be understood as the amount of interaction needed to reach a desired level of performance.

Behavioral cloning is supervised learning, so if you have N demonstrations, it suffices to train on those N over several epochs. No additional environment interaction is needed. Online reinforcement learning, by contrast, must collect fresh experience from the environment each time it improves the policy, and that collection runs into hundreds of thousands to millions of steps.

   Interaction Needed vs Reached Performance (conceptual scale)
   ─────────────────────────────────────────────────────────

   Behavioral cloning : no extra interaction after collecting demos
                        (needs: N good demonstrations)

   DAgger             : demos + repeated expert labeling
                        (needs: N + several rounds of expert involvement)

   Online RL          : a large number of environment steps
                        (needs: hundreds of thousands to millions of trials)

   ▶ For the same performance, "what" is required differs:
     imitation demands human demos, RL demands environment steps.

The key insight here is that data efficiency is hard to compare on a single axis. Imitation learning consumes the expensive resource of "human time," while reinforcement learning consumes "environment steps" (a relatively cheap resource if in simulation). Which is favorable depends on how easily demonstrations can be obtained and how accurate the simulator is.

Policy Evaluation Methodology

As hard as it is to learn a robot policy, properly evaluating its performance is just as hard. A single benchmark number is not enough to judge a policy's true ability.

Success rate: The fraction of attempts that complete the task. The most basic metric, but easily overestimated if the initial conditions are not varied enough.
Generalization evaluation: The success rate on new objects, new backgrounds, and new lighting unseen during training. This distinguishes whether the policy memorized or understood.
Robustness evaluation: How well it holds up when disturbances or sensor noise are deliberately introduced.

   Layers of Policy Evaluation
   ─────────────────────────────

   [1] In-distribution eval    ──▶ basic success rate
        │  (same objects/env)
        ▼
   [2] Out-of-distribution     ──▶ new objects, new env
        │  generalization eval
        ▼
   [3] Robustness eval         ──▶ inject disturbances/noise
        │
        ▼
   [4] Long-term reliability   ──▶ maintains performance on repeats

   ▶ Higher layers better reflect fitness for actual deployment.

A common pitfall in evaluation is mistaking a handful of successful scenes for a finished policy. Robot policies are highly sensitive to initial conditions, so you must repeat a statistically meaningful number of attempts across varied conditions to reach a trustworthy conclusion.

The Whole Flow Seen Through One Task

Let us tie the material together with an imagined manipulation task: "putting objects on a desk into a box."

First, a human operates the robot via teleoperation to collect 50 to 100 demonstrations. Training a behavioral cloning policy on these succeeds reasonably well in layouts similar to the demonstrations. But when objects are placed in positions absent from the demonstrations, failures increase due to compounding error.

Here we apply DAgger. We roll out the policy, collect the situations where it fails, and have an expert label the correct actions in those situations to add to the data. After a few rounds, the success rate rises even in unfamiliar layouts.

If higher performance is needed, we take this imitation policy as an initial value and refine it with reinforcement learning in simulation. If the simulator is inaccurate, the sim-to-real gap becomes a problem, so we use techniques like domain randomization alongside. Finally, only after repeated evaluation across varied objects and lighting to confirm generalization and robustness do we consider deployment.

   Whole-Task Pipeline Summary
   ─────────────────────────────

   Collect teleoperation demos
        │
        ▼
   Initial policy via behavioral cloning
        │
        ▼
   Reinforce unfamiliar situations with DAgger
        │
        ▼
   (optional) Improve performance with RL
        │
        ▼
   Repeated evaluation across varied conditions
        │
        └──▶ deploy once sufficiently validated

Pitfalls and Caveats

Demonstration bias: Imitation learning inherits the habits and biases embedded in the demonstrations. If the demonstrator avoided certain situations, the robot will be weak in those situations too.
Reward hacking: In RL, when the reward fails to perfectly capture the true goal, the agent finds loopholes that maximize the reward number rather than the goal.
Simulation gap: Policies learned in simulation collapsing in reality is common. This problem is important enough to warrant its own topic: sim-to-real transfer.
Difficulty of evaluation: Robot policy performance varies greatly with environment, objects, and initial conditions. Generalization performance should not be inferred from a handful of successes.

Conclusion

In robot learning, imitation learning and reinforcement learning are less two opposing camps than complementary tools. Imitation learning transfers human knowledge to the robot quickly and data-efficiently, while reinforcement learning enables improvement beyond that knowledge and adaptation to novel situations. Recent VLA models, offline RL, and bootstrapping approaches all attempt to build more powerful policies by dissolving the boundary between the two.

In the next article we take up in earnest the gap between simulation and reality that recurred throughout this piece: the sim-to-real transfer problem.

References

RT-2 paper (arXiv): https://arxiv.org/abs/2307.15818
OpenVLA paper (arXiv): https://arxiv.org/abs/2406.09246
Open X-Embodiment paper (arXiv): https://arxiv.org/abs/2310.08864
Physical Intelligence (π0 and more): https://www.physicalintelligence.company/
Open X-Embodiment project: https://robotics-transformer-x.github.io/
NVIDIA Isaac / robotics: https://developer.nvidia.com/isaac
OpenAI Spinning Up (intro to RL): https://spinningup.openai.com/
DAgger paper (Ross et al., AISTATS 2011): https://proceedings.mlr.press/v15/ross11a.html