💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

The humanoid robot is an old dream. A machine that stands and walks on two legs like a person, picks up and handles objects with its hands, and uses the very spaces and tools that humans use. In the past few years this dream has started to walk out of the research lab and, step by step, onto real work sites.

But "walking on two legs" is not as simple as the phrase makes it sound. A person who walks is falling at every moment, and catching that fall with the next foot is something they repeat unconsciously. For a robot, this task means coordinating dozens of joints in real time, managing the force distribution across the soles that touch the ground, and simultaneously accounting for the torso's posture and the arms' motion. It is a high-dimensional control problem.

This article looks at humanoid control along two main axes. One is **bipedal locomotion control**, and the other is **whole-body control**, moving every joint of the body together under a single objective. We then follow how these two integrate with manipulation, and what recent learning-based approaches and the foundation-model trend are changing.

For accuracy, a caveat up front. Specific specs or performance figures from robot manufacturers vary greatly by announcement date and hardware generation, so this article focuses on widely known concepts and published methodology, and generalizes concrete numbers carefully.

The Terrain of the Problem: Why Two-Legged Walking Is Hard

A wheeled robot is stable. Its support polygon is wide, and it does not tip over even at rest. A two-legged robot is different.

- **Narrow support**: The moment it stands on one foot, the support region shrinks to a single sole.

- **Inherent instability**: The human body is close to an inverted pendulum standing atop the ankle. Left alone, it falls.

- **Discontinuous contact**: With each step the foot touches and leaves the ground, and the contact state changes moment to moment.

- **High dimensionality**: A full-body humanoid typically has dozens of joint degrees of freedom.

A common simple model for reasoning about this terrain is the **Linear Inverted Pendulum Model (LIPM)**. It treats the upper-body mass as a single point (the center of mass, CoM) and approximates the body below it as a massless rod connected to the ground, forming a pendulum.

(CoM) ● ── center of mass

│

│ inverted pendulum: it tries to stand

│ but falls if left alone

│

──────────┴────────── ground

foot (ZMP)

Thanks to this simplification, before dealing with the full body dynamics we can first address the question: "where should the center of mass be so it does not fall?"

ZMP: The Reference Point That Keeps You Standing

One of the oldest and most important concepts in bipedal control is the **ZMP (Zero Moment Point)**.

The ZMP is a point on the ground where the moment produced by the ground reaction forces becomes zero. Intuitively, it corresponds to the "center" of the pressure with which the sole presses the ground. The key rule is this:

> As long as the ZMP stays within the support region (the sole, or the polygon formed by both feet), the foot maintains stable contact without lifting or rotating off the ground.

In other words, when planning a walk, if you design the CoM trajectory so that the ZMP at every instant does not leave the support polygon, you can produce a "dynamically stable" gait.

support polygon (both feet) single-foot support phase

┌───────────────┐ ┌────────┐

│ ● ZMP │ stable │ │ ● if ZMP goes

│ (inside) │ │ ●────┼──▶ outside, unstable

└───────────────┘ └────────┘

A traditional ZMP-based gait generator generally follows this flow:

1. Decide the sequence of footstep positions.

2. Determine the allowable ZMP trajectory at each foot position.

3. Solve backward for the center-of-mass (CoM) trajectory that satisfies that ZMP.

4. Solve joint angles by inverse kinematics to satisfy the CoM and foot trajectories.

This approach is predictable and stable, but it strongly follows a predetermined trajectory, so it can be relatively fragile on rough terrain or under unexpected disturbances.

The Anatomy of a Step: The Gait Cycle

Dissecting a single step, a **stance phase** in which the foot is on the ground and a **swing phase** in which the foot is in the air alternate. The decisive difference between walking and running is here too.

- **Walking**: At least one foot is always on the ground. A short **double support** phase exists in which both feet touch.

- **Running**: A **flight phase** arises in which both feet are in the air. Managing the impact at the moment of landing becomes far more important.

one cycle of walking (right foot basis)

┌────────────┬──────────┬────────────┬──────────┐

│ right │ double │ right │ double │

│ stance │ support │ swing │ support │

│ (on ground)│ │ (in air) │ │

└────────────┴──────────┴────────────┴──────────┘

bears load shift wt reach fwd land, swap

The double support phase is short but very important. In this moment the center of mass must move toward the next support foot, and the ZMP must move smoothly too. If this transition goes wrong, the gait becomes jerky or loses balance. A large part of the gait generator is spent making this transition smooth.

Hardware: What Moves the Robot

Before talking about control, we must touch on the hardware that actually executes those commands. However good a policy is, it is useless if the body cannot realize it.

| Component | Role | Key consideration |

| --- | --- | --- |

| Actuator | The muscle that moves the joint | Torque, speed, back-drivability |

| Reducer | Amplifies motor force | Efficiency, backlash, stiffness |

| Inertial sensor (IMU) | Measures torso pose, angular rate | Drift, noise |

| Joint encoder | Angle and velocity of each joint | Resolution, latency |

| Force/torque sensor | Measures contact force at foot/hand | Precision, durability |

**Back-drivability** is especially important. It means how readily a joint yields when pushed from the outside. A traditional joint with a very high reduction ratio is strong but stiff, and does not absorb unexpected impacts, transmitting them directly. Conversely, an appropriately back-drivable joint absorbs impacts and makes force control smooth, which is safer next to people. This is why recent dynamic humanoids adopt actuators favorable to force control.

Layers and Time Scales of Control

The various elements seen so far actually operate at different **time scales**. Organizing this into one picture shows how the whole system meshes.

slow ◀──────────────────────────────────────────▶ fast

task planning gait/footstep plan whole-body ctrl low-level motor

(a few Hz or less)(a few~tens of Hz) (hundreds of Hz) (kHz)

│ │ │ │

"what to do" "where to step" "how the body" "joint current"

│ │ │ │

└────────▶────────┴────────▶──────────┴────────▶────────┘

goals passed top → bottom

state feedback bottom → top

This layered structure matters because each layer solves only its own problem at its own speed. A slow planning layer need not run at kHz, and fast motor control need not know the whole task. This design of separation of concerns is exactly the same philosophy as layering in software engineering.

MPC: Walking While Looking Ahead

**Model Predictive Control (MPC)** advances the ZMP idea a step further. At every control cycle, MPC repeats the following:

1. Observe the current state.

2. Use a dynamics model to predict motion over a fixed future window (the prediction horizon).

3. Solve an optimization for the control-input sequence that minimizes a cost (balance deviation, energy, goal error, etc.) over that window.

4. Actually execute only the first step, then repeat from the beginning in the next cycle.

current state ─┐

▼

┌──────────────────────────────────┐

│ optimize future trajectory │

│ over the horizon │

│ t ── t+1 ── t+2 ── ... ── t+N │

└───────────────┬──────────────────┘

│ execute only first input

▼

apply command to robot

│

▼ (next cycle: observe again → re-optimize)

repeat (receding horizon)

MPC's strength is that it "looks ahead." Even if a choice looks slightly costly right now, it may be favorable for balance a few steps later, and when a disturbance arrives, MPC updates the plan immediately in the next cycle. MPC that directly optimizes contact forces on top of a simplified rigid-body model (e.g., single rigid body dynamics, SRBD) is widely used for dynamic motions such as walking, running, and stair climbing.

The cost is computation. The longer the horizon and the more detailed the model, the heavier real-time optimization (hundreds of Hz to several kHz) becomes. In practice, then, the model is simplified appropriately and the fast low-level control and the slower planning are split into layers.

Whole-Body Control: The Whole Body Under One Objective

So far we have focused mostly on the "center of mass and the feet." But a humanoid uses dozens of joints at once, including two arms, the torso, and the neck. Carrying an object while walking, pushing a door, or throwing out an arm to brace at the instant of a fall cannot be done with legs alone.

**Whole-Body Control (WBC)** bundles all these joints into a single optimization problem. It satisfies multiple objectives (tasks) simultaneously while computing joint torques or accelerations so that balance is maintained within physical constraints (joint limits, contact forces, friction).

┌────────── high-level planning (slow) ──────────┐

│ footstep planning · gait pattern · target pose │

└────────────────────┬───────────────────────────┘

│ pass tasks

▼

┌────────── whole-body control (WBC) ────────────┐

│ prioritized multiple tasks: │

│ 1) balance (keep CoM/ZMP) ← top priority │

│ 2) foot/hand trajectory tracking │

│ 3) posture, gaze, other secondary tasks │

│ constraints: joint limits · contact friction │

│ · torque limits │

└────────────────────┬───────────────────────────┘

│ joint torques/accelerations

▼

┌────────── low-level actuation (fast) ──────────┐

│ per-joint motor current/torque control (high │

│ frequency) │

└─────────────────────────────────────────────────┘

The core idea of WBC is **priority**. Objectives that can never be compromised, such as maintaining balance, sit at the top, while objectives like tracking a hand trajectory sit below them. If lower objectives are pursued only within the range (the null space) that does not harm higher objectives, you can prevent falling over while reaching out an arm.

Implementation is usually done as a constrained optimization (e.g., a quadratic program, QP). A heavily simplified conceptual pseudocode looks like this:

Conceptual whole-body control QP (simplified pseudocode)

Variables: joint accelerations qdd, contact forces f

Objective: minimize several task errors; constraints are physics and limits

minimize sum_i w_i * || J_i @ qdd + dJ_i @ qd - a_desired_i ||^2

subject to

M @ qdd + h == S.T @ tau + Jc.T @ f # full-body dynamics equation

friction_cone(f) # contact friction cone constraint

tau_min <= tau <= tau_max # torque limits

qdd within joint_limits # joint limits

Here each `J_i` is the Jacobian for a particular task (CoM, foot, hand, etc.), and `w_i` is a priority weight. In practice, a strict hierarchy (hierarchical QP) is sometimes used instead of weights. The important thing is the perspective: "move the whole body by coordinating several objectives within the laws of physics."

Balance and Fall Recovery

However well you walk, moments of being pushed, slipping, and missteps will come. Balance recovery strategies generally split into three stages, much like a person's.

| Strategy | Description | Human example |

| --- | --- | --- |

| Ankle strategy | Fine-tune CoM with ankle torque | Holding small sway with the ankles |

| Hip strategy | Bend the torso to shift CoM quickly | Bending at the waist when pushed hard |

| Step strategy | Take a new step to shift the support | Stepping out when shoved strongly |

If the disturbance is small, the ankle handles it; if larger, the hip; if larger still, a step. A concept especially important in the step strategy is the **capture point** (the divergent component of motion). Roughly, it is the point where "if I step my foot here now, the center of mass will come to rest above it," computed in real time to decide the next foot placement.

Even so, there are times you fall. Recently, research on **managing the fall itself** is active. When falling is inevitable, the robot takes a posture that reduces impact and learns a get-up motion to stand again from lying on the floor. Protecting expensive hardware and being able to recover on its own after going down are very important for real-world deployment.

Learning-Based Locomotion: The Rise of RL

Traditional model-based control (ZMP, MPC, WBC) handles physics explicitly, so it is interpretable and stable. But it hits limits when the model is inaccurate or the terrain is unpredictable. This is where **reinforcement-learning (RL)** locomotion drew attention.

The idea is simple. Inside simulation, give the robot "reward for walking well forward, penalty for falling," and through countless trial and error, train a walking policy. The policy is usually a neural network that takes observations (joint angles and velocities, torso posture, commanded velocity, etc.) and outputs joint targets (torques or target angles).

┌──────────────── simulation training loop ─────────────────┐

│ │

│ observation s_t ──▶ [policy network] ──▶ action a_t │

│ ▲ │ │

│ │ ▼ │

│ simulator (thousands of parallel envs) ◀── apply targets │

│ │ │ │

│ └── reward r_t (forward · stable · energy) ◀───────────┘

│ │

│ update the policy over billions of trial-and-error steps │

└────────────────────────────────────────────────────────────┘

The strength of RL locomotion is **robustness to rough terrain and disturbances**. If the policy experiences varied terrain and interference randomly in simulation, it learns recovery motions on its own that were never explicitly programmed. Quadruped and biped policies that walk without falling on stairs, gravel, and slippery floors have been built this way.

Reward Design as an Art

The success of RL depends substantially on **reward function design**. To translate the single sentence "reward for walking well" into an actual formula, you must carefully combine several terms. Conceptually, you sum terms like these.

total reward =

+ forward velocity tracking (+ closer to commanded speed)

+ survival (+ if alive without falling)

- energy consumption (- as joint torque grows)

- torso sway (- if posture tilts greatly)

- foot slip (- if the foot slips in contact)

- joint-limit proximity (- if sticking to a limit)

Here the **weight** of each term determines the policy's character. Weighting energy heavily yields a frugal policy; weighting speed heavily yields an aggressive one. Set the weights wrong and the robot may learn unexpected behavior that "games the reward" (reward hacking). For example, if you give only large forward reward, it may learn a strange gait that pitches forward ignoring balance. So reward design is both a science and an art of experience.

Curriculum Learning

Give stairs or gravel from the start and the policy learns nothing, just keeps falling. So we use a **curriculum**. At first we give an easy task of walking slowly on flat ground, and once the policy starts to succeed, we gradually raise terrain roughness and command speed. Just as a person learns from first steps, a robot learns well only when difficulty rises in stages.

Evaluation: How Do You Measure "Doing Well"

To fairly compare "this robot walks well," you need metrics. The commonly used ones are these.

- **Success rate**: Out of so many tries, how many succeeded at a set task (e.g., climbing 10 stairs).

- **Disturbance robustness**: How hard can it be pushed from the side without falling.

- **Cost of transport (CoT)**: The energy spent to travel a unit distance. Lower is more efficient.

- **Speed/terrain range**: How fast, and over how varied terrain, can it operate.

The problem is that the experimental setups for measuring these differ by study, making it hard to directly compare numbers across papers. The absence of a standardized evaluation protocol is a long-standing homework of this field. Precisely because it is easy to claim "we are the best" from a single demo video, the importance of reproducible, fair evaluation is only growing.

Sim2Real: From Simulation to Reality

The decisive obstacle for RL is the **sim-to-real gap**. Simulator physics are not perfect, real motors have latency, friction, and backlash, and sensors have noise. A policy that was perfect in simulation often falls on hardware.

A key technique for narrowing this is **domain randomization**. During training, physics parameters (mass, friction, motor stiffness, latency, etc.) are shaken randomly so the policy does not overfit to particular values and works over a wide range. Whatever value reality takes, if it falls within the training distribution, the policy treats it as "one of the situations I have already experienced."

simulation (one perfect physics) reality (one uncertain physics)

● ← optimized only here ? ← risk of failure here

after domain randomization:

● ● ● ● ● ● ← trained on many widely distributed physics

└──────────▶ if reality (?) falls inside this distribution, robust

Beyond this, system identification to calibrate the simulator with measured data, methods that fine-tune with a small amount of data on the real robot, and designs that compose observations mainly from values obtainable directly from the body (proprioception) to reduce sensor dependence are used together.

Integrating Walking and Manipulation: Loco-Manipulation

Walking well is not enough. Real work is "walking somewhere, picking something up, and placing it elsewhere." The problem of handling locomotion and manipulation together is called **loco-manipulation**.

The two interfere with each other. Holding a heavy object in one hand shifts the center of mass and rocks balance; leaning the body to push a door changes the foot's force distribution. So a well-built system solves the arm's manipulation objective and the legs' balance objective together **within a single whole-body controller**.

┌── manipulation task ─┐ ┌── locomotion/balance task ─┐

│ hand position · force │ │ center of mass · footstep │

└──────┬───────────────┘ └───────────┬─────────────────┘

│ │

└──────────────┬──────────────────┘

▼

jointly optimized in WBC

│

▼

"carry while walking" · "push while bracing" as one motion

The Layering of Learned Policies: Toward Behavior Foundation Models

The recent trend is to stack control into layers. At the bottom are fast, robust low-level policies for walking and balance (mostly learned with RL), and above them are slower high-level policies that decide "what to do."

A concept drawing attention at this upper layer is the **behavior foundation model** trend, an attempt to train one large policy on large-scale data collected across varied tasks and embodiments to cover a wide range of behaviors. In robotics in particular, **VLA (Vision-Language-Action)** models that handle vision, language, and action together are advancing rapidly.

- **RT-2** (Google DeepMind, arXiv 2307.15818): an approach that fine-tunes a vision-language model (VLM) on robot data so it outputs actions as discretized tokens.

- **OpenVLA** (arXiv 2406.09246): a 7B-scale open VLA model trained on roughly 970k real robot demonstrations, combining DINOv2 and SigLIP vision encoders with a Llama 2 language model.

- **π0** (Physical Intelligence): a policy that generates continuous high-frequency actions in a flow-matching / diffusion direction.

- **GR00T N1** (NVIDIA): claims a dual structure combining System 1 (diffusion family) for fast reaction with System 2 for planning.

- **Helix** (Figure AI): cited as an example of the generalist VLA trend aimed at humanoids.

When such an upper model unfolds a goal like "pick up the red cup and put it in the drawer" into hand and foot objectives, the lower-layer locomotion and WBC policies physically realize that goal, producing a natural division of labor. Note that this field is changing very fast, so specific performance and structure may differ by announcement and generation.

Real Humanoid Robots

Grounded in fact, we touch on widely known examples only at the conceptual level. Since details vary greatly by generation and announcement date, we mention only direction here.

| Robot | Developer | Known characteristics (conceptual) |

| --- | --- | --- |

| Atlas | Boston Dynamics | Widely known for dynamic whole-body motion and mobility demos |

| Figure | Figure AI | Humanoid aimed at commercial work, cited alongside the VLA trend |

| Unitree humanoid | Unitree Robotics | Known as a relatively accessible bipedal platform |

| Digit | Agility Robotics | Introduced as a humanoid aimed at logistics and warehouse work |

Each company's latest model, exact joint count, speed, and payload are safest to confirm in official materials.

Pitfalls and Limits

- **Over-trusting simulation**: Success in simulation does not guarantee success on hardware. The sim2real gap is still a large wall.

- **Safety**: As long as a heavy robot moves next to people, safety design for falls, collisions, and malfunctions must be a prerequisite.

- **Energy and endurance**: Standing and moving on two legs consumes a lot of energy. Battery endurance is a large practical constraint.

- **The illusion of generalization**: Demo videos are often optimized for specific conditions. Generalizing to unfamiliar environments and objects is a separate, hard problem.

- **Difficulty of evaluation**: Standard metrics to fairly compare "walks well" and "handles well" are still maturing.

- **Hardware reliability**: Making dozens of high-power actuators endure long and repeated use is an engineering problem as hard as the software.

Comparing Control Paradigms

Organizing the approaches seen so far at a glance makes each one's place clear.

| --- | --- | --- | --- |

The important thing is that these are not so much competitors as they take **different layers** of the hierarchy. A well-built system combines several approaches — a foundation model on top, MPC/WBC in the middle, a learned low-level policy at the bottom. Rather than seeking "one silver bullet," the engineering sense of placing the right tool at each layer matters more in practice.

Teleoperation and Data Collection

The bottleneck of learning-based manipulation and locomotion is ultimately **data**. One of the most direct ways to teach a humanoid a new job is for a person to teleoperate the robot and show demonstrations.

human operator ──▶ [teleoperation device]

│ (motion mapping)

▼

humanoid robot ──▶ perform the actual motion

│

▼

(record observation-action pairs as data)

│

▼

learn policy by imitation ──▶ later autonomous execution

Teleoperation comes in several forms. A person's arm motion may be mapped to the robot arm by motion capture, or a VR controller may command the hand's target position. Either way, the goal is to "transfer human intent to the robot's body and leave that process as data." Demonstrations gathered this way become fuel for imitation learning or foundation model training later.

But teleoperation has difficulties. If the human and robot bodies differ (arm length, joint layout), motion does not transfer as-is, and latency degrades the feel of control. Still, being data from a real robot, it fills gaps hard to fill with simulation alone.

The Road Ahead

Humanoid control sits where several trends converge at once.

- **Fusion of model-based and learning**: Hybrid approaches that combine ZMP/MPC stability with RL robustness are growing. A division like learning at the low level, model-based at the high plan.

- **Deeper integration of manipulation and locomotion**: Attempts to handle loco-manipulation with one policy continue.

- **Bottom-up penetration of foundation models**: VLA and behavior foundation models increasingly replace the high-level planning layer, and instructing by language is becoming natural.

- **Maturing hardware**: Actuators favorable to force control and durable, light materials, and long-lasting batteries are the keys to practicality.

Where these trends will meet is still open. What is certain is that no single technology completes the humanoid. Physics, learning, hardware, and safety must mature together.

Closing

Humanoid control is the meeting point of physics and learning. Model-based methods such as ZMP, MPC, and whole-body control handle physics explicitly, giving stability and interpretability. RL and sim2real add robustness to rough reality. And above them, the VLA and behavior foundation model trend tries to cover "what to do" broadly.

Doing both the walking on two legs and the handling with two hands within a single body — that integration is the hottest front in this field right now. The road is still long, but the future in which robots walk through human spaces like humans and do work is, frame by frame, moving from demo videos toward real tasks.

References

- RT-2: Vision-Language-Action Models (arXiv): [https://arxiv.org/abs/2307.15818](https://arxiv.org/abs/2307.15818)

- OpenVLA: An Open-Source Vision-Language-Action Model (arXiv): [https://arxiv.org/abs/2406.09246](https://arxiv.org/abs/2406.09246)

- Open X-Embodiment (arXiv): [https://arxiv.org/abs/2310.08864](https://arxiv.org/abs/2310.08864)

- Physical Intelligence (π0): [https://www.physicalintelligence.company/](https://www.physicalintelligence.company/)

- Boston Dynamics Atlas: [https://bostondynamics.com/atlas/](https://bostondynamics.com/atlas/)

- Agility Robotics Digit: [https://www.agilityrobotics.com/](https://www.agilityrobotics.com/)

- Unitree Robotics: [https://www.unitree.com/](https://www.unitree.com/)

- NVIDIA Isaac (robot simulation/learning): [https://developer.nvidia.com/isaac](https://developer.nvidia.com/isaac)