Skip to content
Published on

Two Brains for a Humanoid — GR00T N1 and Helix

Authors

Introduction: Why Humanoids Are Hard

The VLA models we have looked at so far mostly handled a single-arm manipulator on a tabletop. A humanoid is a challenge of a different order. Two arms and two hands, a head, and legs that walk while keeping balance — the degrees of freedom to control explode. On top of that, humanoids are designed to fit human environments (door handles, stairs, dishes), so precision and stability are demanded at the same time.

Trying to solve this complexity with a single monolithic neural network runs into a dilemma. To react fast, the model must be small and light; to reason cleverly, it must be large and heavy. Controlling balancing legs must update hundreds of times per second, but understanding the instruction "take a drink from the fridge and pour it into the cup" does not need such a fast frequency.

The proposed solution is the dual-system architecture. Inspired by the analogy of human cognition's fast intuition (System 1) and slow deliberation (System 2), the robot's brain is split into two parts. This post examines this structure centered on NVIDIA GR00T N1 and Figure AI Helix. We stay grounded in real facts, but since detailed specs can vary by version, we generalize and cover only what is well established.

The Dual-System Architecture

Fast Brain and Slow Brain

The core idea of the dual system is a division of labor.

┌────────────────────────────────────────────────────────────┐
│         Humanoid Dual-System Architecture (concept)        │
└────────────────────────────────────────────────────────────┘

   cameras · language instruction
  ┌─────────────────────────┐
  │  System 2 (slow brain)   │   low frequency (e.g., a few Hz)
  │  - vision-language       │
  │  - scene parse · planning│   "what to do"
  │  - semantic reasoning    │
  └───────────┬─────────────┘
              │  latent representation (intent/goal) ──▶
  ┌─────────────────────────┐
  │  System 1 (fast brain)   │   high frequency (e.g., tens-hundreds Hz)
  │  - low-level motor control│
  │  - diffusion/continuous   │   "how to move"
  │  - balance · precise grasp│
  └───────────┬─────────────┘
       joint torque/position commands  ──▶  robot executes
  • System 2 (slow brain): A large vision-language model understands the scene and plans what to do. It takes natural-language instructions and reasons semantically. Being heavy, it runs at a relatively low frequency.
  • System 1 (fast brain): A small, fast policy controls the actual joints. Conditioned on the intent (latent representation) handed down by System 2, it generates smooth, continuous action at high frequency. It handles tasks needing immediate reaction, such as balance maintenance and precise grasping.

The advantage of this split is clear. You secure both cleverness (slow but rich reasoning) and agility (fast but simple reflexes) within one system.

Connecting the Two Systems

The two systems are connected by a latent representation. System 2 conveys the intent "reach toward this cup and grasp it" not as explicit coordinates but as a continuous latent vector, and System 1 produces concrete joint commands conditioned on that vector. This lets System 2 express high-level intent without worrying about the fine control of every moment.

Time scales and data flow of the two systems

  t (time) ─────────────────────────────────────────▶

  System 2:   [plan]            [replan]            [replan]
              │                  │                   │
              ▼ (latent intent)  ▼                   ▼
  System 1:   ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮
              (fast control steps packed densely)

  → the slow plan updates the "steering" of the fast control

NVIDIA GR00T N1

GR00T N1 is an open model NVIDIA put forward as a foundation model for humanoids. Broadly, it follows the dual-system philosophy described above. It combines a slow module responsible for vision-language understanding with a module that quickly generates smooth low-level action based on diffusion.

┌────────────────────────────────────────────────────────────┐
│                  GR00T N1 (conceptual layout)               │
└────────────────────────────────────────────────────────────┘

   multi-camera · language instruction
  ┌──────────────────────────┐
  │  vision-language (slow)   │  scene understanding, instruction parse
  └────────────┬─────────────┘
               │  latent context
  ┌──────────────────────────┐
  │  diffusion action (fast)  │  generate continuous action by denoising
  └────────────┬─────────────┘
        humanoid joint control

An important point in GR00T N1's training is data diversity. Since humanoid real demonstrations alone are woefully insufficient, the strategy uses data from multiple sources together.

┌────────────────────────────────────────────────────────────┐
│        Training that combines diverse data sources         │
└────────────────────────────────────────────────────────────┘

  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌──────────┐
  │ real robot  │  │ simulation  │  │ human video │  │ web VL   │
  │ demos       │  │ (synthetic) │  │ (observed)  │  │ data     │
  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘  └────┬─────┘
        └───────────────┴────┬─────────┴──────────────┘
                  ┌────────────────────┐
                  │  joint training     │
                  │ (co-fine-tuning)    │
                  └─────────┬──────────┘
                  humanoid policy (GR00T N1)
  • Real robot demos: Most accurate but most expensive.
  • Simulation: Cheaply generates large amounts of synthetic trajectories. The challenge is reducing the gap with reality (sim-to-real gap).
  • Human motion video: Observes human hand/body movement to gain rich behavioral priors.
  • Web vision-language data: Provides common sense for semantic generalization.

Training such differently sourced data together (co-fine-tuning) supplements the small volume of expensive real demos while boosting generalization. One reason NVIDIA offered GR00T openly is to provide a common foundation for humanoid research and grow the ecosystem.

Figure AI Helix

Helix is a generalist VLA that Figure AI put forward for humanoids. It likewise takes a dual-system-family approach that combines fast control with slow reasoning. What Helix emphasizes is broadly generalizing humanoid upper-body control, and acting on natural-language instructions even for objects and instructions not seen in prior training.

┌────────────────────────────────────────────────────────────┐
│                  Helix (conceptual flow)                    │
└────────────────────────────────────────────────────────────┘

   onboard cameras + voice/text instruction
  ┌──────────────────────────┐
  │  slow system (understand) │  parse scene/instruction semantically
  └────────────┬─────────────┘
               │  latent intent
  ┌──────────────────────────┐
  │  fast system (motor)      │  high-frequency continuous control of upper body/hands
  └────────────┬─────────────┘
        humanoid upper-body and two-hand motion

The significance of the Helix-family approach:

  • Natural-language generalization: Aims at a direction where a person can instruct verbally and the robot attempts even new tasks.
  • Onboard operation: Targets inference using resources mounted on the robot, aiming at autonomous operation in real home/work environments.
  • Precise upper-body control: Tries to handle tasks needing two-hand coordination (carrying an object together, tidying up).

Detailed specs such as exact model size, control frequency, and evaluation numbers can vary by public information and version, so here we explain mainly the structural ideas in generalized terms.

Comparing GR00T N1 and Helix

ItemGR00T N1 (NVIDIA)Helix (Figure AI)
CharacterHumanoid foundation model (open-oriented)Generalist VLA for humanoid products
Shared structureDual system (slow understand + fast control)Dual system (slow understand + fast control)
Fast brainDiffusion-based continuous actionHigh-frequency continuous control
Training dataReal + sim + human video + web VL combinedGeneralization training centered on robot demos
EmphasisCommon foundation, reproducibilityNatural-language generalization, onboard autonomy

Both approaches share the same core: separating heavy semantic understanding from light fast control, but stitching them smoothly with a latent representation. Details in the table can vary by source and version.

Deeper: The Handoff Between the Two Systems

Conveying Intent via a Latent Representation

The beauty of the dual system lies in how the slow brain conveys "what it wants" to the fast brain. It could convey explicit coordinates (e.g., "move the hand to (0.3, 0.1, 0.5)"), but a more flexible approach is to express intent as a continuous latent vector. The fast brain takes this vector as a condition and produces concrete joint commands.

┌──────────────────────────────────────────────────────────┐
│            System 2 → System 1 Handoff (concept)         │
└──────────────────────────────────────────────────────────┘

  System 2 (slow brain)
     │  scene understanding + task planning
  ┌──────────────────┐
  │ latent intent z   │   "approach this cup smoothly and grasp it"
  └────────┬─────────┘     (abstract intent, not coordinates)
           │  updated periodically (low frequency)
  System 1 (fast brain)
     │  generate continuous action conditioned on z (high frequency)
  ┌──────────────────┐
  │ joint torque/pos  │  ──▶  robot executes
  └──────────────────┘

The advantage of this approach is that the fast brain can decide "how" to realize the intent on its own. If an obstacle suddenly appears or an object slips, the fast brain immediately adjusts the path while keeping the same intent. The slow brain need not dictate every detail at every moment.

Aligning Time Scales

Since the two systems run at different frequencies, aligning their time scales matters.

Frequency alignment (conceptual numbers; actual varies by implementation)

  System 2:  a few Hz            (planning / replanning)
  System 1:  tens to hundreds Hz (low-level control)

  → during one slow plan, fast control executes many times
  → if the environment changes fast, the fast brain reacts first,
     and the slow brain updates the big picture at the next plan

Exact control frequencies can vary by hardware, model size, and task nature. The key is the division of labor: "the fast brain for fast reaction, the slow brain for big decisions."

Integrating Locomotion and Manipulation

A Difficulty Unique to Humanoids

Unlike a single-arm manipulator, a humanoid must keep balance while simultaneously working with its hands. Leg control (locomotion/balance) and arm control (manipulation) influence each other. For example, lifting a heavy object with one hand shifts the center of mass, so balance control must react immediately.

┌──────────────────────────────────────────────────────────┐
│          Locomotion-Manipulation Interaction (concept)   │
└──────────────────────────────────────────────────────────┘

   upper body (arms/hands)   lower body (legs/balance)
   manipulation task         support · locomotion
       │                        │
       └────── CoM shift ────────┘
        ┌──────────────────┐
        │  whole-body coord. │  reaching the arm makes legs correct balance
        └──────────────────┘

  → the fast brain (System 1) must consider the whole body together for stability

Many humanoid systems handle locomotion and manipulation separately, and increasingly move toward controlling the whole body together. In a dual system, it is natural for the fast brain to handle such whole-body coordination at high frequency.

Simulation and Sim-to-Real

Why Simulation Is Needed

Humanoid real demonstrations are very hard and dangerous to collect. So large amounts of synthetic data are generated in simulation to supplement training. But there is a gap between simulation and the real world in physics, sensors, and appearance (sim-to-real gap), so a policy that worked well in sim can fail in reality.

Domain Randomization

A representative technique to shrink this gap is domain randomization. Training in simulation while randomly varying lighting, textures, and physical parameters (friction, mass, etc.) keeps the model from overfitting to specific conditions and makes it robust to diverse variations.

┌──────────────────────────────────────────────────────────┐
│            Domain Randomization (concept)                │
└──────────────────────────────────────────────────────────┘

  randomly change each training run in simulation:
    - lighting · color · texture
    - friction · mass · inertia
    - camera pose · noise
  ┌──────────────────┐
  │  policy robust to  │   does not overfit to specific conditions
  │  diverse variation │   ──▶  easier transfer to the real environment
  └──────────────────┘

Beyond domain randomization, methods that calibrate the sim with real data, train real and sim together, or additionally use human motion video are combined. The multi-source data combination GR00T emphasized is part of this larger trend.

Challenges

Humanoid VLA is appealing, but many challenges remain.

  • Safety: A heavy robot moves in the same space as people, so collision/fall risks must be strictly managed. Failures in out-of-distribution situations can cause physical harm.
  • Latency: If the slow brain's plan is too slow, the fast brain ends up acting on stale intent. The time scales of the two systems must be aligned well.
  • Data scarcity: Humanoid real demonstrations are very hard to collect. Supplement with simulation and human video while reducing the sim-to-real gap.
  • Balancing generalization and reliability: A balance is needed between the ability to attempt new tasks and the ability to perform known tasks stably.
  • Difficulty of evaluation: Humanoids have large environment/hardware differences, making result reproduction and fair comparison hard.
Latency problem in a dual system (concept)

  When System 2's planning cycle is too slow:

   t ───────────────────────────────────▶
   plan (stale) ─────────────── next plan
   System 1:  ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮  ← controls on stale intent meanwhile
                                    drifts when the environment changes

  → must co-design plan frequency, intent update, and safety guards

Why Humanoids, Why Now

The Appeal of a Generalized Form

One reason humanoids draw attention is that their form "fits human environments as is." Our homes, factories, and offices are designed around human hands, height, and gait. A human-shaped robot has the potential to open doors, climb stairs, and handle tools people use, without changing the environment.

The logic of a generalized form (concept)

  special-purpose robot:  a different machine per task
                          + optimal for a single task
                          - new hardware per new task

  humanoid:               many tasks with one form
                          + reuse human environments/tools
                          - high control difficulty (DoF, balance)

  → the point where "hardware generalization" meets "software generalization (VLA)"

Of course this is both a potential and a difficulty. The human form is versatile, but it carries many degrees of freedom to control and the added challenge of balance. That is why the progress of dual systems and VLA is anticipated as a tool to solve this difficulty.

Foundation Models and the Data Flywheel

As the language and vision fields did, robotics anticipates a virtuous cycle (data flywheel): more data trains better policies, and better policies lead to more deployments that gather data again.

Data flywheel (concept)

   more data ──▶ better policy
        ▲              │
        │              ▼
   more deploy  ◀── wider task coverage
        (collect new data in the field)

  * Safety and reliability must be met for this cycle to actually turn.

Offering GR00T openly or sharing a pooled dataset (Open X-Embodiment) can be seen as an attempt to turn this flywheel at the community level. But for the cycle to actually turn, the premises of safety and reliability must be met.

What Tasks Are Targeted

The tasks humanoid VLA targets are generally the labor-intensive ones in people's daily and work environments.

Representative target task areas (concept)

  ┌─ tidying/transport: move, sort, and arrange objects
  ├─ two-hand coordination: tasks needing both hands together
  ├─ soft manipulation: handling cloth, cords, flexible objects
  └─ tool use: use the tools people use, as is

  → the more "diverse and variable" and hard-to-formalize a task is,
     the greater the value of VLA generalization and language instruction

Traditional industrial robots excel at highly formalized, repetitive tasks. The area humanoid VLA targets, by contrast, is hard-to-formalize tasks where objects, layouts, and instructions change each time. The greater the variability of such tasks, the more semantic generalization and natural-language instruction stand out. Of course, until reliability and safety are sufficiently secured, it is realistic to gradually widen the scope of application under human supervision.

Revisiting the Cognitive Analogy

The dual system started from the analogy of human cognition's "fast intuition (System 1) and slow deliberation (System 2)." But this analogy is only a starting point for inspiration, not an accurate model of the human brain.

The gap between analogy and engineering (concept)

  cognitive analogy:  fast intuition  ↔  slow deliberation
                          │                  │
  engineering:        high-freq control  ↔  low-freq planning
                      (small policy)        (large VLM)

  → the analogy only gives the intuition of "division of labor"
     the actual design is decided by latency, frequency, safety, and data

What matters in engineering is not the fidelity of the analogy but how to efficiently divide fast reaction and slow reasoning and stitch them smoothly. Design choices like the boundary between the two systems, the form of the latent representation, the update cycle, and the safety layer govern actual performance.

Outlook

The dual-system architecture is a design that fits the essential demand of a humanoid — "fast reflexes and slow thinking at once." It can secure the high-frequency control needed for balance and precise manipulation without losing the semantic understanding of a large vision-language model. The stream that offers a common foundation openly (like NVIDIA GR00T N1) and the stream that targets autonomous operation in real products (like Figure AI Helix) are advancing together.

The direction ahead seems clear: joint training that combines more diverse data (sim, human video, web), techniques that shrink the sim-to-real gap, guardrails that guarantee safety, and designs that smoothly stitch the two systems' time scales. Many problems remain until humanoids broadly generalize in human environments, but the idea of "two brains" will be a solid foundation for that road.

Finally, here is the big picture running through all three posts, summarized in one figure.

┌──────────────────────────────────────────────────────────┐
│         The big picture of robot VLA (three posts)       │
└──────────────────────────────────────────────────────────┘

  Post 1: VLM as policy → discrete action tokens (RT-2, OpenVLA)
        │  (semantic generalization, but discretization/frequency limits)
  Post 2: continuous action gen → smoothness/high-freq (Diffusion Policy, π0)
        │  (multimodality/smoothness, fast control via flow-matching)
  Post 3: extend to humanoids → dual system (GR00T N1, Helix)
        (slow understanding + fast control, whole-body coord., sim-to-real)

  Shared foundation: diverse data (Open X-Embodiment), efficient adaptation (LoRA),
                     safety guardrails, and the integration of perception and action

This trajectory shows a consistent direction: "bridging perception and action into a single learning system." Each stage supplements the limits of the previous one, advancing toward more general, smoother, and more complex embodiments. If the premises of safety and reliability are faithfully upheld, a future where robots understand human speech and work broadly in human environments will come one step closer.

References