Skip to content
Published on

Modern Reinforcement Learning Ecosystems 2026 Deep Dive - RLlib (Anyscale) · Stable-Baselines3 · Tianshou · CleanRL · OpenSpiel (DeepMind) · Gymnasium (Farama) · Acme · PufferLib · Pearl (Meta) · TorchRL

Authors

Intro — May 2026, RL Has Entered Its Second Golden Age

If 2018-2020 (AlphaStar, OpenAI Five, AlphaFold 1) was RL's first golden age, 2024-2026 is the second. Three triggers stand out. First, DeepSeek-R1's GRPO and OpenAI's o1/o3 line of Test-Time Compute put RL back at the center of LLM reasoning. Second, NVIDIA GR00T, Isaac Lab, and Cosmos (covered in iter69) pushed robotics simulation and sim-to-real into commercial deployment. Third, Wayve GAIA-2 (iter97) and Tesla FSD v13 brought RL-based evaluation and policy learning back into autonomous driving.

This article is not a marketing catalog. We catalog the libraries, environments, algorithms, and benchmarks that are actually used in RL production and research as of May 2026. All code snippets follow current APIs. We compare RLlib, Stable-Baselines3, Tianshou, CleanRL, TorchRL, OpenSpiel, Acme, Pearl, PufferLib, Gymnasium, PettingZoo, MuJoCo, Brax, and Isaac Lab in one place.

The 2026 RL Landscape — Four Axes

Start with the big picture. The 2026 RL ecosystem decomposes along four axes.

  1. Algorithm library: policy/value networks, trainers, replay buffers.
  2. Environment API: the abstraction that exposes state, action, reward to libraries.
  3. Simulator: physics, games, robots, cities, autonomous driving — the domain simulation.
  4. Distributed runtime: actor-learner topology, actor pools, replay sharding.

In 2018, a single library tried to do all four (own baselines, own Gym envs, own Atari wrappers, own distributed runtime). In 2026 these axes are cleanly separated. The typical stack is Gymnasium (environment API) + PufferLib (env compatibility shim) + RLlib/SB3/Tianshou (algorithms) + Ray/Slurm (distributed).

General-Purpose RL Library Market — Two Leaders Plus Four Challengers

The algorithm library market in May 2026 has two clear leaders.

  • RLlib (Anyscale, on Ray): number one in distributed training and production adoption. PPO, IMPALA, APPO, DQN, SAC, plus MARL. Tight HPO integration with Ray Tune.
  • Stable-Baselines3 (SB3): the de facto research baseline. PyTorch-based, readability first, ideal for single-machine training.

The challenger pack has matured.

  • TorchRL (Meta PyTorch team): modular PyTorch-native RL since 2023. The TensorDict abstraction unifies multi-agent, offline, and online RL behind a single API.
  • Tianshou (Tsinghua University): fast performance plus modular design. CN/EN docs; often cited for training stability.
  • CleanRL: one file per algorithm. Dominant for research reproducibility and education. W&B tracking is default.
  • JAX line: JAXRL, RejaxRL, DeepMind Acme + Haiku/RLax. Combined with compiled envs (Brax etc.), throughput is unbeatable.

Each has its place. Production → RLlib. Paper baselines → SB3. Fast experimentation / paper reproduction → CleanRL. PyTorch-native multi-agent → TorchRL. Training stability + performance → Tianshou. TPU and JAX → Acme.

RLlib — Industrial RL on Ray

RLlib is the RL submodule of Ray, maintained by Anyscale. Its biggest strength in May 2026 is that distributed training actually works. The same API scales from a lightweight single machine to 1000+ actors.

A typical RLlib script:

import gymnasium as gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig

ray.init()

config = (
    PPOConfig()
    .environment(env="CartPole-v1")
    .framework("torch")
    .training(gamma=0.99, lr=3e-4, train_batch_size=4000)
    .rollouts(num_rollout_workers=8)
    .resources(num_gpus=1)
)

algo = config.build()
for i in range(100):
    result = algo.train()
    print(f"iter={i} reward={result['episode_reward_mean']:.2f}")

algo.save(checkpoint_dir="/tmp/ppo_cartpole")
ray.shutdown()

RLlib has the deepest algorithm catalog: PPO, IMPALA, APPO, DQN, Rainbow, SAC, DDPG, TD3, MARWIL, BC, CQL, MARL via PettingZoo wrappers, and RLlib Offline for RLHF. The downside is deep abstractions — the initial learning curve is steep.

Stable-Baselines3 — The De Facto Research Baseline

SB3 is a PyTorch-based RL library maintained by the DLR-RM team (originally from the German Aerospace Center). It leads on usability and readability. More than half of baseline numbers in new RL papers are produced with SB3.

A complete SB3 PPO training run is four lines:

import gymnasium as gym
from stable_baselines3 import PPO

env = gym.make("LunarLander-v2")
model = PPO("MlpPolicy", env, verbose=1, learning_rate=3e-4, n_steps=2048)
model.learn(total_timesteps=1_000_000)
model.save("ppo_lunar")

SB3 officially ships:

  • On-policy: PPO, A2C, TRPO (Contrib).
  • Off-policy: DQN, DDPG, TD3, SAC, HER (goal-conditioned).
  • Imitation / offline: BC, GAIL, AIRL live in the separate Imitation package.

SB3 is the standard recommendation for single-machine, medium-scale training. For distributed training, RLlib or TorchRL is the better fit.

Tianshou — Fast, Modular, Known for Training Stability

Tianshou is a PyTorch-based RL library started by the RL group at Tsinghua University. It has grown rapidly since 2020, and citations as a baseline at NeurIPS/ICLR have climbed sharply over 2024-2026. Its strengths are fast convergence and stable hyperparameters.

Tianshou's core abstractions split into Collector, Policy, and Trainer.

import gymnasium as gym
import tianshou as ts
import torch
from tianshou.utils.net.common import Net
from tianshou.utils.net.discrete import Actor, Critic

env = gym.make("CartPole-v1")
state_shape = env.observation_space.shape or env.observation_space.n
action_shape = env.action_space.n

net = Net(state_shape, hidden_sizes=[64, 64])
actor = Actor(net, action_shape)
critic = Critic(net)
optim = torch.optim.Adam(set(actor.parameters()) | set(critic.parameters()), lr=3e-4)

policy = ts.policy.PPOPolicy(actor, critic, optim, dist_fn=torch.distributions.Categorical)
train_envs = ts.env.DummyVectorEnv([lambda: gym.make("CartPole-v1") for _ in range(8)])
buf = ts.data.VectorReplayBuffer(20000, 8)
collector = ts.data.Collector(policy, train_envs, buf)
collector.collect(n_step=4000)

Tianshou is often cited first for algorithmic correctness. The downside compared to SB3 is shorter documentation.

CleanRL — One File Per Algorithm, Peak Research Reproducibility

CleanRL was started by Costa Huang (Vector Institute, now at Hugging Face) and finishes each algorithm in a single file. The PPO implementation lives entirely in ppo.py with almost no abstraction layer. Readability and reproducibility are unmatched.

As of May 2026, CleanRL provides these algorithms as single-file scripts:

  • Online: PPO (11 variants — Atari, MuJoCo, Procgen, multi-agent, LSTM, continuous/discrete), DQN, C51, SAC, TD3, DDPG, A2C.
  • Offline: CQL, IQL, AWAC, DT (Decision Transformer).
  • Research: PPG, PPL, RLHF variants, single-file GRPO.

W&B integration is built in, so running a script automatically logs metrics to the cloud. It is one of the first repos researchers clone when reproducing a paper or building a baseline.

TorchRL — Modern RL with PyTorch as a First-Class Citizen

TorchRL is an RL library built directly by Meta's PyTorch team. It stabilized in 2023 and accelerated through 2024-2026. PyTorch tensors and the TensorDict abstraction are first-class citizens, which makes it natural for PyTorch developers.

The core abstractions:

  • TensorDict: a single container for every piece of data (obs, action, reward, mask, hidden state).
  • Environment Transforms: torchvision-Transform-style environment wrappers.
  • Replay Buffer: single, prioritized, sequence, and offline buffers behind one API.
  • Loss Modules: standalone losses for PPO, DQN, SAC, DDPG, IQL, CQL, and more.

A short TorchRL snippet:

import torch
from torchrl.envs import GymEnv, TransformedEnv, ObservationNorm
from torchrl.modules import MLP, ProbabilisticActor
from torchrl.objectives import ClipPPOLoss

env = TransformedEnv(GymEnv("CartPole-v1"), ObservationNorm(in_keys=["observation"]))
actor_net = MLP(in_features=env.observation_spec["observation"].shape[-1], out_features=env.action_spec.shape[-1], num_cells=[64, 64])
actor = ProbabilisticActor(module=actor_net, in_keys=["observation"], out_keys=["action"])

loss_module = ClipPPOLoss(actor, critic_network=None, entropy_bonus=True)
optim = torch.optim.Adam(loss_module.parameters(), lr=3e-4)

The strength of TorchRL is multi-agent, offline, and meta-RL under one API. The downsides are deep abstractions and a still-evolving surface.

PFRL — Preferred Networks' Japanese PyTorch RL

PFRL is the PyTorch RL library from Preferred Networks (PFN, Tokyo). Its predecessor was the Chainer-based ChainerRL. As of May 2026 it backs many baseline runs at ICML and NeurIPS by Japanese teams.

PFRL's strengths are algorithmic breadth and reliable reproductions. DQN-family variants like Rainbow, IQN, R2D2, and NoisyNet are well covered, and Atari 50M training is validated end to end. The trade-off is that English-language documentation is shallower than SB3's.

TF-Agents — Google's TensorFlow RL Library

TF-Agents is a TensorFlow RL library built by Google. It remains active in May 2026 but has lost share to the PyTorch ecosystem. Inside Google and on TPUs, however, it is still the first choice. Several follow-up projects to AlphaGo / AlphaStar, plus the RL components inside Vertex AI Pipelines, sit on TF-Agents.

OpenSpiel — DeepMind's Game-Theoretic Multi-Agent Bundle

OpenSpiel is DeepMind's standard toolkit for game theory and multi-agent RL. It bundles 60+ games (chess, Go, poker, hex, Liar's Dice, Goofspiel, Hanabi, Catch the Cat, and more) with equilibrium-learning algorithms like PSRO, CFR, NFSP, and MMD.

import pyspiel

game = pyspiel.load_game("tic_tac_toe")
state = game.new_initial_state()
while not state.is_terminal():
    legal_actions = state.legal_actions()
    action = legal_actions[0]
    state.apply_action(action)
print(state.returns())

OpenSpiel is one of the canonical environments for multi-agent RL research. It covers card games (Hanabi), strategy games (Catan), auctions (Sealed-Bid Auction), and Liar's Dice in a single API.

Acme — DeepMind's Modular Researcher-Facing Framework

Acme is DeepMind's internal research framework, released to the public. The core abstraction is a clean separation of Actor + Learner + Replay. It supports both JAX and TF.

Acme's strength is expressing distributed topologies cleanly (R2D2, IMPALA, Ape-X). Combined with DeepMind's Reverb replay service, scaling out to thousands of actors becomes a normal API call.

Sister libraries in the same DeepMind stack:

  • DM-Haiku: a neural-network module library on JAX (Flax is now the more popular sibling).
  • RLax: RL loss functions / building blocks on JAX.
  • Distrax: a JAX-based distribution library (a TFP alternative).
  • Reverb: a distributed replay-buffer service.

Pearl — Meta's Production Decision-System RL Library

Pearl is an RL library released by Meta's Applied Research / Production RL team in late 2023. Its full name is PEarl (Production-Ready Reinforcement Learning AI Library). As of May 2026 it is squarely focused on online decision systems: ad bidding, content recommendation, notification timing, and similar.

Pearl emphasizes:

  • A single API for contextual bandits and RL: useful when only partial reward signals are observed.
  • Off-policy evaluation (OPE): compare production policies without running experiments.
  • Safe exploration: protect business KPIs while exploring.
  • Decoupled training and serving at scale: training runs in PyTorch; serving runs in a separate runtime.

Inside Meta, parts of the ads, recommendations, and notifications stacks run on Pearl. The OSS release exposes the same abstractions to outside users.

Gymnasium — Farama Foundation's API Standard

OpenAI Gym went unmaintained after 2021, and the Farama Foundation forked it as Gymnasium, which is now the de facto standard. Every major RL library in May 2026 (SB3, RLlib, Tianshou, CleanRL, TorchRL) targets the Gymnasium API first.

The differences between Gym and Gymnasium are small but consequential.

  • env.reset() returns (obs, info).
  • env.step(action) returns the 5-tuple (obs, reward, terminated, truncated, info), splitting terminated (episode finished) from truncated (timeout).
  • Standardized seed handling via env.reset(seed=42).
  • A standardized gym.vector vectorization API.
import gymnasium as gym

env = gym.make("CartPole-v1", render_mode="rgb_array")
obs, info = env.reset(seed=42)
for _ in range(1000):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()
env.close()

Under the same Farama umbrella sit PettingZoo (multi-agent), MiniGrid (grid-world), MiniWorld (3D mini), Procgen (procedurally generated), and Highway-env (mini autonomous driving).

PettingZoo + MARL — The Multi-Agent RL API Standard

PettingZoo is Gymnasium's multi-agent sibling. It supports both the AEC (Agent Environment Cycle) API and the Parallel API. RLlib, Tianshou, and TorchRL all consume PettingZoo environments first-class.

Categories shipped with PettingZoo:

  • Atari Multiplayer: 2P Atari games like Pong, Boxing.
  • Classic: chess, Go, card games.
  • Butterfly: pursuit/evasion games.
  • MPE (Multi-Particle Environments): cooperative / competitive particle simulations (originally from OpenAI).
  • SISL: multi-pursuit, multi-comm.
  • MAgent2: large-scale (1000+ agents) battle / cooperation.

Sample code:

from pettingzoo.classic import chess_v6

env = chess_v6.env(render_mode="human")
env.reset(seed=42)
for agent in env.agent_iter():
    obs, reward, term, trunc, info = env.last()
    action = None if term or trunc else env.action_space(agent).sample()
    env.step(action)
env.close()

Multi-agent RL algorithms include MAPPO, IPPO, QMIX, MADDPG, and COMA, and libraries like RLlib MARL, MARLlib, and EPyMARL run on top of PettingZoo.

Atari, MuJoCo, DeepMind Control Suite — Classic Benchmarks Today

The standard RL benchmarks remain strong.

  • Atari Learning Environment (ALE): 50+ Atari games, standard since DQN. ALE 0.10 (2024) is integrated with Gymnasium.
  • MuJoCo: open-sourced after DeepMind acquired it in 2022. MuJoCo 3.x adds GPU acceleration via MJX.
  • DeepMind Control Suite (dm_control): a MuJoCo-based continuous-control benchmark — Walker, Cheetah, Humanoid, Quadruped, and more.

MuJoCo 3 has shipped a MJX JAX backend since 2024. The same model can run thousands of parallel rollouts on a single GPU, which is heavily cited in sim-to-real work.

import gymnasium as gym

env = gym.make("HalfCheetah-v5")
obs, info = env.reset(seed=0)
for _ in range(200):
    obs, r, term, trunc, info = env.step(env.action_space.sample())

In the same category: PyBullet (open-source physics engine, an alternative to MuJoCo) and Gazebo (ROS-integrated robotics simulator).

Brax — JAX-Based Differentiable Physics Simulator

Brax is a JAX-based physics simulator from Google. It is differentiable physics + GPU parallelism + JIT compilation, so RL throughput can be 100-1000x higher than CPU-based MuJoCo in some setups.

Brax environments are written to be MuJoCo-compatible: Ant, HalfCheetah, Humanoid, Walker2d, Hopper, Pusher, Reacher are all available under the same names.

import jax
import brax.envs
from brax.training.agents.ppo import train as ppo

env = brax.envs.create(env_name="ant", backend="positional")
make_inference_fn, params, _ = ppo.train(
    environment=env,
    num_timesteps=50_000_000,
    num_evals=10,
    reward_scaling=10,
    episode_length=1000,
    normalize_observations=True,
    action_repeat=1,
    unroll_length=5,
    num_minibatches=32,
    num_updates_per_batch=4,
    discounting=0.97,
    learning_rate=3e-4,
    entropy_cost=1e-2,
    num_envs=4096,
    batch_size=2048,
    seed=0,
)

The trick is running thousands of envs on the GPU at once — note num_envs=4096. Brax + Acme + RLax can finish 1B+ training steps in under a day on a single GPU.

NVIDIA Isaac Lab + Cosmos — The Industry Standard for Robotics Sim-to-Real

In 2023-2024 NVIDIA moved from IsaacGym to Isaac Lab. As of May 2026, Isaac Lab (absorbing the older OmniIsaac / IsaacGymEnvs) is the industry standard for robotics RL simulation. Cosmos (covered in iter69) provides a generative world model layer for sim-to-real on top.

Isaac Lab features:

  • Everything runs on the GPU: physics, observation synthesis, reward computation.
  • Thousands to tens of thousands of parallel envs: a single A100/H100 can train 4096 robot policies in parallel.
  • NVIDIA Omniverse + USD standard: assets are shared in standard USD.
  • Built-in Domain Randomization: automated material / physics noise for sim-to-real.

Many humanoid companies — Boston Dynamics, Agility Robotics, Figure AI, 1X — have publicly stated that Isaac Lab is their standard for RL policy training. Common pairings are Isaac Lab + RSL-RL (ETH Zurich) and Isaac Lab + Stable-Baselines3 or in-house PPO.

VizDoom, MineRL, MineDojo, Crafter, NetHack — The Procedural-Env Comeback

Procedural and open-world RL benchmarks regained attention in 2024-2026.

  • VizDoom: a first-person-shooter env on Doom 1. Returned at NeurIPS 2024 Open-Ended Learning.
  • MineRL + MineDojo: Minecraft-based. MineDojo (NVIDIA) ships an internet-video dataset + task spec + environment together.
  • Crafter: a mini-Minecraft with 22 achievements. Quick to evaluate on a single GPU.
  • NetHack Learning Environment (NLE): Meta's NetHack roguelike. Procedural dungeons and an enormous action space; popular for benchmarking LLM agents.
import minerl

env = minerl.make("MineRLObtainDiamond-v0")
obs = env.reset()
done = False
while not done:
    action = env.action_space.sample()
    obs, r, done, info = env.step(action)

These environments are also heavily used for LLM-based agent evaluation. Voyager (NVIDIA, a GPT-4 Minecraft agent), DEPS, and JARVIS-1 all live on top of MineDojo.

MetaWorld, RoboCasa, LIBERO — Robotics Task Suites

Robotics RL is evaluated not against a single simulator but against task suites paired with simulators.

  • MetaWorld: the standard 50-task manipulation benchmark. Meta-learning a single policy across 50 tasks is the canonical evaluation.
  • RoboCasa: 100+ kitchen tasks with photoreal rendering. From NVIDIA + Stanford.
  • LIBERO: a lifelong-learning manipulation benchmark — task distributions shift over time.
  • Habitat 3.0 (Meta): human-robot collaboration evaluation, with humanoid simulation integrated.

MetaWorld usage:

import metaworld
import random

mt10 = metaworld.MT10()
training_envs = []
for name, env_cls in mt10.train_classes.items():
    env = env_cls()
    task = random.choice([t for t in mt10.train_tasks if t.env_name == name])
    env.set_task(task)
    training_envs.append(env)

PufferLib — Env Compatibility and a Throughput Boost

PufferLib is a library that unifies environment APIs and throws in GPU vectorization. It has grown fast since 2024. The core claim: keep your CPU envs, then run 100-1000 of them at max throughput from a single GPU host.

PufferLib delivers all of these in one place:

  • Env adapters: Gym, Gymnasium, PettingZoo, NetHack, MineRL, Crafter, Atari, Procgen, NLE — one interface.
  • Vectorization: shared-memory + multiprocessing vector envs.
  • Native Puffer envs: NMMO, Pong, plus other fast C envs.
import pufferlib
import pufferlib.emulation
import pufferlib.vectorization

env_creator = lambda: pufferlib.emulation.GymnasiumPufferEnv("CartPole-v1")
vecenv = pufferlib.vectorization.Multiprocessing(env_creator, num_envs=64)
obs = vecenv.reset()

PufferLib + CleanRL, or PufferLib + RLlib, commonly delivers a 2-10x speedup in practice.

LLM + RL — From PPO to GRPO, plus TRL and the Training Infrastructure

In late 2024, DeepSeek-R1 was trained with a PPO variant called GRPO (Group Relative Policy Optimization), and GRPO variants have since become the standard for LLM reasoning model training.

A short summary of PPO vs. GRPO:

  • PPO: policy + value (critic) — two networks. The critic provides a baseline.
  • GRPO: drops the critic. For each prompt, several responses are sampled as a group, and rewards are normalized within the group to estimate advantage. Saves memory and compute.
AspectPPOGRPO
critic networkrequirednot used
advantage estimationGAE + value baselinegroup-normalized rewards
memoryactor + criticactor only
stabilitywell-known stable regionsgroup size g is the key knob
LLM fitstandard but critic is heavyde facto standard for reasoning RL

TRL (Hugging Face Transformers Reinforcement Learning) is the LLM-side RL library (covered in iter62 LLM fine-tuning). As of May 2026 it supports PPO, DPO, GRPO, KTO, ORPO, and reward-model training.

from trl import GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

config = GRPOConfig(
    learning_rate=1e-6,
    num_generations=8,
    max_prompt_length=512,
    max_completion_length=1024,
    beta=0.04,
)

trainer = GRPOTrainer(
    model=model,
    args=config,
    reward_funcs=[lambda completions, **kw: [len(c) for c in completions]],
    train_dataset=...,
    processing_class=tokenizer,
)
trainer.train()

The LLM RL stack has a different shape than classic RL. Rollouts run on vLLM or SGLang, training runs on DeepSpeed / FSDP, and algorithms are implemented in TRL, OpenRLHF, or verl. A training cluster of 16-128 GPUs is typical.

The 2026 Algorithm Landscape — PPO Still Rules, with New Companions

As of May 2026, here is the standing per algorithm.

  • PPO: still number one. Simple, stable, broad env compatibility. The default first baseline.
  • GRPO: de facto standard for LLM reasoning training, slowly spreading to general RL as a PPO alternative.
  • SAC: the standard for continuous control (MuJoCo, robots). Entropy regularization + double critic.
  • TD3: SAC alternative. Twin critics + delayed policy updates.
  • DQN family: Rainbow (C51 + Double + Dueling + Noisy + Prioritized + Multi-step + Distributional) is still strong. IQN, QR-DQN are standards for distributional RL.
  • Decision Transformer (DT): reframes RL as sequence modeling. Aligns well with offline RL and LLMs; common in 2023-2025 research.
  • Diffusion Policy: BC + diffusion for robot manipulation. Driven by Toyota Research, Stanford, and NVIDIA.
  • MuZero / EfficientZero / Stochastic MuZero: the kings of model-based RL. The successors to AlphaGo and AlphaZero.
  • AlphaTensor / AlphaCode / AlphaProof: domain-specific RL applications from DeepMind.
  • Q-Transformer: DeepMind's RT-X line. A transformer + Q-learning hybrid for robot learning.

Benchmark Landscape — MuJoCo / Atari / DM Control / NetHack / Crafter

Standard benchmarks for algorithm comparison:

  • Atari 100k: average score across 26 Atari games in 100k env steps. The data-efficiency standard.
  • MuJoCo Locomotion: continuous-control scores on HalfCheetah, Walker, Ant, Humanoid.
  • DeepMind Control Suite: average across 28+ tasks. Pixel and state inputs are reported separately.
  • MineRL Diamond: diamond-mining success rate — a very long-horizon task.
  • NetHack Challenge: NetHack score. Huge action space and procedural dungeons.
  • BabyAI: a natural-language grid-world. Evaluates language + RL.
  • Crafter: a 22-achievement evaluation, fast on a single GPU.
  • Procgen: 16 procedural games. The standard for generalization evaluation.

Korean RL Research — KAIST, SNU, NCSOFT, Krafton, NAVER

Korean RL research has exploded since the 2020s. Key groups:

  • KAIST AI: groups led by Eunho Yang, Juho Lee, and Sungju Hwang have published heavily on meta RL, offline RL, and model-based RL at NeurIPS/ICLR/ICML.
  • Seoul National University AI Institute: groups led by Gunhee Kim and Byoung-Tak Zhang work on multi-agent RL and language-RL combinations.
  • POSTECH AI: groups led by Sungsoo Ahn and Seunghyun Lee focus on stability / theoretical RL.
  • NCSOFT AI Center: RL boss AIs for Lineage and Blade & Soul — industrial game-RL deployments.
  • Krafton AI: bots and NPC AI for PUBG. Workshop presentations at ICLR/NeurIPS 2024-2025.
  • NAVER Search Engineering: RL for search ranking and ad bidding.
  • Kakao Enterprise AI: RL for recommendation.

Korean RL is strong on game RL and industrial decision systems, and academically active on theory and meta-RL.

Japanese RL Research — Preferred Networks, DeepMind Tokyo, Sony AI, NTT

Japanese RL research is strong in games, robotics, and industrial applications.

  • Preferred Networks (PFN): an RL contributor since the Chainer days. Maintains PFRL OSS. Robotics collaborations with Toyota and FANUC.
  • DeepMind Tokyo: DeepMind's Tokyo office. Extensive collaborations with Japanese academia.
  • Sony AI: Gran Turismo Sophy — the RL racing agent in Gran Turismo 7. Cover of Nature 2022; integrated into the main series by 2026.
  • NTT CS Labs: telecom / network RL — SDN routing, power-grid control.
  • Riken AIP: meta-learning, continual learning.
  • OMRON SINIC X: robot manipulation.

Gran Turismo Sophy is one of the few cases that combines a high-fidelity simulator with a comparable human baseline — one of the most visible demonstrations of what industrial RL can do.

Real-World Deployments — From AlphaGo to AlphaChip and Boston Dynamics

Notable RL deployments:

  • AlphaGo → AlphaZero → MuZero (DeepMind): Go, chess, shogi, Atari. The starting point for model-based RL.
  • AlphaStar (DeepMind): grandmaster-level StarCraft II — peer to top human players.
  • OpenAI Five: world-champion Dota 2 — the first giant distributed RL infrastructure case.
  • AlphaFold 2 (DeepMind, though RL is a small ingredient — RL is used in post-processing).
  • AlphaTensor (DeepMind, 2022): discovering matrix-multiplication algorithms with RL.
  • AlphaChip (DeepMind, 2024 — formerly Chip Placement): RL for TPU floorplanning. Covered in iter96.
  • Gran Turismo Sophy (Sony AI, Nature 2022): RL racing.
  • Boston Dynamics + RL fine-tuning: motion-optimization RL for Atlas / Spot.
  • NVIDIA Eureka (2023): GPT-4 writes RL reward functions and trains in Isaac Gym.
  • DeepMind Loon balloon control: stratospheric balloon RL navigation.
  • Amazon SageMaker RL, Microsoft Bonsai, Vertex AI Vizier: managed RL for industrial use.

OpenAI Spinning Up and Educational Resources

Core resources for getting started:

  • OpenAI Spinning Up: an educational RL library — simple implementations of PPO, SAC, TD3, DDPG, VPG with theory notes. Released in 2018, still a student starting point.
  • CleanRL single-file algorithms: see above.
  • Hugging Face Deep RL Course: 8 units, free. SB3 + Unity ML-Agents + Gymnasium.
  • DeepMind x UCL RL Course (David Silver): the classic; theoretical foundation.
  • Sergey Levine CS 285 (UC Berkeley): the most influential recent RL course.
  • Pieter Abbeel CS 287 (Berkeley): robotics RL.

A natural path is read Spinning Up code → walk through CleanRL's ppo.py line by line → experiment with SB3 / Tianshou → scale up with RLlib / TorchRL to distributed and multi-agent.

Tool Selection Guide — Recommendations by Scenario

A short cheat sheet:

  • Solo, fast baseline → SB3 (LunarLander, MuJoCo, single machine).
  • Reproducibility / paper writing → CleanRL (single file + W&B).
  • Training stability + module extension → Tianshou or TorchRL.
  • Thousands of actors distributed → RLlib (Ray).
  • TPU / JAX / Brax → Acme + RLax + Haiku (or Flax).
  • Robotics sim-to-real → Isaac Lab + RSL-RL / SB3 or in-house PPO.
  • Multi-agent → PettingZoo + RLlib MARL or MARLlib, EPyMARL.
  • Game theory → OpenSpiel.
  • Ads / recommendation decision systems → Pearl or your own contextual bandit.
  • LLM RL → TRL + vLLM / SGLang; algorithm: GRPO or DPO.

The most important question when picking a tool is "where does environment throughput bottleneck?" CPU envs are fine with SB3 / Tianshou. GPU sims fit Isaac Lab / Brax. Distributed jobs go to RLlib / Acme. LLM training uses TRL.

Wrap-Up — May 2026, RL Is a Stack, Not a Single Tool

The conclusion of this article is simple. The single-tool era of RL is over. You assemble a stack of environment + library + simulator + distributed runtime that fits your domain.

Two big trends to close on. First, the split between LLM-RL and classic RL. LLM-RL has become its own area — TRL + vLLM + GRPO/DPO. Classic RL continues with SB3 / RLlib / Tianshou + Gymnasium + simulators. Second, robotics RL has crossed into commercial deployment. Isaac Lab + Cosmos + Diffusion Policy + Q-Transformer is now the default stack for several humanoid and manipulator companies.

Do not overspend time picking tools. As long as you nail a consistent environment interface, a stable distributed runtime, and good experiment tracking, 90% of the problem is solved. The remainder is domain-specific tuning.

References