- Authors

- Name
- Youngju Kim
- @fjvbn20031
Overview
Navigating web browsers to find information, fill out forms, and click buttons is natural for humans but extremely challenging for machines. Web pages are dynamic with diverse layouts, and the same task must be performed differently across different websites.
By applying reinforcement learning to web navigation, agents can learn appropriate actions (clicking, typing, etc.) by observing the visual information or DOM structure of web pages.
Challenges of Web Navigation
Why Is It Difficult?
Web environments differ significantly from traditional RL environments:
- Enormous state space: The rendered result (pixels) of a web page has millions of dimensions
- Enormous action space: Combinations of mouse position (x, y) + click/drag + keyboard input
- Delayed reward: Task completion can only be confirmed after multiple clicks and inputs
- Partial observability: Elements visible only after scrolling, popups, dynamic loading, etc.
- Environment non-determinism: The same page can show different states depending on loading time
State Representation Methods
| Method | Description | Advantages | Disadvantages |
|---|---|---|---|
| Pixel-based | Use screenshots directly as input | General purpose, includes visual info | High-dimensional, slow learning |
| DOM-based | Parse HTML DOM tree for use | Rich structural information | Different structure per site |
| Hybrid | Combine pixel + DOM information | Richest information | Complex preprocessing needed |
Browser Automation and Reinforcement Learning
Environment Interface
We wrap the web browser as a Gym-compatible environment:
import numpy as np
class WebEnvironment:
"""Gym-compatible web browser environment"""
def __init__(self, task_config, screen_width=160, screen_height=210):
self.screen_width = screen_width
self.screen_height = screen_height
self.task = task_config
# Action space: grid clicks + keyboard input
self.grid_size = 16 # 16x16 grid
self.num_click_actions = self.grid_size * self.grid_size
self.num_type_actions = 128 # ASCII characters
self.total_actions = self.num_click_actions + self.num_type_actions
def reset(self):
"""Start new episode: load web page"""
self._load_page(self.task['url'])
screenshot = self._get_screenshot()
return self._preprocess(screenshot)
def step(self, action):
"""Execute action and return result"""
if action < self.num_click_actions:
# Click action: convert to grid coordinates
row = action // self.grid_size
col = action % self.grid_size
x = col * (self.screen_width // self.grid_size)
y = row * (self.screen_height // self.grid_size)
self._click(x, y)
else:
# Typing action
char_idx = action - self.num_click_actions
self._type_char(chr(char_idx))
screenshot = self._get_screenshot()
obs = self._preprocess(screenshot)
reward = self._compute_reward()
done = self._check_done()
return obs, reward, done, {}
def _preprocess(self, screenshot):
"""Convert screenshot to model input"""
# Resize and normalize
resized = np.array(screenshot.resize((self.screen_width,
self.screen_height)))
return resized.astype(np.float32) / 255.0
Mini World of Bits Benchmark
Mini World of Bits (MiniWoB), developed by OpenAI and researchers, is the standard benchmark for web navigation RL. The goal is to perform specific tasks on simple HTML widgets.
Task Examples
- click-button: Click a button with specific text
- click-checkboxes: Select designated checkboxes
- enter-text: Enter a specified string in a text field
- navigate-tree: Navigate a tree structure menu to select a specific item
- email-inbox: Find and perform actions on emails matching specific conditions
Each task is a small 160x210 pixel web page that the agent must complete within 10 seconds.
class MiniWoBTask:
"""MiniWoB task definition"""
def __init__(self, task_name):
self.task_name = task_name
self.time_limit = 10.0 # 10 seconds
self.reward_range = (-1.0, 1.0)
def get_reward(self, page_state, time_elapsed):
"""Reward based on task completion"""
if self._is_task_complete(page_state):
# Higher reward for faster completion
time_bonus = 1.0 - (time_elapsed / self.time_limit)
return max(time_bonus, 0.1)
elif time_elapsed >= self.time_limit:
return -1.0 # Timeout penalty
else:
return 0.0 # In progress
OpenAI Universe
OpenAI Universe was a platform that provided various environments (games, web browsers, etc.) through a standard interface. It observed browser screens via VNC (Virtual Network Computing) and sent mouse/keyboard events.
Features of VNC-Based Environments
- Runs actual browsers (Chrome, Firefox) in Docker containers
- The agent receives screen pixels via VNC protocol
- Actions are sent as mouse coordinates and keyboard events
- Network latency and rendering delays exist
class VNCActionSpace:
"""VNC-based action space"""
def __init__(self, screen_width, screen_height):
self.screen_width = screen_width
self.screen_height = screen_height
def click(self, x, y):
"""Generate mouse click event"""
return {
'type': 'pointer',
'x': int(x),
'y': int(y),
'button': 1, # Left click
}
def type_text(self, text):
"""Generate keyboard input events"""
events = []
for char in text:
events.append({
'type': 'key',
'key': char,
'action': 'press',
})
return events
def scroll(self, x, y, direction='down'):
"""Generate scroll event"""
delta = -3 if direction == 'down' else 3
return {
'type': 'scroll',
'x': int(x),
'y': int(y),
'delta': delta,
}
Simple Click Approach
The most basic web navigation agent transforms the problem into a classification task by dividing the screen into a grid and deciding which cell to click.
Grid Action Space
class GridActionSpace:
"""Divide screen into NxN grid to determine click position"""
def __init__(self, screen_w, screen_h, grid_n=16):
self.screen_w = screen_w
self.screen_h = screen_h
self.grid_n = grid_n
self.cell_w = screen_w / grid_n
self.cell_h = screen_h / grid_n
self.n_actions = grid_n * grid_n
def action_to_coordinate(self, action_idx):
"""Convert action index to screen coordinates"""
row = action_idx // self.grid_n
col = action_idx % self.grid_n
# Cell center coordinates
x = (col + 0.5) * self.cell_w
y = (row + 0.5) * self.cell_h
return int(x), int(y)
def coordinate_to_action(self, x, y):
"""Convert screen coordinates to action index"""
col = min(int(x / self.cell_w), self.grid_n - 1)
row = min(int(y / self.cell_h), self.grid_n - 1)
return row * self.grid_n + col
CNN-Based Model
A model that takes screenshots as input and outputs click probabilities for grid cells:
import torch
import torch.nn as nn
class WebNavigationModel(nn.Module):
"""CNN-based web navigation agent"""
def __init__(self, grid_size=16):
super().__init__()
self.grid_size = grid_size
n_actions = grid_size * grid_size
# CNN for screen feature extraction
self.conv = nn.Sequential(
nn.Conv2d(3, 32, 8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, 4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, 3, stride=1),
nn.ReLU(),
)
# Compute feature vector size
self._feature_size = self._get_conv_output_size((3, 210, 160))
# Actor: click position policy
self.policy = nn.Sequential(
nn.Linear(self._feature_size, 512),
nn.ReLU(),
nn.Linear(512, n_actions),
)
# Critic: state value
self.value = nn.Sequential(
nn.Linear(self._feature_size, 512),
nn.ReLU(),
nn.Linear(512, 1),
)
def _get_conv_output_size(self, shape):
with torch.no_grad():
dummy = torch.zeros(1, *shape)
return self.conv(dummy).view(1, -1).shape[1]
def forward(self, screen):
features = self.conv(screen).view(screen.size(0), -1)
policy_logits = self.policy(features)
value = self.value(features)
return policy_logits, value
Training Loop
def train_web_agent(model, env, num_episodes=10000, gamma=0.99):
"""Train web navigation agent with A2C"""
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
for episode in range(num_episodes):
state = env.reset()
done = False
episode_reward = 0
log_probs = []
values = []
rewards = []
while not done:
state_t = torch.FloatTensor(state).permute(2, 0, 1).unsqueeze(0)
logits, value = model(state_t)
probs = torch.softmax(logits, dim=-1)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
next_state, reward, done, info = env.step(action.item())
log_probs.append(dist.log_prob(action))
values.append(value.squeeze())
rewards.append(reward)
state = next_state
episode_reward += reward
# A2C update
returns = compute_returns(rewards, gamma)
returns_t = torch.FloatTensor(returns)
values_t = torch.stack(values)
log_probs_t = torch.stack(log_probs)
advantages = returns_t - values_t.detach()
policy_loss = -(log_probs_t * advantages).mean()
value_loss = (returns_t - values_t).pow(2).mean()
loss = policy_loss + 0.5 * value_loss
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 40.0)
optimizer.step()
if episode % 100 == 0:
print(f"Episode {episode}: Reward={episode_reward:.2f}")
def compute_returns(rewards, gamma):
returns = []
R = 0
for r in reversed(rewards):
R = r + gamma * R
returns.insert(0, R)
return returns
Learning from Human Demonstrations
Pure RL alone can be very slow for web navigation learning. Human demonstrations can significantly accelerate learning.
Behavioral Cloning
First perform supervised learning to imitate human behavior, then fine-tune with RL:
def pretrain_with_demonstrations(model, demos, num_epochs=50):
"""Pretrain with human demonstration data"""
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
total_loss = 0
correct = 0
total = 0
for screen, action in demos:
screen_t = torch.FloatTensor(screen).permute(2, 0, 1).unsqueeze(0)
action_t = torch.tensor([action], dtype=torch.long)
logits, _ = model(screen_t)
loss = criterion(logits, action_t)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
predicted = logits.argmax(dim=-1)
correct += (predicted == action_t).sum().item()
total += 1
accuracy = correct / total
print(f"Epoch {epoch}: Loss={total_loss/len(demos):.4f}, "
f"Accuracy={accuracy:.2%}")
return model
Demonstration-Weighted Learning
Human demonstration experiences are placed in a replay buffer with high priority, continuously referenced during RL training:
class DemonstrationReplayBuffer:
"""Priority replay buffer including demonstration data"""
def __init__(self, capacity, demo_ratio=0.25):
self.capacity = capacity
self.demo_ratio = demo_ratio
self.demo_buffer = []
self.agent_buffer = []
def add_demonstration(self, transition):
self.demo_buffer.append(transition)
def add_agent_experience(self, transition):
if len(self.agent_buffer) >= self.capacity:
self.agent_buffer.pop(0)
self.agent_buffer.append(transition)
def sample(self, batch_size):
"""Sample a mix of demonstrations and agent experiences"""
n_demo = int(batch_size * self.demo_ratio)
n_agent = batch_size - n_demo
demo_samples = random.sample(
self.demo_buffer,
min(n_demo, len(self.demo_buffer))
)
agent_samples = random.sample(
self.agent_buffer,
min(n_agent, len(self.agent_buffer))
)
return demo_samples + agent_samples
Practical Considerations and Limitations
Current Technical Limitations
- Even on the MiniWoB benchmark, complex tasks (email-inbox, social-media, etc.) do not reach human-level performance
- It is difficult to handle the diversity and complexity of real websites
- Vulnerable to dynamic changes in DOM structure
Recent Research Directions
- LLM-based web agents: Input HTML as text to generate actions
- Multimodal models: Agents that understand both screen images and text
- Hierarchical RL: Separating high-level planning (which elements to interact with) from low-level execution (precise coordinate clicking)
Key Takeaways
- Web navigation is a challenging RL problem characterized by enormous state/action spaces and delayed rewards
- A basic agent can be implemented with CNN + A2C by simplifying the action space to a grid-based approach
- Pretraining with human demonstrations (behavioral cloning) significantly accelerates RL learning
- Recent research is moving toward general-purpose web agents using LLMs and multimodal models
In the next post, we will explore continuous action spaces with DDPG and distributional policy gradients.