[Deep RL] 13. Web Navigation and Reinforcement Learning

Overview

Navigating web browsers to find information, fill out forms, and click buttons is natural for humans but extremely challenging for machines. Web pages are dynamic with diverse layouts, and the same task must be performed differently across different websites.

By applying reinforcement learning to web navigation, agents can learn appropriate actions (clicking, typing, etc.) by observing the visual information or DOM structure of web pages.

Challenges of Web Navigation

Why Is It Difficult?

Web environments differ significantly from traditional RL environments:

Enormous state space: The rendered result (pixels) of a web page has millions of dimensions
Enormous action space: Combinations of mouse position (x, y) + click/drag + keyboard input
Delayed reward: Task completion can only be confirmed after multiple clicks and inputs
Partial observability: Elements visible only after scrolling, popups, dynamic loading, etc.
Environment non-determinism: The same page can show different states depending on loading time

State Representation Methods

Method	Description	Advantages	Disadvantages
Pixel-based	Use screenshots directly as input	General purpose, includes visual info	High-dimensional, slow learning
DOM-based	Parse HTML DOM tree for use	Rich structural information	Different structure per site
Hybrid	Combine pixel + DOM information	Richest information	Complex preprocessing needed

Browser Automation and Reinforcement Learning

Environment Interface

We wrap the web browser as a Gym-compatible environment:

import numpy as np

class WebEnvironment:
    """Gym-compatible web browser environment"""

    def __init__(self, task_config, screen_width=160, screen_height=210):
        self.screen_width = screen_width
        self.screen_height = screen_height
        self.task = task_config

        # Action space: grid clicks + keyboard input
        self.grid_size = 16  # 16x16 grid
        self.num_click_actions = self.grid_size * self.grid_size
        self.num_type_actions = 128  # ASCII characters
        self.total_actions = self.num_click_actions + self.num_type_actions

    def reset(self):
        """Start new episode: load web page"""
        self._load_page(self.task['url'])
        screenshot = self._get_screenshot()
        return self._preprocess(screenshot)

    def step(self, action):
        """Execute action and return result"""
        if action < self.num_click_actions:
            # Click action: convert to grid coordinates
            row = action // self.grid_size
            col = action % self.grid_size
            x = col * (self.screen_width // self.grid_size)
            y = row * (self.screen_height // self.grid_size)
            self._click(x, y)
        else:
            # Typing action
            char_idx = action - self.num_click_actions
            self._type_char(chr(char_idx))

        screenshot = self._get_screenshot()
        obs = self._preprocess(screenshot)

        reward = self._compute_reward()
        done = self._check_done()

        return obs, reward, done, {}

    def _preprocess(self, screenshot):
        """Convert screenshot to model input"""
        # Resize and normalize
        resized = np.array(screenshot.resize((self.screen_width,
                                               self.screen_height)))
        return resized.astype(np.float32) / 255.0

Mini World of Bits Benchmark

Mini World of Bits (MiniWoB), developed by OpenAI and researchers, is the standard benchmark for web navigation RL. The goal is to perform specific tasks on simple HTML widgets.

Task Examples

click-button: Click a button with specific text
click-checkboxes: Select designated checkboxes
enter-text: Enter a specified string in a text field
navigate-tree: Navigate a tree structure menu to select a specific item
email-inbox: Find and perform actions on emails matching specific conditions

Each task is a small 160x210 pixel web page that the agent must complete within 10 seconds.

class MiniWoBTask:
    """MiniWoB task definition"""
    def __init__(self, task_name):
        self.task_name = task_name
        self.time_limit = 10.0  # 10 seconds
        self.reward_range = (-1.0, 1.0)

    def get_reward(self, page_state, time_elapsed):
        """Reward based on task completion"""
        if self._is_task_complete(page_state):
            # Higher reward for faster completion
            time_bonus = 1.0 - (time_elapsed / self.time_limit)
            return max(time_bonus, 0.1)
        elif time_elapsed >= self.time_limit:
            return -1.0  # Timeout penalty
        else:
            return 0.0  # In progress

OpenAI Universe

OpenAI Universe was a platform that provided various environments (games, web browsers, etc.) through a standard interface. It observed browser screens via VNC (Virtual Network Computing) and sent mouse/keyboard events.

Features of VNC-Based Environments

Runs actual browsers (Chrome, Firefox) in Docker containers
The agent receives screen pixels via VNC protocol
Actions are sent as mouse coordinates and keyboard events
Network latency and rendering delays exist

class VNCActionSpace:
    """VNC-based action space"""

    def __init__(self, screen_width, screen_height):
        self.screen_width = screen_width
        self.screen_height = screen_height

    def click(self, x, y):
        """Generate mouse click event"""
        return {
            'type': 'pointer',
            'x': int(x),
            'y': int(y),
            'button': 1,  # Left click
        }

    def type_text(self, text):
        """Generate keyboard input events"""
        events = []
        for char in text:
            events.append({
                'type': 'key',
                'key': char,
                'action': 'press',
            })
        return events

    def scroll(self, x, y, direction='down'):
        """Generate scroll event"""
        delta = -3 if direction == 'down' else 3
        return {
            'type': 'scroll',
            'x': int(x),
            'y': int(y),
            'delta': delta,
        }

Simple Click Approach

The most basic web navigation agent transforms the problem into a classification task by dividing the screen into a grid and deciding which cell to click.

Grid Action Space

class GridActionSpace:
    """Divide screen into NxN grid to determine click position"""

    def __init__(self, screen_w, screen_h, grid_n=16):
        self.screen_w = screen_w
        self.screen_h = screen_h
        self.grid_n = grid_n
        self.cell_w = screen_w / grid_n
        self.cell_h = screen_h / grid_n
        self.n_actions = grid_n * grid_n

    def action_to_coordinate(self, action_idx):
        """Convert action index to screen coordinates"""
        row = action_idx // self.grid_n
        col = action_idx % self.grid_n
        # Cell center coordinates
        x = (col + 0.5) * self.cell_w
        y = (row + 0.5) * self.cell_h
        return int(x), int(y)

    def coordinate_to_action(self, x, y):
        """Convert screen coordinates to action index"""
        col = min(int(x / self.cell_w), self.grid_n - 1)
        row = min(int(y / self.cell_h), self.grid_n - 1)
        return row * self.grid_n + col

CNN-Based Model

A model that takes screenshots as input and outputs click probabilities for grid cells:

import torch
import torch.nn as nn

class WebNavigationModel(nn.Module):
    """CNN-based web navigation agent"""

    def __init__(self, grid_size=16):
        super().__init__()
        self.grid_size = grid_size
        n_actions = grid_size * grid_size

        # CNN for screen feature extraction
        self.conv = nn.Sequential(
            nn.Conv2d(3, 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
        )

        # Compute feature vector size
        self._feature_size = self._get_conv_output_size((3, 210, 160))

        # Actor: click position policy
        self.policy = nn.Sequential(
            nn.Linear(self._feature_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions),
        )

        # Critic: state value
        self.value = nn.Sequential(
            nn.Linear(self._feature_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1),
        )

    def _get_conv_output_size(self, shape):
        with torch.no_grad():
            dummy = torch.zeros(1, *shape)
            return self.conv(dummy).view(1, -1).shape[1]

    def forward(self, screen):
        features = self.conv(screen).view(screen.size(0), -1)
        policy_logits = self.policy(features)
        value = self.value(features)
        return policy_logits, value

Training Loop

def train_web_agent(model, env, num_episodes=10000, gamma=0.99):
    """Train web navigation agent with A2C"""
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

    for episode in range(num_episodes):
        state = env.reset()
        done = False
        episode_reward = 0
        log_probs = []
        values = []
        rewards = []

        while not done:
            state_t = torch.FloatTensor(state).permute(2, 0, 1).unsqueeze(0)
            logits, value = model(state_t)

            probs = torch.softmax(logits, dim=-1)
            dist = torch.distributions.Categorical(probs)
            action = dist.sample()

            next_state, reward, done, info = env.step(action.item())

            log_probs.append(dist.log_prob(action))
            values.append(value.squeeze())
            rewards.append(reward)

            state = next_state
            episode_reward += reward

        # A2C update
        returns = compute_returns(rewards, gamma)
        returns_t = torch.FloatTensor(returns)
        values_t = torch.stack(values)
        log_probs_t = torch.stack(log_probs)

        advantages = returns_t - values_t.detach()
        policy_loss = -(log_probs_t * advantages).mean()
        value_loss = (returns_t - values_t).pow(2).mean()
        loss = policy_loss + 0.5 * value_loss

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 40.0)
        optimizer.step()

        if episode % 100 == 0:
            print(f"Episode {episode}: Reward={episode_reward:.2f}")

def compute_returns(rewards, gamma):
    returns = []
    R = 0
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    return returns

Learning from Human Demonstrations

Pure RL alone can be very slow for web navigation learning. Human demonstrations can significantly accelerate learning.

Behavioral Cloning

First perform supervised learning to imitate human behavior, then fine-tune with RL:

def pretrain_with_demonstrations(model, demos, num_epochs=50):
    """Pretrain with human demonstration data"""
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        total_loss = 0
        correct = 0
        total = 0

        for screen, action in demos:
            screen_t = torch.FloatTensor(screen).permute(2, 0, 1).unsqueeze(0)
            action_t = torch.tensor([action], dtype=torch.long)

            logits, _ = model(screen_t)
            loss = criterion(logits, action_t)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            predicted = logits.argmax(dim=-1)
            correct += (predicted == action_t).sum().item()
            total += 1

        accuracy = correct / total
        print(f"Epoch {epoch}: Loss={total_loss/len(demos):.4f}, "
              f"Accuracy={accuracy:.2%}")

    return model

Demonstration-Weighted Learning

Human demonstration experiences are placed in a replay buffer with high priority, continuously referenced during RL training:

class DemonstrationReplayBuffer:
    """Priority replay buffer including demonstration data"""

    def __init__(self, capacity, demo_ratio=0.25):
        self.capacity = capacity
        self.demo_ratio = demo_ratio
        self.demo_buffer = []
        self.agent_buffer = []

    def add_demonstration(self, transition):
        self.demo_buffer.append(transition)

    def add_agent_experience(self, transition):
        if len(self.agent_buffer) >= self.capacity:
            self.agent_buffer.pop(0)
        self.agent_buffer.append(transition)

    def sample(self, batch_size):
        """Sample a mix of demonstrations and agent experiences"""
        n_demo = int(batch_size * self.demo_ratio)
        n_agent = batch_size - n_demo

        demo_samples = random.sample(
            self.demo_buffer,
            min(n_demo, len(self.demo_buffer))
        )
        agent_samples = random.sample(
            self.agent_buffer,
            min(n_agent, len(self.agent_buffer))
        )

        return demo_samples + agent_samples

Practical Considerations and Limitations

Current Technical Limitations

Even on the MiniWoB benchmark, complex tasks (email-inbox, social-media, etc.) do not reach human-level performance
It is difficult to handle the diversity and complexity of real websites
Vulnerable to dynamic changes in DOM structure

Recent Research Directions

LLM-based web agents: Input HTML as text to generate actions
Multimodal models: Agents that understand both screen images and text
Hierarchical RL: Separating high-level planning (which elements to interact with) from low-level execution (precise coordinate clicking)

Key Takeaways

Web navigation is a challenging RL problem characterized by enormous state/action spaces and delayed rewards
A basic agent can be implemented with CNN + A2C by simplifying the action space to a grid-based approach
Pretraining with human demonstrations (behavioral cloning) significantly accelerates RL learning
Recent research is moving toward general-purpose web agents using LLMs and multimodal models

In the next post, we will explore continuous action spaces with DDPG and distributional policy gradients.