Skip to content
Published on

WEKA High-Performance Storage Complete Guide 2025: Parallel File System for AI/HPC Workloads

Authors

Table of Contents

1. Why High-Performance Storage for AI/HPC?

1.1 The GPU Utilization Problem

In modern AI training, the most expensive resource is the GPU. With a single NVIDIA H100 costing over $30,000, having GPUs sit idle while waiting for data is an enormous waste of money.

Typical AI Training Pipeline:

[Storage] --read--> [CPU: Preprocessing] --transfer--> [GPU: Training]
    ^                    ^                                ^
    |                    |                                |
  Bottleneck 1:       Bottleneck 2:                   Actual compute:
  I/O Wait            Data Transform                  GPU Utilization
  (30-50% time)       (10-20% time)                   (30-60% time)

Real-world GPU Utilization:

ScenarioGPU UtilizationPrimary Bottleneck
NFS + HDD storage20-30%Storage I/O
NFS + SSD storage40-50%Network, metadata
Lustre parallel FS60-70%Small file performance
WEKA + NVMe80-95%Depends on model complexity
WEKA + GDS85-98%GPU compute-bound

1.2 AI Storage Requirements

Training Data Characteristics:

  • Image classification: Millions of small files (JPEG, 10-500KB)
  • NLP: Large tokenized datasets (multiple TB)
  • Autonomous driving: Video/LiDAR sensor data (multiple PB)
  • Genomics: Billions of short sequence reads

Storage Performance Requirements:

RequirementDescriptionTarget
ThroughputLarge sequential reads/writes100+ GB/s
IOPSSmall random I/OMillions of IOPS
LatencyTime to first byteSub 200us
Metadata PerformanceFile listing, stat callsHundreds of thousands ops/s
ConcurrencyParallel access from thousands of GPUsLinear scaling

1.3 Limitations of Traditional Storage

NFS (Network File System) Limitations:
- Single server bottleneck (scale-up only)
- Limited metadata performance
- Inefficient small file I/O
- Struggles scaling beyond hundreds of clients

SAN/Block Storage Limitations:
- Not a shared file system (single mount)
- Management complexity
- High cost

General Distributed Storage Limitations:
- High latency (Ceph: 1-5ms)
- Not optimized for AI workloads
- No GPU Direct Storage support

2. Parallel File Systems Overview

2.1 What Is a Parallel File System?

A parallel file system distributes (stripes) data across multiple storage nodes, enabling simultaneous read/write operations.

Traditional NFS:
Client -> [NFS Server] -> [Disk]
                          (single path)

Parallel File System:
Client -> [Node 1] -> [Disk 1]  (concurrent access)
       -> [Node 2] -> [Disk 2]  (concurrent access)
       -> [Node 3] -> [Disk 3]  (concurrent access)
       -> [Node N] -> [Disk N]  (concurrent access)

  Throughput = Number of Nodes x Per-Node Throughput

2.2 Parallel File System Comparison

FeatureWEKALustreGPFS (Spectrum Scale)BeeGFSCephFS
ArchitectureSoftware-definedKernel moduleKernel moduleUserspaceUserspace
MetadataDistributedMDS serverDistributedDistributedMDS daemon
NVMe OptimizationNative (DPDK)LimitedGoodLimitedLimited
POSIX ComplianceFullFullFullFullNearly full
GPU DirectSupported (GDS)Not supportedLimitedNot supportedNot supported
Small File PerformanceExcellentAverageGoodGoodAverage
Cloud IntegrationAWS/Azure/GCPLimitedLimitedLimitedGood
Auto-TieringS3/Blob/GCSHSMPolicy-basedBeeONDNot supported
Installation DifficultyEasyHardHardMediumMedium
Inline DedupSupportedNot supportedNot supportedNot supportedNot supported
Inline CompressionSupportedNot supportedSupportedNot supportedSupported
LicenseCommercialOpen Source (GPL)CommercialOpen SourceOpen Source
Primary Use CasesAI/HPC/FinanceHPC/ResearchEnterprise/HPCHPC/ResearchGeneral purpose
Max Throughput2+ TB/s1+ TB/s1+ TB/s500+ GB/s200+ GB/s

2.3 Why Choose WEKA?

Key Differentiators:

  1. NVMe-First Design: Kernel bypass (DPDK) to fully utilize NVMe performance
  2. Small File Performance: Exceptional at handling millions of small files essential for AI training
  3. GPU Direct Storage: CPU bypass for direct GPU-to-storage data transfer
  4. Auto-Tiering: Automatic data movement from NVMe to S3 (hot/cold separation)
  5. Cloud Native: Native deployment on AWS, Azure, GCP
  6. Simple Management: Easy installation and operations via GUI/CLI

3. WEKA Architecture Deep Dive

3.1 Software-Defined Architecture

WEKA is software-defined storage that runs on commodity x86 servers.

+------------------------------------------------------------+
|                    WEKA Cluster                              |
|                                                              |
|  +------------------+  +------------------+                  |
|  | Backend Server 1 |  | Backend Server 2 |  ...            |
|  | +----+ +----+    |  | +----+ +----+    |                 |
|  | |NVMe| |NVMe|    |  | |NVMe| |NVMe|    |                 |
|  | +----+ +----+    |  | +----+ +----+    |                 |
|  | DPDK Networking   |  | DPDK Networking   |                |
|  +------------------+  +------------------+                  |
|                                                              |
|  Distributed Metadata (across all nodes)                     |
|  Erasure Coding Engine (N+2 or N+4)                          |
|  Inline Dedup + Compression                                  |
|                                                              |
|  +------------------+  +------------------+                  |
|  | Frontend Client 1|  | Frontend Client 2|  ...            |
|  | (Compute Node)   |  | (Compute Node)   |                 |
|  | GPU GPU GPU GPU   |  | GPU GPU GPU GPU   |                |
|  +------------------+  +------------------+                  |
+------------------------------------------------------------+

3.2 WekaFS Core Features

Distributed Metadata:

Traditional parallel file systems (Lustre, GPFS) use separate metadata servers (MDS), but WEKA distributes metadata across all nodes. This eliminates the metadata bottleneck and maximizes performance for ls and stat calls on millions of files.

Erasure Coding:

Replication (3-way):
  Data -> Copy1, Copy2, Copy3
  Overhead: 200% (1TB data = 3TB consumed)

Erasure Coding (N+2):
  Data -> Stripe1, Stripe2, ..., StripeN, Parity1, Parity2
  Overhead: ~29% (when N=7, 7+2=9, 2/7=28.6%)
  Tolerates 2 simultaneous node failures

Erasure Coding (N+4):
  Overhead: ~57% (when N=7, 7+4=11, 4/7=57.1%)
  Tolerates 4 simultaneous node failures

NVMe-First Design (Kernel Bypass):

Traditional Storage I/O Path:
Application -> VFS -> Filesystem -> Block Layer -> Device Driver -> NVMe
  (numerous kernel context switches, interrupts, copies)

WEKA I/O Path (DPDK):
Application -> WEKA Client (userspace) -> DPDK -> NVMe
  (kernel bypass, polling mode, zero-copy)

Inline Deduplication and Compression:

Data is deduplicated and compressed in real-time as it is written. In AI workloads, similar images and datasets often share significant overlap, enabling 20-50% space savings.

Snap-to-Object (S3 Tiering):

+------------------------------------------------------------+
|  Hot Tier (NVMe)        Warm Tier         Cold Tier         |
|  +-----------------+  +-----------+  +------------------+   |
|  | Active Training |  | Recent    |  | S3 / Azure Blob  |   |
|  | Data            |  | Experiments|  | / GCS            |   |
|  | (Fast Access)   |  | (SSD)     |  | (Archived)       |   |
|  +-----------------+  +-----------+  +------------------+   |
|         ^                  ^                   ^            |
|         |                  |                   |            |
|    Auto-tiering policy: access time, size, age              |
+------------------------------------------------------------+

4. WEKA Components

4.1 Backend Servers (Storage Nodes)

Backend servers are the nodes that actually store data.

# Check backend server status
weka cluster host
weka status

# Detailed node information
weka cluster host -v

Key Responsibilities:

  • NVMe disk management
  • Data striping and erasure coding
  • Metadata processing
  • Serving client requests

4.2 Frontend Clients (Compute Nodes)

Compute nodes are GPU-equipped servers that access the file system through the WEKA client.

# Mount WEKA filesystem
mount -t wekafs backend-host/fs-name /mnt/weka

# Or POSIX mount
weka local mount fs-name /mnt/weka

# Check mount status
weka local status

Client Modes:

ModeDescriptionPerformanceUse Case
DPDK (Stateless)Kernel bypass, dedicated NICHighestAI training, HPC
UDPStandard network stackGoodGeneral workloads
NFS/SMBProtocol gatewayModerateLegacy applications

4.3 Management Cluster

# WEKA CLI essential commands
weka status                        # Cluster status
weka cluster host                  # Host list
weka fs                            # Filesystem list
weka fs group                      # Filesystem groups
weka alerts                        # Check alerts

# Create filesystem
weka fs create myfs default 10TiB

# Resize filesystem
weka fs update myfs --total-capacity 20TiB

# Snapshots
weka fs snapshot create myfs snap-before-training
weka fs snapshot list myfs
weka fs snapshot restore myfs snap-before-training

4.4 Organizations and Filesystem Groups

WEKA provides an Organization concept for multi-tenancy.

# Create organization
weka org create research-team

# Filesystem groups (categorize by purpose)
weka fs group create ai-training --org research-team
weka fs group create inference --org research-team

# Quota configuration
weka fs quota set myfs --hard-limit 50TiB --path /projects/team-a

5. GPU Direct Storage (GDS) Integration

5.1 NVIDIA Magnum IO Architecture

GPU Direct Storage is part of the NVIDIA Magnum IO ecosystem, enabling direct data transfer between GPU memory and storage.

Traditional I/O Path:
Storage -> CPU Memory (bounce buffer) -> GPU Memory
           ^^^^^^^^^^^^^^^^^^^^^^^^^
           CPU involvement required, extra memory copy

GDS I/O Path:
Storage -> GPU Memory (direct DMA)
           ^^^^^^^^^^^^^^^^^^^^^^^^
           CPU bypass, zero-copy

5.2 cuFile API

// File read example using GDS (CUDA C)
#include <cufile.h>

// Allocate GPU memory
void* gpu_buffer;
cudaMalloc(&gpu_buffer, file_size);

// Open cuFile handle
CUfileDescr_t cf_desc;
CUfileHandle_t cf_handle;
cf_desc.type = CU_FILE_HANDLE_TYPE_OPAQUE_FD;

int fd = open("/mnt/weka/training_data.bin", O_RDONLY | O_DIRECT);
cf_desc.handle.fd = fd;

cuFileHandleRegister(&cf_handle, &cf_desc);

// Register GPU buffer
cuFileBufRegister(gpu_buffer, file_size, 0);

// Direct read from storage to GPU
cuFileRead(cf_handle, gpu_buffer, file_size, 0, 0);
// DMA transfer directly from storage to GPU memory, bypassing CPU

5.3 GDS Performance Improvements

WorkloadTraditional (bounce buffer)With GDSImprovement
Large sequential read12 GB/s/GPU25 GB/s/GPU2.1x
Small random read500K IOPS/GPU1.2M IOPS/GPU2.4x
Checkpoint write8 GB/s/GPU20 GB/s/GPU2.5x
CPU utilization30-50%5-10%4-6x reduction

5.4 Configuring GDS with WEKA

# 1. Verify NVIDIA driver and CUDA installation
nvidia-smi
nvcc --version

# 2. Install GDS package
# Both NVIDIA GDS and WEKA client required

# 3. Mount WEKA with GDS enabled
mount -t wekafs -o gds backend-host/fs-name /mnt/weka

# 4. Check GDS status
/usr/local/cuda/gds/tools/gds_stats

# 5. Performance test
/usr/local/cuda/gds/tools/cufile_sample_001

6. AI/ML Workload Optimization

6.1 Training Data Pipeline

PyTorch DataLoader Optimization:

import torch
from torch.utils.data import DataLoader, Dataset

class WEKAImageDataset(Dataset):
    def __init__(self, root_dir="/mnt/weka/imagenet"):
        self.root_dir = root_dir
        self.file_list = self._build_file_list()

    def _build_file_list(self):
        # WEKA's fast metadata processing builds
        # file list of millions of files within seconds
        import os
        files = []
        for root, dirs, fnames in os.walk(self.root_dir):
            for fname in fnames:
                if fname.endswith(('.jpg', '.jpeg', '.png')):
                    files.append(os.path.join(root, fname))
        return files

    def __len__(self):
        return len(self.file_list)

    def __getitem__(self, idx):
        img_path = self.file_list[idx]
        # Fast file access with WEKA's low latency
        image = read_image(img_path)
        return image

# Optimized DataLoader configuration
dataloader = DataLoader(
    WEKAImageDataset(),
    batch_size=256,
    num_workers=16,        # WEKA supports high concurrency
    pin_memory=True,       # GPU transfer optimization
    prefetch_factor=4,     # Prefetch ahead
    persistent_workers=True # Reuse workers
)

NVIDIA DALI Pipeline:

from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types

@pipeline_def(batch_size=256, num_threads=12, device_id=0)
def training_pipeline():
    # Direct read from WEKA (GDS capable)
    jpegs, labels = fn.readers.file(
        file_root="/mnt/weka/imagenet/train",
        random_shuffle=True,
        name="Reader"
    )
    # Decode on GPU (bypass CPU)
    images = fn.decoders.image_random_crop(
        jpegs, device="mixed",
        output_type=types.RGB
    )
    images = fn.resize(images, device="gpu", size=224)
    images = fn.crop_mirror_normalize(
        images, device="gpu",
        dtype=types.FLOAT,
        mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
        std=[0.229 * 255, 0.224 * 255, 0.225 * 255]
    )
    return images, labels

6.2 Checkpoint Storage

Model checkpointing during AI training directly depends on storage performance.

import torch

# Checkpoint save (leveraging WEKA's high write throughput)
def save_checkpoint(model, optimizer, epoch, path="/mnt/weka/checkpoints"):
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }
    # LLM checkpoints can be several GB to tens of GB
    torch.save(checkpoint, f"{path}/checkpoint_epoch_{epoch}.pt")

# Checkpoint restore
def load_checkpoint(path, model, optimizer):
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    return checkpoint['epoch']

Checkpoint Performance Comparison:

Storage10GB Checkpoint Save10GB Checkpoint Load
NFS (1GbE)80s80s
NFS (10GbE)8s8s
Lustre3s2s
WEKA (NVMe)0.5s0.4s
WEKA + GDS0.3s0.25s

6.3 Small File Performance

Image classification training requires reading millions of small JPEG files.

ImageNet Dataset Example:
- Total files: ~14 million
- Average file size: ~100KB
- Full data read required per epoch

Time Per ImageNet Epoch by Filesystem (8x A100):
- NFS:    45 min (I/O bound)
- Lustre: 15 min (metadata bottleneck)
- WEKA:   3 min  (GPU bound)

6.4 Multi-GPU Multi-Node Training

Distributed Training Architecture (NCCL + WEKA):

+----------+  +----------+  +----------+  +----------+
| Node 1   |  | Node 2   |  | Node 3   |  | Node 4   |
| 8x H100  |  | 8x H100  |  | 8x H100  |  | 8x H100  |
+-----+----+  +-----+----+  +-----+----+  +-----+----+
      |              |              |              |
      +-------+------+------+------+------+-------+
              |      NCCL All-Reduce       |
              +----------------------------+
              |
      +-------+-------+
      |  WEKA Cluster  |
      | (shared data)  |
      +----------------+

Each node reads data from WEKA independently
NCCL handles gradient synchronization
WEKA provides linear scaling up to 32+ nodes

6.5 Inference Model Serving

Inference Workload Characteristics:
- Model loading: Single large file read (GB to tens of GB)
- Inference input: Repeated small file reads (images, text)
- Latency sensitive: Real-time response needed (ms range)

Why WEKA is suitable for inference:
- Fast model file loading (NVMe caching)
- Low first-byte latency (sub 200us)
- Concurrent model access from multiple inference servers
- Model version management (via snapshots)

7. Tiered Storage Architecture

7.1 Three-Tier Architecture

+------------------------------------------------------------+
|              WEKA Tiered Storage                             |
|                                                              |
|  Tier 1: Hot (NVMe)                                         |
|  +------------------------------------------------------+   |
|  | Active training data, current experiment datasets     |   |
|  | Latency: Sub 0.1ms                                    |   |
|  | Throughput: 100+ GB/s                                 |   |
|  +------------------------------------------------------+   |
|                      |                                       |
|           Auto-tiering (access pattern based)                |
|                      |                                       |
|  Tier 2: Warm (SSD, optional)                                |
|  +------------------------------------------------------+   |
|  | Recent experiment data, frequently accessed archives  |   |
|  | Latency: Sub 0.5ms                                    |   |
|  +------------------------------------------------------+   |
|                      |                                       |
|           Snap-to-Object tiering                             |
|                      |                                       |
|  Tier 3: Cold (Object Storage)                               |
|  +------------------------------------------------------+   |
|  | S3 / Azure Blob / GCS                                   |   |
|  | Archived datasets, old checkpoints                      |   |
|  | Latency: Tens of ms                                     |   |
|  | Cost: 1/10 to 1/50 of NVMe                              |   |
|  +------------------------------------------------------+   |
+------------------------------------------------------------+

7.2 Auto-Tiering Policies

# Configure tiering policy
weka fs tier s3 add myfs \
  --obs-name my-s3-bucket \
  --obs-type s3 \
  --hostname s3.amazonaws.com \
  --port 443 \
  --bucket weka-tier-data \
  --access-key-id AKIAIOSFODNN7EXAMPLE \
  --secret-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# Tiering policy: Move data to S3 after 7 days without access
weka fs tier s3 update myfs --tiering-cue 7d

# Manual tiering (specific paths)
weka fs tier fetch myfs /datasets/imagenet  # Restore from S3 to NVMe
weka fs tier release myfs /datasets/old_experiment  # Move from NVMe to S3

7.3 Cost Optimization

Storage TierCost (GB/month, approx.)1PB Monthly Cost
NVMe (local)~$0.15~$150K
SSD (local)~$0.08~$80K
S3 Standard~$0.023~$23K
S3 Glacier~$0.004~$4K

80% Cost Reduction with Auto-Tiering:

Total data: 5PB
- Hot (NVMe): 500TB (active training) = ~$75K/month
- Cold (S3):  4.5PB (archive)         = ~$103K/month
- Total: ~$178K/month

All on NVMe: ~$750K/month
Savings: ~76%

8. Cloud Deployment

8.1 WEKA on AWS

AWS Deployment Architecture:

+------------------------------------------------------------+
|  VPC                                                        |
|  +------------------+  +------------------+                  |
|  | WEKA Backend     |  | WEKA Backend     |  ...            |
|  | i3en.24xlarge    |  | i3en.24xlarge    |                  |
|  | (8x 7.5TB NVMe) |  | (8x 7.5TB NVMe) |                  |
|  | 100 Gbps ENA     |  | 100 Gbps ENA     |                  |
|  +------------------+  +------------------+                  |
|                                                              |
|  +------------------+  +------------------+                  |
|  | Compute Client   |  | Compute Client   |  ...            |
|  | p5.48xlarge      |  | p5.48xlarge      |                  |
|  | (8x H100 GPU)    |  | (8x H100 GPU)    |                  |
|  | 3200 Gbps EFA    |  | 3200 Gbps EFA    |                  |
|  +------------------+  +------------------+                  |
|                                                              |
|  S3 Bucket (Cold Tier)                                       |
+------------------------------------------------------------+
# Deploy WEKA via AWS CloudFormation
aws cloudformation create-stack \
  --stack-name weka-cluster \
  --template-url https://weka-deploy-templates.s3.amazonaws.com/weka-latest.yaml \
  --parameters \
    ParameterKey=ClusterSize,ParameterValue=6 \
    ParameterKey=InstanceType,ParameterValue=i3en.24xlarge \
    ParameterKey=VpcId,ParameterValue=vpc-xxxxx

8.2 Hybrid Cloud

On-premises + Cloud Hybrid:

[On-premises WEKA Cluster]
  - Always-on workloads
  - Sensitive data
  - NVMe Hot Tier
         |
    S3 Tiering (automatic)
         |
[AWS S3]
         |
    Cloud Burst (on demand)
         |
[AWS WEKA Cluster (temporary)]
  - Large-scale training jobs
  - On-demand GPU (p5 instances)
  - Tear down cluster after job completes

8.3 Cloud Burst Workflow

# 1. Tier dataset from on-premises to S3
weka fs tier release myfs /datasets/large-training-set

# 2. Create temporary WEKA cluster on AWS
# (CloudFormation/Terraform)

# 3. Fetch S3 data into AWS WEKA
weka fs tier fetch myfs /datasets/large-training-set

# 4. Run training
# (distributed training job)

# 5. Save results to S3
weka fs tier release myfs /results/experiment-42

# 6. Tear down AWS cluster (cost savings)

9. Installation and Configuration

9.1 Hardware Requirements

Backend Servers (minimum 6 recommended):
  CPU: Minimum 19 cores (16 for WEKA + 3 for OS)
  RAM: Minimum 72GB (31GB for WEKA + OS)
  NVMe: Minimum 1 (recommended 4-8, each 1TB+)
  Network: 25GbE+ (recommended 100GbE)

Frontend Clients:
  CPU: Minimum 2 cores (for WEKA client)
  RAM: Minimum 4GB (for WEKA client)
  Network: 25GbE+

Network Requirements:
  - Dedicated DPDK NICs (Mellanox ConnectX-5/6 recommended)
  - Jumbo Frames (MTU 9000)
  - Lossless network (RoCE v2 or InfiniBand)

9.2 Network Configuration

# Check DPDK NIC status
dpdk-devbind.py --status

# Mellanox NIC tuning
mlnx_tune -p THROUGHPUT

# MTU configuration
ip link set dev enp3s0f0 mtu 9000

9.3 Cluster Installation

# 1. Download and install WEKA package
curl -o weka.tar https://get.weka.io/dist/v4.3/weka-4.3.tar
tar xf weka.tar
cd weka-4.3
./install.sh

# 2. Create cluster
weka cluster create backend1 backend2 backend3 backend4 backend5 backend6

# 3. Add drives
weka cluster drive add --host backend1 /dev/nvme0n1 /dev/nvme1n1

# 4. Start cluster
weka cluster start

# 5. Create filesystem
weka fs create training-data default 100TiB

# 6. Mount on client
mount -t wekafs backend1/training-data /mnt/weka

10. Performance Benchmarking

10.1 FIO Benchmarks

# Sequential read test
fio --name=seq-read \
  --directory=/mnt/weka/fio-test \
  --rw=read \
  --bs=1M \
  --numjobs=16 \
  --iodepth=32 \
  --size=10G \
  --direct=1

# Random read IOPS test
fio --name=rand-read \
  --directory=/mnt/weka/fio-test \
  --rw=randread \
  --bs=4K \
  --numjobs=32 \
  --iodepth=64 \
  --size=1G \
  --direct=1

# Mixed workload (70% read / 30% write)
fio --name=mixed \
  --directory=/mnt/weka/fio-test \
  --rw=randrw \
  --rwmixread=70 \
  --bs=64K \
  --numjobs=16 \
  --iodepth=32 \
  --size=5G \
  --direct=1

10.2 Benchmark Results (Example: 6-Node Cluster)

TestWEKA (6 nodes)NFS (single)Lustre (6 nodes)
Sequential Read80 GB/s1.2 GB/s30 GB/s
Sequential Write50 GB/s1.0 GB/s20 GB/s
4K Random Read6M IOPS50K IOPS800K IOPS
4K Random Write3M IOPS30K IOPS400K IOPS
Metadata ops2M ops/s20K ops/s100K ops/s
Latency (4K read)0.15ms2ms0.5ms

10.3 MLPerf Storage Benchmark

MLPerf Storage is a benchmark specifically designed for AI training storage workloads.

# Run MLPerf Storage benchmark
git clone https://github.com/mlcommons/storage_benchmark.git
cd storage_benchmark

# Configuration file
# benchmark_config.yaml
benchmark:
  model: resnet50
  accelerator: h100
  num_accelerators: 8
  dataset_path: /mnt/weka/mlperf/imagenet
  results_dir: /mnt/weka/mlperf/results

11. Operations and Monitoring

11.1 WEKA GUI Dashboard

WEKA provides a web-based GUI for monitoring cluster status.

Key Dashboard Items:
- Cluster health status (green/yellow/red)
- Capacity utilization and trends
- Real-time throughput/IOPS graphs
- Per-node performance distribution
- Tiering status (NVMe vs S3 usage)
- Alert history

11.2 Alerts and Health Monitoring

# Check alerts
weka alerts
weka alerts --severity critical

# Cluster status
weka status
weka cluster host -v

# Performance statistics
weka stats --category ops
weka stats --category throughput
weka stats --category latency

# Capacity information
weka fs --name training-data -v

11.3 Capacity Planning

# Current capacity usage
weka fs --name training-data
# Total: 100 TiB, Used: 67 TiB, Available: 33 TiB

# Track daily growth
weka events --category capacity --start-time "7 days ago"

# Calculate estimated exhaustion (manual)
# Daily average growth: 500GB
# Remaining capacity: 33TB
# Estimated exhaustion: ~66 days

11.4 Upgrade Process

# Rolling upgrade (zero downtime)
# 1. Download new version
weka cluster update download --url https://get.weka.io/dist/v4.4/weka-4.4.tar

# 2. Start upgrade (sequential per node)
weka cluster update start

# 3. Check progress
weka cluster update status

# 4. Verify completion
weka status

11.5 Common Troubleshooting

SymptomCauseResolution
Mount failureNetwork connectivity issueCheck DPDK NIC status, verify MTU
Performance degradationDisk failure or network congestionRun weka diags for diagnosis
Capacity shortageTiering not configured or data growthConfigure S3 tiering or add disks
Node downHardware failureAuto-recovery via erasure coding, replace node
High latencyClient overloadAdjust client count or I/O depth

12. Use Cases

12.1 LLM Training

Large Language Model training processes petabytes of text data across thousands of GPUs.

GPT-4 Scale Training Example:
- Training data: ~13TB (tokenized text)
- Checkpoints: ~2TB each (thousands)
- GPU cluster: 10,000+ H100
- Storage requirement: 200+ GB/s read throughput

WEKA Configuration:
- 12 nodes x 8 NVMe = ~600TB Hot Tier
- Auto-archive checkpoints to S3
- GDS maintains GPU utilization at 95%+

12.2 Autonomous Driving

Autonomous Driving Data Scale:
- Per-vehicle daily data: ~20TB (camera, LiDAR, radar)
- Total dataset: PB to tens of PB range
- Training cycle: Continuous (new data + retraining)

WEKA's Role:
- High-speed ingestion of collected sensor data
- Support labeling/preprocessing pipelines
- Provide training data to models
- Store simulation results

12.3 Life Sciences

Genomics Workloads:
- WGS (Whole Genome Sequencing): ~100GB per sample
- Large cohorts: Tens of thousands of samples = multiple PB
- Analysis pipeline: Generates millions of small files

Drug Discovery:
- Molecular simulations: High I/O throughput required
- AI-based drug screening: GPU intensive
- Data sharing: Multiple research teams concurrent access

12.4 Financial Services

Risk Modeling:
- Large-scale Monte Carlo simulations
- Millisecond-level latency requirements
- Trading data analysis (exchange and OTC)

WEKA Advantages:
- Ultra-low latency (sub 0.1ms)
- Deterministic performance (minimal jitter)
- Data retention compliance for regulatory requirements

12.5 Media and Entertainment

VFX Rendering:
- GB of textures/assets per frame
- Hundreds of render nodes with concurrent access
- High sequential read throughput required

8K/16K Video Editing:
- Real-time streaming: Playback without frame drops
- Multiple editors with concurrent access
- Large project file management

13. Quiz

Q1. What core technology does WEKA use to fully leverage NVMe performance?

Answer: Kernel bypass using DPDK (Data Plane Development Kit)

WEKA uses DPDK to completely bypass the Linux kernel I/O stack. Traditional storage traverses VFS, filesystem, block layer, and device driver with context switching and memory copy overhead. DPDK directly controls NVMe from userspace (polling mode, zero-copy) to achieve NVMe's microsecond-level latency.

Q2. What core problem does GPU Direct Storage (GDS) solve?

Answer: Eliminates unnecessary data copying through CPU (bounce buffer) by enabling direct DMA transfer between GPU and storage

In the traditional approach, data from storage is first copied to CPU memory (bounce buffer) then re-copied to GPU memory. GDS removes this process and performs direct DMA transfer from storage to GPU memory. This improves throughput by 2-3x and reduces CPU utilization by 4-6x.

Q3. What advantage does WEKA's erasure coding (N+2) have over 3-way replication?

Answer: Provides the same fault tolerance (2 simultaneous node failures) while dramatically reducing storage overhead from ~200% to ~29%

3-way replication creates 1 data copy + 2 replicas for 200% overhead. N+2 erasure coding (e.g., 7+2) uses 7 data stripes and 2 parity stripes, achieving the same 2-node fault tolerance with only ~29% overhead.

Q4. What is the biggest benefit of WEKA's Snap-to-Object feature for AI teams?

Answer: Cost optimization - automatically moves infrequently accessed data to low-cost object storage (S3), saving up to 90%+ compared to NVMe costs

AI teams keep only active training data (10-20% of total) on the high-performance NVMe tier while past experiment data and archived datasets automatically move to S3. Data can be restored to NVMe when needed, maintaining accessibility while dramatically reducing costs.

Q5. What is the biggest architectural difference between WEKA and Lustre?

Answer: Metadata distribution - WEKA distributes metadata across all nodes, while Lustre uses a separate MDS (Metadata Server)

Lustre concentrates metadata on dedicated MDS servers, which becomes a bottleneck when processing millions of small files. WEKA distributes metadata across all backend nodes, allowing metadata operations to scale linearly. This difference is the core reason WEKA delivers superior performance for AI workloads (millions of image files).


14. References

  1. WEKA Official Documentation - docs.weka.io
  2. WEKA Architecture Whitepaper - weka.io/resources/white-papers
  3. NVIDIA GPU Direct Storage - developer.nvidia.com/gpudirect-storage
  4. NVIDIA Magnum IO - developer.nvidia.com/magnum-io
  5. MLPerf Storage Benchmark - mlcommons.org/benchmarks/storage
  6. Lustre Official Documentation - lustre.org/documentation
  7. IBM Spectrum Scale (GPFS) - ibm.com/docs/en/spectrum-scale
  8. BeeGFS Official Documentation - beegfs.io/docs
  9. Ceph Official Documentation - docs.ceph.com
  10. DPDK Documentation - doc.dpdk.org
  11. NVIDIA DALI - docs.nvidia.com/deeplearning/dali
  12. PyTorch DataLoader - pytorch.org/docs/stable/data.html
  13. WEKA on AWS - aws.amazon.com/marketplace/pp/prodview-weka
  14. FIO Benchmark - fio.readthedocs.io