Skip to content
Published on

AI Development Environment Complete Guide: From GPU Server Setup to Jupyter, VS Code, Docker

Authors

Overview

Setting up a proper AI development environment is half the battle. Without the right foundation, reproducible experiments, team collaboration, and fast iteration cycles are all impossible. This guide covers everything an AI/ML developer needs to know — from the initial GPU server setup to daily development workflows.

You will learn how to install CUDA on Ubuntu, manage Python environments, use JupyterLab and VS Code like a professional, and build reproducible experiment environments with Docker containers.


1. AI Development Environment Requirements

1.1 Hardware Recommendations

GPU (Top Priority)

In AI research, a GPU is not optional — it is essential.

  • Entry-level / Personal research: NVIDIA RTX 4080/4090 (16-24GB VRAM)
  • Shared team server: NVIDIA A100 40GB or 80GB
  • Large LLM training: NVIDIA H100 or H200 (80GB+ VRAM)
  • Cloud options: AWS p3/p4d/p5, GCP A100/H100, Lambda Labs

CPU

During GPU training, the CPU is primarily responsible for data preprocessing and loading.

  • Minimum: 8-core 16-thread (Intel Core i9 or AMD Ryzen 9)
  • Recommended: 16+ cores (AMD Threadripper, Intel Xeon)
  • Optimal DataLoader workers = half the number of CPU cores

RAM

  • Minimum: 32GB
  • Recommended: 64GB (at least 4x GPU VRAM)
  • Large-scale NLP: 128GB+ (tokenization, data loading)

Storage

/home/user/          -> NVMe SSD (OS, code, environments)
/data/               -> NVMe SSD or fast HDD (training data)
/models/             -> HDD or NAS (model checkpoints)
  • OS and packages: NVMe SSD 500GB+
  • Datasets: NVMe SSD (I/O-intensive training) or high-performance HDD
  • Checkpoints: HDD 4TB+ (cost-effective high-capacity storage)

1.2 OS Selection

Ubuntu 22.04 LTS — Strongly Recommended

# Check Ubuntu version
lsb_release -a

# Example output:
# Ubuntu 22.04.3 LTS (Jammy Jellyfish)

Reasons to choose Ubuntu:

  • NVIDIA's primary supported platform (latest drivers, CUDA, cuDNN available immediately)
  • Largest AI/ML community documentation and package ecosystem
  • Best compatibility with Docker and Kubernetes
  • 5-year LTS support for stable server operation

macOS (M1/M2/M3)

Apple Silicon supports GPU acceleration via Metal Performance Shaders.

# Check MPS availability (PyTorch)
python -c "import torch; print(torch.backends.mps.is_available())"

1.3 Cloud vs. Local Development

AspectLocal ServerCloud
Upfront costHigh (hardware)Low
Ongoing costLow (electricity)High (hourly billing)
LatencyNoneInstance startup time
ScalabilityLimitedUnlimited
Data securityHighPolicy-dependent
Best forContinuous researchOne-off large experiments

2. NVIDIA GPU Drivers and CUDA Installation

2.1 Installing nvidia-driver (Ubuntu)

# 1. Remove existing NVIDIA packages
sudo apt-get purge nvidia*
sudo apt autoremove

# 2. Check available drivers
ubuntu-drivers devices

# 3. Auto-install recommended driver
sudo ubuntu-drivers autoinstall

# Or install a specific version
sudo apt install nvidia-driver-535

# 4. Reboot
sudo reboot

# 5. Verify installation
nvidia-smi

Example nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05   Driver Version: 535.154.05   CUDA Version: 12.2    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB  Off| 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    56W / 400W |   1024MiB / 40960MiB |      0%      Default |
+-----------------------------------------------------------------------------+

2.2 Installing the CUDA Toolkit

Generate the appropriate installation command on the official NVIDIA website for your platform.

# CUDA 12.2 installation (Ubuntu 22.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-2

# Set environment variables (add to ~/.bashrc or ~/.zshrc)
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify CUDA version
nvcc --version

2.3 Installing cuDNN

# Requires an NVIDIA Developer account
# After downloading cuDNN:

# For Ubuntu 22.04 with CUDA 12.x
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install libcudnn8 libcudnn8-dev libcudnn8-samples

# Verify cuDNN version
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

2.4 Verifying the Installation

# verify_gpu.py
import subprocess
import sys

def check_nvidia_smi():
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    if result.returncode == 0:
        print("nvidia-smi is working")
        lines = result.stdout.split('\n')
        for line in lines:
            if 'NVIDIA' in line and 'Driver' in line:
                print(f"  {line.strip()}")
    else:
        print("nvidia-smi error:", result.stderr)

def check_pytorch_cuda():
    try:
        import torch
        print(f"\nPyTorch version: {torch.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"CUDA version: {torch.version.cuda}")
            print(f"cuDNN version: {torch.backends.cudnn.version()}")
            print(f"GPU count: {torch.cuda.device_count()}")
            for i in range(torch.cuda.device_count()):
                props = torch.cuda.get_device_properties(i)
                print(f"  GPU {i}: {props.name} ({props.total_memory / 1024**3:.1f}GB)")

            # Quick tensor operation test
            x = torch.randn(1000, 1000).cuda()
            y = torch.randn(1000, 1000).cuda()
            z = x @ y
            print(f"GPU tensor operation test: passed (shape: {z.shape})")
    except ImportError:
        print("PyTorch is not installed")

def check_tensorflow_gpu():
    try:
        import tensorflow as tf
        print(f"\nTensorFlow version: {tf.__version__}")
        gpus = tf.config.list_physical_devices('GPU')
        print(f"Detected GPUs: {len(gpus)}")
        for gpu in gpus:
            print(f"  {gpu}")
    except ImportError:
        print("TensorFlow is not installed")

if __name__ == "__main__":
    check_nvidia_smi()
    check_pytorch_cuda()
    check_tensorflow_gpu()
python verify_gpu.py

3. Python Environment Management

3.1 Python Version Management with pyenv

# Install pyenv
curl https://pyenv.run | bash

# Add to ~/.bashrc or ~/.zshrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc

# List available Python versions
pyenv install --list | grep -E "^\s+3\.(10|11|12)"

# Install Python
pyenv install 3.11.7
pyenv install 3.10.13

# Set global version
pyenv global 3.11.7

# Set project-specific version (in that directory)
cd my-project
pyenv local 3.10.13

# Check current Python version
python --version
pyenv versions

3.2 conda Environments (for GPU Libraries)

conda is particularly useful for installing GPU-accelerated libraries like RAPIDS and cuML.

# Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda init

# Configure channels
conda config --add channels conda-forge
conda config --add channels nvidia
conda config --set channel_priority strict

# Create AI/ML environment
conda create -n aiml python=3.11 -y
conda activate aiml

# PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# RAPIDS (GPU-accelerated data science)
conda install -c rapidsai -c conda-forge -c nvidia rapids=23.10 cuda-version=12.0

# Export and import environments
conda env export > environment.yml
conda env create -f environment.yml

# List environments
conda env list

environment.yml example:

name: aiml
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - pytorch>=2.1
  - torchvision
  - cudatoolkit=12.1
  - numpy>=1.24
  - pandas>=2.0
  - scikit-learn>=1.3
  - pip:
      - transformers>=4.35
      - wandb>=0.16
      - pydantic>=2.0

3.3 Dependency Management with poetry

# Install poetry
curl -sSL https://install.python-poetry.org | python3 -

# Create project
poetry new ml-research
cd ml-research

# Add dependencies
poetry add torch torchvision numpy pandas transformers
poetry add --group dev pytest black ruff mypy jupyter

# Install with extras
poetry add "torch[cuda]>=2.1"
poetry add "transformers[torch]>=4.35"

# Install all dependencies
poetry install

# Check virtual environment info
poetry env info
poetry env list

3.4 uv (Ultra-fast Package Manager)

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install packages (10-100x faster than pip)
uv pip install torch torchvision numpy pandas

# Create virtual environment
uv venv .venv --python 3.11
source .venv/bin/activate

# Install from requirements.txt (very fast)
uv pip install -r requirements.txt

# Manage Python versions
uv python install 3.11 3.12
uv python list

4. Advanced JupyterLab Setup

4.1 JupyterLab Installation and Extensions

# Install JupyterLab
pip install jupyterlab

# Install useful extensions
pip install jupyterlab-git           # Git integration
pip install jupyterlab-lsp           # Language server protocol (autocomplete)
pip install python-lsp-server        # Python LSP
pip install jupyterlab-code-formatter  # Code formatter
pip install black isort              # Formatters
pip install jupyterlab-vim           # Vim key bindings
pip install ipywidgets               # Interactive widgets

# List installed extensions
jupyter labextension list

# Start JupyterLab
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser

4.2 Kernel Management

# List available kernels
jupyter kernelspec list

# Register current conda/venv environment as a kernel
pip install ipykernel
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

# Register a specific conda environment as a kernel
conda activate aiml
conda install ipykernel
python -m ipykernel install --user --name aiml --display-name "Python (aiml GPU)"

# Remove a kernel
jupyter kernelspec remove old-kernel

Custom kernel.json example:

{
  "argv": [
    "/home/user/miniconda3/envs/aiml/bin/python",
    "-m",
    "ipykernel_launcher",
    "-f",
    "{connection_file}"
  ],
  "display_name": "Python (aiml GPU)",
  "language": "python",
  "env": {
    "CUDA_VISIBLE_DEVICES": "0",
    "PYTHONPATH": "/home/user/projects"
  }
}

4.3 Remote Jupyter via SSH Tunneling

# Start Jupyter on the server (no token)
jupyter lab --no-browser --port=8888 --ip=127.0.0.1

# Create SSH tunnel from local machine
ssh -N -L 8888:localhost:8888 user@your-server.com

# Access in browser
# http://localhost:8888

# Set a password for security
jupyter lab password

Systemd service for auto-start:

cat > /tmp/jupyter.service << 'EOF'
[Unit]
Description=Jupyter Lab
After=network.target

[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username
ExecStart=/home/your-username/miniconda3/envs/aiml/bin/jupyter lab --no-browser --port=8888
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo mv /tmp/jupyter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable jupyter
sudo systemctl start jupyter

4.4 Jupyter Magic Commands

# Timing
%time result = expensive_function()
%timeit -n 100 result = fast_function()  # 100 repetitions

# Line-by-line profiling
%load_ext line_profiler
%lprun -f my_function my_function(data)

# Memory profiling
%load_ext memory_profiler
%memit result = memory_heavy_function()

# Write cell contents to file
%%writefile my_script.py
import numpy as np
# ...

# Load file into cell
%load my_script.py

# Run shell commands
!nvidia-smi
!pip list | grep torch
files = !ls -la *.py

# Set environment variables
%env CUDA_VISIBLE_DEVICES=0

# Auto-reload (reflect code changes immediately)
%load_ext autoreload
%autoreload 2

# Inline matplotlib plots
%matplotlib inline

# List current variables
%who
%whos

# Access previous outputs
print(_)   # Last result
print(__)  # Two results ago

4.5 nbconvert (Notebook Conversion)

# Convert notebook to script
jupyter nbconvert --to script notebook.ipynb

# Convert to HTML (for sharing)
jupyter nbconvert --to html notebook.ipynb

# Convert to PDF (requires LaTeX)
jupyter nbconvert --to pdf notebook.ipynb

# Execute and convert to HTML
jupyter nbconvert --to html --execute notebook.ipynb

# Execute notebook in-place from CLI
jupyter nbconvert --to notebook --execute --inplace notebook.ipynb

# Parameterized execution (papermill)
pip install papermill
papermill input.ipynb output.ipynb -p lr 0.001 -p epochs 100

5. VS Code for AI

5.1 Installing Essential Extensions

# Install extensions from the command line
code --install-extension ms-python.python           # Python
code --install-extension ms-toolsai.jupyter         # Jupyter
code --install-extension ms-python.pylance          # Pylance (LSP)
code --install-extension charliermarsh.ruff         # Ruff linter
code --install-extension github.copilot             # GitHub Copilot
code --install-extension github.copilot-chat        # Copilot Chat
code --install-extension ms-vscode-remote.remote-ssh  # Remote SSH
code --install-extension ms-vscode-remote.remote-containers  # Dev Containers
code --install-extension eamodio.gitlens            # GitLens
code --install-extension njpwerner.autodocstring     # Auto-docstring

5.2 VS Code Settings (settings.json)

{
  "python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",
  "python.formatting.provider": "none",
  "[python]": {
    "editor.defaultFormatter": "charliermarsh.ruff",
    "editor.formatOnSave": true,
    "editor.codeActionsOnSave": {
      "source.fixAll.ruff": true,
      "source.organizeImports.ruff": true
    }
  },
  "pylance.typeCheckingMode": "basic",
  "python.analysis.autoImportCompletions": true,
  "python.analysis.indexing": true,
  "python.analysis.packageIndexDepths": [
    { "name": "torch", "depth": 5 },
    { "name": "transformers", "depth": 5 }
  ],
  "jupyter.askForKernelRestart": false,
  "jupyter.interactiveWindow.creationMode": "perFile",
  "editor.inlineSuggest.enabled": true,
  "github.copilot.enable": {
    "*": true,
    "yaml": true,
    "plaintext": false
  },
  "files.exclude": {
    "**/__pycache__": true,
    "**/*.pyc": true,
    ".mypy_cache": true,
    ".ruff_cache": true
  },
  "terminal.integrated.env.linux": {
    "CUDA_VISIBLE_DEVICES": "0"
  }
}

5.3 Remote SSH Setup

# Local ~/.ssh/config
Host ml-server
    HostName 192.168.1.100
    User your-username
    IdentityFile ~/.ssh/id_rsa
    ForwardAgent yes
    ServerAliveInterval 60
    ServerAliveCountMax 3

Host gpu-cloud
    HostName gpu-server.example.com
    User ubuntu
    IdentityFile ~/.ssh/cloud-key.pem
    Port 22

Using Remote SSH in VS Code:

  1. Press Ctrl+Shift+P (or Cmd+Shift+P on Mac)
  2. Select "Remote-SSH: Connect to Host"
  3. Choose your configured host

Auto-install extensions on the remote server:

{
  "remote.SSH.defaultExtensions": ["ms-python.python", "ms-toolsai.jupyter", "charliermarsh.ruff"]
}

5.4 Dev Containers

.devcontainer/devcontainer.json example:

{
  "name": "AI Development",
  "build": {
    "dockerfile": "Dockerfile",
    "args": {
      "CUDA_VERSION": "12.1.1",
      "PYTHON_VERSION": "3.11"
    }
  },
  "runArgs": ["--gpus", "all", "--shm-size", "8g"],
  "mounts": [
    "source=/data,target=/data,type=bind",
    "source=${localEnv:HOME}/.cache,target=/root/.cache,type=bind"
  ],
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-toolsai.jupyter",
        "charliermarsh.ruff",
        "github.copilot"
      ],
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python"
      }
    }
  },
  "postCreateCommand": "pip install -e '.[dev]'",
  "remoteUser": "root"
}

6. GPU Docker Containers

6.1 Installing NVIDIA Container Toolkit

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

6.2 Dockerfile for AI Projects

# Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

# Environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1

# System packages
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3.11-venv \
    python3-pip \
    git \
    wget \
    curl \
    vim \
    htop \
    tmux \
    && rm -rf /var/lib/apt/lists/*

# Set Python defaults
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1

# Upgrade pip
RUN pip install --upgrade pip setuptools wheel

# Working directory
WORKDIR /workspace

# Python dependencies (optimize layer caching)
COPY requirements.txt .
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip install -r requirements.txt

# Copy source code
COPY . .

# Install as editable package
RUN pip install -e ".[dev]"

# Create non-root user (security best practice)
RUN useradd -m -u 1000 researcher
RUN chown -R researcher:researcher /workspace
USER researcher

# Expose ports
EXPOSE 8888 6006

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]

6.3 Docker Compose for a Service Stack

# docker-compose.yml
version: '3.8'

services:
  # Training container
  trainer:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - CUDA_VISIBLE_DEVICES=0
      - WANDB_API_KEY=${WANDB_API_KEY}
    volumes:
      - ./src:/workspace/src
      - ./configs:/workspace/configs
      - /data:/data:ro
      - model-checkpoints:/workspace/checkpoints
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: '8g'
    command: python train.py

  # Jupyter notebook server
  notebook:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    ports:
      - '8888:8888'
    volumes:
      - ./notebooks:/workspace/notebooks
      - ./src:/workspace/src
      - /data:/data:ro
    command: jupyter lab --ip=0.0.0.0 --no-browser --allow-root

  # TensorBoard
  tensorboard:
    image: tensorflow/tensorflow:latest
    ports:
      - '6006:6006'
    volumes:
      - model-checkpoints:/workspace/checkpoints:ro
    command: tensorboard --logdir=/workspace/checkpoints --host=0.0.0.0

  # MLflow tracking server
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.8.0
    ports:
      - '5000:5000'
    volumes:
      - mlflow-data:/mlflow
    command: mlflow server --host=0.0.0.0 --port=5000 --default-artifact-root=/mlflow/artifacts

volumes:
  model-checkpoints:
  mlflow-data:
# Start the full stack
docker-compose up -d

# Follow logs
docker-compose logs -f trainer

# Restart a specific service
docker-compose restart notebook

# Check GPU usage inside container
docker-compose exec trainer nvidia-smi

# Tear down
docker-compose down --volumes

6.4 GPU Sharing Strategies

# Assign specific GPUs
docker run --gpus '"device=0,1"' myimage  # Use GPUs 0 and 1
docker run --gpus '"device=2"' myimage     # Use only GPU 2

# MIG (Multi-Instance GPU) for A100/H100
sudo nvidia-smi mig -i 0 --create-gpu-instance 3g.20gb
sudo nvidia-smi mig -i 0 --create-compute-instance 3g.20gb

# Assign MIG instance to a container
docker run --gpus '"MIG-GPU-xxxxxxxx-xx-xx-xxxx-xxxxxxxxxxxx/x/x"' myimage

7. Remote Development Environment

7.1 SSH-Based Remote Development

# Generate SSH key (on local machine)
ssh-keygen -t ed25519 -C "your-email@example.com"

# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.com

# Or manually
cat ~/.ssh/id_ed25519.pub | ssh user@server.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

# Optimized SSH config (~/.ssh/config)
Host ml-server
    HostName server.example.com
    User researcher
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/cm_%r@%h:%p
    ControlPersist 10m
    ServerAliveInterval 30
    Compression yes

7.2 tmux (Session Persistence)

# Install tmux
sudo apt install tmux

# Start a new session
tmux new-session -s training

# Reattach to a session
tmux attach-session -t training

# List sessions
tmux list-sessions

# Key shortcuts (Prefix: Ctrl+B)
# Ctrl+B d    : Detach session (keeps running after SSH disconnect)
# Ctrl+B c    : Create new window
# Ctrl+B n/p  : Next/previous window
# Ctrl+B %    : Vertical split
# Ctrl+B "    : Horizontal split
# Ctrl+B z    : Toggle pane zoom

~/.tmux.conf configuration:

# ~/.tmux.conf

# Enable mouse support
set -g mouse on

# Increase scroll buffer
set -g history-limit 50000

# Start window numbering at 1
set -g base-index 1
setw -g pane-base-index 1

# Color support
set -g default-terminal "screen-256color"
set -ga terminal-overrides ",xterm-256color:Tc"

# Customize status bar
set -g status-bg colour235
set -g status-fg colour136
set -g status-right '#[fg=colour166]%d %b #[fg=colour136]%H:%M '

# Change prefix to Ctrl+A (screen style)
set -g prefix C-a
unbind C-b
bind C-a send-prefix

# Fast pane navigation
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D

7.3 File Synchronization (rsync, sshfs)

# Sync code local -> server
rsync -avz --exclude '__pycache__' --exclude '.git' \
    ./my-project/ user@server:/home/user/my-project/

# Sync results server -> local
rsync -avz user@server:/home/user/my-project/outputs/ ./outputs/

# Auto-sync script
cat > sync.sh << 'EOF'
#!/bin/bash
rsync -avz \
    --exclude '__pycache__' \
    --exclude '.git' \
    --exclude '.venv' \
    --exclude '*.pyc' \
    --delete \
    ./ user@ml-server:/home/user/project/
echo "Sync complete: $(date)"
EOF
chmod +x sync.sh

# Mount remote filesystem with sshfs
sudo apt install sshfs
mkdir -p ~/remote-server
sshfs user@server:/home/user ~/remote-server -o reconnect,ServerAliveInterval=15

# Unmount
fusermount -u ~/remote-server

8. Monitoring Tools

8.1 nvidia-smi watch

# Refresh GPU status every second
watch -n 1 nvidia-smi

# Show only GPU memory usage
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
    --format=csv -l 1

# Record to CSV log
nvidia-smi --query-gpu=timestamp,index,utilization.gpu,memory.used,temperature.gpu \
    --format=csv -l 1 > gpu_log.csv

8.2 nvtop

# Install nvtop (htop for GPUs)
sudo apt install nvtop

# Run
nvtop

8.3 GPU Monitoring from Python

# gpu_monitor.py
import subprocess
import time
from dataclasses import dataclass
from typing import List

@dataclass
class GPUStats:
    index: int
    name: str
    temperature: float
    utilization: float
    memory_used: int
    memory_total: int
    power_draw: float

def get_gpu_stats() -> List[GPUStats]:
    """Query GPU stats via nvidia-smi"""
    cmd = [
        'nvidia-smi',
        '--query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw',
        '--format=csv,noheader,nounits'
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)

    gpus = []
    for line in result.stdout.strip().split('\n'):
        parts = [p.strip() for p in line.split(',')]
        gpus.append(GPUStats(
            index=int(parts[0]),
            name=parts[1],
            temperature=float(parts[2]),
            utilization=float(parts[3]),
            memory_used=int(parts[4]),
            memory_total=int(parts[5]),
            power_draw=float(parts[6]) if parts[6] != 'N/A' else 0.0
        ))
    return gpus

def monitor_training(interval: int = 10, duration: int = 3600):
    """Monitor GPU during training"""
    print(f"GPU monitoring started (interval: {interval}s)")
    history = []

    start_time = time.time()
    while time.time() - start_time < duration:
        stats = get_gpu_stats()
        timestamp = time.time() - start_time

        for gpu in stats:
            mem_pct = gpu.memory_used / gpu.memory_total * 100
            print(
                f"[{timestamp:.0f}s] GPU {gpu.index}: "
                f"util={gpu.utilization:.0f}% "
                f"mem={gpu.memory_used}/{gpu.memory_total}MB ({mem_pct:.0f}%) "
                f"temp={gpu.temperature:.0f}C "
                f"power={gpu.power_draw:.0f}W"
            )
            history.append({
                'timestamp': timestamp,
                'gpu_index': gpu.index,
                'utilization': gpu.utilization,
                'memory_used': gpu.memory_used,
                'temperature': gpu.temperature
            })

        time.sleep(interval)

    return history

if __name__ == "__main__":
    history = monitor_training(interval=5, duration=60)
    print(f"\nCollected {len(history)} measurements")

8.4 System Monitoring

# Install and use htop
sudo apt install htop
htop

# Monitor I/O with iotop
sudo apt install iotop
sudo iotop -o  # Show only processes with active I/O

# Disk I/O stats
iostat -x 1 5

# Memory usage
free -h
vmstat 1 10

# Network monitoring
sudo apt install nethogs
sudo nethogs

# Comprehensive monitoring dashboard (glances)
pip install glances
glances

9. Code Quality Tools

9.1 black + ruff Configuration

# Install
pip install black ruff pre-commit

# Run black
black .
black --line-length 88 src/

# Run ruff
ruff check .
ruff check --fix .  # Auto-fix
ruff format .       # Format (can replace black)

pyproject.toml configuration:

[tool.black]
line-length = 88
target-version = ['py310', 'py311']
include = '\.pyi?$'

[tool.ruff]
line-length = 88
target-version = "py310"

[tool.ruff.lint]
select = [
    "E",   # pycodestyle errors
    "W",   # pycodestyle warnings
    "F",   # pyflakes
    "I",   # isort
    "N",   # pep8-naming
    "UP",  # pyupgrade
    "B",   # flake8-bugbear
]
ignore = ["E501", "B008"]

[tool.ruff.lint.isort]
known-first-party = ["my_package"]

9.2 pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key
      - id: check-added-large-files
        args: ['--maxkb=10000']

  - repo: https://github.com/psf/black
    rev: 23.11.0
    hooks:
      - id: black

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff
        args: [--fix]

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.7.1
    hooks:
      - id: mypy
        additional_dependencies:
          - types-requests
          - pydantic
# Install and activate pre-commit
pre-commit install

# Run on all files
pre-commit run --all-files

# Run a specific hook
pre-commit run black --all-files

9.3 pytest with Coverage

# Install
pip install pytest pytest-cov pytest-xdist

# Basic run
pytest tests/ -v

# With coverage
pytest tests/ --cov=src --cov-report=html --cov-report=term-missing

# Parallel execution
pytest tests/ -n auto  # As many workers as CPU cores

# Show slowest tests
pytest tests/ --durations=10

# Run only fast tests
pytest tests/ -m "not slow"

pytest configuration in pyproject.toml:

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py", "*_test.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
markers = [
    "slow: marks tests as slow",
    "gpu: marks tests requiring GPU",
    "integration: integration tests"
]
addopts = [
    "-v",
    "--tb=short",
    "--cov=src",
    "--cov-report=html",
    "--cov-fail-under=80"
]

10. Weights & Biases Setup

10.1 Installation and Initialization

# Install
pip install wandb

# Log in (sets API key)
wandb login
# Or set via environment variable
export WANDB_API_KEY=your-api-key-here

10.2 Basic W&B Usage

import wandb
import numpy as np

# Initialize experiment
run = wandb.init(
    project="bert-sentiment-analysis",
    name="run-001-baseline",
    config={
        "learning_rate": 2e-5,
        "batch_size": 32,
        "epochs": 10,
        "model": "bert-base-uncased",
        "optimizer": "adamw"
    },
    tags=["baseline", "bert", "nlp"],
    notes="Baseline BERT fine-tuning experiment"
)

config = wandb.config
print(f"Learning rate: {config.learning_rate}")

# Log metrics
for epoch in range(config.epochs):
    train_loss = 2.0 - epoch * 0.15 + np.random.randn() * 0.1
    val_loss = 2.2 - epoch * 0.12 + np.random.randn() * 0.1
    val_acc = 0.5 + epoch * 0.04 + np.random.randn() * 0.01

    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "val/loss": val_loss,
        "val/accuracy": val_acc,
        "learning_rate": config.learning_rate * (0.95 ** epoch)
    })

# Log a visualization
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
x = np.linspace(0, 10, 100)
ax.plot(x, np.sin(x), label='Predicted')
ax.plot(x, np.cos(x), label='Actual')
ax.legend()
wandb.log({"prediction_plot": wandb.Image(fig)})
plt.close()

wandb.finish()

10.3 Artifact Management

import wandb
import os

def save_model_artifact(model_path: str, run, metadata: dict = None):
    """Save model checkpoint as a W&B artifact"""
    artifact = wandb.Artifact(
        name="model-checkpoint",
        type="model",
        metadata=metadata or {}
    )
    artifact.add_file(model_path)
    run.log_artifact(artifact)
    print(f"Model artifact saved: {model_path}")

def save_dataset_artifact(data_dir: str, run):
    """Save dataset as a W&B artifact"""
    artifact = wandb.Artifact(
        name="training-dataset",
        type="dataset",
        description="Preprocessed training dataset"
    )
    artifact.add_dir(data_dir)
    run.log_artifact(artifact)

def load_model_artifact(artifact_name: str, version: str = "latest"):
    """Download a model artifact"""
    run = wandb.init(project="my-project", job_type="inference")
    artifact = run.use_artifact(f"{artifact_name}:{version}")
    artifact_dir = artifact.download()
    print(f"Artifact downloaded to: {artifact_dir}")
    return artifact_dir

# Usage example
run = wandb.init(project="bert-training")

with open("/tmp/model.pt", "w") as f:
    f.write("model_weights")

save_model_artifact(
    "/tmp/model.pt",
    run,
    metadata={"accuracy": 0.94, "epoch": 10, "val_loss": 0.28}
)

wandb.finish()

10.4 Hyperparameter Optimization Sweeps

import wandb
import numpy as np

# Sweep configuration
sweep_config = {
    "method": "bayes",  # random, grid, or bayes
    "metric": {
        "name": "val/accuracy",
        "goal": "maximize"
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-3
        },
        "batch_size": {
            "values": [16, 32, 64, 128]
        },
        "hidden_size": {
            "values": [256, 512, 768, 1024]
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.1,
            "max": 0.5
        },
        "num_layers": {
            "values": [2, 4, 6, 8]
        }
    },
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 3
    }
}

def train_sweep():
    """Training function called by each sweep run"""
    run = wandb.init()
    config = wandb.config

    best_val_acc = 0.0

    for epoch in range(10):
        train_loss = 1.0 / (config.learning_rate * 1000) * np.random.uniform(0.8, 1.2)
        val_acc = min(0.99, config.hidden_size / 2000 + np.random.uniform(0, 0.1))

        if val_acc > best_val_acc:
            best_val_acc = val_acc

        wandb.log({
            "epoch": epoch,
            "train/loss": train_loss,
            "val/accuracy": val_acc
        })

    wandb.finish()

# Create and launch sweep
sweep_id = wandb.sweep(sweep_config, project="hyperparameter-search")
print(f"Sweep ID: {sweep_id}")

# Run agent for N trials
wandb.agent(sweep_id, function=train_sweep, count=20)

To run a sweep in parallel across multiple servers:

# Server 1
wandb agent username/project/sweep-id

# Server 2 (simultaneously)
wandb agent username/project/sweep-id

Conclusion

Investing time in setting up your AI development environment upfront pays dividends throughout the life of your project. Here is a summary of the key priorities covered in this guide.

Essential priorities:

  1. CUDA environment: Correct GPU driver and CUDA Toolkit installation is the foundation of everything.
  2. Python environment isolation: Use pyenv + conda/poetry to create separate environments per project.
  3. JupyterLab + VS Code: Use Jupyter for exploratory analysis and VS Code for production development.
  4. Docker containerization: Guarantees reproducible experiment environments and simplifies team collaboration.
  5. W&B experiment tracking: Tracking every experiment gives you reproducibility and actionable insights.

Areas worth investing in long-term:

  • Automate code quality with pre-commit hooks and CI/CD.
  • Master remote development (Remote SSH + tmux) to work productively from anywhere.
  • Use GPU monitoring to catch training issues early.

References