AI Development Environment Complete Guide: From GPU Server Setup to Jupyter, VS Code, Docker

Overview

Setting up a proper AI development environment is half the battle. Without the right foundation, reproducible experiments, team collaboration, and fast iteration cycles are all impossible. This guide covers everything an AI/ML developer needs to know — from the initial GPU server setup to daily development workflows.

You will learn how to install CUDA on Ubuntu, manage Python environments, use JupyterLab and VS Code like a professional, and build reproducible experiment environments with Docker containers.

1. AI Development Environment Requirements

1.1 Hardware Recommendations

GPU (Top Priority)

In AI research, a GPU is not optional — it is essential.

Entry-level / Personal research: NVIDIA RTX 4080/4090 (16-24GB VRAM)
Shared team server: NVIDIA A100 40GB or 80GB
Large LLM training: NVIDIA H100 or H200 (80GB+ VRAM)
Cloud options: AWS p3/p4d/p5, GCP A100/H100, Lambda Labs

CPU

During GPU training, the CPU is primarily responsible for data preprocessing and loading.

Minimum: 8-core 16-thread (Intel Core i9 or AMD Ryzen 9)
Recommended: 16+ cores (AMD Threadripper, Intel Xeon)
Optimal DataLoader workers = half the number of CPU cores

RAM

Minimum: 32GB
Recommended: 64GB (at least 4x GPU VRAM)
Large-scale NLP: 128GB+ (tokenization, data loading)

Storage

/home/user/          -> NVMe SSD (OS, code, environments)
/data/               -> NVMe SSD or fast HDD (training data)
/models/             -> HDD or NAS (model checkpoints)

OS and packages: NVMe SSD 500GB+
Datasets: NVMe SSD (I/O-intensive training) or high-performance HDD
Checkpoints: HDD 4TB+ (cost-effective high-capacity storage)

1.2 OS Selection

Ubuntu 22.04 LTS — Strongly Recommended

# Check Ubuntu version
lsb_release -a

# Example output:
# Ubuntu 22.04.3 LTS (Jammy Jellyfish)

Reasons to choose Ubuntu:

NVIDIA's primary supported platform (latest drivers, CUDA, cuDNN available immediately)
Largest AI/ML community documentation and package ecosystem
Best compatibility with Docker and Kubernetes
5-year LTS support for stable server operation

macOS (M1/M2/M3)

Apple Silicon supports GPU acceleration via Metal Performance Shaders.

# Check MPS availability (PyTorch)
python -c "import torch; print(torch.backends.mps.is_available())"

1.3 Cloud vs. Local Development

Aspect	Local Server	Cloud
Upfront cost	High (hardware)	Low
Ongoing cost	Low (electricity)	High (hourly billing)
Latency	None	Instance startup time
Scalability	Limited	Unlimited
Data security	High	Policy-dependent
Best for	Continuous research	One-off large experiments

2. NVIDIA GPU Drivers and CUDA Installation

2.1 Installing nvidia-driver (Ubuntu)

# 1. Remove existing NVIDIA packages
sudo apt-get purge nvidia*
sudo apt autoremove

# 2. Check available drivers
ubuntu-drivers devices

# 3. Auto-install recommended driver
sudo ubuntu-drivers autoinstall

# Or install a specific version
sudo apt install nvidia-driver-535

# 4. Reboot
sudo reboot

# 5. Verify installation
nvidia-smi

Example nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05   Driver Version: 535.154.05   CUDA Version: 12.2    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB  Off| 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    56W / 400W |   1024MiB / 40960MiB |      0%      Default |
+-----------------------------------------------------------------------------+

2.2 Installing the CUDA Toolkit

Generate the appropriate installation command on the official NVIDIA website for your platform.

# CUDA 12.2 installation (Ubuntu 22.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-2

# Set environment variables (add to ~/.bashrc or ~/.zshrc)
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify CUDA version
nvcc --version

2.3 Installing cuDNN

# Requires an NVIDIA Developer account
# After downloading cuDNN:

# For Ubuntu 22.04 with CUDA 12.x
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install libcudnn8 libcudnn8-dev libcudnn8-samples

# Verify cuDNN version
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

2.4 Verifying the Installation

# verify_gpu.py
import subprocess
import sys

def check_nvidia_smi():
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    if result.returncode == 0:
        print("nvidia-smi is working")
        lines = result.stdout.split('\n')
        for line in lines:
            if 'NVIDIA' in line and 'Driver' in line:
                print(f"  {line.strip()}")
    else:
        print("nvidia-smi error:", result.stderr)

def check_pytorch_cuda():
    try:
        import torch
        print(f"\nPyTorch version: {torch.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"CUDA version: {torch.version.cuda}")
            print(f"cuDNN version: {torch.backends.cudnn.version()}")
            print(f"GPU count: {torch.cuda.device_count()}")
            for i in range(torch.cuda.device_count()):
                props = torch.cuda.get_device_properties(i)
                print(f"  GPU {i}: {props.name} ({props.total_memory / 1024**3:.1f}GB)")

            # Quick tensor operation test
            x = torch.randn(1000, 1000).cuda()
            y = torch.randn(1000, 1000).cuda()
            z = x @ y
            print(f"GPU tensor operation test: passed (shape: {z.shape})")
    except ImportError:
        print("PyTorch is not installed")

def check_tensorflow_gpu():
    try:
        import tensorflow as tf
        print(f"\nTensorFlow version: {tf.__version__}")
        gpus = tf.config.list_physical_devices('GPU')
        print(f"Detected GPUs: {len(gpus)}")
        for gpu in gpus:
            print(f"  {gpu}")
    except ImportError:
        print("TensorFlow is not installed")

if __name__ == "__main__":
    check_nvidia_smi()
    check_pytorch_cuda()
    check_tensorflow_gpu()

python verify_gpu.py

3. Python Environment Management

3.1 Python Version Management with pyenv

# Install pyenv
curl https://pyenv.run | bash

# Add to ~/.bashrc or ~/.zshrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc

# List available Python versions
pyenv install --list | grep -E "^\s+3\.(10|11|12)"

# Install Python
pyenv install 3.11.7
pyenv install 3.10.13

# Set global version
pyenv global 3.11.7

# Set project-specific version (in that directory)
cd my-project
pyenv local 3.10.13

# Check current Python version
python --version
pyenv versions

3.2 conda Environments (for GPU Libraries)

conda is particularly useful for installing GPU-accelerated libraries like RAPIDS and cuML.

# Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda init

# Configure channels
conda config --add channels conda-forge
conda config --add channels nvidia
conda config --set channel_priority strict

# Create AI/ML environment
conda create -n aiml python=3.11 -y
conda activate aiml

# PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# RAPIDS (GPU-accelerated data science)
conda install -c rapidsai -c conda-forge -c nvidia rapids=23.10 cuda-version=12.0

# Export and import environments
conda env export > environment.yml
conda env create -f environment.yml

# List environments
conda env list

environment.yml example:

name: aiml
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - pytorch>=2.1
  - torchvision
  - cudatoolkit=12.1
  - numpy>=1.24
  - pandas>=2.0
  - scikit-learn>=1.3
  - pip:
      - transformers>=4.35
      - wandb>=0.16
      - pydantic>=2.0

3.3 Dependency Management with poetry

# Install poetry
curl -sSL https://install.python-poetry.org | python3 -

# Create project
poetry new ml-research
cd ml-research

# Add dependencies
poetry add torch torchvision numpy pandas transformers
poetry add --group dev pytest black ruff mypy jupyter

# Install with extras
poetry add "torch[cuda]>=2.1"
poetry add "transformers[torch]>=4.35"

# Install all dependencies
poetry install

# Check virtual environment info
poetry env info
poetry env list

3.4 uv (Ultra-fast Package Manager)

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install packages (10-100x faster than pip)
uv pip install torch torchvision numpy pandas

# Create virtual environment
uv venv .venv --python 3.11
source .venv/bin/activate

# Install from requirements.txt (very fast)
uv pip install -r requirements.txt

# Manage Python versions
uv python install 3.11 3.12
uv python list

4. Advanced JupyterLab Setup

4.1 JupyterLab Installation and Extensions

# Install JupyterLab
pip install jupyterlab

# Install useful extensions
pip install jupyterlab-git           # Git integration
pip install jupyterlab-lsp           # Language server protocol (autocomplete)
pip install python-lsp-server        # Python LSP
pip install jupyterlab-code-formatter  # Code formatter
pip install black isort              # Formatters
pip install jupyterlab-vim           # Vim key bindings
pip install ipywidgets               # Interactive widgets

# List installed extensions
jupyter labextension list

# Start JupyterLab
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser

4.2 Kernel Management

# List available kernels
jupyter kernelspec list

# Register current conda/venv environment as a kernel
pip install ipykernel
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

# Register a specific conda environment as a kernel
conda activate aiml
conda install ipykernel
python -m ipykernel install --user --name aiml --display-name "Python (aiml GPU)"

# Remove a kernel
jupyter kernelspec remove old-kernel

Custom kernel.json example:

{
  "argv": [
    "/home/user/miniconda3/envs/aiml/bin/python",
    "-m",
    "ipykernel_launcher",
    "-f",
    "{connection_file}"
  ],
  "display_name": "Python (aiml GPU)",
  "language": "python",
  "env": {
    "CUDA_VISIBLE_DEVICES": "0",
    "PYTHONPATH": "/home/user/projects"
  }
}

4.3 Remote Jupyter via SSH Tunneling

# Start Jupyter on the server (no token)
jupyter lab --no-browser --port=8888 --ip=127.0.0.1

# Create SSH tunnel from local machine
ssh -N -L 8888:localhost:8888 user@your-server.com

# Access in browser
# http://localhost:8888

# Set a password for security
jupyter lab password

Systemd service for auto-start:

cat > /tmp/jupyter.service << 'EOF'
[Unit]
Description=Jupyter Lab
After=network.target

[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username
ExecStart=/home/your-username/miniconda3/envs/aiml/bin/jupyter lab --no-browser --port=8888
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo mv /tmp/jupyter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable jupyter
sudo systemctl start jupyter

4.4 Jupyter Magic Commands

# Timing
%time result = expensive_function()
%timeit -n 100 result = fast_function()  # 100 repetitions

# Line-by-line profiling
%load_ext line_profiler
%lprun -f my_function my_function(data)

# Memory profiling
%load_ext memory_profiler
%memit result = memory_heavy_function()

# Write cell contents to file
%%writefile my_script.py
import numpy as np
# ...

# Load file into cell
%load my_script.py

# Run shell commands
!nvidia-smi
!pip list | grep torch
files = !ls -la *.py

# Set environment variables
%env CUDA_VISIBLE_DEVICES=0

# Auto-reload (reflect code changes immediately)
%load_ext autoreload
%autoreload 2

# Inline matplotlib plots
%matplotlib inline

# List current variables
%who
%whos

# Access previous outputs
print(_)   # Last result
print(__)  # Two results ago

4.5 nbconvert (Notebook Conversion)

# Convert notebook to script
jupyter nbconvert --to script notebook.ipynb

# Convert to HTML (for sharing)
jupyter nbconvert --to html notebook.ipynb

# Convert to PDF (requires LaTeX)
jupyter nbconvert --to pdf notebook.ipynb

# Execute and convert to HTML
jupyter nbconvert --to html --execute notebook.ipynb

# Execute notebook in-place from CLI
jupyter nbconvert --to notebook --execute --inplace notebook.ipynb

# Parameterized execution (papermill)
pip install papermill
papermill input.ipynb output.ipynb -p lr 0.001 -p epochs 100

5. VS Code for AI

5.1 Installing Essential Extensions

# Install extensions from the command line
code --install-extension ms-python.python           # Python
code --install-extension ms-toolsai.jupyter         # Jupyter
code --install-extension ms-python.pylance          # Pylance (LSP)
code --install-extension charliermarsh.ruff         # Ruff linter
code --install-extension github.copilot             # GitHub Copilot
code --install-extension github.copilot-chat        # Copilot Chat
code --install-extension ms-vscode-remote.remote-ssh  # Remote SSH
code --install-extension ms-vscode-remote.remote-containers  # Dev Containers
code --install-extension eamodio.gitlens            # GitLens
code --install-extension njpwerner.autodocstring     # Auto-docstring

5.2 VS Code Settings (settings.json)

{
  "python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",
  "python.formatting.provider": "none",
  "[python]": {
    "editor.defaultFormatter": "charliermarsh.ruff",
    "editor.formatOnSave": true,
    "editor.codeActionsOnSave": {
      "source.fixAll.ruff": true,
      "source.organizeImports.ruff": true
    }
  },
  "pylance.typeCheckingMode": "basic",
  "python.analysis.autoImportCompletions": true,
  "python.analysis.indexing": true,
  "python.analysis.packageIndexDepths": [
    { "name": "torch", "depth": 5 },
    { "name": "transformers", "depth": 5 }
  ],
  "jupyter.askForKernelRestart": false,
  "jupyter.interactiveWindow.creationMode": "perFile",
  "editor.inlineSuggest.enabled": true,
  "github.copilot.enable": {
    "*": true,
    "yaml": true,
    "plaintext": false
  },
  "files.exclude": {
    "**/__pycache__": true,
    "**/*.pyc": true,
    ".mypy_cache": true,
    ".ruff_cache": true
  },
  "terminal.integrated.env.linux": {
    "CUDA_VISIBLE_DEVICES": "0"
  }
}

5.3 Remote SSH Setup

# Local ~/.ssh/config
Host ml-server
    HostName 192.168.1.100
    User your-username
    IdentityFile ~/.ssh/id_rsa
    ForwardAgent yes
    ServerAliveInterval 60
    ServerAliveCountMax 3

Host gpu-cloud
    HostName gpu-server.example.com
    User ubuntu
    IdentityFile ~/.ssh/cloud-key.pem
    Port 22

Using Remote SSH in VS Code:

Press Ctrl+Shift+P (or Cmd+Shift+P on Mac)
Select "Remote-SSH: Connect to Host"
Choose your configured host

Auto-install extensions on the remote server:

{
  "remote.SSH.defaultExtensions": ["ms-python.python", "ms-toolsai.jupyter", "charliermarsh.ruff"]
}

5.4 Dev Containers

.devcontainer/devcontainer.json example:

{
  "name": "AI Development",
  "build": {
    "dockerfile": "Dockerfile",
    "args": {
      "CUDA_VERSION": "12.1.1",
      "PYTHON_VERSION": "3.11"
    }
  },
  "runArgs": ["--gpus", "all", "--shm-size", "8g"],
  "mounts": [
    "source=/data,target=/data,type=bind",
    "source=${localEnv:HOME}/.cache,target=/root/.cache,type=bind"
  ],
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-toolsai.jupyter",
        "charliermarsh.ruff",
        "github.copilot"
      ],
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python"
      }
    }
  },
  "postCreateCommand": "pip install -e '.[dev]'",
  "remoteUser": "root"
}

6. GPU Docker Containers

6.1 Installing NVIDIA Container Toolkit

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

6.2 Dockerfile for AI Projects

# Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

# Environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1

# System packages
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3.11-venv \
    python3-pip \
    git \
    wget \
    curl \
    vim \
    htop \
    tmux \
    && rm -rf /var/lib/apt/lists/*

# Set Python defaults
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1

# Upgrade pip
RUN pip install --upgrade pip setuptools wheel

# Working directory
WORKDIR /workspace

# Python dependencies (optimize layer caching)
COPY requirements.txt .
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip install -r requirements.txt

# Copy source code
COPY . .

# Install as editable package
RUN pip install -e ".[dev]"

# Create non-root user (security best practice)
RUN useradd -m -u 1000 researcher
RUN chown -R researcher:researcher /workspace
USER researcher

# Expose ports
EXPOSE 8888 6006

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]

6.3 Docker Compose for a Service Stack

# docker-compose.yml
version: '3.8'

services:
  # Training container
  trainer:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - CUDA_VISIBLE_DEVICES=0
      - WANDB_API_KEY=${WANDB_API_KEY}
    volumes:
      - ./src:/workspace/src
      - ./configs:/workspace/configs
      - /data:/data:ro
      - model-checkpoints:/workspace/checkpoints
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: '8g'
    command: python train.py

  # Jupyter notebook server
  notebook:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    ports:
      - '8888:8888'
    volumes:
      - ./notebooks:/workspace/notebooks
      - ./src:/workspace/src
      - /data:/data:ro
    command: jupyter lab --ip=0.0.0.0 --no-browser --allow-root

  # TensorBoard
  tensorboard:
    image: tensorflow/tensorflow:latest
    ports:
      - '6006:6006'
    volumes:
      - model-checkpoints:/workspace/checkpoints:ro
    command: tensorboard --logdir=/workspace/checkpoints --host=0.0.0.0

  # MLflow tracking server
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.8.0
    ports:
      - '5000:5000'
    volumes:
      - mlflow-data:/mlflow
    command: mlflow server --host=0.0.0.0 --port=5000 --default-artifact-root=/mlflow/artifacts

volumes:
  model-checkpoints:
  mlflow-data:

# Start the full stack
docker-compose up -d

# Follow logs
docker-compose logs -f trainer

# Restart a specific service
docker-compose restart notebook

# Check GPU usage inside container
docker-compose exec trainer nvidia-smi

# Tear down
docker-compose down --volumes

# Assign specific GPUs
docker run --gpus '"device=0,1"' myimage  # Use GPUs 0 and 1
docker run --gpus '"device=2"' myimage     # Use only GPU 2

# MIG (Multi-Instance GPU) for A100/H100
sudo nvidia-smi mig -i 0 --create-gpu-instance 3g.20gb
sudo nvidia-smi mig -i 0 --create-compute-instance 3g.20gb

# Assign MIG instance to a container
docker run --gpus '"MIG-GPU-xxxxxxxx-xx-xx-xxxx-xxxxxxxxxxxx/x/x"' myimage

7. Remote Development Environment

7.1 SSH-Based Remote Development

# Generate SSH key (on local machine)
ssh-keygen -t ed25519 -C "your-email@example.com"

# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.com

# Or manually
cat ~/.ssh/id_ed25519.pub | ssh user@server.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

# Optimized SSH config (~/.ssh/config)
Host ml-server
    HostName server.example.com
    User researcher
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/cm_%r@%h:%p
    ControlPersist 10m
    ServerAliveInterval 30
    Compression yes

7.2 tmux (Session Persistence)

# Install tmux
sudo apt install tmux

# Start a new session
tmux new-session -s training

# Reattach to a session
tmux attach-session -t training

# List sessions
tmux list-sessions

# Key shortcuts (Prefix: Ctrl+B)
# Ctrl+B d    : Detach session (keeps running after SSH disconnect)
# Ctrl+B c    : Create new window
# Ctrl+B n/p  : Next/previous window
# Ctrl+B %    : Vertical split
# Ctrl+B "    : Horizontal split
# Ctrl+B z    : Toggle pane zoom

~/.tmux.conf configuration:

# ~/.tmux.conf

# Enable mouse support
set -g mouse on

# Increase scroll buffer
set -g history-limit 50000

# Start window numbering at 1
set -g base-index 1
setw -g pane-base-index 1

# Color support
set -g default-terminal "screen-256color"
set -ga terminal-overrides ",xterm-256color:Tc"

# Customize status bar
set -g status-bg colour235
set -g status-fg colour136
set -g status-right '#[fg=colour166]%d %b #[fg=colour136]%H:%M '

# Change prefix to Ctrl+A (screen style)
set -g prefix C-a
unbind C-b
bind C-a send-prefix

# Fast pane navigation
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D

7.3 File Synchronization (rsync, sshfs)

# Sync code local -> server
rsync -avz --exclude '__pycache__' --exclude '.git' \
    ./my-project/ user@server:/home/user/my-project/

# Sync results server -> local
rsync -avz user@server:/home/user/my-project/outputs/ ./outputs/

# Auto-sync script
cat > sync.sh << 'EOF'
#!/bin/bash
rsync -avz \
    --exclude '__pycache__' \
    --exclude '.git' \
    --exclude '.venv' \
    --exclude '*.pyc' \
    --delete \
    ./ user@ml-server:/home/user/project/
echo "Sync complete: $(date)"
EOF
chmod +x sync.sh

# Mount remote filesystem with sshfs
sudo apt install sshfs
mkdir -p ~/remote-server
sshfs user@server:/home/user ~/remote-server -o reconnect,ServerAliveInterval=15

# Unmount
fusermount -u ~/remote-server

8. Monitoring Tools

8.1 nvidia-smi watch

# Refresh GPU status every second
watch -n 1 nvidia-smi

# Show only GPU memory usage
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
    --format=csv -l 1

# Record to CSV log
nvidia-smi --query-gpu=timestamp,index,utilization.gpu,memory.used,temperature.gpu \
    --format=csv -l 1 > gpu_log.csv

8.2 nvtop

# Install nvtop (htop for GPUs)
sudo apt install nvtop

# Run
nvtop

8.3 GPU Monitoring from Python

# gpu_monitor.py
import subprocess
import time
from dataclasses import dataclass
from typing import List

@dataclass
class GPUStats:
    index: int
    name: str
    temperature: float
    utilization: float
    memory_used: int
    memory_total: int
    power_draw: float

def get_gpu_stats() -> List[GPUStats]:
    """Query GPU stats via nvidia-smi"""
    cmd = [
        'nvidia-smi',
        '--query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw',
        '--format=csv,noheader,nounits'
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)

    gpus = []
    for line in result.stdout.strip().split('\n'):
        parts = [p.strip() for p in line.split(',')]
        gpus.append(GPUStats(
            index=int(parts[0]),
            name=parts[1],
            temperature=float(parts[2]),
            utilization=float(parts[3]),
            memory_used=int(parts[4]),
            memory_total=int(parts[5]),
            power_draw=float(parts[6]) if parts[6] != 'N/A' else 0.0
        ))
    return gpus

def monitor_training(interval: int = 10, duration: int = 3600):
    """Monitor GPU during training"""
    print(f"GPU monitoring started (interval: {interval}s)")
    history = []

    start_time = time.time()
    while time.time() - start_time < duration:
        stats = get_gpu_stats()
        timestamp = time.time() - start_time

        for gpu in stats:
            mem_pct = gpu.memory_used / gpu.memory_total * 100
            print(
                f"[{timestamp:.0f}s] GPU {gpu.index}: "
                f"util={gpu.utilization:.0f}% "
                f"mem={gpu.memory_used}/{gpu.memory_total}MB ({mem_pct:.0f}%) "
                f"temp={gpu.temperature:.0f}C "
                f"power={gpu.power_draw:.0f}W"
            )
            history.append({
                'timestamp': timestamp,
                'gpu_index': gpu.index,
                'utilization': gpu.utilization,
                'memory_used': gpu.memory_used,
                'temperature': gpu.temperature
            })

        time.sleep(interval)

    return history

if __name__ == "__main__":
    history = monitor_training(interval=5, duration=60)
    print(f"\nCollected {len(history)} measurements")

8.4 System Monitoring

# Install and use htop
sudo apt install htop
htop

# Monitor I/O with iotop
sudo apt install iotop
sudo iotop -o  # Show only processes with active I/O

# Disk I/O stats
iostat -x 1 5

# Memory usage
free -h
vmstat 1 10

# Network monitoring
sudo apt install nethogs
sudo nethogs

# Comprehensive monitoring dashboard (glances)
pip install glances
glances

9. Code Quality Tools

9.1 black + ruff Configuration

# Install
pip install black ruff pre-commit

# Run black
black .
black --line-length 88 src/

# Run ruff
ruff check .
ruff check --fix .  # Auto-fix
ruff format .       # Format (can replace black)

pyproject.toml configuration:

[tool.black]
line-length = 88
target-version = ['py310', 'py311']
include = '\.pyi?$'

[tool.ruff]
line-length = 88
target-version = "py310"

[tool.ruff.lint]
select = [
    "E",   # pycodestyle errors
    "W",   # pycodestyle warnings
    "F",   # pyflakes
    "I",   # isort
    "N",   # pep8-naming
    "UP",  # pyupgrade
    "B",   # flake8-bugbear
]
ignore = ["E501", "B008"]

[tool.ruff.lint.isort]
known-first-party = ["my_package"]

9.2 pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key
      - id: check-added-large-files
        args: ['--maxkb=10000']

  - repo: https://github.com/psf/black
    rev: 23.11.0
    hooks:
      - id: black

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff
        args: [--fix]

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.7.1
    hooks:
      - id: mypy
        additional_dependencies:
          - types-requests
          - pydantic

# Install and activate pre-commit
pre-commit install

# Run on all files
pre-commit run --all-files

# Run a specific hook
pre-commit run black --all-files

9.3 pytest with Coverage

# Install
pip install pytest pytest-cov pytest-xdist

# Basic run
pytest tests/ -v

# With coverage
pytest tests/ --cov=src --cov-report=html --cov-report=term-missing

# Parallel execution
pytest tests/ -n auto  # As many workers as CPU cores

# Show slowest tests
pytest tests/ --durations=10

# Run only fast tests
pytest tests/ -m "not slow"

pytest configuration in pyproject.toml:

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py", "*_test.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
markers = [
    "slow: marks tests as slow",
    "gpu: marks tests requiring GPU",
    "integration: integration tests"
]
addopts = [
    "-v",
    "--tb=short",
    "--cov=src",
    "--cov-report=html",
    "--cov-fail-under=80"
]

10. Weights & Biases Setup

10.1 Installation and Initialization

# Install
pip install wandb

# Log in (sets API key)
wandb login
# Or set via environment variable
export WANDB_API_KEY=your-api-key-here

10.2 Basic W&B Usage

import wandb
import numpy as np

# Initialize experiment
run = wandb.init(
    project="bert-sentiment-analysis",
    name="run-001-baseline",
    config={
        "learning_rate": 2e-5,
        "batch_size": 32,
        "epochs": 10,
        "model": "bert-base-uncased",
        "optimizer": "adamw"
    },
    tags=["baseline", "bert", "nlp"],
    notes="Baseline BERT fine-tuning experiment"
)

config = wandb.config
print(f"Learning rate: {config.learning_rate}")

# Log metrics
for epoch in range(config.epochs):
    train_loss = 2.0 - epoch * 0.15 + np.random.randn() * 0.1
    val_loss = 2.2 - epoch * 0.12 + np.random.randn() * 0.1
    val_acc = 0.5 + epoch * 0.04 + np.random.randn() * 0.01

    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "val/loss": val_loss,
        "val/accuracy": val_acc,
        "learning_rate": config.learning_rate * (0.95 ** epoch)
    })

# Log a visualization
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
x = np.linspace(0, 10, 100)
ax.plot(x, np.sin(x), label='Predicted')
ax.plot(x, np.cos(x), label='Actual')
ax.legend()
wandb.log({"prediction_plot": wandb.Image(fig)})
plt.close()

wandb.finish()

10.3 Artifact Management

import wandb
import os

def save_model_artifact(model_path: str, run, metadata: dict = None):
    """Save model checkpoint as a W&B artifact"""
    artifact = wandb.Artifact(
        name="model-checkpoint",
        type="model",
        metadata=metadata or {}
    )
    artifact.add_file(model_path)
    run.log_artifact(artifact)
    print(f"Model artifact saved: {model_path}")

def save_dataset_artifact(data_dir: str, run):
    """Save dataset as a W&B artifact"""
    artifact = wandb.Artifact(
        name="training-dataset",
        type="dataset",
        description="Preprocessed training dataset"
    )
    artifact.add_dir(data_dir)
    run.log_artifact(artifact)

def load_model_artifact(artifact_name: str, version: str = "latest"):
    """Download a model artifact"""
    run = wandb.init(project="my-project", job_type="inference")
    artifact = run.use_artifact(f"{artifact_name}:{version}")
    artifact_dir = artifact.download()
    print(f"Artifact downloaded to: {artifact_dir}")
    return artifact_dir

# Usage example
run = wandb.init(project="bert-training")

with open("/tmp/model.pt", "w") as f:
    f.write("model_weights")

save_model_artifact(
    "/tmp/model.pt",
    run,
    metadata={"accuracy": 0.94, "epoch": 10, "val_loss": 0.28}
)

wandb.finish()

10.4 Hyperparameter Optimization Sweeps

import wandb
import numpy as np

# Sweep configuration
sweep_config = {
    "method": "bayes",  # random, grid, or bayes
    "metric": {
        "name": "val/accuracy",
        "goal": "maximize"
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-3
        },
        "batch_size": {
            "values": [16, 32, 64, 128]
        },
        "hidden_size": {
            "values": [256, 512, 768, 1024]
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.1,
            "max": 0.5
        },
        "num_layers": {
            "values": [2, 4, 6, 8]
        }
    },
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 3
    }
}

def train_sweep():
    """Training function called by each sweep run"""
    run = wandb.init()
    config = wandb.config

    best_val_acc = 0.0

    for epoch in range(10):
        train_loss = 1.0 / (config.learning_rate * 1000) * np.random.uniform(0.8, 1.2)
        val_acc = min(0.99, config.hidden_size / 2000 + np.random.uniform(0, 0.1))

        if val_acc > best_val_acc:
            best_val_acc = val_acc

        wandb.log({
            "epoch": epoch,
            "train/loss": train_loss,
            "val/accuracy": val_acc
        })

    wandb.finish()

# Create and launch sweep
sweep_id = wandb.sweep(sweep_config, project="hyperparameter-search")
print(f"Sweep ID: {sweep_id}")

# Run agent for N trials
wandb.agent(sweep_id, function=train_sweep, count=20)

To run a sweep in parallel across multiple servers:

# Server 1
wandb agent username/project/sweep-id

# Server 2 (simultaneously)
wandb agent username/project/sweep-id

Conclusion

Investing time in setting up your AI development environment upfront pays dividends throughout the life of your project. Here is a summary of the key priorities covered in this guide.

Essential priorities:

CUDA environment: Correct GPU driver and CUDA Toolkit installation is the foundation of everything.
Python environment isolation: Use pyenv + conda/poetry to create separate environments per project.
JupyterLab + VS Code: Use Jupyter for exploratory analysis and VS Code for production development.
Docker containerization: Guarantees reproducible experiment environments and simplifies team collaboration.
W&B experiment tracking: Tracking every experiment gives you reproducibility and actionable insights.

Areas worth investing in long-term:

Automate code quality with pre-commit hooks and CI/CD.
Master remote development (Remote SSH + tmux) to work productively from anywhere.
Use GPU monitoring to catch training issues early.