- Published on
AI Development Environment Complete Guide: From GPU Server Setup to Jupyter, VS Code, Docker
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Overview
Setting up a proper AI development environment is half the battle. Without the right foundation, reproducible experiments, team collaboration, and fast iteration cycles are all impossible. This guide covers everything an AI/ML developer needs to know — from the initial GPU server setup to daily development workflows.
You will learn how to install CUDA on Ubuntu, manage Python environments, use JupyterLab and VS Code like a professional, and build reproducible experiment environments with Docker containers.
1. AI Development Environment Requirements
1.1 Hardware Recommendations
GPU (Top Priority)
In AI research, a GPU is not optional — it is essential.
- Entry-level / Personal research: NVIDIA RTX 4080/4090 (16-24GB VRAM)
- Shared team server: NVIDIA A100 40GB or 80GB
- Large LLM training: NVIDIA H100 or H200 (80GB+ VRAM)
- Cloud options: AWS p3/p4d/p5, GCP A100/H100, Lambda Labs
CPU
During GPU training, the CPU is primarily responsible for data preprocessing and loading.
- Minimum: 8-core 16-thread (Intel Core i9 or AMD Ryzen 9)
- Recommended: 16+ cores (AMD Threadripper, Intel Xeon)
- Optimal DataLoader workers = half the number of CPU cores
RAM
- Minimum: 32GB
- Recommended: 64GB (at least 4x GPU VRAM)
- Large-scale NLP: 128GB+ (tokenization, data loading)
Storage
/home/user/ -> NVMe SSD (OS, code, environments)
/data/ -> NVMe SSD or fast HDD (training data)
/models/ -> HDD or NAS (model checkpoints)
- OS and packages: NVMe SSD 500GB+
- Datasets: NVMe SSD (I/O-intensive training) or high-performance HDD
- Checkpoints: HDD 4TB+ (cost-effective high-capacity storage)
1.2 OS Selection
Ubuntu 22.04 LTS — Strongly Recommended
# Check Ubuntu version
lsb_release -a
# Example output:
# Ubuntu 22.04.3 LTS (Jammy Jellyfish)
Reasons to choose Ubuntu:
- NVIDIA's primary supported platform (latest drivers, CUDA, cuDNN available immediately)
- Largest AI/ML community documentation and package ecosystem
- Best compatibility with Docker and Kubernetes
- 5-year LTS support for stable server operation
macOS (M1/M2/M3)
Apple Silicon supports GPU acceleration via Metal Performance Shaders.
# Check MPS availability (PyTorch)
python -c "import torch; print(torch.backends.mps.is_available())"
1.3 Cloud vs. Local Development
| Aspect | Local Server | Cloud |
|---|---|---|
| Upfront cost | High (hardware) | Low |
| Ongoing cost | Low (electricity) | High (hourly billing) |
| Latency | None | Instance startup time |
| Scalability | Limited | Unlimited |
| Data security | High | Policy-dependent |
| Best for | Continuous research | One-off large experiments |
2. NVIDIA GPU Drivers and CUDA Installation
2.1 Installing nvidia-driver (Ubuntu)
# 1. Remove existing NVIDIA packages
sudo apt-get purge nvidia*
sudo apt autoremove
# 2. Check available drivers
ubuntu-drivers devices
# 3. Auto-install recommended driver
sudo ubuntu-drivers autoinstall
# Or install a specific version
sudo apt install nvidia-driver-535
# 4. Reboot
sudo reboot
# 5. Verify installation
nvidia-smi
Example nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off| 00000000:00:04.0 Off | 0 |
| N/A 34C P0 56W / 400W | 1024MiB / 40960MiB | 0% Default |
+-----------------------------------------------------------------------------+
2.2 Installing the CUDA Toolkit
Generate the appropriate installation command on the official NVIDIA website for your platform.
# CUDA 12.2 installation (Ubuntu 22.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-2
# Set environment variables (add to ~/.bashrc or ~/.zshrc)
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# Verify CUDA version
nvcc --version
2.3 Installing cuDNN
# Requires an NVIDIA Developer account
# After downloading cuDNN:
# For Ubuntu 22.04 with CUDA 12.x
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install libcudnn8 libcudnn8-dev libcudnn8-samples
# Verify cuDNN version
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
2.4 Verifying the Installation
# verify_gpu.py
import subprocess
import sys
def check_nvidia_smi():
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
if result.returncode == 0:
print("nvidia-smi is working")
lines = result.stdout.split('\n')
for line in lines:
if 'NVIDIA' in line and 'Driver' in line:
print(f" {line.strip()}")
else:
print("nvidia-smi error:", result.stderr)
def check_pytorch_cuda():
try:
import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"GPU count: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
print(f" GPU {i}: {props.name} ({props.total_memory / 1024**3:.1f}GB)")
# Quick tensor operation test
x = torch.randn(1000, 1000).cuda()
y = torch.randn(1000, 1000).cuda()
z = x @ y
print(f"GPU tensor operation test: passed (shape: {z.shape})")
except ImportError:
print("PyTorch is not installed")
def check_tensorflow_gpu():
try:
import tensorflow as tf
print(f"\nTensorFlow version: {tf.__version__}")
gpus = tf.config.list_physical_devices('GPU')
print(f"Detected GPUs: {len(gpus)}")
for gpu in gpus:
print(f" {gpu}")
except ImportError:
print("TensorFlow is not installed")
if __name__ == "__main__":
check_nvidia_smi()
check_pytorch_cuda()
check_tensorflow_gpu()
python verify_gpu.py
3. Python Environment Management
3.1 Python Version Management with pyenv
# Install pyenv
curl https://pyenv.run | bash
# Add to ~/.bashrc or ~/.zshrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc
# List available Python versions
pyenv install --list | grep -E "^\s+3\.(10|11|12)"
# Install Python
pyenv install 3.11.7
pyenv install 3.10.13
# Set global version
pyenv global 3.11.7
# Set project-specific version (in that directory)
cd my-project
pyenv local 3.10.13
# Check current Python version
python --version
pyenv versions
3.2 conda Environments (for GPU Libraries)
conda is particularly useful for installing GPU-accelerated libraries like RAPIDS and cuML.
# Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda init
# Configure channels
conda config --add channels conda-forge
conda config --add channels nvidia
conda config --set channel_priority strict
# Create AI/ML environment
conda create -n aiml python=3.11 -y
conda activate aiml
# PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# RAPIDS (GPU-accelerated data science)
conda install -c rapidsai -c conda-forge -c nvidia rapids=23.10 cuda-version=12.0
# Export and import environments
conda env export > environment.yml
conda env create -f environment.yml
# List environments
conda env list
environment.yml example:
name: aiml
channels:
- pytorch
- nvidia
- conda-forge
- defaults
dependencies:
- python=3.11
- pytorch>=2.1
- torchvision
- cudatoolkit=12.1
- numpy>=1.24
- pandas>=2.0
- scikit-learn>=1.3
- pip:
- transformers>=4.35
- wandb>=0.16
- pydantic>=2.0
3.3 Dependency Management with poetry
# Install poetry
curl -sSL https://install.python-poetry.org | python3 -
# Create project
poetry new ml-research
cd ml-research
# Add dependencies
poetry add torch torchvision numpy pandas transformers
poetry add --group dev pytest black ruff mypy jupyter
# Install with extras
poetry add "torch[cuda]>=2.1"
poetry add "transformers[torch]>=4.35"
# Install all dependencies
poetry install
# Check virtual environment info
poetry env info
poetry env list
3.4 uv (Ultra-fast Package Manager)
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install packages (10-100x faster than pip)
uv pip install torch torchvision numpy pandas
# Create virtual environment
uv venv .venv --python 3.11
source .venv/bin/activate
# Install from requirements.txt (very fast)
uv pip install -r requirements.txt
# Manage Python versions
uv python install 3.11 3.12
uv python list
4. Advanced JupyterLab Setup
4.1 JupyterLab Installation and Extensions
# Install JupyterLab
pip install jupyterlab
# Install useful extensions
pip install jupyterlab-git # Git integration
pip install jupyterlab-lsp # Language server protocol (autocomplete)
pip install python-lsp-server # Python LSP
pip install jupyterlab-code-formatter # Code formatter
pip install black isort # Formatters
pip install jupyterlab-vim # Vim key bindings
pip install ipywidgets # Interactive widgets
# List installed extensions
jupyter labextension list
# Start JupyterLab
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser
4.2 Kernel Management
# List available kernels
jupyter kernelspec list
# Register current conda/venv environment as a kernel
pip install ipykernel
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"
# Register a specific conda environment as a kernel
conda activate aiml
conda install ipykernel
python -m ipykernel install --user --name aiml --display-name "Python (aiml GPU)"
# Remove a kernel
jupyter kernelspec remove old-kernel
Custom kernel.json example:
{
"argv": [
"/home/user/miniconda3/envs/aiml/bin/python",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "Python (aiml GPU)",
"language": "python",
"env": {
"CUDA_VISIBLE_DEVICES": "0",
"PYTHONPATH": "/home/user/projects"
}
}
4.3 Remote Jupyter via SSH Tunneling
# Start Jupyter on the server (no token)
jupyter lab --no-browser --port=8888 --ip=127.0.0.1
# Create SSH tunnel from local machine
ssh -N -L 8888:localhost:8888 user@your-server.com
# Access in browser
# http://localhost:8888
# Set a password for security
jupyter lab password
Systemd service for auto-start:
cat > /tmp/jupyter.service << 'EOF'
[Unit]
Description=Jupyter Lab
After=network.target
[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username
ExecStart=/home/your-username/miniconda3/envs/aiml/bin/jupyter lab --no-browser --port=8888
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo mv /tmp/jupyter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable jupyter
sudo systemctl start jupyter
4.4 Jupyter Magic Commands
# Timing
%time result = expensive_function()
%timeit -n 100 result = fast_function() # 100 repetitions
# Line-by-line profiling
%load_ext line_profiler
%lprun -f my_function my_function(data)
# Memory profiling
%load_ext memory_profiler
%memit result = memory_heavy_function()
# Write cell contents to file
%%writefile my_script.py
import numpy as np
# ...
# Load file into cell
%load my_script.py
# Run shell commands
!nvidia-smi
!pip list | grep torch
files = !ls -la *.py
# Set environment variables
%env CUDA_VISIBLE_DEVICES=0
# Auto-reload (reflect code changes immediately)
%load_ext autoreload
%autoreload 2
# Inline matplotlib plots
%matplotlib inline
# List current variables
%who
%whos
# Access previous outputs
print(_) # Last result
print(__) # Two results ago
4.5 nbconvert (Notebook Conversion)
# Convert notebook to script
jupyter nbconvert --to script notebook.ipynb
# Convert to HTML (for sharing)
jupyter nbconvert --to html notebook.ipynb
# Convert to PDF (requires LaTeX)
jupyter nbconvert --to pdf notebook.ipynb
# Execute and convert to HTML
jupyter nbconvert --to html --execute notebook.ipynb
# Execute notebook in-place from CLI
jupyter nbconvert --to notebook --execute --inplace notebook.ipynb
# Parameterized execution (papermill)
pip install papermill
papermill input.ipynb output.ipynb -p lr 0.001 -p epochs 100
5. VS Code for AI
5.1 Installing Essential Extensions
# Install extensions from the command line
code --install-extension ms-python.python # Python
code --install-extension ms-toolsai.jupyter # Jupyter
code --install-extension ms-python.pylance # Pylance (LSP)
code --install-extension charliermarsh.ruff # Ruff linter
code --install-extension github.copilot # GitHub Copilot
code --install-extension github.copilot-chat # Copilot Chat
code --install-extension ms-vscode-remote.remote-ssh # Remote SSH
code --install-extension ms-vscode-remote.remote-containers # Dev Containers
code --install-extension eamodio.gitlens # GitLens
code --install-extension njpwerner.autodocstring # Auto-docstring
5.2 VS Code Settings (settings.json)
{
"python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",
"python.formatting.provider": "none",
"[python]": {
"editor.defaultFormatter": "charliermarsh.ruff",
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.fixAll.ruff": true,
"source.organizeImports.ruff": true
}
},
"pylance.typeCheckingMode": "basic",
"python.analysis.autoImportCompletions": true,
"python.analysis.indexing": true,
"python.analysis.packageIndexDepths": [
{ "name": "torch", "depth": 5 },
{ "name": "transformers", "depth": 5 }
],
"jupyter.askForKernelRestart": false,
"jupyter.interactiveWindow.creationMode": "perFile",
"editor.inlineSuggest.enabled": true,
"github.copilot.enable": {
"*": true,
"yaml": true,
"plaintext": false
},
"files.exclude": {
"**/__pycache__": true,
"**/*.pyc": true,
".mypy_cache": true,
".ruff_cache": true
},
"terminal.integrated.env.linux": {
"CUDA_VISIBLE_DEVICES": "0"
}
}
5.3 Remote SSH Setup
# Local ~/.ssh/config
Host ml-server
HostName 192.168.1.100
User your-username
IdentityFile ~/.ssh/id_rsa
ForwardAgent yes
ServerAliveInterval 60
ServerAliveCountMax 3
Host gpu-cloud
HostName gpu-server.example.com
User ubuntu
IdentityFile ~/.ssh/cloud-key.pem
Port 22
Using Remote SSH in VS Code:
- Press Ctrl+Shift+P (or Cmd+Shift+P on Mac)
- Select "Remote-SSH: Connect to Host"
- Choose your configured host
Auto-install extensions on the remote server:
{
"remote.SSH.defaultExtensions": ["ms-python.python", "ms-toolsai.jupyter", "charliermarsh.ruff"]
}
5.4 Dev Containers
.devcontainer/devcontainer.json example:
{
"name": "AI Development",
"build": {
"dockerfile": "Dockerfile",
"args": {
"CUDA_VERSION": "12.1.1",
"PYTHON_VERSION": "3.11"
}
},
"runArgs": ["--gpus", "all", "--shm-size", "8g"],
"mounts": [
"source=/data,target=/data,type=bind",
"source=${localEnv:HOME}/.cache,target=/root/.cache,type=bind"
],
"customizations": {
"vscode": {
"extensions": [
"ms-python.python",
"ms-toolsai.jupyter",
"charliermarsh.ruff",
"github.copilot"
],
"settings": {
"python.defaultInterpreterPath": "/usr/local/bin/python"
}
}
},
"postCreateCommand": "pip install -e '.[dev]'",
"remoteUser": "root"
}
6. GPU Docker Containers
6.1 Installing NVIDIA Container Toolkit
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
6.2 Dockerfile for AI Projects
# Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
# Environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1
# System packages
RUN apt-get update && apt-get install -y \
python3.11 \
python3.11-dev \
python3.11-venv \
python3-pip \
git \
wget \
curl \
vim \
htop \
tmux \
&& rm -rf /var/lib/apt/lists/*
# Set Python defaults
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
# Upgrade pip
RUN pip install --upgrade pip setuptools wheel
# Working directory
WORKDIR /workspace
# Python dependencies (optimize layer caching)
COPY requirements.txt .
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip install -r requirements.txt
# Copy source code
COPY . .
# Install as editable package
RUN pip install -e ".[dev]"
# Create non-root user (security best practice)
RUN useradd -m -u 1000 researcher
RUN chown -R researcher:researcher /workspace
USER researcher
# Expose ports
EXPOSE 8888 6006
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]
6.3 Docker Compose for a Service Stack
# docker-compose.yml
version: '3.8'
services:
# Training container
trainer:
build:
context: .
dockerfile: Dockerfile
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
- CUDA_VISIBLE_DEVICES=0
- WANDB_API_KEY=${WANDB_API_KEY}
volumes:
- ./src:/workspace/src
- ./configs:/workspace/configs
- /data:/data:ro
- model-checkpoints:/workspace/checkpoints
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: '8g'
command: python train.py
# Jupyter notebook server
notebook:
build:
context: .
dockerfile: Dockerfile
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=1
ports:
- '8888:8888'
volumes:
- ./notebooks:/workspace/notebooks
- ./src:/workspace/src
- /data:/data:ro
command: jupyter lab --ip=0.0.0.0 --no-browser --allow-root
# TensorBoard
tensorboard:
image: tensorflow/tensorflow:latest
ports:
- '6006:6006'
volumes:
- model-checkpoints:/workspace/checkpoints:ro
command: tensorboard --logdir=/workspace/checkpoints --host=0.0.0.0
# MLflow tracking server
mlflow:
image: ghcr.io/mlflow/mlflow:v2.8.0
ports:
- '5000:5000'
volumes:
- mlflow-data:/mlflow
command: mlflow server --host=0.0.0.0 --port=5000 --default-artifact-root=/mlflow/artifacts
volumes:
model-checkpoints:
mlflow-data:
# Start the full stack
docker-compose up -d
# Follow logs
docker-compose logs -f trainer
# Restart a specific service
docker-compose restart notebook
# Check GPU usage inside container
docker-compose exec trainer nvidia-smi
# Tear down
docker-compose down --volumes
6.4 GPU Sharing Strategies
# Assign specific GPUs
docker run --gpus '"device=0,1"' myimage # Use GPUs 0 and 1
docker run --gpus '"device=2"' myimage # Use only GPU 2
# MIG (Multi-Instance GPU) for A100/H100
sudo nvidia-smi mig -i 0 --create-gpu-instance 3g.20gb
sudo nvidia-smi mig -i 0 --create-compute-instance 3g.20gb
# Assign MIG instance to a container
docker run --gpus '"MIG-GPU-xxxxxxxx-xx-xx-xxxx-xxxxxxxxxxxx/x/x"' myimage
7. Remote Development Environment
7.1 SSH-Based Remote Development
# Generate SSH key (on local machine)
ssh-keygen -t ed25519 -C "your-email@example.com"
# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.com
# Or manually
cat ~/.ssh/id_ed25519.pub | ssh user@server.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
# Optimized SSH config (~/.ssh/config)
Host ml-server
HostName server.example.com
User researcher
IdentityFile ~/.ssh/id_ed25519
ControlMaster auto
ControlPath ~/.ssh/cm_%r@%h:%p
ControlPersist 10m
ServerAliveInterval 30
Compression yes
7.2 tmux (Session Persistence)
# Install tmux
sudo apt install tmux
# Start a new session
tmux new-session -s training
# Reattach to a session
tmux attach-session -t training
# List sessions
tmux list-sessions
# Key shortcuts (Prefix: Ctrl+B)
# Ctrl+B d : Detach session (keeps running after SSH disconnect)
# Ctrl+B c : Create new window
# Ctrl+B n/p : Next/previous window
# Ctrl+B % : Vertical split
# Ctrl+B " : Horizontal split
# Ctrl+B z : Toggle pane zoom
~/.tmux.conf configuration:
# ~/.tmux.conf
# Enable mouse support
set -g mouse on
# Increase scroll buffer
set -g history-limit 50000
# Start window numbering at 1
set -g base-index 1
setw -g pane-base-index 1
# Color support
set -g default-terminal "screen-256color"
set -ga terminal-overrides ",xterm-256color:Tc"
# Customize status bar
set -g status-bg colour235
set -g status-fg colour136
set -g status-right '#[fg=colour166]%d %b #[fg=colour136]%H:%M '
# Change prefix to Ctrl+A (screen style)
set -g prefix C-a
unbind C-b
bind C-a send-prefix
# Fast pane navigation
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D
7.3 File Synchronization (rsync, sshfs)
# Sync code local -> server
rsync -avz --exclude '__pycache__' --exclude '.git' \
./my-project/ user@server:/home/user/my-project/
# Sync results server -> local
rsync -avz user@server:/home/user/my-project/outputs/ ./outputs/
# Auto-sync script
cat > sync.sh << 'EOF'
#!/bin/bash
rsync -avz \
--exclude '__pycache__' \
--exclude '.git' \
--exclude '.venv' \
--exclude '*.pyc' \
--delete \
./ user@ml-server:/home/user/project/
echo "Sync complete: $(date)"
EOF
chmod +x sync.sh
# Mount remote filesystem with sshfs
sudo apt install sshfs
mkdir -p ~/remote-server
sshfs user@server:/home/user ~/remote-server -o reconnect,ServerAliveInterval=15
# Unmount
fusermount -u ~/remote-server
8. Monitoring Tools
8.1 nvidia-smi watch
# Refresh GPU status every second
watch -n 1 nvidia-smi
# Show only GPU memory usage
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
--format=csv -l 1
# Record to CSV log
nvidia-smi --query-gpu=timestamp,index,utilization.gpu,memory.used,temperature.gpu \
--format=csv -l 1 > gpu_log.csv
8.2 nvtop
# Install nvtop (htop for GPUs)
sudo apt install nvtop
# Run
nvtop
8.3 GPU Monitoring from Python
# gpu_monitor.py
import subprocess
import time
from dataclasses import dataclass
from typing import List
@dataclass
class GPUStats:
index: int
name: str
temperature: float
utilization: float
memory_used: int
memory_total: int
power_draw: float
def get_gpu_stats() -> List[GPUStats]:
"""Query GPU stats via nvidia-smi"""
cmd = [
'nvidia-smi',
'--query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw',
'--format=csv,noheader,nounits'
]
result = subprocess.run(cmd, capture_output=True, text=True)
gpus = []
for line in result.stdout.strip().split('\n'):
parts = [p.strip() for p in line.split(',')]
gpus.append(GPUStats(
index=int(parts[0]),
name=parts[1],
temperature=float(parts[2]),
utilization=float(parts[3]),
memory_used=int(parts[4]),
memory_total=int(parts[5]),
power_draw=float(parts[6]) if parts[6] != 'N/A' else 0.0
))
return gpus
def monitor_training(interval: int = 10, duration: int = 3600):
"""Monitor GPU during training"""
print(f"GPU monitoring started (interval: {interval}s)")
history = []
start_time = time.time()
while time.time() - start_time < duration:
stats = get_gpu_stats()
timestamp = time.time() - start_time
for gpu in stats:
mem_pct = gpu.memory_used / gpu.memory_total * 100
print(
f"[{timestamp:.0f}s] GPU {gpu.index}: "
f"util={gpu.utilization:.0f}% "
f"mem={gpu.memory_used}/{gpu.memory_total}MB ({mem_pct:.0f}%) "
f"temp={gpu.temperature:.0f}C "
f"power={gpu.power_draw:.0f}W"
)
history.append({
'timestamp': timestamp,
'gpu_index': gpu.index,
'utilization': gpu.utilization,
'memory_used': gpu.memory_used,
'temperature': gpu.temperature
})
time.sleep(interval)
return history
if __name__ == "__main__":
history = monitor_training(interval=5, duration=60)
print(f"\nCollected {len(history)} measurements")
8.4 System Monitoring
# Install and use htop
sudo apt install htop
htop
# Monitor I/O with iotop
sudo apt install iotop
sudo iotop -o # Show only processes with active I/O
# Disk I/O stats
iostat -x 1 5
# Memory usage
free -h
vmstat 1 10
# Network monitoring
sudo apt install nethogs
sudo nethogs
# Comprehensive monitoring dashboard (glances)
pip install glances
glances
9. Code Quality Tools
9.1 black + ruff Configuration
# Install
pip install black ruff pre-commit
# Run black
black .
black --line-length 88 src/
# Run ruff
ruff check .
ruff check --fix . # Auto-fix
ruff format . # Format (can replace black)
pyproject.toml configuration:
[tool.black]
line-length = 88
target-version = ['py310', 'py311']
include = '\.pyi?$'
[tool.ruff]
line-length = 88
target-version = "py310"
[tool.ruff.lint]
select = [
"E", # pycodestyle errors
"W", # pycodestyle warnings
"F", # pyflakes
"I", # isort
"N", # pep8-naming
"UP", # pyupgrade
"B", # flake8-bugbear
]
ignore = ["E501", "B008"]
[tool.ruff.lint.isort]
known-first-party = ["my_package"]
9.2 pre-commit Hooks
# .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-json
- id: check-merge-conflict
- id: detect-private-key
- id: check-added-large-files
args: ['--maxkb=10000']
- repo: https://github.com/psf/black
rev: 23.11.0
hooks:
- id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.1.6
hooks:
- id: ruff
args: [--fix]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.7.1
hooks:
- id: mypy
additional_dependencies:
- types-requests
- pydantic
# Install and activate pre-commit
pre-commit install
# Run on all files
pre-commit run --all-files
# Run a specific hook
pre-commit run black --all-files
9.3 pytest with Coverage
# Install
pip install pytest pytest-cov pytest-xdist
# Basic run
pytest tests/ -v
# With coverage
pytest tests/ --cov=src --cov-report=html --cov-report=term-missing
# Parallel execution
pytest tests/ -n auto # As many workers as CPU cores
# Show slowest tests
pytest tests/ --durations=10
# Run only fast tests
pytest tests/ -m "not slow"
pytest configuration in pyproject.toml:
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py", "*_test.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
markers = [
"slow: marks tests as slow",
"gpu: marks tests requiring GPU",
"integration: integration tests"
]
addopts = [
"-v",
"--tb=short",
"--cov=src",
"--cov-report=html",
"--cov-fail-under=80"
]
10. Weights & Biases Setup
10.1 Installation and Initialization
# Install
pip install wandb
# Log in (sets API key)
wandb login
# Or set via environment variable
export WANDB_API_KEY=your-api-key-here
10.2 Basic W&B Usage
import wandb
import numpy as np
# Initialize experiment
run = wandb.init(
project="bert-sentiment-analysis",
name="run-001-baseline",
config={
"learning_rate": 2e-5,
"batch_size": 32,
"epochs": 10,
"model": "bert-base-uncased",
"optimizer": "adamw"
},
tags=["baseline", "bert", "nlp"],
notes="Baseline BERT fine-tuning experiment"
)
config = wandb.config
print(f"Learning rate: {config.learning_rate}")
# Log metrics
for epoch in range(config.epochs):
train_loss = 2.0 - epoch * 0.15 + np.random.randn() * 0.1
val_loss = 2.2 - epoch * 0.12 + np.random.randn() * 0.1
val_acc = 0.5 + epoch * 0.04 + np.random.randn() * 0.01
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/loss": val_loss,
"val/accuracy": val_acc,
"learning_rate": config.learning_rate * (0.95 ** epoch)
})
# Log a visualization
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x = np.linspace(0, 10, 100)
ax.plot(x, np.sin(x), label='Predicted')
ax.plot(x, np.cos(x), label='Actual')
ax.legend()
wandb.log({"prediction_plot": wandb.Image(fig)})
plt.close()
wandb.finish()
10.3 Artifact Management
import wandb
import os
def save_model_artifact(model_path: str, run, metadata: dict = None):
"""Save model checkpoint as a W&B artifact"""
artifact = wandb.Artifact(
name="model-checkpoint",
type="model",
metadata=metadata or {}
)
artifact.add_file(model_path)
run.log_artifact(artifact)
print(f"Model artifact saved: {model_path}")
def save_dataset_artifact(data_dir: str, run):
"""Save dataset as a W&B artifact"""
artifact = wandb.Artifact(
name="training-dataset",
type="dataset",
description="Preprocessed training dataset"
)
artifact.add_dir(data_dir)
run.log_artifact(artifact)
def load_model_artifact(artifact_name: str, version: str = "latest"):
"""Download a model artifact"""
run = wandb.init(project="my-project", job_type="inference")
artifact = run.use_artifact(f"{artifact_name}:{version}")
artifact_dir = artifact.download()
print(f"Artifact downloaded to: {artifact_dir}")
return artifact_dir
# Usage example
run = wandb.init(project="bert-training")
with open("/tmp/model.pt", "w") as f:
f.write("model_weights")
save_model_artifact(
"/tmp/model.pt",
run,
metadata={"accuracy": 0.94, "epoch": 10, "val_loss": 0.28}
)
wandb.finish()
10.4 Hyperparameter Optimization Sweeps
import wandb
import numpy as np
# Sweep configuration
sweep_config = {
"method": "bayes", # random, grid, or bayes
"metric": {
"name": "val/accuracy",
"goal": "maximize"
},
"parameters": {
"learning_rate": {
"distribution": "log_uniform_values",
"min": 1e-5,
"max": 1e-3
},
"batch_size": {
"values": [16, 32, 64, 128]
},
"hidden_size": {
"values": [256, 512, 768, 1024]
},
"dropout": {
"distribution": "uniform",
"min": 0.1,
"max": 0.5
},
"num_layers": {
"values": [2, 4, 6, 8]
}
},
"early_terminate": {
"type": "hyperband",
"min_iter": 3
}
}
def train_sweep():
"""Training function called by each sweep run"""
run = wandb.init()
config = wandb.config
best_val_acc = 0.0
for epoch in range(10):
train_loss = 1.0 / (config.learning_rate * 1000) * np.random.uniform(0.8, 1.2)
val_acc = min(0.99, config.hidden_size / 2000 + np.random.uniform(0, 0.1))
if val_acc > best_val_acc:
best_val_acc = val_acc
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/accuracy": val_acc
})
wandb.finish()
# Create and launch sweep
sweep_id = wandb.sweep(sweep_config, project="hyperparameter-search")
print(f"Sweep ID: {sweep_id}")
# Run agent for N trials
wandb.agent(sweep_id, function=train_sweep, count=20)
To run a sweep in parallel across multiple servers:
# Server 1
wandb agent username/project/sweep-id
# Server 2 (simultaneously)
wandb agent username/project/sweep-id
Conclusion
Investing time in setting up your AI development environment upfront pays dividends throughout the life of your project. Here is a summary of the key priorities covered in this guide.
Essential priorities:
- CUDA environment: Correct GPU driver and CUDA Toolkit installation is the foundation of everything.
- Python environment isolation: Use pyenv + conda/poetry to create separate environments per project.
- JupyterLab + VS Code: Use Jupyter for exploratory analysis and VS Code for production development.
- Docker containerization: Guarantees reproducible experiment environments and simplifies team collaboration.
- W&B experiment tracking: Tracking every experiment gives you reproducibility and actionable insights.
Areas worth investing in long-term:
- Automate code quality with pre-commit hooks and CI/CD.
- Master remote development (Remote SSH + tmux) to work productively from anywhere.
- Use GPU monitoring to catch training issues early.