💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Overview

Setting up a proper AI development environment is half the battle. Without the right foundation, reproducible experiments, team collaboration, and fast iteration cycles are all impossible. This guide covers everything an AI/ML developer needs to know — from the initial GPU server setup to daily development workflows.

You will learn how to install CUDA on Ubuntu, manage Python environments, use JupyterLab and VS Code like a professional, and build reproducible experiment environments with Docker containers.

1. AI Development Environment Requirements

1.1 Hardware Recommendations

**GPU (Top Priority)**

In AI research, a GPU is not optional — it is essential.

- Entry-level / Personal research: NVIDIA RTX 4080/4090 (16-24GB VRAM)

- Shared team server: NVIDIA A100 40GB or 80GB

- Large LLM training: NVIDIA H100 or H200 (80GB+ VRAM)

- Cloud options: AWS p3/p4d/p5, GCP A100/H100, Lambda Labs

**CPU**

During GPU training, the CPU is primarily responsible for data preprocessing and loading.

- Minimum: 8-core 16-thread (Intel Core i9 or AMD Ryzen 9)

- Recommended: 16+ cores (AMD Threadripper, Intel Xeon)

- Optimal DataLoader workers = half the number of CPU cores

**RAM**

- Minimum: 32GB

- Recommended: 64GB (at least 4x GPU VRAM)

- Large-scale NLP: 128GB+ (tokenization, data loading)

**Storage**

/home/user/ -> NVMe SSD (OS, code, environments)

/data/ -> NVMe SSD or fast HDD (training data)

/models/ -> HDD or NAS (model checkpoints)

- OS and packages: NVMe SSD 500GB+

- Datasets: NVMe SSD (I/O-intensive training) or high-performance HDD

- Checkpoints: HDD 4TB+ (cost-effective high-capacity storage)

1.2 OS Selection

**Ubuntu 22.04 LTS — Strongly Recommended**

Check Ubuntu version

lsb_release -a

Example output:

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

Reasons to choose Ubuntu:

- NVIDIA's primary supported platform (latest drivers, CUDA, cuDNN available immediately)

- Largest AI/ML community documentation and package ecosystem

- Best compatibility with Docker and Kubernetes

- 5-year LTS support for stable server operation

**macOS (M1/M2/M3)**

Apple Silicon supports GPU acceleration via Metal Performance Shaders.

Check MPS availability (PyTorch)

python -c "import torch; print(torch.backends.mps.is_available())"

1.3 Cloud vs. Local Development

| Aspect | Local Server | Cloud |

| ------------- | ------------------- | ------------------------- |

| Upfront cost | High (hardware) | Low |

| Ongoing cost | Low (electricity) | High (hourly billing) |

| Latency | None | Instance startup time |

| Scalability | Limited | Unlimited |

| Data security | High | Policy-dependent |

| Best for | Continuous research | One-off large experiments |

2. NVIDIA GPU Drivers and CUDA Installation

2.1 Installing nvidia-driver (Ubuntu)

1. Remove existing NVIDIA packages

sudo apt-get purge nvidia*

sudo apt autoremove

2. Check available drivers

ubuntu-drivers devices

3. Auto-install recommended driver

sudo ubuntu-drivers autoinstall

Or install a specific version

sudo apt install nvidia-driver-535

4. Reboot

sudo reboot

5. Verify installation

nvidia-smi

Example `nvidia-smi` output:

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 NVIDIA A100-SXM4-40GB Off| 00000000:00:04.0 Off | 0 |

| N/A 34C P0 56W / 400W | 1024MiB / 40960MiB | 0% Default |

+-----------------------------------------------------------------------------+

2.2 Installing the CUDA Toolkit

Generate the appropriate installation command on the official NVIDIA website for your platform.

CUDA 12.2 installation (Ubuntu 22.04 example)

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

sudo dpkg -i cuda-keyring_1.1-1_all.deb

sudo apt-get update

sudo apt-get -y install cuda-toolkit-12-2

Set environment variables (add to ~/.bashrc or ~/.zshrc)

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc

echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

source ~/.bashrc

Verify CUDA version

nvcc --version

2.3 Installing cuDNN

Requires an NVIDIA Developer account

After downloading cuDNN:

For Ubuntu 22.04 with CUDA 12.x

sudo dpkg -i cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb

sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-*-keyring.gpg /usr/share/keyrings/

sudo apt-get update

sudo apt-get install libcudnn8 libcudnn8-dev libcudnn8-samples

Verify cuDNN version

cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

2.4 Verifying the Installation

verify_gpu.py

def check_nvidia_smi():

result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)

if result.returncode == 0:

print("nvidia-smi is working")

lines = result.stdout.split('\n')

for line in lines:

if 'NVIDIA' in line and 'Driver' in line:

print(f" {line.strip()}")

else:

print("nvidia-smi error:", result.stderr)

def check_pytorch_cuda():

try:

print(f"\nPyTorch version: {torch.__version__}")

print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():

print(f"CUDA version: {torch.version.cuda}")

print(f"cuDNN version: {torch.backends.cudnn.version()}")

print(f"GPU count: {torch.cuda.device_count()}")

for i in range(torch.cuda.device_count()):

props = torch.cuda.get_device_properties(i)

print(f" GPU {i}: {props.name} ({props.total_memory / 1024**3:.1f}GB)")

Quick tensor operation test

x = torch.randn(1000, 1000).cuda()

y = torch.randn(1000, 1000).cuda()

z = x @ y

print(f"GPU tensor operation test: passed (shape: {z.shape})")

except ImportError:

print("PyTorch is not installed")

def check_tensorflow_gpu():

try:

print(f"\nTensorFlow version: {tf.__version__}")

gpus = tf.config.list_physical_devices('GPU')

print(f"Detected GPUs: {len(gpus)}")

for gpu in gpus:

print(f" {gpu}")

except ImportError:

print("TensorFlow is not installed")

if __name__ == "__main__":

check_nvidia_smi()

check_pytorch_cuda()

check_tensorflow_gpu()

python verify_gpu.py

3. Python Environment Management

3.1 Python Version Management with pyenv

Install pyenv

curl https://pyenv.run | bash

Add to ~/.bashrc or ~/.zshrc

echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc

echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc

echo 'eval "$(pyenv init -)"' >> ~/.bashrc

source ~/.bashrc

List available Python versions

pyenv install --list | grep -E "^\s+3\.(10|11|12)"

Install Python

pyenv install 3.11.7

pyenv install 3.10.13

Set global version

pyenv global 3.11.7

Set project-specific version (in that directory)

cd my-project

pyenv local 3.10.13

Check current Python version

python --version

pyenv versions

3.2 conda Environments (for GPU Libraries)

conda is particularly useful for installing GPU-accelerated libraries like RAPIDS and cuML.

Install Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh -b

eval "$($HOME/miniconda3/bin/conda shell.bash hook)"

conda init

Configure channels

conda config --add channels conda-forge

conda config --add channels nvidia

conda config --set channel_priority strict

Create AI/ML environment

conda create -n aiml python=3.11 -y

conda activate aiml

PyTorch with CUDA

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

RAPIDS (GPU-accelerated data science)

conda install -c rapidsai -c conda-forge -c nvidia rapids=23.10 cuda-version=12.0

Export and import environments

conda env export > environment.yml

conda env create -f environment.yml

List environments

conda env list

`environment.yml` example:

name: aiml

channels:

- pytorch

- nvidia

- conda-forge

- defaults

dependencies:

- python=3.11

- pytorch>=2.1

- torchvision

- cudatoolkit=12.1

- numpy>=1.24

- pandas>=2.0

- scikit-learn>=1.3

- pip:

- transformers>=4.35

- wandb>=0.16

- pydantic>=2.0

3.3 Dependency Management with poetry

Install poetry

curl -sSL https://install.python-poetry.org | python3 -

Create project

poetry new ml-research

cd ml-research

Add dependencies

poetry add torch torchvision numpy pandas transformers

poetry add --group dev pytest black ruff mypy jupyter

Install with extras

poetry add "torch[cuda]>=2.1"

poetry add "transformers[torch]>=4.35"

Install all dependencies

poetry install

Check virtual environment info

poetry env info

poetry env list

3.4 uv (Ultra-fast Package Manager)

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Install packages (10-100x faster than pip)

uv pip install torch torchvision numpy pandas

Create virtual environment

uv venv .venv --python 3.11

source .venv/bin/activate

Install from requirements.txt (very fast)

uv pip install -r requirements.txt

Manage Python versions

uv python install 3.11 3.12

uv python list

4. Advanced JupyterLab Setup

4.1 JupyterLab Installation and Extensions

Install JupyterLab

pip install jupyterlab

Install useful extensions

pip install jupyterlab-git # Git integration

pip install jupyterlab-lsp # Language server protocol (autocomplete)

pip install python-lsp-server # Python LSP

pip install jupyterlab-code-formatter # Code formatter

pip install black isort # Formatters

pip install jupyterlab-vim # Vim key bindings

pip install ipywidgets # Interactive widgets

List installed extensions

jupyter labextension list

Start JupyterLab

jupyter lab --ip=0.0.0.0 --port=8888 --no-browser

4.2 Kernel Management

List available kernels

jupyter kernelspec list

Register current conda/venv environment as a kernel

pip install ipykernel

python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

Register a specific conda environment as a kernel

conda activate aiml

conda install ipykernel

python -m ipykernel install --user --name aiml --display-name "Python (aiml GPU)"

Remove a kernel

jupyter kernelspec remove old-kernel

Custom `kernel.json` example:

{

"argv": [

"/home/user/miniconda3/envs/aiml/bin/python",

"-m",

"ipykernel_launcher",

"-f",

"{connection_file}"

"display_name": "Python (aiml GPU)",

"language": "python",

"env": {

"CUDA_VISIBLE_DEVICES": "0",

"PYTHONPATH": "/home/user/projects"

}

4.3 Remote Jupyter via SSH Tunneling

Start Jupyter on the server (no token)

jupyter lab --no-browser --port=8888 --ip=127.0.0.1

Create SSH tunnel from local machine

ssh -N -L 8888:localhost:8888 user@your-server.com

Access in browser

http://localhost:8888

Set a password for security

jupyter lab password

Systemd service for auto-start:

cat > /tmp/jupyter.service << 'EOF'

[Unit]

Description=Jupyter Lab

After=network.target

[Service]

Type=simple

User=your-username

WorkingDirectory=/home/your-username

ExecStart=/home/your-username/miniconda3/envs/aiml/bin/jupyter lab --no-browser --port=8888

Restart=on-failure

[Install]

WantedBy=multi-user.target

EOF

sudo mv /tmp/jupyter.service /etc/systemd/system/

sudo systemctl daemon-reload

sudo systemctl enable jupyter

sudo systemctl start jupyter

4.4 Jupyter Magic Commands

Timing

%time result = expensive_function()

%timeit -n 100 result = fast_function() # 100 repetitions

Line-by-line profiling

%load_ext line_profiler

%lprun -f my_function my_function(data)

Memory profiling

%load_ext memory_profiler

%memit result = memory_heavy_function()

Write cell contents to file

%%writefile my_script.py

...

Load file into cell

%load my_script.py

Run shell commands

!nvidia-smi

!pip list | grep torch

files = !ls -la *.py

Set environment variables

%env CUDA_VISIBLE_DEVICES=0

Auto-reload (reflect code changes immediately)

%load_ext autoreload

%autoreload 2

Inline matplotlib plots

%matplotlib inline

List current variables

%who

%whos

Access previous outputs

print(_) # Last result

print(__) # Two results ago

4.5 nbconvert (Notebook Conversion)

Convert notebook to script

jupyter nbconvert --to script notebook.ipynb

Convert to HTML (for sharing)

jupyter nbconvert --to html notebook.ipynb

Convert to PDF (requires LaTeX)

jupyter nbconvert --to pdf notebook.ipynb

Execute and convert to HTML

jupyter nbconvert --to html --execute notebook.ipynb

Execute notebook in-place from CLI

jupyter nbconvert --to notebook --execute --inplace notebook.ipynb

Parameterized execution (papermill)

pip install papermill

papermill input.ipynb output.ipynb -p lr 0.001 -p epochs 100

5. VS Code for AI

5.1 Installing Essential Extensions

Install extensions from the command line

code --install-extension ms-python.python # Python

code --install-extension ms-toolsai.jupyter # Jupyter

code --install-extension ms-python.pylance # Pylance (LSP)

code --install-extension charliermarsh.ruff # Ruff linter

code --install-extension github.copilot # GitHub Copilot

code --install-extension github.copilot-chat # Copilot Chat

code --install-extension ms-vscode-remote.remote-ssh # Remote SSH

code --install-extension ms-vscode-remote.remote-containers # Dev Containers

code --install-extension eamodio.gitlens # GitLens

code --install-extension njpwerner.autodocstring # Auto-docstring

5.2 VS Code Settings (settings.json)

{

"python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",

"python.formatting.provider": "none",

"[python]": {

"editor.defaultFormatter": "charliermarsh.ruff",

"editor.formatOnSave": true,

"editor.codeActionsOnSave": {

"source.fixAll.ruff": true,

"source.organizeImports.ruff": true

}

"pylance.typeCheckingMode": "basic",

"python.analysis.autoImportCompletions": true,

"python.analysis.indexing": true,

"python.analysis.packageIndexDepths": [

{ "name": "torch", "depth": 5 },

{ "name": "transformers", "depth": 5 }

"jupyter.askForKernelRestart": false,

"jupyter.interactiveWindow.creationMode": "perFile",

"editor.inlineSuggest.enabled": true,

"github.copilot.enable": {

"*": true,

"yaml": true,

"plaintext": false

"files.exclude": {

"**/__pycache__": true,

"**/*.pyc": true,

".mypy_cache": true,

".ruff_cache": true

"terminal.integrated.env.linux": {

"CUDA_VISIBLE_DEVICES": "0"

}

5.3 Remote SSH Setup

Local ~/.ssh/config

Host ml-server

HostName 192.168.1.100

User your-username

IdentityFile ~/.ssh/id_rsa

ForwardAgent yes

ServerAliveInterval 60

ServerAliveCountMax 3

Host gpu-cloud

HostName gpu-server.example.com

User ubuntu

IdentityFile ~/.ssh/cloud-key.pem

Port 22

Using Remote SSH in VS Code:

1. Press Ctrl+Shift+P (or Cmd+Shift+P on Mac)

2. Select "Remote-SSH: Connect to Host"

3. Choose your configured host

Auto-install extensions on the remote server:

{

"remote.SSH.defaultExtensions": ["ms-python.python", "ms-toolsai.jupyter", "charliermarsh.ruff"]

}

5.4 Dev Containers

`.devcontainer/devcontainer.json` example:

{

"name": "AI Development",

"build": {

"dockerfile": "Dockerfile",

"args": {

"CUDA_VERSION": "12.1.1",

"PYTHON_VERSION": "3.11"

}

"runArgs": ["--gpus", "all", "--shm-size", "8g"],

"mounts": [

"source=/data,target=/data,type=bind",

"source=${localEnv:HOME}/.cache,target=/root/.cache,type=bind"

"customizations": {

"vscode": {

"extensions": [

"ms-python.python",

"ms-toolsai.jupyter",

"charliermarsh.ruff",

"github.copilot"

"settings": {

"python.defaultInterpreterPath": "/usr/local/bin/python"

}

"postCreateCommand": "pip install -e '.[dev]'",

"remoteUser": "root"

}

6. GPU Docker Containers

6.1 Installing NVIDIA Container Toolkit

Install NVIDIA Container Toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \

| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \

| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \

| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

Test GPU access

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

6.2 Dockerfile for AI Projects

Dockerfile

FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

Environment variables

ENV DEBIAN_FRONTEND=noninteractive

ENV PYTHONUNBUFFERED=1

ENV PYTHONDONTWRITEBYTECODE=1

ENV PIP_NO_CACHE_DIR=1

System packages

RUN apt-get update && apt-get install -y \

python3.11 \

python3.11-dev \

python3.11-venv \

python3-pip \

git \

wget \

curl \

vim \

htop \

tmux \

&& rm -rf /var/lib/apt/lists/*

Set Python defaults

RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1

RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1

Upgrade pip

RUN pip install --upgrade pip setuptools wheel

Working directory

WORKDIR /workspace

Python dependencies (optimize layer caching)

COPY requirements.txt .

RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

RUN pip install -r requirements.txt

Copy source code

COPY . .

Install as editable package

RUN pip install -e ".[dev]"

Create non-root user (security best practice)

RUN useradd -m -u 1000 researcher

RUN chown -R researcher:researcher /workspace

USER researcher

Expose ports

EXPOSE 8888 6006

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]

6.3 Docker Compose for a Service Stack

docker-compose.yml

version: '3.8'

services:

Training container

trainer:

build:

context: .

dockerfile: Dockerfile

runtime: nvidia

environment:

- NVIDIA_VISIBLE_DEVICES=0

- CUDA_VISIBLE_DEVICES=0

- WANDB_API_KEY=${WANDB_API_KEY}

volumes:

- ./src:/workspace/src

- ./configs:/workspace/configs

- /data:/data:ro

- model-checkpoints:/workspace/checkpoints

- ~/.cache/huggingface:/root/.cache/huggingface

shm_size: '8g'

command: python train.py

Jupyter notebook server

notebook:

build:

context: .

dockerfile: Dockerfile

runtime: nvidia

environment:

- NVIDIA_VISIBLE_DEVICES=1

ports:

- '8888:8888'

volumes:

- ./notebooks:/workspace/notebooks

- ./src:/workspace/src

- /data:/data:ro

command: jupyter lab --ip=0.0.0.0 --no-browser --allow-root

TensorBoard

tensorboard:

image: tensorflow/tensorflow:latest

ports:

- '6006:6006'

volumes:

- model-checkpoints:/workspace/checkpoints:ro

command: tensorboard --logdir=/workspace/checkpoints --host=0.0.0.0

MLflow tracking server

mlflow:

image: ghcr.io/mlflow/mlflow:v2.8.0

ports:

- '5000:5000'

volumes:

- mlflow-data:/mlflow

command: mlflow server --host=0.0.0.0 --port=5000 --default-artifact-root=/mlflow/artifacts

volumes:

model-checkpoints:

mlflow-data:

Start the full stack

docker-compose up -d

Follow logs

docker-compose logs -f trainer

Restart a specific service

docker-compose restart notebook

Check GPU usage inside container

docker-compose exec trainer nvidia-smi

Tear down

docker-compose down --volumes

6.4 GPU Sharing Strategies

Assign specific GPUs

docker run --gpus '"device=0,1"' myimage # Use GPUs 0 and 1

docker run --gpus '"device=2"' myimage # Use only GPU 2

MIG (Multi-Instance GPU) for A100/H100

sudo nvidia-smi mig -i 0 --create-gpu-instance 3g.20gb

sudo nvidia-smi mig -i 0 --create-compute-instance 3g.20gb

Assign MIG instance to a container

docker run --gpus '"MIG-GPU-xxxxxxxx-xx-xx-xxxx-xxxxxxxxxxxx/x/x"' myimage

7. Remote Development Environment

7.1 SSH-Based Remote Development

Generate SSH key (on local machine)

ssh-keygen -t ed25519 -C "your-email@example.com"

Copy public key to server

ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.com

Or manually

cat ~/.ssh/id_ed25519.pub | ssh user@server.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

Optimized SSH config (~/.ssh/config)

Host ml-server

HostName server.example.com

User researcher

IdentityFile ~/.ssh/id_ed25519

ControlMaster auto

ControlPath ~/.ssh/cm_%r@%h:%p

ControlPersist 10m

ServerAliveInterval 30

Compression yes

7.2 tmux (Session Persistence)

Install tmux

sudo apt install tmux

Start a new session

tmux new-session -s training

Reattach to a session

tmux attach-session -t training

List sessions

tmux list-sessions

Key shortcuts (Prefix: Ctrl+B)

Ctrl+B d : Detach session (keeps running after SSH disconnect)

Ctrl+B c : Create new window

Ctrl+B n/p : Next/previous window

Ctrl+B % : Vertical split

Ctrl+B " : Horizontal split

Ctrl+B z : Toggle pane zoom

`~/.tmux.conf` configuration:

~/.tmux.conf

Enable mouse support

set -g mouse on

Increase scroll buffer

set -g history-limit 50000

Start window numbering at 1

set -g base-index 1

setw -g pane-base-index 1

Color support

set -g default-terminal "screen-256color"

set -ga terminal-overrides ",xterm-256color:Tc"

Customize status bar

set -g status-bg colour235

set -g status-fg colour136

set -g status-right '#[fg=colour166]%d %b #[fg=colour136]%H:%M '

Change prefix to Ctrl+A (screen style)

set -g prefix C-a

unbind C-b

bind C-a send-prefix

Fast pane navigation

bind -n M-Left select-pane -L

bind -n M-Right select-pane -R

bind -n M-Up select-pane -U

bind -n M-Down select-pane -D

7.3 File Synchronization (rsync, sshfs)

Sync code local -> server

rsync -avz --exclude '__pycache__' --exclude '.git' \

./my-project/ user@server:/home/user/my-project/

Sync results server -> local

rsync -avz user@server:/home/user/my-project/outputs/ ./outputs/

Auto-sync script

cat > sync.sh << 'EOF'

#!/bin/bash

rsync -avz \

--exclude '__pycache__' \

--exclude '.git' \

--exclude '.venv' \

--exclude '*.pyc' \

--delete \

./ user@ml-server:/home/user/project/

echo "Sync complete: $(date)"

EOF

chmod +x sync.sh

Mount remote filesystem with sshfs

sudo apt install sshfs

mkdir -p ~/remote-server

sshfs user@server:/home/user ~/remote-server -o reconnect,ServerAliveInterval=15

Unmount

fusermount -u ~/remote-server

8. Monitoring Tools

8.1 nvidia-smi watch

Refresh GPU status every second

watch -n 1 nvidia-smi

Show only GPU memory usage

nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \

--format=csv -l 1

Record to CSV log

nvidia-smi --query-gpu=timestamp,index,utilization.gpu,memory.used,temperature.gpu \

--format=csv -l 1 > gpu_log.csv

8.2 nvtop

Install nvtop (htop for GPUs)

sudo apt install nvtop

Run

nvtop

8.3 GPU Monitoring from Python

gpu_monitor.py

from dataclasses import dataclass

from typing import List

@dataclass

class GPUStats:

index: int

name: str

temperature: float

utilization: float

memory_used: int

memory_total: int

power_draw: float

def get_gpu_stats() -> List[GPUStats]:

"""Query GPU stats via nvidia-smi"""

cmd = [

'nvidia-smi',

'--query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw',

'--format=csv,noheader,nounits'

]

result = subprocess.run(cmd, capture_output=True, text=True)

gpus = []

for line in result.stdout.strip().split('\n'):

parts = [p.strip() for p in line.split(',')]

gpus.append(GPUStats(

index=int(parts[0]),

name=parts[1],

temperature=float(parts[2]),

utilization=float(parts[3]),

memory_used=int(parts[4]),

memory_total=int(parts[5]),

power_draw=float(parts[6]) if parts[6] != 'N/A' else 0.0

))

return gpus

def monitor_training(interval: int = 10, duration: int = 3600):

"""Monitor GPU during training"""

print(f"GPU monitoring started (interval: {interval}s)")

history = []

start_time = time.time()

while time.time() - start_time < duration:

stats = get_gpu_stats()

timestamp = time.time() - start_time

for gpu in stats:

mem_pct = gpu.memory_used / gpu.memory_total * 100

print(

f"[{timestamp:.0f}s] GPU {gpu.index}: "

f"util={gpu.utilization:.0f}% "

f"mem={gpu.memory_used}/{gpu.memory_total}MB ({mem_pct:.0f}%) "

f"temp={gpu.temperature:.0f}C "

f"power={gpu.power_draw:.0f}W"

)

history.append({

'timestamp': timestamp,

'gpu_index': gpu.index,

'utilization': gpu.utilization,

'memory_used': gpu.memory_used,

'temperature': gpu.temperature

})

time.sleep(interval)

return history

if __name__ == "__main__":

history = monitor_training(interval=5, duration=60)

print(f"\nCollected {len(history)} measurements")

8.4 System Monitoring

Install and use htop

sudo apt install htop

htop

Monitor I/O with iotop

sudo apt install iotop

sudo iotop -o # Show only processes with active I/O

Disk I/O stats

iostat -x 1 5

Memory usage

free -h

vmstat 1 10

Network monitoring

sudo apt install nethogs

sudo nethogs

Comprehensive monitoring dashboard (glances)

pip install glances

glances

9. Code Quality Tools

9.1 black + ruff Configuration

Install

pip install black ruff pre-commit

Run black

black .

black --line-length 88 src/

Run ruff

ruff check .

ruff check --fix . # Auto-fix

ruff format . # Format (can replace black)

`pyproject.toml` configuration:

[tool.black]

line-length = 88

target-version = ['py310', 'py311']

include = '\.pyi?$'

[tool.ruff]

line-length = 88

target-version = "py310"

[tool.ruff.lint]

select = [

"E", # pycodestyle errors

"W", # pycodestyle warnings

"F", # pyflakes

"I", # isort

"N", # pep8-naming

"UP", # pyupgrade

"B", # flake8-bugbear

]

ignore = ["E501", "B008"]

[tool.ruff.lint.isort]

known-first-party = ["my_package"]

9.2 pre-commit Hooks

.pre-commit-config.yaml

repos:

- repo: https://github.com/pre-commit/pre-commit-hooks

rev: v4.5.0

hooks:

- id: trailing-whitespace

- id: end-of-file-fixer

- id: check-yaml

- id: check-json

- id: check-merge-conflict

- id: detect-private-key

- id: check-added-large-files

args: ['--maxkb=10000']

- repo: https://github.com/psf/black

rev: 23.11.0

hooks:

- id: black

- repo: https://github.com/astral-sh/ruff-pre-commit

rev: v0.1.6

hooks:

- id: ruff

args: [--fix]

- repo: https://github.com/pre-commit/mirrors-mypy

rev: v1.7.1

hooks:

- id: mypy

additional_dependencies:

- types-requests

- pydantic

Install and activate pre-commit

pre-commit install

Run on all files

pre-commit run --all-files

Run a specific hook

pre-commit run black --all-files

9.3 pytest with Coverage

Install

pip install pytest pytest-cov pytest-xdist

Basic run

pytest tests/ -v

With coverage

pytest tests/ --cov=src --cov-report=html --cov-report=term-missing

Parallel execution

pytest tests/ -n auto # As many workers as CPU cores

Show slowest tests

pytest tests/ --durations=10

Run only fast tests

pytest tests/ -m "not slow"

`pytest` configuration in `pyproject.toml`:

[tool.pytest.ini_options]

testpaths = ["tests"]

python_files = ["test_*.py", "*_test.py"]

python_classes = ["Test*"]

python_functions = ["test_*"]

markers = [

"slow: marks tests as slow",

"gpu: marks tests requiring GPU",

"integration: integration tests"

]

addopts = [

"-v",

"--tb=short",

"--cov=src",

"--cov-report=html",

"--cov-fail-under=80"

]

10. Weights & Biases Setup

10.1 Installation and Initialization

Install

pip install wandb

Log in (sets API key)

wandb login

Or set via environment variable

export WANDB_API_KEY=your-api-key-here

10.2 Basic W&B Usage

Initialize experiment

run = wandb.init(

project="bert-sentiment-analysis",

name="run-001-baseline",

config={

"learning_rate": 2e-5,

"batch_size": 32,

"epochs": 10,

"model": "bert-base-uncased",

"optimizer": "adamw"

tags=["baseline", "bert", "nlp"],

notes="Baseline BERT fine-tuning experiment"

)

config = wandb.config

print(f"Learning rate: {config.learning_rate}")

Log metrics

for epoch in range(config.epochs):

train_loss = 2.0 - epoch * 0.15 + np.random.randn() * 0.1

val_loss = 2.2 - epoch * 0.12 + np.random.randn() * 0.1

val_acc = 0.5 + epoch * 0.04 + np.random.randn() * 0.01

wandb.log({

"epoch": epoch,

"train/loss": train_loss,

"val/loss": val_loss,

"val/accuracy": val_acc,

"learning_rate": config.learning_rate * (0.95 ** epoch)

})

Log a visualization

fig, ax = plt.subplots()

x = np.linspace(0, 10, 100)

ax.plot(x, np.sin(x), label='Predicted')

ax.plot(x, np.cos(x), label='Actual')

ax.legend()

wandb.log({"prediction_plot": wandb.Image(fig)})

plt.close()

wandb.finish()

10.3 Artifact Management

def save_model_artifact(model_path: str, run, metadata: dict = None):

"""Save model checkpoint as a W&B artifact"""

artifact = wandb.Artifact(

name="model-checkpoint",

type="model",

metadata=metadata or {}

)

artifact.add_file(model_path)

run.log_artifact(artifact)

print(f"Model artifact saved: {model_path}")

def save_dataset_artifact(data_dir: str, run):

"""Save dataset as a W&B artifact"""

artifact = wandb.Artifact(

name="training-dataset",

type="dataset",

description="Preprocessed training dataset"

)

artifact.add_dir(data_dir)

run.log_artifact(artifact)

def load_model_artifact(artifact_name: str, version: str = "latest"):

"""Download a model artifact"""

run = wandb.init(project="my-project", job_type="inference")

artifact = run.use_artifact(f"{artifact_name}:{version}")

artifact_dir = artifact.download()

print(f"Artifact downloaded to: {artifact_dir}")

return artifact_dir

Usage example

run = wandb.init(project="bert-training")

with open("/tmp/model.pt", "w") as f:

f.write("model_weights")

save_model_artifact(

"/tmp/model.pt",

run,

metadata={"accuracy": 0.94, "epoch": 10, "val_loss": 0.28}

)

wandb.finish()

10.4 Hyperparameter Optimization Sweeps

Sweep configuration

sweep_config = {

"method": "bayes", # random, grid, or bayes

"metric": {

"name": "val/accuracy",

"goal": "maximize"

"parameters": {

"learning_rate": {

"distribution": "log_uniform_values",

"min": 1e-5,

"max": 1e-3

"batch_size": {

"values": [16, 32, 64, 128]

"hidden_size": {

"values": [256, 512, 768, 1024]

"dropout": {

"distribution": "uniform",

"min": 0.1,

"max": 0.5

"num_layers": {

"values": [2, 4, 6, 8]

}

"early_terminate": {

"type": "hyperband",

"min_iter": 3

}

def train_sweep():

"""Training function called by each sweep run"""

run = wandb.init()

config = wandb.config

best_val_acc = 0.0

for epoch in range(10):

train_loss = 1.0 / (config.learning_rate * 1000) * np.random.uniform(0.8, 1.2)

val_acc = min(0.99, config.hidden_size / 2000 + np.random.uniform(0, 0.1))

if val_acc > best_val_acc:

best_val_acc = val_acc

wandb.log({

"epoch": epoch,

"train/loss": train_loss,

"val/accuracy": val_acc

})

wandb.finish()

Create and launch sweep

sweep_id = wandb.sweep(sweep_config, project="hyperparameter-search")

print(f"Sweep ID: {sweep_id}")

Run agent for N trials

wandb.agent(sweep_id, function=train_sweep, count=20)

To run a sweep in parallel across multiple servers:

Server 1

wandb agent username/project/sweep-id

Server 2 (simultaneously)

wandb agent username/project/sweep-id

Conclusion

Investing time in setting up your AI development environment upfront pays dividends throughout the life of your project. Here is a summary of the key priorities covered in this guide.

**Essential priorities:**

1. CUDA environment: Correct GPU driver and CUDA Toolkit installation is the foundation of everything.

2. Python environment isolation: Use pyenv + conda/poetry to create separate environments per project.

3. JupyterLab + VS Code: Use Jupyter for exploratory analysis and VS Code for production development.

4. Docker containerization: Guarantees reproducible experiment environments and simplifies team collaboration.

5. W&B experiment tracking: Tracking every experiment gives you reproducibility and actionable insights.

**Areas worth investing in long-term:**

- Automate code quality with pre-commit hooks and CI/CD.

- Master remote development (Remote SSH + tmux) to work productively from anywhere.

- Use GPU monitoring to catch training issues early.

References

- [CUDA Installation Guide (Linux)](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)

- [Docker GPU Support Documentation](https://docs.docker.com/config/containers/resource_constraints/#gpu)

- [Jupyter Official Documentation](https://docs.jupyter.org/)

- [VS Code Remote SSH](https://code.visualstudio.com/docs/remote/ssh)

- [Weights & Biases Documentation](https://docs.wandb.ai/)

- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html)