Skip to content

Split View: AI 개발 환경 완전 가이드: GPU 서버 설정부터 Jupyter, VS Code, Docker까지

|

AI 개발 환경 완전 가이드: GPU 서버 설정부터 Jupyter, VS Code, Docker까지

개요

AI 개발 환경 구축은 프로젝트 성공의 절반입니다. 제대로 된 환경 없이는 실험 재현성도, 협업도, 빠른 이터레이션도 불가능합니다. 이 가이드는 GPU 서버 초기 설정부터 일상적인 개발 워크플로우까지 AI/ML 개발자가 알아야 할 모든 환경 설정을 다룹니다.

Ubuntu 서버에서 CUDA를 설치하고, Python 환경을 관리하며, JupyterLab과 VS Code를 전문가처럼 활용하고, Docker 컨테이너로 재현 가능한 실험 환경을 구축하는 방법을 단계별로 설명합니다.


1. AI 개발 환경 요구사항

1.1 하드웨어 추천

GPU (최우선)

AI 연구에서 GPU는 선택이 아닌 필수입니다. 용도별 추천 사항은 다음과 같습니다.

  • 입문/개인 연구: NVIDIA RTX 4080/4090 (16-24GB VRAM)
  • 팀 공유 서버: NVIDIA A100 40GB 또는 80GB
  • 대규모 LLM 학습: NVIDIA H100 또는 H200 (80GB+ VRAM)
  • 클라우드: AWS p3/p4d/p5, GCP A100/H100, Lambda Labs

CPU

GPU 학습 시 CPU는 데이터 전처리와 로딩에 주로 활용됩니다.

  • 최소: 8코어 16스레드 (Intel Core i9 또는 AMD Ryzen 9)
  • 권장: 16코어 이상 (AMD Threadripper, Intel Xeon)
  • 데이터 로더 워커 수 = CPU 코어 수의 절반이 적절

RAM

  • 최소: 32GB
  • 권장: 64GB (GPU VRAM의 4배 이상)
  • 대규모 NLP: 128GB 이상 (토크나이징, 데이터 로딩)

저장소

/home/user/NVMe SSD (OS, 코드, 환경)
/data/NVMe SSD 또는 고속 HDD (학습 데이터)
/models/HDD 또는 NAS (모델 체크포인트)
  • OS 및 패키지: NVMe SSD 500GB 이상
  • 데이터셋: NVMe SSD (I/O 집약적 학습) 또는 고성능 HDD
  • 체크포인트: HDD 4TB 이상 (비용 효율적 대용량 스토리지)

1.2 OS 선택

Ubuntu 22.04 LTS 강력 추천

# Ubuntu 버전 확인
lsb_release -a

# 예시 출력:
# Ubuntu 22.04.3 LTS (Jammy Jellyfish)

Ubuntu를 선택하는 이유는 다음과 같습니다.

  • NVIDIA 공식 지원 (드라이버, CUDA, cuDNN 최신 버전 즉시 제공)
  • 풍부한 AI/ML 커뮤니티 문서와 패키지
  • Docker, Kubernetes와 최고의 호환성
  • 5년 LTS 지원으로 안정적인 서버 운영

macOS (M1/M2/M3)

Apple Silicon은 Metal Performance Shaders를 통한 GPU 가속을 지원합니다.

# MPS 가속 확인 (PyTorch)
python -c "import torch; print(torch.backends.mps.is_available())"

1.3 클라우드 vs 로컬 개발

항목로컬 서버클라우드
초기 비용높음 (하드웨어)낮음
운영 비용낮음 (전기료)높음 (시간당 과금)
대기 시간없음인스턴스 시작 시간
확장성제한적무제한
데이터 보안높음정책 의존
추천 용도지속적 연구일회성 대규모 실험

2. NVIDIA GPU 드라이버와 CUDA 설치

2.1 nvidia-driver 설치 (Ubuntu)

# 1. 기존 NVIDIA 패키지 제거
sudo apt-get purge nvidia*
sudo apt autoremove

# 2. 사용 가능한 드라이버 확인
ubuntu-drivers devices

# 3. 권장 드라이버 자동 설치
sudo ubuntu-drivers autoinstall

# 또는 특정 버전 설치
sudo apt install nvidia-driver-535

# 4. 재부팅
sudo reboot

# 5. 설치 확인
nvidia-smi

nvidia-smi 출력 예시:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05   Driver Version: 535.154.05   CUDA Version: 12.2    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB  Off| 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    56W / 400W |   1024MiB / 40960MiB |      0%      Default |
+-----------------------------------------------------------------------------+

2.2 CUDA Toolkit 설치

공식 NVIDIA 사이트에서 환경에 맞는 설치 명령을 생성합니다.

# CUDA 12.2 설치 (Ubuntu 22.04 예시)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-2

# 환경 변수 설정 (~/.bashrc 또는 ~/.zshrc에 추가)
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# CUDA 버전 확인
nvcc --version

2.3 cuDNN 설치

# NVIDIA Developer 계정 필요
# cuDNN 다운로드 후:

# Ubuntu 22.04, CUDA 12.x 기준
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install libcudnn8 libcudnn8-dev libcudnn8-samples

# cuDNN 버전 확인
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

2.4 설치 검증

# verify_gpu.py
import subprocess
import sys

def check_nvidia_smi():
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    if result.returncode == 0:
        print("nvidia-smi 정상 동작")
        # GPU 정보 추출
        lines = result.stdout.split('\n')
        for line in lines:
            if 'NVIDIA' in line and 'Driver' in line:
                print(f"  {line.strip()}")
    else:
        print("nvidia-smi 오류:", result.stderr)

def check_pytorch_cuda():
    try:
        import torch
        print(f"\nPyTorch 버전: {torch.__version__}")
        print(f"CUDA 사용 가능: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"CUDA 버전: {torch.version.cuda}")
            print(f"cuDNN 버전: {torch.backends.cudnn.version()}")
            print(f"GPU 수: {torch.cuda.device_count()}")
            for i in range(torch.cuda.device_count()):
                props = torch.cuda.get_device_properties(i)
                print(f"  GPU {i}: {props.name} ({props.total_memory / 1024**3:.1f}GB)")

            # 간단한 텐서 연산 테스트
            x = torch.randn(1000, 1000).cuda()
            y = torch.randn(1000, 1000).cuda()
            z = x @ y
            print(f"GPU 텐서 연산 테스트: 성공 (shape: {z.shape})")
    except ImportError:
        print("PyTorch가 설치되지 않음")

def check_tensorflow_gpu():
    try:
        import tensorflow as tf
        print(f"\nTensorFlow 버전: {tf.__version__}")
        gpus = tf.config.list_physical_devices('GPU')
        print(f"감지된 GPU: {len(gpus)}개")
        for gpu in gpus:
            print(f"  {gpu}")
    except ImportError:
        print("TensorFlow가 설치되지 않음")

if __name__ == "__main__":
    check_nvidia_smi()
    check_pytorch_cuda()
    check_tensorflow_gpu()
python verify_gpu.py

3. Python 환경 관리

3.1 pyenv로 Python 버전 관리

# pyenv 설치
curl https://pyenv.run | bash

# ~/.bashrc 또는 ~/.zshrc에 추가
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc

# 사용 가능한 Python 버전 목록
pyenv install --list | grep -E "^\s+3\.(10|11|12)"

# Python 설치
pyenv install 3.11.7
pyenv install 3.10.13

# 전역 버전 설정
pyenv global 3.11.7

# 프로젝트별 버전 설정 (해당 디렉토리에서)
cd my-project
pyenv local 3.10.13

# 현재 Python 버전 확인
python --version
pyenv versions

3.2 conda 환경 (GPU 라이브러리)

conda는 RAPIDS, cuML 등 GPU 가속 라이브러리 설치에 특히 유용합니다.

# Miniconda 설치
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda init

# 채널 설정
conda config --add channels conda-forge
conda config --add channels nvidia
conda config --set channel_priority strict

# AI/ML 환경 생성
conda create -n aiml python=3.11 -y
conda activate aiml

# PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# RAPIDS (GPU 가속 데이터 과학)
conda install -c rapidsai -c conda-forge -c nvidia rapids=23.10 cuda-version=12.0

# 환경 내보내기 및 불러오기
conda env export > environment.yml
conda env create -f environment.yml

# 환경 목록
conda env list

environment.yml 예시:

name: aiml
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - pytorch>=2.1
  - torchvision
  - cudatoolkit=12.1
  - numpy>=1.24
  - pandas>=2.0
  - scikit-learn>=1.3
  - pip:
      - transformers>=4.35
      - wandb>=0.16
      - pydantic>=2.0

3.3 poetry로 의존성 관리

# poetry 설치
curl -sSL https://install.python-poetry.org | python3 -

# 프로젝트 생성
poetry new ml-research
cd ml-research

# 의존성 추가
poetry add torch torchvision numpy pandas transformers
poetry add --group dev pytest black ruff mypy jupyter

# 특정 버전 또는 extra 포함
poetry add "torch[cuda]>=2.1"
poetry add "transformers[torch]>=4.35"

# 설치
poetry install

# 가상환경 정보
poetry env info
poetry env list

3.4 uv (초고속 패키지 매니저)

# uv 설치
curl -LsSf https://astral.sh/uv/install.sh | sh

# pip보다 10-100x 빠른 패키지 설치
uv pip install torch torchvision numpy pandas

# 가상환경 생성
uv venv .venv --python 3.11
source .venv/bin/activate

# requirements.txt에서 설치 (매우 빠름)
uv pip install -r requirements.txt

# Python 버전 관리
uv python install 3.11 3.12
uv python list

4. Jupyter Lab 고급 설정

4.1 JupyterLab 설치와 확장

# JupyterLab 설치
pip install jupyterlab

# 유용한 확장 설치
pip install jupyterlab-git           # Git 통합
pip install jupyterlab-lsp           # 언어 서버 프로토콜 (자동완성)
pip install python-lsp-server        # Python LSP
pip install jupyterlab-code-formatter  # 코드 포매터
pip install black isort              # 포매터
pip install jupyterlab-vim           # Vim 키바인딩
pip install ipywidgets               # 인터랙티브 위젯

# 확장 목록 확인
jupyter labextension list

# JupyterLab 실행
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser

4.2 커널 관리

# 커널 목록 확인
jupyter kernelspec list

# 현재 conda/venv 환경을 커널로 등록
pip install ipykernel
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

# 특정 conda 환경을 커널로 등록
conda activate aiml
conda install ipykernel
python -m ipykernel install --user --name aiml --display-name "Python (aiml GPU)"

# 커널 제거
jupyter kernelspec remove old-kernel

# 커스텀 커널 스펙 수정
# ~/.local/share/jupyter/kernels/aiml/kernel.json

kernel.json 예시:

{
  "argv": [
    "/home/user/miniconda3/envs/aiml/bin/python",
    "-m",
    "ipykernel_launcher",
    "-f",
    "{connection_file}"
  ],
  "display_name": "Python (aiml GPU)",
  "language": "python",
  "env": {
    "CUDA_VISIBLE_DEVICES": "0",
    "PYTHONPATH": "/home/user/projects"
  }
}

4.3 원격 Jupyter (SSH 터널링)

# 서버에서 Jupyter 시작 (토큰 없이)
jupyter lab --no-browser --port=8888 --ip=127.0.0.1

# 로컬에서 SSH 터널 생성
ssh -N -L 8888:localhost:8888 user@your-server.com

# 브라우저에서 접속
# http://localhost:8888

# 비밀번호 설정 (더 안전)
jupyter lab password
# 이후 localhost:8888에서 비밀번호로 접속

자동 시작을 위한 systemd 서비스 설정:

# /etc/systemd/system/jupyter.service
cat > /tmp/jupyter.service << 'EOF'
[Unit]
Description=Jupyter Lab
After=network.target

[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username
ExecStart=/home/your-username/miniconda3/envs/aiml/bin/jupyter lab --no-browser --port=8888
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo mv /tmp/jupyter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable jupyter
sudo systemctl start jupyter

4.4 Jupyter Magic Commands

# 시간 측정
%time result = expensive_function()
%timeit -n 100 result = fast_function()  # 100번 반복

# 라인별 프로파일링
%load_ext line_profiler
%lprun -f my_function my_function(data)

# 메모리 프로파일링
%load_ext memory_profiler
%memit result = memory_heavy_function()

# 셀 내용을 파일로 저장
%%writefile my_script.py
import numpy as np
# ...

# 파일 내용을 셀로 불러오기
%load my_script.py

# 쉘 명령 실행
!nvidia-smi
!pip list | grep torch
files = !ls -la *.py

# 환경 변수
%env CUDA_VISIBLE_DEVICES=0

# 자동 재로드 (코드 수정 후 즉시 반영)
%load_ext autoreload
%autoreload 2

# matplotlib 인라인 표시
%matplotlib inline

# 현재 변수 목록
%who
%whos

# 이전 출력 결과 확인
print(_)   # 마지막 결과
print(__)  # 두 번째 전 결과

4.5 nbconvert (노트북 변환)

# 노트북을 스크립트로 변환
jupyter nbconvert --to script notebook.ipynb

# HTML로 변환 (공유용)
jupyter nbconvert --to html notebook.ipynb

# PDF 변환 (LaTeX 필요)
jupyter nbconvert --to pdf notebook.ipynb

# 실행하면서 HTML로 변환
jupyter nbconvert --to html --execute notebook.ipynb

# 커맨드라인에서 노트북 실행
jupyter nbconvert --to notebook --execute --inplace notebook.ipynb

# 파라미터화 실행 (papermill)
pip install papermill
papermill input.ipynb output.ipynb -p lr 0.001 -p epochs 100

5. VS Code for AI

5.1 필수 확장 설치

# VS Code 커맨드라인에서 확장 설치
code --install-extension ms-python.python           # Python
code --install-extension ms-toolsai.jupyter         # Jupyter
code --install-extension ms-python.pylance          # Pylance (LSP)
code --install-extension charliermarsh.ruff         # Ruff 린터
code --install-extension github.copilot             # GitHub Copilot
code --install-extension github.copilot-chat        # Copilot Chat
code --install-extension ms-vscode-remote.remote-ssh  # Remote SSH
code --install-extension ms-vscode-remote.remote-containers  # Dev Containers
code --install-extension eamodio.gitlens            # GitLens
code --install-extension njpwerner.autodocstring     # 독스트링 자동생성
code --install-extension tabnine.tabnine-vscode     # Tabnine AI

5.2 VS Code 설정 (settings.json)

{
  "python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",
  "python.formatting.provider": "none",
  "[python]": {
    "editor.defaultFormatter": "charliermarsh.ruff",
    "editor.formatOnSave": true,
    "editor.codeActionsOnSave": {
      "source.fixAll.ruff": true,
      "source.organizeImports.ruff": true
    }
  },
  "pylance.typeCheckingMode": "basic",
  "python.analysis.autoImportCompletions": true,
  "python.analysis.indexing": true,
  "python.analysis.packageIndexDepths": [
    { "name": "torch", "depth": 5 },
    { "name": "transformers", "depth": 5 }
  ],
  "jupyter.askForKernelRestart": false,
  "jupyter.interactiveWindow.creationMode": "perFile",
  "editor.inlineSuggest.enabled": true,
  "github.copilot.enable": {
    "*": true,
    "yaml": true,
    "plaintext": false
  },
  "files.exclude": {
    "**/__pycache__": true,
    "**/*.pyc": true,
    ".mypy_cache": true,
    ".ruff_cache": true
  },
  "terminal.integrated.env.linux": {
    "CUDA_VISIBLE_DEVICES": "0"
  }
}

5.3 Remote SSH 설정

# 로컬 ~/.ssh/config
Host ml-server
    HostName 192.168.1.100
    User your-username
    IdentityFile ~/.ssh/id_rsa
    ForwardAgent yes
    ServerAliveInterval 60
    ServerAliveCountMax 3

Host gpu-cloud
    HostName gpu-server.example.com
    User ubuntu
    IdentityFile ~/.ssh/cloud-key.pem
    Port 22

VS Code에서 Remote SSH 사용하기:

  1. Ctrl+Shift+P (또는 Cmd+Shift+P on Mac)
  2. "Remote-SSH: Connect to Host" 선택
  3. 설정한 호스트 이름 선택

원격 서버에서 Python 환경 자동 선택:

{
  "remote.SSH.defaultExtensions": ["ms-python.python", "ms-toolsai.jupyter", "charliermarsh.ruff"]
}

5.4 Dev Containers

.devcontainer/devcontainer.json 예시:

{
  "name": "AI Development",
  "build": {
    "dockerfile": "Dockerfile",
    "args": {
      "CUDA_VERSION": "12.1.1",
      "PYTHON_VERSION": "3.11"
    }
  },
  "runArgs": ["--gpus", "all", "--shm-size", "8g"],
  "mounts": [
    "source=/data,target=/data,type=bind",
    "source=${localEnv:HOME}/.cache,target=/root/.cache,type=bind"
  ],
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-toolsai.jupyter",
        "charliermarsh.ruff",
        "github.copilot"
      ],
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python"
      }
    }
  },
  "postCreateCommand": "pip install -e '.[dev]'",
  "remoteUser": "root"
}

6. GPU Docker 컨테이너

6.1 NVIDIA Container Toolkit 설치

# NVIDIA Container Toolkit 설치
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# GPU 접근 테스트
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

6.2 AI 프로젝트용 Dockerfile

# Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

# 환경 변수
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1

# 시스템 패키지
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3.11-venv \
    python3-pip \
    git \
    wget \
    curl \
    vim \
    htop \
    tmux \
    && rm -rf /var/lib/apt/lists/*

# Python 기본값 설정
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1

# pip 업그레이드
RUN pip install --upgrade pip setuptools wheel

# 작업 디렉토리
WORKDIR /workspace

# Python 의존성 (레이어 캐싱 최적화)
COPY requirements.txt .
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip install -r requirements.txt

# 소스 코드 복사
COPY . .

# 패키지 설치
RUN pip install -e ".[dev]"

# 비루트 사용자 생성 (보안)
RUN useradd -m -u 1000 researcher
RUN chown -R researcher:researcher /workspace
USER researcher

# 포트 노출
EXPOSE 8888 6006

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]

6.3 Docker Compose로 서비스 스택

# docker-compose.yml
version: '3.8'

services:
  # 학습 컨테이너
  trainer:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - CUDA_VISIBLE_DEVICES=0
      - WANDB_API_KEY=${WANDB_API_KEY}
    volumes:
      - ./src:/workspace/src
      - ./configs:/workspace/configs
      - /data:/data:ro
      - model-checkpoints:/workspace/checkpoints
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: '8g'
    command: python train.py

  # Jupyter 노트북 서버
  notebook:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    ports:
      - '8888:8888'
    volumes:
      - ./notebooks:/workspace/notebooks
      - ./src:/workspace/src
      - /data:/data:ro
    command: jupyter lab --ip=0.0.0.0 --no-browser --allow-root

  # TensorBoard
  tensorboard:
    image: tensorflow/tensorflow:latest
    ports:
      - '6006:6006'
    volumes:
      - model-checkpoints:/workspace/checkpoints:ro
    command: tensorboard --logdir=/workspace/checkpoints --host=0.0.0.0

  # MLflow 트래킹 서버
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.8.0
    ports:
      - '5000:5000'
    volumes:
      - mlflow-data:/mlflow
    command: mlflow server --host=0.0.0.0 --port=5000 --default-artifact-root=/mlflow/artifacts

volumes:
  model-checkpoints:
  mlflow-data:
# 전체 스택 시작
docker-compose up -d

# 로그 확인
docker-compose logs -f trainer

# 특정 서비스 재시작
docker-compose restart notebook

# GPU 사용 현황 (컨테이너 내)
docker-compose exec trainer nvidia-smi

# 정리
docker-compose down --volumes

6.4 GPU 공유 전략

# 특정 GPU 할당
docker run --gpus '"device=0,1"' myimage  # GPU 0, 1 사용
docker run --gpus '"device=2"' myimage     # GPU 2만 사용

# MIG (Multi-Instance GPU) 설정 (A100/H100)
sudo nvidia-smi mig -i 0 --create-gpu-instance 3g.20gb
sudo nvidia-smi mig -i 0 --create-compute-instance 3g.20gb

# 컨테이너에 MIG 인스턴스 할당
docker run --gpus '"MIG-GPU-xxxxxxxx-xx-xx-xxxx-xxxxxxxxxxxx/x/x"' myimage

# GPU 메모리 분수 할당 (time-slicing)
# /etc/nvidia-container-runtime/config.toml에서 설정

7. 원격 개발 환경

7.1 SSH 기반 원격 개발

# SSH 키 생성 (로컬)
ssh-keygen -t ed25519 -C "your-email@example.com"

# 공개키를 서버에 복사
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.com

# 또는 수동으로
cat ~/.ssh/id_ed25519.pub | ssh user@server.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

# SSH 접속 최적화 (~/.ssh/config)
Host ml-server
    HostName server.example.com
    User researcher
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/cm_%r@%h:%p
    ControlPersist 10m
    ServerAliveInterval 30
    Compression yes

7.2 tmux (세션 유지)

# tmux 설치
sudo apt install tmux

# 새 세션 시작
tmux new-session -s training

# 세션에 재접속
tmux attach-session -t training

# 세션 목록
tmux list-sessions

# 유용한 단축키 (Prefix: Ctrl+B)
# Ctrl+B d    : 세션 분리 (SSH 끊어도 실행 유지)
# Ctrl+B c    : 새 창 생성
# Ctrl+B n/p  : 다음/이전 창
# Ctrl+B %    : 수직 분할
# Ctrl+B "    : 수평 분할
# Ctrl+B z    : 패널 최대화/복원

~/.tmux.conf 설정:

# ~/.tmux.conf

# 마우스 지원 활성화
set -g mouse on

# 기록 버퍼 증가
set -g history-limit 50000

# 창 번호 1부터 시작
set -g base-index 1
setw -g pane-base-index 1

# 색상 지원
set -g default-terminal "screen-256color"
set -ga terminal-overrides ",xterm-256color:Tc"

# 상태바 커스터마이즈
set -g status-bg colour235
set -g status-fg colour136
set -g status-right '#[fg=colour166]%d %b #[fg=colour136]%H:%M '

# Prefix 키 변경 (Ctrl+A, screen 스타일)
set -g prefix C-a
unbind C-b
bind C-a send-prefix

# 빠른 패널 이동
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D

7.3 파일 동기화 (rsync, sshfs)

# rsync로 코드 동기화
# 로컬 -> 서버
rsync -avz --exclude '__pycache__' --exclude '.git' \
    ./my-project/ user@server:/home/user/my-project/

# 서버 -> 로컬 (결과 다운로드)
rsync -avz user@server:/home/user/my-project/outputs/ ./outputs/

# 자동 동기화 스크립트
cat > sync.sh << 'EOF'
#!/bin/bash
rsync -avz --exclude '__pycache__' \
    --exclude '.git' \
    --exclude '.venv' \
    --exclude '*.pyc' \
    --delete \
    ./ user@ml-server:/home/user/project/
echo "동기화 완료: $(date)"
EOF
chmod +x sync.sh

# sshfs로 원격 파일시스템 마운트
sudo apt install sshfs
mkdir -p ~/remote-server
sshfs user@server:/home/user ~/remote-server -o reconnect,ServerAliveInterval=15

# 마운트 해제
fusermount -u ~/remote-server

8. 모니터링 도구

8.1 nvidia-smi watch

# 1초마다 GPU 상태 갱신
watch -n 1 nvidia-smi

# GPU 메모리 사용량만 확인
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
    --format=csv -l 1

# CSV 로그 기록
nvidia-smi --query-gpu=timestamp,index,utilization.gpu,memory.used,temperature.gpu \
    --format=csv -l 1 > gpu_log.csv

8.2 nvtop

# nvtop 설치 (htop의 GPU 버전)
sudo apt install nvtop

# 실행
nvtop

8.3 Python에서 GPU 모니터링

# gpu_monitor.py
import subprocess
import time
import json
from dataclasses import dataclass
from typing import List

@dataclass
class GPUStats:
    index: int
    name: str
    temperature: float
    utilization: float
    memory_used: int
    memory_total: int
    power_draw: float

def get_gpu_stats() -> List[GPUStats]:
    """nvidia-smi로 GPU 통계 조회"""
    cmd = [
        'nvidia-smi',
        '--query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw',
        '--format=csv,noheader,nounits'
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)

    gpus = []
    for line in result.stdout.strip().split('\n'):
        parts = [p.strip() for p in line.split(',')]
        gpus.append(GPUStats(
            index=int(parts[0]),
            name=parts[1],
            temperature=float(parts[2]),
            utilization=float(parts[3]),
            memory_used=int(parts[4]),
            memory_total=int(parts[5]),
            power_draw=float(parts[6]) if parts[6] != 'N/A' else 0.0
        ))
    return gpus

def monitor_training(interval: int = 10, duration: int = 3600):
    """학습 중 GPU 모니터링"""
    print(f"GPU 모니터링 시작 (간격: {interval}초)")
    history = []

    start_time = time.time()
    while time.time() - start_time < duration:
        stats = get_gpu_stats()
        timestamp = time.time() - start_time

        for gpu in stats:
            mem_pct = gpu.memory_used / gpu.memory_total * 100
            print(
                f"[{timestamp:.0f}s] GPU {gpu.index}: "
                f"이용률={gpu.utilization:.0f}% "
                f"메모리={gpu.memory_used}/{gpu.memory_total}MB ({mem_pct:.0f}%) "
                f"온도={gpu.temperature:.0f}C "
                f"전력={gpu.power_draw:.0f}W"
            )
            history.append({
                'timestamp': timestamp,
                'gpu_index': gpu.index,
                'utilization': gpu.utilization,
                'memory_used': gpu.memory_used,
                'temperature': gpu.temperature
            })

        time.sleep(interval)

    return history

if __name__ == "__main__":
    history = monitor_training(interval=5, duration=60)
    print(f"\n총 {len(history)}개 측정값 수집됨")

8.4 시스템 모니터링

# htop 설치 및 사용
sudo apt install htop
htop

# iotop으로 I/O 모니터링
sudo apt install iotop
sudo iotop -o  # I/O 활성 프로세스만 표시

# 디스크 I/O 확인
iostat -x 1 5

# 메모리 사용량
free -h
vmstat 1 10

# 네트워크 모니터링
sudo apt install nethogs
sudo nethogs

# 종합 모니터링 대시보드 (glances)
pip install glances
glances

9. 코드 품질 도구

9.1 black + ruff 설정

# 설치
pip install black ruff pre-commit

# black 실행
black .
black --line-length 88 src/

# ruff 실행
ruff check .
ruff check --fix .  # 자동 수정
ruff format .       # 포매팅 (black 대체 가능)

pyproject.toml 설정:

[tool.black]
line-length = 88
target-version = ['py310', 'py311']
include = '\.pyi?$'

[tool.ruff]
line-length = 88
target-version = "py310"

[tool.ruff.lint]
select = [
    "E",   # pycodestyle errors
    "W",   # pycodestyle warnings
    "F",   # pyflakes
    "I",   # isort
    "N",   # pep8-naming
    "UP",  # pyupgrade
    "B",   # flake8-bugbear
]
ignore = ["E501", "B008"]

[tool.ruff.lint.isort]
known-first-party = ["my_package"]

9.2 pre-commit hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key
      - id: check-added-large-files
        args: ['--maxkb=10000']

  - repo: https://github.com/psf/black
    rev: 23.11.0
    hooks:
      - id: black

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff
        args: [--fix]

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.7.1
    hooks:
      - id: mypy
        additional_dependencies:
          - types-requests
          - pydantic
# pre-commit 설치 및 활성화
pre-commit install

# 모든 파일에 실행
pre-commit run --all-files

# 특정 훅 실행
pre-commit run black --all-files

9.3 pytest with coverage

# 설치
pip install pytest pytest-cov pytest-xdist

# 기본 실행
pytest tests/ -v

# 커버리지 포함
pytest tests/ --cov=src --cov-report=html --cov-report=term-missing

# 병렬 실행
pytest tests/ -n auto  # CPU 수만큼 병렬

# 느린 테스트 표시
pytest tests/ --durations=10

# 특정 마커만 실행
pytest tests/ -m "not slow"

pytest.ini 또는 pyproject.toml:

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py", "*_test.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
markers = [
    "slow: marks tests as slow",
    "gpu: marks tests requiring GPU",
    "integration: integration tests"
]
addopts = [
    "-v",
    "--tb=short",
    "--cov=src",
    "--cov-report=html",
    "--cov-fail-under=80"
]

10. Weights & Biases 설정

10.1 W&B 설치와 초기화

# 설치
pip install wandb

# 로그인 (API 키 설정)
wandb login
# 또는 환경 변수로 설정
export WANDB_API_KEY=your-api-key-here

10.2 기본 W&B 사용법

import wandb
import numpy as np
import torch

# 실험 초기화
run = wandb.init(
    project="bert-sentiment-analysis",
    name="run-001-baseline",
    config={
        "learning_rate": 2e-5,
        "batch_size": 32,
        "epochs": 10,
        "model": "bert-base-uncased",
        "optimizer": "adamw"
    },
    tags=["baseline", "bert", "nlp"],
    notes="기본 BERT 파인튜닝 실험"
)

# 설정 접근
config = wandb.config
print(f"학습률: {config.learning_rate}")

# 메트릭 로깅
for epoch in range(config.epochs):
    train_loss = 2.0 - epoch * 0.15 + np.random.randn() * 0.1
    val_loss = 2.2 - epoch * 0.12 + np.random.randn() * 0.1
    val_acc = 0.5 + epoch * 0.04 + np.random.randn() * 0.01

    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "val/loss": val_loss,
        "val/accuracy": val_acc,
        "learning_rate": config.learning_rate * (0.95 ** epoch)
    })

    # 그래디언트 히스토그램 (PyTorch)
    # wandb.log({"gradients": wandb.Histogram(model.fc.weight.grad)})

# 시각화 로깅
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
x = np.linspace(0, 10, 100)
ax.plot(x, np.sin(x), label='예측')
ax.plot(x, np.cos(x), label='실제')
ax.legend()
wandb.log({"prediction_plot": wandb.Image(fig)})
plt.close()

# 실험 종료
wandb.finish()

10.3 아티팩트 관리

import wandb
import os

# 아티팩트 저장 (모델 체크포인트)
def save_model_artifact(model_path: str, run, metadata: dict = None):
    artifact = wandb.Artifact(
        name="model-checkpoint",
        type="model",
        metadata=metadata or {}
    )
    artifact.add_file(model_path)
    run.log_artifact(artifact)
    print(f"모델 아티팩트 저장됨: {model_path}")

# 아티팩트 저장 (데이터셋)
def save_dataset_artifact(data_dir: str, run):
    artifact = wandb.Artifact(
        name="training-dataset",
        type="dataset",
        description="전처리된 학습 데이터셋"
    )
    artifact.add_dir(data_dir)
    run.log_artifact(artifact)

# 아티팩트 불러오기
def load_model_artifact(artifact_name: str, version: str = "latest"):
    run = wandb.init(project="my-project", job_type="inference")
    artifact = run.use_artifact(f"{artifact_name}:{version}")
    artifact_dir = artifact.download()
    print(f"아티팩트 다운로드 위치: {artifact_dir}")
    return artifact_dir

# 사용 예시
run = wandb.init(project="bert-training")

# 임시 모델 파일 저장 시뮬레이션
with open("/tmp/model.pt", "w") as f:
    f.write("model_weights")

save_model_artifact(
    "/tmp/model.pt",
    run,
    metadata={"accuracy": 0.94, "epoch": 10, "val_loss": 0.28}
)

wandb.finish()

10.4 스윕 (Sweep) for 하이퍼파라미터 최적화

import wandb
import numpy as np

# 스윕 설정
sweep_config = {
    "method": "bayes",  # random, grid, bayes
    "metric": {
        "name": "val/accuracy",
        "goal": "maximize"
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-3
        },
        "batch_size": {
            "values": [16, 32, 64, 128]
        },
        "hidden_size": {
            "values": [256, 512, 768, 1024]
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.1,
            "max": 0.5
        },
        "num_layers": {
            "values": [2, 4, 6, 8]
        }
    },
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 3
    }
}

def train_sweep():
    """스윕에서 호출되는 학습 함수"""
    run = wandb.init()
    config = wandb.config

    # 설정에 따른 모델 학습 시뮬레이션
    best_val_acc = 0.0

    for epoch in range(10):
        # 실제로는 모델 학습
        train_loss = 1.0 / (config.learning_rate * 1000) * np.random.uniform(0.8, 1.2)
        val_acc = min(0.99, config.hidden_size / 2000 + np.random.uniform(0, 0.1))

        if val_acc > best_val_acc:
            best_val_acc = val_acc

        wandb.log({
            "epoch": epoch,
            "train/loss": train_loss,
            "val/accuracy": val_acc
        })

    wandb.finish()

# 스윕 생성 및 실행
sweep_id = wandb.sweep(sweep_config, project="hyperparameter-search")
print(f"스윕 ID: {sweep_id}")

# 에이전트 실행 (N번 시도)
wandb.agent(sweep_id, function=train_sweep, count=20)

스윕을 여러 서버에서 분산 실행하려면 다음을 사용합니다.

# 서버 1
wandb agent username/project/sweep-id

# 서버 2 (동시에)
wandb agent username/project/sweep-id

결론

AI 개발 환경 구축은 초반에 시간을 투자할수록 나중의 생산성이 크게 향상됩니다. 이 가이드에서 다룬 핵심 사항을 정리하면 다음과 같습니다.

필수 우선순위:

  1. CUDA 환경 구축: GPU 드라이버와 CUDA Toolkit 올바른 설치가 기반입니다.
  2. Python 환경 격리: pyenv + conda/poetry 조합으로 프로젝트별 환경을 분리합니다.
  3. JupyterLab + VS Code: 탐색적 분석은 Jupyter, 본격 개발은 VS Code를 활용합니다.
  4. Docker 컨테이너화: 재현 가능한 실험 환경을 보장하고 팀 협업을 단순화합니다.
  5. W&B 실험 추적: 모든 실험을 추적하면 재현성과 인사이트를 얻습니다.

장기적으로 투자할 영역:

  • pre-commit hooks와 CI/CD로 코드 품질을 자동화합니다.
  • 원격 개발 환경 (Remote SSH + tmux)을 숙련하면 어디서나 생산적으로 작업할 수 있습니다.
  • GPU 모니터링으로 학습 과정의 문제를 조기에 발견합니다.

참고 자료

AI Development Environment Complete Guide: From GPU Server Setup to Jupyter, VS Code, Docker

Overview

Setting up a proper AI development environment is half the battle. Without the right foundation, reproducible experiments, team collaboration, and fast iteration cycles are all impossible. This guide covers everything an AI/ML developer needs to know — from the initial GPU server setup to daily development workflows.

You will learn how to install CUDA on Ubuntu, manage Python environments, use JupyterLab and VS Code like a professional, and build reproducible experiment environments with Docker containers.


1. AI Development Environment Requirements

1.1 Hardware Recommendations

GPU (Top Priority)

In AI research, a GPU is not optional — it is essential.

  • Entry-level / Personal research: NVIDIA RTX 4080/4090 (16-24GB VRAM)
  • Shared team server: NVIDIA A100 40GB or 80GB
  • Large LLM training: NVIDIA H100 or H200 (80GB+ VRAM)
  • Cloud options: AWS p3/p4d/p5, GCP A100/H100, Lambda Labs

CPU

During GPU training, the CPU is primarily responsible for data preprocessing and loading.

  • Minimum: 8-core 16-thread (Intel Core i9 or AMD Ryzen 9)
  • Recommended: 16+ cores (AMD Threadripper, Intel Xeon)
  • Optimal DataLoader workers = half the number of CPU cores

RAM

  • Minimum: 32GB
  • Recommended: 64GB (at least 4x GPU VRAM)
  • Large-scale NLP: 128GB+ (tokenization, data loading)

Storage

/home/user/          -> NVMe SSD (OS, code, environments)
/data/               -> NVMe SSD or fast HDD (training data)
/models/             -> HDD or NAS (model checkpoints)
  • OS and packages: NVMe SSD 500GB+
  • Datasets: NVMe SSD (I/O-intensive training) or high-performance HDD
  • Checkpoints: HDD 4TB+ (cost-effective high-capacity storage)

1.2 OS Selection

Ubuntu 22.04 LTS — Strongly Recommended

# Check Ubuntu version
lsb_release -a

# Example output:
# Ubuntu 22.04.3 LTS (Jammy Jellyfish)

Reasons to choose Ubuntu:

  • NVIDIA's primary supported platform (latest drivers, CUDA, cuDNN available immediately)
  • Largest AI/ML community documentation and package ecosystem
  • Best compatibility with Docker and Kubernetes
  • 5-year LTS support for stable server operation

macOS (M1/M2/M3)

Apple Silicon supports GPU acceleration via Metal Performance Shaders.

# Check MPS availability (PyTorch)
python -c "import torch; print(torch.backends.mps.is_available())"

1.3 Cloud vs. Local Development

AspectLocal ServerCloud
Upfront costHigh (hardware)Low
Ongoing costLow (electricity)High (hourly billing)
LatencyNoneInstance startup time
ScalabilityLimitedUnlimited
Data securityHighPolicy-dependent
Best forContinuous researchOne-off large experiments

2. NVIDIA GPU Drivers and CUDA Installation

2.1 Installing nvidia-driver (Ubuntu)

# 1. Remove existing NVIDIA packages
sudo apt-get purge nvidia*
sudo apt autoremove

# 2. Check available drivers
ubuntu-drivers devices

# 3. Auto-install recommended driver
sudo ubuntu-drivers autoinstall

# Or install a specific version
sudo apt install nvidia-driver-535

# 4. Reboot
sudo reboot

# 5. Verify installation
nvidia-smi

Example nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05   Driver Version: 535.154.05   CUDA Version: 12.2    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB  Off| 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    56W / 400W |   1024MiB / 40960MiB |      0%      Default |
+-----------------------------------------------------------------------------+

2.2 Installing the CUDA Toolkit

Generate the appropriate installation command on the official NVIDIA website for your platform.

# CUDA 12.2 installation (Ubuntu 22.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-2

# Set environment variables (add to ~/.bashrc or ~/.zshrc)
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify CUDA version
nvcc --version

2.3 Installing cuDNN

# Requires an NVIDIA Developer account
# After downloading cuDNN:

# For Ubuntu 22.04 with CUDA 12.x
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install libcudnn8 libcudnn8-dev libcudnn8-samples

# Verify cuDNN version
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

2.4 Verifying the Installation

# verify_gpu.py
import subprocess
import sys

def check_nvidia_smi():
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    if result.returncode == 0:
        print("nvidia-smi is working")
        lines = result.stdout.split('\n')
        for line in lines:
            if 'NVIDIA' in line and 'Driver' in line:
                print(f"  {line.strip()}")
    else:
        print("nvidia-smi error:", result.stderr)

def check_pytorch_cuda():
    try:
        import torch
        print(f"\nPyTorch version: {torch.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"CUDA version: {torch.version.cuda}")
            print(f"cuDNN version: {torch.backends.cudnn.version()}")
            print(f"GPU count: {torch.cuda.device_count()}")
            for i in range(torch.cuda.device_count()):
                props = torch.cuda.get_device_properties(i)
                print(f"  GPU {i}: {props.name} ({props.total_memory / 1024**3:.1f}GB)")

            # Quick tensor operation test
            x = torch.randn(1000, 1000).cuda()
            y = torch.randn(1000, 1000).cuda()
            z = x @ y
            print(f"GPU tensor operation test: passed (shape: {z.shape})")
    except ImportError:
        print("PyTorch is not installed")

def check_tensorflow_gpu():
    try:
        import tensorflow as tf
        print(f"\nTensorFlow version: {tf.__version__}")
        gpus = tf.config.list_physical_devices('GPU')
        print(f"Detected GPUs: {len(gpus)}")
        for gpu in gpus:
            print(f"  {gpu}")
    except ImportError:
        print("TensorFlow is not installed")

if __name__ == "__main__":
    check_nvidia_smi()
    check_pytorch_cuda()
    check_tensorflow_gpu()
python verify_gpu.py

3. Python Environment Management

3.1 Python Version Management with pyenv

# Install pyenv
curl https://pyenv.run | bash

# Add to ~/.bashrc or ~/.zshrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc

# List available Python versions
pyenv install --list | grep -E "^\s+3\.(10|11|12)"

# Install Python
pyenv install 3.11.7
pyenv install 3.10.13

# Set global version
pyenv global 3.11.7

# Set project-specific version (in that directory)
cd my-project
pyenv local 3.10.13

# Check current Python version
python --version
pyenv versions

3.2 conda Environments (for GPU Libraries)

conda is particularly useful for installing GPU-accelerated libraries like RAPIDS and cuML.

# Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda init

# Configure channels
conda config --add channels conda-forge
conda config --add channels nvidia
conda config --set channel_priority strict

# Create AI/ML environment
conda create -n aiml python=3.11 -y
conda activate aiml

# PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# RAPIDS (GPU-accelerated data science)
conda install -c rapidsai -c conda-forge -c nvidia rapids=23.10 cuda-version=12.0

# Export and import environments
conda env export > environment.yml
conda env create -f environment.yml

# List environments
conda env list

environment.yml example:

name: aiml
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - pytorch>=2.1
  - torchvision
  - cudatoolkit=12.1
  - numpy>=1.24
  - pandas>=2.0
  - scikit-learn>=1.3
  - pip:
      - transformers>=4.35
      - wandb>=0.16
      - pydantic>=2.0

3.3 Dependency Management with poetry

# Install poetry
curl -sSL https://install.python-poetry.org | python3 -

# Create project
poetry new ml-research
cd ml-research

# Add dependencies
poetry add torch torchvision numpy pandas transformers
poetry add --group dev pytest black ruff mypy jupyter

# Install with extras
poetry add "torch[cuda]>=2.1"
poetry add "transformers[torch]>=4.35"

# Install all dependencies
poetry install

# Check virtual environment info
poetry env info
poetry env list

3.4 uv (Ultra-fast Package Manager)

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install packages (10-100x faster than pip)
uv pip install torch torchvision numpy pandas

# Create virtual environment
uv venv .venv --python 3.11
source .venv/bin/activate

# Install from requirements.txt (very fast)
uv pip install -r requirements.txt

# Manage Python versions
uv python install 3.11 3.12
uv python list

4. Advanced JupyterLab Setup

4.1 JupyterLab Installation and Extensions

# Install JupyterLab
pip install jupyterlab

# Install useful extensions
pip install jupyterlab-git           # Git integration
pip install jupyterlab-lsp           # Language server protocol (autocomplete)
pip install python-lsp-server        # Python LSP
pip install jupyterlab-code-formatter  # Code formatter
pip install black isort              # Formatters
pip install jupyterlab-vim           # Vim key bindings
pip install ipywidgets               # Interactive widgets

# List installed extensions
jupyter labextension list

# Start JupyterLab
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser

4.2 Kernel Management

# List available kernels
jupyter kernelspec list

# Register current conda/venv environment as a kernel
pip install ipykernel
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

# Register a specific conda environment as a kernel
conda activate aiml
conda install ipykernel
python -m ipykernel install --user --name aiml --display-name "Python (aiml GPU)"

# Remove a kernel
jupyter kernelspec remove old-kernel

Custom kernel.json example:

{
  "argv": [
    "/home/user/miniconda3/envs/aiml/bin/python",
    "-m",
    "ipykernel_launcher",
    "-f",
    "{connection_file}"
  ],
  "display_name": "Python (aiml GPU)",
  "language": "python",
  "env": {
    "CUDA_VISIBLE_DEVICES": "0",
    "PYTHONPATH": "/home/user/projects"
  }
}

4.3 Remote Jupyter via SSH Tunneling

# Start Jupyter on the server (no token)
jupyter lab --no-browser --port=8888 --ip=127.0.0.1

# Create SSH tunnel from local machine
ssh -N -L 8888:localhost:8888 user@your-server.com

# Access in browser
# http://localhost:8888

# Set a password for security
jupyter lab password

Systemd service for auto-start:

cat > /tmp/jupyter.service << 'EOF'
[Unit]
Description=Jupyter Lab
After=network.target

[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username
ExecStart=/home/your-username/miniconda3/envs/aiml/bin/jupyter lab --no-browser --port=8888
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo mv /tmp/jupyter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable jupyter
sudo systemctl start jupyter

4.4 Jupyter Magic Commands

# Timing
%time result = expensive_function()
%timeit -n 100 result = fast_function()  # 100 repetitions

# Line-by-line profiling
%load_ext line_profiler
%lprun -f my_function my_function(data)

# Memory profiling
%load_ext memory_profiler
%memit result = memory_heavy_function()

# Write cell contents to file
%%writefile my_script.py
import numpy as np
# ...

# Load file into cell
%load my_script.py

# Run shell commands
!nvidia-smi
!pip list | grep torch
files = !ls -la *.py

# Set environment variables
%env CUDA_VISIBLE_DEVICES=0

# Auto-reload (reflect code changes immediately)
%load_ext autoreload
%autoreload 2

# Inline matplotlib plots
%matplotlib inline

# List current variables
%who
%whos

# Access previous outputs
print(_)   # Last result
print(__)  # Two results ago

4.5 nbconvert (Notebook Conversion)

# Convert notebook to script
jupyter nbconvert --to script notebook.ipynb

# Convert to HTML (for sharing)
jupyter nbconvert --to html notebook.ipynb

# Convert to PDF (requires LaTeX)
jupyter nbconvert --to pdf notebook.ipynb

# Execute and convert to HTML
jupyter nbconvert --to html --execute notebook.ipynb

# Execute notebook in-place from CLI
jupyter nbconvert --to notebook --execute --inplace notebook.ipynb

# Parameterized execution (papermill)
pip install papermill
papermill input.ipynb output.ipynb -p lr 0.001 -p epochs 100

5. VS Code for AI

5.1 Installing Essential Extensions

# Install extensions from the command line
code --install-extension ms-python.python           # Python
code --install-extension ms-toolsai.jupyter         # Jupyter
code --install-extension ms-python.pylance          # Pylance (LSP)
code --install-extension charliermarsh.ruff         # Ruff linter
code --install-extension github.copilot             # GitHub Copilot
code --install-extension github.copilot-chat        # Copilot Chat
code --install-extension ms-vscode-remote.remote-ssh  # Remote SSH
code --install-extension ms-vscode-remote.remote-containers  # Dev Containers
code --install-extension eamodio.gitlens            # GitLens
code --install-extension njpwerner.autodocstring     # Auto-docstring

5.2 VS Code Settings (settings.json)

{
  "python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",
  "python.formatting.provider": "none",
  "[python]": {
    "editor.defaultFormatter": "charliermarsh.ruff",
    "editor.formatOnSave": true,
    "editor.codeActionsOnSave": {
      "source.fixAll.ruff": true,
      "source.organizeImports.ruff": true
    }
  },
  "pylance.typeCheckingMode": "basic",
  "python.analysis.autoImportCompletions": true,
  "python.analysis.indexing": true,
  "python.analysis.packageIndexDepths": [
    { "name": "torch", "depth": 5 },
    { "name": "transformers", "depth": 5 }
  ],
  "jupyter.askForKernelRestart": false,
  "jupyter.interactiveWindow.creationMode": "perFile",
  "editor.inlineSuggest.enabled": true,
  "github.copilot.enable": {
    "*": true,
    "yaml": true,
    "plaintext": false
  },
  "files.exclude": {
    "**/__pycache__": true,
    "**/*.pyc": true,
    ".mypy_cache": true,
    ".ruff_cache": true
  },
  "terminal.integrated.env.linux": {
    "CUDA_VISIBLE_DEVICES": "0"
  }
}

5.3 Remote SSH Setup

# Local ~/.ssh/config
Host ml-server
    HostName 192.168.1.100
    User your-username
    IdentityFile ~/.ssh/id_rsa
    ForwardAgent yes
    ServerAliveInterval 60
    ServerAliveCountMax 3

Host gpu-cloud
    HostName gpu-server.example.com
    User ubuntu
    IdentityFile ~/.ssh/cloud-key.pem
    Port 22

Using Remote SSH in VS Code:

  1. Press Ctrl+Shift+P (or Cmd+Shift+P on Mac)
  2. Select "Remote-SSH: Connect to Host"
  3. Choose your configured host

Auto-install extensions on the remote server:

{
  "remote.SSH.defaultExtensions": ["ms-python.python", "ms-toolsai.jupyter", "charliermarsh.ruff"]
}

5.4 Dev Containers

.devcontainer/devcontainer.json example:

{
  "name": "AI Development",
  "build": {
    "dockerfile": "Dockerfile",
    "args": {
      "CUDA_VERSION": "12.1.1",
      "PYTHON_VERSION": "3.11"
    }
  },
  "runArgs": ["--gpus", "all", "--shm-size", "8g"],
  "mounts": [
    "source=/data,target=/data,type=bind",
    "source=${localEnv:HOME}/.cache,target=/root/.cache,type=bind"
  ],
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-toolsai.jupyter",
        "charliermarsh.ruff",
        "github.copilot"
      ],
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python"
      }
    }
  },
  "postCreateCommand": "pip install -e '.[dev]'",
  "remoteUser": "root"
}

6. GPU Docker Containers

6.1 Installing NVIDIA Container Toolkit

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

6.2 Dockerfile for AI Projects

# Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

# Environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1

# System packages
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3.11-venv \
    python3-pip \
    git \
    wget \
    curl \
    vim \
    htop \
    tmux \
    && rm -rf /var/lib/apt/lists/*

# Set Python defaults
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1

# Upgrade pip
RUN pip install --upgrade pip setuptools wheel

# Working directory
WORKDIR /workspace

# Python dependencies (optimize layer caching)
COPY requirements.txt .
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip install -r requirements.txt

# Copy source code
COPY . .

# Install as editable package
RUN pip install -e ".[dev]"

# Create non-root user (security best practice)
RUN useradd -m -u 1000 researcher
RUN chown -R researcher:researcher /workspace
USER researcher

# Expose ports
EXPOSE 8888 6006

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]

6.3 Docker Compose for a Service Stack

# docker-compose.yml
version: '3.8'

services:
  # Training container
  trainer:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - CUDA_VISIBLE_DEVICES=0
      - WANDB_API_KEY=${WANDB_API_KEY}
    volumes:
      - ./src:/workspace/src
      - ./configs:/workspace/configs
      - /data:/data:ro
      - model-checkpoints:/workspace/checkpoints
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: '8g'
    command: python train.py

  # Jupyter notebook server
  notebook:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    ports:
      - '8888:8888'
    volumes:
      - ./notebooks:/workspace/notebooks
      - ./src:/workspace/src
      - /data:/data:ro
    command: jupyter lab --ip=0.0.0.0 --no-browser --allow-root

  # TensorBoard
  tensorboard:
    image: tensorflow/tensorflow:latest
    ports:
      - '6006:6006'
    volumes:
      - model-checkpoints:/workspace/checkpoints:ro
    command: tensorboard --logdir=/workspace/checkpoints --host=0.0.0.0

  # MLflow tracking server
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.8.0
    ports:
      - '5000:5000'
    volumes:
      - mlflow-data:/mlflow
    command: mlflow server --host=0.0.0.0 --port=5000 --default-artifact-root=/mlflow/artifacts

volumes:
  model-checkpoints:
  mlflow-data:
# Start the full stack
docker-compose up -d

# Follow logs
docker-compose logs -f trainer

# Restart a specific service
docker-compose restart notebook

# Check GPU usage inside container
docker-compose exec trainer nvidia-smi

# Tear down
docker-compose down --volumes

6.4 GPU Sharing Strategies

# Assign specific GPUs
docker run --gpus '"device=0,1"' myimage  # Use GPUs 0 and 1
docker run --gpus '"device=2"' myimage     # Use only GPU 2

# MIG (Multi-Instance GPU) for A100/H100
sudo nvidia-smi mig -i 0 --create-gpu-instance 3g.20gb
sudo nvidia-smi mig -i 0 --create-compute-instance 3g.20gb

# Assign MIG instance to a container
docker run --gpus '"MIG-GPU-xxxxxxxx-xx-xx-xxxx-xxxxxxxxxxxx/x/x"' myimage

7. Remote Development Environment

7.1 SSH-Based Remote Development

# Generate SSH key (on local machine)
ssh-keygen -t ed25519 -C "your-email@example.com"

# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.com

# Or manually
cat ~/.ssh/id_ed25519.pub | ssh user@server.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

# Optimized SSH config (~/.ssh/config)
Host ml-server
    HostName server.example.com
    User researcher
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/cm_%r@%h:%p
    ControlPersist 10m
    ServerAliveInterval 30
    Compression yes

7.2 tmux (Session Persistence)

# Install tmux
sudo apt install tmux

# Start a new session
tmux new-session -s training

# Reattach to a session
tmux attach-session -t training

# List sessions
tmux list-sessions

# Key shortcuts (Prefix: Ctrl+B)
# Ctrl+B d    : Detach session (keeps running after SSH disconnect)
# Ctrl+B c    : Create new window
# Ctrl+B n/p  : Next/previous window
# Ctrl+B %    : Vertical split
# Ctrl+B "    : Horizontal split
# Ctrl+B z    : Toggle pane zoom

~/.tmux.conf configuration:

# ~/.tmux.conf

# Enable mouse support
set -g mouse on

# Increase scroll buffer
set -g history-limit 50000

# Start window numbering at 1
set -g base-index 1
setw -g pane-base-index 1

# Color support
set -g default-terminal "screen-256color"
set -ga terminal-overrides ",xterm-256color:Tc"

# Customize status bar
set -g status-bg colour235
set -g status-fg colour136
set -g status-right '#[fg=colour166]%d %b #[fg=colour136]%H:%M '

# Change prefix to Ctrl+A (screen style)
set -g prefix C-a
unbind C-b
bind C-a send-prefix

# Fast pane navigation
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D

7.3 File Synchronization (rsync, sshfs)

# Sync code local -> server
rsync -avz --exclude '__pycache__' --exclude '.git' \
    ./my-project/ user@server:/home/user/my-project/

# Sync results server -> local
rsync -avz user@server:/home/user/my-project/outputs/ ./outputs/

# Auto-sync script
cat > sync.sh << 'EOF'
#!/bin/bash
rsync -avz \
    --exclude '__pycache__' \
    --exclude '.git' \
    --exclude '.venv' \
    --exclude '*.pyc' \
    --delete \
    ./ user@ml-server:/home/user/project/
echo "Sync complete: $(date)"
EOF
chmod +x sync.sh

# Mount remote filesystem with sshfs
sudo apt install sshfs
mkdir -p ~/remote-server
sshfs user@server:/home/user ~/remote-server -o reconnect,ServerAliveInterval=15

# Unmount
fusermount -u ~/remote-server

8. Monitoring Tools

8.1 nvidia-smi watch

# Refresh GPU status every second
watch -n 1 nvidia-smi

# Show only GPU memory usage
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
    --format=csv -l 1

# Record to CSV log
nvidia-smi --query-gpu=timestamp,index,utilization.gpu,memory.used,temperature.gpu \
    --format=csv -l 1 > gpu_log.csv

8.2 nvtop

# Install nvtop (htop for GPUs)
sudo apt install nvtop

# Run
nvtop

8.3 GPU Monitoring from Python

# gpu_monitor.py
import subprocess
import time
from dataclasses import dataclass
from typing import List

@dataclass
class GPUStats:
    index: int
    name: str
    temperature: float
    utilization: float
    memory_used: int
    memory_total: int
    power_draw: float

def get_gpu_stats() -> List[GPUStats]:
    """Query GPU stats via nvidia-smi"""
    cmd = [
        'nvidia-smi',
        '--query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw',
        '--format=csv,noheader,nounits'
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)

    gpus = []
    for line in result.stdout.strip().split('\n'):
        parts = [p.strip() for p in line.split(',')]
        gpus.append(GPUStats(
            index=int(parts[0]),
            name=parts[1],
            temperature=float(parts[2]),
            utilization=float(parts[3]),
            memory_used=int(parts[4]),
            memory_total=int(parts[5]),
            power_draw=float(parts[6]) if parts[6] != 'N/A' else 0.0
        ))
    return gpus

def monitor_training(interval: int = 10, duration: int = 3600):
    """Monitor GPU during training"""
    print(f"GPU monitoring started (interval: {interval}s)")
    history = []

    start_time = time.time()
    while time.time() - start_time < duration:
        stats = get_gpu_stats()
        timestamp = time.time() - start_time

        for gpu in stats:
            mem_pct = gpu.memory_used / gpu.memory_total * 100
            print(
                f"[{timestamp:.0f}s] GPU {gpu.index}: "
                f"util={gpu.utilization:.0f}% "
                f"mem={gpu.memory_used}/{gpu.memory_total}MB ({mem_pct:.0f}%) "
                f"temp={gpu.temperature:.0f}C "
                f"power={gpu.power_draw:.0f}W"
            )
            history.append({
                'timestamp': timestamp,
                'gpu_index': gpu.index,
                'utilization': gpu.utilization,
                'memory_used': gpu.memory_used,
                'temperature': gpu.temperature
            })

        time.sleep(interval)

    return history

if __name__ == "__main__":
    history = monitor_training(interval=5, duration=60)
    print(f"\nCollected {len(history)} measurements")

8.4 System Monitoring

# Install and use htop
sudo apt install htop
htop

# Monitor I/O with iotop
sudo apt install iotop
sudo iotop -o  # Show only processes with active I/O

# Disk I/O stats
iostat -x 1 5

# Memory usage
free -h
vmstat 1 10

# Network monitoring
sudo apt install nethogs
sudo nethogs

# Comprehensive monitoring dashboard (glances)
pip install glances
glances

9. Code Quality Tools

9.1 black + ruff Configuration

# Install
pip install black ruff pre-commit

# Run black
black .
black --line-length 88 src/

# Run ruff
ruff check .
ruff check --fix .  # Auto-fix
ruff format .       # Format (can replace black)

pyproject.toml configuration:

[tool.black]
line-length = 88
target-version = ['py310', 'py311']
include = '\.pyi?$'

[tool.ruff]
line-length = 88
target-version = "py310"

[tool.ruff.lint]
select = [
    "E",   # pycodestyle errors
    "W",   # pycodestyle warnings
    "F",   # pyflakes
    "I",   # isort
    "N",   # pep8-naming
    "UP",  # pyupgrade
    "B",   # flake8-bugbear
]
ignore = ["E501", "B008"]

[tool.ruff.lint.isort]
known-first-party = ["my_package"]

9.2 pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key
      - id: check-added-large-files
        args: ['--maxkb=10000']

  - repo: https://github.com/psf/black
    rev: 23.11.0
    hooks:
      - id: black

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff
        args: [--fix]

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.7.1
    hooks:
      - id: mypy
        additional_dependencies:
          - types-requests
          - pydantic
# Install and activate pre-commit
pre-commit install

# Run on all files
pre-commit run --all-files

# Run a specific hook
pre-commit run black --all-files

9.3 pytest with Coverage

# Install
pip install pytest pytest-cov pytest-xdist

# Basic run
pytest tests/ -v

# With coverage
pytest tests/ --cov=src --cov-report=html --cov-report=term-missing

# Parallel execution
pytest tests/ -n auto  # As many workers as CPU cores

# Show slowest tests
pytest tests/ --durations=10

# Run only fast tests
pytest tests/ -m "not slow"

pytest configuration in pyproject.toml:

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py", "*_test.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
markers = [
    "slow: marks tests as slow",
    "gpu: marks tests requiring GPU",
    "integration: integration tests"
]
addopts = [
    "-v",
    "--tb=short",
    "--cov=src",
    "--cov-report=html",
    "--cov-fail-under=80"
]

10. Weights & Biases Setup

10.1 Installation and Initialization

# Install
pip install wandb

# Log in (sets API key)
wandb login
# Or set via environment variable
export WANDB_API_KEY=your-api-key-here

10.2 Basic W&B Usage

import wandb
import numpy as np

# Initialize experiment
run = wandb.init(
    project="bert-sentiment-analysis",
    name="run-001-baseline",
    config={
        "learning_rate": 2e-5,
        "batch_size": 32,
        "epochs": 10,
        "model": "bert-base-uncased",
        "optimizer": "adamw"
    },
    tags=["baseline", "bert", "nlp"],
    notes="Baseline BERT fine-tuning experiment"
)

config = wandb.config
print(f"Learning rate: {config.learning_rate}")

# Log metrics
for epoch in range(config.epochs):
    train_loss = 2.0 - epoch * 0.15 + np.random.randn() * 0.1
    val_loss = 2.2 - epoch * 0.12 + np.random.randn() * 0.1
    val_acc = 0.5 + epoch * 0.04 + np.random.randn() * 0.01

    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "val/loss": val_loss,
        "val/accuracy": val_acc,
        "learning_rate": config.learning_rate * (0.95 ** epoch)
    })

# Log a visualization
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
x = np.linspace(0, 10, 100)
ax.plot(x, np.sin(x), label='Predicted')
ax.plot(x, np.cos(x), label='Actual')
ax.legend()
wandb.log({"prediction_plot": wandb.Image(fig)})
plt.close()

wandb.finish()

10.3 Artifact Management

import wandb
import os

def save_model_artifact(model_path: str, run, metadata: dict = None):
    """Save model checkpoint as a W&B artifact"""
    artifact = wandb.Artifact(
        name="model-checkpoint",
        type="model",
        metadata=metadata or {}
    )
    artifact.add_file(model_path)
    run.log_artifact(artifact)
    print(f"Model artifact saved: {model_path}")

def save_dataset_artifact(data_dir: str, run):
    """Save dataset as a W&B artifact"""
    artifact = wandb.Artifact(
        name="training-dataset",
        type="dataset",
        description="Preprocessed training dataset"
    )
    artifact.add_dir(data_dir)
    run.log_artifact(artifact)

def load_model_artifact(artifact_name: str, version: str = "latest"):
    """Download a model artifact"""
    run = wandb.init(project="my-project", job_type="inference")
    artifact = run.use_artifact(f"{artifact_name}:{version}")
    artifact_dir = artifact.download()
    print(f"Artifact downloaded to: {artifact_dir}")
    return artifact_dir

# Usage example
run = wandb.init(project="bert-training")

with open("/tmp/model.pt", "w") as f:
    f.write("model_weights")

save_model_artifact(
    "/tmp/model.pt",
    run,
    metadata={"accuracy": 0.94, "epoch": 10, "val_loss": 0.28}
)

wandb.finish()

10.4 Hyperparameter Optimization Sweeps

import wandb
import numpy as np

# Sweep configuration
sweep_config = {
    "method": "bayes",  # random, grid, or bayes
    "metric": {
        "name": "val/accuracy",
        "goal": "maximize"
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-3
        },
        "batch_size": {
            "values": [16, 32, 64, 128]
        },
        "hidden_size": {
            "values": [256, 512, 768, 1024]
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.1,
            "max": 0.5
        },
        "num_layers": {
            "values": [2, 4, 6, 8]
        }
    },
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 3
    }
}

def train_sweep():
    """Training function called by each sweep run"""
    run = wandb.init()
    config = wandb.config

    best_val_acc = 0.0

    for epoch in range(10):
        train_loss = 1.0 / (config.learning_rate * 1000) * np.random.uniform(0.8, 1.2)
        val_acc = min(0.99, config.hidden_size / 2000 + np.random.uniform(0, 0.1))

        if val_acc > best_val_acc:
            best_val_acc = val_acc

        wandb.log({
            "epoch": epoch,
            "train/loss": train_loss,
            "val/accuracy": val_acc
        })

    wandb.finish()

# Create and launch sweep
sweep_id = wandb.sweep(sweep_config, project="hyperparameter-search")
print(f"Sweep ID: {sweep_id}")

# Run agent for N trials
wandb.agent(sweep_id, function=train_sweep, count=20)

To run a sweep in parallel across multiple servers:

# Server 1
wandb agent username/project/sweep-id

# Server 2 (simultaneously)
wandb agent username/project/sweep-id

Conclusion

Investing time in setting up your AI development environment upfront pays dividends throughout the life of your project. Here is a summary of the key priorities covered in this guide.

Essential priorities:

  1. CUDA environment: Correct GPU driver and CUDA Toolkit installation is the foundation of everything.
  2. Python environment isolation: Use pyenv + conda/poetry to create separate environments per project.
  3. JupyterLab + VS Code: Use Jupyter for exploratory analysis and VS Code for production development.
  4. Docker containerization: Guarantees reproducible experiment environments and simplifies team collaboration.
  5. W&B experiment tracking: Tracking every experiment gives you reproducibility and actionable insights.

Areas worth investing in long-term:

  • Automate code quality with pre-commit hooks and CI/CD.
  • Master remote development (Remote SSH + tmux) to work productively from anywhere.
  • Use GPU monitoring to catch training issues early.

References