Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

개요

AI 개발 환경 구축은 프로젝트 성공의 절반입니다. 제대로 된 환경 없이는 실험 재현성도, 협업도, 빠른 이터레이션도 불가능합니다. 이 가이드는 GPU 서버 초기 설정부터 일상적인 개발 워크플로우까지 AI/ML 개발자가 알아야 할 모든 환경 설정을 다룹니다.

Ubuntu 서버에서 CUDA를 설치하고, Python 환경을 관리하며, JupyterLab과 VS Code를 전문가처럼 활용하고, Docker 컨테이너로 재현 가능한 실험 환경을 구축하는 방법을 단계별로 설명합니다.

1. AI 개발 환경 요구사항

1.1 하드웨어 추천

GPU (최우선)

AI 연구에서 GPU는 선택이 아닌 필수입니다. 용도별 추천 사항은 다음과 같습니다.

입문/개인 연구: NVIDIA RTX 4080/4090 (16-24GB VRAM)
팀 공유 서버: NVIDIA A100 40GB 또는 80GB
대규모 LLM 학습: NVIDIA H100 또는 H200 (80GB+ VRAM)
클라우드: AWS p3/p4d/p5, GCP A100/H100, Lambda Labs

CPU

GPU 학습 시 CPU는 데이터 전처리와 로딩에 주로 활용됩니다.

최소: 8코어 16스레드 (Intel Core i9 또는 AMD Ryzen 9)
권장: 16코어 이상 (AMD Threadripper, Intel Xeon)
데이터 로더 워커 수 = CPU 코어 수의 절반이 적절

RAM

최소: 32GB
권장: 64GB (GPU VRAM의 4배 이상)
대규모 NLP: 128GB 이상 (토크나이징, 데이터 로딩)

저장소

/home/user/          → NVMe SSD (OS, 코드, 환경)
/data/               → NVMe SSD 또는 고속 HDD (학습 데이터)
/models/             → HDD 또는 NAS (모델 체크포인트)

OS 및 패키지: NVMe SSD 500GB 이상
데이터셋: NVMe SSD (I/O 집약적 학습) 또는 고성능 HDD
체크포인트: HDD 4TB 이상 (비용 효율적 대용량 스토리지)

1.2 OS 선택

Ubuntu 22.04 LTS 강력 추천

# Ubuntu 버전 확인
lsb_release -a

# 예시 출력:
# Ubuntu 22.04.3 LTS (Jammy Jellyfish)

Ubuntu를 선택하는 이유는 다음과 같습니다.

NVIDIA 공식 지원 (드라이버, CUDA, cuDNN 최신 버전 즉시 제공)
풍부한 AI/ML 커뮤니티 문서와 패키지
Docker, Kubernetes와 최고의 호환성
5년 LTS 지원으로 안정적인 서버 운영

macOS (M1/M2/M3)

Apple Silicon은 Metal Performance Shaders를 통한 GPU 가속을 지원합니다.

# MPS 가속 확인 (PyTorch)
python -c "import torch; print(torch.backends.mps.is_available())"

1.3 클라우드 vs 로컬 개발

항목	로컬 서버	클라우드
초기 비용	높음 (하드웨어)	낮음
운영 비용	낮음 (전기료)	높음 (시간당 과금)
대기 시간	없음	인스턴스 시작 시간
확장성	제한적	무제한
데이터 보안	높음	정책 의존
추천 용도	지속적 연구	일회성 대규모 실험

2. NVIDIA GPU 드라이버와 CUDA 설치

2.1 nvidia-driver 설치 (Ubuntu)

# 1. 기존 NVIDIA 패키지 제거
sudo apt-get purge nvidia*
sudo apt autoremove

# 2. 사용 가능한 드라이버 확인
ubuntu-drivers devices

# 3. 권장 드라이버 자동 설치
sudo ubuntu-drivers autoinstall

# 또는 특정 버전 설치
sudo apt install nvidia-driver-535

# 4. 재부팅
sudo reboot

# 5. 설치 확인
nvidia-smi

nvidia-smi 출력 예시:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05   Driver Version: 535.154.05   CUDA Version: 12.2    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB  Off| 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    56W / 400W |   1024MiB / 40960MiB |      0%      Default |
+-----------------------------------------------------------------------------+

2.2 CUDA Toolkit 설치

공식 NVIDIA 사이트에서 환경에 맞는 설치 명령을 생성합니다.

# CUDA 12.2 설치 (Ubuntu 22.04 예시)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-2

# 환경 변수 설정 (~/.bashrc 또는 ~/.zshrc에 추가)
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# CUDA 버전 확인
nvcc --version

2.3 cuDNN 설치

# NVIDIA Developer 계정 필요
# cuDNN 다운로드 후:

# Ubuntu 22.04, CUDA 12.x 기준
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install libcudnn8 libcudnn8-dev libcudnn8-samples

# cuDNN 버전 확인
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

2.4 설치 검증

# verify_gpu.py
import subprocess
import sys

def check_nvidia_smi():
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    if result.returncode == 0:
        print("nvidia-smi 정상 동작")
        # GPU 정보 추출
        lines = result.stdout.split('\n')
        for line in lines:
            if 'NVIDIA' in line and 'Driver' in line:
                print(f"  {line.strip()}")
    else:
        print("nvidia-smi 오류:", result.stderr)

def check_pytorch_cuda():
    try:
        import torch
        print(f"\nPyTorch 버전: {torch.__version__}")
        print(f"CUDA 사용 가능: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"CUDA 버전: {torch.version.cuda}")
            print(f"cuDNN 버전: {torch.backends.cudnn.version()}")
            print(f"GPU 수: {torch.cuda.device_count()}")
            for i in range(torch.cuda.device_count()):
                props = torch.cuda.get_device_properties(i)
                print(f"  GPU {i}: {props.name} ({props.total_memory / 1024**3:.1f}GB)")

            # 간단한 텐서 연산 테스트
            x = torch.randn(1000, 1000).cuda()
            y = torch.randn(1000, 1000).cuda()
            z = x @ y
            print(f"GPU 텐서 연산 테스트: 성공 (shape: {z.shape})")
    except ImportError:
        print("PyTorch가 설치되지 않음")

def check_tensorflow_gpu():
    try:
        import tensorflow as tf
        print(f"\nTensorFlow 버전: {tf.__version__}")
        gpus = tf.config.list_physical_devices('GPU')
        print(f"감지된 GPU: {len(gpus)}개")
        for gpu in gpus:
            print(f"  {gpu}")
    except ImportError:
        print("TensorFlow가 설치되지 않음")

if __name__ == "__main__":
    check_nvidia_smi()
    check_pytorch_cuda()
    check_tensorflow_gpu()

python verify_gpu.py

3. Python 환경 관리

3.1 pyenv로 Python 버전 관리

# pyenv 설치
curl https://pyenv.run | bash

# ~/.bashrc 또는 ~/.zshrc에 추가
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc

# 사용 가능한 Python 버전 목록
pyenv install --list | grep -E "^\s+3\.(10|11|12)"

# Python 설치
pyenv install 3.11.7
pyenv install 3.10.13

# 전역 버전 설정
pyenv global 3.11.7

# 프로젝트별 버전 설정 (해당 디렉토리에서)
cd my-project
pyenv local 3.10.13

# 현재 Python 버전 확인
python --version
pyenv versions

3.2 conda 환경 (GPU 라이브러리)

conda는 RAPIDS, cuML 등 GPU 가속 라이브러리 설치에 특히 유용합니다.

# Miniconda 설치
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda init

# 채널 설정
conda config --add channels conda-forge
conda config --add channels nvidia
conda config --set channel_priority strict

# AI/ML 환경 생성
conda create -n aiml python=3.11 -y
conda activate aiml

# PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# RAPIDS (GPU 가속 데이터 과학)
conda install -c rapidsai -c conda-forge -c nvidia rapids=23.10 cuda-version=12.0

# 환경 내보내기 및 불러오기
conda env export > environment.yml
conda env create -f environment.yml

# 환경 목록
conda env list

environment.yml 예시:

name: aiml
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - pytorch>=2.1
  - torchvision
  - cudatoolkit=12.1
  - numpy>=1.24
  - pandas>=2.0
  - scikit-learn>=1.3
  - pip:
      - transformers>=4.35
      - wandb>=0.16
      - pydantic>=2.0

3.3 poetry로 의존성 관리

# poetry 설치
curl -sSL https://install.python-poetry.org | python3 -

# 프로젝트 생성
poetry new ml-research
cd ml-research

# 의존성 추가
poetry add torch torchvision numpy pandas transformers
poetry add --group dev pytest black ruff mypy jupyter

# 특정 버전 또는 extra 포함
poetry add "torch[cuda]>=2.1"
poetry add "transformers[torch]>=4.35"

# 설치
poetry install

# 가상환경 정보
poetry env info
poetry env list

3.4 uv (초고속 패키지 매니저)

# uv 설치
curl -LsSf https://astral.sh/uv/install.sh | sh

# pip보다 10-100x 빠른 패키지 설치
uv pip install torch torchvision numpy pandas

# 가상환경 생성
uv venv .venv --python 3.11
source .venv/bin/activate

# requirements.txt에서 설치 (매우 빠름)
uv pip install -r requirements.txt

# Python 버전 관리
uv python install 3.11 3.12
uv python list

4. Jupyter Lab 고급 설정

4.1 JupyterLab 설치와 확장

# JupyterLab 설치
pip install jupyterlab

# 유용한 확장 설치
pip install jupyterlab-git           # Git 통합
pip install jupyterlab-lsp           # 언어 서버 프로토콜 (자동완성)
pip install python-lsp-server        # Python LSP
pip install jupyterlab-code-formatter  # 코드 포매터
pip install black isort              # 포매터
pip install jupyterlab-vim           # Vim 키바인딩
pip install ipywidgets               # 인터랙티브 위젯

# 확장 목록 확인
jupyter labextension list

# JupyterLab 실행
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser

4.2 커널 관리

# 커널 목록 확인
jupyter kernelspec list

# 현재 conda/venv 환경을 커널로 등록
pip install ipykernel
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

# 특정 conda 환경을 커널로 등록
conda activate aiml
conda install ipykernel
python -m ipykernel install --user --name aiml --display-name "Python (aiml GPU)"

# 커널 제거
jupyter kernelspec remove old-kernel

# 커스텀 커널 스펙 수정
# ~/.local/share/jupyter/kernels/aiml/kernel.json

kernel.json 예시:

{
  "argv": [
    "/home/user/miniconda3/envs/aiml/bin/python",
    "-m",
    "ipykernel_launcher",
    "-f",
    "{connection_file}"
  ],
  "display_name": "Python (aiml GPU)",
  "language": "python",
  "env": {
    "CUDA_VISIBLE_DEVICES": "0",
    "PYTHONPATH": "/home/user/projects"
  }
}

4.3 원격 Jupyter (SSH 터널링)

# 서버에서 Jupyter 시작 (토큰 없이)
jupyter lab --no-browser --port=8888 --ip=127.0.0.1

# 로컬에서 SSH 터널 생성
ssh -N -L 8888:localhost:8888 user@your-server.com

# 브라우저에서 접속
# http://localhost:8888

# 비밀번호 설정 (더 안전)
jupyter lab password
# 이후 localhost:8888에서 비밀번호로 접속

자동 시작을 위한 systemd 서비스 설정:

# /etc/systemd/system/jupyter.service
cat > /tmp/jupyter.service << 'EOF'
[Unit]
Description=Jupyter Lab
After=network.target

[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username
ExecStart=/home/your-username/miniconda3/envs/aiml/bin/jupyter lab --no-browser --port=8888
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo mv /tmp/jupyter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable jupyter
sudo systemctl start jupyter

4.4 Jupyter Magic Commands

# 시간 측정
%time result = expensive_function()
%timeit -n 100 result = fast_function()  # 100번 반복

# 라인별 프로파일링
%load_ext line_profiler
%lprun -f my_function my_function(data)

# 메모리 프로파일링
%load_ext memory_profiler
%memit result = memory_heavy_function()

# 셀 내용을 파일로 저장
%%writefile my_script.py
import numpy as np
# ...

# 파일 내용을 셀로 불러오기
%load my_script.py

# 쉘 명령 실행
!nvidia-smi
!pip list | grep torch
files = !ls -la *.py

# 환경 변수
%env CUDA_VISIBLE_DEVICES=0

# 자동 재로드 (코드 수정 후 즉시 반영)
%load_ext autoreload
%autoreload 2

# matplotlib 인라인 표시
%matplotlib inline

# 현재 변수 목록
%who
%whos

# 이전 출력 결과 확인
print(_)   # 마지막 결과
print(__)  # 두 번째 전 결과

4.5 nbconvert (노트북 변환)

# 노트북을 스크립트로 변환
jupyter nbconvert --to script notebook.ipynb

# HTML로 변환 (공유용)
jupyter nbconvert --to html notebook.ipynb

# PDF 변환 (LaTeX 필요)
jupyter nbconvert --to pdf notebook.ipynb

# 실행하면서 HTML로 변환
jupyter nbconvert --to html --execute notebook.ipynb

# 커맨드라인에서 노트북 실행
jupyter nbconvert --to notebook --execute --inplace notebook.ipynb

# 파라미터화 실행 (papermill)
pip install papermill
papermill input.ipynb output.ipynb -p lr 0.001 -p epochs 100

5. VS Code for AI

5.1 필수 확장 설치

# VS Code 커맨드라인에서 확장 설치
code --install-extension ms-python.python           # Python
code --install-extension ms-toolsai.jupyter         # Jupyter
code --install-extension ms-python.pylance          # Pylance (LSP)
code --install-extension charliermarsh.ruff         # Ruff 린터
code --install-extension github.copilot             # GitHub Copilot
code --install-extension github.copilot-chat        # Copilot Chat
code --install-extension ms-vscode-remote.remote-ssh  # Remote SSH
code --install-extension ms-vscode-remote.remote-containers  # Dev Containers
code --install-extension eamodio.gitlens            # GitLens
code --install-extension njpwerner.autodocstring     # 독스트링 자동생성
code --install-extension tabnine.tabnine-vscode     # Tabnine AI

5.2 VS Code 설정 (settings.json)

{
  "python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",
  "python.formatting.provider": "none",
  "[python]": {
    "editor.defaultFormatter": "charliermarsh.ruff",
    "editor.formatOnSave": true,
    "editor.codeActionsOnSave": {
      "source.fixAll.ruff": true,
      "source.organizeImports.ruff": true
    }
  },
  "pylance.typeCheckingMode": "basic",
  "python.analysis.autoImportCompletions": true,
  "python.analysis.indexing": true,
  "python.analysis.packageIndexDepths": [
    { "name": "torch", "depth": 5 },
    { "name": "transformers", "depth": 5 }
  ],
  "jupyter.askForKernelRestart": false,
  "jupyter.interactiveWindow.creationMode": "perFile",
  "editor.inlineSuggest.enabled": true,
  "github.copilot.enable": {
    "*": true,
    "yaml": true,
    "plaintext": false
  },
  "files.exclude": {
    "**/__pycache__": true,
    "**/*.pyc": true,
    ".mypy_cache": true,
    ".ruff_cache": true
  },
  "terminal.integrated.env.linux": {
    "CUDA_VISIBLE_DEVICES": "0"
  }
}

5.3 Remote SSH 설정

# 로컬 ~/.ssh/config
Host ml-server
    HostName 192.168.1.100
    User your-username
    IdentityFile ~/.ssh/id_rsa
    ForwardAgent yes
    ServerAliveInterval 60
    ServerAliveCountMax 3

Host gpu-cloud
    HostName gpu-server.example.com
    User ubuntu
    IdentityFile ~/.ssh/cloud-key.pem
    Port 22

VS Code에서 Remote SSH 사용하기:

Ctrl+Shift+P (또는 Cmd+Shift+P on Mac)
"Remote-SSH: Connect to Host" 선택
설정한 호스트 이름 선택

원격 서버에서 Python 환경 자동 선택:

{
  "remote.SSH.defaultExtensions": ["ms-python.python", "ms-toolsai.jupyter", "charliermarsh.ruff"]
}

5.4 Dev Containers

.devcontainer/devcontainer.json 예시:

{
  "name": "AI Development",
  "build": {
    "dockerfile": "Dockerfile",
    "args": {
      "CUDA_VERSION": "12.1.1",
      "PYTHON_VERSION": "3.11"
    }
  },
  "runArgs": ["--gpus", "all", "--shm-size", "8g"],
  "mounts": [
    "source=/data,target=/data,type=bind",
    "source=${localEnv:HOME}/.cache,target=/root/.cache,type=bind"
  ],
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-toolsai.jupyter",
        "charliermarsh.ruff",
        "github.copilot"
      ],
      "settings": {
        "python.defaultInterpreterPath": "/usr/local/bin/python"
      }
    }
  },
  "postCreateCommand": "pip install -e '.[dev]'",
  "remoteUser": "root"
}

6. GPU Docker 컨테이너

6.1 NVIDIA Container Toolkit 설치

# NVIDIA Container Toolkit 설치
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# GPU 접근 테스트
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

6.2 AI 프로젝트용 Dockerfile

# Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

# 환경 변수
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1

# 시스템 패키지
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3.11-venv \
    python3-pip \
    git \
    wget \
    curl \
    vim \
    htop \
    tmux \
    && rm -rf /var/lib/apt/lists/*

# Python 기본값 설정
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1

# pip 업그레이드
RUN pip install --upgrade pip setuptools wheel

# 작업 디렉토리
WORKDIR /workspace

# Python 의존성 (레이어 캐싱 최적화)
COPY requirements.txt .
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip install -r requirements.txt

# 소스 코드 복사
COPY . .

# 패키지 설치
RUN pip install -e ".[dev]"

# 비루트 사용자 생성 (보안)
RUN useradd -m -u 1000 researcher
RUN chown -R researcher:researcher /workspace
USER researcher

# 포트 노출
EXPOSE 8888 6006

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]

6.3 Docker Compose로 서비스 스택

# docker-compose.yml
version: '3.8'

services:
  # 학습 컨테이너
  trainer:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - CUDA_VISIBLE_DEVICES=0
      - WANDB_API_KEY=${WANDB_API_KEY}
    volumes:
      - ./src:/workspace/src
      - ./configs:/workspace/configs
      - /data:/data:ro
      - model-checkpoints:/workspace/checkpoints
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: '8g'
    command: python train.py

  # Jupyter 노트북 서버
  notebook:
    build:
      context: .
      dockerfile: Dockerfile
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    ports:
      - '8888:8888'
    volumes:
      - ./notebooks:/workspace/notebooks
      - ./src:/workspace/src
      - /data:/data:ro
    command: jupyter lab --ip=0.0.0.0 --no-browser --allow-root

  # TensorBoard
  tensorboard:
    image: tensorflow/tensorflow:latest
    ports:
      - '6006:6006'
    volumes:
      - model-checkpoints:/workspace/checkpoints:ro
    command: tensorboard --logdir=/workspace/checkpoints --host=0.0.0.0

  # MLflow 트래킹 서버
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.8.0
    ports:
      - '5000:5000'
    volumes:
      - mlflow-data:/mlflow
    command: mlflow server --host=0.0.0.0 --port=5000 --default-artifact-root=/mlflow/artifacts

volumes:
  model-checkpoints:
  mlflow-data:

# 전체 스택 시작
docker-compose up -d

# 로그 확인
docker-compose logs -f trainer

# 특정 서비스 재시작
docker-compose restart notebook

# GPU 사용 현황 (컨테이너 내)
docker-compose exec trainer nvidia-smi

# 정리
docker-compose down --volumes

6.4 GPU 공유 전략

# 특정 GPU 할당
docker run --gpus '"device=0,1"' myimage  # GPU 0, 1 사용
docker run --gpus '"device=2"' myimage     # GPU 2만 사용

# MIG (Multi-Instance GPU) 설정 (A100/H100)
sudo nvidia-smi mig -i 0 --create-gpu-instance 3g.20gb
sudo nvidia-smi mig -i 0 --create-compute-instance 3g.20gb

# 컨테이너에 MIG 인스턴스 할당
docker run --gpus '"MIG-GPU-xxxxxxxx-xx-xx-xxxx-xxxxxxxxxxxx/x/x"' myimage

# GPU 메모리 분수 할당 (time-slicing)
# /etc/nvidia-container-runtime/config.toml에서 설정

7. 원격 개발 환경

7.1 SSH 기반 원격 개발

# SSH 키 생성 (로컬)
ssh-keygen -t ed25519 -C "your-email@example.com"

# 공개키를 서버에 복사
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.com

# 또는 수동으로
cat ~/.ssh/id_ed25519.pub | ssh user@server.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

# SSH 접속 최적화 (~/.ssh/config)
Host ml-server
    HostName server.example.com
    User researcher
    IdentityFile ~/.ssh/id_ed25519
    ControlMaster auto
    ControlPath ~/.ssh/cm_%r@%h:%p
    ControlPersist 10m
    ServerAliveInterval 30
    Compression yes

7.2 tmux (세션 유지)

# tmux 설치
sudo apt install tmux

# 새 세션 시작
tmux new-session -s training

# 세션에 재접속
tmux attach-session -t training

# 세션 목록
tmux list-sessions

# 유용한 단축키 (Prefix: Ctrl+B)
# Ctrl+B d    : 세션 분리 (SSH 끊어도 실행 유지)
# Ctrl+B c    : 새 창 생성
# Ctrl+B n/p  : 다음/이전 창
# Ctrl+B %    : 수직 분할
# Ctrl+B "    : 수평 분할
# Ctrl+B z    : 패널 최대화/복원

~/.tmux.conf 설정:

# ~/.tmux.conf

# 마우스 지원 활성화
set -g mouse on

# 기록 버퍼 증가
set -g history-limit 50000

# 창 번호 1부터 시작
set -g base-index 1
setw -g pane-base-index 1

# 색상 지원
set -g default-terminal "screen-256color"
set -ga terminal-overrides ",xterm-256color:Tc"

# 상태바 커스터마이즈
set -g status-bg colour235
set -g status-fg colour136
set -g status-right '#[fg=colour166]%d %b #[fg=colour136]%H:%M '

# Prefix 키 변경 (Ctrl+A, screen 스타일)
set -g prefix C-a
unbind C-b
bind C-a send-prefix

# 빠른 패널 이동
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D

7.3 파일 동기화 (rsync, sshfs)

# rsync로 코드 동기화
# 로컬 -> 서버
rsync -avz --exclude '__pycache__' --exclude '.git' \
    ./my-project/ user@server:/home/user/my-project/

# 서버 -> 로컬 (결과 다운로드)
rsync -avz user@server:/home/user/my-project/outputs/ ./outputs/

# 자동 동기화 스크립트
cat > sync.sh << 'EOF'
#!/bin/bash
rsync -avz --exclude '__pycache__' \
    --exclude '.git' \
    --exclude '.venv' \
    --exclude '*.pyc' \
    --delete \
    ./ user@ml-server:/home/user/project/
echo "동기화 완료: $(date)"
EOF
chmod +x sync.sh

# sshfs로 원격 파일시스템 마운트
sudo apt install sshfs
mkdir -p ~/remote-server
sshfs user@server:/home/user ~/remote-server -o reconnect,ServerAliveInterval=15

# 마운트 해제
fusermount -u ~/remote-server

8. 모니터링 도구

8.1 nvidia-smi watch

# 1초마다 GPU 상태 갱신
watch -n 1 nvidia-smi

# GPU 메모리 사용량만 확인
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
    --format=csv -l 1

# CSV 로그 기록
nvidia-smi --query-gpu=timestamp,index,utilization.gpu,memory.used,temperature.gpu \
    --format=csv -l 1 > gpu_log.csv

8.2 nvtop

# nvtop 설치 (htop의 GPU 버전)
sudo apt install nvtop

# 실행
nvtop

8.3 Python에서 GPU 모니터링

# gpu_monitor.py
import subprocess
import time
import json
from dataclasses import dataclass
from typing import List

@dataclass
class GPUStats:
    index: int
    name: str
    temperature: float
    utilization: float
    memory_used: int
    memory_total: int
    power_draw: float

def get_gpu_stats() -> List[GPUStats]:
    """nvidia-smi로 GPU 통계 조회"""
    cmd = [
        'nvidia-smi',
        '--query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw',
        '--format=csv,noheader,nounits'
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)

    gpus = []
    for line in result.stdout.strip().split('\n'):
        parts = [p.strip() for p in line.split(',')]
        gpus.append(GPUStats(
            index=int(parts[0]),
            name=parts[1],
            temperature=float(parts[2]),
            utilization=float(parts[3]),
            memory_used=int(parts[4]),
            memory_total=int(parts[5]),
            power_draw=float(parts[6]) if parts[6] != 'N/A' else 0.0
        ))
    return gpus

def monitor_training(interval: int = 10, duration: int = 3600):
    """학습 중 GPU 모니터링"""
    print(f"GPU 모니터링 시작 (간격: {interval}초)")
    history = []

    start_time = time.time()
    while time.time() - start_time < duration:
        stats = get_gpu_stats()
        timestamp = time.time() - start_time

        for gpu in stats:
            mem_pct = gpu.memory_used / gpu.memory_total * 100
            print(
                f"[{timestamp:.0f}s] GPU {gpu.index}: "
                f"이용률={gpu.utilization:.0f}% "
                f"메모리={gpu.memory_used}/{gpu.memory_total}MB ({mem_pct:.0f}%) "
                f"온도={gpu.temperature:.0f}C "
                f"전력={gpu.power_draw:.0f}W"
            )
            history.append({
                'timestamp': timestamp,
                'gpu_index': gpu.index,
                'utilization': gpu.utilization,
                'memory_used': gpu.memory_used,
                'temperature': gpu.temperature
            })

        time.sleep(interval)

    return history

if __name__ == "__main__":
    history = monitor_training(interval=5, duration=60)
    print(f"\n총 {len(history)}개 측정값 수집됨")

8.4 시스템 모니터링

# htop 설치 및 사용
sudo apt install htop
htop

# iotop으로 I/O 모니터링
sudo apt install iotop
sudo iotop -o  # I/O 활성 프로세스만 표시

# 디스크 I/O 확인
iostat -x 1 5

# 메모리 사용량
free -h
vmstat 1 10

# 네트워크 모니터링
sudo apt install nethogs
sudo nethogs

# 종합 모니터링 대시보드 (glances)
pip install glances
glances

9. 코드 품질 도구

9.1 black + ruff 설정

# 설치
pip install black ruff pre-commit

# black 실행
black .
black --line-length 88 src/

# ruff 실행
ruff check .
ruff check --fix .  # 자동 수정
ruff format .       # 포매팅 (black 대체 가능)

pyproject.toml 설정:

[tool.black]
line-length = 88
target-version = ['py310', 'py311']
include = '\.pyi?$'

[tool.ruff]
line-length = 88
target-version = "py310"

[tool.ruff.lint]
select = [
    "E",   # pycodestyle errors
    "W",   # pycodestyle warnings
    "F",   # pyflakes
    "I",   # isort
    "N",   # pep8-naming
    "UP",  # pyupgrade
    "B",   # flake8-bugbear
]
ignore = ["E501", "B008"]

[tool.ruff.lint.isort]
known-first-party = ["my_package"]

9.2 pre-commit hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key
      - id: check-added-large-files
        args: ['--maxkb=10000']

  - repo: https://github.com/psf/black
    rev: 23.11.0
    hooks:
      - id: black

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff
        args: [--fix]

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.7.1
    hooks:
      - id: mypy
        additional_dependencies:
          - types-requests
          - pydantic

# pre-commit 설치 및 활성화
pre-commit install

# 모든 파일에 실행
pre-commit run --all-files

# 특정 훅 실행
pre-commit run black --all-files

9.3 pytest with coverage

# 설치
pip install pytest pytest-cov pytest-xdist

# 기본 실행
pytest tests/ -v

# 커버리지 포함
pytest tests/ --cov=src --cov-report=html --cov-report=term-missing

# 병렬 실행
pytest tests/ -n auto  # CPU 수만큼 병렬

# 느린 테스트 표시
pytest tests/ --durations=10

# 특정 마커만 실행
pytest tests/ -m "not slow"

pytest.ini 또는 pyproject.toml:

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py", "*_test.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
markers = [
    "slow: marks tests as slow",
    "gpu: marks tests requiring GPU",
    "integration: integration tests"
]
addopts = [
    "-v",
    "--tb=short",
    "--cov=src",
    "--cov-report=html",
    "--cov-fail-under=80"
]

10. Weights & Biases 설정

10.1 W&B 설치와 초기화

# 설치
pip install wandb

# 로그인 (API 키 설정)
wandb login
# 또는 환경 변수로 설정
export WANDB_API_KEY=your-api-key-here

10.2 기본 W&B 사용법

import wandb
import numpy as np
import torch

# 실험 초기화
run = wandb.init(
    project="bert-sentiment-analysis",
    name="run-001-baseline",
    config={
        "learning_rate": 2e-5,
        "batch_size": 32,
        "epochs": 10,
        "model": "bert-base-uncased",
        "optimizer": "adamw"
    },
    tags=["baseline", "bert", "nlp"],
    notes="기본 BERT 파인튜닝 실험"
)

# 설정 접근
config = wandb.config
print(f"학습률: {config.learning_rate}")

# 메트릭 로깅
for epoch in range(config.epochs):
    train_loss = 2.0 - epoch * 0.15 + np.random.randn() * 0.1
    val_loss = 2.2 - epoch * 0.12 + np.random.randn() * 0.1
    val_acc = 0.5 + epoch * 0.04 + np.random.randn() * 0.01

    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "val/loss": val_loss,
        "val/accuracy": val_acc,
        "learning_rate": config.learning_rate * (0.95 ** epoch)
    })

    # 그래디언트 히스토그램 (PyTorch)
    # wandb.log({"gradients": wandb.Histogram(model.fc.weight.grad)})

# 시각화 로깅
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
x = np.linspace(0, 10, 100)
ax.plot(x, np.sin(x), label='예측')
ax.plot(x, np.cos(x), label='실제')
ax.legend()
wandb.log({"prediction_plot": wandb.Image(fig)})
plt.close()

# 실험 종료
wandb.finish()

10.3 아티팩트 관리

import wandb
import os

# 아티팩트 저장 (모델 체크포인트)
def save_model_artifact(model_path: str, run, metadata: dict = None):
    artifact = wandb.Artifact(
        name="model-checkpoint",
        type="model",
        metadata=metadata or {}
    )
    artifact.add_file(model_path)
    run.log_artifact(artifact)
    print(f"모델 아티팩트 저장됨: {model_path}")

# 아티팩트 저장 (데이터셋)
def save_dataset_artifact(data_dir: str, run):
    artifact = wandb.Artifact(
        name="training-dataset",
        type="dataset",
        description="전처리된 학습 데이터셋"
    )
    artifact.add_dir(data_dir)
    run.log_artifact(artifact)

# 아티팩트 불러오기
def load_model_artifact(artifact_name: str, version: str = "latest"):
    run = wandb.init(project="my-project", job_type="inference")
    artifact = run.use_artifact(f"{artifact_name}:{version}")
    artifact_dir = artifact.download()
    print(f"아티팩트 다운로드 위치: {artifact_dir}")
    return artifact_dir

# 사용 예시
run = wandb.init(project="bert-training")

# 임시 모델 파일 저장 시뮬레이션
with open("/tmp/model.pt", "w") as f:
    f.write("model_weights")

save_model_artifact(
    "/tmp/model.pt",
    run,
    metadata={"accuracy": 0.94, "epoch": 10, "val_loss": 0.28}
)

wandb.finish()

10.4 스윕 (Sweep) for 하이퍼파라미터 최적화

import wandb
import numpy as np

# 스윕 설정
sweep_config = {
    "method": "bayes",  # random, grid, bayes
    "metric": {
        "name": "val/accuracy",
        "goal": "maximize"
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-3
        },
        "batch_size": {
            "values": [16, 32, 64, 128]
        },
        "hidden_size": {
            "values": [256, 512, 768, 1024]
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.1,
            "max": 0.5
        },
        "num_layers": {
            "values": [2, 4, 6, 8]
        }
    },
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 3
    }
}

def train_sweep():
    """스윕에서 호출되는 학습 함수"""
    run = wandb.init()
    config = wandb.config

    # 설정에 따른 모델 학습 시뮬레이션
    best_val_acc = 0.0

    for epoch in range(10):
        # 실제로는 모델 학습
        train_loss = 1.0 / (config.learning_rate * 1000) * np.random.uniform(0.8, 1.2)
        val_acc = min(0.99, config.hidden_size / 2000 + np.random.uniform(0, 0.1))

        if val_acc > best_val_acc:
            best_val_acc = val_acc

        wandb.log({
            "epoch": epoch,
            "train/loss": train_loss,
            "val/accuracy": val_acc
        })

    wandb.finish()

# 스윕 생성 및 실행
sweep_id = wandb.sweep(sweep_config, project="hyperparameter-search")
print(f"스윕 ID: {sweep_id}")

# 에이전트 실행 (N번 시도)
wandb.agent(sweep_id, function=train_sweep, count=20)

스윕을 여러 서버에서 분산 실행하려면 다음을 사용합니다.

# 서버 1
wandb agent username/project/sweep-id

# 서버 2 (동시에)
wandb agent username/project/sweep-id

결론

AI 개발 환경 구축은 초반에 시간을 투자할수록 나중의 생산성이 크게 향상됩니다. 이 가이드에서 다룬 핵심 사항을 정리하면 다음과 같습니다.

필수 우선순위:

CUDA 환경 구축: GPU 드라이버와 CUDA Toolkit 올바른 설치가 기반입니다.
Python 환경 격리: pyenv + conda/poetry 조합으로 프로젝트별 환경을 분리합니다.
JupyterLab + VS Code: 탐색적 분석은 Jupyter, 본격 개발은 VS Code를 활용합니다.
Docker 컨테이너화: 재현 가능한 실험 환경을 보장하고 팀 협업을 단순화합니다.
W&B 실험 추적: 모든 실험을 추적하면 재현성과 인사이트를 얻습니다.

장기적으로 투자할 영역:

pre-commit hooks와 CI/CD로 코드 품질을 자동화합니다.
원격 개발 환경 (Remote SSH + tmux)을 숙련하면 어디서나 생산적으로 작업할 수 있습니다.
GPU 모니터링으로 학습 과정의 문제를 조기에 발견합니다.

개요