필사 모드: AI Development Environment Complete Guide: From GPU Server Setup to Jupyter, VS Code, Docker
EnglishOverview
Setting up a proper AI development environment is half the battle. Without the right foundation, reproducible experiments, team collaboration, and fast iteration cycles are all impossible. This guide covers everything an AI/ML developer needs to know — from the initial GPU server setup to daily development workflows.
You will learn how to install CUDA on Ubuntu, manage Python environments, use JupyterLab and VS Code like a professional, and build reproducible experiment environments with Docker containers.
1. AI Development Environment Requirements
1.1 Hardware Recommendations
**GPU (Top Priority)**
In AI research, a GPU is not optional — it is essential.
- Entry-level / Personal research: NVIDIA RTX 4080/4090 (16-24GB VRAM)
- Shared team server: NVIDIA A100 40GB or 80GB
- Large LLM training: NVIDIA H100 or H200 (80GB+ VRAM)
- Cloud options: AWS p3/p4d/p5, GCP A100/H100, Lambda Labs
**CPU**
During GPU training, the CPU is primarily responsible for data preprocessing and loading.
- Minimum: 8-core 16-thread (Intel Core i9 or AMD Ryzen 9)
- Recommended: 16+ cores (AMD Threadripper, Intel Xeon)
- Optimal DataLoader workers = half the number of CPU cores
**RAM**
- Minimum: 32GB
- Recommended: 64GB (at least 4x GPU VRAM)
- Large-scale NLP: 128GB+ (tokenization, data loading)
**Storage**
/home/user/ -> NVMe SSD (OS, code, environments)
/data/ -> NVMe SSD or fast HDD (training data)
/models/ -> HDD or NAS (model checkpoints)
- OS and packages: NVMe SSD 500GB+
- Datasets: NVMe SSD (I/O-intensive training) or high-performance HDD
- Checkpoints: HDD 4TB+ (cost-effective high-capacity storage)
1.2 OS Selection
**Ubuntu 22.04 LTS — Strongly Recommended**
Check Ubuntu version
lsb_release -a
Example output:
Ubuntu 22.04.3 LTS (Jammy Jellyfish)
Reasons to choose Ubuntu:
- NVIDIA's primary supported platform (latest drivers, CUDA, cuDNN available immediately)
- Largest AI/ML community documentation and package ecosystem
- Best compatibility with Docker and Kubernetes
- 5-year LTS support for stable server operation
**macOS (M1/M2/M3)**
Apple Silicon supports GPU acceleration via Metal Performance Shaders.
Check MPS availability (PyTorch)
python -c "import torch; print(torch.backends.mps.is_available())"
1.3 Cloud vs. Local Development
| Aspect | Local Server | Cloud |
| ------------- | ------------------- | ------------------------- |
| Upfront cost | High (hardware) | Low |
| Ongoing cost | Low (electricity) | High (hourly billing) |
| Latency | None | Instance startup time |
| Scalability | Limited | Unlimited |
| Data security | High | Policy-dependent |
| Best for | Continuous research | One-off large experiments |
2. NVIDIA GPU Drivers and CUDA Installation
2.1 Installing nvidia-driver (Ubuntu)
1. Remove existing NVIDIA packages
sudo apt-get purge nvidia*
sudo apt autoremove
2. Check available drivers
ubuntu-drivers devices
3. Auto-install recommended driver
sudo ubuntu-drivers autoinstall
Or install a specific version
sudo apt install nvidia-driver-535
4. Reboot
sudo reboot
5. Verify installation
nvidia-smi
Example `nvidia-smi` output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off| 00000000:00:04.0 Off | 0 |
| N/A 34C P0 56W / 400W | 1024MiB / 40960MiB | 0% Default |
+-----------------------------------------------------------------------------+
2.2 Installing the CUDA Toolkit
Generate the appropriate installation command on the official NVIDIA website for your platform.
CUDA 12.2 installation (Ubuntu 22.04 example)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-2
Set environment variables (add to ~/.bashrc or ~/.zshrc)
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
Verify CUDA version
nvcc --version
2.3 Installing cuDNN
Requires an NVIDIA Developer account
After downloading cuDNN:
For Ubuntu 22.04 with CUDA 12.x
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install libcudnn8 libcudnn8-dev libcudnn8-samples
Verify cuDNN version
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
2.4 Verifying the Installation
verify_gpu.py
def check_nvidia_smi():
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
if result.returncode == 0:
print("nvidia-smi is working")
lines = result.stdout.split('\n')
for line in lines:
if 'NVIDIA' in line and 'Driver' in line:
print(f" {line.strip()}")
else:
print("nvidia-smi error:", result.stderr)
def check_pytorch_cuda():
try:
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"GPU count: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
print(f" GPU {i}: {props.name} ({props.total_memory / 1024**3:.1f}GB)")
Quick tensor operation test
x = torch.randn(1000, 1000).cuda()
y = torch.randn(1000, 1000).cuda()
z = x @ y
print(f"GPU tensor operation test: passed (shape: {z.shape})")
except ImportError:
print("PyTorch is not installed")
def check_tensorflow_gpu():
try:
print(f"\nTensorFlow version: {tf.__version__}")
gpus = tf.config.list_physical_devices('GPU')
print(f"Detected GPUs: {len(gpus)}")
for gpu in gpus:
print(f" {gpu}")
except ImportError:
print("TensorFlow is not installed")
if __name__ == "__main__":
check_nvidia_smi()
check_pytorch_cuda()
check_tensorflow_gpu()
python verify_gpu.py
3. Python Environment Management
3.1 Python Version Management with pyenv
Install pyenv
curl https://pyenv.run | bash
Add to ~/.bashrc or ~/.zshrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc
List available Python versions
pyenv install --list | grep -E "^\s+3\.(10|11|12)"
Install Python
pyenv install 3.11.7
pyenv install 3.10.13
Set global version
pyenv global 3.11.7
Set project-specific version (in that directory)
cd my-project
pyenv local 3.10.13
Check current Python version
python --version
pyenv versions
3.2 conda Environments (for GPU Libraries)
conda is particularly useful for installing GPU-accelerated libraries like RAPIDS and cuML.
Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda init
Configure channels
conda config --add channels conda-forge
conda config --add channels nvidia
conda config --set channel_priority strict
Create AI/ML environment
conda create -n aiml python=3.11 -y
conda activate aiml
PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
RAPIDS (GPU-accelerated data science)
conda install -c rapidsai -c conda-forge -c nvidia rapids=23.10 cuda-version=12.0
Export and import environments
conda env export > environment.yml
conda env create -f environment.yml
List environments
conda env list
`environment.yml` example:
name: aiml
channels:
- pytorch
- nvidia
- conda-forge
- defaults
dependencies:
- python=3.11
- pytorch>=2.1
- torchvision
- cudatoolkit=12.1
- numpy>=1.24
- pandas>=2.0
- scikit-learn>=1.3
- pip:
- transformers>=4.35
- wandb>=0.16
- pydantic>=2.0
3.3 Dependency Management with poetry
Install poetry
curl -sSL https://install.python-poetry.org | python3 -
Create project
poetry new ml-research
cd ml-research
Add dependencies
poetry add torch torchvision numpy pandas transformers
poetry add --group dev pytest black ruff mypy jupyter
Install with extras
poetry add "torch[cuda]>=2.1"
poetry add "transformers[torch]>=4.35"
Install all dependencies
poetry install
Check virtual environment info
poetry env info
poetry env list
3.4 uv (Ultra-fast Package Manager)
Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
Install packages (10-100x faster than pip)
uv pip install torch torchvision numpy pandas
Create virtual environment
uv venv .venv --python 3.11
source .venv/bin/activate
Install from requirements.txt (very fast)
uv pip install -r requirements.txt
Manage Python versions
uv python install 3.11 3.12
uv python list
4. Advanced JupyterLab Setup
4.1 JupyterLab Installation and Extensions
Install JupyterLab
pip install jupyterlab
Install useful extensions
pip install jupyterlab-git # Git integration
pip install jupyterlab-lsp # Language server protocol (autocomplete)
pip install python-lsp-server # Python LSP
pip install jupyterlab-code-formatter # Code formatter
pip install black isort # Formatters
pip install jupyterlab-vim # Vim key bindings
pip install ipywidgets # Interactive widgets
List installed extensions
jupyter labextension list
Start JupyterLab
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser
4.2 Kernel Management
List available kernels
jupyter kernelspec list
Register current conda/venv environment as a kernel
pip install ipykernel
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"
Register a specific conda environment as a kernel
conda activate aiml
conda install ipykernel
python -m ipykernel install --user --name aiml --display-name "Python (aiml GPU)"
Remove a kernel
jupyter kernelspec remove old-kernel
Custom `kernel.json` example:
{
"argv": [
"/home/user/miniconda3/envs/aiml/bin/python",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "Python (aiml GPU)",
"language": "python",
"env": {
"CUDA_VISIBLE_DEVICES": "0",
"PYTHONPATH": "/home/user/projects"
}
}
4.3 Remote Jupyter via SSH Tunneling
Start Jupyter on the server (no token)
jupyter lab --no-browser --port=8888 --ip=127.0.0.1
Create SSH tunnel from local machine
ssh -N -L 8888:localhost:8888 user@your-server.com
Access in browser
http://localhost:8888
Set a password for security
jupyter lab password
Systemd service for auto-start:
cat > /tmp/jupyter.service << 'EOF'
[Unit]
Description=Jupyter Lab
After=network.target
[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username
ExecStart=/home/your-username/miniconda3/envs/aiml/bin/jupyter lab --no-browser --port=8888
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo mv /tmp/jupyter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable jupyter
sudo systemctl start jupyter
4.4 Jupyter Magic Commands
Timing
%time result = expensive_function()
%timeit -n 100 result = fast_function() # 100 repetitions
Line-by-line profiling
%load_ext line_profiler
%lprun -f my_function my_function(data)
Memory profiling
%load_ext memory_profiler
%memit result = memory_heavy_function()
Write cell contents to file
%%writefile my_script.py
...
Load file into cell
%load my_script.py
Run shell commands
!nvidia-smi
!pip list | grep torch
files = !ls -la *.py
Set environment variables
%env CUDA_VISIBLE_DEVICES=0
Auto-reload (reflect code changes immediately)
%load_ext autoreload
%autoreload 2
Inline matplotlib plots
%matplotlib inline
List current variables
%who
%whos
Access previous outputs
print(_) # Last result
print(__) # Two results ago
4.5 nbconvert (Notebook Conversion)
Convert notebook to script
jupyter nbconvert --to script notebook.ipynb
Convert to HTML (for sharing)
jupyter nbconvert --to html notebook.ipynb
Convert to PDF (requires LaTeX)
jupyter nbconvert --to pdf notebook.ipynb
Execute and convert to HTML
jupyter nbconvert --to html --execute notebook.ipynb
Execute notebook in-place from CLI
jupyter nbconvert --to notebook --execute --inplace notebook.ipynb
Parameterized execution (papermill)
pip install papermill
papermill input.ipynb output.ipynb -p lr 0.001 -p epochs 100
5. VS Code for AI
5.1 Installing Essential Extensions
Install extensions from the command line
code --install-extension ms-python.python # Python
code --install-extension ms-toolsai.jupyter # Jupyter
code --install-extension ms-python.pylance # Pylance (LSP)
code --install-extension charliermarsh.ruff # Ruff linter
code --install-extension github.copilot # GitHub Copilot
code --install-extension github.copilot-chat # Copilot Chat
code --install-extension ms-vscode-remote.remote-ssh # Remote SSH
code --install-extension ms-vscode-remote.remote-containers # Dev Containers
code --install-extension eamodio.gitlens # GitLens
code --install-extension njpwerner.autodocstring # Auto-docstring
5.2 VS Code Settings (settings.json)
{
"python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python",
"python.formatting.provider": "none",
"[python]": {
"editor.defaultFormatter": "charliermarsh.ruff",
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.fixAll.ruff": true,
"source.organizeImports.ruff": true
}
},
"pylance.typeCheckingMode": "basic",
"python.analysis.autoImportCompletions": true,
"python.analysis.indexing": true,
"python.analysis.packageIndexDepths": [
{ "name": "torch", "depth": 5 },
{ "name": "transformers", "depth": 5 }
],
"jupyter.askForKernelRestart": false,
"jupyter.interactiveWindow.creationMode": "perFile",
"editor.inlineSuggest.enabled": true,
"github.copilot.enable": {
"*": true,
"yaml": true,
"plaintext": false
},
"files.exclude": {
"**/__pycache__": true,
"**/*.pyc": true,
".mypy_cache": true,
".ruff_cache": true
},
"terminal.integrated.env.linux": {
"CUDA_VISIBLE_DEVICES": "0"
}
}
5.3 Remote SSH Setup
Local ~/.ssh/config
Host ml-server
HostName 192.168.1.100
User your-username
IdentityFile ~/.ssh/id_rsa
ForwardAgent yes
ServerAliveInterval 60
ServerAliveCountMax 3
Host gpu-cloud
HostName gpu-server.example.com
User ubuntu
IdentityFile ~/.ssh/cloud-key.pem
Port 22
Using Remote SSH in VS Code:
1. Press Ctrl+Shift+P (or Cmd+Shift+P on Mac)
2. Select "Remote-SSH: Connect to Host"
3. Choose your configured host
Auto-install extensions on the remote server:
{
"remote.SSH.defaultExtensions": ["ms-python.python", "ms-toolsai.jupyter", "charliermarsh.ruff"]
}
5.4 Dev Containers
`.devcontainer/devcontainer.json` example:
{
"name": "AI Development",
"build": {
"dockerfile": "Dockerfile",
"args": {
"CUDA_VERSION": "12.1.1",
"PYTHON_VERSION": "3.11"
}
},
"runArgs": ["--gpus", "all", "--shm-size", "8g"],
"mounts": [
"source=/data,target=/data,type=bind",
"source=${localEnv:HOME}/.cache,target=/root/.cache,type=bind"
],
"customizations": {
"vscode": {
"extensions": [
"ms-python.python",
"ms-toolsai.jupyter",
"charliermarsh.ruff",
"github.copilot"
],
"settings": {
"python.defaultInterpreterPath": "/usr/local/bin/python"
}
}
},
"postCreateCommand": "pip install -e '.[dev]'",
"remoteUser": "root"
}
6. GPU Docker Containers
6.1 Installing NVIDIA Container Toolkit
Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Test GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
6.2 Dockerfile for AI Projects
Dockerfile
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
Environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1
System packages
RUN apt-get update && apt-get install -y \
python3.11 \
python3.11-dev \
python3.11-venv \
python3-pip \
git \
wget \
curl \
vim \
htop \
tmux \
&& rm -rf /var/lib/apt/lists/*
Set Python defaults
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
RUN update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
Upgrade pip
RUN pip install --upgrade pip setuptools wheel
Working directory
WORKDIR /workspace
Python dependencies (optimize layer caching)
COPY requirements.txt .
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip install -r requirements.txt
Copy source code
COPY . .
Install as editable package
RUN pip install -e ".[dev]"
Create non-root user (security best practice)
RUN useradd -m -u 1000 researcher
RUN chown -R researcher:researcher /workspace
USER researcher
Expose ports
EXPOSE 8888 6006
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]
6.3 Docker Compose for a Service Stack
docker-compose.yml
version: '3.8'
services:
Training container
trainer:
build:
context: .
dockerfile: Dockerfile
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
- CUDA_VISIBLE_DEVICES=0
- WANDB_API_KEY=${WANDB_API_KEY}
volumes:
- ./src:/workspace/src
- ./configs:/workspace/configs
- /data:/data:ro
- model-checkpoints:/workspace/checkpoints
- ~/.cache/huggingface:/root/.cache/huggingface
shm_size: '8g'
command: python train.py
Jupyter notebook server
notebook:
build:
context: .
dockerfile: Dockerfile
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=1
ports:
- '8888:8888'
volumes:
- ./notebooks:/workspace/notebooks
- ./src:/workspace/src
- /data:/data:ro
command: jupyter lab --ip=0.0.0.0 --no-browser --allow-root
TensorBoard
tensorboard:
image: tensorflow/tensorflow:latest
ports:
- '6006:6006'
volumes:
- model-checkpoints:/workspace/checkpoints:ro
command: tensorboard --logdir=/workspace/checkpoints --host=0.0.0.0
MLflow tracking server
mlflow:
image: ghcr.io/mlflow/mlflow:v2.8.0
ports:
- '5000:5000'
volumes:
- mlflow-data:/mlflow
command: mlflow server --host=0.0.0.0 --port=5000 --default-artifact-root=/mlflow/artifacts
volumes:
model-checkpoints:
mlflow-data:
Start the full stack
docker-compose up -d
Follow logs
docker-compose logs -f trainer
Restart a specific service
docker-compose restart notebook
Check GPU usage inside container
docker-compose exec trainer nvidia-smi
Tear down
docker-compose down --volumes
6.4 GPU Sharing Strategies
Assign specific GPUs
docker run --gpus '"device=0,1"' myimage # Use GPUs 0 and 1
docker run --gpus '"device=2"' myimage # Use only GPU 2
MIG (Multi-Instance GPU) for A100/H100
sudo nvidia-smi mig -i 0 --create-gpu-instance 3g.20gb
sudo nvidia-smi mig -i 0 --create-compute-instance 3g.20gb
Assign MIG instance to a container
docker run --gpus '"MIG-GPU-xxxxxxxx-xx-xx-xxxx-xxxxxxxxxxxx/x/x"' myimage
7. Remote Development Environment
7.1 SSH-Based Remote Development
Generate SSH key (on local machine)
ssh-keygen -t ed25519 -C "your-email@example.com"
Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.com
Or manually
cat ~/.ssh/id_ed25519.pub | ssh user@server.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
Optimized SSH config (~/.ssh/config)
Host ml-server
HostName server.example.com
User researcher
IdentityFile ~/.ssh/id_ed25519
ControlMaster auto
ControlPath ~/.ssh/cm_%r@%h:%p
ControlPersist 10m
ServerAliveInterval 30
Compression yes
7.2 tmux (Session Persistence)
Install tmux
sudo apt install tmux
Start a new session
tmux new-session -s training
Reattach to a session
tmux attach-session -t training
List sessions
tmux list-sessions
Key shortcuts (Prefix: Ctrl+B)
Ctrl+B d : Detach session (keeps running after SSH disconnect)
Ctrl+B c : Create new window
Ctrl+B n/p : Next/previous window
Ctrl+B % : Vertical split
Ctrl+B " : Horizontal split
Ctrl+B z : Toggle pane zoom
`~/.tmux.conf` configuration:
~/.tmux.conf
Enable mouse support
set -g mouse on
Increase scroll buffer
set -g history-limit 50000
Start window numbering at 1
set -g base-index 1
setw -g pane-base-index 1
Color support
set -g default-terminal "screen-256color"
set -ga terminal-overrides ",xterm-256color:Tc"
Customize status bar
set -g status-bg colour235
set -g status-fg colour136
set -g status-right '#[fg=colour166]%d %b #[fg=colour136]%H:%M '
Change prefix to Ctrl+A (screen style)
set -g prefix C-a
unbind C-b
bind C-a send-prefix
Fast pane navigation
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D
7.3 File Synchronization (rsync, sshfs)
Sync code local -> server
rsync -avz --exclude '__pycache__' --exclude '.git' \
./my-project/ user@server:/home/user/my-project/
Sync results server -> local
rsync -avz user@server:/home/user/my-project/outputs/ ./outputs/
Auto-sync script
cat > sync.sh << 'EOF'
#!/bin/bash
rsync -avz \
--exclude '__pycache__' \
--exclude '.git' \
--exclude '.venv' \
--exclude '*.pyc' \
--delete \
./ user@ml-server:/home/user/project/
echo "Sync complete: $(date)"
EOF
chmod +x sync.sh
Mount remote filesystem with sshfs
sudo apt install sshfs
mkdir -p ~/remote-server
sshfs user@server:/home/user ~/remote-server -o reconnect,ServerAliveInterval=15
Unmount
fusermount -u ~/remote-server
8. Monitoring Tools
8.1 nvidia-smi watch
Refresh GPU status every second
watch -n 1 nvidia-smi
Show only GPU memory usage
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
--format=csv -l 1
Record to CSV log
nvidia-smi --query-gpu=timestamp,index,utilization.gpu,memory.used,temperature.gpu \
--format=csv -l 1 > gpu_log.csv
8.2 nvtop
Install nvtop (htop for GPUs)
sudo apt install nvtop
Run
nvtop
8.3 GPU Monitoring from Python
gpu_monitor.py
from dataclasses import dataclass
from typing import List
@dataclass
class GPUStats:
index: int
name: str
temperature: float
utilization: float
memory_used: int
memory_total: int
power_draw: float
def get_gpu_stats() -> List[GPUStats]:
"""Query GPU stats via nvidia-smi"""
cmd = [
'nvidia-smi',
'--query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw',
'--format=csv,noheader,nounits'
]
result = subprocess.run(cmd, capture_output=True, text=True)
gpus = []
for line in result.stdout.strip().split('\n'):
parts = [p.strip() for p in line.split(',')]
gpus.append(GPUStats(
index=int(parts[0]),
name=parts[1],
temperature=float(parts[2]),
utilization=float(parts[3]),
memory_used=int(parts[4]),
memory_total=int(parts[5]),
power_draw=float(parts[6]) if parts[6] != 'N/A' else 0.0
))
return gpus
def monitor_training(interval: int = 10, duration: int = 3600):
"""Monitor GPU during training"""
print(f"GPU monitoring started (interval: {interval}s)")
history = []
start_time = time.time()
while time.time() - start_time < duration:
stats = get_gpu_stats()
timestamp = time.time() - start_time
for gpu in stats:
mem_pct = gpu.memory_used / gpu.memory_total * 100
print(
f"[{timestamp:.0f}s] GPU {gpu.index}: "
f"util={gpu.utilization:.0f}% "
f"mem={gpu.memory_used}/{gpu.memory_total}MB ({mem_pct:.0f}%) "
f"temp={gpu.temperature:.0f}C "
f"power={gpu.power_draw:.0f}W"
)
history.append({
'timestamp': timestamp,
'gpu_index': gpu.index,
'utilization': gpu.utilization,
'memory_used': gpu.memory_used,
'temperature': gpu.temperature
})
time.sleep(interval)
return history
if __name__ == "__main__":
history = monitor_training(interval=5, duration=60)
print(f"\nCollected {len(history)} measurements")
8.4 System Monitoring
Install and use htop
sudo apt install htop
htop
Monitor I/O with iotop
sudo apt install iotop
sudo iotop -o # Show only processes with active I/O
Disk I/O stats
iostat -x 1 5
Memory usage
free -h
vmstat 1 10
Network monitoring
sudo apt install nethogs
sudo nethogs
Comprehensive monitoring dashboard (glances)
pip install glances
glances
9. Code Quality Tools
9.1 black + ruff Configuration
Install
pip install black ruff pre-commit
Run black
black .
black --line-length 88 src/
Run ruff
ruff check .
ruff check --fix . # Auto-fix
ruff format . # Format (can replace black)
`pyproject.toml` configuration:
[tool.black]
line-length = 88
target-version = ['py310', 'py311']
include = '\.pyi?$'
[tool.ruff]
line-length = 88
target-version = "py310"
[tool.ruff.lint]
select = [
"E", # pycodestyle errors
"W", # pycodestyle warnings
"F", # pyflakes
"I", # isort
"N", # pep8-naming
"UP", # pyupgrade
"B", # flake8-bugbear
]
ignore = ["E501", "B008"]
[tool.ruff.lint.isort]
known-first-party = ["my_package"]
9.2 pre-commit Hooks
.pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-json
- id: check-merge-conflict
- id: detect-private-key
- id: check-added-large-files
args: ['--maxkb=10000']
- repo: https://github.com/psf/black
rev: 23.11.0
hooks:
- id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.1.6
hooks:
- id: ruff
args: [--fix]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.7.1
hooks:
- id: mypy
additional_dependencies:
- types-requests
- pydantic
Install and activate pre-commit
pre-commit install
Run on all files
pre-commit run --all-files
Run a specific hook
pre-commit run black --all-files
9.3 pytest with Coverage
Install
pip install pytest pytest-cov pytest-xdist
Basic run
pytest tests/ -v
With coverage
pytest tests/ --cov=src --cov-report=html --cov-report=term-missing
Parallel execution
pytest tests/ -n auto # As many workers as CPU cores
Show slowest tests
pytest tests/ --durations=10
Run only fast tests
pytest tests/ -m "not slow"
`pytest` configuration in `pyproject.toml`:
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py", "*_test.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
markers = [
"slow: marks tests as slow",
"gpu: marks tests requiring GPU",
"integration: integration tests"
]
addopts = [
"-v",
"--tb=short",
"--cov=src",
"--cov-report=html",
"--cov-fail-under=80"
]
10. Weights & Biases Setup
10.1 Installation and Initialization
Install
pip install wandb
Log in (sets API key)
wandb login
Or set via environment variable
export WANDB_API_KEY=your-api-key-here
10.2 Basic W&B Usage
Initialize experiment
run = wandb.init(
project="bert-sentiment-analysis",
name="run-001-baseline",
config={
"learning_rate": 2e-5,
"batch_size": 32,
"epochs": 10,
"model": "bert-base-uncased",
"optimizer": "adamw"
},
tags=["baseline", "bert", "nlp"],
notes="Baseline BERT fine-tuning experiment"
)
config = wandb.config
print(f"Learning rate: {config.learning_rate}")
Log metrics
for epoch in range(config.epochs):
train_loss = 2.0 - epoch * 0.15 + np.random.randn() * 0.1
val_loss = 2.2 - epoch * 0.12 + np.random.randn() * 0.1
val_acc = 0.5 + epoch * 0.04 + np.random.randn() * 0.01
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/loss": val_loss,
"val/accuracy": val_acc,
"learning_rate": config.learning_rate * (0.95 ** epoch)
})
Log a visualization
fig, ax = plt.subplots()
x = np.linspace(0, 10, 100)
ax.plot(x, np.sin(x), label='Predicted')
ax.plot(x, np.cos(x), label='Actual')
ax.legend()
wandb.log({"prediction_plot": wandb.Image(fig)})
plt.close()
wandb.finish()
10.3 Artifact Management
def save_model_artifact(model_path: str, run, metadata: dict = None):
"""Save model checkpoint as a W&B artifact"""
artifact = wandb.Artifact(
name="model-checkpoint",
type="model",
metadata=metadata or {}
)
artifact.add_file(model_path)
run.log_artifact(artifact)
print(f"Model artifact saved: {model_path}")
def save_dataset_artifact(data_dir: str, run):
"""Save dataset as a W&B artifact"""
artifact = wandb.Artifact(
name="training-dataset",
type="dataset",
description="Preprocessed training dataset"
)
artifact.add_dir(data_dir)
run.log_artifact(artifact)
def load_model_artifact(artifact_name: str, version: str = "latest"):
"""Download a model artifact"""
run = wandb.init(project="my-project", job_type="inference")
artifact = run.use_artifact(f"{artifact_name}:{version}")
artifact_dir = artifact.download()
print(f"Artifact downloaded to: {artifact_dir}")
return artifact_dir
Usage example
run = wandb.init(project="bert-training")
with open("/tmp/model.pt", "w") as f:
f.write("model_weights")
save_model_artifact(
"/tmp/model.pt",
run,
metadata={"accuracy": 0.94, "epoch": 10, "val_loss": 0.28}
)
wandb.finish()
10.4 Hyperparameter Optimization Sweeps
Sweep configuration
sweep_config = {
"method": "bayes", # random, grid, or bayes
"metric": {
"name": "val/accuracy",
"goal": "maximize"
},
"parameters": {
"learning_rate": {
"distribution": "log_uniform_values",
"min": 1e-5,
"max": 1e-3
},
"batch_size": {
"values": [16, 32, 64, 128]
},
"hidden_size": {
"values": [256, 512, 768, 1024]
},
"dropout": {
"distribution": "uniform",
"min": 0.1,
"max": 0.5
},
"num_layers": {
"values": [2, 4, 6, 8]
}
},
"early_terminate": {
"type": "hyperband",
"min_iter": 3
}
}
def train_sweep():
"""Training function called by each sweep run"""
run = wandb.init()
config = wandb.config
best_val_acc = 0.0
for epoch in range(10):
train_loss = 1.0 / (config.learning_rate * 1000) * np.random.uniform(0.8, 1.2)
val_acc = min(0.99, config.hidden_size / 2000 + np.random.uniform(0, 0.1))
if val_acc > best_val_acc:
best_val_acc = val_acc
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/accuracy": val_acc
})
wandb.finish()
Create and launch sweep
sweep_id = wandb.sweep(sweep_config, project="hyperparameter-search")
print(f"Sweep ID: {sweep_id}")
Run agent for N trials
wandb.agent(sweep_id, function=train_sweep, count=20)
To run a sweep in parallel across multiple servers:
Server 1
wandb agent username/project/sweep-id
Server 2 (simultaneously)
wandb agent username/project/sweep-id
Conclusion
Investing time in setting up your AI development environment upfront pays dividends throughout the life of your project. Here is a summary of the key priorities covered in this guide.
**Essential priorities:**
1. CUDA environment: Correct GPU driver and CUDA Toolkit installation is the foundation of everything.
2. Python environment isolation: Use pyenv + conda/poetry to create separate environments per project.
3. JupyterLab + VS Code: Use Jupyter for exploratory analysis and VS Code for production development.
4. Docker containerization: Guarantees reproducible experiment environments and simplifies team collaboration.
5. W&B experiment tracking: Tracking every experiment gives you reproducibility and actionable insights.
**Areas worth investing in long-term:**
- Automate code quality with pre-commit hooks and CI/CD.
- Master remote development (Remote SSH + tmux) to work productively from anywhere.
- Use GPU monitoring to catch training issues early.
References
- [CUDA Installation Guide (Linux)](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
- [Docker GPU Support Documentation](https://docs.docker.com/config/containers/resource_constraints/#gpu)
- [Jupyter Official Documentation](https://docs.jupyter.org/)
- [VS Code Remote SSH](https://code.visualstudio.com/docs/remote/ssh)
- [Weights & Biases Documentation](https://docs.wandb.ai/)
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html)
현재 단락 (1/812)
Setting up a proper AI development environment is half the battle. Without the right foundation, rep...