Skip to content

Split View: 딥러닝을 위한 Linux GPU 서버 완전 구축 가이드

|

딥러닝을 위한 Linux GPU 서버 완전 구축 가이드

1. 딥러닝 GPU 서버 하드웨어 선택 가이드

딥러닝 서버를 구축하기 전에 가장 먼저 결정해야 할 것은 하드웨어 구성이다. 워크로드의 규모와 예산에 따라 선택지가 달라진다.

1.1 GPU 선택

GPU는 딥러닝 서버의 핵심 컴포넌트다. 선택 시 VRAM 용량, Tensor Core 세대, 메모리 대역폭을 중점적으로 고려해야 한다.

GPUVRAM용도대략적 가격대
RTX 409024GB GDDR6X개인 연구, 중소규모 학습~$1,600
RTX 509032GB GDDR7개인 연구, 대형 모델 Fine-tuning~$2,000
RTX A6000 / RTX 6000 Ada48GB GDDR6프로덕션, ECC 메모리 지원~$4,000+
A100 (80GB PCIe/SXM)80GB HBM2e대규모 학습, NVLink 지원~$10,000+
H100 (80GB SXM)80GB HBM3최대 규모 LLM 학습~$25,000+
H200141GB HBM3eLong-context LLM, 메모리 병목 해소~$30,000+

실용적 추천:

  • 개인/소규모 랩: RTX 4090 또는 RTX 5090. 2432GB VRAM으로 7B13B 파라미터 모델의 Fine-tuning이 가능하다.
  • 기업/연구소: A100 80GB 또는 H100. NVLink을 통한 멀티 GPU 학습이 필수적인 환경에서는 데이터센터급 GPU가 필요하다.
  • Multi-GPU를 고려한다면: PCIe 슬롯 간격, NVLink 지원 여부, 전원 공급 용량을 반드시 확인해야 한다.

1.2 CPU, RAM, Storage

컴포넌트권장 사양이유
CPUAMD EPYC 또는 Intel Xeon (코어 수 16+)Data loading 병목 방지, PCIe 레인 충분 확보
RAMGPU VRAM의 2배 이상 (최소 64GB, 권장 128GB+)대규모 데이터셋 전처리, DataLoader worker 메모리
OS StorageNVMe SSD 500GB+빠른 부팅, 패키지 캐시
Data StorageNVMe SSD 2TB+ 또는 RAID 구성학습 데이터 I/O 병목 방지
PSU1200W+ (Multi-GPU 시 1600W+)RTX 4090 하나에 450W, 여유 전력 필수

CPU는 GPU 대비 상대적으로 덜 중요하지만, PCIe 레인 수가 멀티 GPU 환경에서 병목이 될 수 있으므로 서버급 플랫폼(AMD EPYC, Intel Xeon)을 권장한다. 데스크탑 플랫폼(AM5, LGA1700)에서도 단일 GPU 환경에서는 충분하다.


2. Ubuntu 22.04/24.04 설치 및 초기 설정

딥러닝 서버 OS로는 Ubuntu 22.04 LTS 또는 Ubuntu 24.04 LTS가 가장 널리 사용된다. NVIDIA 드라이버, CUDA, 각종 딥러닝 프레임워크의 지원이 가장 우선적으로 이루어지기 때문이다.

2.1 설치 후 기본 설정

# 시스템 업데이트
sudo apt update && sudo apt upgrade -y

# 필수 빌드 도구 설치
sudo apt install -y build-essential gcc g++ make cmake

# 커널 헤더 설치 (NVIDIA 드라이버 빌드에 필요)
sudo apt install -y linux-headers-$(uname -r)

# 네트워크 도구
sudo apt install -y net-tools curl wget git vim htop

# 시간대 설정
sudo timedatectl set-timezone Asia/Seoul

2.2 Secure Boot 비활성화

NVIDIA 드라이버는 커널 모듈을 로드해야 하므로, BIOS에서 Secure Boot을 비활성화하는 것이 권장된다. Secure Boot이 활성화된 상태에서도 MOK(Machine Owner Key)를 등록하여 사용할 수 있지만, 설정이 복잡해질 수 있다.

# Secure Boot 상태 확인
mokutil --sb-state

3. NVIDIA 드라이버 설치

NVIDIA 공식 문서에서는 크게 두 가지 설치 방식을 안내하고 있다: 패키지 매니저(apt) 방식.run 파일 방식이다. 서버 환경에서는 apt repository 방식이 관리 편의성 측면에서 권장된다.

3.1 방법 1: apt repository 방식 (권장)

NVIDIA 공식 드라이버 설치 가이드(https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html)에 따르면, Ubuntu에서는 다음과 같이 설치한다.

기존 드라이버 제거

# 기존 NVIDIA 관련 패키지 완전 제거
sudo apt-get purge nvidia* -y
sudo apt-get autoremove -y
sudo apt-get autoclean

ubuntu-drivers를 이용한 자동 설치

Ubuntu에서는 ubuntu-drivers 도구를 통해 시스템에 적합한 드라이버를 자동으로 설치할 수 있다.

# 사용 가능한 드라이버 목록 확인
sudo ubuntu-drivers devices

# 추천 드라이버 자동 설치
sudo ubuntu-drivers autoinstall

NVIDIA 공식 repository를 이용한 수동 설치

특정 버전의 드라이버가 필요한 경우, NVIDIA의 공식 apt repository를 추가하여 설치한다.

# NVIDIA 공식 repository 키 및 저장소 추가
sudo apt install -y software-properties-common
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update

# 특정 버전 설치 (예: 550 버전)
sudo apt install -y nvidia-driver-550

# 재부팅
sudo reboot

설치 확인

nvidia-smi

정상적으로 설치되었다면, GPU 이름, 드라이버 버전, 지원 CUDA 버전 등이 출력된다.

3.2 방법 2: .run 파일 방식

NVIDIA 홈페이지에서 직접 .run 파일을 다운로드하여 설치하는 방식이다. 특수한 커널 환경이나 커스텀 빌드가 필요한 경우에 사용한다.

# Nouveau 드라이버 비활성화
sudo bash -c "echo 'blacklist nouveau' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo bash -c "echo 'options nouveau modeset=0' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo update-initramfs -u
sudo reboot

# GUI 환경 종료 (서버에 데스크탑 환경이 있는 경우)
sudo systemctl stop gdm3
# 또는
sudo systemctl stop lightdm

# .run 파일 실행
chmod +x NVIDIA-Linux-x86_64-550.120.run
sudo ./NVIDIA-Linux-x86_64-550.120.run

.run 파일 방식의 단점: 커널 업데이트 시 드라이버가 깨질 수 있으며, 패키지 매니저로 관리되지 않아 업데이트/제거가 불편하다. 서버 운영 환경에서는 apt 방식을 강력히 권장한다.


4. 드라이버 버전 선택 전략

NVIDIA는 데이터센터/서버용 드라이버에 대해 두 가지 브랜치를 운영한다.

4.1 Production Branch vs New Feature Branch

구분Production Branch (PB)New Feature Branch (NFB)
안정성높음 (광범위한 테스트 완료)상대적으로 낮음
새 기능출시 후 새 기능 추가 없음최신 GPU 지원, 새 기능 포함
업데이트 정책버그 수정, 보안 패치만 제공주기적으로 새 기능 포함 릴리스
권장 환경프로덕션 서버, 안정성 최우선개발/테스트, 최신 GPU 사용

실무 가이드라인:

  • 프로덕션 환경: Production Branch 사용. 예를 들어 550.xx 시리즈가 PB로 지정되면, 해당 시리즈 내에서만 버그 수정 업데이트가 제공된다.
  • 최신 GPU (예: Blackwell 아키텍처): New Feature Branch를 사용해야 하는 경우가 많다. 최신 하드웨어 지원은 NFB에서 먼저 제공되기 때문이다.
  • 서버에서 -server 패키지를 사용: nvidia-driver-550-server와 같이 -server 접미사가 붙은 패키지는 Enterprise Ready Driver(ERD)로, 서버 환경에 최적화되어 있다.
# 사용 가능한 서버용 드라이버 확인
apt list --installed 2>/dev/null | grep nvidia-driver
apt-cache search nvidia-driver | grep server

5. CUDA Toolkit 설치

CUDA Toolkit은 NVIDIA GPU에서 병렬 컴퓨팅을 수행하기 위한 개발 환경이다. NVIDIA 공식 설치 가이드(https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)에 따라 설치한다.

5.1 설치 전 확인사항

# CUDA 지원 GPU 확인
lspci | grep -i nvidia

# gcc 설치 확인
gcc --version

# 커널 헤더 확인
uname -r
sudo apt install -y linux-headers-$(uname -r)

5.2 네트워크 repository 방식 설치 (권장)

# CUDA keyring 패키지 설치
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# CUDA Toolkit 설치 (최신 버전)
sudo apt-get install -y cuda-toolkit

# 또는 특정 버전 설치
sudo apt-get install -y cuda-toolkit-12-6

5.3 환경 변수 설정

# ~/.bashrc 또는 ~/.zshrc에 추가
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# 적용
source ~/.bashrc

5.4 다중 CUDA 버전 관리

실무에서는 프로젝트마다 다른 CUDA 버전이 필요한 경우가 빈번하다. /usr/local/ 아래에 여러 CUDA 버전을 설치하고 update-alternatives로 관리할 수 있다.

# 다중 버전 설치 (예: 12.4와 12.6)
sudo apt-get install -y cuda-toolkit-12-4
sudo apt-get install -y cuda-toolkit-12-6

# /usr/local/ 아래 설치 확인
ls /usr/local/ | grep cuda
# cuda -> cuda-12.6 (심볼릭 링크)
# cuda-12.4
# cuda-12.6

# update-alternatives로 버전 전환 등록
sudo update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-12.4 10
sudo update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-12.6 20

# 버전 전환
sudo update-alternatives --config cuda

update-alternatives --config cuda를 실행하면 대화형 메뉴가 나타나며, 원하는 CUDA 버전을 선택할 수 있다. 이 방식을 사용하면 /usr/local/cuda 심볼릭 링크가 선택한 버전을 가리키도록 변경된다.

# 설치 확인
nvcc --version

6. CUDA 버전과 드라이버 호환성 매트릭스

CUDA Toolkit과 NVIDIA 드라이버 간에는 최소 드라이버 버전 요구사항이 존재한다. NVIDIA 공식 CUDA Toolkit Release Notes(https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)에서 호환성 표를 확인할 수 있다.

6.1 주요 호환성 표 (Linux x86_64 기준)

CUDA Toolkit 버전최소 드라이버 버전 (Linux)
CUDA 12.0>= 525.60.13
CUDA 12.1>= 530.30.02
CUDA 12.2>= 535.54.03
CUDA 12.3>= 545.23.06
CUDA 12.4>= 550.54.14
CUDA 12.5>= 555.42.02
CUDA 12.6>= 560.28.03
CUDA 13.0>= 570.86.15
CUDA 13.1>= 575.51.03

핵심: nvidia-smi에 표시되는 "CUDA Version"은 해당 드라이버가 지원하는 최대 CUDA Runtime 버전이다. 실제로 설치된 CUDA Toolkit 버전과 다를 수 있다. 실제 CUDA Toolkit 버전은 nvcc --version으로 확인한다.

6.2 CUDA Forward/Minor Compatibility

NVIDIA의 CUDA Compatibility 문서(https://docs.nvidia.com/deploy/cuda-compatibility/)에 따르면, CUDA 12.x 이후부터는 Minor Version Compatibility가 지원된다. 즉, CUDA 12.0용 드라이버가 설치되어 있어도 CUDA 12.6으로 컴파일된 애플리케이션을 제한적으로 실행할 수 있다.

# 드라이버가 지원하는 CUDA 버전 확인
nvidia-smi | head -3

# 실제 설치된 CUDA Toolkit 버전 확인
nvcc --version

7. cuDNN 설치 및 설정

cuDNN(CUDA Deep Neural Network library)은 딥러닝에 최적화된 GPU 가속 라이브러리다. Convolution, Pooling, Normalization, Activation 등의 연산을 고도로 최적화하여 PyTorch, TensorFlow 등의 프레임워크가 내부적으로 사용한다.

NVIDIA 공식 cuDNN 설치 가이드(https://docs.nvidia.com/deeplearning/cudnn/installation/latest/linux.html)를 기반으로 설치한다.

7.1 apt repository 방식 설치 (권장)

# CUDA keyring이 이미 설치되어 있다면 cuDNN 패키지를 바로 설치할 수 있다
sudo apt-get install -y cudnn

# 특정 CUDA 버전용 cuDNN 설치
sudo apt-get install -y cudnn-cuda-12

7.2 Tarball 방식 설치

특정 버전이 필요하거나 시스템 패키지 매니저를 사용하지 않으려는 경우 tarball을 사용한다.

# NVIDIA Developer 사이트에서 다운로드 후
tar -xvf cudnn-linux-x86_64-9.x.x.x_cudaXX-archive.tar.xz

# 파일 복사
sudo cp cudnn-linux-x86_64-9.x.x.x_cudaXX-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp -P cudnn-linux-x86_64-9.x.x.x_cudaXX-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

7.3 설치 확인

# cuDNN 버전 확인 (apt 설치 시)
dpkg -l | grep cudnn

# 또는 헤더 파일에서 직접 확인
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

8. Conda/Mamba 환경 구성

시스템 레벨의 CUDA Toolkit과 별개로, Conda/Mamba 가상 환경 내에서 PyTorch, TensorFlow를 설치하면 CUDA 버전 충돌을 방지할 수 있다. 최근의 PyTorch, TensorFlow는 자체적으로 CUDA Runtime을 번들링하고 있어, 시스템 CUDA 버전과 독립적으로 동작한다.

8.1 Miniforge(Mamba) 설치

# Miniforge 설치 (Mamba 포함)
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p $HOME/miniforge3

# 환경 변수 설정
$HOME/miniforge3/bin/mamba init bash
source ~/.bashrc

8.2 PyTorch 설치

# 딥러닝 환경 생성
mamba create -n dl python=3.11 -y
mamba activate dl

# PyTorch 설치 (CUDA 12.4 기준 - 공식 사이트에서 최신 명령어 확인)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# 설치 확인
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

8.3 TensorFlow 설치

# TensorFlow 설치 (GPU 지원 포함)
pip install tensorflow[and-cuda]
# 설치 확인
import tensorflow as tf
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU devices: {tf.config.list_physical_devices('GPU')}")

: PyTorch와 TensorFlow를 동일 환경에 설치하면 CUDA 라이브러리 버전 충돌이 발생할 수 있다. 가능하면 별도의 Conda 환경을 사용하는 것이 좋다.


9. Docker + NVIDIA Container Toolkit 설정

Docker를 사용하면 딥러닝 환경을 컨테이너 단위로 격리하여 관리할 수 있다. NVIDIA Container Toolkit(https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)을 설치하면 Docker 컨테이너 내에서 GPU를 사용할 수 있다.

9.1 Docker 설치

# Docker 공식 repository 설정
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# 현재 사용자를 docker 그룹에 추가 (sudo 없이 사용)
sudo usermod -aG docker $USER
newgrp docker

9.2 NVIDIA Container Toolkit 설치

NVIDIA 공식 문서에 따른 설치 과정이다.

# NVIDIA Container Toolkit repository 설정
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 패키지 설치
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Docker runtime 설정
sudo nvidia-ctk runtime configure --runtime=docker

# Docker 재시작
sudo systemctl restart docker

9.3 GPU Docker 테스트

# GPU 접근 테스트
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

# 특정 GPU만 사용
docker run --rm --gpus '"device=0"' nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

# 모든 GPU 사용하여 PyTorch 컨테이너 실행
docker run --rm --gpus all -it pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime python -c \
  "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

9.4 Docker Compose에서 GPU 사용

# docker-compose.yml
services:
  training:
    image: pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all # 또는 특정 수: 1, 2 등
              capabilities: [gpu]
    volumes:
      - ./data:/workspace/data
      - ./models:/workspace/models
    shm_size: '8g' # PyTorch DataLoader shared memory

주의: shm_size를 설정하지 않으면 PyTorch의 DataLoader에서 num_workers > 0일 때 shared memory 부족 오류가 발생할 수 있다.


10. nvidia-smi, nvtop 모니터링 도구

GPU 서버 운영에서 모니터링은 필수적이다. GPU 사용률, 메모리 사용량, 온도 등을 실시간으로 확인할 수 있어야 한다.

10.1 nvidia-smi

nvidia-smi는 NVIDIA 드라이버와 함께 설치되는 기본 모니터링 도구다.

# 기본 상태 확인
nvidia-smi

# 실시간 모니터링 (1초 간격)
watch -n 1 nvidia-smi

# 특정 정보만 쿼리
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv

# 프로세스별 GPU 사용량
nvidia-smi pmon -s um -d 1

# Persistence Mode 활성화 (서버 환경 권장)
sudo nvidia-smi -pm 1

Persistence Mode를 활성화하면 GPU 드라이버가 항상 로드된 상태를 유지하여, GPU를 처음 호출할 때 발생하는 초기화 지연(~수 초)을 제거할 수 있다. 서버 환경에서는 반드시 활성화해야 한다.

10.2 nvtop

nvtophtop과 유사한 인터페이스를 가진 GPU 모니터링 도구다. 여러 GPU의 사용률, 메모리, 온도, 프로세스 정보를 한 화면에서 실시간으로 확인할 수 있다.

# 설치
sudo apt install -y nvtop

# 실행
nvtop

10.3 gpustat

간결한 출력을 선호한다면 gpustat도 유용하다.

pip install gpustat

# 실시간 모니터링
gpustat -i 1 --color

11. SSH 원격 개발 환경

GPU 서버는 대부분 원격으로 접속하여 사용한다. 효율적인 원격 개발 환경을 구성한다.

11.1 SSH 서버 설정

# OpenSSH 서버 설치
sudo apt install -y openssh-server

# SSH 서비스 시작 및 자동 시작 설정
sudo systemctl enable ssh
sudo systemctl start ssh

11.2 VS Code Remote - SSH

VS Code의 Remote - SSH 확장을 사용하면 로컬 VS Code에서 원격 서버의 파일을 직접 편집하고, 원격 터미널을 사용할 수 있다.

클라이언트(로컬) 측 SSH config 설정:

# ~/.ssh/config
Host gpu-server
    HostName 192.168.1.100
    User username
    Port 22
    IdentityFile ~/.ssh/id_ed25519
    ForwardAgent yes

11.3 tmux를 이용한 세션 유지

SSH 연결이 끊어져도 학습이 계속 실행되도록 tmux를 사용한다.

# tmux 설치
sudo apt install -y tmux

# 새 세션 생성
tmux new -s training

# 세션 분리 (detach): Ctrl+b, d

# 세션 재접속
tmux attach -t training

# 세션 목록 확인
tmux ls

필수 tmux 단축키:

  • Ctrl+b, d : 세션 분리 (학습 계속 실행)
  • Ctrl+b, c : 새 윈도우
  • Ctrl+b, n/p : 다음/이전 윈도우
  • Ctrl+b, % : 수평 분할
  • Ctrl+b, " : 수직 분할

11.4 포트 포워딩

Jupyter Notebook, TensorBoard 등을 로컬 브라우저에서 접근하기 위한 SSH 포트 포워딩 설정이다.

# Jupyter Notebook 포트 포워딩 (원격 8888 -> 로컬 8888)
ssh -L 8888:localhost:8888 gpu-server

# TensorBoard 포트 포워딩 (원격 6006 -> 로컬 6006)
ssh -L 6006:localhost:6006 gpu-server

# 여러 포트 동시 포워딩
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 gpu-server

12. 보안 설정

GPU 서버가 네트워크에 노출되어 있다면 기본적인 보안 설정은 필수다.

12.1 SSH Key 인증 설정

# 클라이언트에서 키 생성
ssh-keygen -t ed25519 -C "your_email@example.com"

# 서버에 공개키 복사
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@gpu-server

# 서버에서 비밀번호 인증 비활성화
sudo vim /etc/ssh/sshd_config

/etc/ssh/sshd_config에서 다음 항목을 수정한다:

PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
# SSH 서비스 재시작
sudo systemctl restart sshd

12.2 UFW 방화벽 설정

# UFW 설치 및 활성화
sudo apt install -y ufw

# 기본 정책: 들어오는 트래픽 차단, 나가는 트래픽 허용
sudo ufw default deny incoming
sudo ufw default allow outgoing

# SSH 허용
sudo ufw allow ssh

# 특정 IP에서만 SSH 허용 (더 안전)
sudo ufw allow from 192.168.1.0/24 to any port 22

# 방화벽 활성화
sudo ufw enable

# 상태 확인
sudo ufw status verbose

12.3 fail2ban 설정

SSH 무차별 대입 공격을 방어하기 위해 fail2ban을 설치한다.

# 설치
sudo apt install -y fail2ban

# 설정 파일 복사
sudo cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local
# /etc/fail2ban/jail.local 에서 [sshd] 섹션 수정
sudo vim /etc/fail2ban/jail.local
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600
# fail2ban 시작
sudo systemctl enable fail2ban
sudo systemctl start fail2ban

# 상태 확인
sudo fail2ban-client status sshd

13. 자동화: Ansible Playbook 예시

서버가 여러 대이거나 반복적으로 환경을 구성해야 하는 경우, Ansible을 사용하여 전체 설정 과정을 자동화할 수 있다.

13.1 Ansible Inventory

# inventory.ini
[gpu_servers]
gpu-server-01 ansible_host=192.168.1.101
gpu-server-02 ansible_host=192.168.1.102

[gpu_servers:vars]
ansible_user=ubuntu
ansible_ssh_private_key_file=~/.ssh/id_ed25519

13.2 GPU 서버 Setup Playbook

# gpu_server_setup.yml
---
- name: GPU Server Initial Setup
  hosts: gpu_servers
  become: true
  vars:
    nvidia_driver_version: '550'
    cuda_keyring_url: 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb'

  tasks:
    # ===== 시스템 기본 설정 =====
    - name: Update and upgrade apt packages
      apt:
        update_cache: yes
        upgrade: dist
        cache_valid_time: 3600

    - name: Install essential packages
      apt:
        name:
          - build-essential
          - gcc
          - g++
          - make
          - cmake
          - linux-headers-{{ ansible_kernel }}
          - curl
          - wget
          - git
          - vim
          - htop
          - tmux
          - nvtop
          - net-tools
          - software-properties-common
        state: present

    - name: Set timezone to Asia/Seoul
      community.general.timezone:
        name: Asia/Seoul

    # ===== NVIDIA 드라이버 설치 =====
    - name: Purge existing NVIDIA drivers
      apt:
        name: 'nvidia*'
        state: absent
        purge: yes
      ignore_errors: yes

    - name: Add NVIDIA PPA
      apt_repository:
        repo: ppa:graphics-drivers/ppa
        state: present

    - name: Install NVIDIA driver
      apt:
        name: 'nvidia-driver-{{ nvidia_driver_version }}'
        state: present
        update_cache: yes

    # ===== CUDA Toolkit 설치 =====
    - name: Download CUDA keyring
      get_url:
        url: '{{ cuda_keyring_url }}'
        dest: /tmp/cuda-keyring.deb

    - name: Install CUDA keyring
      apt:
        deb: /tmp/cuda-keyring.deb

    - name: Install CUDA Toolkit
      apt:
        name: cuda-toolkit
        state: present
        update_cache: yes

    # ===== cuDNN 설치 =====
    - name: Install cuDNN
      apt:
        name: cudnn-cuda-12
        state: present

    # ===== Docker 설치 =====
    - name: Install Docker prerequisites
      apt:
        name:
          - ca-certificates
          - curl
          - gnupg
        state: present

    - name: Add Docker GPG key
      shell: |
        install -m 0755 -d /etc/apt/keyrings
        curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
        chmod a+r /etc/apt/keyrings/docker.gpg
      args:
        creates: /etc/apt/keyrings/docker.gpg

    - name: Add Docker repository
      shell: |
        echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
      args:
        creates: /etc/apt/sources.list.d/docker.list

    - name: Install Docker
      apt:
        name:
          - docker-ce
          - docker-ce-cli
          - containerd.io
          - docker-buildx-plugin
          - docker-compose-plugin
        state: present
        update_cache: yes

    - name: Add user to docker group
      user:
        name: '{{ ansible_user }}'
        groups: docker
        append: yes

    # ===== NVIDIA Container Toolkit 설치 =====
    - name: Add NVIDIA Container Toolkit GPG key
      shell: |
        curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
      args:
        creates: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

    - name: Add NVIDIA Container Toolkit repository
      shell: |
        curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
      args:
        creates: /etc/apt/sources.list.d/nvidia-container-toolkit.list

    - name: Install NVIDIA Container Toolkit
      apt:
        name: nvidia-container-toolkit
        state: present
        update_cache: yes

    - name: Configure Docker runtime for NVIDIA
      command: nvidia-ctk runtime configure --runtime=docker

    - name: Restart Docker
      systemd:
        name: docker
        state: restarted

    # ===== 보안 설정 =====
    - name: Install security packages
      apt:
        name:
          - ufw
          - fail2ban
        state: present

    - name: Configure UFW - deny incoming
      ufw:
        direction: incoming
        policy: deny

    - name: Configure UFW - allow outgoing
      ufw:
        direction: outgoing
        policy: allow

    - name: Configure UFW - allow SSH
      ufw:
        rule: allow
        name: OpenSSH

    - name: Enable UFW
      ufw:
        state: enabled

    - name: Enable fail2ban
      systemd:
        name: fail2ban
        enabled: yes
        state: started

    # ===== GPU Persistence Mode =====
    - name: Enable NVIDIA Persistence Mode
      command: nvidia-smi -pm 1
      ignore_errors: yes

  handlers:
    - name: Reboot server
      reboot:
        reboot_timeout: 300

13.3 Playbook 실행

# Ansible 설치
pip install ansible

# Playbook 실행
ansible-playbook -i inventory.ini gpu_server_setup.yml

# 드라이 런 (실제 실행 없이 확인)
ansible-playbook -i inventory.ini gpu_server_setup.yml --check

14. 정리: 전체 설치 순서 요약

딥러닝 GPU 서버 구축의 전체 흐름을 정리하면 다음과 같다.

1. 하드웨어 조립 및 BIOS 설정 (Secure Boot 비활성화)
    |
2. Ubuntu 22.04/24.04 설치 및 기본 패키지 설정
    |
3. NVIDIA 드라이버 설치 (apt 방식 권장)
    |
4. CUDA Toolkit 설치 (필요한 버전별로 다중 설치 가능)
    |
5. cuDNN 설치
    |
6. Conda/Mamba 환경 구성 -> PyTorch/TensorFlow 설치
    |
7. Docker + NVIDIA Container Toolkit 설치
    |
8. 보안 설정 (SSH Key, UFW, fail2ban)
    |
9. 모니터링 도구 설치 (nvtop, gpustat)
    |
10. 원격 개발 환경 구성 (VS Code Remote, tmux)

이 가이드에서 다룬 내용을 순서대로 따라가면, NVIDIA 드라이버부터 Docker GPU 환경까지 완전한 딥러닝 개발 서버를 구축할 수 있다. 각 단계에서 문제가 발생하면 NVIDIA 공식 문서를 참고하여 트러블슈팅하는 것을 권장한다.


References

Complete Guide to Building a Linux GPU Server for Deep Learning

1. Deep Learning GPU Server Hardware Selection Guide

Before building a deep learning server, the first decision to make is the hardware configuration. Options vary depending on workload scale and budget.

1.1 GPU Selection

The GPU is the core component of a deep learning server. When choosing one, focus primarily on VRAM capacity, Tensor Core generation, and memory bandwidth.

GPUVRAMUse CaseApproximate Price
RTX 409024GB GDDR6XPersonal research, small-to-medium training~$1,600
RTX 509032GB GDDR7Personal research, large model fine-tuning~$2,000
RTX A6000 / RTX 6000 Ada48GB GDDR6Production, ECC memory support~$4,000+
A100 (80GB PCIe/SXM)80GB HBM2eLarge-scale training, NVLink support~$10,000+
H100 (80GB SXM)80GB HBM3Maximum-scale LLM training~$25,000+
H200141GB HBM3eLong-context LLM, memory bottleneck relief~$30,000+

Practical Recommendations:

  • Personal/Small Labs: RTX 4090 or RTX 5090. With 24-32GB VRAM, fine-tuning 7B-13B parameter models is feasible.
  • Enterprise/Research Labs: A100 80GB or H100. Data center-grade GPUs are necessary for environments where multi-GPU training via NVLink is essential.
  • If considering Multi-GPU: Be sure to check PCIe slot spacing, NVLink support, and power supply capacity.

1.2 CPU, RAM, Storage

ComponentRecommended SpecsReason
CPUAMD EPYC or Intel Xeon (16+ cores)Prevent data loading bottlenecks, sufficient PCIe lanes
RAMAt least 2x GPU VRAM (minimum 64GB, recommended 128GB+)Large dataset preprocessing, DataLoader worker memory
OS StorageNVMe SSD 500GB+Fast boot, package cache
Data StorageNVMe SSD 2TB+ or RAID configurationPrevent training data I/O bottlenecks
PSU1200W+ (1600W+ for multi-GPU)A single RTX 4090 draws 450W, adequate headroom is essential

The CPU is relatively less critical compared to the GPU, but the number of PCIe lanes can become a bottleneck in multi-GPU environments, so server-grade platforms (AMD EPYC, Intel Xeon) are recommended. Desktop platforms (AM5, LGA1700) are sufficient for single-GPU setups.


2. Ubuntu 22.04/24.04 Installation and Initial Setup

For deep learning server operating systems, Ubuntu 22.04 LTS or Ubuntu 24.04 LTS are the most widely used. This is because NVIDIA driver, CUDA, and various deep learning framework support is prioritized for these distributions.

2.1 Post-Installation Basic Setup

# System update
sudo apt update && sudo apt upgrade -y

# Install essential build tools
sudo apt install -y build-essential gcc g++ make cmake

# Install kernel headers (required for NVIDIA driver build)
sudo apt install -y linux-headers-$(uname -r)

# Network tools
sudo apt install -y net-tools curl wget git vim htop

# Set timezone
sudo timedatectl set-timezone Asia/Seoul

2.2 Disable Secure Boot

Since NVIDIA drivers need to load kernel modules, it is recommended to disable Secure Boot in the BIOS. While it is possible to use MOK (Machine Owner Key) registration with Secure Boot enabled, the setup can become complex.

# Check Secure Boot status
mokutil --sb-state

3. NVIDIA Driver Installation

The official NVIDIA documentation describes two main installation methods: the package manager (apt) method and the .run file method. For server environments, the apt repository method is recommended for ease of management.

According to the official NVIDIA driver installation guide (https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html), the installation on Ubuntu proceeds as follows.

Remove Existing Drivers

# Completely remove existing NVIDIA-related packages
sudo apt-get purge nvidia* -y
sudo apt-get autoremove -y
sudo apt-get autoclean

Automatic Installation Using ubuntu-drivers

On Ubuntu, you can use the ubuntu-drivers tool to automatically install the appropriate driver for your system.

# Check available driver list
sudo ubuntu-drivers devices

# Auto-install recommended driver
sudo ubuntu-drivers autoinstall

Manual Installation Using NVIDIA Official Repository

If you need a specific driver version, add NVIDIA's official apt repository and install.

# Add NVIDIA official repository key and source
sudo apt install -y software-properties-common
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update

# Install specific version (e.g., version 550)
sudo apt install -y nvidia-driver-550

# Reboot
sudo reboot

Verify Installation

nvidia-smi

If installed correctly, the GPU name, driver version, supported CUDA version, and other details will be displayed.

3.2 Method 2: .run File Method

This method involves downloading the .run file directly from the NVIDIA website. It is used when a special kernel environment or custom build is required.

# Disable Nouveau driver
sudo bash -c "echo 'blacklist nouveau' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo bash -c "echo 'options nouveau modeset=0' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo update-initramfs -u
sudo reboot

# Stop GUI environment (if desktop environment is installed on the server)
sudo systemctl stop gdm3
# or
sudo systemctl stop lightdm

# Run .run file
chmod +x NVIDIA-Linux-x86_64-550.120.run
sudo ./NVIDIA-Linux-x86_64-550.120.run

Drawbacks of the .run file method: The driver may break during kernel updates, and since it is not managed by the package manager, updates and removal are inconvenient. The apt method is strongly recommended for production server environments.


4. Driver Version Selection Strategy

NVIDIA operates two branches for data center/server drivers.

4.1 Production Branch vs New Feature Branch

CategoryProduction Branch (PB)New Feature Branch (NFB)
StabilityHigh (extensively tested)Relatively lower
New FeaturesNo new features after releaseIncludes latest GPU support, new features
Update PolicyBug fixes and security patches onlyPeriodic releases with new features
Recommended ForProduction servers, stability-firstDevelopment/testing, latest GPUs

Practical Guidelines:

  • Production environments: Use the Production Branch. For example, if the 550.xx series is designated as PB, only bug-fix updates within that series will be provided.
  • Latest GPUs (e.g., Blackwell architecture): You may need to use the New Feature Branch. Support for the latest hardware is provided first through NFB.
  • Use -server packages on servers: Packages with the -server suffix such as nvidia-driver-550-server are Enterprise Ready Drivers (ERD), optimized for server environments.
# Check available server drivers
apt list --installed 2>/dev/null | grep nvidia-driver
apt-cache search nvidia-driver | grep server

5. CUDA Toolkit Installation

The CUDA Toolkit is a development environment for performing parallel computing on NVIDIA GPUs. Install it following the official NVIDIA installation guide (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).

5.1 Pre-Installation Checks

# Verify CUDA-capable GPU
lspci | grep -i nvidia

# Verify gcc installation
gcc --version

# Verify kernel headers
uname -r
sudo apt install -y linux-headers-$(uname -r)
# Install CUDA keyring package
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install CUDA Toolkit (latest version)
sudo apt-get install -y cuda-toolkit

# Or install a specific version
sudo apt-get install -y cuda-toolkit-12-6

5.3 Environment Variable Setup

# Add to ~/.bashrc or ~/.zshrc
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Apply
source ~/.bashrc

5.4 Managing Multiple CUDA Versions

In practice, different projects frequently require different CUDA versions. You can install multiple CUDA versions under /usr/local/ and manage them with update-alternatives.

# Install multiple versions (e.g., 12.4 and 12.6)
sudo apt-get install -y cuda-toolkit-12-4
sudo apt-get install -y cuda-toolkit-12-6

# Verify installations under /usr/local/
ls /usr/local/ | grep cuda
# cuda -> cuda-12.6 (symbolic link)
# cuda-12.4
# cuda-12.6

# Register version switching with update-alternatives
sudo update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-12.4 10
sudo update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-12.6 20

# Switch versions
sudo update-alternatives --config cuda

Running update-alternatives --config cuda presents an interactive menu where you can select the desired CUDA version. This approach changes the /usr/local/cuda symbolic link to point to the selected version.

# Verify installation
nvcc --version

6. CUDA Version and Driver Compatibility Matrix

There are minimum driver version requirements between the CUDA Toolkit and NVIDIA drivers. The compatibility table can be found in the official NVIDIA CUDA Toolkit Release Notes (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html).

6.1 Key Compatibility Table (Linux x86_64)

CUDA Toolkit VersionMinimum Driver Version (Linux)
CUDA 12.0>= 525.60.13
CUDA 12.1>= 530.30.02
CUDA 12.2>= 535.54.03
CUDA 12.3>= 545.23.06
CUDA 12.4>= 550.54.14
CUDA 12.5>= 555.42.02
CUDA 12.6>= 560.28.03
CUDA 13.0>= 570.86.15
CUDA 13.1>= 575.51.03

Key point: The "CUDA Version" shown in nvidia-smi represents the maximum CUDA Runtime version supported by that driver. It may differ from the actually installed CUDA Toolkit version. Verify the actual CUDA Toolkit version with nvcc --version.

6.2 CUDA Forward/Minor Compatibility

According to NVIDIA's CUDA Compatibility documentation (https://docs.nvidia.com/deploy/cuda-compatibility/), from CUDA 12.x onward, Minor Version Compatibility is supported. This means that even with a driver installed for CUDA 12.0, applications compiled with CUDA 12.6 can be run with some limitations.

# Check the CUDA version supported by the driver
nvidia-smi | head -3

# Check the actually installed CUDA Toolkit version
nvcc --version

7. cuDNN Installation and Configuration

cuDNN (CUDA Deep Neural Network library) is a GPU-accelerated library optimized for deep learning. It provides highly optimized implementations of operations such as Convolution, Pooling, Normalization, and Activation, and is used internally by frameworks like PyTorch and TensorFlow.

Install it based on the official NVIDIA cuDNN installation guide (https://docs.nvidia.com/deeplearning/cudnn/installation/latest/linux.html).

# If CUDA keyring is already installed, you can install the cuDNN package directly
sudo apt-get install -y cudnn

# Install cuDNN for a specific CUDA version
sudo apt-get install -y cudnn-cuda-12

7.2 Tarball Installation

Use a tarball when you need a specific version or prefer not to use the system package manager.

# After downloading from the NVIDIA Developer site
tar -xvf cudnn-linux-x86_64-9.x.x.x_cudaXX-archive.tar.xz

# Copy files
sudo cp cudnn-linux-x86_64-9.x.x.x_cudaXX-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp -P cudnn-linux-x86_64-9.x.x.x_cudaXX-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

7.3 Verify Installation

# Check cuDNN version (when installed via apt)
dpkg -l | grep cudnn

# Or check directly from the header file
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

8. Conda/Mamba Environment Setup

Separate from the system-level CUDA Toolkit, installing PyTorch and TensorFlow within Conda/Mamba virtual environments can prevent CUDA version conflicts. Recent versions of PyTorch and TensorFlow bundle their own CUDA Runtime, operating independently of the system CUDA version.

8.1 Miniforge (Mamba) Installation

# Install Miniforge (includes Mamba)
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p $HOME/miniforge3

# Set environment variables
$HOME/miniforge3/bin/mamba init bash
source ~/.bashrc

8.2 PyTorch Installation

# Create deep learning environment
mamba create -n dl python=3.11 -y
mamba activate dl

# Install PyTorch (for CUDA 12.4 - check the official site for the latest command)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Verify installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

8.3 TensorFlow Installation

# Install TensorFlow (with GPU support)
pip install tensorflow[and-cuda]
# Verify installation
import tensorflow as tf
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU devices: {tf.config.list_physical_devices('GPU')}")

Tip: Installing PyTorch and TensorFlow in the same environment may cause CUDA library version conflicts. It is recommended to use separate Conda environments whenever possible.


9. Docker + NVIDIA Container Toolkit Setup

Docker allows you to manage deep learning environments isolated at the container level. Installing the NVIDIA Container Toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) enables GPU access within Docker containers.

9.1 Docker Installation

# Set up Docker official repository
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Add current user to docker group (to use without sudo)
sudo usermod -aG docker $USER
newgrp docker

9.2 NVIDIA Container Toolkit Installation

The installation process follows the official NVIDIA documentation.

# Set up NVIDIA Container Toolkit repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install packages
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker
sudo systemctl restart docker

9.3 GPU Docker Testing

# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

# Use specific GPU only
docker run --rm --gpus '"device=0"' nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

# Run PyTorch container with all GPUs
docker run --rm --gpus all -it pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime python -c \
  "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

9.4 Using GPUs in Docker Compose

# docker-compose.yml
services:
  training:
    image: pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all # or a specific number: 1, 2, etc.
              capabilities: [gpu]
    volumes:
      - ./data:/workspace/data
      - ./models:/workspace/models
    shm_size: '8g' # PyTorch DataLoader shared memory

Note: If shm_size is not set, PyTorch's DataLoader may encounter shared memory shortage errors when num_workers > 0.


10. nvidia-smi and nvtop Monitoring Tools

Monitoring is essential in GPU server operations. You need to be able to check GPU utilization, memory usage, temperature, and other metrics in real time.

10.1 nvidia-smi

nvidia-smi is the default monitoring tool installed with the NVIDIA driver.

# Basic status check
nvidia-smi

# Real-time monitoring (1-second interval)
watch -n 1 nvidia-smi

# Query specific information
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv

# Per-process GPU usage
nvidia-smi pmon -s um -d 1

# Enable Persistence Mode (recommended for server environments)
sudo nvidia-smi -pm 1

Enabling Persistence Mode keeps the GPU driver loaded at all times, eliminating the initialization delay (~several seconds) that occurs when the GPU is first called. This should always be enabled in server environments.

10.2 nvtop

nvtop is a GPU monitoring tool with an interface similar to htop. It displays utilization, memory, temperature, and process information for multiple GPUs on a single screen in real time.

# Install
sudo apt install -y nvtop

# Run
nvtop

10.3 gpustat

If you prefer concise output, gpustat is also useful.

pip install gpustat

# Real-time monitoring
gpustat -i 1 --color

11. SSH Remote Development Environment

GPU servers are typically accessed remotely. Set up an efficient remote development environment.

11.1 SSH Server Setup

# Install OpenSSH server
sudo apt install -y openssh-server

# Start SSH service and enable auto-start
sudo systemctl enable ssh
sudo systemctl start ssh

11.2 VS Code Remote - SSH

Using VS Code's Remote - SSH extension, you can directly edit files on the remote server from your local VS Code and use the remote terminal.

Client (local) SSH config setup:

# ~/.ssh/config
Host gpu-server
    HostName 192.168.1.100
    User username
    Port 22
    IdentityFile ~/.ssh/id_ed25519
    ForwardAgent yes

11.3 Session Persistence with tmux

Use tmux to keep training running even if the SSH connection is lost.

# Install tmux
sudo apt install -y tmux

# Create new session
tmux new -s training

# Detach session: Ctrl+b, d

# Reattach to session
tmux attach -t training

# List sessions
tmux ls

Essential tmux shortcuts:

  • Ctrl+b, d : Detach session (training continues running)
  • Ctrl+b, c : New window
  • Ctrl+b, n/p : Next/previous window
  • Ctrl+b, % : Horizontal split
  • Ctrl+b, " : Vertical split

11.4 Port Forwarding

SSH port forwarding configuration for accessing Jupyter Notebook, TensorBoard, and similar tools from your local browser.

# Jupyter Notebook port forwarding (remote 8888 -> local 8888)
ssh -L 8888:localhost:8888 gpu-server

# TensorBoard port forwarding (remote 6006 -> local 6006)
ssh -L 6006:localhost:6006 gpu-server

# Forward multiple ports simultaneously
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 gpu-server

12. Security Settings

If your GPU server is exposed to the network, basic security settings are essential.

12.1 SSH Key Authentication Setup

# Generate key on client
ssh-keygen -t ed25519 -C "your_email@example.com"

# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@gpu-server

# Disable password authentication on server
sudo vim /etc/ssh/sshd_config

Modify the following entries in /etc/ssh/sshd_config:

PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
# Restart SSH service
sudo systemctl restart sshd

12.2 UFW Firewall Setup

# Install and enable UFW
sudo apt install -y ufw

# Default policy: deny incoming, allow outgoing
sudo ufw default deny incoming
sudo ufw default allow outgoing

# Allow SSH
sudo ufw allow ssh

# Allow SSH from specific IP only (more secure)
sudo ufw allow from 192.168.1.0/24 to any port 22

# Enable firewall
sudo ufw enable

# Check status
sudo ufw status verbose

12.3 fail2ban Setup

Install fail2ban to defend against SSH brute force attacks.

# Install
sudo apt install -y fail2ban

# Copy configuration file
sudo cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local
# Modify [sshd] section in /etc/fail2ban/jail.local
sudo vim /etc/fail2ban/jail.local
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600
# Start fail2ban
sudo systemctl enable fail2ban
sudo systemctl start fail2ban

# Check status
sudo fail2ban-client status sshd

13. Automation: Ansible Playbook Example

When managing multiple servers or repeatedly configuring environments, Ansible can be used to automate the entire setup process.

13.1 Ansible Inventory

# inventory.ini
[gpu_servers]
gpu-server-01 ansible_host=192.168.1.101
gpu-server-02 ansible_host=192.168.1.102

[gpu_servers:vars]
ansible_user=ubuntu
ansible_ssh_private_key_file=~/.ssh/id_ed25519

13.2 GPU Server Setup Playbook

# gpu_server_setup.yml
---
- name: GPU Server Initial Setup
  hosts: gpu_servers
  become: true
  vars:
    nvidia_driver_version: '550'
    cuda_keyring_url: 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb'

  tasks:
    # ===== System Basic Setup =====
    - name: Update and upgrade apt packages
      apt:
        update_cache: yes
        upgrade: dist
        cache_valid_time: 3600

    - name: Install essential packages
      apt:
        name:
          - build-essential
          - gcc
          - g++
          - make
          - cmake
          - linux-headers-{{ ansible_kernel }}
          - curl
          - wget
          - git
          - vim
          - htop
          - tmux
          - nvtop
          - net-tools
          - software-properties-common
        state: present

    - name: Set timezone to Asia/Seoul
      community.general.timezone:
        name: Asia/Seoul

    # ===== NVIDIA Driver Installation =====
    - name: Purge existing NVIDIA drivers
      apt:
        name: 'nvidia*'
        state: absent
        purge: yes
      ignore_errors: yes

    - name: Add NVIDIA PPA
      apt_repository:
        repo: ppa:graphics-drivers/ppa
        state: present

    - name: Install NVIDIA driver
      apt:
        name: 'nvidia-driver-{{ nvidia_driver_version }}'
        state: present
        update_cache: yes

    # ===== CUDA Toolkit Installation =====
    - name: Download CUDA keyring
      get_url:
        url: '{{ cuda_keyring_url }}'
        dest: /tmp/cuda-keyring.deb

    - name: Install CUDA keyring
      apt:
        deb: /tmp/cuda-keyring.deb

    - name: Install CUDA Toolkit
      apt:
        name: cuda-toolkit
        state: present
        update_cache: yes

    # ===== cuDNN Installation =====
    - name: Install cuDNN
      apt:
        name: cudnn-cuda-12
        state: present

    # ===== Docker Installation =====
    - name: Install Docker prerequisites
      apt:
        name:
          - ca-certificates
          - curl
          - gnupg
        state: present

    - name: Add Docker GPG key
      shell: |
        install -m 0755 -d /etc/apt/keyrings
        curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
        chmod a+r /etc/apt/keyrings/docker.gpg
      args:
        creates: /etc/apt/keyrings/docker.gpg

    - name: Add Docker repository
      shell: |
        echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
      args:
        creates: /etc/apt/sources.list.d/docker.list

    - name: Install Docker
      apt:
        name:
          - docker-ce
          - docker-ce-cli
          - containerd.io
          - docker-buildx-plugin
          - docker-compose-plugin
        state: present
        update_cache: yes

    - name: Add user to docker group
      user:
        name: '{{ ansible_user }}'
        groups: docker
        append: yes

    # ===== NVIDIA Container Toolkit Installation =====
    - name: Add NVIDIA Container Toolkit GPG key
      shell: |
        curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
      args:
        creates: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

    - name: Add NVIDIA Container Toolkit repository
      shell: |
        curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
      args:
        creates: /etc/apt/sources.list.d/nvidia-container-toolkit.list

    - name: Install NVIDIA Container Toolkit
      apt:
        name: nvidia-container-toolkit
        state: present
        update_cache: yes

    - name: Configure Docker runtime for NVIDIA
      command: nvidia-ctk runtime configure --runtime=docker

    - name: Restart Docker
      systemd:
        name: docker
        state: restarted

    # ===== Security Settings =====
    - name: Install security packages
      apt:
        name:
          - ufw
          - fail2ban
        state: present

    - name: Configure UFW - deny incoming
      ufw:
        direction: incoming
        policy: deny

    - name: Configure UFW - allow outgoing
      ufw:
        direction: outgoing
        policy: allow

    - name: Configure UFW - allow SSH
      ufw:
        rule: allow
        name: OpenSSH

    - name: Enable UFW
      ufw:
        state: enabled

    - name: Enable fail2ban
      systemd:
        name: fail2ban
        enabled: yes
        state: started

    # ===== GPU Persistence Mode =====
    - name: Enable NVIDIA Persistence Mode
      command: nvidia-smi -pm 1
      ignore_errors: yes

  handlers:
    - name: Reboot server
      reboot:
        reboot_timeout: 300

13.3 Running the Playbook

# Install Ansible
pip install ansible

# Run Playbook
ansible-playbook -i inventory.ini gpu_server_setup.yml

# Dry run (check without actually executing)
ansible-playbook -i inventory.ini gpu_server_setup.yml --check

14. Summary: Complete Installation Order

The complete flow for building a deep learning GPU server is as follows.

1. Hardware assembly and BIOS setup (disable Secure Boot)
    |
2. Ubuntu 22.04/24.04 installation and basic package setup
    |
3. NVIDIA driver installation (apt method recommended)
    |
4. CUDA Toolkit installation (multiple versions can be installed)
    |
5. cuDNN installation
    |
6. Conda/Mamba environment setup -> PyTorch/TensorFlow installation
    |
7. Docker + NVIDIA Container Toolkit installation
    |
8. Security settings (SSH Key, UFW, fail2ban)
    |
9. Monitoring tools installation (nvtop, gpustat)
    |
10. Remote development environment setup (VS Code Remote, tmux)

By following the content covered in this guide step by step, you can build a complete deep learning development server from NVIDIA drivers to Docker GPU environments. If issues arise at any stage, it is recommended to consult the official NVIDIA documentation for troubleshooting.


References

Quiz

Q1: What is the main topic covered in "Complete Guide to Building a Linux GPU Server for Deep Learning"?

A step-by-step guide to building a Linux GPU server for deep learning development, covering everything from NVIDIA driver installation to Docker GPU environments, based on official NVIDIA documentation.

Q2: What is 1 GPU Selection? The GPU is the core component of a deep learning server. When choosing one, focus primarily on VRAM capacity, Tensor Core generation, and memory bandwidth. Practical Recommendations: Personal/Small Labs: RTX 4090 or RTX 5090.

Q3: Explain the core concept of 2 CPU, RAM, Storage. The CPU is relatively less critical compared to the GPU, but the number of PCIe lanes can become a bottleneck in multi-GPU environments, so server-grade platforms (AMD EPYC, Intel Xeon) are recommended. Desktop platforms (AM5, LGA1700) are sufficient for single-GPU setups. 2.

Q4: What are the key aspects of 2 Disable Secure Boot? Since NVIDIA drivers need to load kernel modules, it is recommended to disable Secure Boot in the BIOS. While it is possible to use MOK (Machine Owner Key) registration with Secure Boot enabled, the setup can become complex. 3.

Q5: How does 1 Method 1: apt Repository Method (Recommended) work? According to the official NVIDIA driver installation guide (https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html), the installation on Ubuntu proceeds as follows.