Skip to content
Published on

Complete Guide to Building a Linux GPU Server for Deep Learning

Authors
  • Name
    Twitter

1. Deep Learning GPU Server Hardware Selection Guide

Before building a deep learning server, the first decision to make is the hardware configuration. Options vary depending on workload scale and budget.

1.1 GPU Selection

The GPU is the core component of a deep learning server. When choosing one, focus primarily on VRAM capacity, Tensor Core generation, and memory bandwidth.

GPUVRAMUse CaseApproximate Price
RTX 409024GB GDDR6XPersonal research, small-to-medium training~$1,600
RTX 509032GB GDDR7Personal research, large model fine-tuning~$2,000
RTX A6000 / RTX 6000 Ada48GB GDDR6Production, ECC memory support~$4,000+
A100 (80GB PCIe/SXM)80GB HBM2eLarge-scale training, NVLink support~$10,000+
H100 (80GB SXM)80GB HBM3Maximum-scale LLM training~$25,000+
H200141GB HBM3eLong-context LLM, memory bottleneck relief~$30,000+

Practical Recommendations:

  • Personal/Small Labs: RTX 4090 or RTX 5090. With 24-32GB VRAM, fine-tuning 7B-13B parameter models is feasible.
  • Enterprise/Research Labs: A100 80GB or H100. Data center-grade GPUs are necessary for environments where multi-GPU training via NVLink is essential.
  • If considering Multi-GPU: Be sure to check PCIe slot spacing, NVLink support, and power supply capacity.

1.2 CPU, RAM, Storage

ComponentRecommended SpecsReason
CPUAMD EPYC or Intel Xeon (16+ cores)Prevent data loading bottlenecks, sufficient PCIe lanes
RAMAt least 2x GPU VRAM (minimum 64GB, recommended 128GB+)Large dataset preprocessing, DataLoader worker memory
OS StorageNVMe SSD 500GB+Fast boot, package cache
Data StorageNVMe SSD 2TB+ or RAID configurationPrevent training data I/O bottlenecks
PSU1200W+ (1600W+ for multi-GPU)A single RTX 4090 draws 450W, adequate headroom is essential

The CPU is relatively less critical compared to the GPU, but the number of PCIe lanes can become a bottleneck in multi-GPU environments, so server-grade platforms (AMD EPYC, Intel Xeon) are recommended. Desktop platforms (AM5, LGA1700) are sufficient for single-GPU setups.


2. Ubuntu 22.04/24.04 Installation and Initial Setup

For deep learning server operating systems, Ubuntu 22.04 LTS or Ubuntu 24.04 LTS are the most widely used. This is because NVIDIA driver, CUDA, and various deep learning framework support is prioritized for these distributions.

2.1 Post-Installation Basic Setup

# System update
sudo apt update && sudo apt upgrade -y

# Install essential build tools
sudo apt install -y build-essential gcc g++ make cmake

# Install kernel headers (required for NVIDIA driver build)
sudo apt install -y linux-headers-$(uname -r)

# Network tools
sudo apt install -y net-tools curl wget git vim htop

# Set timezone
sudo timedatectl set-timezone Asia/Seoul

2.2 Disable Secure Boot

Since NVIDIA drivers need to load kernel modules, it is recommended to disable Secure Boot in the BIOS. While it is possible to use MOK (Machine Owner Key) registration with Secure Boot enabled, the setup can become complex.

# Check Secure Boot status
mokutil --sb-state

3. NVIDIA Driver Installation

The official NVIDIA documentation describes two main installation methods: the package manager (apt) method and the .run file method. For server environments, the apt repository method is recommended for ease of management.

According to the official NVIDIA driver installation guide (https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html), the installation on Ubuntu proceeds as follows.

Remove Existing Drivers

# Completely remove existing NVIDIA-related packages
sudo apt-get purge nvidia* -y
sudo apt-get autoremove -y
sudo apt-get autoclean

Automatic Installation Using ubuntu-drivers

On Ubuntu, you can use the ubuntu-drivers tool to automatically install the appropriate driver for your system.

# Check available driver list
sudo ubuntu-drivers devices

# Auto-install recommended driver
sudo ubuntu-drivers autoinstall

Manual Installation Using NVIDIA Official Repository

If you need a specific driver version, add NVIDIA's official apt repository and install.

# Add NVIDIA official repository key and source
sudo apt install -y software-properties-common
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update

# Install specific version (e.g., version 550)
sudo apt install -y nvidia-driver-550

# Reboot
sudo reboot

Verify Installation

nvidia-smi

If installed correctly, the GPU name, driver version, supported CUDA version, and other details will be displayed.

3.2 Method 2: .run File Method

This method involves downloading the .run file directly from the NVIDIA website. It is used when a special kernel environment or custom build is required.

# Disable Nouveau driver
sudo bash -c "echo 'blacklist nouveau' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo bash -c "echo 'options nouveau modeset=0' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo update-initramfs -u
sudo reboot

# Stop GUI environment (if desktop environment is installed on the server)
sudo systemctl stop gdm3
# or
sudo systemctl stop lightdm

# Run .run file
chmod +x NVIDIA-Linux-x86_64-550.120.run
sudo ./NVIDIA-Linux-x86_64-550.120.run

Drawbacks of the .run file method: The driver may break during kernel updates, and since it is not managed by the package manager, updates and removal are inconvenient. The apt method is strongly recommended for production server environments.


4. Driver Version Selection Strategy

NVIDIA operates two branches for data center/server drivers.

4.1 Production Branch vs New Feature Branch

CategoryProduction Branch (PB)New Feature Branch (NFB)
StabilityHigh (extensively tested)Relatively lower
New FeaturesNo new features after releaseIncludes latest GPU support, new features
Update PolicyBug fixes and security patches onlyPeriodic releases with new features
Recommended ForProduction servers, stability-firstDevelopment/testing, latest GPUs

Practical Guidelines:

  • Production environments: Use the Production Branch. For example, if the 550.xx series is designated as PB, only bug-fix updates within that series will be provided.
  • Latest GPUs (e.g., Blackwell architecture): You may need to use the New Feature Branch. Support for the latest hardware is provided first through NFB.
  • Use -server packages on servers: Packages with the -server suffix such as nvidia-driver-550-server are Enterprise Ready Drivers (ERD), optimized for server environments.
# Check available server drivers
apt list --installed 2>/dev/null | grep nvidia-driver
apt-cache search nvidia-driver | grep server

5. CUDA Toolkit Installation

The CUDA Toolkit is a development environment for performing parallel computing on NVIDIA GPUs. Install it following the official NVIDIA installation guide (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).

5.1 Pre-Installation Checks

# Verify CUDA-capable GPU
lspci | grep -i nvidia

# Verify gcc installation
gcc --version

# Verify kernel headers
uname -r
sudo apt install -y linux-headers-$(uname -r)
# Install CUDA keyring package
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install CUDA Toolkit (latest version)
sudo apt-get install -y cuda-toolkit

# Or install a specific version
sudo apt-get install -y cuda-toolkit-12-6

5.3 Environment Variable Setup

# Add to ~/.bashrc or ~/.zshrc
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Apply
source ~/.bashrc

5.4 Managing Multiple CUDA Versions

In practice, different projects frequently require different CUDA versions. You can install multiple CUDA versions under /usr/local/ and manage them with update-alternatives.

# Install multiple versions (e.g., 12.4 and 12.6)
sudo apt-get install -y cuda-toolkit-12-4
sudo apt-get install -y cuda-toolkit-12-6

# Verify installations under /usr/local/
ls /usr/local/ | grep cuda
# cuda -> cuda-12.6 (symbolic link)
# cuda-12.4
# cuda-12.6

# Register version switching with update-alternatives
sudo update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-12.4 10
sudo update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-12.6 20

# Switch versions
sudo update-alternatives --config cuda

Running update-alternatives --config cuda presents an interactive menu where you can select the desired CUDA version. This approach changes the /usr/local/cuda symbolic link to point to the selected version.

# Verify installation
nvcc --version

6. CUDA Version and Driver Compatibility Matrix

There are minimum driver version requirements between the CUDA Toolkit and NVIDIA drivers. The compatibility table can be found in the official NVIDIA CUDA Toolkit Release Notes (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html).

6.1 Key Compatibility Table (Linux x86_64)

CUDA Toolkit VersionMinimum Driver Version (Linux)
CUDA 12.0>= 525.60.13
CUDA 12.1>= 530.30.02
CUDA 12.2>= 535.54.03
CUDA 12.3>= 545.23.06
CUDA 12.4>= 550.54.14
CUDA 12.5>= 555.42.02
CUDA 12.6>= 560.28.03
CUDA 13.0>= 570.86.15
CUDA 13.1>= 575.51.03

Key point: The "CUDA Version" shown in nvidia-smi represents the maximum CUDA Runtime version supported by that driver. It may differ from the actually installed CUDA Toolkit version. Verify the actual CUDA Toolkit version with nvcc --version.

6.2 CUDA Forward/Minor Compatibility

According to NVIDIA's CUDA Compatibility documentation (https://docs.nvidia.com/deploy/cuda-compatibility/), from CUDA 12.x onward, Minor Version Compatibility is supported. This means that even with a driver installed for CUDA 12.0, applications compiled with CUDA 12.6 can be run with some limitations.

# Check the CUDA version supported by the driver
nvidia-smi | head -3

# Check the actually installed CUDA Toolkit version
nvcc --version

7. cuDNN Installation and Configuration

cuDNN (CUDA Deep Neural Network library) is a GPU-accelerated library optimized for deep learning. It provides highly optimized implementations of operations such as Convolution, Pooling, Normalization, and Activation, and is used internally by frameworks like PyTorch and TensorFlow.

Install it based on the official NVIDIA cuDNN installation guide (https://docs.nvidia.com/deeplearning/cudnn/installation/latest/linux.html).

# If CUDA keyring is already installed, you can install the cuDNN package directly
sudo apt-get install -y cudnn

# Install cuDNN for a specific CUDA version
sudo apt-get install -y cudnn-cuda-12

7.2 Tarball Installation

Use a tarball when you need a specific version or prefer not to use the system package manager.

# After downloading from the NVIDIA Developer site
tar -xvf cudnn-linux-x86_64-9.x.x.x_cudaXX-archive.tar.xz

# Copy files
sudo cp cudnn-linux-x86_64-9.x.x.x_cudaXX-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp -P cudnn-linux-x86_64-9.x.x.x_cudaXX-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

7.3 Verify Installation

# Check cuDNN version (when installed via apt)
dpkg -l | grep cudnn

# Or check directly from the header file
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

8. Conda/Mamba Environment Setup

Separate from the system-level CUDA Toolkit, installing PyTorch and TensorFlow within Conda/Mamba virtual environments can prevent CUDA version conflicts. Recent versions of PyTorch and TensorFlow bundle their own CUDA Runtime, operating independently of the system CUDA version.

8.1 Miniforge (Mamba) Installation

# Install Miniforge (includes Mamba)
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p $HOME/miniforge3

# Set environment variables
$HOME/miniforge3/bin/mamba init bash
source ~/.bashrc

8.2 PyTorch Installation

# Create deep learning environment
mamba create -n dl python=3.11 -y
mamba activate dl

# Install PyTorch (for CUDA 12.4 - check the official site for the latest command)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Verify installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

8.3 TensorFlow Installation

# Install TensorFlow (with GPU support)
pip install tensorflow[and-cuda]
# Verify installation
import tensorflow as tf
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU devices: {tf.config.list_physical_devices('GPU')}")

Tip: Installing PyTorch and TensorFlow in the same environment may cause CUDA library version conflicts. It is recommended to use separate Conda environments whenever possible.


9. Docker + NVIDIA Container Toolkit Setup

Docker allows you to manage deep learning environments isolated at the container level. Installing the NVIDIA Container Toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) enables GPU access within Docker containers.

9.1 Docker Installation

# Set up Docker official repository
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Add current user to docker group (to use without sudo)
sudo usermod -aG docker $USER
newgrp docker

9.2 NVIDIA Container Toolkit Installation

The installation process follows the official NVIDIA documentation.

# Set up NVIDIA Container Toolkit repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install packages
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker
sudo systemctl restart docker

9.3 GPU Docker Testing

# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

# Use specific GPU only
docker run --rm --gpus '"device=0"' nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

# Run PyTorch container with all GPUs
docker run --rm --gpus all -it pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime python -c \
  "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

9.4 Using GPUs in Docker Compose

# docker-compose.yml
services:
  training:
    image: pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all # or a specific number: 1, 2, etc.
              capabilities: [gpu]
    volumes:
      - ./data:/workspace/data
      - ./models:/workspace/models
    shm_size: '8g' # PyTorch DataLoader shared memory

Note: If shm_size is not set, PyTorch's DataLoader may encounter shared memory shortage errors when num_workers > 0.


10. nvidia-smi and nvtop Monitoring Tools

Monitoring is essential in GPU server operations. You need to be able to check GPU utilization, memory usage, temperature, and other metrics in real time.

10.1 nvidia-smi

nvidia-smi is the default monitoring tool installed with the NVIDIA driver.

# Basic status check
nvidia-smi

# Real-time monitoring (1-second interval)
watch -n 1 nvidia-smi

# Query specific information
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv

# Per-process GPU usage
nvidia-smi pmon -s um -d 1

# Enable Persistence Mode (recommended for server environments)
sudo nvidia-smi -pm 1

Enabling Persistence Mode keeps the GPU driver loaded at all times, eliminating the initialization delay (~several seconds) that occurs when the GPU is first called. This should always be enabled in server environments.

10.2 nvtop

nvtop is a GPU monitoring tool with an interface similar to htop. It displays utilization, memory, temperature, and process information for multiple GPUs on a single screen in real time.

# Install
sudo apt install -y nvtop

# Run
nvtop

10.3 gpustat

If you prefer concise output, gpustat is also useful.

pip install gpustat

# Real-time monitoring
gpustat -i 1 --color

11. SSH Remote Development Environment

GPU servers are typically accessed remotely. Set up an efficient remote development environment.

11.1 SSH Server Setup

# Install OpenSSH server
sudo apt install -y openssh-server

# Start SSH service and enable auto-start
sudo systemctl enable ssh
sudo systemctl start ssh

11.2 VS Code Remote - SSH

Using VS Code's Remote - SSH extension, you can directly edit files on the remote server from your local VS Code and use the remote terminal.

Client (local) SSH config setup:

# ~/.ssh/config
Host gpu-server
    HostName 192.168.1.100
    User username
    Port 22
    IdentityFile ~/.ssh/id_ed25519
    ForwardAgent yes

11.3 Session Persistence with tmux

Use tmux to keep training running even if the SSH connection is lost.

# Install tmux
sudo apt install -y tmux

# Create new session
tmux new -s training

# Detach session: Ctrl+b, d

# Reattach to session
tmux attach -t training

# List sessions
tmux ls

Essential tmux shortcuts:

  • Ctrl+b, d : Detach session (training continues running)
  • Ctrl+b, c : New window
  • Ctrl+b, n/p : Next/previous window
  • Ctrl+b, % : Horizontal split
  • Ctrl+b, " : Vertical split

11.4 Port Forwarding

SSH port forwarding configuration for accessing Jupyter Notebook, TensorBoard, and similar tools from your local browser.

# Jupyter Notebook port forwarding (remote 8888 -> local 8888)
ssh -L 8888:localhost:8888 gpu-server

# TensorBoard port forwarding (remote 6006 -> local 6006)
ssh -L 6006:localhost:6006 gpu-server

# Forward multiple ports simultaneously
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 gpu-server

12. Security Settings

If your GPU server is exposed to the network, basic security settings are essential.

12.1 SSH Key Authentication Setup

# Generate key on client
ssh-keygen -t ed25519 -C "your_email@example.com"

# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@gpu-server

# Disable password authentication on server
sudo vim /etc/ssh/sshd_config

Modify the following entries in /etc/ssh/sshd_config:

PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
# Restart SSH service
sudo systemctl restart sshd

12.2 UFW Firewall Setup

# Install and enable UFW
sudo apt install -y ufw

# Default policy: deny incoming, allow outgoing
sudo ufw default deny incoming
sudo ufw default allow outgoing

# Allow SSH
sudo ufw allow ssh

# Allow SSH from specific IP only (more secure)
sudo ufw allow from 192.168.1.0/24 to any port 22

# Enable firewall
sudo ufw enable

# Check status
sudo ufw status verbose

12.3 fail2ban Setup

Install fail2ban to defend against SSH brute force attacks.

# Install
sudo apt install -y fail2ban

# Copy configuration file
sudo cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local
# Modify [sshd] section in /etc/fail2ban/jail.local
sudo vim /etc/fail2ban/jail.local
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600
# Start fail2ban
sudo systemctl enable fail2ban
sudo systemctl start fail2ban

# Check status
sudo fail2ban-client status sshd

13. Automation: Ansible Playbook Example

When managing multiple servers or repeatedly configuring environments, Ansible can be used to automate the entire setup process.

13.1 Ansible Inventory

# inventory.ini
[gpu_servers]
gpu-server-01 ansible_host=192.168.1.101
gpu-server-02 ansible_host=192.168.1.102

[gpu_servers:vars]
ansible_user=ubuntu
ansible_ssh_private_key_file=~/.ssh/id_ed25519

13.2 GPU Server Setup Playbook

# gpu_server_setup.yml
---
- name: GPU Server Initial Setup
  hosts: gpu_servers
  become: true
  vars:
    nvidia_driver_version: '550'
    cuda_keyring_url: 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb'

  tasks:
    # ===== System Basic Setup =====
    - name: Update and upgrade apt packages
      apt:
        update_cache: yes
        upgrade: dist
        cache_valid_time: 3600

    - name: Install essential packages
      apt:
        name:
          - build-essential
          - gcc
          - g++
          - make
          - cmake
          - linux-headers-{{ ansible_kernel }}
          - curl
          - wget
          - git
          - vim
          - htop
          - tmux
          - nvtop
          - net-tools
          - software-properties-common
        state: present

    - name: Set timezone to Asia/Seoul
      community.general.timezone:
        name: Asia/Seoul

    # ===== NVIDIA Driver Installation =====
    - name: Purge existing NVIDIA drivers
      apt:
        name: 'nvidia*'
        state: absent
        purge: yes
      ignore_errors: yes

    - name: Add NVIDIA PPA
      apt_repository:
        repo: ppa:graphics-drivers/ppa
        state: present

    - name: Install NVIDIA driver
      apt:
        name: 'nvidia-driver-{{ nvidia_driver_version }}'
        state: present
        update_cache: yes

    # ===== CUDA Toolkit Installation =====
    - name: Download CUDA keyring
      get_url:
        url: '{{ cuda_keyring_url }}'
        dest: /tmp/cuda-keyring.deb

    - name: Install CUDA keyring
      apt:
        deb: /tmp/cuda-keyring.deb

    - name: Install CUDA Toolkit
      apt:
        name: cuda-toolkit
        state: present
        update_cache: yes

    # ===== cuDNN Installation =====
    - name: Install cuDNN
      apt:
        name: cudnn-cuda-12
        state: present

    # ===== Docker Installation =====
    - name: Install Docker prerequisites
      apt:
        name:
          - ca-certificates
          - curl
          - gnupg
        state: present

    - name: Add Docker GPG key
      shell: |
        install -m 0755 -d /etc/apt/keyrings
        curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
        chmod a+r /etc/apt/keyrings/docker.gpg
      args:
        creates: /etc/apt/keyrings/docker.gpg

    - name: Add Docker repository
      shell: |
        echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
      args:
        creates: /etc/apt/sources.list.d/docker.list

    - name: Install Docker
      apt:
        name:
          - docker-ce
          - docker-ce-cli
          - containerd.io
          - docker-buildx-plugin
          - docker-compose-plugin
        state: present
        update_cache: yes

    - name: Add user to docker group
      user:
        name: '{{ ansible_user }}'
        groups: docker
        append: yes

    # ===== NVIDIA Container Toolkit Installation =====
    - name: Add NVIDIA Container Toolkit GPG key
      shell: |
        curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
      args:
        creates: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

    - name: Add NVIDIA Container Toolkit repository
      shell: |
        curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
      args:
        creates: /etc/apt/sources.list.d/nvidia-container-toolkit.list

    - name: Install NVIDIA Container Toolkit
      apt:
        name: nvidia-container-toolkit
        state: present
        update_cache: yes

    - name: Configure Docker runtime for NVIDIA
      command: nvidia-ctk runtime configure --runtime=docker

    - name: Restart Docker
      systemd:
        name: docker
        state: restarted

    # ===== Security Settings =====
    - name: Install security packages
      apt:
        name:
          - ufw
          - fail2ban
        state: present

    - name: Configure UFW - deny incoming
      ufw:
        direction: incoming
        policy: deny

    - name: Configure UFW - allow outgoing
      ufw:
        direction: outgoing
        policy: allow

    - name: Configure UFW - allow SSH
      ufw:
        rule: allow
        name: OpenSSH

    - name: Enable UFW
      ufw:
        state: enabled

    - name: Enable fail2ban
      systemd:
        name: fail2ban
        enabled: yes
        state: started

    # ===== GPU Persistence Mode =====
    - name: Enable NVIDIA Persistence Mode
      command: nvidia-smi -pm 1
      ignore_errors: yes

  handlers:
    - name: Reboot server
      reboot:
        reboot_timeout: 300

13.3 Running the Playbook

# Install Ansible
pip install ansible

# Run Playbook
ansible-playbook -i inventory.ini gpu_server_setup.yml

# Dry run (check without actually executing)
ansible-playbook -i inventory.ini gpu_server_setup.yml --check

14. Summary: Complete Installation Order

The complete flow for building a deep learning GPU server is as follows.

1. Hardware assembly and BIOS setup (disable Secure Boot)
    |
2. Ubuntu 22.04/24.04 installation and basic package setup
    |
3. NVIDIA driver installation (apt method recommended)
    |
4. CUDA Toolkit installation (multiple versions can be installed)
    |
5. cuDNN installation
    |
6. Conda/Mamba environment setup -> PyTorch/TensorFlow installation
    |
7. Docker + NVIDIA Container Toolkit installation
    |
8. Security settings (SSH Key, UFW, fail2ban)
    |
9. Monitoring tools installation (nvtop, gpustat)
    |
10. Remote development environment setup (VS Code Remote, tmux)

By following the content covered in this guide step by step, you can build a complete deep learning development server from NVIDIA drivers to Docker GPU environments. If issues arise at any stage, it is recommended to consult the official NVIDIA documentation for troubleshooting.


References