- Authors
- Name
- 1. Deep Learning GPU Server Hardware Selection Guide
- 2. Ubuntu 22.04/24.04 Installation and Initial Setup
- 3. NVIDIA Driver Installation
- 4. Driver Version Selection Strategy
- 5. CUDA Toolkit Installation
- 6. CUDA Version and Driver Compatibility Matrix
- 7. cuDNN Installation and Configuration
- 8. Conda/Mamba Environment Setup
- 9. Docker + NVIDIA Container Toolkit Setup
- 10. nvidia-smi and nvtop Monitoring Tools
- 11. SSH Remote Development Environment
- 12. Security Settings
- 13. Automation: Ansible Playbook Example
- 14. Summary: Complete Installation Order
- References
1. Deep Learning GPU Server Hardware Selection Guide
Before building a deep learning server, the first decision to make is the hardware configuration. Options vary depending on workload scale and budget.
1.1 GPU Selection
The GPU is the core component of a deep learning server. When choosing one, focus primarily on VRAM capacity, Tensor Core generation, and memory bandwidth.
| GPU | VRAM | Use Case | Approximate Price |
|---|---|---|---|
| RTX 4090 | 24GB GDDR6X | Personal research, small-to-medium training | ~$1,600 |
| RTX 5090 | 32GB GDDR7 | Personal research, large model fine-tuning | ~$2,000 |
| RTX A6000 / RTX 6000 Ada | 48GB GDDR6 | Production, ECC memory support | ~$4,000+ |
| A100 (80GB PCIe/SXM) | 80GB HBM2e | Large-scale training, NVLink support | ~$10,000+ |
| H100 (80GB SXM) | 80GB HBM3 | Maximum-scale LLM training | ~$25,000+ |
| H200 | 141GB HBM3e | Long-context LLM, memory bottleneck relief | ~$30,000+ |
Practical Recommendations:
- Personal/Small Labs: RTX 4090 or RTX 5090. With 24-32GB VRAM, fine-tuning 7B-13B parameter models is feasible.
- Enterprise/Research Labs: A100 80GB or H100. Data center-grade GPUs are necessary for environments where multi-GPU training via NVLink is essential.
- If considering Multi-GPU: Be sure to check PCIe slot spacing, NVLink support, and power supply capacity.
1.2 CPU, RAM, Storage
| Component | Recommended Specs | Reason |
|---|---|---|
| CPU | AMD EPYC or Intel Xeon (16+ cores) | Prevent data loading bottlenecks, sufficient PCIe lanes |
| RAM | At least 2x GPU VRAM (minimum 64GB, recommended 128GB+) | Large dataset preprocessing, DataLoader worker memory |
| OS Storage | NVMe SSD 500GB+ | Fast boot, package cache |
| Data Storage | NVMe SSD 2TB+ or RAID configuration | Prevent training data I/O bottlenecks |
| PSU | 1200W+ (1600W+ for multi-GPU) | A single RTX 4090 draws 450W, adequate headroom is essential |
The CPU is relatively less critical compared to the GPU, but the number of PCIe lanes can become a bottleneck in multi-GPU environments, so server-grade platforms (AMD EPYC, Intel Xeon) are recommended. Desktop platforms (AM5, LGA1700) are sufficient for single-GPU setups.
2. Ubuntu 22.04/24.04 Installation and Initial Setup
For deep learning server operating systems, Ubuntu 22.04 LTS or Ubuntu 24.04 LTS are the most widely used. This is because NVIDIA driver, CUDA, and various deep learning framework support is prioritized for these distributions.
2.1 Post-Installation Basic Setup
# System update
sudo apt update && sudo apt upgrade -y
# Install essential build tools
sudo apt install -y build-essential gcc g++ make cmake
# Install kernel headers (required for NVIDIA driver build)
sudo apt install -y linux-headers-$(uname -r)
# Network tools
sudo apt install -y net-tools curl wget git vim htop
# Set timezone
sudo timedatectl set-timezone Asia/Seoul
2.2 Disable Secure Boot
Since NVIDIA drivers need to load kernel modules, it is recommended to disable Secure Boot in the BIOS. While it is possible to use MOK (Machine Owner Key) registration with Secure Boot enabled, the setup can become complex.
# Check Secure Boot status
mokutil --sb-state
3. NVIDIA Driver Installation
The official NVIDIA documentation describes two main installation methods: the package manager (apt) method and the .run file method. For server environments, the apt repository method is recommended for ease of management.
3.1 Method 1: apt Repository Method (Recommended)
According to the official NVIDIA driver installation guide (https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html), the installation on Ubuntu proceeds as follows.
Remove Existing Drivers
# Completely remove existing NVIDIA-related packages
sudo apt-get purge nvidia* -y
sudo apt-get autoremove -y
sudo apt-get autoclean
Automatic Installation Using ubuntu-drivers
On Ubuntu, you can use the ubuntu-drivers tool to automatically install the appropriate driver for your system.
# Check available driver list
sudo ubuntu-drivers devices
# Auto-install recommended driver
sudo ubuntu-drivers autoinstall
Manual Installation Using NVIDIA Official Repository
If you need a specific driver version, add NVIDIA's official apt repository and install.
# Add NVIDIA official repository key and source
sudo apt install -y software-properties-common
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update
# Install specific version (e.g., version 550)
sudo apt install -y nvidia-driver-550
# Reboot
sudo reboot
Verify Installation
nvidia-smi
If installed correctly, the GPU name, driver version, supported CUDA version, and other details will be displayed.
3.2 Method 2: .run File Method
This method involves downloading the .run file directly from the NVIDIA website. It is used when a special kernel environment or custom build is required.
# Disable Nouveau driver
sudo bash -c "echo 'blacklist nouveau' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo bash -c "echo 'options nouveau modeset=0' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo update-initramfs -u
sudo reboot
# Stop GUI environment (if desktop environment is installed on the server)
sudo systemctl stop gdm3
# or
sudo systemctl stop lightdm
# Run .run file
chmod +x NVIDIA-Linux-x86_64-550.120.run
sudo ./NVIDIA-Linux-x86_64-550.120.run
Drawbacks of the .run file method: The driver may break during kernel updates, and since it is not managed by the package manager, updates and removal are inconvenient. The apt method is strongly recommended for production server environments.
4. Driver Version Selection Strategy
NVIDIA operates two branches for data center/server drivers.
4.1 Production Branch vs New Feature Branch
| Category | Production Branch (PB) | New Feature Branch (NFB) |
|---|---|---|
| Stability | High (extensively tested) | Relatively lower |
| New Features | No new features after release | Includes latest GPU support, new features |
| Update Policy | Bug fixes and security patches only | Periodic releases with new features |
| Recommended For | Production servers, stability-first | Development/testing, latest GPUs |
Practical Guidelines:
- Production environments: Use the Production Branch. For example, if the
550.xxseries is designated as PB, only bug-fix updates within that series will be provided. - Latest GPUs (e.g., Blackwell architecture): You may need to use the New Feature Branch. Support for the latest hardware is provided first through NFB.
- Use
-serverpackages on servers: Packages with the-serversuffix such asnvidia-driver-550-serverare Enterprise Ready Drivers (ERD), optimized for server environments.
# Check available server drivers
apt list --installed 2>/dev/null | grep nvidia-driver
apt-cache search nvidia-driver | grep server
5. CUDA Toolkit Installation
The CUDA Toolkit is a development environment for performing parallel computing on NVIDIA GPUs. Install it following the official NVIDIA installation guide (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
5.1 Pre-Installation Checks
# Verify CUDA-capable GPU
lspci | grep -i nvidia
# Verify gcc installation
gcc --version
# Verify kernel headers
uname -r
sudo apt install -y linux-headers-$(uname -r)
5.2 Network Repository Installation (Recommended)
# Install CUDA keyring package
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# Install CUDA Toolkit (latest version)
sudo apt-get install -y cuda-toolkit
# Or install a specific version
sudo apt-get install -y cuda-toolkit-12-6
5.3 Environment Variable Setup
# Add to ~/.bashrc or ~/.zshrc
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Apply
source ~/.bashrc
5.4 Managing Multiple CUDA Versions
In practice, different projects frequently require different CUDA versions. You can install multiple CUDA versions under /usr/local/ and manage them with update-alternatives.
# Install multiple versions (e.g., 12.4 and 12.6)
sudo apt-get install -y cuda-toolkit-12-4
sudo apt-get install -y cuda-toolkit-12-6
# Verify installations under /usr/local/
ls /usr/local/ | grep cuda
# cuda -> cuda-12.6 (symbolic link)
# cuda-12.4
# cuda-12.6
# Register version switching with update-alternatives
sudo update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-12.4 10
sudo update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-12.6 20
# Switch versions
sudo update-alternatives --config cuda
Running update-alternatives --config cuda presents an interactive menu where you can select the desired CUDA version. This approach changes the /usr/local/cuda symbolic link to point to the selected version.
# Verify installation
nvcc --version
6. CUDA Version and Driver Compatibility Matrix
There are minimum driver version requirements between the CUDA Toolkit and NVIDIA drivers. The compatibility table can be found in the official NVIDIA CUDA Toolkit Release Notes (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html).
6.1 Key Compatibility Table (Linux x86_64)
| CUDA Toolkit Version | Minimum Driver Version (Linux) |
|---|---|
| CUDA 12.0 | >= 525.60.13 |
| CUDA 12.1 | >= 530.30.02 |
| CUDA 12.2 | >= 535.54.03 |
| CUDA 12.3 | >= 545.23.06 |
| CUDA 12.4 | >= 550.54.14 |
| CUDA 12.5 | >= 555.42.02 |
| CUDA 12.6 | >= 560.28.03 |
| CUDA 13.0 | >= 570.86.15 |
| CUDA 13.1 | >= 575.51.03 |
Key point: The "CUDA Version" shown in
nvidia-smirepresents the maximum CUDA Runtime version supported by that driver. It may differ from the actually installed CUDA Toolkit version. Verify the actual CUDA Toolkit version withnvcc --version.
6.2 CUDA Forward/Minor Compatibility
According to NVIDIA's CUDA Compatibility documentation (https://docs.nvidia.com/deploy/cuda-compatibility/), from CUDA 12.x onward, Minor Version Compatibility is supported. This means that even with a driver installed for CUDA 12.0, applications compiled with CUDA 12.6 can be run with some limitations.
# Check the CUDA version supported by the driver
nvidia-smi | head -3
# Check the actually installed CUDA Toolkit version
nvcc --version
7. cuDNN Installation and Configuration
cuDNN (CUDA Deep Neural Network library) is a GPU-accelerated library optimized for deep learning. It provides highly optimized implementations of operations such as Convolution, Pooling, Normalization, and Activation, and is used internally by frameworks like PyTorch and TensorFlow.
Install it based on the official NVIDIA cuDNN installation guide (https://docs.nvidia.com/deeplearning/cudnn/installation/latest/linux.html).
7.1 apt Repository Installation (Recommended)
# If CUDA keyring is already installed, you can install the cuDNN package directly
sudo apt-get install -y cudnn
# Install cuDNN for a specific CUDA version
sudo apt-get install -y cudnn-cuda-12
7.2 Tarball Installation
Use a tarball when you need a specific version or prefer not to use the system package manager.
# After downloading from the NVIDIA Developer site
tar -xvf cudnn-linux-x86_64-9.x.x.x_cudaXX-archive.tar.xz
# Copy files
sudo cp cudnn-linux-x86_64-9.x.x.x_cudaXX-archive/include/cudnn*.h /usr/local/cuda/include
sudo cp -P cudnn-linux-x86_64-9.x.x.x_cudaXX-archive/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
7.3 Verify Installation
# Check cuDNN version (when installed via apt)
dpkg -l | grep cudnn
# Or check directly from the header file
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
8. Conda/Mamba Environment Setup
Separate from the system-level CUDA Toolkit, installing PyTorch and TensorFlow within Conda/Mamba virtual environments can prevent CUDA version conflicts. Recent versions of PyTorch and TensorFlow bundle their own CUDA Runtime, operating independently of the system CUDA version.
8.1 Miniforge (Mamba) Installation
# Install Miniforge (includes Mamba)
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p $HOME/miniforge3
# Set environment variables
$HOME/miniforge3/bin/mamba init bash
source ~/.bashrc
8.2 PyTorch Installation
# Create deep learning environment
mamba create -n dl python=3.11 -y
mamba activate dl
# Install PyTorch (for CUDA 12.4 - check the official site for the latest command)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Verify installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
8.3 TensorFlow Installation
# Install TensorFlow (with GPU support)
pip install tensorflow[and-cuda]
# Verify installation
import tensorflow as tf
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU devices: {tf.config.list_physical_devices('GPU')}")
Tip: Installing PyTorch and TensorFlow in the same environment may cause CUDA library version conflicts. It is recommended to use separate Conda environments whenever possible.
9. Docker + NVIDIA Container Toolkit Setup
Docker allows you to manage deep learning environments isolated at the container level. Installing the NVIDIA Container Toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) enables GPU access within Docker containers.
9.1 Docker Installation
# Set up Docker official repository
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Add current user to docker group (to use without sudo)
sudo usermod -aG docker $USER
newgrp docker
9.2 NVIDIA Container Toolkit Installation
The installation process follows the official NVIDIA documentation.
# Set up NVIDIA Container Toolkit repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install packages
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
# Restart Docker
sudo systemctl restart docker
9.3 GPU Docker Testing
# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
# Use specific GPU only
docker run --rm --gpus '"device=0"' nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
# Run PyTorch container with all GPUs
docker run --rm --gpus all -it pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime python -c \
"import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
9.4 Using GPUs in Docker Compose
# docker-compose.yml
services:
training:
image: pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all # or a specific number: 1, 2, etc.
capabilities: [gpu]
volumes:
- ./data:/workspace/data
- ./models:/workspace/models
shm_size: '8g' # PyTorch DataLoader shared memory
Note: If
shm_sizeis not set, PyTorch's DataLoader may encounter shared memory shortage errors whennum_workers > 0.
10. nvidia-smi and nvtop Monitoring Tools
Monitoring is essential in GPU server operations. You need to be able to check GPU utilization, memory usage, temperature, and other metrics in real time.
10.1 nvidia-smi
nvidia-smi is the default monitoring tool installed with the NVIDIA driver.
# Basic status check
nvidia-smi
# Real-time monitoring (1-second interval)
watch -n 1 nvidia-smi
# Query specific information
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv
# Per-process GPU usage
nvidia-smi pmon -s um -d 1
# Enable Persistence Mode (recommended for server environments)
sudo nvidia-smi -pm 1
Enabling Persistence Mode keeps the GPU driver loaded at all times, eliminating the initialization delay (~several seconds) that occurs when the GPU is first called. This should always be enabled in server environments.
10.2 nvtop
nvtop is a GPU monitoring tool with an interface similar to htop. It displays utilization, memory, temperature, and process information for multiple GPUs on a single screen in real time.
# Install
sudo apt install -y nvtop
# Run
nvtop
10.3 gpustat
If you prefer concise output, gpustat is also useful.
pip install gpustat
# Real-time monitoring
gpustat -i 1 --color
11. SSH Remote Development Environment
GPU servers are typically accessed remotely. Set up an efficient remote development environment.
11.1 SSH Server Setup
# Install OpenSSH server
sudo apt install -y openssh-server
# Start SSH service and enable auto-start
sudo systemctl enable ssh
sudo systemctl start ssh
11.2 VS Code Remote - SSH
Using VS Code's Remote - SSH extension, you can directly edit files on the remote server from your local VS Code and use the remote terminal.
Client (local) SSH config setup:
# ~/.ssh/config
Host gpu-server
HostName 192.168.1.100
User username
Port 22
IdentityFile ~/.ssh/id_ed25519
ForwardAgent yes
11.3 Session Persistence with tmux
Use tmux to keep training running even if the SSH connection is lost.
# Install tmux
sudo apt install -y tmux
# Create new session
tmux new -s training
# Detach session: Ctrl+b, d
# Reattach to session
tmux attach -t training
# List sessions
tmux ls
Essential tmux shortcuts:
Ctrl+b, d: Detach session (training continues running)Ctrl+b, c: New windowCtrl+b, n/p: Next/previous windowCtrl+b, %: Horizontal splitCtrl+b, ": Vertical split
11.4 Port Forwarding
SSH port forwarding configuration for accessing Jupyter Notebook, TensorBoard, and similar tools from your local browser.
# Jupyter Notebook port forwarding (remote 8888 -> local 8888)
ssh -L 8888:localhost:8888 gpu-server
# TensorBoard port forwarding (remote 6006 -> local 6006)
ssh -L 6006:localhost:6006 gpu-server
# Forward multiple ports simultaneously
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 gpu-server
12. Security Settings
If your GPU server is exposed to the network, basic security settings are essential.
12.1 SSH Key Authentication Setup
# Generate key on client
ssh-keygen -t ed25519 -C "your_email@example.com"
# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@gpu-server
# Disable password authentication on server
sudo vim /etc/ssh/sshd_config
Modify the following entries in /etc/ssh/sshd_config:
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
# Restart SSH service
sudo systemctl restart sshd
12.2 UFW Firewall Setup
# Install and enable UFW
sudo apt install -y ufw
# Default policy: deny incoming, allow outgoing
sudo ufw default deny incoming
sudo ufw default allow outgoing
# Allow SSH
sudo ufw allow ssh
# Allow SSH from specific IP only (more secure)
sudo ufw allow from 192.168.1.0/24 to any port 22
# Enable firewall
sudo ufw enable
# Check status
sudo ufw status verbose
12.3 fail2ban Setup
Install fail2ban to defend against SSH brute force attacks.
# Install
sudo apt install -y fail2ban
# Copy configuration file
sudo cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local
# Modify [sshd] section in /etc/fail2ban/jail.local
sudo vim /etc/fail2ban/jail.local
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600
# Start fail2ban
sudo systemctl enable fail2ban
sudo systemctl start fail2ban
# Check status
sudo fail2ban-client status sshd
13. Automation: Ansible Playbook Example
When managing multiple servers or repeatedly configuring environments, Ansible can be used to automate the entire setup process.
13.1 Ansible Inventory
# inventory.ini
[gpu_servers]
gpu-server-01 ansible_host=192.168.1.101
gpu-server-02 ansible_host=192.168.1.102
[gpu_servers:vars]
ansible_user=ubuntu
ansible_ssh_private_key_file=~/.ssh/id_ed25519
13.2 GPU Server Setup Playbook
# gpu_server_setup.yml
---
- name: GPU Server Initial Setup
hosts: gpu_servers
become: true
vars:
nvidia_driver_version: '550'
cuda_keyring_url: 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb'
tasks:
# ===== System Basic Setup =====
- name: Update and upgrade apt packages
apt:
update_cache: yes
upgrade: dist
cache_valid_time: 3600
- name: Install essential packages
apt:
name:
- build-essential
- gcc
- g++
- make
- cmake
- linux-headers-{{ ansible_kernel }}
- curl
- wget
- git
- vim
- htop
- tmux
- nvtop
- net-tools
- software-properties-common
state: present
- name: Set timezone to Asia/Seoul
community.general.timezone:
name: Asia/Seoul
# ===== NVIDIA Driver Installation =====
- name: Purge existing NVIDIA drivers
apt:
name: 'nvidia*'
state: absent
purge: yes
ignore_errors: yes
- name: Add NVIDIA PPA
apt_repository:
repo: ppa:graphics-drivers/ppa
state: present
- name: Install NVIDIA driver
apt:
name: 'nvidia-driver-{{ nvidia_driver_version }}'
state: present
update_cache: yes
# ===== CUDA Toolkit Installation =====
- name: Download CUDA keyring
get_url:
url: '{{ cuda_keyring_url }}'
dest: /tmp/cuda-keyring.deb
- name: Install CUDA keyring
apt:
deb: /tmp/cuda-keyring.deb
- name: Install CUDA Toolkit
apt:
name: cuda-toolkit
state: present
update_cache: yes
# ===== cuDNN Installation =====
- name: Install cuDNN
apt:
name: cudnn-cuda-12
state: present
# ===== Docker Installation =====
- name: Install Docker prerequisites
apt:
name:
- ca-certificates
- curl
- gnupg
state: present
- name: Add Docker GPG key
shell: |
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
chmod a+r /etc/apt/keyrings/docker.gpg
args:
creates: /etc/apt/keyrings/docker.gpg
- name: Add Docker repository
shell: |
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
args:
creates: /etc/apt/sources.list.d/docker.list
- name: Install Docker
apt:
name:
- docker-ce
- docker-ce-cli
- containerd.io
- docker-buildx-plugin
- docker-compose-plugin
state: present
update_cache: yes
- name: Add user to docker group
user:
name: '{{ ansible_user }}'
groups: docker
append: yes
# ===== NVIDIA Container Toolkit Installation =====
- name: Add NVIDIA Container Toolkit GPG key
shell: |
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
args:
creates: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
- name: Add NVIDIA Container Toolkit repository
shell: |
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
args:
creates: /etc/apt/sources.list.d/nvidia-container-toolkit.list
- name: Install NVIDIA Container Toolkit
apt:
name: nvidia-container-toolkit
state: present
update_cache: yes
- name: Configure Docker runtime for NVIDIA
command: nvidia-ctk runtime configure --runtime=docker
- name: Restart Docker
systemd:
name: docker
state: restarted
# ===== Security Settings =====
- name: Install security packages
apt:
name:
- ufw
- fail2ban
state: present
- name: Configure UFW - deny incoming
ufw:
direction: incoming
policy: deny
- name: Configure UFW - allow outgoing
ufw:
direction: outgoing
policy: allow
- name: Configure UFW - allow SSH
ufw:
rule: allow
name: OpenSSH
- name: Enable UFW
ufw:
state: enabled
- name: Enable fail2ban
systemd:
name: fail2ban
enabled: yes
state: started
# ===== GPU Persistence Mode =====
- name: Enable NVIDIA Persistence Mode
command: nvidia-smi -pm 1
ignore_errors: yes
handlers:
- name: Reboot server
reboot:
reboot_timeout: 300
13.3 Running the Playbook
# Install Ansible
pip install ansible
# Run Playbook
ansible-playbook -i inventory.ini gpu_server_setup.yml
# Dry run (check without actually executing)
ansible-playbook -i inventory.ini gpu_server_setup.yml --check
14. Summary: Complete Installation Order
The complete flow for building a deep learning GPU server is as follows.
1. Hardware assembly and BIOS setup (disable Secure Boot)
|
2. Ubuntu 22.04/24.04 installation and basic package setup
|
3. NVIDIA driver installation (apt method recommended)
|
4. CUDA Toolkit installation (multiple versions can be installed)
|
5. cuDNN installation
|
6. Conda/Mamba environment setup -> PyTorch/TensorFlow installation
|
7. Docker + NVIDIA Container Toolkit installation
|
8. Security settings (SSH Key, UFW, fail2ban)
|
9. Monitoring tools installation (nvtop, gpustat)
|
10. Remote development environment setup (VS Code Remote, tmux)
By following the content covered in this guide step by step, you can build a complete deep learning development server from NVIDIA drivers to Docker GPU environments. If issues arise at any stage, it is recommended to consult the official NVIDIA documentation for troubleshooting.
References
- NVIDIA Driver Installation Guide
- NVIDIA Driver Installation Guide - Ubuntu
- Ubuntu Server - NVIDIA Drivers Installation
- NVIDIA Data Center Drivers
- CUDA Installation Guide for Linux
- CUDA Toolkit Release Notes
- CUDA Compatibility
- Supported Drivers and CUDA Toolkit Versions
- cuDNN Installation Guide - Linux
- cuDNN Support Matrix
- NVIDIA Container Toolkit Installation Guide
- NVIDIA Container Toolkit - GitHub
- Docker Official Documentation