Skip to content

필사 모드: Mathematical Optimization for Machine Learning: From Adam to Convex Optimization and ZeRO

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Table of Contents

1. [Optimization Fundamentals: Convex Optimization and KKT Conditions](#optimization-fundamentals)

2. [Gradient Descent Family of Optimizers](#gradient-descent-family)

3. [Second-Order Optimization Methods](#second-order-optimization)

4. [Learning Rate Scheduling](#learning-rate-scheduling)

5. [Regularization Techniques](#regularization-techniques)

6. [Loss Function Design](#loss-function-design)

7. [LLM Training Optimization](#llm-training-optimization)

8. [Quiz](#quiz)

Optimization Fundamentals

Convex Optimization

A function $f: \mathbb{R}^n \to \mathbb{R}$ is **convex** if for any two points $x, y$ and $\lambda \in [0,1]$:

$$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$$

Key properties of convex functions:

- Every local minimum is a global minimum

- Gradient descent is guaranteed to converge

- Deep learning loss surfaces are mostly non-convex, but convex analysis techniques remain useful

**Strongly Convex**: If there exists $m > 0$ such that $f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{m}{2}\|y-x\|^2$, convergence is linear (exponentially fast).

Lagrange Multipliers

Handles equality-constrained optimization problems:

$$\min_x f(x) \quad \text{subject to} \quad g_i(x) = 0, \; i = 1, \ldots, m$$

Lagrangian:

$$\mathcal{L}(x, \lambda) = f(x) + \sum_{i=1}^{m} \lambda_i g_i(x)$$

At the optimum: $\nabla_x \mathcal{L} = 0$ and $\nabla_\lambda \mathcal{L} = 0$.

KKT Conditions

For the general constrained optimization problem:

$$\min_x f(x) \quad \text{s.t.} \quad g_i(x) \leq 0, \; h_j(x) = 0$$

KKT necessary conditions:

1. **Stationarity**: $\nabla f(x^*) + \sum_i \mu_i \nabla g_i(x^*) + \sum_j \lambda_j \nabla h_j(x^*) = 0$

2. **Primal feasibility**: $g_i(x^*) \leq 0$, $h_j(x^*) = 0$

3. **Dual feasibility**: $\mu_i \geq 0$

4. **Complementary slackness**: $\mu_i g_i(x^*) = 0$

For convex problems, KKT conditions are also sufficient.

Saddle Points

In deep learning, **saddle points** are a greater concern than local minima. At a saddle point, the gradient is zero but it is neither a local min nor max. The stochastic noise in SGD helps escape saddle points.

Gradient Descent Family

SGD and Its Variants

**Vanilla SGD**:

$$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t; x_i, y_i)$$

**SGD with Momentum**:

$$v_{t+1} = \beta v_t + \nabla_\theta \mathcal{L}(\theta_t)$$

$$\theta_{t+1} = \theta_t - \eta v_{t+1}$$

Momentum $\beta = 0.9$ is standard; it accumulates past gradients to reduce oscillation.

**Nesterov Accelerated Gradient (NAG)**:

$$v_{t+1} = \beta v_t + \nabla_\theta \mathcal{L}(\theta_t - \beta v_t)$$

$$\theta_{t+1} = \theta_t - \eta v_{t+1}$$

Computes the gradient at a "lookahead" position rather than the current one.

AdaGrad, RMSProp, Adam

**AdaGrad**: Per-parameter adaptive learning rate

$$G_t = G_{t-1} + g_t^2$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t$$

Frequent features get smaller updates; rare features get larger updates. Drawback: monotonically shrinking learning rates cause learning to stall.

**RMSProp**: Fixes AdaGrad's accumulation problem

$$v_t = \beta v_{t-1} + (1-\beta) g_t^2$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} g_t$$

**Adam (Adaptive Moment Estimation)**:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(1st moment)}$$

$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(2nd moment)}$$

Bias correction:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$

model = ... # define your model

Standard Adam

optimizer_adam = optim.Adam(

model.parameters(),

lr=1e-3,

betas=(0.9, 0.999),

eps=1e-8

)

AdamW (decoupled weight decay)

optimizer_adamw = optim.AdamW(

model.parameters(),

lr=1e-3,

betas=(0.9, 0.999),

eps=1e-8,

weight_decay=0.01 # applied independently of gradient scaling

)

AdamW and Lion

**AdamW**: Applies weight decay directly to parameter updates, separate from the gradient-based term.

$$\theta_{t+1} = \theta_t - \eta \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t\right)$$

This is mathematically inequivalent to adding L2 regularization inside Adam (see Quiz for details).

**Lion (EvoLved Sign Momentum)**:

$$u_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$

$$\theta_{t+1} = \theta_t - \eta \cdot \text{sign}(u_t)$$

$$m_t = \beta_2 m_{t-1} + (1-\beta_2) g_t$$

Lion uses only the sign of the update, providing uniform update magnitude and better memory efficiency.

| Optimizer | Memory | Convergence | Best Use Case |

| ------------ | ------ | ----------- | ---------------------------- |

| SGD+Momentum | Low | Slow | Computer vision, large batch |

| Adam | Medium | Fast | NLP, general purpose |

| AdamW | Medium | Fast | Transformer training |

| Lion | Low | Fast | Large-scale models |

| L-BFGS | High | Very fast | Small models |

Second-Order Optimization

Newton's Method

Uses second-order derivatives (Hessian):

$$\theta_{t+1} = \theta_t - H_t^{-1} \nabla f(\theta_t)$$

where $H_t = \nabla^2 f(\theta_t)$ is the Hessian matrix. Achieves quadratic convergence, but inverting an $n \times n$ matrix requires $O(n^3)$ computation — impractical for deep learning.

L-BFGS (Limited-memory BFGS)

Approximates the inverse Hessian using the last $m$ gradient differences, without storing the full matrix.

$$H_t^{-1} \approx \text{approximation via vector sequences } \{s_k\}, \{y_k\}$$

where $s_k = \theta_{k+1} - \theta_k$ and $y_k = \nabla f_{k+1} - \nabla f_k$.

L-BFGS requires a closure function

optimizer = optim.LBFGS(

model.parameters(),

lr=1.0,

max_iter=20,

history_size=10,

line_search_fn='strong_wolfe'

)

def closure():

optimizer.zero_grad()

output = model(input_data)

loss = criterion(output, target)

loss.backward()

return loss

optimizer.step(closure)

Natural Gradient Descent

Uses the Fisher Information Matrix to account for the curvature of the parameter space:

$$\theta_{t+1} = \theta_t - \eta F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t)$$

Fisher Matrix: $F(\theta) = \mathbb{E}\left[\nabla \log p(y|x;\theta) \nabla \log p(y|x;\theta)^T\right]$

K-FAC (Kronecker-factored Approximate Curvature) provides a practical implementation by factoring the Fisher matrix layer-wise.

Learning Rate Scheduling

Linear Warmup

Gradually increases the learning rate at the start to stabilize training:

$$\eta_t = \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} \quad (t \leq T_{\text{warmup}})$$

Cosine Annealing

$$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi t}{T}\right)$$

from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau

Cosine Annealing

scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

OneCycleLR: Warmup + cosine decay

scheduler = OneCycleLR(

optimizer,

max_lr=1e-3,

total_steps=1000,

pct_start=0.3, # 30% warmup phase

anneal_strategy='cos'

)

ReduceLROnPlateau: reduce when validation loss stalls

scheduler = ReduceLROnPlateau(

optimizer,

mode='min',

factor=0.5,

patience=10,

min_lr=1e-6

)

Cyclical Learning Rate (CLR)

Periodically varies the learning rate to help escape saddle points:

$$\eta_t = \eta_{\min} + (\eta_{\max} - \eta_{\min}) \cdot \max\left(0, 1 - \left|\frac{t}{\text{step\_size}} - 2k + 1\right|\right)$$

| Scheduler | Characteristic | Best Use Case |

| ----------------- | --------------------- | ----------------------- |

| Cosine Annealing | Smooth decay | Transformer pretraining |

| OneCycleLR | Warmup + fast decay | Fine-tuning, short runs |

| ReduceLROnPlateau | Adaptive | General training |

| Cyclical LR | Periodic oscillation | Saddle point escape |

| Linear Warmup | Initial stabilization | LLM training |

Regularization Techniques

L1 / L2 Regularization

**L2 Regularization (Ridge)**:

$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2} \|\theta\|_2^2$$

Gradient: $\nabla_\theta \mathcal{L}_{\text{reg}} = \nabla_\theta \mathcal{L} + \lambda \theta$

**L1 Regularization (Lasso)**:

$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \|\theta\|_1$$

L1 induces sparse solutions, driving many weights to exactly zero.

Batch Normalization vs Layer Normalization

**Batch Normalization (BN)**:

$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \cdot \gamma + \beta$$

where $\mu_B$, $\sigma_B^2$ are computed across the mini-batch dimension. Normalizes along the batch axis.

**Layer Normalization (LN)**:

$$\hat{x} = \frac{x - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}} \cdot \gamma + \beta$$

Statistics are computed over the feature dimension of each individual sample.

| Normalization | Statistic Axis | Best Use Case |

| ------------- | ---------------------- | ---------------- |

| Batch Norm | Batch (same feature) | CNN, large batch |

| Layer Norm | Feature (same sample) | Transformer, RNN |

| Instance Norm | Spatial (same channel) | Style transfer |

| Group Norm | Channel groups | Small batch |

Weight Decay vs L2 Regularization

With SGD:

$$\theta_{t+1} = \theta_t - \eta(\nabla \mathcal{L} + \lambda \theta_t) = (1 - \eta\lambda)\theta_t - \eta \nabla \mathcal{L}$$

Weight decay and L2 regularization are equivalent here. However with Adam:

- **L2 Adam**: $\lambda\theta$ is added to the gradient, then divided by the adaptive scaling factor — regularization effect is weakened for parameters with large gradient variance

- **AdamW**: $\lambda\theta$ is applied after the gradient update — uniform decay for all parameters regardless of gradient scale

class RegularizedModel(nn.Module):

def __init__(self):

super().__init__()

self.fc1 = nn.Linear(512, 256)

self.bn1 = nn.BatchNorm1d(256)

self.ln1 = nn.LayerNorm(256)

self.dropout = nn.Dropout(p=0.3)

def forward(self, x):

x = self.fc1(x)

x = self.bn1(x) # or self.ln1(x) for transformers

x = torch.relu(x)

x = self.dropout(x)

return x

Loss Function Design

Cross-Entropy Loss

$$\mathcal{L}_{CE} = -\sum_{c=1}^{C} y_c \log \hat{p}_c$$

Binary cross-entropy: $\mathcal{L}_{BCE} = -[y \log p + (1-y)\log(1-p)]$

Focal Loss

Addresses class imbalance by down-weighting easy examples:

$$\mathcal{L}_{FL} = -(1-p_t)^\gamma \log(p_t)$$

where $p_t$ is the predicted probability for the ground-truth class and $\gamma \geq 0$ is the focusing parameter. When $\gamma = 0$, this reduces to standard cross-entropy.

class FocalLoss(torch.nn.Module):

def __init__(self, alpha=0.25, gamma=2.0):

super().__init__()

self.alpha = alpha

self.gamma = gamma

def forward(self, logits, targets):

bce_loss = F.binary_cross_entropy_with_logits(

logits, targets.float(), reduction='none'

)

p = torch.sigmoid(logits)

p_t = p * targets + (1 - p) * (1 - targets)

focal_weight = (1 - p_t) ** self.gamma

alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)

loss = alpha_t * focal_weight * bce_loss

return loss.mean()

Contrastive Loss and Triplet Loss

**Contrastive Loss** (Siamese Networks):

$$\mathcal{L} = (1-y)\frac{d^2}{2} + y \cdot \max(0, m - d)^2$$

where $d = \|f(x_1) - f(x_2)\|_2$, $y=0$ for similar pairs and $y=1$ for dissimilar pairs.

**Triplet Loss**:

$$\mathcal{L}_{trip} = \max(0, \|f(a) - f(p)\|_2^2 - \|f(a) - f(n)\|_2^2 + m)$$

Uses anchor (a), positive (p), and negative (n) samples.

InfoNCE Loss (NT-Xent)

The core loss function for contrastive self-supervised learning:

$$\mathcal{L}_{InfoNCE} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbf{1}_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$$

where $\tau$ is the temperature parameter and $\text{sim}$ is cosine similarity.

def info_nce_loss(features, temperature=0.07):

"""

features: (2N, D) - two augmentation views of each image

"""

N = features.shape[0] // 2

features = F.normalize(features, dim=1)

Compute similarity matrix

similarity = torch.matmul(features, features.T) / temperature

Mask self-similarity (set diagonal to -inf)

mask = torch.eye(2 * N, dtype=torch.bool, device=features.device)

similarity.masked_fill_(mask, float('-inf'))

Positive pairs: i with i+N, and i+N with i

labels = torch.cat([

torch.arange(N, 2 * N),

torch.arange(N)

]).to(features.device)

loss = F.cross_entropy(similarity, labels)

return loss

LLM Training Optimization

Gradient Clipping

Prevents exploding gradients during training:

$$g \leftarrow g \cdot \min\left(1, \frac{\text{clip\_norm}}{\|g\|_2}\right)$$

def train_with_clipping(model, optimizer, loss, max_norm=1.0):

optimizer.zero_grad()

loss.backward()

Monitor gradient norm before clipping

total_norm = 0

for p in model.parameters():

if p.grad is not None:

param_norm = p.grad.data.norm(2)

total_norm += param_norm.item() ** 2

total_norm = total_norm ** 0.5

Apply clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_norm)

optimizer.step()

return total_norm

ZeRO Optimizer (Zero Redundancy Optimizer)

Partitions model training state across GPUs in three progressive stages:

| ZeRO Stage | Partitioned State | Memory Reduction |

| ---------- | ----------------- | ---------------- |

| Stage 1 | Optimizer states | ~4x |

| Stage 2 | + Gradients | ~8x |

| Stage 3 | + Parameters | ~64x (N GPUs) |

Mixed precision (FP16/BF16) combined with ZeRO-3 enables training multi-billion parameter models on a single node.

8-bit Adam

Uses quantization to store optimizer states in INT8 instead of FP32:

- Reduces optimizer state memory by 75% compared to FP32

- Block-wise quantization minimizes precision loss

- Available via the `bitsandbytes` library

8-bit Adam via bitsandbytes

optimizer = bnb.optim.Adam8bit(

model.parameters(),

lr=1e-4,

betas=(0.9, 0.999)

)

Adafactor

Approximates Adam's second moment matrix via low-rank factorization:

$$V_t \approx \hat{r}_t \hat{v}_t^T \quad \text{(rank-1 approximation)}$$

Requires memory proportional to parameter size (row + column vectors only). Used to train T5, PaLM, and other massive models.

| Optimizer | Memory (relative to params) | LLM Suitability |

| ---------- | --------------------------- | --------------- |

| Adam | 8x (params + 2 states) | Moderate |

| AdamW | 8x | Good |

| 8-bit Adam | 6x | Good |

| Adafactor | ~2x | Excellent |

| Lion | 6x | Good |

Quiz

**Answer**: To correct the bias introduced by initializing the moment estimates at zero.

**Explanation**: Adam initializes $m_0 = 0$ and $v_0 = 0$. In early timesteps, $m_t$ and $v_t$ underestimate the true moments of the gradients. For example, at $t=1$: $m_1 = (1-\beta_1)g_1$, whose expected value $(1-\beta_1)\mathbb{E}[g_1]$ is much smaller than $\mathbb{E}[g_1]$. Dividing by $(1-\beta_1^t)$ corrects this. With $\beta_1 = 0.9$ at $t=1$, the correction factor is $1/0.1 = 10$. As $t$ grows large, $\beta_1^t \to 0$ and the correction factor approaches 1, becoming negligible.

**Answer**: Because Adam's adaptive learning rate scales the L2 penalty gradient, weakening its regularization effect.

**Explanation**: In SGD, $\theta \leftarrow \theta - \eta(\nabla \mathcal{L} + \lambda\theta)$ makes both approaches mathematically identical. In Adam with L2 regularization, the combined gradient becomes $g_t + \lambda\theta_t$, which is then divided by the adaptive factor $1/\sqrt{\hat{v}_t}$. Parameters with high gradient variance (large $v_t$) receive proportionally smaller regularization. AdamW decouples weight decay from the gradient update: $\theta \leftarrow \theta(1-\eta\lambda) - \eta\hat{m}_t/(\sqrt{\hat{v}_t}+\epsilon)$, applying uniform decay to all parameters regardless of their gradient scale.

**Answer**: BN normalizes across the batch dimension; LN normalizes across the feature dimension of each sample.

**Explanation**: BN computes mean and variance over the mini-batch for each feature position. It depends on batch size — small batches yield unstable statistics. It is best suited for CNNs with fixed spatial structure and sufficiently large batches. LN computes statistics over the feature dimension of each sample independently, making it batch-size agnostic. It is ideal for Transformers (variable sequence lengths), RNNs, and online inference scenarios where batch statistics are unavailable or unreliable.

**Answer**: The $(1-p_t)^\gamma$ modulating factor dynamically down-weights easy examples during training.

**Explanation**: Standard CE loss $-\log(p_t)$ treats every sample equally regardless of prediction confidence. Focal Loss introduces $(1-p_t)^\gamma$: for an easy example with $p_t = 0.9$, the weight is $(1-0.9)^2 = 0.01$, reducing its contribution by 100x. For a hard example with $p_t = 0.1$, the weight is $(1-0.1)^2 = 0.81$, preserving nearly the full loss signal. With $\gamma = 2$, easy well-classified majority-class samples effectively stop contributing, forcing the model to focus training on the rare, hard minority-class examples.

**Answer**: By maximizing similarity between augmented views of the same image while pushing apart views from different images.

**Explanation**: InfoNCE maximizes a lower bound on mutual information. The numerator $\exp(\text{sim}(z_i, z_j)/\tau)$ increases cosine similarity between the two augmented views of the same image (positive pair). The denominator includes $2N-1$ negative pairs (other images in the batch). The temperature $\tau$ controls distribution sharpness: smaller $\tau$ creates a more peaked distribution, forcing tighter positive-pair clusters. Large batches provide more diverse negatives, improving representation quality. SimCLR, MoCo, and CLIP all rely on this loss formulation to learn generalizable visual and multimodal representations.

현재 단락 (1/264)

1. [Optimization Fundamentals: Convex Optimization and KKT Conditions](#optimization-fundamentals)

작성 글자: 0원문 글자: 15,290작성 단락: 0/264