Skip to content

Split View: AI 엔지니어를 위한 클라우드 컴퓨팅 가이드: AWS/GCP/Azure로 AI 서비스 구축

|

AI 엔지니어를 위한 클라우드 컴퓨팅 가이드: AWS/GCP/Azure로 AI 서비스 구축

1. 클라우드 AI 플랫폼 개요

AI 엔지니어에게 클라우드 컴퓨팅은 이제 선택이 아닌 필수입니다. 수백 개의 GPU를 필요에 따라 즉시 프로비저닝하고, 모델 학습부터 서빙까지 관리형 서비스로 처리할 수 있는 환경이 갖춰져 있습니다.

IaaS, PaaS, SaaS 개념

클라우드 서비스는 세 가지 계층으로 나뉩니다.

  • IaaS (Infrastructure as a Service): 가상 머신, 스토리지, 네트워크를 제공합니다. EC2 GPU 인스턴스가 대표적입니다. 최대 제어권을 갖지만 인프라 관리 부담이 큽니다.
  • PaaS (Platform as a Service): 런타임과 미들웨어까지 관리해 줍니다. AWS SageMaker, GCP Vertex AI, Azure ML이 여기에 해당합니다. 모델 코드에 집중할 수 있습니다.
  • SaaS (Software as a Service): 완성된 AI 기능을 API로 제공합니다. AWS Bedrock, GCP Gemini API, Azure OpenAI Service가 대표적입니다.

주요 클라우드 AI 서비스 비교

기능AWSGCPAzure
ML 플랫폼SageMakerVertex AIAzure ML
LLM APIBedrockVertex AI (Gemini)Azure OpenAI
관리형 노트북SageMaker StudioVertex AI WorkbenchAzure ML Studio
AutoMLSageMaker AutopilotVertex AutoMLAzure AutoML
Feature StoreSageMaker Feature StoreVertex Feature StoreAzure ML Feature Store
모델 레지스트리SageMaker Model RegistryVertex Model RegistryAzure ML Registry
서버리스 추론Lambda, FargateCloud Run, Cloud FunctionsAzure Functions, Container Apps

GPU 인스턴스 유형 비교

AWS GPU 인스턴스:

  • p3.2xlarge: V100 1개, 61 GiB RAM — 소규모 학습
  • p4d.24xlarge: A100 8개, 320 GiB RAM — 대규모 분산 학습
  • p5.48xlarge: H100 8개, 2 TiB RAM — 최신 LLM 학습

GCP GPU 인스턴스:

  • n1-standard-8 + V100: 비용 효율적 학습
  • a2-highgpu-8g: A100 8개 — Vertex AI 기본 학습
  • a3-highgpu-8g: H100 8개 — 최신 대형 모델

Azure GPU 인스턴스:

  • NC6s_v3: V100 1개 — 개발/테스트
  • ND96asr_v4: A100 8개 — 대규모 학습
  • ND96amsr_A100_v4: A100 80GB 8개 — 최대 성능

비용 최적화 전략

클라우드 AI 비용의 70% 이상이 컴퓨팅에서 발생합니다. 주요 절감 전략은 다음과 같습니다.

  • Spot/Preemptible 인스턴스: On-Demand 대비 최대 90% 절약. 인터럽션 대비 체크포인팅 필수
  • Reserved Instances / Committed Use: 13년 약정으로 4060% 절약. 장기 프로젝트에 적합
  • Auto Scaling: 추론 트래픽에 따라 인스턴스 수 자동 조정
  • Savings Plans (AWS): 컴퓨팅 사용량 약정으로 유연한 인스턴스 타입 할인

2. AWS AI/ML 서비스

SageMaker 핵심 기능

Amazon SageMaker는 AWS의 통합 ML 플랫폼입니다. 데이터 준비부터 모델 배포, 모니터링까지 전체 ML 라이프사이클을 하나의 서비스에서 처리합니다.

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role

role = get_execution_role()
sess = sagemaker.Session()

# SageMaker Training Job
estimator = PyTorch(
    entry_point='train.py',
    source_dir='./src',
    role=role,
    instance_type='ml.p4d.24xlarge',
    instance_count=4,
    framework_version='2.1.0',
    py_version='py310',
    hyperparameters={
        'epochs': 10,
        'batch-size': 32,
        'learning-rate': 0.001
    },
    distribution={
        'torch_distributed': {'enabled': True}
    }
)

estimator.fit({'train': 's3://bucket/train', 'val': 's3://bucket/val'})

SageMaker의 distribution 파라미터를 통해 PyTorch DDP 기반 분산 학습을 손쉽게 설정할 수 있습니다. instance_count=4torch_distributed 옵션을 함께 지정하면 4개 노드에 걸친 데이터 병렬 학습이 자동으로 구성됩니다.

SageMaker 모델 배포

from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(
    model_data='s3://bucket/model.tar.gz',
    role=role,
    framework_version='2.1.0',
    py_version='py310',
    entry_point='inference.py'
)

predictor = model.deploy(
    initial_instance_count=2,
    instance_type='ml.g4dn.xlarge',
    endpoint_name='my-pytorch-endpoint'
)

# 예측 호출
result = predictor.predict({'inputs': 'Hello, cloud AI!'})

AWS Bedrock (LLM API)

AWS Bedrock은 Anthropic Claude, Meta Llama, Amazon Titan 등 여러 파운데이션 모델을 단일 API로 사용할 수 있는 서비스입니다.

import boto3
import json

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.invoke_model(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    body=json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'max_tokens': 1024,
        'messages': [{'role': 'user', 'content': 'AI 트렌드를 설명해줘'}]
    })
)
result = json.loads(response['body'].read())
print(result['content'][0]['text'])

AWS Lambda를 이용한 서버리스 추론

경량 모델은 Lambda로 서버리스 배포가 가능합니다.

import json
import boto3
import numpy as np

def lambda_handler(event, context):
    # 이벤트에서 입력 데이터 파싱
    body = json.loads(event['body'])
    input_data = body['input']

    # S3에서 모델 로드 (컨테이너 이미지에 번들로 포함 권장)
    prediction = run_inference(input_data)

    return {
        'statusCode': 200,
        'body': json.dumps({'prediction': prediction})
    }

3. GCP AI/ML 서비스

Vertex AI Training

Google Cloud의 Vertex AI는 통합 ML 플랫폼으로, BigQuery와의 긴밀한 통합이 강점입니다.

from google.cloud import aiplatform

aiplatform.init(project='my-project', location='us-central1')

job = aiplatform.CustomTrainingJob(
    display_name='pytorch-training',
    script_path='train.py',
    container_uri='us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-0:latest',
    requirements=['transformers', 'datasets']
)

model = job.run(
    dataset=None,
    machine_type='a2-highgpu-8g',
    accelerator_type='NVIDIA_TESLA_A100',
    accelerator_count=8,
    args=['--epochs=10', '--batch_size=32']
)

Vertex AI 모델 배포

from google.cloud import aiplatform

# 모델 업로드
model = aiplatform.Model.upload(
    display_name='my-pytorch-model',
    artifact_uri='gs://bucket/model/',
    serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.2-0:latest'
)

# 엔드포인트 생성 및 배포
endpoint = aiplatform.Endpoint.create(display_name='my-endpoint')
model.deploy(
    endpoint=endpoint,
    dedicated_resources_machine_type='n1-standard-4',
    dedicated_resources_accelerator_type='NVIDIA_TESLA_T4',
    dedicated_resources_accelerator_count=1,
    min_replica_count=1,
    max_replica_count=5
)

BigQuery ML

BigQuery ML은 SQL 문법으로 ML 모델을 학습하고 예측할 수 있는 강력한 도구입니다.

-- BigQuery ML로 분류 모델 학습
CREATE OR REPLACE MODEL `dataset.fraud_model`
OPTIONS(
    model_type = 'BOOSTED_TREE_CLASSIFIER',
    num_parallel_tree = 1,
    max_iterations = 50,
    input_label_cols = ['is_fraud']
) AS
SELECT * FROM `dataset.transactions_train`;

-- 모델 평가
SELECT *
FROM ML.EVALUATE(MODEL `dataset.fraud_model`,
  (SELECT * FROM `dataset.transactions_test`)
);

-- 예측 실행
SELECT *
FROM ML.PREDICT(MODEL `dataset.fraud_model`,
  (SELECT * FROM `dataset.new_transactions`)
);

4. Azure AI Services

Azure Machine Learning

Azure ML은 Microsoft의 엔터프라이즈급 ML 플랫폼으로, Active Directory와의 통합 및 하이브리드 클라우드 지원이 강점입니다.

from azure.ai.ml import MLClient
from azure.ai.ml.entities import AmlCompute, Command
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="YOUR_SUBSCRIPTION",
    resource_group_name="rg-ai",
    workspace_name="ai-workspace"
)

# GPU 클러스터 생성
compute_config = AmlCompute(
    name="gpu-cluster",
    type="amlcompute",
    size="Standard_ND96asr_v4",
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120
)
ml_client.compute.begin_create_or_update(compute_config).result()

Azure ML 학습 잡 실행

from azure.ai.ml.entities import Command
from azure.ai.ml import Input

job = Command(
    code="./src",
    command="python train.py --epochs 10 --learning_rate 0.001",
    environment="AzureML-pytorch-2.0-ubuntu20.04-py38-cuda11-gpu@latest",
    compute="gpu-cluster",
    inputs={
        "train_data": Input(type="uri_folder", path="azureml://datastores/mydata/paths/train/")
    },
    display_name="pytorch-training-job"
)

returned_job = ml_client.jobs.create_or_update(job)
print(f"Job URL: {returned_job.studio_url}")

Azure OpenAI Service

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="https://your-resource.openai.azure.com/",
    api_key="YOUR_API_KEY",
    api_version="2024-02-01"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an AI assistant."},
        {"role": "user", "content": "클라우드 AI의 장점을 설명해줘"}
    ]
)
print(response.choices[0].message.content)

5. Kubernetes for AI (EKS/GKE/AKS)

Kubernetes는 대규모 AI 워크로드 오케스트레이션의 표준이 되었습니다.

GPU 노드 풀 설정

# GKE GPU 노드 풀 (Terraform)
resource "google_container_node_pool" "gpu_pool" {
name       = "gpu-pool"
cluster    = google_container_cluster.primary.name
node_count = 2

node_config {
machine_type = "a2-highgpu-1g"
guest_accelerator {
type  = "nvidia-tesla-a100"
count = 1
}
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
}
}

NVIDIA 디바이스 플러그인

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

Kubeflow Pipeline 정의

import kfp
from kfp import dsl

@dsl.component(
    base_image='python:3.10',
    packages_to_install=['scikit-learn', 'pandas']
)
def train_model(
    data_path: str,
    model_path: kfp.dsl.OutputPath(str)
):
    import pickle
    from sklearn.ensemble import RandomForestClassifier
    import pandas as pd

    df = pd.read_csv(data_path)
    X = df.drop('label', axis=1)
    y = df['label']

    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)

    with open(model_path, 'wb') as f:
        pickle.dump(model, f)

@dsl.pipeline(name='ml-pipeline')
def ml_pipeline(data_path: str = 'gs://bucket/data.csv'):
    train_task = train_model(data_path=data_path)

KEDA를 이용한 자동 스케일링

KEDA(Kubernetes Event-driven Autoscaling)는 큐 깊이나 HTTP 요청 수 기반으로 AI 추론 파드를 자동 스케일링합니다.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
spec:
  scaleTargetRef:
    name: inference-deployment
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/inference-queue
        queueLength: '5'

6. 서버리스 AI 추론

서버리스는 트래픽이 간헐적이거나 예측 불가능한 AI 서비스에 비용 효율적인 선택입니다.

콜드 스타트 최적화

ML 모델의 콜드 스타트는 수 초~수십 초 걸릴 수 있습니다. 최소화 방법:

  1. 프로비저닝된 동시성 (AWS Lambda): 미리 워밍업된 인스턴스 유지
  2. 컨테이너 이미지 최적화: 불필요한 패키지 제거, 멀티 스테이지 빌드
  3. 모델 양자화: FP16/INT8로 모델 크기 절반 이상 줄이기
  4. 지연 로딩: 핸들러 외부에서 모델 초기화 (전역 변수 활용)
# Lambda 콜드 스타트 최적화 패턴
import boto3
import json
from transformers import pipeline

# 전역 변수로 모델 로드 (핸들러 외부)
# 컨테이너가 재사용되면 이 코드는 다시 실행되지 않음
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

def lambda_handler(event, context):
    body = json.loads(event['body'])
    text = body['text']

    result = classifier(text)

    return {
        'statusCode': 200,
        'body': json.dumps(result)
    }

AWS Fargate로 컨테이너 기반 서버리스

Fargate는 서버 관리 없이 컨테이너를 실행합니다. Lambda의 메모리/시간 제한 없이 대용량 모델 서빙이 가능합니다.

# ECS Task Definition (Fargate + GPU)
{
  'family': 'ai-inference-task',
  'networkMode': 'awsvpc',
  'requiresCompatibilities': ['FARGATE'],
  'cpu': '4096',
  'memory': '16384',
  'containerDefinitions':
    [
      {
        'name': 'inference-container',
        'image': '123456789.dkr.ecr.us-east-1.amazonaws.com/my-model:latest',
        'portMappings': [{ 'containerPort': 8080 }],
        'environment': [{ 'name': 'MODEL_PATH', 'value': '/opt/ml/model' }],
      },
    ],
}

7. 데이터 스토리지와 AI

오브젝트 스토리지 비교

서비스제공자특징
Amazon S3AWS99.999999999% 내구성, 강력한 SDK
Google Cloud StorageGCPBigQuery와 네이티브 통합
Azure Blob StorageAzureAzure Data Lake Gen2 지원

데이터 레이크 아키텍처

효율적인 AI 데이터 파이프라인을 위한 레이크하우스 패턴:

Raw Layer (Bronze)
    └── 원본 데이터 그대로 보존
    └── 파티션: year/month/day/

Processed Layer (Silver)
    └── 정제, 중복 제거, 스키마 강제
    └── Parquet/Delta Lake 포맷

Feature Layer (Gold)
    └── ML 피처 엔지니어링 완료
    └── Feature Store에 등록

대용량 모델 체크포인트 관리

import boto3
from pathlib import Path

def save_checkpoint_to_s3(model, optimizer, epoch, loss, bucket, prefix):
    """모델 체크포인트를 S3에 저장"""
    import torch
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }
    local_path = f'/tmp/checkpoint_epoch_{epoch}.pt'
    torch.save(checkpoint, local_path)

    s3 = boto3.client('s3')
    s3_key = f'{prefix}/checkpoint_epoch_{epoch}.pt'
    s3.upload_file(local_path, bucket, s3_key)
    print(f'Checkpoint saved to s3://{bucket}/{s3_key}')

8. 클라우드 AI 모니터링

모델 성능 드리프트 감지

프로덕션 모델은 시간이 지남에 따라 성능이 저하될 수 있습니다. SageMaker Model Monitor 예시:

from sagemaker.model_monitor import DataCaptureConfig, DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

# 데이터 캡처 설정
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=20,
    destination_s3_uri='s3://bucket/capture'
)

# 모델 모니터 생성
monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600
)

# 베이스라인 생성
monitor.suggest_baseline(
    baseline_dataset='s3://bucket/train_data.csv',
    dataset_format=DatasetFormat.csv(header=True)
)

CloudWatch 메트릭 커스텀 알림

import boto3

cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')

# 커스텀 메트릭 전송
cloudwatch.put_metric_data(
    Namespace='MLOps/ModelPerformance',
    MetricData=[
        {
            'MetricName': 'PredictionAccuracy',
            'Value': 0.94,
            'Unit': 'None',
            'Dimensions': [
                {'Name': 'ModelName', 'Value': 'fraud-detector-v2'},
                {'Name': 'Environment', 'Value': 'production'}
            ]
        }
    ]
)

9. MLOps on Cloud

GitHub Actions + AWS CodePipeline CI/CD

name: ML Pipeline CI/CD

on:
  push:
    branches: [main]

jobs:
  train-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Run SageMaker training
        run: |
          python scripts/run_training.py \
            --instance-type ml.p3.2xlarge \
            --output-path s3://bucket/models/

      - name: Deploy to staging
        run: |
          python scripts/deploy_model.py \
            --endpoint-name staging-endpoint \
            --instance-type ml.g4dn.xlarge

MLflow + S3 모델 레지스트리

import mlflow
import mlflow.pytorch

mlflow.set_tracking_uri('s3://bucket/mlflow')
mlflow.set_experiment('fraud-detection')

with mlflow.start_run():
    # 파라미터 기록
    mlflow.log_params({
        'learning_rate': 0.001,
        'batch_size': 32,
        'epochs': 10
    })

    # 학습 루프
    for epoch in range(10):
        train_loss = train_one_epoch(model, train_loader, optimizer)
        val_accuracy = evaluate(model, val_loader)

        mlflow.log_metrics({
            'train_loss': train_loss,
            'val_accuracy': val_accuracy
        }, step=epoch)

    # 모델 저장
    mlflow.pytorch.log_model(model, 'model')
    mlflow.register_model(
        model_uri=f'runs:/{mlflow.active_run().info.run_id}/model',
        name='fraud-detector'
    )

카나리 배포

# SageMaker 카나리 배포
import boto3

sm = boto3.client('sagemaker')

# 트래픽의 10%를 새 모델로 전환
sm.update_endpoint_weights_and_capacities(
    EndpointName='production-endpoint',
    DesiredWeightsAndCapacities=[
        {
            'VariantName': 'current-model',
            'DesiredWeight': 90,
            'DesiredInstanceCount': 4
        },
        {
            'VariantName': 'new-model',
            'DesiredWeight': 10,
            'DesiredInstanceCount': 1
        }
    ]
)

10. 퀴즈

Q1. AWS SageMaker에서 분산 학습 시 PyTorch DDP를 활성화하는 올바른 설정은?

정답: distribution 파라미터에 {'torch_distributed': {'enabled': True}}를 지정하고, instance_count를 2 이상으로 설정합니다.

설명: SageMaker PyTorch Estimator는 distribution 옵션을 통해 여러 분산 학습 방식을 지원합니다. torch_distributed는 PyTorch의 기본 분산 학습 프레임워크를 활용하며, SageMaker가 자동으로 노드 간 통신을 설정해 줍니다.

Q2. Spot 인스턴스와 On-Demand 인스턴스의 주요 차이점은?

정답: Spot 인스턴스는 AWS의 유휴 용량을 활용하여 최대 90% 저렴하지만, AWS가 용량 필요 시 인터럽션(중단) 신호를 보낸 후 2분 안에 인스턴스를 회수할 수 있습니다. On-Demand는 언제든 사용 가능하지만 정가입니다.

설명: ML 학습 작업에 Spot을 사용할 때는 반드시 체크포인팅을 구현해야 합니다. SageMaker는 CheckpointConfig를 통해 S3로 자동 체크포인팅을 지원하며, 인터럽션 후 자동 재시작도 가능합니다.

Q3. BigQuery ML에서 CREATE OR REPLACE MODEL 구문의 input_label_cols 옵션은 무엇을 의미하나요?

정답: input_label_cols는 모델이 예측해야 할 타깃 컬럼(레이블)을 지정합니다. 지정된 컬럼은 피처에서 자동으로 제외됩니다.

설명: BigQuery ML은 SQL 쿼리 결과를 학습 데이터로 직접 사용합니다. input_label_cols를 올바르게 설정하지 않으면 타깃 값이 피처로 포함되어 데이터 누수(Data Leakage)가 발생할 수 있습니다.

Q4. Kubernetes에서 KEDA를 사용하는 주목적은 무엇인가요?

정답: KEDA는 이벤트 기반 오토스케일링을 제공합니다. SQS 큐 깊이, Kafka 토픽 오프셋, HTTP 요청 수 등 외부 이벤트 소스를 기반으로 파드 수를 자동 조정합니다. 기본 HPA(Horizontal Pod Autoscaler)가 CPU/메모리 기반인 것과 달리, 실제 워크로드 큐를 기반으로 스케일링합니다.

설명: AI 추론 서비스에서는 요청 큐가 쌓일 때 빠르게 스케일 아웃하고, 유휴 시에는 0까지 스케일 인하여 비용을 절약하는 패턴이 효과적입니다.

Q5. MLflow의 모델 레지스트리를 S3 백엔드로 사용할 때 mlflow.set_tracking_uri에 전달하는 값의 형식은?

정답: S3 URI 형식인 s3://bucket-name/prefix를 사용합니다. 예를 들어 s3://my-mlflow-bucket/mlflow와 같이 지정합니다.

설명: MLflow는 S3를 아티팩트 스토어로 사용할 수 있습니다. 트래킹 서버 없이 S3만으로도 실험 메타데이터와 모델 아티팩트를 중앙 관리할 수 있어, 팀 단위 MLOps 환경에서 유용합니다. EC2 또는 SageMaker 환경에서는 IAM 역할을 통해 자동 인증이 가능합니다.


참고 자료

Cloud Computing for AI Engineers: Build AI Services with AWS/GCP/Azure

1. Cloud AI Platform Overview

Cloud computing has become indispensable for AI engineers. The ability to provision hundreds of GPUs on demand, and to handle the full ML lifecycle — from training to serving — using managed services has fundamentally changed how AI systems are built.

IaaS, PaaS, and SaaS

Cloud services are organized into three layers:

  • IaaS (Infrastructure as a Service): Provides virtual machines, storage, and networking. EC2 GPU instances are a classic example. You get maximum control but must manage the infrastructure yourself.
  • PaaS (Platform as a Service): Manages the runtime and middleware for you. AWS SageMaker, GCP Vertex AI, and Azure ML fall into this category. You focus on model code, not servers.
  • SaaS (Software as a Service): Delivers complete AI capabilities as APIs. AWS Bedrock, GCP Gemini API, and Azure OpenAI Service are the leading examples.

Cloud AI Services Comparison

FeatureAWSGCPAzure
ML PlatformSageMakerVertex AIAzure ML
LLM APIBedrockVertex AI (Gemini)Azure OpenAI
Managed NotebooksSageMaker StudioVertex AI WorkbenchAzure ML Studio
AutoMLSageMaker AutopilotVertex AutoMLAzure AutoML
Feature StoreSageMaker Feature StoreVertex Feature StoreAzure ML Feature Store
Model RegistrySageMaker Model RegistryVertex Model RegistryAzure ML Registry
Serverless InferenceLambda, FargateCloud Run, Cloud FunctionsAzure Functions, Container Apps

GPU Instance Types Compared

AWS GPU Instances:

  • p3.2xlarge: 1x V100, 61 GiB RAM — small-scale training
  • p4d.24xlarge: 8x A100, 320 GiB RAM — large-scale distributed training
  • p5.48xlarge: 8x H100, 2 TiB RAM — latest LLM training

GCP GPU Instances:

  • n1-standard-8 + V100: cost-effective training
  • a2-highgpu-8g: 8x A100 — default Vertex AI training
  • a3-highgpu-8g: 8x H100 — latest large models

Azure GPU Instances:

  • NC6s_v3: 1x V100 — development and testing
  • ND96asr_v4: 8x A100 — large-scale training
  • ND96amsr_A100_v4: 8x A100 80GB — maximum performance

Cost Optimization Strategies

More than 70% of cloud AI costs come from compute. Key saving strategies:

  • Spot/Preemptible Instances: Up to 90% savings vs On-Demand. Checkpointing is essential.
  • Reserved Instances / Committed Use: 40–60% savings with 1–3 year commitments. Best for long-running projects.
  • Auto Scaling: Automatically adjust instance count based on inference traffic.
  • Savings Plans (AWS): Flexible instance-type discounts tied to compute usage commitments.

2. AWS AI/ML Services

SageMaker Core Features

Amazon SageMaker is AWS's integrated ML platform. It handles the full ML lifecycle — from data preparation to model deployment and monitoring — within a single service.

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role

role = get_execution_role()
sess = sagemaker.Session()

# SageMaker Training Job
estimator = PyTorch(
    entry_point='train.py',
    source_dir='./src',
    role=role,
    instance_type='ml.p4d.24xlarge',
    instance_count=4,
    framework_version='2.1.0',
    py_version='py310',
    hyperparameters={
        'epochs': 10,
        'batch-size': 32,
        'learning-rate': 0.001
    },
    distribution={
        'torch_distributed': {'enabled': True}
    }
)

estimator.fit({'train': 's3://bucket/train', 'val': 's3://bucket/val'})

The distribution parameter enables PyTorch DDP-based distributed training across multiple nodes. Setting instance_count=4 with torch_distributed enabled automatically configures 4-node data-parallel training.

SageMaker Model Deployment

from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(
    model_data='s3://bucket/model.tar.gz',
    role=role,
    framework_version='2.1.0',
    py_version='py310',
    entry_point='inference.py'
)

predictor = model.deploy(
    initial_instance_count=2,
    instance_type='ml.g4dn.xlarge',
    endpoint_name='my-pytorch-endpoint'
)

# Invoke the endpoint
result = predictor.predict({'inputs': 'Hello, cloud AI!'})

AWS Bedrock (LLM API)

AWS Bedrock provides access to multiple foundation models — Anthropic Claude, Meta Llama, Amazon Titan — through a single API.

import boto3
import json

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.invoke_model(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    body=json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'max_tokens': 1024,
        'messages': [{'role': 'user', 'content': 'Explain recent AI trends'}]
    })
)
result = json.loads(response['body'].read())
print(result['content'][0]['text'])

Serverless Inference with AWS Lambda

Lightweight models can be deployed as serverless functions.

import json
import boto3
import numpy as np

def lambda_handler(event, context):
    body = json.loads(event['body'])
    input_data = body['input']

    # Model is loaded at module level (outside handler) for warm reuse
    prediction = run_inference(input_data)

    return {
        'statusCode': 200,
        'body': json.dumps({'prediction': prediction})
    }

3. GCP AI/ML Services

Vertex AI Training

Google Cloud's Vertex AI is a unified ML platform with tight BigQuery integration as a key differentiator.

from google.cloud import aiplatform

aiplatform.init(project='my-project', location='us-central1')

job = aiplatform.CustomTrainingJob(
    display_name='pytorch-training',
    script_path='train.py',
    container_uri='us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-0:latest',
    requirements=['transformers', 'datasets']
)

model = job.run(
    dataset=None,
    machine_type='a2-highgpu-8g',
    accelerator_type='NVIDIA_TESLA_A100',
    accelerator_count=8,
    args=['--epochs=10', '--batch_size=32']
)

Vertex AI Model Deployment

from google.cloud import aiplatform

# Upload model artifact
model = aiplatform.Model.upload(
    display_name='my-pytorch-model',
    artifact_uri='gs://bucket/model/',
    serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.2-0:latest'
)

# Create endpoint and deploy
endpoint = aiplatform.Endpoint.create(display_name='my-endpoint')
model.deploy(
    endpoint=endpoint,
    dedicated_resources_machine_type='n1-standard-4',
    dedicated_resources_accelerator_type='NVIDIA_TESLA_T4',
    dedicated_resources_accelerator_count=1,
    min_replica_count=1,
    max_replica_count=5
)

BigQuery ML

BigQuery ML lets you train and run predictions using SQL syntax — no Python required.

-- Train a classification model with BigQuery ML
CREATE OR REPLACE MODEL `dataset.fraud_model`
OPTIONS(
    model_type = 'BOOSTED_TREE_CLASSIFIER',
    num_parallel_tree = 1,
    max_iterations = 50,
    input_label_cols = ['is_fraud']
) AS
SELECT * FROM `dataset.transactions_train`;

-- Evaluate model
SELECT *
FROM ML.EVALUATE(MODEL `dataset.fraud_model`,
  (SELECT * FROM `dataset.transactions_test`)
);

-- Run predictions
SELECT *
FROM ML.PREDICT(MODEL `dataset.fraud_model`,
  (SELECT * FROM `dataset.new_transactions`)
);

4. Azure AI Services

Azure Machine Learning

Azure ML is Microsoft's enterprise-grade ML platform with strong Active Directory integration and hybrid cloud support.

from azure.ai.ml import MLClient
from azure.ai.ml.entities import AmlCompute, Command
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="YOUR_SUBSCRIPTION",
    resource_group_name="rg-ai",
    workspace_name="ai-workspace"
)

# Create GPU compute cluster
compute_config = AmlCompute(
    name="gpu-cluster",
    type="amlcompute",
    size="Standard_ND96asr_v4",
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120
)
ml_client.compute.begin_create_or_update(compute_config).result()

Submitting an Azure ML Training Job

from azure.ai.ml.entities import Command
from azure.ai.ml import Input

job = Command(
    code="./src",
    command="python train.py --epochs 10 --learning_rate 0.001",
    environment="AzureML-pytorch-2.0-ubuntu20.04-py38-cuda11-gpu@latest",
    compute="gpu-cluster",
    inputs={
        "train_data": Input(type="uri_folder", path="azureml://datastores/mydata/paths/train/")
    },
    display_name="pytorch-training-job"
)

returned_job = ml_client.jobs.create_or_update(job)
print(f"Job URL: {returned_job.studio_url}")

Azure OpenAI Service

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="https://your-resource.openai.azure.com/",
    api_key="YOUR_API_KEY",
    api_version="2024-02-01"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an AI assistant."},
        {"role": "user", "content": "What are the benefits of cloud AI?"}
    ]
)
print(response.choices[0].message.content)

5. Kubernetes for AI (EKS/GKE/AKS)

Kubernetes has become the standard for orchestrating large-scale AI workloads.

GPU Node Pool Setup

# GKE GPU node pool (Terraform)
resource "google_container_node_pool" "gpu_pool" {
name       = "gpu-pool"
cluster    = google_container_cluster.primary.name
node_count = 2

node_config {
machine_type = "a2-highgpu-1g"
guest_accelerator {
type  = "nvidia-tesla-a100"
count = 1
}
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
}
}

NVIDIA Device Plugin

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

Kubeflow Pipeline Definition

import kfp
from kfp import dsl

@dsl.component(
    base_image='python:3.10',
    packages_to_install=['scikit-learn', 'pandas']
)
def train_model(
    data_path: str,
    model_path: kfp.dsl.OutputPath(str)
):
    import pickle
    from sklearn.ensemble import RandomForestClassifier
    import pandas as pd

    df = pd.read_csv(data_path)
    X = df.drop('label', axis=1)
    y = df['label']

    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)

    with open(model_path, 'wb') as f:
        pickle.dump(model, f)

@dsl.pipeline(name='ml-pipeline')
def ml_pipeline(data_path: str = 'gs://bucket/data.csv'):
    train_task = train_model(data_path=data_path)

Auto Scaling with KEDA

KEDA (Kubernetes Event-driven Autoscaling) scales AI inference pods based on queue depth or HTTP request count.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
spec:
  scaleTargetRef:
    name: inference-deployment
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/inference-queue
        queueLength: '5'

6. Serverless AI Inference

Serverless is a cost-effective choice for AI services with intermittent or unpredictable traffic.

Cold Start Optimization

ML model cold starts can take several seconds to tens of seconds. Ways to minimize them:

  1. Provisioned Concurrency (AWS Lambda): Keep pre-warmed instances ready at all times.
  2. Container Image Optimization: Remove unnecessary packages; use multi-stage builds.
  3. Model Quantization: Reduce model size by more than half with FP16/INT8.
  4. Lazy-free Loading: Initialize the model outside the handler function (global variables).
# Lambda cold start optimization pattern
import json
from transformers import pipeline

# Load model outside the handler (global scope)
# This code only runs once per container lifecycle
classifier = pipeline(
    'sentiment-analysis',
    model='distilbert-base-uncased-finetuned-sst-2-english'
)

def lambda_handler(event, context):
    body = json.loads(event['body'])
    text = body['text']
    result = classifier(text)
    return {
        'statusCode': 200,
        'body': json.dumps(result)
    }

Container-based Serverless with AWS Fargate

Fargate runs containers without server management. Unlike Lambda, there are no memory or timeout limits, making it suitable for serving large models.

{
  "family": "ai-inference-task",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "4096",
  "memory": "16384",
  "containerDefinitions": [
    {
      "name": "inference-container",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/my-model:latest",
      "portMappings": [{ "containerPort": 8080 }],
      "environment": [{ "name": "MODEL_PATH", "value": "/opt/ml/model" }]
    }
  ]
}

7. Data Storage for AI

Object Storage Comparison

ServiceProviderKey Feature
Amazon S3AWS11 nines durability, rich SDK ecosystem
Google Cloud StorageGCPNative BigQuery integration
Azure Blob StorageAzureAzure Data Lake Gen2 support

Data Lake Architecture

A modern lakehouse pattern for AI data pipelines:

Raw Layer (Bronze)
    └── Ingest source data as-is
    └── Partition by: year/month/day/

Processed Layer (Silver)
    └── Cleaned, deduplicated, schema-enforced
    └── Parquet / Delta Lake format

Feature Layer (Gold)
    └── Feature engineering complete
    └── Registered in Feature Store

Large Model Checkpoint Management

import boto3

def save_checkpoint_to_s3(model, optimizer, epoch, loss, bucket, prefix):
    """Save model checkpoint to S3"""
    import torch
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }
    local_path = f'/tmp/checkpoint_epoch_{epoch}.pt'
    torch.save(checkpoint, local_path)

    s3 = boto3.client('s3')
    s3_key = f'{prefix}/checkpoint_epoch_{epoch}.pt'
    s3.upload_file(local_path, bucket, s3_key)
    print(f'Checkpoint saved to s3://{bucket}/{s3_key}')

8. Cloud AI Monitoring

Model Performance Drift Detection

Production models degrade over time. SageMaker Model Monitor example:

from sagemaker.model_monitor import DataCaptureConfig, DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

# Configure data capture
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=20,
    destination_s3_uri='s3://bucket/capture'
)

# Create model monitor
monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600
)

# Create baseline from training data
monitor.suggest_baseline(
    baseline_dataset='s3://bucket/train_data.csv',
    dataset_format=DatasetFormat.csv(header=True)
)

Custom CloudWatch Metrics and Alarms

import boto3

cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')

# Publish a custom metric
cloudwatch.put_metric_data(
    Namespace='MLOps/ModelPerformance',
    MetricData=[
        {
            'MetricName': 'PredictionAccuracy',
            'Value': 0.94,
            'Unit': 'None',
            'Dimensions': [
                {'Name': 'ModelName', 'Value': 'fraud-detector-v2'},
                {'Name': 'Environment', 'Value': 'production'}
            ]
        }
    ]
)

9. MLOps on Cloud

GitHub Actions + AWS CodePipeline CI/CD

name: ML Pipeline CI/CD

on:
  push:
    branches: [main]

jobs:
  train-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Run SageMaker training
        run: |
          python scripts/run_training.py \
            --instance-type ml.p3.2xlarge \
            --output-path s3://bucket/models/

      - name: Deploy to staging
        run: |
          python scripts/deploy_model.py \
            --endpoint-name staging-endpoint \
            --instance-type ml.g4dn.xlarge

MLflow + S3 Model Registry

import mlflow
import mlflow.pytorch

mlflow.set_tracking_uri('s3://bucket/mlflow')
mlflow.set_experiment('fraud-detection')

with mlflow.start_run():
    mlflow.log_params({
        'learning_rate': 0.001,
        'batch_size': 32,
        'epochs': 10
    })

    for epoch in range(10):
        train_loss = train_one_epoch(model, train_loader, optimizer)
        val_accuracy = evaluate(model, val_loader)

        mlflow.log_metrics({
            'train_loss': train_loss,
            'val_accuracy': val_accuracy
        }, step=epoch)

    mlflow.pytorch.log_model(model, 'model')
    mlflow.register_model(
        model_uri=f'runs:/{mlflow.active_run().info.run_id}/model',
        name='fraud-detector'
    )

Canary Deployment

import boto3

sm = boto3.client('sagemaker')

# Route 10% of traffic to the new model
sm.update_endpoint_weights_and_capacities(
    EndpointName='production-endpoint',
    DesiredWeightsAndCapacities=[
        {
            'VariantName': 'current-model',
            'DesiredWeight': 90,
            'DesiredInstanceCount': 4
        },
        {
            'VariantName': 'new-model',
            'DesiredWeight': 10,
            'DesiredInstanceCount': 1
        }
    ]
)

10. Quiz

Q1. What is the correct way to enable PyTorch DDP distributed training in SageMaker?

Answer: Set the distribution parameter to {'torch_distributed': {'enabled': True}} and set instance_count to 2 or more.

Explanation: SageMaker's PyTorch Estimator supports multiple distributed training strategies via the distribution parameter. The torch_distributed option leverages PyTorch's native distributed training framework, and SageMaker automatically handles the inter-node communication setup.

Q2. What is the key difference between Spot instances and On-Demand instances?

Answer: Spot instances leverage AWS spare capacity at up to 90% discount vs On-Demand, but AWS can reclaim them with a 2-minute interruption notice when capacity is needed. On-Demand instances are always available at full price.

Explanation: When using Spot for ML training, checkpointing is mandatory. SageMaker supports automatic S3 checkpointing via CheckpointConfig, and supports automatic restart after interruption, making it practical for long training runs.

Q3. In BigQuery ML, what does the input_label_cols option in CREATE MODEL specify?

Answer: input_label_cols specifies the target column(s) the model should predict. These columns are automatically excluded from the feature set.

Explanation: BigQuery ML uses the results of a SQL query directly as training data. If input_label_cols is not set correctly, the target value will be included as a feature, causing data leakage and artificially inflated model accuracy.

Q4. What is the primary purpose of KEDA in a Kubernetes AI inference setup?

Answer: KEDA provides event-driven autoscaling. It scales pods based on external event sources such as SQS queue depth, Kafka consumer lag, or HTTP request counts — unlike the standard HPA which only reacts to CPU and memory usage.

Explanation: For AI inference services, scaling based on actual workload queues is far more responsive than CPU-based scaling. KEDA can also scale pods down to zero during idle periods, eliminating idle compute costs.

Q5. When using MLflow with an S3 backend, what is the correct format for mlflow.set_tracking_uri?

Answer: Use the S3 URI format: s3://bucket-name/prefix, for example s3://my-mlflow-bucket/mlflow.

Explanation: MLflow can use S3 as its artifact store. Without a separate tracking server, S3 alone can centrally store experiment metadata and model artifacts — useful for team MLOps environments. On EC2 or SageMaker, IAM roles handle authentication automatically.


References