Cloud Computing for AI Engineers: Build AI Services with AWS/GCP/Azure

1. Cloud AI Platform Overview
2. AWS AI/ML Services
3. GCP AI/ML Services
4. Azure AI Services
- Azure Machine Learning
  - Submitting an Azure ML Training Job
- Azure OpenAI Service
5. Kubernetes for AI (EKS/GKE/AKS)
6. Serverless AI Inference
- Cold Start Optimization
- Container-based Serverless with AWS Fargate
7. Data Storage for AI
8. Cloud AI Monitoring
- Model Performance Drift Detection
- Custom CloudWatch Metrics and Alarms
9. MLOps on Cloud
10. Quiz
References

1. Cloud AI Platform Overview

Cloud computing has become indispensable for AI engineers. The ability to provision hundreds of GPUs on demand, and to handle the full ML lifecycle — from training to serving — using managed services has fundamentally changed how AI systems are built.

IaaS, PaaS, and SaaS

Cloud services are organized into three layers:

IaaS (Infrastructure as a Service): Provides virtual machines, storage, and networking. EC2 GPU instances are a classic example. You get maximum control but must manage the infrastructure yourself.
PaaS (Platform as a Service): Manages the runtime and middleware for you. AWS SageMaker, GCP Vertex AI, and Azure ML fall into this category. You focus on model code, not servers.
SaaS (Software as a Service): Delivers complete AI capabilities as APIs. AWS Bedrock, GCP Gemini API, and Azure OpenAI Service are the leading examples.

Cloud AI Services Comparison

Feature	AWS	GCP	Azure
ML Platform	SageMaker	Vertex AI	Azure ML
LLM API	Bedrock	Vertex AI (Gemini)	Azure OpenAI
Managed Notebooks	SageMaker Studio	Vertex AI Workbench	Azure ML Studio
AutoML	SageMaker Autopilot	Vertex AutoML	Azure AutoML
Feature Store	SageMaker Feature Store	Vertex Feature Store	Azure ML Feature Store
Model Registry	SageMaker Model Registry	Vertex Model Registry	Azure ML Registry
Serverless Inference	Lambda, Fargate	Cloud Run, Cloud Functions	Azure Functions, Container Apps

GPU Instance Types Compared

AWS GPU Instances:

p3.2xlarge: 1x V100, 61 GiB RAM — small-scale training
p4d.24xlarge: 8x A100, 320 GiB RAM — large-scale distributed training
p5.48xlarge: 8x H100, 2 TiB RAM — latest LLM training

GCP GPU Instances:

n1-standard-8 + V100: cost-effective training
a2-highgpu-8g: 8x A100 — default Vertex AI training
a3-highgpu-8g: 8x H100 — latest large models

Azure GPU Instances:

NC6s_v3: 1x V100 — development and testing
ND96asr_v4: 8x A100 — large-scale training
ND96amsr_A100_v4: 8x A100 80GB — maximum performance

Cost Optimization Strategies

More than 70% of cloud AI costs come from compute. Key saving strategies:

Spot/Preemptible Instances: Up to 90% savings vs On-Demand. Checkpointing is essential.
Reserved Instances / Committed Use: 40–60% savings with 1–3 year commitments. Best for long-running projects.
Auto Scaling: Automatically adjust instance count based on inference traffic.
Savings Plans (AWS): Flexible instance-type discounts tied to compute usage commitments.

2. AWS AI/ML Services

SageMaker Core Features

Amazon SageMaker is AWS's integrated ML platform. It handles the full ML lifecycle — from data preparation to model deployment and monitoring — within a single service.

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role

role = get_execution_role()
sess = sagemaker.Session()

# SageMaker Training Job
estimator = PyTorch(
    entry_point='train.py',
    source_dir='./src',
    role=role,
    instance_type='ml.p4d.24xlarge',
    instance_count=4,
    framework_version='2.1.0',
    py_version='py310',
    hyperparameters={
        'epochs': 10,
        'batch-size': 32,
        'learning-rate': 0.001
    },
    distribution={
        'torch_distributed': {'enabled': True}
    }
)

estimator.fit({'train': 's3://bucket/train', 'val': 's3://bucket/val'})

The distribution parameter enables PyTorch DDP-based distributed training across multiple nodes. Setting instance_count=4 with torch_distributed enabled automatically configures 4-node data-parallel training.

SageMaker Model Deployment

from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(
    model_data='s3://bucket/model.tar.gz',
    role=role,
    framework_version='2.1.0',
    py_version='py310',
    entry_point='inference.py'
)

predictor = model.deploy(
    initial_instance_count=2,
    instance_type='ml.g4dn.xlarge',
    endpoint_name='my-pytorch-endpoint'
)

# Invoke the endpoint
result = predictor.predict({'inputs': 'Hello, cloud AI!'})

AWS Bedrock (LLM API)

AWS Bedrock provides access to multiple foundation models — Anthropic Claude, Meta Llama, Amazon Titan — through a single API.

import boto3
import json

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

response = bedrock.invoke_model(
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
    body=json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'max_tokens': 1024,
        'messages': [{'role': 'user', 'content': 'Explain recent AI trends'}]
    })
)
result = json.loads(response['body'].read())
print(result['content'][0]['text'])

Serverless Inference with AWS Lambda

Lightweight models can be deployed as serverless functions.

import json
import boto3
import numpy as np

def lambda_handler(event, context):
    body = json.loads(event['body'])
    input_data = body['input']

    # Model is loaded at module level (outside handler) for warm reuse
    prediction = run_inference(input_data)

    return {
        'statusCode': 200,
        'body': json.dumps({'prediction': prediction})
    }

3. GCP AI/ML Services

Vertex AI Training

Google Cloud's Vertex AI is a unified ML platform with tight BigQuery integration as a key differentiator.

from google.cloud import aiplatform

aiplatform.init(project='my-project', location='us-central1')

job = aiplatform.CustomTrainingJob(
    display_name='pytorch-training',
    script_path='train.py',
    container_uri='us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-0:latest',
    requirements=['transformers', 'datasets']
)

model = job.run(
    dataset=None,
    machine_type='a2-highgpu-8g',
    accelerator_type='NVIDIA_TESLA_A100',
    accelerator_count=8,
    args=['--epochs=10', '--batch_size=32']
)

Vertex AI Model Deployment

from google.cloud import aiplatform

# Upload model artifact
model = aiplatform.Model.upload(
    display_name='my-pytorch-model',
    artifact_uri='gs://bucket/model/',
    serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.2-0:latest'
)

# Create endpoint and deploy
endpoint = aiplatform.Endpoint.create(display_name='my-endpoint')
model.deploy(
    endpoint=endpoint,
    dedicated_resources_machine_type='n1-standard-4',
    dedicated_resources_accelerator_type='NVIDIA_TESLA_T4',
    dedicated_resources_accelerator_count=1,
    min_replica_count=1,
    max_replica_count=5
)

BigQuery ML

BigQuery ML lets you train and run predictions using SQL syntax — no Python required.

-- Train a classification model with BigQuery ML
CREATE OR REPLACE MODEL `dataset.fraud_model`
OPTIONS(
    model_type = 'BOOSTED_TREE_CLASSIFIER',
    num_parallel_tree = 1,
    max_iterations = 50,
    input_label_cols = ['is_fraud']
) AS
SELECT * FROM `dataset.transactions_train`;

-- Evaluate model
SELECT *
FROM ML.EVALUATE(MODEL `dataset.fraud_model`,
  (SELECT * FROM `dataset.transactions_test`)
);

-- Run predictions
SELECT *
FROM ML.PREDICT(MODEL `dataset.fraud_model`,
  (SELECT * FROM `dataset.new_transactions`)
);

4. Azure AI Services

Azure Machine Learning

Azure ML is Microsoft's enterprise-grade ML platform with strong Active Directory integration and hybrid cloud support.

from azure.ai.ml import MLClient
from azure.ai.ml.entities import AmlCompute, Command
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="YOUR_SUBSCRIPTION",
    resource_group_name="rg-ai",
    workspace_name="ai-workspace"
)

# Create GPU compute cluster
compute_config = AmlCompute(
    name="gpu-cluster",
    type="amlcompute",
    size="Standard_ND96asr_v4",
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120
)
ml_client.compute.begin_create_or_update(compute_config).result()

Submitting an Azure ML Training Job

from azure.ai.ml.entities import Command
from azure.ai.ml import Input

job = Command(
    code="./src",
    command="python train.py --epochs 10 --learning_rate 0.001",
    environment="AzureML-pytorch-2.0-ubuntu20.04-py38-cuda11-gpu@latest",
    compute="gpu-cluster",
    inputs={
        "train_data": Input(type="uri_folder", path="azureml://datastores/mydata/paths/train/")
    },
    display_name="pytorch-training-job"
)

returned_job = ml_client.jobs.create_or_update(job)
print(f"Job URL: {returned_job.studio_url}")

Azure OpenAI Service

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="https://your-resource.openai.azure.com/",
    api_key="YOUR_API_KEY",
    api_version="2024-02-01"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an AI assistant."},
        {"role": "user", "content": "What are the benefits of cloud AI?"}
    ]
)
print(response.choices[0].message.content)

5. Kubernetes for AI (EKS/GKE/AKS)

Kubernetes has become the standard for orchestrating large-scale AI workloads.

GPU Node Pool Setup

# GKE GPU node pool (Terraform)
resource "google_container_node_pool" "gpu_pool" {
name       = "gpu-pool"
cluster    = google_container_cluster.primary.name
node_count = 2

node_config {
machine_type = "a2-highgpu-1g"
guest_accelerator {
type  = "nvidia-tesla-a100"
count = 1
}
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
}
}

NVIDIA Device Plugin

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

Kubeflow Pipeline Definition

import kfp
from kfp import dsl

@dsl.component(
    base_image='python:3.10',
    packages_to_install=['scikit-learn', 'pandas']
)
def train_model(
    data_path: str,
    model_path: kfp.dsl.OutputPath(str)
):
    import pickle
    from sklearn.ensemble import RandomForestClassifier
    import pandas as pd

    df = pd.read_csv(data_path)
    X = df.drop('label', axis=1)
    y = df['label']

    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)

    with open(model_path, 'wb') as f:
        pickle.dump(model, f)

@dsl.pipeline(name='ml-pipeline')
def ml_pipeline(data_path: str = 'gs://bucket/data.csv'):
    train_task = train_model(data_path=data_path)

Auto Scaling with KEDA

KEDA (Kubernetes Event-driven Autoscaling) scales AI inference pods based on queue depth or HTTP request count.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
spec:
  scaleTargetRef:
    name: inference-deployment
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/inference-queue
        queueLength: '5'

6. Serverless AI Inference

Serverless is a cost-effective choice for AI services with intermittent or unpredictable traffic.

Cold Start Optimization

ML model cold starts can take several seconds to tens of seconds. Ways to minimize them:

Provisioned Concurrency (AWS Lambda): Keep pre-warmed instances ready at all times.
Container Image Optimization: Remove unnecessary packages; use multi-stage builds.
Model Quantization: Reduce model size by more than half with FP16/INT8.
Lazy-free Loading: Initialize the model outside the handler function (global variables).

# Lambda cold start optimization pattern
import json
from transformers import pipeline

# Load model outside the handler (global scope)
# This code only runs once per container lifecycle
classifier = pipeline(
    'sentiment-analysis',
    model='distilbert-base-uncased-finetuned-sst-2-english'
)

def lambda_handler(event, context):
    body = json.loads(event['body'])
    text = body['text']
    result = classifier(text)
    return {
        'statusCode': 200,
        'body': json.dumps(result)
    }

Container-based Serverless with AWS Fargate

Fargate runs containers without server management. Unlike Lambda, there are no memory or timeout limits, making it suitable for serving large models.

{
  "family": "ai-inference-task",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "4096",
  "memory": "16384",
  "containerDefinitions": [
    {
      "name": "inference-container",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/my-model:latest",
      "portMappings": [{ "containerPort": 8080 }],
      "environment": [{ "name": "MODEL_PATH", "value": "/opt/ml/model" }]
    }
  ]
}

7. Data Storage for AI

Object Storage Comparison

Service	Provider	Key Feature
Amazon S3	AWS	11 nines durability, rich SDK ecosystem
Google Cloud Storage	GCP	Native BigQuery integration
Azure Blob Storage	Azure	Azure Data Lake Gen2 support

Data Lake Architecture

A modern lakehouse pattern for AI data pipelines:

Raw Layer (Bronze)
    └── Ingest source data as-is
    └── Partition by: year/month/day/

Processed Layer (Silver)
    └── Cleaned, deduplicated, schema-enforced
    └── Parquet / Delta Lake format

Feature Layer (Gold)
    └── Feature engineering complete
    └── Registered in Feature Store

Large Model Checkpoint Management

import boto3

def save_checkpoint_to_s3(model, optimizer, epoch, loss, bucket, prefix):
    """Save model checkpoint to S3"""
    import torch
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }
    local_path = f'/tmp/checkpoint_epoch_{epoch}.pt'
    torch.save(checkpoint, local_path)

    s3 = boto3.client('s3')
    s3_key = f'{prefix}/checkpoint_epoch_{epoch}.pt'
    s3.upload_file(local_path, bucket, s3_key)
    print(f'Checkpoint saved to s3://{bucket}/{s3_key}')

8. Cloud AI Monitoring

Model Performance Drift Detection

Production models degrade over time. SageMaker Model Monitor example:

from sagemaker.model_monitor import DataCaptureConfig, DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

# Configure data capture
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=20,
    destination_s3_uri='s3://bucket/capture'
)

# Create model monitor
monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600
)

# Create baseline from training data
monitor.suggest_baseline(
    baseline_dataset='s3://bucket/train_data.csv',
    dataset_format=DatasetFormat.csv(header=True)
)

Custom CloudWatch Metrics and Alarms

import boto3

cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')

# Publish a custom metric
cloudwatch.put_metric_data(
    Namespace='MLOps/ModelPerformance',
    MetricData=[
        {
            'MetricName': 'PredictionAccuracy',
            'Value': 0.94,
            'Unit': 'None',
            'Dimensions': [
                {'Name': 'ModelName', 'Value': 'fraud-detector-v2'},
                {'Name': 'Environment', 'Value': 'production'}
            ]
        }
    ]
)

9. MLOps on Cloud

GitHub Actions + AWS CodePipeline CI/CD

name: ML Pipeline CI/CD

on:
  push:
    branches: [main]

jobs:
  train-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Run SageMaker training
        run: |
          python scripts/run_training.py \
            --instance-type ml.p3.2xlarge \
            --output-path s3://bucket/models/

      - name: Deploy to staging
        run: |
          python scripts/deploy_model.py \
            --endpoint-name staging-endpoint \
            --instance-type ml.g4dn.xlarge

MLflow + S3 Model Registry

import mlflow
import mlflow.pytorch

mlflow.set_tracking_uri('s3://bucket/mlflow')
mlflow.set_experiment('fraud-detection')

with mlflow.start_run():
    mlflow.log_params({
        'learning_rate': 0.001,
        'batch_size': 32,
        'epochs': 10
    })

    for epoch in range(10):
        train_loss = train_one_epoch(model, train_loader, optimizer)
        val_accuracy = evaluate(model, val_loader)

        mlflow.log_metrics({
            'train_loss': train_loss,
            'val_accuracy': val_accuracy
        }, step=epoch)

    mlflow.pytorch.log_model(model, 'model')
    mlflow.register_model(
        model_uri=f'runs:/{mlflow.active_run().info.run_id}/model',
        name='fraud-detector'
    )

Canary Deployment

import boto3

sm = boto3.client('sagemaker')

# Route 10% of traffic to the new model
sm.update_endpoint_weights_and_capacities(
    EndpointName='production-endpoint',
    DesiredWeightsAndCapacities=[
        {
            'VariantName': 'current-model',
            'DesiredWeight': 90,
            'DesiredInstanceCount': 4
        },
        {
            'VariantName': 'new-model',
            'DesiredWeight': 10,
            'DesiredInstanceCount': 1
        }
    ]
)

10. Quiz

Q1. What is the correct way to enable PyTorch DDP distributed training in SageMaker?

Answer: Set the distribution parameter to {'torch_distributed': {'enabled': True}} and set instance_count to 2 or more.

Explanation: SageMaker's PyTorch Estimator supports multiple distributed training strategies via the distribution parameter. The torch_distributed option leverages PyTorch's native distributed training framework, and SageMaker automatically handles the inter-node communication setup.

Q2. What is the key difference between Spot instances and On-Demand instances?

Answer: Spot instances leverage AWS spare capacity at up to 90% discount vs On-Demand, but AWS can reclaim them with a 2-minute interruption notice when capacity is needed. On-Demand instances are always available at full price.

Explanation: When using Spot for ML training, checkpointing is mandatory. SageMaker supports automatic S3 checkpointing via CheckpointConfig, and supports automatic restart after interruption, making it practical for long training runs.

Q3. In BigQuery ML, what does the input_label_cols option in CREATE MODEL specify?

Answer: input_label_cols specifies the target column(s) the model should predict. These columns are automatically excluded from the feature set.

Explanation: BigQuery ML uses the results of a SQL query directly as training data. If input_label_cols is not set correctly, the target value will be included as a feature, causing data leakage and artificially inflated model accuracy.

Q4. What is the primary purpose of KEDA in a Kubernetes AI inference setup?

Answer: KEDA provides event-driven autoscaling. It scales pods based on external event sources such as SQS queue depth, Kafka consumer lag, or HTTP request counts — unlike the standard HPA which only reacts to CPU and memory usage.

Explanation: For AI inference services, scaling based on actual workload queues is far more responsive than CPU-based scaling. KEDA can also scale pods down to zero during idle periods, eliminating idle compute costs.

Q5. When using MLflow with an S3 backend, what is the correct format for mlflow.set_tracking_uri?

Answer: Use the S3 URI format: s3://bucket-name/prefix, for example s3://my-mlflow-bucket/mlflow.

Explanation: MLflow can use S3 as its artifact store. Without a separate tracking server, S3 alone can centrally store experiment metadata and model artifacts — useful for team MLOps environments. On EC2 or SageMaker, IAM roles handle authentication automatically.