Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

This post is organized into four major parts.

IT Outage Prevention - Common concerns, core defense strategies, SRE practices
Lab Environment Setup - Effective hands-on practice in local and cloud environments
Post-Exercise Nutrition - Golden window and carbohydrate/protein ratios
Global Table Tennis News - WTT tournament structure, notable players by country, recent results

Part 1: IT Outage Prevention

1. What IT Engineers Worry About

If you are an IT operations engineer, you have likely received an emergency call in the middle of the night at least once. System outages arrive without warning, and their impact ripples across the entire business.

Service Outage

The most feared situation. When users cannot access a service or core functionality fails, it directly translates to revenue loss. For large-scale services, losses can reach millions per minute.

Deployment Failure

When new code is deployed and existing functionality breaks. Precious time passes while deciding whether to roll back or apply a hotfix.

Security Incident

Data breaches, unauthorized access, DDoS attacks -- security incidents carry not just technical problems but also legal liability. With strengthened data protection laws, the cost of security incidents continues to grow.

Technical Debt

It works for now, but it is a ticking time bomb. When legacy system dependencies, undocumented configuration values, and untested code accumulate, identifying the root cause during an outage becomes extremely difficult.

2. Core Elements of Outage Prevention

You cannot completely prevent outages, but you can reduce their frequency and minimize their impact.

2-1. High Availability

Eliminating the Single Point of Failure is the first principle.

# HAProxy High Availability configuration example
frontend http_front
  bind *:80
  default_backend app_servers

backend app_servers
  balance roundrobin
  option httpchk GET /health
  server app1 10.0.1.10:8080 check inter 5s fall 3 rise 2
  server app2 10.0.1.11:8080 check inter 5s fall 3 rise 2
  server app3 10.0.1.12:8080 check inter 5s fall 3 rise 2

Active-Active: All nodes handle traffic. Provides both performance and availability.
Active-Standby: A standby node automatically takes over when failure occurs. Commonly used for databases.
Multi-Region: Prepares for region-level outages. Uses DNS-based failover.

2-2. Monitoring

Real-time awareness of system state is essential. Always track the 4 Golden Signals.

Signal	Description	Example Metric
Latency	Time to process a request	p50, p95, p99 response time
Traffic	Demand on the service	Requests per second (RPS)
Error Rate	Ratio of failed requests	HTTP 5xx rate
Saturation	Resource utilization	CPU, memory, disk usage

# Prometheus alerting rule example
groups:
  - name: service_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected - exceeds 5%"

2-3. Alerting

Intelligent alerting systems based on monitoring data are necessary.

Alert Fatigue Prevention: Tier alerts by severity.
P1 (Critical): Immediate response. PagerDuty call, SMS.
P2 (Warning): Check within 30 minutes. Slack channel notification.
P3 (Info): Check during business hours. Email or dashboard.

2-4. Logging

Adopting Structured Logging dramatically speeds up root cause analysis.

{
  "timestamp": "2026-04-11T10:30:00Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "message": "Payment processing failed",
  "user_id": "user_789",
  "error_code": "GATEWAY_TIMEOUT",
  "duration_ms": 30000
}

Tools like the ELK stack (Elasticsearch, Logstash, Kibana) or Grafana Loki enable centralized log management.

2-5. Backup

Follow the 3-2-1 rule.

3 copies of data
2 different media types
1 offsite (remote) copy

# PostgreSQL automated backup script example
#!/bin/bash
BACKUP_DIR="/backup/postgres"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="production_db"

pg_dump -h localhost -U postgres -Fc "$DB_NAME" > "$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump"

# Clean up backups older than 30 days
find "$BACKUP_DIR" -name "*.dump" -mtime +30 -delete

# Upload to S3
aws s3 cp "$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump" \
  "s3://my-backup-bucket/postgres/${DB_NAME}_${TIMESTAMP}.dump"

echo "Backup complete: ${DB_NAME}_${TIMESTAMP}.dump"

2-6. Canary Deployment

Route only a portion of total traffic to the new version to detect issues early.

# Kubernetes Canary Deployment example (Argo Rollouts)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause:
            duration: 5m
        - setWeight: 30
        - pause:
            duration: 10m
        - setWeight: 60
        - pause:
            duration: 10m
        - setWeight: 100
      canaryMetadata:
        labels:
          role: canary
      analysis:
        templates:
          - templateName: error-rate-check
        startingStep: 1

2-7. Chaos Engineering

Intentionally inject failures to verify system resilience. Netflix's Chaos Monkey is the classic example.

Network latency injection: Add artificial delays between services.
Instance termination: Randomly kill servers to verify automatic recovery.
Disk full: Exhaust storage space to verify response systems.

# Litmus Chaos experiment example - Pod deletion
# Warning: Always run in staging environment first
litmus chaos run pod-delete \
  --namespace production \
  --app-label app=payment-service \
  --total-chaos-duration 30 \
  --chaos-interval 10

3. SRE Practices

Google's SRE (Site Reliability Engineering) methodology provides great help in establishing operational frameworks.

3-1. SLO / SLI / SLA

Three concepts that must be clearly distinguished.

SLI (Service Level Indicator): The actual metrics measuring service quality. For example, "99.5% of requests respond within 200ms."
SLO (Service Level Objective): Internally set targets. Based on SLIs, such as "99.95% monthly availability."
SLA (Service Level Agreement): A contract with customers. Set slightly looser than SLOs to ensure buffer.

# SLI calculation example
def calculate_availability_sli(total_requests, successful_requests):
    """Calculate availability SLI."""
    if total_requests == 0:
        return 100.0
    return (successful_requests / total_requests) * 100

# Example: 999,500 out of 1,000,000 requests successful
sli = calculate_availability_sli(1_000_000, 999_500)
print(f"Availability SLI: {sli}%")  # 99.95%

3-2. Error Budget

The buffer between the SLO target and actual performance.

If the SLO is 99.95%, 0.05% downtime per month is allowed.
In time, that translates to about 21.6 minutes.
When the error budget is exhausted, halt new feature deployments and focus on stability.

This approach is an excellent tool for balancing development and operations teams.

3-3. Postmortem

An analysis that must be performed after an outage. The core principle is a Blameless Culture.

Items to include in a postmortem document:

Summary: Concise description of what happened
Impact: Number of affected users, duration, revenue loss
Timeline: Minute-by-minute event progression
Root Cause Analysis: Use techniques like 5-Whys to identify the real cause
Action Items: List with owners and deadlines

Part 2: Lab Environment Setup

4. Local Lab Environment

Theory alone is not enough. You must get hands-on to truly learn.

4-1. Docker Compose

The lowest-barrier lab tool. Let us build a monitoring stack locally.

# docker-compose.monitoring.yml
version: "3.8"
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
    depends_on:
      - prometheus
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    restart: unless-stopped

This setup alone lets you experience the Prometheus + Grafana + Alertmanager monitoring stack immediately.

4-2. minikube / kind

For Kubernetes practice, minikube or kind (Kubernetes in Docker) is ideal.

# Create a cluster with kind
kind create cluster --name sre-lab --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
EOF

# Check cluster status
kubectl cluster-info --context kind-sre-lab
kubectl get nodes

kind runs Kubernetes inside Docker containers, so you can set up a multi-node cluster even on a laptop. Ideal for hands-on practice with high availability, rolling updates, and canary deployments.

5. Cloud Lab Environment

5-1. AWS Free Tier / GCP Free Trial

Practicing in cloud environments is also important. Key free options:

Cloud	Free Option	Key Services
AWS Free Tier	12 months free	EC2 t2.micro, RDS, S3
GCP Free Trial	90 days + 300 USD credit	GCE, GKE, Cloud Run
Azure Free	12 months + 200 USD credit	VM, AKS, App Service

5-2. Infrastructure as Code with Terraform

Managing lab environments as code makes them reproducible anytime.

# main.tf - AWS infrastructure for SRE lab
provider "aws" {
  region = "ap-northeast-2"
}

resource "aws_vpc" "sre_lab" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true

  tags = {
    Name        = "sre-lab-vpc"
    Environment = "lab"
  }
}

resource "aws_instance" "monitoring" {
  ami           = "ami-0c9c942bd7bf113a2"
  instance_type = "t3.micro"
  subnet_id     = aws_subnet.public.id

  tags = {
    Name = "monitoring-server"
    Role = "prometheus-grafana"
  }
}

resource "aws_instance" "app" {
  count         = 2
  ami           = "ami-0c9c942bd7bf113a2"
  instance_type = "t3.micro"
  subnet_id     = aws_subnet.public.id

  tags = {
    Name = "app-server-${count.index + 1}"
    Role = "application"
  }
}

6. Effective Lab Curriculum Design

A step-by-step approach maximizes learning effectiveness.

Stage 1: Basics (1-2 weeks)

Build a web service + DB with Docker Compose
Connect Prometheus + Grafana monitoring
Set up basic alerting rules

Stage 2: Intermediate (3-4 weeks)

Deploy services to a Kubernetes cluster
Practice rolling updates and canary deployments
Centralize logs with the ELK stack

Stage 3: Advanced (5-8 weeks)

Build multi-environment infrastructure with Terraform
Integrate SLO monitoring into CI/CD pipelines
Design and execute chaos engineering experiments
Practice writing postmortem documents

Key tip: Intentionally create failures at each stage. Disconnect the DB, limit memory, inject network latency -- these experiments let you directly experience system vulnerabilities and recovery processes.

Part 3: Post-Exercise Nutrition

7. Post-Exercise Golden Window and Nutrition Strategy

IT work involves long hours of sitting, making regular exercise important. After whole-body aerobic exercise like table tennis, proper nutrition is key to recovery and fitness maintenance.

7-1. Golden Window

The 30 minutes to 1 hour after exercise is the "golden window" for nutrition. During this period, muscles are most sensitive to glycogen synthesis and protein synthesis, so providing proper nutrients significantly speeds recovery.

7-2. Carbohydrate to Protein Ratio

Adjust the ratio based on exercise type.

Exercise Type	Carb : Protein	Example Foods
Aerobic (table tennis, running)	3:1 to 4:1	Banana + Greek yogurt
Resistance (weights)	2:1 to 3:1	Sweet potato + chicken breast
Combined (circuit)	3:1	Brown rice + salmon

7-3. Supplements

A balanced diet is most important, but some supplements can be considered.

BCAA (Branched-Chain Amino Acids): Helps reduce muscle breakdown during exercise.
Creatine: Assists energy supply during high-intensity exercise.
Vitamin D: Especially important for office workers. Involved in immune function and muscle maintenance.
Electrolyte supplementation: Replace sodium, potassium, and magnesium after heavy sweating.

Note: Supplements are exactly that -- supplementary. A balanced diet comes first, and it is best to consult a professional based on your individual health condition.

Part 4: Global Table Tennis News Analysis

8. Understanding the WTT Tournament Structure

WTT (World Table Tennis), launched in 2021, is the commercial partner of the International Table Tennis Federation (ITTF), operating professional table tennis tournaments. Tournaments vary in prize money and ranking points by tier.

Tournament Tier System

Grand Smash

The highest-tier events with the most prize money and ranking points.
In 2025, the US Smash was the first Grand Smash held on American soil.

Champions

Invitational events featuring the top 32 players.
The 2026 season includes events in Doha, Chongqing (China), and Yokohama (Japan).

Star Contender

Upper-mid tier events, key stages for securing world ranking points.

Contender

Base-tier events with diverse player participation.

Finals

Season-ending events for top-ranked players based on cumulative results.
The 2025 WTT Finals were held in Hong Kong.

Notable 2026 Events

WTT Champions Doha (January 2026) - First Champions event of the season
WTT Champions Chongqing (March 2026) - Champions event in China
WTT Singapore Smash (February-March 2026) - Asia Grand Smash
ITTF World Team Championships London (April 28 - May 10, 2026) - 100th anniversary event

The 2026 London World Team Championships is a historic event celebrating ITTF's 100th anniversary, held at OVO Arena Wembley.

9. Notable Players by Country

South Korea

Shin Yubin - Paris 2024 Olympic mixed doubles bronze medalist and the ace of Korean table tennis. Won women's doubles bronze at the 2025 World Championships with Yoo Hanna -- Korea's first women's doubles medal in 16 years. Reached the semifinals at the 2026 ITTF World Cup in Macau.

Jang Woojin - Korea's top men's player. Won the 2025 Korean Pro League Series 2 and King of Kings title, and clinched the men's singles title at the first event of 2026. Runner-up at WTT Champions Doha 2026.

Lim Jonghoon - Forms a dominant mixed doubles partnership with Shin Yubin. Won the 2025 WTT Finals in Hong Kong mixed doubles 3-0 over Wang Chuqin/Sun Yingsha.

Japan

Harimoto Tomokazu - Japan's top men's player, maintaining a world ranking among the elite. Remains a core member of the Japanese team despite an upset loss at the 2025 World Championships.

Hayata Hina - Won singles bronze at the 2024 Paris Olympics and has established herself as Japan's women's ace, consistently ranking among the world's best.

Shinozuka Hiroto / Togami Shunsuke - Won the 2025 World Championships men's doubles, bringing Japan its first world title in 64 years.

China

Wang Chuqin - A top-ranked player in men's table tennis. Won the 2025 World Championships men's singles and three consecutive mixed doubles titles with Sun Yingsha. Won WTT Champions Chongqing 2026.

Lin Shidong - Rose rapidly from late 2025 to reach world No. 1. Maintains stable top ranking.

Sun Yingsha - China's women's ace. Won the 2025 World Championships women's singles. Lost the WTT Finals 2025 mixed doubles final 0-3 to Korea's Lim/Shin pair.

Fan Zhendong - 2024 Paris Olympic men's singles gold medalist and long-time world No. 1.

Europe and Americas

Hugo Calderano (Brazil) - Making history in South American table tennis. Won the 2025 ITTF World Cup by defeating Harimoto, Wang Chuqin, and Lin Shidong in succession. Reached world No. 2 in February 2026.

Truls Moregard (Sweden) - Won the 2025 WTT European Smash in Sweden, becoming the first European Grand Smash champion by defeating world No. 1 Lin Shidong. Currently ranked around No. 2-3 globally.

Alexis Lebrun (France) - Won back-to-back titles at the 2026 European Top 16 event. Holds the world doubles No. 1 ranking with brother Felix.

Felix Lebrun (France) - Became a national hero at the 2024 Paris Olympics. Won his first French national championship in 2025.

Kanak Jha (USA) - Achieved the best-ever result for a US men's player at the 2024 Paris Olympics (Round of 16). Rose to world No. 26 after winning the 2025 Pan American Cup.

Lily Zhang (USA) - Four-time Olympian and the top US women's player, ranked around 35th globally.

10. Recent Tournament Results Analysis

WTT Champions Doha 2026 (January 2026)

In the first Champions event of the 2026 season, Taiwan's Lin Yun-Ju defeated Jang Woojin 4-0 in the men's singles final. China's Zhu Yuling won the women's singles.

WTT Champions Chongqing 2026 (March 2026)

Wang Chuqin defeated Lin Shidong 4-1 (11-5, 6-11, 11-7, 11-5, 11-6) in the final to claim his second Champions title of the season.

2025 World Championships Doha (May 2025)

Men's Singles: Wang Chuqin defeated Calderano 4-1 to win
Women's Singles: Sun Yingsha won
Men's Doubles: Shinozuka-Togami (Japan) brought Japan its first gold in 64 years
Mixed Doubles: Wang Chuqin-Sun Yingsha won for the third consecutive year

2025 WTT Finals Hong Kong (December 2025)

Korea's Lim Jonghoon-Shin Yubin dominated the mixed doubles final against Wang Chuqin-Sun Yingsha 3-0.

2026 ITTF World Cup Macau (March-April 2026)

Jang Woojin lost in the semifinals to Japan's Matsushima Sora 1-4, and Shin Yubin lost to Wang Manyu 2-4 in the semifinals. Breaking through the semifinal barrier remains a challenge for Korean table tennis.

Conclusion: The Commonality Between IT Reliability and Table Tennis

IT system reliability and table tennis share common ground.

Preparation is everything. Just as the ready position for receiving serves is critical in table tennis, proactive monitoring and alerting systems are the core of outage response in IT.
Repetitive practice builds skill. Like multi-ball drills in table tennis, repeatedly experiencing failure scenarios in lab environments is needed for rapid real-world response.
Teamwork matters. Just as doubles requires great chemistry with your partner, outage response requires collaboration between development and operations teams.
Record and analyze. Just as table tennis players study opponents' match videos, postmortems analyze outages and identify improvements.

Equip yourself with high availability, monitoring, and systematic SRE practices, and build endurance through consistent lab practice. And every now and then, pick up a table tennis paddle and enjoy sports too. Healthy code comes from a healthy body.

Table of Contents