Split View: [AWS] Kinesis 실전 아키텍처: Kafka 비교와 스트리밍 패턴

[AWS] Kinesis 실전 아키텍처: Kafka 비교와 스트리밍 패턴

1. Kinesis vs Apache Kafka 상세 비교

1.1 아키텍처 차이

Apache Kafka와 AWS Kinesis는 모두 실시간 데이터 스트리밍을 위한 플랫폼이지만, 근본적인 아키텍처 철학이 다릅니다.

Apache Kafka 아키텍처
======================

+----------+     +------------------------------------------+
| Producer | --> | Kafka Cluster                             |
+----------+     |                                          |
                 |  Broker 1    Broker 2    Broker 3        |
                 |  +-------+  +-------+  +-------+        |
                 |  |Topic A|  |Topic A|  |Topic A|        |
                 |  |Part 0 |  |Part 1 |  |Part 2 |        |
                 |  |       |  |       |  |       |        |
                 |  |Topic B|  |Topic B|  |Topic B|        |
                 |  |Part 0 |  |Part 1 |  |Part 2 |        |
                 |  +-------+  +-------+  +-------+        |
                 |                                          |
                 |  KRaft (Kafka 4.0+, ZooKeeper 제거)      |
                 +------------------------------------------+
                         |
                         v
                 +----------+
                 | Consumer |
                 | Group    |
                 +----------+

AWS Kinesis Data Streams 아키텍처
===================================

+----------+     +------------------------------------------+
| Producer | --> | Kinesis Stream (완전관리형)                |
+----------+     |                                          |
                 |  Shard 1     Shard 2     Shard 3         |
                 |  +-------+  +-------+  +-------+        |
                 |  |Records|  |Records|  |Records|        |
                 |  |1MB/s W|  |1MB/s W|  |1MB/s W|        |
                 |  |2MB/s R|  |2MB/s R|  |2MB/s R|        |
                 |  +-------+  +-------+  +-------+        |
                 |                                          |
                 |  AWS 완전관리 (인프라 관리 불필요)        |
                 +------------------------------------------+
                         |
                         v
                 +----------+
                 | Consumer |
                 | (KCL)    |
                 +----------+

1.2 종합 비교표

비교 항목	AWS Kinesis	Apache Kafka (자체 운영)	Amazon MSK
관리 방식	완전관리형	자체 운영	관리형 Kafka
확장 단위	샤드	파티션 + 브로커	브로커
샤드/파티션당 쓰기	1 MB/s	제한 없음 (디스크 I/O)	디스크 I/O 의존
샤드/파티션당 읽기	2 MB/s (공유)	제한 없음	디스크 I/O 의존
최대 보존 기간	365일	무제한	무제한
순서 보장	샤드 내	파티션 내	파티션 내
레코드 최대 크기	1 MB	기본 1 MB (설정 변경 가능)	기본 1 MB
컨슈머 모델	KCL, 향상된 팬아웃	컨슈머 그룹	컨슈머 그룹
프로토콜	HTTPS/HTTP2	자체 TCP 프로토콜	자체 TCP
에코시스템	AWS 서비스 통합	Kafka Connect, Schema Registry 등	Kafka 에코시스템
운영 복잡도	낮음	높음	중간
초기 설정	수 분	수 시간 ~ 수 일	수십 분

1.3 처리량과 지연 시간

처리량 비교
============

Kinesis (프로비저닝):
  쓰기: 1 MB/s x 샤드 수
  읽기: 2 MB/s x 샤드 수 (공유)
        2 MB/s x 샤드 수 x 컨슈머 수 (향상된 팬아웃)

  예시: 100 샤드
  쓰기: 100 MB/s
  읽기: 200 MB/s (공유) 또는 200 MB/s x N (향상된 팬아웃)

Kafka (자체 운영):
  브로커 성능에 의존
  단일 브로커: 수백 MB/s 가능
  클러스터: 수 GB/s 처리 가능

  예시: 6 브로커 클러스터
  쓰기: 600+ MB/s
  읽기: 수 GB/s

지연 시간 비교
===============
Kinesis:
  - PutRecord: 수십 ms
  - GetRecords (공유 팬아웃): ~200 ms
  - Enhanced Fan-Out: ~70 ms

Kafka:
  - 프로듀서 -> 컨슈머: ~2-10 ms (네트워크 의존)
  - 엔드투엔드: ~10-50 ms

1.4 비용 비교

월간 비용 추정 (미국 동부 기준)
================================

시나리오: 10 MB/s 지속적 데이터 수집, 3개 컨슈머

Kinesis (프로비저닝 모드):
  - 10 샤드 필요 (10 MB/s / 1 MB/s per shard)
  - 샤드 비용: 10 x 0.015 x 720 시간 = ~108 USD
  - PUT 유닛: ~360 USD (약 25.9B 유닛/월)
  - 향상된 팬아웃 (3 컨슈머): ~324 USD
  합계: ~792 USD/월

Kinesis (온디맨드 Advantage):
  - 데이터 쓰기: ~25.9 TB x 0.032 = ~829 USD
  - 데이터 읽기: ~25.9 TB x 3 x 0.016 = ~1,243 USD
  합계: ~2,072 USD/월

Kafka (EC2 자체 운영):
  - 3 브로커 (m5.xlarge): 3 x 140 = ~420 USD
  - EBS 스토리지 (1TB x 3): ~300 USD
  - 운영 인력 비용: 별도
  합계: ~720 USD/월 + 운영 비용

Amazon MSK:
  - 3 브로커 (kafka.m5.large): ~456 USD
  - 스토리지: ~300 USD
  합계: ~756 USD/월

1.5 언제 무엇을 선택할 것인가

Kinesis를 선택하는 경우:

AWS 환경에 깊이 통합된 아키텍처
운영 부담을 최소화하고 싶을 때
Lambda, Firehose 등 AWS 서비스와 직접 연동이 필요할 때
중소 규모 처리량 (수십 MB/s 이하)
빠른 프로토타이핑이 필요할 때

Kafka를 선택하는 경우:

매우 높은 처리량이 필요할 때 (수 GB/s)
멀티 클라우드 또는 하이브리드 환경
Kafka Connect 에코시스템이 필요할 때
초저지연이 요구될 때 (수 ms)
무제한 데이터 보존이 필요할 때

2. Kinesis vs SQS: 언제 무엇을 사용할 것인가

비교 항목	Kinesis Data Streams	Amazon SQS
데이터 처리 모델	스트리밍 (연속 처리)	메시지 큐 (개별 처리)
컨슈머 수	다중 컨슈머 동시 처리	기본적으로 단일 컨슈머
순서 보장	샤드 내 보장	FIFO 큐에서만 보장
데이터 보존	24시간 ~ 365일	최대 14일
데이터 리플레이	가능 (시퀀스 번호 기반)	불가 (처리 후 삭제)
처리량	샤드당 1 MB/s 쓰기	거의 무제한
메시지 크기	최대 1 MB	최대 256 KB
지연 시간	밀리초	밀리초
비용 모델	샤드 시간 + 데이터 전송	요청 수 기반
주요 용도	실시간 분석, 로그 수집	마이크로서비스 디커플링

사용 시나리오 결정 플로우차트
================================

데이터 처리 요구사항 분석
         |
    +----+----+
    |         |
같은 데이터를   메시지를 한 번만
여러 컨슈머가   처리하면 되는가?
읽어야 하는가?       |
    |              SQS
    |
실시간 순서가
보장되어야 하는가?
    |
    +----+----+
    |         |
   YES        NO
    |         |
 Kinesis    SQS FIFO
 Data       또는
 Streams    Kinesis

3. Amazon Managed Service for Apache Flink

3.1 개요

Amazon Managed Service for Apache Flink(구 Kinesis Data Analytics)는 Apache Flink를 완전관리형 인프라에서 실행할 수 있는 서비스입니다.

참고: 기존 Kinesis Data Analytics for SQL은 2025년 10월 이후 신규 생성이 중단되었으며, Amazon Managed Service for Apache Flink으로 마이그레이션이 권장됩니다.

3.2 주요 기능

Managed Flink 아키텍처
========================

+-----------+     +----------------------------+     +-----------+
| 소스      |     | Managed Flink              |     | 싱크      |
|           | --> |                            | --> |           |
| - Kinesis |     | +------------------------+ |     | - Kinesis |
| - MSK     |     | | Flink Application      | |     | - S3      |
| - S3      |     | |                        | |     | - DynamoDB|
|           |     | | - SQL 쿼리             | |     | - Open-   |
|           |     | | - Java/Scala 앱        | |     |   Search  |
|           |     | | - Python (PyFlink)     | |     | - Redshift|
|           |     | |                        | |     |           |
|           |     | | 윈도우 집계            | |     |           |
|           |     | | 패턴 감지              | |     |           |
|           |     | | CEP (복합 이벤트 처리) | |     |           |
|           |     | +------------------------+ |     |           |
+-----------+     +----------------------------+     +-----------+

3.3 윈도우 처리 유형

스트림 데이터를 시간 기반으로 그룹화하여 분석하는 핵심 기능입니다.

윈도우 유형
============

1) 텀블링 윈도우 (Tumbling Window)
   - 고정 크기, 겹치지 않음
   |-------|-------|-------|-------|
   0       5       10      15      20 (초)

2) 슬라이딩 윈도우 (Sliding/Hopping Window)
   - 고정 크기, 일정 간격으로 슬라이드
   |-----------|
       |-----------|
           |-----------|
   0   2   4   6   8   10 (초)
   크기: 6초, 슬라이드: 2초

3) 세션 윈도우 (Session Window)
   - 활동 기반, 비활성 기간으로 구분
   |---event-event---| gap |--event-event-event--| gap |
   <-- Session 1 -->       <------ Session 2 ---->

4) 글로벌 윈도우 (Global Window)
   - 전체 스트림에 대해 단일 윈도우

3.4 Flink SQL 예제

-- Kinesis 소스 테이블 정의
CREATE TABLE clickstream (
    user_id VARCHAR,
    page VARCHAR,
    action VARCHAR,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kinesis',
    'stream' = 'clickstream-data',
    'aws.region' = 'ap-northeast-2',
    'scan.stream.initpos' = 'LATEST',
    'format' = 'json'
);

-- 1분 텀블링 윈도우로 페이지별 조회수 집계
SELECT
    page,
    COUNT(*) AS view_count,
    COUNT(DISTINCT user_id) AS unique_users,
    TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
    TUMBLE_END(event_time, INTERVAL '1' MINUTE) AS window_end
FROM clickstream
WHERE action = 'view'
GROUP BY
    page,
    TUMBLE(event_time, INTERVAL '1' MINUTE);

-- 이상 행동 감지: 1분 내 동일 사용자 10회 이상 클릭
SELECT
    user_id,
    COUNT(*) AS click_count,
    TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start
FROM clickstream
WHERE action = 'click'
GROUP BY
    user_id,
    TUMBLE(event_time, INTERVAL '1' MINUTE)
HAVING COUNT(*) >= 10;

4. 실전 스트리밍 아키텍처 패턴

4.1 로그 집계 파이프라인

로그 집계 아키텍처
====================

+----------+     +----------+     +---------+     +----------+
| App      |     | Kinesis  |     | Firehose|     | S3       |
| Server 1 | --> | Agent    | --> |         | --> | (Raw     |
+----------+     +----------+     |         |     |  Logs)   |
                                  |         |     +----------+
+----------+     +----------+     |         |         |
| App      |     | Kinesis  |     |         |         v
| Server 2 | --> | Agent    | --> |         |     +----------+
+----------+     +----------+     |         |     | Athena   |
                                  |         |     | (Query)  |
+----------+     +----------+     |         |     +----------+
| App      |     | Kinesis  |     |         |
| Server N | --> | Agent    | --> |         |
+----------+     +----------+     +---------+
                      |
                      v
                 +----------+     +----------+
                 | Lambda   | --> | Open-    |
                 | (실시간  |     | Search   |
                 |  알림)   |     | (검색/   |
                 +----------+     |  대시보드)|
                                  +----------+

4.2 실시간 분석 대시보드

실시간 대시보드 아키텍처
==========================

+----------+     +----------+     +-----------+     +----------+
| 웹/모바일|     | API      |     | Kinesis   |     | Managed  |
| 클라이언트| --> | Gateway  | --> | Data      | --> | Flink    |
+----------+     +----------+     | Streams   |     | (집계/   |
                                  +-----------+     |  분석)   |
                                                    +----------+
                                                         |
                                          +---------+----+----+---------+
                                          |         |         |         |
                                          v         v         v         v
                                     +--------+ +--------+ +--------+ +--------+
                                     |DynamoDB| |Timestream| | S3    | |CloudWatch|
                                     |(실시간 | |(시계열  | |(장기  | |(메트릭  |
                                     | 데이터)| | 데이터) | | 저장) | | 알람)   |
                                     +--------+ +--------+ +--------+ +--------+
                                          |         |
                                          v         v
                                     +--------------------+
                                     | 대시보드 앱        |
                                     | (React/Vue +       |
                                     |  WebSocket)        |
                                     +--------------------+

4.3 IoT 데이터 수집

# IoT 디바이스 시뮬레이터
import boto3
import json
import time
import random
from datetime import datetime

kinesis = boto3.client('kinesis', region_name='ap-northeast-2')
STREAM_NAME = 'iot-sensor-data'

def simulate_sensor(device_id):
    """IoT 센서 데이터 시뮬레이션"""
    return {
        'device_id': device_id,
        'temperature': round(random.uniform(15.0, 45.0), 2),
        'humidity': round(random.uniform(20.0, 90.0), 2),
        'pressure': round(random.uniform(990.0, 1030.0), 2),
        'battery_level': round(random.uniform(0.0, 100.0), 1),
        'location': {
            'lat': round(random.uniform(33.0, 38.0), 6),
            'lon': round(random.uniform(126.0, 130.0), 6)
        },
        'timestamp': datetime.utcnow().isoformat() + 'Z'
    }

def ingest_iot_data(num_devices=100, interval=1.0):
    """IoT 데이터를 Kinesis에 수집"""
    device_ids = [f'sensor-{i:04d}' for i in range(num_devices)]

    while True:
        records = []
        for device_id in device_ids:
            sensor_data = simulate_sensor(device_id)
            records.append({
                'Data': json.dumps(sensor_data).encode('utf-8'),
                'PartitionKey': device_id
            })

        # PutRecords는 최대 500개까지
        for batch_start in range(0, len(records), 500):
            batch = records[batch_start:batch_start + 500]
            response = kinesis.put_records(
                StreamName=STREAM_NAME,
                Records=batch
            )

            failed = response['FailedRecordCount']
            if failed > 0:
                print(f"Batch failed: {failed} records")
                # 지수 백오프 재시도 로직
                retry_records = []
                for i, result in enumerate(response['Records']):
                    if 'ErrorCode' in result:
                        retry_records.append(batch[i])
                if retry_records:
                    time.sleep(0.5)
                    kinesis.put_records(
                        StreamName=STREAM_NAME,
                        Records=retry_records
                    )

        print(f"Ingested {len(records)} sensor readings")
        time.sleep(interval)

if __name__ == '__main__':
    ingest_iot_data()

4.4 이벤트 소싱 패턴

이벤트 소싱 아키텍처
======================

+----------+     +----------+     +-----------+
| Command  |     | Kinesis  |     | Event     |
| Handler  | --> | Data     | --> | Processor |
|          |     | Streams  |     | (Lambda/  |
| - 주문   |     | (이벤트  |     |  ECS)     |
| - 결제   |     |  스토어) |     +-----------+
| - 배송   |     +----------+          |
+----------+          |          +-----+-----+
                      |          |           |
                      v          v           v
                 +----------+ +--------+ +--------+
                 | S3       | |DynamoDB| |SNS     |
                 | (이벤트  | |(읽기   | |(알림)  |
                 |  아카이브)| | 모델)  | |        |
                 +----------+ +--------+ +--------+

이벤트 흐름 예시:
1. OrderCreated -> 주문 생성 이벤트
2. PaymentProcessed -> 결제 처리 이벤트
3. InventoryReserved -> 재고 예약 이벤트
4. ShipmentCreated -> 배송 생성 이벤트

4.5 ML 피처 파이프라인

ML 피처 파이프라인
====================

+----------+     +----------+     +-----------+     +----------+
| 이벤트   |     | Kinesis  |     | Managed   |     | Feature  |
| 소스     | --> | Data     | --> | Flink     | --> | Store    |
|          |     | Streams  |     | (피처     |     | (Sage-   |
| - 클릭   |     |          |     |  계산)    |     |  Maker)  |
| - 구매   |     |          |     |           |     +----------+
| - 검색   |     |          |     | 실시간:   |         |
+----------+     +----------+     | - 세션 수 |         v
                                  | - 최근    |     +----------+
                                  |   구매 수  |     | ML 모델  |
                                  | - 평균    |     | 추론     |
                                  |   체류 시간|     +----------+
                                  +-----------+

5. 성능 최적화

5.1 파티션 키 설계

파티션 키 설계는 Kinesis 성능의 가장 중요한 요소입니다.

좋은 파티션 키의 조건:

높은 카디널리티 (고유 값이 많아야 함)
균등한 분포 (특정 키에 쏠리지 않아야 함)
샤드 수의 최소 10배 이상의 고유 키 확보

좋은 파티션 키 예시
====================

1) 사용자 ID (높은 카디널리티)
   user-001 -> Shard 1
   user-002 -> Shard 3
   user-003 -> Shard 2
   ...
   고르게 분산됨

2) UUID (최고 분산)
   랜덤 UUID -> 완벽한 분산
   단점: 같은 엔티티의 순서 보장 불가

3) 복합 키
   "region-userType-userId"
   세밀한 분산 제어 가능

나쁜 파티션 키 예시
====================

1) 날짜 ("2026-03-20")
   모든 레코드가 같은 샤드로 -> 핫 샤드

2) 국가 코드 ("KR", "US", "JP")
   카디널리티가 너무 낮음
   특정 국가에 트래픽 쏠림

3) 고정값 ("default")
   단일 샤드에 모든 부하 집중

5.2 KPL을 활용한 집계

KPL 집계 최적화
================

집계 없이:
Record 1 (100B) -> PutRecord -> 1 API 호출
Record 2 (200B) -> PutRecord -> 1 API 호출
Record 3 (150B) -> PutRecord -> 1 API 호출
합계: 3 API 호출, 450B 전송

KPL 집계 사용:
Record 1 (100B) --+
Record 2 (200B) --+--> 집계 레코드 (450B) -> 1 API 호출
Record 3 (150B) --+
합계: 1 API 호출, 450B 전송

효과:
- API 호출 수 대폭 감소
- PUT 유닛 비용 절감
- 처리량 대폭 향상

5.3 향상된 팬아웃 전략

# 향상된 팬아웃 컨슈머 등록
import boto3

kinesis = boto3.client('kinesis', region_name='ap-northeast-2')

# 컨슈머 등록
response = kinesis.register_stream_consumer(
    StreamARN='arn:aws:kinesis:ap-northeast-2:123456789012:stream/my-stream',
    ConsumerName='analytics-consumer'
)
consumer_arn = response['Consumer']['ConsumerARN']
print(f"Consumer ARN: {consumer_arn}")

# 컨슈머 상태 확인
response = kinesis.describe_stream_consumer(
    StreamARN='arn:aws:kinesis:ap-northeast-2:123456789012:stream/my-stream',
    ConsumerName='analytics-consumer'
)
print(f"Status: {response['ConsumerDescription']['ConsumerStatus']}")

5.4 에러 처리와 재시도 전략

import time
import random

def put_records_with_retry(kinesis_client, stream_name, records, max_retries=3):
    """지수 백오프를 사용한 PutRecords 재시도"""

    for attempt in range(max_retries):
        response = kinesis_client.put_records(
            StreamName=stream_name,
            Records=records
        )

        failed_count = response['FailedRecordCount']

        if failed_count == 0:
            return response

        # 실패한 레코드만 추출
        retry_records = []
        for i, result in enumerate(response['Records']):
            if 'ErrorCode' in result:
                error_code = result['ErrorCode']
                if error_code == 'ProvisionedThroughputExceededException':
                    retry_records.append(records[i])
                else:
                    print(f"Non-retryable error: {error_code}")

        if not retry_records:
            return response

        records = retry_records

        # 지수 백오프 + 지터
        backoff = min(2 ** attempt * 0.1, 5.0)
        jitter = random.uniform(0, backoff * 0.5)
        wait_time = backoff + jitter
        print(f"Retry {attempt + 1}: {len(retry_records)} records, waiting {wait_time:.2f}s")
        time.sleep(wait_time)

    print(f"Failed after {max_retries} retries: {len(records)} records")
    return None

6. 모니터링: CloudWatch 메트릭

6.1 핵심 모니터링 메트릭

메트릭	설명	경고 임계값
IncomingBytes	스트림에 들어오는 바이트 수	샤드 용량의 80%
IncomingRecords	스트림에 들어오는 레코드 수	샤드당 800 rec/s
GetRecords.IteratorAgeMilliseconds	컨슈머가 얼마나 뒤처져 있는지	60,000 ms (1분)
WriteProvisionedThroughputExceeded	쓰기 제한 초과 횟수	0 초과 시 알람
ReadProvisionedThroughputExceeded	읽기 제한 초과 횟수	0 초과 시 알람
GetRecords.Latency	GetRecords 호출 지연 시간	1,000 ms
PutRecord.Latency	PutRecord 호출 지연 시간	1,000 ms
GetRecords.Success	GetRecords 성공률	99% 미만 시 알람

6.2 향상된 모니터링

향상된 모니터링 활성화 시 추가 메트릭
========================================

샤드 수준 메트릭:
- IncomingBytes (샤드별)
- IncomingRecords (샤드별)
- IteratorAgeMilliseconds (샤드별)
- OutgoingBytes (샤드별)
- OutgoingRecords (샤드별)
- ReadProvisionedThroughputExceeded (샤드별)
- WriteProvisionedThroughputExceeded (샤드별)

핫 샤드 탐지:
  Shard 1: IncomingBytes = 200 KB/s  [정상]
  Shard 2: IncomingBytes = 950 KB/s  [경고! 거의 한계]
  Shard 3: IncomingBytes = 300 KB/s  [정상]
  -> Shard 2 분할 권장

7. 베스트 프랙티스와 안티패턴

7.1 베스트 프랙티스

1) 파티션 키 설계

높은 카디널리티 키 사용 (사용자 ID, 디바이스 ID)
샤드 수의 10배 이상 고유 키 확보
순서가 필요한 경우 엔티티 ID를 파티션 키로 사용

2) 프로듀서 최적화

PutRecords (배치) API 사용으로 API 호출 최소화
KPL 사용으로 레코드 집계 및 수집 최적화
적절한 재시도 로직 구현 (지수 백오프 + 지터)

3) 컨슈머 최적화

다중 컨슈머 시 향상된 팬아웃 사용
KCL 사용으로 분산 처리 자동화
체크포인팅 주기 최적화 (너무 자주하면 DynamoDB 비용 증가)

4) 용량 관리

예측 가능한 워크로드: 프로비저닝 모드
불규칙한 워크로드: 온디맨드 모드
CloudWatch 알람으로 자동 스케일링 트리거

5) 비용 최적화

온디맨드 Advantage 모드 검토 (2025년 출시)
보존 기간을 필요한 만큼만 설정
불필요한 향상된 팬아웃 컨슈머 등록 해제

7.2 안티패턴

1) 단일 파티션 키 사용

모든 데이터가 하나의 샤드에 집중
샤드 제한에 즉시 도달

2) 과도한 샤드 수

비용 증가, 관리 복잡성 증가
KCL의 DynamoDB 리스 테이블 부하 증가

3) 체크포인트 미사용

장애 시 데이터 중복 처리 또는 유실
항상 적절한 체크포인팅 전략 구현

4) 에러 처리 부재

ProvisionedThroughputExceededException 무시
재시도 없이 실패 데이터 유실

5) GetRecords 과다 호출

샤드당 초당 5회 제한 준수 필요
적절한 폴링 간격 설정

8. 종합 비교 정리표

항목	Kinesis Data Streams	Kinesis Firehose	Kafka	SQS	Managed Flink
유형	데이터 스트리밍	데이터 전달	데이터 스트리밍	메시지 큐	스트림 처리
관리	완전관리형	완전관리형	자체/MSK	완전관리형	완전관리형
지연시간	ms	60s+	ms	ms	ms
순서보장	샤드 내	없음	파티션 내	FIFO만	입력 의존
리플레이	가능	불가	가능	불가	입력 의존
스케일링	샤드 추가	자동	파티션/브로커	자동	KPU 추가
변환	없음 (소비자 담당)	Lambda	Kafka Streams	없음	Flink 앱
비용모델	샤드+데이터	데이터량	인스턴스	요청수	KPU 시간
AWS통합	높음	매우높음	낮음/중간	높음	높음
최적용도	실시간 수집	자동 전달	대량 스트리밍	작업 큐	실시간 분석

9. 정리

서비스 선택 가이드

실시간 스트리밍 아키텍처를 설계할 때는 요구사항에 맞는 서비스를 선택하는 것이 중요합니다.

단순한 데이터 전달이 목적이라면: Amazon Data Firehose를 사용하여 S3, Redshift 등에 직접 전달
실시간 처리와 다중 컨슈머가 필요하다면: Kinesis Data Streams + KCL 또는 향상된 팬아웃
복잡한 스트림 분석이 필요하다면: Managed Service for Apache Flink
대용량 처리와 에코시스템이 필요하다면: Apache Kafka 또는 Amazon MSK
마이크로서비스 간 메시징이 필요하다면: Amazon SQS

각 서비스의 강점을 이해하고, 종종 여러 서비스를 조합하여 사용하는 것이 실전에서의 최적 패턴입니다. 예를 들어, Kinesis Data Streams로 수집하고, Managed Flink로 실시간 분석하며, Data Firehose로 S3에 장기 저장하는 조합은 매우 일반적인 아키텍처입니다.

[AWS] Kinesis in Practice: Kafka Comparison and Streaming Patterns

1. Kinesis vs Apache Kafka Detailed Comparison

1.1 Architecture Differences

Apache Kafka and AWS Kinesis are both platforms for real-time data streaming, but their fundamental architectural philosophies differ.

Apache Kafka Architecture
==========================

+----------+     +------------------------------------------+
| Producer | --> | Kafka Cluster                             |
+----------+     |                                          |
                 |  Broker 1    Broker 2    Broker 3        |
                 |  +-------+  +-------+  +-------+        |
                 |  |Topic A|  |Topic A|  |Topic A|        |
                 |  |Part 0 |  |Part 1 |  |Part 2 |        |
                 |  |       |  |       |  |       |        |
                 |  |Topic B|  |Topic B|  |Topic B|        |
                 |  |Part 0 |  |Part 1 |  |Part 2 |        |
                 |  +-------+  +-------+  +-------+        |
                 |                                          |
                 |  KRaft (Kafka 4.0+, ZooKeeper removed)   |
                 +------------------------------------------+
                         |
                         v
                 +----------+
                 | Consumer |
                 | Group    |
                 +----------+

AWS Kinesis Data Streams Architecture
=======================================

+----------+     +------------------------------------------+
| Producer | --> | Kinesis Stream (Fully Managed)            |
+----------+     |                                          |
                 |  Shard 1     Shard 2     Shard 3         |
                 |  +-------+  +-------+  +-------+        |
                 |  |Records|  |Records|  |Records|        |
                 |  |1MB/s W|  |1MB/s W|  |1MB/s W|        |
                 |  |2MB/s R|  |2MB/s R|  |2MB/s R|        |
                 |  +-------+  +-------+  +-------+        |
                 |                                          |
                 |  Fully managed by AWS                    |
                 +------------------------------------------+
                         |
                         v
                 +----------+
                 | Consumer |
                 | (KCL)    |
                 +----------+

1.2 Comprehensive Comparison Table

Comparison	AWS Kinesis	Apache Kafka (Self-hosted)	Amazon MSK
Management	Fully managed	Self-operated	Managed Kafka
Scale unit	Shard	Partition + Broker	Broker
Write per shard/partition	1 MB/s	No limit (disk I/O)	Disk I/O dependent
Read per shard/partition	2 MB/s (shared)	No limit	Disk I/O dependent
Max retention	365 days	Unlimited	Unlimited
Ordering	Within shard	Within partition	Within partition
Max record size	1 MB	Default 1 MB (configurable)	Default 1 MB
Consumer model	KCL, Enhanced Fan-Out	Consumer Groups	Consumer Groups
Protocol	HTTPS/HTTP2	Custom TCP protocol	Custom TCP
Ecosystem	AWS service integration	Kafka Connect, Schema Registry, etc.	Kafka ecosystem
Operational complexity	Low	High	Medium
Initial setup	Minutes	Hours to days	Tens of minutes

1.3 Throughput and Latency

Throughput Comparison
======================

Kinesis (Provisioned):
  Write: 1 MB/s x number of shards
  Read:  2 MB/s x number of shards (shared)
         2 MB/s x shards x consumers (enhanced fan-out)

  Example: 100 shards
  Write: 100 MB/s
  Read:  200 MB/s (shared) or 200 MB/s x N (enhanced fan-out)

Kafka (Self-hosted):
  Depends on broker performance
  Single broker: hundreds of MB/s possible
  Cluster: multiple GB/s achievable

  Example: 6 broker cluster
  Write: 600+ MB/s
  Read:  multiple GB/s

Latency Comparison
===================
Kinesis:
  - PutRecord: tens of ms
  - GetRecords (shared fan-out): ~200 ms
  - Enhanced Fan-Out: ~70 ms

Kafka:
  - Producer -> Consumer: ~2-10 ms (network dependent)
  - End-to-end: ~10-50 ms

1.4 Cost Comparison

Monthly Cost Estimate (US East)
=================================

Scenario: 10 MB/s sustained ingestion, 3 consumers

Kinesis (Provisioned Mode):
  - 10 shards needed (10 MB/s / 1 MB/s per shard)
  - Shard cost: 10 x 0.015 x 720 hours = ~108 USD
  - PUT units: ~360 USD (~25.9B units/month)
  - Enhanced fan-out (3 consumers): ~324 USD
  Total: ~792 USD/month

Kinesis (On-Demand Advantage):
  - Data write: ~25.9 TB x 0.032 = ~829 USD
  - Data read: ~25.9 TB x 3 x 0.016 = ~1,243 USD
  Total: ~2,072 USD/month

Kafka (Self-hosted on EC2):
  - 3 brokers (m5.xlarge): 3 x 140 = ~420 USD
  - EBS storage (1TB x 3): ~300 USD
  - Operations personnel: separate
  Total: ~720 USD/month + operational costs

Amazon MSK:
  - 3 brokers (kafka.m5.large): ~456 USD
  - Storage: ~300 USD
  Total: ~756 USD/month

1.5 When to Choose What

Choose Kinesis when:

Architecture deeply integrated with AWS
You want to minimize operational burden
Direct integration with Lambda, Firehose, and other AWS services is needed
Small to medium throughput (under tens of MB/s)
Rapid prototyping is needed

Choose Kafka when:

Very high throughput is required (multiple GB/s)
Multi-cloud or hybrid environment
Kafka Connect ecosystem is needed
Ultra-low latency is required (single-digit ms)
Unlimited data retention is needed

2. Kinesis vs SQS: When to Use Which

Comparison	Kinesis Data Streams	Amazon SQS
Processing model	Streaming (continuous)	Message queue (individual)
Consumer count	Multiple simultaneous	Primarily single consumer
Ordering	Guaranteed within shard	FIFO queue only
Data retention	24h to 365 days	Up to 14 days
Data replay	Possible (sequence number based)	Not possible (deleted after processing)
Throughput	1 MB/s write per shard	Nearly unlimited
Message size	Up to 1 MB	Up to 256 KB
Latency	Milliseconds	Milliseconds
Cost model	Shard hours + data transfer	Request-based
Primary use	Real-time analytics, log collection	Microservice decoupling

Usage Scenario Decision Flowchart
====================================

Analyze data processing requirements
         |
    +----+----+
    |         |
Do multiple     Do messages need
consumers need  single processing?
to read the          |
same data?         SQS
    |
Is real-time ordering
required?
    |
    +----+----+
    |         |
   YES        NO
    |         |
 Kinesis    SQS FIFO
 Data       or
 Streams    Kinesis

3. Amazon Managed Service for Apache Flink

3.1 Overview

Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics) allows you to run Apache Flink on fully managed infrastructure.

Note: The legacy Kinesis Data Analytics for SQL was discontinued for new application creation after October 2025, and migration to Amazon Managed Service for Apache Flink is recommended.

3.2 Key Features

Managed Flink Architecture
============================

+-----------+     +----------------------------+     +-----------+
| Sources   |     | Managed Flink              |     | Sinks     |
|           | --> |                            | --> |           |
| - Kinesis |     | +------------------------+ |     | - Kinesis |
| - MSK     |     | | Flink Application      | |     | - S3      |
| - S3      |     | |                        | |     | - DynamoDB|
|           |     | | - SQL queries          | |     | - Open-   |
|           |     | | - Java/Scala apps      | |     |   Search  |
|           |     | | - Python (PyFlink)     | |     | - Redshift|
|           |     | |                        | |     |           |
|           |     | | Window aggregation     | |     |           |
|           |     | | Pattern detection      | |     |           |
|           |     | | CEP (Complex Event     | |     |           |
|           |     | |      Processing)       | |     |           |
|           |     | +------------------------+ |     |           |
+-----------+     +----------------------------+     +-----------+

3.3 Window Processing Types

A core feature for grouping stream data by time-based windows for analysis.

Window Types
==============

1) Tumbling Window
   - Fixed size, non-overlapping
   |-------|-------|-------|-------|
   0       5       10      15      20 (sec)

2) Sliding/Hopping Window
   - Fixed size, slides at regular intervals
   |-----------|
       |-----------|
           |-----------|
   0   2   4   6   8   10 (sec)
   Size: 6s, Slide: 2s

3) Session Window
   - Activity-based, separated by inactivity gaps
   |---event-event---| gap |--event-event-event--| gap |
   <-- Session 1 -->       <------ Session 2 ---->

4) Global Window
   - Single window over the entire stream

3.4 Flink SQL Example

-- Define Kinesis source table
CREATE TABLE clickstream (
    user_id VARCHAR,
    page VARCHAR,
    action VARCHAR,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kinesis',
    'stream' = 'clickstream-data',
    'aws.region' = 'us-east-1',
    'scan.stream.initpos' = 'LATEST',
    'format' = 'json'
);

-- Aggregate page views per 1-minute tumbling window
SELECT
    page,
    COUNT(*) AS view_count,
    COUNT(DISTINCT user_id) AS unique_users,
    TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
    TUMBLE_END(event_time, INTERVAL '1' MINUTE) AS window_end
FROM clickstream
WHERE action = 'view'
GROUP BY
    page,
    TUMBLE(event_time, INTERVAL '1' MINUTE);

-- Anomaly detection: 10+ clicks by same user within 1 minute
SELECT
    user_id,
    COUNT(*) AS click_count,
    TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start
FROM clickstream
WHERE action = 'click'
GROUP BY
    user_id,
    TUMBLE(event_time, INTERVAL '1' MINUTE)
HAVING COUNT(*) >= 10;

4. Real-World Streaming Architecture Patterns

4.1 Log Aggregation Pipeline

Log Aggregation Architecture
==============================

+----------+     +----------+     +---------+     +----------+
| App      |     | Kinesis  |     | Firehose|     | S3       |
| Server 1 | --> | Agent    | --> |         | --> | (Raw     |
+----------+     +----------+     |         |     |  Logs)   |
                                  |         |     +----------+
+----------+     +----------+     |         |         |
| App      |     | Kinesis  |     |         |         v
| Server 2 | --> | Agent    | --> |         |     +----------+
+----------+     +----------+     |         |     | Athena   |
                                  |         |     | (Query)  |
+----------+     +----------+     |         |     +----------+
| App      |     | Kinesis  |     |         |
| Server N | --> | Agent    | --> |         |
+----------+     +----------+     +---------+
                      |
                      v
                 +----------+     +----------+
                 | Lambda   | --> | Open-    |
                 | (Real-   |     | Search   |
                 |  time    |     | (Search/ |
                 |  alerts) |     | Dashboard)|
                 +----------+     +----------+

4.2 Real-Time Analytics Dashboard

Real-Time Dashboard Architecture
==================================

+----------+     +----------+     +-----------+     +----------+
| Web/     |     | API      |     | Kinesis   |     | Managed  |
| Mobile   | --> | Gateway  | --> | Data      | --> | Flink    |
| Client   |     |          |     | Streams   |     | (Aggre-  |
+----------+     +----------+     +-----------+     |  gation) |
                                                    +----------+
                                                         |
                                          +---------+----+----+---------+
                                          |         |         |         |
                                          v         v         v         v
                                     +--------+ +--------+ +--------+ +--------+
                                     |DynamoDB| |Time-   | | S3     | |Cloud-  |
                                     |(Real-  | |stream  | |(Long-  | |Watch   |
                                     | time)  | |(Time-  | | term)  | |(Metric |
                                     |        | | series)| |        | | Alarms)|
                                     +--------+ +--------+ +--------+ +--------+
                                          |         |
                                          v         v
                                     +--------------------+
                                     | Dashboard App      |
                                     | (React/Vue +       |
                                     |  WebSocket)        |
                                     +--------------------+

4.3 IoT Data Ingestion

# IoT device simulator
import boto3
import json
import time
import random
from datetime import datetime

kinesis = boto3.client('kinesis', region_name='us-east-1')
STREAM_NAME = 'iot-sensor-data'

def simulate_sensor(device_id):
    """Simulate IoT sensor data"""
    return {
        'device_id': device_id,
        'temperature': round(random.uniform(15.0, 45.0), 2),
        'humidity': round(random.uniform(20.0, 90.0), 2),
        'pressure': round(random.uniform(990.0, 1030.0), 2),
        'battery_level': round(random.uniform(0.0, 100.0), 1),
        'location': {
            'lat': round(random.uniform(33.0, 38.0), 6),
            'lon': round(random.uniform(126.0, 130.0), 6)
        },
        'timestamp': datetime.utcnow().isoformat() + 'Z'
    }

def ingest_iot_data(num_devices=100, interval=1.0):
    """Ingest IoT data into Kinesis"""
    device_ids = [f'sensor-{i:04d}' for i in range(num_devices)]

    while True:
        records = []
        for device_id in device_ids:
            sensor_data = simulate_sensor(device_id)
            records.append({
                'Data': json.dumps(sensor_data).encode('utf-8'),
                'PartitionKey': device_id
            })

        # PutRecords supports up to 500 records
        for batch_start in range(0, len(records), 500):
            batch = records[batch_start:batch_start + 500]
            response = kinesis.put_records(
                StreamName=STREAM_NAME,
                Records=batch
            )

            failed = response['FailedRecordCount']
            if failed > 0:
                print(f"Batch failed: {failed} records")
                # Exponential backoff retry logic
                retry_records = []
                for i, result in enumerate(response['Records']):
                    if 'ErrorCode' in result:
                        retry_records.append(batch[i])
                if retry_records:
                    time.sleep(0.5)
                    kinesis.put_records(
                        StreamName=STREAM_NAME,
                        Records=retry_records
                    )

        print(f"Ingested {len(records)} sensor readings")
        time.sleep(interval)

if __name__ == '__main__':
    ingest_iot_data()

4.4 Event Sourcing Pattern

Event Sourcing Architecture
=============================

+----------+     +----------+     +-----------+
| Command  |     | Kinesis  |     | Event     |
| Handler  | --> | Data     | --> | Processor |
|          |     | Streams  |     | (Lambda/  |
| - Order  |     | (Event   |     |  ECS)     |
| - Payment|     |  Store)  |     +-----------+
| - Ship   |     +----------+          |
+----------+          |          +-----+-----+
                      |          |           |
                      v          v           v
                 +----------+ +--------+ +--------+
                 | S3       | |DynamoDB| |SNS     |
                 | (Event   | |(Read   | |(Notif- |
                 |  Archive)| | Model) | | ication)|
                 +----------+ +--------+ +--------+

Event Flow Example:
1. OrderCreated -> Order creation event
2. PaymentProcessed -> Payment processing event
3. InventoryReserved -> Inventory reservation event
4. ShipmentCreated -> Shipment creation event

4.5 ML Feature Pipeline

ML Feature Pipeline
=====================

+----------+     +----------+     +-----------+     +----------+
| Event    |     | Kinesis  |     | Managed   |     | Feature  |
| Sources  | --> | Data     | --> | Flink     | --> | Store    |
|          |     | Streams  |     | (Feature  |     | (Sage-   |
| - Clicks |     |          |     |  Compute) |     |  Maker)  |
| - Purch. |     |          |     |           |     +----------+
| - Search |     |          |     | Real-time:|         |
+----------+     +----------+     | - Session |         v
                                  |   count   |     +----------+
                                  | - Recent  |     | ML Model |
                                  |   purch.  |     | Inference|
                                  | - Avg     |     +----------+
                                  |   dwell   |
                                  +-----------+

5. Performance Optimization

5.1 Partition Key Design

Partition key design is the most critical factor for Kinesis performance.

Criteria for good partition keys:

High cardinality (many unique values)
Even distribution (no skew toward specific keys)
At least 10x as many unique keys as the number of shards

Good Partition Key Examples
============================

1) User ID (high cardinality)
   user-001 -> Shard 1
   user-002 -> Shard 3
   user-003 -> Shard 2
   ...
   Evenly distributed

2) UUID (best distribution)
   Random UUID -> Perfect distribution
   Downside: Cannot guarantee ordering for same entity

3) Composite key
   "region-userType-userId"
   Fine-grained distribution control

Bad Partition Key Examples
===========================

1) Date ("2026-03-20")
   All records to same shard -> Hot shard

2) Country code ("KR", "US", "JP")
   Cardinality too low
   Traffic skews toward specific countries

3) Fixed value ("default")
   All load concentrated on single shard

5.2 Aggregation with KPL

KPL Aggregation Optimization
==============================

Without aggregation:
Record 1 (100B) -> PutRecord -> 1 API call
Record 2 (200B) -> PutRecord -> 1 API call
Record 3 (150B) -> PutRecord -> 1 API call
Total: 3 API calls, 450B transmitted

With KPL aggregation:
Record 1 (100B) --+
Record 2 (200B) --+--> Aggregated record (450B) -> 1 API call
Record 3 (150B) --+
Total: 1 API call, 450B transmitted

Benefits:
- Dramatically reduced API calls
- Lower PUT unit costs
- Significantly improved throughput

5.3 Enhanced Fan-Out Strategy

# Register enhanced fan-out consumer
import boto3

kinesis = boto3.client('kinesis', region_name='us-east-1')

# Register consumer
response = kinesis.register_stream_consumer(
    StreamARN='arn:aws:kinesis:us-east-1:123456789012:stream/my-stream',
    ConsumerName='analytics-consumer'
)
consumer_arn = response['Consumer']['ConsumerARN']
print(f"Consumer ARN: {consumer_arn}")

# Check consumer status
response = kinesis.describe_stream_consumer(
    StreamARN='arn:aws:kinesis:us-east-1:123456789012:stream/my-stream',
    ConsumerName='analytics-consumer'
)
print(f"Status: {response['ConsumerDescription']['ConsumerStatus']}")

5.4 Error Handling and Retry Strategy

import time
import random

def put_records_with_retry(kinesis_client, stream_name, records, max_retries=3):
    """PutRecords retry with exponential backoff"""

    for attempt in range(max_retries):
        response = kinesis_client.put_records(
            StreamName=stream_name,
            Records=records
        )

        failed_count = response['FailedRecordCount']

        if failed_count == 0:
            return response

        # Extract only failed records
        retry_records = []
        for i, result in enumerate(response['Records']):
            if 'ErrorCode' in result:
                error_code = result['ErrorCode']
                if error_code == 'ProvisionedThroughputExceededException':
                    retry_records.append(records[i])
                else:
                    print(f"Non-retryable error: {error_code}")

        if not retry_records:
            return response

        records = retry_records

        # Exponential backoff + jitter
        backoff = min(2 ** attempt * 0.1, 5.0)
        jitter = random.uniform(0, backoff * 0.5)
        wait_time = backoff + jitter
        print(f"Retry {attempt + 1}: {len(retry_records)} records, waiting {wait_time:.2f}s")
        time.sleep(wait_time)

    print(f"Failed after {max_retries} retries: {len(records)} records")
    return None

6. Monitoring: CloudWatch Metrics

6.1 Key Monitoring Metrics

Metric	Description	Alert Threshold
IncomingBytes	Bytes entering the stream	80% of shard capacity
IncomingRecords	Records entering the stream	800 rec/s per shard
GetRecords.IteratorAgeMilliseconds	How far behind a consumer is	60,000 ms (1 min)
WriteProvisionedThroughputExceeded	Write throttle count	Alarm when above 0
ReadProvisionedThroughputExceeded	Read throttle count	Alarm when above 0
GetRecords.Latency	GetRecords call latency	1,000 ms
PutRecord.Latency	PutRecord call latency	1,000 ms
GetRecords.Success	GetRecords success rate	Alarm when below 99%

6.2 Enhanced Monitoring

Additional Metrics with Enhanced Monitoring
=============================================

Shard-level metrics:
- IncomingBytes (per shard)
- IncomingRecords (per shard)
- IteratorAgeMilliseconds (per shard)
- OutgoingBytes (per shard)
- OutgoingRecords (per shard)
- ReadProvisionedThroughputExceeded (per shard)
- WriteProvisionedThroughputExceeded (per shard)

Hot Shard Detection:
  Shard 1: IncomingBytes = 200 KB/s  [Normal]
  Shard 2: IncomingBytes = 950 KB/s  [Warning! Near limit]
  Shard 3: IncomingBytes = 300 KB/s  [Normal]
  -> Recommend splitting Shard 2

7. Best Practices and Anti-Patterns

7.1 Best Practices

1) Partition Key Design

Use high-cardinality keys (user IDs, device IDs)
Ensure at least 10x unique keys relative to shard count
Use entity IDs as partition keys when ordering is needed

2) Producer Optimization

Use PutRecords (batch) API to minimize API calls
Use KPL for record aggregation and collection optimization
Implement proper retry logic (exponential backoff + jitter)

3) Consumer Optimization

Use enhanced fan-out for multiple consumers
Use KCL for automated distributed processing
Optimize checkpointing frequency (too frequent increases DynamoDB costs)

4) Capacity Management

Predictable workloads: Provisioned mode
Irregular workloads: On-demand mode
Trigger auto-scaling with CloudWatch alarms

5) Cost Optimization

Consider On-Demand Advantage mode (released 2025)
Set retention period only as long as needed
Deregister unnecessary enhanced fan-out consumers

7.2 Anti-Patterns

1) Single Partition Key

All data concentrates on one shard
Immediately hits shard limits

2) Excessive Shard Count

Increased costs and management complexity
Increased DynamoDB lease table load for KCL

3) Not Using Checkpoints

Duplicate processing or data loss on failure
Always implement proper checkpointing strategy

4) Missing Error Handling

Ignoring ProvisionedThroughputExceededException
Losing failed data without retries

5) Excessive GetRecords Calls

Must respect 5 calls per second per shard limit
Set appropriate polling intervals

8. Comprehensive Comparison Summary Table

Item	Kinesis Data Streams	Kinesis Firehose	Kafka	SQS	Managed Flink
Type	Data streaming	Data delivery	Data streaming	Message queue	Stream processing
Management	Fully managed	Fully managed	Self/MSK	Fully managed	Fully managed
Latency	ms	60s+	ms	ms	ms
Ordering	Within shard	None	Within partition	FIFO only	Input dependent
Replay	Yes	No	Yes	No	Input dependent
Scaling	Add shards	Automatic	Partition/broker	Automatic	Add KPUs
Transform	None (consumer)	Lambda	Kafka Streams	None	Flink app
Cost model	Shard+data	Data volume	Instance	Requests	KPU hours
AWS integration	High	Very high	Low/Medium	High	High
Best for	Real-time ingest	Auto delivery	High-volume streaming	Task queue	Real-time analytics

9. Summary

Service Selection Guide

When designing a real-time streaming architecture, choosing the right service for your requirements is crucial.

If simple data delivery is the goal: Use Amazon Data Firehose for direct delivery to S3, Redshift, etc.
If real-time processing with multiple consumers is needed: Kinesis Data Streams + KCL or Enhanced Fan-Out
If complex stream analytics are needed: Managed Service for Apache Flink
If high-volume processing and ecosystem are needed: Apache Kafka or Amazon MSK
If inter-microservice messaging is needed: Amazon SQS

Understand each service's strengths and consider combining multiple services for optimal production patterns. For example, collecting with Kinesis Data Streams, analyzing in real-time with Managed Flink, and storing long-term in S3 via Data Firehose is a very common architecture combination.