Skip to content

필사 모드: [AWS] Kinesis in Practice: Kafka Comparison and Streaming Patterns

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

1. Kinesis vs Apache Kafka Detailed Comparison

1.1 Architecture Differences

Apache Kafka and AWS Kinesis are both platforms for real-time data streaming,

but their fundamental architectural philosophies differ.

Apache Kafka Architecture

==========================

+----------+ +------------------------------------------+

| Producer | --> | Kafka Cluster |

+----------+ | |

| Broker 1 Broker 2 Broker 3 |

| +-------+ +-------+ +-------+ |

| |Topic A| |Topic A| |Topic A| |

| |Part 0 | |Part 1 | |Part 2 | |

| | | | | | | |

| |Topic B| |Topic B| |Topic B| |

| |Part 0 | |Part 1 | |Part 2 | |

| +-------+ +-------+ +-------+ |

| |

| KRaft (Kafka 4.0+, ZooKeeper removed) |

+------------------------------------------+

|

v

+----------+

| Consumer |

| Group |

+----------+

AWS Kinesis Data Streams Architecture

=======================================

+----------+ +------------------------------------------+

| Producer | --> | Kinesis Stream (Fully Managed) |

+----------+ | |

| Shard 1 Shard 2 Shard 3 |

| +-------+ +-------+ +-------+ |

| |Records| |Records| |Records| |

| |1MB/s W| |1MB/s W| |1MB/s W| |

| |2MB/s R| |2MB/s R| |2MB/s R| |

| +-------+ +-------+ +-------+ |

| |

| Fully managed by AWS |

+------------------------------------------+

|

v

+----------+

| Consumer |

| (KCL) |

+----------+

1.2 Comprehensive Comparison Table

| Comparison | AWS Kinesis | Apache Kafka (Self-hosted) | Amazon MSK |

| ----------------------------- | ----------------------- | ------------------------------------ | ------------------ |

| **Management** | Fully managed | Self-operated | Managed Kafka |

| **Scale unit** | Shard | Partition + Broker | Broker |

| **Write per shard/partition** | 1 MB/s | No limit (disk I/O) | Disk I/O dependent |

| **Read per shard/partition** | 2 MB/s (shared) | No limit | Disk I/O dependent |

| **Max retention** | 365 days | Unlimited | Unlimited |

| **Ordering** | Within shard | Within partition | Within partition |

| **Max record size** | 1 MB | Default 1 MB (configurable) | Default 1 MB |

| **Consumer model** | KCL, Enhanced Fan-Out | Consumer Groups | Consumer Groups |

| **Protocol** | HTTPS/HTTP2 | Custom TCP protocol | Custom TCP |

| **Ecosystem** | AWS service integration | Kafka Connect, Schema Registry, etc. | Kafka ecosystem |

| **Operational complexity** | Low | High | Medium |

| **Initial setup** | Minutes | Hours to days | Tens of minutes |

1.3 Throughput and Latency

Throughput Comparison

======================

Kinesis (Provisioned):

Write: 1 MB/s x number of shards

Read: 2 MB/s x number of shards (shared)

2 MB/s x shards x consumers (enhanced fan-out)

Example: 100 shards

Write: 100 MB/s

Read: 200 MB/s (shared) or 200 MB/s x N (enhanced fan-out)

Kafka (Self-hosted):

Depends on broker performance

Single broker: hundreds of MB/s possible

Cluster: multiple GB/s achievable

Example: 6 broker cluster

Write: 600+ MB/s

Read: multiple GB/s

Latency Comparison

===================

Kinesis:

- PutRecord: tens of ms

- GetRecords (shared fan-out): ~200 ms

- Enhanced Fan-Out: ~70 ms

Kafka:

- Producer -> Consumer: ~2-10 ms (network dependent)

- End-to-end: ~10-50 ms

1.4 Cost Comparison

Monthly Cost Estimate (US East)

=================================

Scenario: 10 MB/s sustained ingestion, 3 consumers

Kinesis (Provisioned Mode):

- 10 shards needed (10 MB/s / 1 MB/s per shard)

- Shard cost: 10 x 0.015 x 720 hours = ~108 USD

- PUT units: ~360 USD (~25.9B units/month)

- Enhanced fan-out (3 consumers): ~324 USD

Total: ~792 USD/month

Kinesis (On-Demand Advantage):

- Data write: ~25.9 TB x 0.032 = ~829 USD

- Data read: ~25.9 TB x 3 x 0.016 = ~1,243 USD

Total: ~2,072 USD/month

Kafka (Self-hosted on EC2):

- 3 brokers (m5.xlarge): 3 x 140 = ~420 USD

- EBS storage (1TB x 3): ~300 USD

- Operations personnel: separate

Total: ~720 USD/month + operational costs

Amazon MSK:

- 3 brokers (kafka.m5.large): ~456 USD

- Storage: ~300 USD

Total: ~756 USD/month

1.5 When to Choose What

**Choose Kinesis when:**

- Architecture deeply integrated with AWS

- You want to minimize operational burden

- Direct integration with Lambda, Firehose, and other AWS services is needed

- Small to medium throughput (under tens of MB/s)

- Rapid prototyping is needed

**Choose Kafka when:**

- Very high throughput is required (multiple GB/s)

- Multi-cloud or hybrid environment

- Kafka Connect ecosystem is needed

- Ultra-low latency is required (single-digit ms)

- Unlimited data retention is needed

2. Kinesis vs SQS: When to Use Which

| Comparison | Kinesis Data Streams | Amazon SQS |

| -------------------- | ----------------------------------- | --------------------------------------- |

| **Processing model** | Streaming (continuous) | Message queue (individual) |

| **Consumer count** | Multiple simultaneous | Primarily single consumer |

| **Ordering** | Guaranteed within shard | FIFO queue only |

| **Data retention** | 24h to 365 days | Up to 14 days |

| **Data replay** | Possible (sequence number based) | Not possible (deleted after processing) |

| **Throughput** | 1 MB/s write per shard | Nearly unlimited |

| **Message size** | Up to 1 MB | Up to 256 KB |

| **Latency** | Milliseconds | Milliseconds |

| **Cost model** | Shard hours + data transfer | Request-based |

| **Primary use** | Real-time analytics, log collection | Microservice decoupling |

Usage Scenario Decision Flowchart

====================================

Analyze data processing requirements

|

+----+----+

| |

Do multiple Do messages need

consumers need single processing?

to read the |

same data? SQS

|

Is real-time ordering

required?

|

+----+----+

| |

YES NO

| |

Kinesis SQS FIFO

Data or

Streams Kinesis

3. Amazon Managed Service for Apache Flink

3.1 Overview

Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics) allows you to run Apache Flink

on fully managed infrastructure.

Note: The legacy Kinesis Data Analytics for SQL was discontinued for new application creation

after October 2025, and migration to Amazon Managed Service for Apache Flink is recommended.

3.2 Key Features

Managed Flink Architecture

============================

+-----------+ +----------------------------+ +-----------+

| Sources | | Managed Flink | | Sinks |

| | --> | | --> | |

| - Kinesis | | +------------------------+ | | - Kinesis |

| - MSK | | | Flink Application | | | - S3 |

| - S3 | | | | | | - DynamoDB|

| | | | - SQL queries | | | - Open- |

| | | | - Java/Scala apps | | | Search |

| | | | - Python (PyFlink) | | | - Redshift|

| | | | | | | |

| | | | Window aggregation | | | |

| | | | Pattern detection | | | |

| | | | CEP (Complex Event | | | |

| | | | Processing) | | | |

| | | +------------------------+ | | |

+-----------+ +----------------------------+ +-----------+

3.3 Window Processing Types

A core feature for grouping stream data by time-based windows for analysis.

Window Types

==============

1) Tumbling Window

- Fixed size, non-overlapping

|-------|-------|-------|-------|

0 5 10 15 20 (sec)

2) Sliding/Hopping Window

- Fixed size, slides at regular intervals

|-----------|

|-----------|

|-----------|

0 2 4 6 8 10 (sec)

Size: 6s, Slide: 2s

3) Session Window

- Activity-based, separated by inactivity gaps

|---event-event---| gap |--event-event-event--| gap |

<-- Session 1 --> <------ Session 2 ---->

4) Global Window

- Single window over the entire stream

3.4 Flink SQL Example

-- Define Kinesis source table

CREATE TABLE clickstream (

user_id VARCHAR,

page VARCHAR,

action VARCHAR,

event_time TIMESTAMP(3),

WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND

) WITH (

'connector' = 'kinesis',

'stream' = 'clickstream-data',

'aws.region' = 'us-east-1',

'scan.stream.initpos' = 'LATEST',

'format' = 'json'

);

-- Aggregate page views per 1-minute tumbling window

SELECT

page,

COUNT(*) AS view_count,

COUNT(DISTINCT user_id) AS unique_users,

TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,

TUMBLE_END(event_time, INTERVAL '1' MINUTE) AS window_end

FROM clickstream

WHERE action = 'view'

GROUP BY

page,

TUMBLE(event_time, INTERVAL '1' MINUTE);

-- Anomaly detection: 10+ clicks by same user within 1 minute

SELECT

user_id,

COUNT(*) AS click_count,

TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start

FROM clickstream

WHERE action = 'click'

GROUP BY

user_id,

TUMBLE(event_time, INTERVAL '1' MINUTE)

HAVING COUNT(*) >= 10;

4. Real-World Streaming Architecture Patterns

4.1 Log Aggregation Pipeline

Log Aggregation Architecture

==============================

+----------+ +----------+ +---------+ +----------+

| App | | Kinesis | | Firehose| | S3 |

| Server 1 | --> | Agent | --> | | --> | (Raw |

+----------+ +----------+ | | | Logs) |

| | +----------+

+----------+ +----------+ | | |

| App | | Kinesis | | | v

| Server 2 | --> | Agent | --> | | +----------+

+----------+ +----------+ | | | Athena |

| | | (Query) |

+----------+ +----------+ | | +----------+

| App | | Kinesis | | |

| Server N | --> | Agent | --> | |

+----------+ +----------+ +---------+

|

v

+----------+ +----------+

| Lambda | --> | Open- |

| (Real- | | Search |

| time | | (Search/ |

| alerts) | | Dashboard)|

+----------+ +----------+

4.2 Real-Time Analytics Dashboard

Real-Time Dashboard Architecture

==================================

+----------+ +----------+ +-----------+ +----------+

| Web/ | | API | | Kinesis | | Managed |

| Mobile | --> | Gateway | --> | Data | --> | Flink |

| Client | | | | Streams | | (Aggre- |

+----------+ +----------+ +-----------+ | gation) |

+----------+

|

+---------+----+----+---------+

| | | |

v v v v

+--------+ +--------+ +--------+ +--------+

|DynamoDB| |Time- | | S3 | |Cloud- |

|(Real- | |stream | |(Long- | |Watch |

| time) | |(Time- | | term) | |(Metric |

| | | series)| | | | Alarms)|

+--------+ +--------+ +--------+ +--------+

| |

v v

+--------------------+

| Dashboard App |

| (React/Vue + |

| WebSocket) |

+--------------------+

4.3 IoT Data Ingestion

IoT device simulator

from datetime import datetime

kinesis = boto3.client('kinesis', region_name='us-east-1')

STREAM_NAME = 'iot-sensor-data'

def simulate_sensor(device_id):

"""Simulate IoT sensor data"""

return {

'device_id': device_id,

'temperature': round(random.uniform(15.0, 45.0), 2),

'humidity': round(random.uniform(20.0, 90.0), 2),

'pressure': round(random.uniform(990.0, 1030.0), 2),

'battery_level': round(random.uniform(0.0, 100.0), 1),

'location': {

'lat': round(random.uniform(33.0, 38.0), 6),

'lon': round(random.uniform(126.0, 130.0), 6)

},

'timestamp': datetime.utcnow().isoformat() + 'Z'

}

def ingest_iot_data(num_devices=100, interval=1.0):

"""Ingest IoT data into Kinesis"""

device_ids = [f'sensor-{i:04d}' for i in range(num_devices)]

while True:

records = []

for device_id in device_ids:

sensor_data = simulate_sensor(device_id)

records.append({

'Data': json.dumps(sensor_data).encode('utf-8'),

'PartitionKey': device_id

})

PutRecords supports up to 500 records

for batch_start in range(0, len(records), 500):

batch = records[batch_start:batch_start + 500]

response = kinesis.put_records(

StreamName=STREAM_NAME,

Records=batch

)

failed = response['FailedRecordCount']

if failed > 0:

print(f"Batch failed: {failed} records")

Exponential backoff retry logic

retry_records = []

for i, result in enumerate(response['Records']):

if 'ErrorCode' in result:

retry_records.append(batch[i])

if retry_records:

time.sleep(0.5)

kinesis.put_records(

StreamName=STREAM_NAME,

Records=retry_records

)

print(f"Ingested {len(records)} sensor readings")

time.sleep(interval)

if __name__ == '__main__':

ingest_iot_data()

4.4 Event Sourcing Pattern

Event Sourcing Architecture

=============================

+----------+ +----------+ +-----------+

| Command | | Kinesis | | Event |

| Handler | --> | Data | --> | Processor |

| | | Streams | | (Lambda/ |

| - Order | | (Event | | ECS) |

| - Payment| | Store) | +-----------+

| - Ship | +----------+ |

+----------+ | +-----+-----+

| | |

v v v

+----------+ +--------+ +--------+

| S3 | |DynamoDB| |SNS |

| (Event | |(Read | |(Notif- |

| Archive)| | Model) | | ication)|

+----------+ +--------+ +--------+

Event Flow Example:

1. OrderCreated -> Order creation event

2. PaymentProcessed -> Payment processing event

3. InventoryReserved -> Inventory reservation event

4. ShipmentCreated -> Shipment creation event

4.5 ML Feature Pipeline

ML Feature Pipeline

=====================

+----------+ +----------+ +-----------+ +----------+

| Event | | Kinesis | | Managed | | Feature |

| Sources | --> | Data | --> | Flink | --> | Store |

| | | Streams | | (Feature | | (Sage- |

| - Clicks | | | | Compute) | | Maker) |

| - Purch. | | | | | +----------+

| - Search | | | | Real-time:| |

+----------+ +----------+ | - Session | v

| count | +----------+

| - Recent | | ML Model |

| purch. | | Inference|

| - Avg | +----------+

| dwell |

+-----------+

5. Performance Optimization

5.1 Partition Key Design

Partition key design is the most critical factor for Kinesis performance.

**Criteria for good partition keys:**

- High cardinality (many unique values)

- Even distribution (no skew toward specific keys)

- At least 10x as many unique keys as the number of shards

Good Partition Key Examples

============================

1) User ID (high cardinality)

user-001 -> Shard 1

user-002 -> Shard 3

user-003 -> Shard 2

...

Evenly distributed

2) UUID (best distribution)

Random UUID -> Perfect distribution

Downside: Cannot guarantee ordering for same entity

3) Composite key

"region-userType-userId"

Fine-grained distribution control

Bad Partition Key Examples

===========================

1) Date ("2026-03-20")

All records to same shard -> Hot shard

2) Country code ("KR", "US", "JP")

Cardinality too low

Traffic skews toward specific countries

3) Fixed value ("default")

All load concentrated on single shard

5.2 Aggregation with KPL

KPL Aggregation Optimization

==============================

Without aggregation:

Record 1 (100B) -> PutRecord -> 1 API call

Record 2 (200B) -> PutRecord -> 1 API call

Record 3 (150B) -> PutRecord -> 1 API call

Total: 3 API calls, 450B transmitted

With KPL aggregation:

Record 1 (100B) --+

Record 2 (200B) --+--> Aggregated record (450B) -> 1 API call

Record 3 (150B) --+

Total: 1 API call, 450B transmitted

Benefits:

- Dramatically reduced API calls

- Lower PUT unit costs

- Significantly improved throughput

5.3 Enhanced Fan-Out Strategy

Register enhanced fan-out consumer

kinesis = boto3.client('kinesis', region_name='us-east-1')

Register consumer

response = kinesis.register_stream_consumer(

StreamARN='arn:aws:kinesis:us-east-1:123456789012:stream/my-stream',

ConsumerName='analytics-consumer'

)

consumer_arn = response['Consumer']['ConsumerARN']

print(f"Consumer ARN: {consumer_arn}")

Check consumer status

response = kinesis.describe_stream_consumer(

StreamARN='arn:aws:kinesis:us-east-1:123456789012:stream/my-stream',

ConsumerName='analytics-consumer'

)

print(f"Status: {response['ConsumerDescription']['ConsumerStatus']}")

5.4 Error Handling and Retry Strategy

def put_records_with_retry(kinesis_client, stream_name, records, max_retries=3):

"""PutRecords retry with exponential backoff"""

for attempt in range(max_retries):

response = kinesis_client.put_records(

StreamName=stream_name,

Records=records

)

failed_count = response['FailedRecordCount']

if failed_count == 0:

return response

Extract only failed records

retry_records = []

for i, result in enumerate(response['Records']):

if 'ErrorCode' in result:

error_code = result['ErrorCode']

if error_code == 'ProvisionedThroughputExceededException':

retry_records.append(records[i])

else:

print(f"Non-retryable error: {error_code}")

if not retry_records:

return response

records = retry_records

Exponential backoff + jitter

backoff = min(2 ** attempt * 0.1, 5.0)

jitter = random.uniform(0, backoff * 0.5)

wait_time = backoff + jitter

print(f"Retry {attempt + 1}: {len(retry_records)} records, waiting {wait_time:.2f}s")

time.sleep(wait_time)

print(f"Failed after {max_retries} retries: {len(records)} records")

return None

6. Monitoring: CloudWatch Metrics

6.1 Key Monitoring Metrics

| Metric | Description | Alert Threshold |

| ---------------------------------- | ---------------------------- | --------------------- |

| IncomingBytes | Bytes entering the stream | 80% of shard capacity |

| IncomingRecords | Records entering the stream | 800 rec/s per shard |

| GetRecords.IteratorAgeMilliseconds | How far behind a consumer is | 60,000 ms (1 min) |

| WriteProvisionedThroughputExceeded | Write throttle count | Alarm when above 0 |

| ReadProvisionedThroughputExceeded | Read throttle count | Alarm when above 0 |

| GetRecords.Latency | GetRecords call latency | 1,000 ms |

| PutRecord.Latency | PutRecord call latency | 1,000 ms |

| GetRecords.Success | GetRecords success rate | Alarm when below 99% |

6.2 Enhanced Monitoring

Additional Metrics with Enhanced Monitoring

=============================================

Shard-level metrics:

- IncomingBytes (per shard)

- IncomingRecords (per shard)

- IteratorAgeMilliseconds (per shard)

- OutgoingBytes (per shard)

- OutgoingRecords (per shard)

- ReadProvisionedThroughputExceeded (per shard)

- WriteProvisionedThroughputExceeded (per shard)

Hot Shard Detection:

Shard 1: IncomingBytes = 200 KB/s [Normal]

Shard 2: IncomingBytes = 950 KB/s [Warning! Near limit]

Shard 3: IncomingBytes = 300 KB/s [Normal]

-> Recommend splitting Shard 2

7. Best Practices and Anti-Patterns

7.1 Best Practices

**1) Partition Key Design**

- Use high-cardinality keys (user IDs, device IDs)

- Ensure at least 10x unique keys relative to shard count

- Use entity IDs as partition keys when ordering is needed

**2) Producer Optimization**

- Use PutRecords (batch) API to minimize API calls

- Use KPL for record aggregation and collection optimization

- Implement proper retry logic (exponential backoff + jitter)

**3) Consumer Optimization**

- Use enhanced fan-out for multiple consumers

- Use KCL for automated distributed processing

- Optimize checkpointing frequency (too frequent increases DynamoDB costs)

**4) Capacity Management**

- Predictable workloads: Provisioned mode

- Irregular workloads: On-demand mode

- Trigger auto-scaling with CloudWatch alarms

**5) Cost Optimization**

- Consider On-Demand Advantage mode (released 2025)

- Set retention period only as long as needed

- Deregister unnecessary enhanced fan-out consumers

7.2 Anti-Patterns

**1) Single Partition Key**

- All data concentrates on one shard

- Immediately hits shard limits

**2) Excessive Shard Count**

- Increased costs and management complexity

- Increased DynamoDB lease table load for KCL

**3) Not Using Checkpoints**

- Duplicate processing or data loss on failure

- Always implement proper checkpointing strategy

**4) Missing Error Handling**

- Ignoring ProvisionedThroughputExceededException

- Losing failed data without retries

**5) Excessive GetRecords Calls**

- Must respect 5 calls per second per shard limit

- Set appropriate polling intervals

8. Comprehensive Comparison Summary Table

| Item | Kinesis Data Streams | Kinesis Firehose | Kafka | SQS | Managed Flink |

| ------------------- | -------------------- | ---------------- | --------------------- | ------------- | ------------------- |

| **Type** | Data streaming | Data delivery | Data streaming | Message queue | Stream processing |

| **Management** | Fully managed | Fully managed | Self/MSK | Fully managed | Fully managed |

| **Latency** | ms | 60s+ | ms | ms | ms |

| **Ordering** | Within shard | None | Within partition | FIFO only | Input dependent |

| **Replay** | Yes | No | Yes | No | Input dependent |

| **Scaling** | Add shards | Automatic | Partition/broker | Automatic | Add KPUs |

| **Transform** | None (consumer) | Lambda | Kafka Streams | None | Flink app |

| **Cost model** | Shard+data | Data volume | Instance | Requests | KPU hours |

| **AWS integration** | High | Very high | Low/Medium | High | High |

| **Best for** | Real-time ingest | Auto delivery | High-volume streaming | Task queue | Real-time analytics |

9. Summary

Service Selection Guide

When designing a real-time streaming architecture, choosing the right service for your requirements is crucial.

- **If simple data delivery is the goal**: Use Amazon Data Firehose for direct delivery to S3, Redshift, etc.

- **If real-time processing with multiple consumers is needed**: Kinesis Data Streams + KCL or Enhanced Fan-Out

- **If complex stream analytics are needed**: Managed Service for Apache Flink

- **If high-volume processing and ecosystem are needed**: Apache Kafka or Amazon MSK

- **If inter-microservice messaging is needed**: Amazon SQS

Understand each service's strengths and consider combining multiple services for optimal production patterns.

For example, collecting with Kinesis Data Streams, analyzing in real-time with Managed Flink, and storing

long-term in S3 via Data Firehose is a very common architecture combination.

현재 단락 (1/532)

Apache Kafka and AWS Kinesis are both platforms for real-time data streaming,

작성 글자: 0원문 글자: 17,848작성 단락: 0/532