- Authors

- Name
- Youngju Kim
- @fjvbn20031
1. Kinesis vs Apache Kafka Detailed Comparison
1.1 Architecture Differences
Apache Kafka and AWS Kinesis are both platforms for real-time data streaming, but their fundamental architectural philosophies differ.
Apache Kafka Architecture
==========================
+----------+ +------------------------------------------+
| Producer | --> | Kafka Cluster |
+----------+ | |
| Broker 1 Broker 2 Broker 3 |
| +-------+ +-------+ +-------+ |
| |Topic A| |Topic A| |Topic A| |
| |Part 0 | |Part 1 | |Part 2 | |
| | | | | | | |
| |Topic B| |Topic B| |Topic B| |
| |Part 0 | |Part 1 | |Part 2 | |
| +-------+ +-------+ +-------+ |
| |
| KRaft (Kafka 4.0+, ZooKeeper removed) |
+------------------------------------------+
|
v
+----------+
| Consumer |
| Group |
+----------+
AWS Kinesis Data Streams Architecture
=======================================
+----------+ +------------------------------------------+
| Producer | --> | Kinesis Stream (Fully Managed) |
+----------+ | |
| Shard 1 Shard 2 Shard 3 |
| +-------+ +-------+ +-------+ |
| |Records| |Records| |Records| |
| |1MB/s W| |1MB/s W| |1MB/s W| |
| |2MB/s R| |2MB/s R| |2MB/s R| |
| +-------+ +-------+ +-------+ |
| |
| Fully managed by AWS |
+------------------------------------------+
|
v
+----------+
| Consumer |
| (KCL) |
+----------+
1.2 Comprehensive Comparison Table
| Comparison | AWS Kinesis | Apache Kafka (Self-hosted) | Amazon MSK |
|---|---|---|---|
| Management | Fully managed | Self-operated | Managed Kafka |
| Scale unit | Shard | Partition + Broker | Broker |
| Write per shard/partition | 1 MB/s | No limit (disk I/O) | Disk I/O dependent |
| Read per shard/partition | 2 MB/s (shared) | No limit | Disk I/O dependent |
| Max retention | 365 days | Unlimited | Unlimited |
| Ordering | Within shard | Within partition | Within partition |
| Max record size | 1 MB | Default 1 MB (configurable) | Default 1 MB |
| Consumer model | KCL, Enhanced Fan-Out | Consumer Groups | Consumer Groups |
| Protocol | HTTPS/HTTP2 | Custom TCP protocol | Custom TCP |
| Ecosystem | AWS service integration | Kafka Connect, Schema Registry, etc. | Kafka ecosystem |
| Operational complexity | Low | High | Medium |
| Initial setup | Minutes | Hours to days | Tens of minutes |
1.3 Throughput and Latency
Throughput Comparison
======================
Kinesis (Provisioned):
Write: 1 MB/s x number of shards
Read: 2 MB/s x number of shards (shared)
2 MB/s x shards x consumers (enhanced fan-out)
Example: 100 shards
Write: 100 MB/s
Read: 200 MB/s (shared) or 200 MB/s x N (enhanced fan-out)
Kafka (Self-hosted):
Depends on broker performance
Single broker: hundreds of MB/s possible
Cluster: multiple GB/s achievable
Example: 6 broker cluster
Write: 600+ MB/s
Read: multiple GB/s
Latency Comparison
===================
Kinesis:
- PutRecord: tens of ms
- GetRecords (shared fan-out): ~200 ms
- Enhanced Fan-Out: ~70 ms
Kafka:
- Producer -> Consumer: ~2-10 ms (network dependent)
- End-to-end: ~10-50 ms
1.4 Cost Comparison
Monthly Cost Estimate (US East)
=================================
Scenario: 10 MB/s sustained ingestion, 3 consumers
Kinesis (Provisioned Mode):
- 10 shards needed (10 MB/s / 1 MB/s per shard)
- Shard cost: 10 x 0.015 x 720 hours = ~108 USD
- PUT units: ~360 USD (~25.9B units/month)
- Enhanced fan-out (3 consumers): ~324 USD
Total: ~792 USD/month
Kinesis (On-Demand Advantage):
- Data write: ~25.9 TB x 0.032 = ~829 USD
- Data read: ~25.9 TB x 3 x 0.016 = ~1,243 USD
Total: ~2,072 USD/month
Kafka (Self-hosted on EC2):
- 3 brokers (m5.xlarge): 3 x 140 = ~420 USD
- EBS storage (1TB x 3): ~300 USD
- Operations personnel: separate
Total: ~720 USD/month + operational costs
Amazon MSK:
- 3 brokers (kafka.m5.large): ~456 USD
- Storage: ~300 USD
Total: ~756 USD/month
1.5 When to Choose What
Choose Kinesis when:
- Architecture deeply integrated with AWS
- You want to minimize operational burden
- Direct integration with Lambda, Firehose, and other AWS services is needed
- Small to medium throughput (under tens of MB/s)
- Rapid prototyping is needed
Choose Kafka when:
- Very high throughput is required (multiple GB/s)
- Multi-cloud or hybrid environment
- Kafka Connect ecosystem is needed
- Ultra-low latency is required (single-digit ms)
- Unlimited data retention is needed
2. Kinesis vs SQS: When to Use Which
| Comparison | Kinesis Data Streams | Amazon SQS |
|---|---|---|
| Processing model | Streaming (continuous) | Message queue (individual) |
| Consumer count | Multiple simultaneous | Primarily single consumer |
| Ordering | Guaranteed within shard | FIFO queue only |
| Data retention | 24h to 365 days | Up to 14 days |
| Data replay | Possible (sequence number based) | Not possible (deleted after processing) |
| Throughput | 1 MB/s write per shard | Nearly unlimited |
| Message size | Up to 1 MB | Up to 256 KB |
| Latency | Milliseconds | Milliseconds |
| Cost model | Shard hours + data transfer | Request-based |
| Primary use | Real-time analytics, log collection | Microservice decoupling |
Usage Scenario Decision Flowchart
====================================
Analyze data processing requirements
|
+----+----+
| |
Do multiple Do messages need
consumers need single processing?
to read the |
same data? SQS
|
Is real-time ordering
required?
|
+----+----+
| |
YES NO
| |
Kinesis SQS FIFO
Data or
Streams Kinesis
3. Amazon Managed Service for Apache Flink
3.1 Overview
Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics) allows you to run Apache Flink on fully managed infrastructure.
Note: The legacy Kinesis Data Analytics for SQL was discontinued for new application creation after October 2025, and migration to Amazon Managed Service for Apache Flink is recommended.
3.2 Key Features
Managed Flink Architecture
============================
+-----------+ +----------------------------+ +-----------+
| Sources | | Managed Flink | | Sinks |
| | --> | | --> | |
| - Kinesis | | +------------------------+ | | - Kinesis |
| - MSK | | | Flink Application | | | - S3 |
| - S3 | | | | | | - DynamoDB|
| | | | - SQL queries | | | - Open- |
| | | | - Java/Scala apps | | | Search |
| | | | - Python (PyFlink) | | | - Redshift|
| | | | | | | |
| | | | Window aggregation | | | |
| | | | Pattern detection | | | |
| | | | CEP (Complex Event | | | |
| | | | Processing) | | | |
| | | +------------------------+ | | |
+-----------+ +----------------------------+ +-----------+
3.3 Window Processing Types
A core feature for grouping stream data by time-based windows for analysis.
Window Types
==============
1) Tumbling Window
- Fixed size, non-overlapping
|-------|-------|-------|-------|
0 5 10 15 20 (sec)
2) Sliding/Hopping Window
- Fixed size, slides at regular intervals
|-----------|
|-----------|
|-----------|
0 2 4 6 8 10 (sec)
Size: 6s, Slide: 2s
3) Session Window
- Activity-based, separated by inactivity gaps
|---event-event---| gap |--event-event-event--| gap |
<-- Session 1 --> <------ Session 2 ---->
4) Global Window
- Single window over the entire stream
3.4 Flink SQL Example
-- Define Kinesis source table
CREATE TABLE clickstream (
user_id VARCHAR,
page VARCHAR,
action VARCHAR,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kinesis',
'stream' = 'clickstream-data',
'aws.region' = 'us-east-1',
'scan.stream.initpos' = 'LATEST',
'format' = 'json'
);
-- Aggregate page views per 1-minute tumbling window
SELECT
page,
COUNT(*) AS view_count,
COUNT(DISTINCT user_id) AS unique_users,
TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
TUMBLE_END(event_time, INTERVAL '1' MINUTE) AS window_end
FROM clickstream
WHERE action = 'view'
GROUP BY
page,
TUMBLE(event_time, INTERVAL '1' MINUTE);
-- Anomaly detection: 10+ clicks by same user within 1 minute
SELECT
user_id,
COUNT(*) AS click_count,
TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start
FROM clickstream
WHERE action = 'click'
GROUP BY
user_id,
TUMBLE(event_time, INTERVAL '1' MINUTE)
HAVING COUNT(*) >= 10;
4. Real-World Streaming Architecture Patterns
4.1 Log Aggregation Pipeline
Log Aggregation Architecture
==============================
+----------+ +----------+ +---------+ +----------+
| App | | Kinesis | | Firehose| | S3 |
| Server 1 | --> | Agent | --> | | --> | (Raw |
+----------+ +----------+ | | | Logs) |
| | +----------+
+----------+ +----------+ | | |
| App | | Kinesis | | | v
| Server 2 | --> | Agent | --> | | +----------+
+----------+ +----------+ | | | Athena |
| | | (Query) |
+----------+ +----------+ | | +----------+
| App | | Kinesis | | |
| Server N | --> | Agent | --> | |
+----------+ +----------+ +---------+
|
v
+----------+ +----------+
| Lambda | --> | Open- |
| (Real- | | Search |
| time | | (Search/ |
| alerts) | | Dashboard)|
+----------+ +----------+
4.2 Real-Time Analytics Dashboard
Real-Time Dashboard Architecture
==================================
+----------+ +----------+ +-----------+ +----------+
| Web/ | | API | | Kinesis | | Managed |
| Mobile | --> | Gateway | --> | Data | --> | Flink |
| Client | | | | Streams | | (Aggre- |
+----------+ +----------+ +-----------+ | gation) |
+----------+
|
+---------+----+----+---------+
| | | |
v v v v
+--------+ +--------+ +--------+ +--------+
|DynamoDB| |Time- | | S3 | |Cloud- |
|(Real- | |stream | |(Long- | |Watch |
| time) | |(Time- | | term) | |(Metric |
| | | series)| | | | Alarms)|
+--------+ +--------+ +--------+ +--------+
| |
v v
+--------------------+
| Dashboard App |
| (React/Vue + |
| WebSocket) |
+--------------------+
4.3 IoT Data Ingestion
# IoT device simulator
import boto3
import json
import time
import random
from datetime import datetime
kinesis = boto3.client('kinesis', region_name='us-east-1')
STREAM_NAME = 'iot-sensor-data'
def simulate_sensor(device_id):
"""Simulate IoT sensor data"""
return {
'device_id': device_id,
'temperature': round(random.uniform(15.0, 45.0), 2),
'humidity': round(random.uniform(20.0, 90.0), 2),
'pressure': round(random.uniform(990.0, 1030.0), 2),
'battery_level': round(random.uniform(0.0, 100.0), 1),
'location': {
'lat': round(random.uniform(33.0, 38.0), 6),
'lon': round(random.uniform(126.0, 130.0), 6)
},
'timestamp': datetime.utcnow().isoformat() + 'Z'
}
def ingest_iot_data(num_devices=100, interval=1.0):
"""Ingest IoT data into Kinesis"""
device_ids = [f'sensor-{i:04d}' for i in range(num_devices)]
while True:
records = []
for device_id in device_ids:
sensor_data = simulate_sensor(device_id)
records.append({
'Data': json.dumps(sensor_data).encode('utf-8'),
'PartitionKey': device_id
})
# PutRecords supports up to 500 records
for batch_start in range(0, len(records), 500):
batch = records[batch_start:batch_start + 500]
response = kinesis.put_records(
StreamName=STREAM_NAME,
Records=batch
)
failed = response['FailedRecordCount']
if failed > 0:
print(f"Batch failed: {failed} records")
# Exponential backoff retry logic
retry_records = []
for i, result in enumerate(response['Records']):
if 'ErrorCode' in result:
retry_records.append(batch[i])
if retry_records:
time.sleep(0.5)
kinesis.put_records(
StreamName=STREAM_NAME,
Records=retry_records
)
print(f"Ingested {len(records)} sensor readings")
time.sleep(interval)
if __name__ == '__main__':
ingest_iot_data()
4.4 Event Sourcing Pattern
Event Sourcing Architecture
=============================
+----------+ +----------+ +-----------+
| Command | | Kinesis | | Event |
| Handler | --> | Data | --> | Processor |
| | | Streams | | (Lambda/ |
| - Order | | (Event | | ECS) |
| - Payment| | Store) | +-----------+
| - Ship | +----------+ |
+----------+ | +-----+-----+
| | |
v v v
+----------+ +--------+ +--------+
| S3 | |DynamoDB| |SNS |
| (Event | |(Read | |(Notif- |
| Archive)| | Model) | | ication)|
+----------+ +--------+ +--------+
Event Flow Example:
1. OrderCreated -> Order creation event
2. PaymentProcessed -> Payment processing event
3. InventoryReserved -> Inventory reservation event
4. ShipmentCreated -> Shipment creation event
4.5 ML Feature Pipeline
ML Feature Pipeline
=====================
+----------+ +----------+ +-----------+ +----------+
| Event | | Kinesis | | Managed | | Feature |
| Sources | --> | Data | --> | Flink | --> | Store |
| | | Streams | | (Feature | | (Sage- |
| - Clicks | | | | Compute) | | Maker) |
| - Purch. | | | | | +----------+
| - Search | | | | Real-time:| |
+----------+ +----------+ | - Session | v
| count | +----------+
| - Recent | | ML Model |
| purch. | | Inference|
| - Avg | +----------+
| dwell |
+-----------+
5. Performance Optimization
5.1 Partition Key Design
Partition key design is the most critical factor for Kinesis performance.
Criteria for good partition keys:
- High cardinality (many unique values)
- Even distribution (no skew toward specific keys)
- At least 10x as many unique keys as the number of shards
Good Partition Key Examples
============================
1) User ID (high cardinality)
user-001 -> Shard 1
user-002 -> Shard 3
user-003 -> Shard 2
...
Evenly distributed
2) UUID (best distribution)
Random UUID -> Perfect distribution
Downside: Cannot guarantee ordering for same entity
3) Composite key
"region-userType-userId"
Fine-grained distribution control
Bad Partition Key Examples
===========================
1) Date ("2026-03-20")
All records to same shard -> Hot shard
2) Country code ("KR", "US", "JP")
Cardinality too low
Traffic skews toward specific countries
3) Fixed value ("default")
All load concentrated on single shard
5.2 Aggregation with KPL
KPL Aggregation Optimization
==============================
Without aggregation:
Record 1 (100B) -> PutRecord -> 1 API call
Record 2 (200B) -> PutRecord -> 1 API call
Record 3 (150B) -> PutRecord -> 1 API call
Total: 3 API calls, 450B transmitted
With KPL aggregation:
Record 1 (100B) --+
Record 2 (200B) --+--> Aggregated record (450B) -> 1 API call
Record 3 (150B) --+
Total: 1 API call, 450B transmitted
Benefits:
- Dramatically reduced API calls
- Lower PUT unit costs
- Significantly improved throughput
5.3 Enhanced Fan-Out Strategy
# Register enhanced fan-out consumer
import boto3
kinesis = boto3.client('kinesis', region_name='us-east-1')
# Register consumer
response = kinesis.register_stream_consumer(
StreamARN='arn:aws:kinesis:us-east-1:123456789012:stream/my-stream',
ConsumerName='analytics-consumer'
)
consumer_arn = response['Consumer']['ConsumerARN']
print(f"Consumer ARN: {consumer_arn}")
# Check consumer status
response = kinesis.describe_stream_consumer(
StreamARN='arn:aws:kinesis:us-east-1:123456789012:stream/my-stream',
ConsumerName='analytics-consumer'
)
print(f"Status: {response['ConsumerDescription']['ConsumerStatus']}")
5.4 Error Handling and Retry Strategy
import time
import random
def put_records_with_retry(kinesis_client, stream_name, records, max_retries=3):
"""PutRecords retry with exponential backoff"""
for attempt in range(max_retries):
response = kinesis_client.put_records(
StreamName=stream_name,
Records=records
)
failed_count = response['FailedRecordCount']
if failed_count == 0:
return response
# Extract only failed records
retry_records = []
for i, result in enumerate(response['Records']):
if 'ErrorCode' in result:
error_code = result['ErrorCode']
if error_code == 'ProvisionedThroughputExceededException':
retry_records.append(records[i])
else:
print(f"Non-retryable error: {error_code}")
if not retry_records:
return response
records = retry_records
# Exponential backoff + jitter
backoff = min(2 ** attempt * 0.1, 5.0)
jitter = random.uniform(0, backoff * 0.5)
wait_time = backoff + jitter
print(f"Retry {attempt + 1}: {len(retry_records)} records, waiting {wait_time:.2f}s")
time.sleep(wait_time)
print(f"Failed after {max_retries} retries: {len(records)} records")
return None
6. Monitoring: CloudWatch Metrics
6.1 Key Monitoring Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| IncomingBytes | Bytes entering the stream | 80% of shard capacity |
| IncomingRecords | Records entering the stream | 800 rec/s per shard |
| GetRecords.IteratorAgeMilliseconds | How far behind a consumer is | 60,000 ms (1 min) |
| WriteProvisionedThroughputExceeded | Write throttle count | Alarm when above 0 |
| ReadProvisionedThroughputExceeded | Read throttle count | Alarm when above 0 |
| GetRecords.Latency | GetRecords call latency | 1,000 ms |
| PutRecord.Latency | PutRecord call latency | 1,000 ms |
| GetRecords.Success | GetRecords success rate | Alarm when below 99% |
6.2 Enhanced Monitoring
Additional Metrics with Enhanced Monitoring
=============================================
Shard-level metrics:
- IncomingBytes (per shard)
- IncomingRecords (per shard)
- IteratorAgeMilliseconds (per shard)
- OutgoingBytes (per shard)
- OutgoingRecords (per shard)
- ReadProvisionedThroughputExceeded (per shard)
- WriteProvisionedThroughputExceeded (per shard)
Hot Shard Detection:
Shard 1: IncomingBytes = 200 KB/s [Normal]
Shard 2: IncomingBytes = 950 KB/s [Warning! Near limit]
Shard 3: IncomingBytes = 300 KB/s [Normal]
-> Recommend splitting Shard 2
7. Best Practices and Anti-Patterns
7.1 Best Practices
1) Partition Key Design
- Use high-cardinality keys (user IDs, device IDs)
- Ensure at least 10x unique keys relative to shard count
- Use entity IDs as partition keys when ordering is needed
2) Producer Optimization
- Use PutRecords (batch) API to minimize API calls
- Use KPL for record aggregation and collection optimization
- Implement proper retry logic (exponential backoff + jitter)
3) Consumer Optimization
- Use enhanced fan-out for multiple consumers
- Use KCL for automated distributed processing
- Optimize checkpointing frequency (too frequent increases DynamoDB costs)
4) Capacity Management
- Predictable workloads: Provisioned mode
- Irregular workloads: On-demand mode
- Trigger auto-scaling with CloudWatch alarms
5) Cost Optimization
- Consider On-Demand Advantage mode (released 2025)
- Set retention period only as long as needed
- Deregister unnecessary enhanced fan-out consumers
7.2 Anti-Patterns
1) Single Partition Key
- All data concentrates on one shard
- Immediately hits shard limits
2) Excessive Shard Count
- Increased costs and management complexity
- Increased DynamoDB lease table load for KCL
3) Not Using Checkpoints
- Duplicate processing or data loss on failure
- Always implement proper checkpointing strategy
4) Missing Error Handling
- Ignoring ProvisionedThroughputExceededException
- Losing failed data without retries
5) Excessive GetRecords Calls
- Must respect 5 calls per second per shard limit
- Set appropriate polling intervals
8. Comprehensive Comparison Summary Table
| Item | Kinesis Data Streams | Kinesis Firehose | Kafka | SQS | Managed Flink |
|---|---|---|---|---|---|
| Type | Data streaming | Data delivery | Data streaming | Message queue | Stream processing |
| Management | Fully managed | Fully managed | Self/MSK | Fully managed | Fully managed |
| Latency | ms | 60s+ | ms | ms | ms |
| Ordering | Within shard | None | Within partition | FIFO only | Input dependent |
| Replay | Yes | No | Yes | No | Input dependent |
| Scaling | Add shards | Automatic | Partition/broker | Automatic | Add KPUs |
| Transform | None (consumer) | Lambda | Kafka Streams | None | Flink app |
| Cost model | Shard+data | Data volume | Instance | Requests | KPU hours |
| AWS integration | High | Very high | Low/Medium | High | High |
| Best for | Real-time ingest | Auto delivery | High-volume streaming | Task queue | Real-time analytics |
9. Summary
Service Selection Guide
When designing a real-time streaming architecture, choosing the right service for your requirements is crucial.
- If simple data delivery is the goal: Use Amazon Data Firehose for direct delivery to S3, Redshift, etc.
- If real-time processing with multiple consumers is needed: Kinesis Data Streams + KCL or Enhanced Fan-Out
- If complex stream analytics are needed: Managed Service for Apache Flink
- If high-volume processing and ecosystem are needed: Apache Kafka or Amazon MSK
- If inter-microservice messaging is needed: Amazon SQS
Understand each service's strengths and consider combining multiple services for optimal production patterns. For example, collecting with Kinesis Data Streams, analyzing in real-time with Managed Flink, and storing long-term in S3 via Data Firehose is a very common architecture combination.