Skip to content

Split View: 메시지 큐 완전 비교: Kafka vs RabbitMQ vs SQS vs Pulsar vs NATS — 2025 실전 선택 가이드

✨ Learn with Quiz
|

메시지 큐 완전 비교: Kafka vs RabbitMQ vs SQS vs Pulsar vs NATS — 2025 실전 선택 가이드

1. 왜 메시지 큐인가?

시장 규모와 도입 현황

글로벌 메시지 큐 미들웨어 시장은 2025년 기준 약 60억 달러(6B USD) 규모로 성장했습니다. Fortune Business Insights 보고서에 따르면 기업의 **41%**가 이미 메시지 큐 기반 비동기 아키텍처를 도입했고, 추가 27%가 도입을 계획하고 있습니다.

왜 이렇게 많은 기업이 메시지 큐를 채택할까요?

동기 vs 비동기 통신

graph LR
    subgraph "동기 방식 - 강한 결합"
        A[주문 서비스] -->|HTTP 호출| B[결제 서비스]
        B -->|HTTP 호출| C[재고 서비스]
        C -->|HTTP 호출| D[알림 서비스]
    end

동기 방식에서는 결제 서비스가 1초 지연되면 전체 체인이 1초 이상 느려집니다. 하나가 죽으면 연쇄 장애가 발생합니다.

graph LR
    subgraph "비동기 방식 - 느슨한 결합"
        A2[주문 서비스] -->|이벤트 발행| MQ[메시지 큐]
        MQ -->|구독| B2[결제 서비스]
        MQ -->|구독| C2[재고 서비스]
        MQ -->|구독| D2[알림 서비스]
    end

비동기 방식에서는 주문 서비스가 메시지만 발행하면 됩니다. 결제 서비스가 일시적으로 다운되어도 메시지는 큐에 보관되어 복구 후 처리됩니다.

메시지 큐 도입이 필요한 5가지 신호

  1. 서비스 간 동기 호출이 3단계 이상 체이닝되는 경우
  2. 트래픽 스파이크로 인해 다운스트림 서비스가 과부하에 빠지는 경우
  3. 이벤트 리플레이가 필요한 경우 (장애 복구, 데이터 재처리)
  4. 다수의 소비자가 동일한 데이터를 각기 다른 목적으로 처리하는 경우
  5. 마이크로서비스 전환 시 서비스 간 결합도를 낮추려는 경우

2. 아키텍처 비교: 로그 vs 브로커 vs 하이브리드

메시지 큐는 크게 세 가지 아키텍처 패러다임으로 나뉩니다.

2.1 Apache Kafka — 분산 커밋 로그

graph TB
    subgraph "Kafka 클러스터"
        direction TB
        KR[KRaft Controller] --> B1[Broker 1]
        KR --> B2[Broker 2]
        KR --> B3[Broker 3]
    end
    P[Producer] --> B1
    B1 --> T["Topic: orders (3 partitions)"]
    T --> P0[Partition 0 - Leader: B1]
    T --> P1[Partition 1 - Leader: B2]
    T --> P2[Partition 2 - Leader: B3]
    P0 --> CG["Consumer Group"]
    P1 --> CG
    P2 --> CG

핵심 특징:

  • Append-Only 커밋 로그: 메시지는 파티션에 순서대로 추가되며, 불변(immutable)입니다
  • Consumer Offset: 소비자가 어디까지 읽었는지를 자체 관리 — 리플레이 가능
  • KRaft (Kafka 4.0): ZooKeeper를 완전히 제거하고 Raft 기반 메타데이터 관리
  • 보존 정책: 시간 또는 크기 기반으로 메시지를 유지 (기본 7일)
  • 파티션 병렬 처리: 파티션 수만큼 컨슈머를 병렬로 확장

적합한 경우: 이벤트 스트리밍, 로그 집계, 이벤트 소싱, 실시간 분석

2.2 RabbitMQ — AMQP 메시지 브로커

graph LR
    P[Producer] --> E["Exchange (type: topic)"]
    E -->|"routing.key.a"| Q1[Queue A]
    E -->|"routing.key.b"| Q2[Queue B]
    E -->|"routing.#"| Q3[Queue C - 와일드카드]
    Q1 --> C1[Consumer 1]
    Q2 --> C2[Consumer 2]
    Q3 --> C3[Consumer 3]

핵심 특징:

  • AMQP 프로토콜: Exchange, Binding, Queue의 유연한 라우팅 모델
  • Exchange 유형: Direct, Topic, Fanout, Headers — 복잡한 라우팅 가능
  • 메시지 ACK: Consumer가 확인 응답을 보내야 메시지가 큐에서 제거됨
  • 플러그인 생태계: Shovel, Federation, Management UI 등
  • Priority Queue: 메시지 우선순위 설정 가능

적합한 경우: 복잡한 라우팅, RPC 패턴, 태스크 분배, 기존 AMQP 에코시스템

2.3 Amazon SQS — 완전 관리형 큐

graph LR
    P[Producer] -->|SendMessage| SQS["SQS Queue"]
    SQS -->|ReceiveMessage| C1[Consumer 1]
    SQS -->|ReceiveMessage| C2[Consumer 2]
    SQS -.->|실패 시| DLQ["Dead Letter Queue"]

핵심 특징:

  • 완전 관리형: 서버 프로비저닝, 패치, 스케일링 불필요
  • Pull 기반: Consumer가 능동적으로 메시지를 가져가는 구조
  • Standard vs FIFO: 무제한 처리량(Standard) 또는 순서 보장(FIFO, 초당 300~3,000 TPS)
  • Visibility Timeout: 메시지 처리 중 다른 소비자에게 보이지 않음
  • DLQ 네이티브 지원: 처리 실패 메시지 자동 이동

적합한 경우: 서버리스 아키텍처, AWS 네이티브 워크로드, 운영 부담 최소화

2.4 Apache Pulsar — 분리형 아키텍처

graph TB
    subgraph "Pulsar 클러스터"
        direction TB
        B1[Broker 1 - Serving] --> BK1[BookKeeper 1 - Storage]
        B2[Broker 2 - Serving] --> BK2[BookKeeper 2 - Storage]
        B3[Broker 3 - Serving] --> BK3[BookKeeper 3 - Storage]
    end
    P[Producer] --> B1
    B1 --> T["Topic: events"]
    T --> SUB["Subscription (Shared/Key_Shared/Exclusive)"]
    SUB --> C1[Consumer 1]
    SUB --> C2[Consumer 2]

핵심 특징:

  • Compute/Storage 분리: Broker(서빙)와 BookKeeper(저장)가 독립적으로 스케일링
  • Multi-Tenancy: 네임스페이스와 테넌트 단위의 격리
  • Geo-Replication: 네이티브 교차 데이터센터 복제
  • Tiered Storage: 오래된 데이터를 S3 같은 저비용 스토리지로 자동 이관
  • 다양한 구독 모드: Exclusive, Shared, Failover, Key_Shared

적합한 경우: 멀티 테넌트 환경, 지역 분산, 대규모 스트리밍

2.5 NATS — 경량 메시지 메시

graph LR
    subgraph "NATS 클러스터"
        N1[NATS Server 1] --- N2[NATS Server 2]
        N2 --- N3[NATS Server 3]
        N1 --- N3
    end
    P[Publisher] --> N1
    N1 -->|"subject: sensor.temp"| S1[Subscriber 1]
    N2 -->|"subject: sensor.*"| S2[Subscriber 2 - 와일드카드]
    N3 -->|JetStream| JS[JetStream Consumer]

핵심 특징:

  • Core NATS: 최소한의 지연시간을 위한 fire-and-forget Pub/Sub
  • JetStream: 영속성, 리플레이, exactly-once가 필요할 때 활성화
  • 경량 바이너리: 단일 바이너리 약 20MB, Go로 작성
  • Subject 기반 라우팅: 계층적 subject 이름과 와일드카드 매칭
  • Leaf Node: 엣지 컴퓨팅을 위한 경량 노드 확장

적합한 경우: IoT, 엣지 컴퓨팅, 마이크로서비스 간 빠른 통신, 클라우드 네이티브

2.6 Redis Streams — 인메모리 로그 구조

graph LR
    P[Producer] -->|XADD| RS["Redis Stream (mystream)"]
    RS -->|XREADGROUP| CG["Consumer Group"]
    CG --> C1["Consumer 1 (pending entries)"]
    CG --> C2["Consumer 2 (pending entries)"]

핵심 특징:

  • 인메모리: 초저지연, 높은 처리량
  • Consumer Group: Kafka와 유사한 컨슈머 그룹 모델
  • XADD/XREAD: 간단한 커맨드 기반 API
  • Pending Entries List (PEL): 미확인 메시지 추적
  • MAXLEN 트리밍: 메모리 한도 내에서 자동 정리

적합한 경우: 경량 스트리밍, 이미 Redis를 사용 중인 환경, 캐시 + 큐 통합

아키텍처 패러다임 요약

시스템패러다임메시지 저장소비 모델메시지 삭제
Kafka분산 커밋 로그디스크 (보존 정책)Pull + Offset보존 기간 후 삭제
RabbitMQ메시지 브로커메모리 + 디스크Push (ACK 후 삭제)ACK 시 즉시 삭제
SQS관리형 큐AWS 관리형Pull처리 후 삭제
Pulsar분리형 로그BookKeeper + TieredPull + 커서보존 정책 기반
NATS메시지 메시메모리 / JetStreamPush / Pull정책 기반
Redis Streams인메모리 로그메모리 (AOF 옵션)Pull (Consumer Group)MAXLEN 트리밍

3. 성능 벤치마크

처리량과 지연시간 비교

다음 벤치마크는 표준적인 3-노드 클러스터 환경 (m5.2xlarge 또는 동급)에서의 측정치입니다.

시스템처리량 (msgs/s)P50 지연시간P99 지연시간최적 사용처
Kafka500K ~ 1M5 ~ 15ms10 ~ 50ms이벤트 스트림, 리플레이, 로그 집계
Pulsar1M ~ 2.6M5 ~ 10msKafka 대비 300x 개선된 P99 tail latency멀티 테넌시, 지역 복제
NATS200K ~ 400Ksub-ms1 ~ 5msIoT, 엣지, 빠른 메시징
RabbitMQ50K ~ 100K1 ~ 5ms5 ~ 20ms복잡한 라우팅, RPC
Redis Streams100K ~ 500Ksub-ms1 ~ 3ms경량 스트리밍, 캐시 통합
SQS3K ~ 30K (Standard)20 ~ 50ms50 ~ 100ms서버리스, AWS 네이티브

처리량 상세 분석

Kafka 1M msgs/s의 비밀:

# Kafka 처리량 최적화 구성
num.partitions=12
batch.size=65536          # 64KB 배치
linger.ms=5               # 5ms 대기 후 배치 전송
compression.type=lz4      # LZ4 압축
buffer.memory=67108864    # 64MB 버퍼
acks=1                    # Leader만 확인 (최대 처리량)
  • 파티션 12개 기준, 각 파티션당 약 83K msgs/s
  • acks=all로 변경 시 약 30~40% 처리량 감소하지만 데이터 안정성 확보
  • LZ4 압축으로 네트워크 대역폭 40~60% 절감

Pulsar가 P99에서 강한 이유:

  • Broker와 BookKeeper 분리로 쓰기 I/O와 읽기 I/O가 경쟁하지 않음
  • BookKeeper의 Journal(WAL)과 Ledger(데이터)가 물리적으로 분리된 디스크 사용
  • 결과: Kafka 대비 P99 tail latency가 최대 300배 개선

NATS sub-ms 달성 조건:

# NATS Core (JetStream 비활성화)
# - 영속성 없이 in-memory only
# - 10KB 이하 메시지 크기
# - 단일 데이터센터 내 통신

확장성 비교

시스템수평 확장 방식파티션/큐 한도클러스터 최대 규모
Kafka파티션 추가 + Broker 추가실무 권장 4,000 ~ 20,000개/클러스터수백 대 브로커
PulsarBroker/BookKeeper 독립 확장100만+ 토픽 가능수천 대 노드
NATSLeaf Node 확장제한 없음 (subject 기반)슈퍼 클러스터
RabbitMQQuorum Queue + 노드 추가실무 권장 수천 개수십 대 노드
Redis StreamsRedis Cluster 샤딩샤드당 하나의 Stream수백 대 노드
SQS자동 (무제한)무제한AWS 관리형

4. 2025 신기능 하이라이트

4.1 Kafka 4.0 — KRaft GA와 큐 모드

KRaft GA (KIP-833):

Kafka 4.0에서 ZooKeeper가 완전히 제거되었습니다. 모든 메타데이터 관리가 KRaft 컨트롤러로 이관됩니다.

# Kafka 4.0 KRaft 모드 설정
process.roles: controller,broker
node.id: 1
controller.quorum.voters: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
listeners: PLAINTEXT://:9092,CONTROLLER://:9093
# ZooKeeper 관련 설정 완전 제거됨

개선 효과:

  • 메타데이터 전파 속도 10배 향상
  • 파티션 수 100만 개 이상으로 확장 가능
  • 운영 복잡도 대폭 감소 (ZooKeeper 클러스터 별도 관리 불필요)
  • 클러스터 시작 시간 수 분에서 수 초로 단축

KIP-932: Queues for Kafka (Share Groups):

Kafka에 전통적인 "큐" 의미론이 추가되었습니다. 기존에는 같은 컨슈머 그룹 내에서도 파티션 단위로만 분배할 수 있었지만, Share Groups를 사용하면 메시지 단위로 분배됩니다.

// Kafka 4.0 Share Group 사용 예시
Properties props = new Properties();
props.put("group.id", "my-share-group");
props.put("group.type", "share");  // 새로운 Share 타입

KafkaShareConsumer<String, String> consumer =
    new KafkaShareConsumer<>(props);
consumer.subscribe(List.of("orders"));

// 메시지 단위로 분배 - 파티션에 종속되지 않음
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        process(record);
        consumer.acknowledge(record);  // 개별 메시지 ACK
    }
}

4.2 RabbitMQ 4.x — SQL 필터와 AMQP 1.0

Native AMQP 1.0 지원:

RabbitMQ 4.0부터 AMQP 1.0이 코어 프로토콜로 승격되었습니다. 기존 AMQP 0.9.1은 레거시 플러그인으로 전환됩니다.

Stream Filtering (4.2):

// RabbitMQ 4.2 Stream SQL Filter
// 서버 사이드에서 필터링 - 네트워크 대역폭 절약
Consumer consumer = environment.consumerBuilder()
    .stream("events")
    .filter()
        .values("region-us", "priority-high")  // 서버 사이드 필터
        .postFilter(msg -> msg.getProperties().getSubject().equals("order"))
    .builder()
    .messageHandler((context, message) -> {
        // 필터링된 메시지만 수신
        processMessage(message);
        context.storeOffset();
    })
    .build();

처리 성능: 필터링 시에도 4M+ msgs/s 유지

Khepri 메타데이터 저장소:

Mnesia를 대체하는 Raft 기반 메타데이터 저장소로, 대규모 클러스터에서의 안정성이 크게 개선되었습니다.

4.3 Pulsar 4.1 — 19 PIPs

Apache Pulsar 4.1은 19개의 Pulsar Improvement Proposals(PIPs)가 포함되었습니다.

주요 변경:

  • Java 17 필수: Java 8/11 지원 종료
  • PIP-354: Topic Policy 성능 최적화
  • PIP-362: Broker Level Dispatch Throttling
  • PIP-374: 개선된 메모리 관리로 GC 부담 감소

4.4 NATS 2.11 — 메시지 TTL과 분산 트레이싱

# NATS 2.11 JetStream 메시지 TTL 설정
nats stream add EVENTS \
  --subjects "events.>" \
  --max-msg-ttl 3600 \
  --storage file \
  --replicas 3

주요 변경:

  • 메시지 레벨 TTL: 스트림 전체가 아닌 개별 메시지에 만료 시간 설정
  • 분산 트레이싱: OpenTelemetry 통합으로 메시지 흐름 추적
  • SubjectTransform: subject 이름을 동적으로 변환
  • 개선된 모니터링: Prometheus 메트릭 확장

5. 클라우드 매니지드 서비스 비교 + 가격

주요 매니지드 서비스

서비스기반 기술제공사주요 특징
Confluent CloudKafkaConfluentSchema Registry, ksqlDB, Connectors
Amazon MSKKafkaAWSVPC 통합, Serverless 옵션
Amazon MSK ServerlessKafkaAWS자동 스케일링, 사용량 과금
Amazon SQS자체AWS완전 관리형, 서버리스 통합
CloudAMQPRabbitMQ84codesDedicated/Shared 인스턴스
StreamNativePulsarStreamNativePulsar 전문 매니지드
SynadiaNATSSynadiaNATS 전문 매니지드

월 비용 시뮬레이션 (일일 1억 메시지, 1KB 평균)

시나리오: 일일 1억 메시지, 평균 1KB, 3일 보존

Confluent Cloud (Basic):
  - 처리량: ~1,160 msgs/s 평균
  - 예상 비용: 월 약 800 ~ 1,200 USD
  - 포함: 브로커, 스토리지, 네트워크

Amazon MSK Provisioned (kafka.m5.large x 3):
  - 인스턴스: 0.21 USD/h x 3 = 월 약 454 USD
  - 스토리지: 300GB x 0.10 USD/GB = 30 USD
  - 예상 총비용: 월 약 500 ~ 700 USD

Amazon MSK Serverless:
  - 클러스터: 0.75 USD/h = 월 약 540 USD
  - 파티션: 0.0015 USD/h/partition
  - 스토리지: 0.10 USD/GB
  - 예상 총비용: 월 약 600 ~ 1,000 USD

Amazon SQS Standard:
  - 처음 100만 건 무료
  - 이후: 0.40 USD / 100만 건
  - 월 약 30억 건: 약 1,200 USD
  - 데이터 전송 비용 별도

CloudAMQP (Dedicated - Tiger):
  - 월 약 500 ~ 1,000 USD
  - 클러스터 포함, 모니터링 포함

자체 호스팅 Kafka (c5.2xlarge x 3 + EBS):
  - EC2: 0.34 USD/h x 3 = 월 약 734 USD
  - EBS gp3 300GB x 3: 월 약 72 USD
  - 운영 인건비 별도 (DevOps 1인 월 최소 수백만 원)
  - 예상 총비용: 월 약 800 USD + 운영비

선택 기준 정리

  • 비용 최소화: 소규모라면 SQS (사용량 과금), 대규모라면 자체 호스팅 Kafka
  • 운영 부담 최소화: SQS 또는 MSK Serverless
  • 기능 최대화: Confluent Cloud (Schema Registry, ksqlDB)
  • RabbitMQ 필요 시: CloudAMQP

6. Spring Boot 연동 코드

6.1 Kafka + Spring Boot

// build.gradle
// implementation 'org.springframework.kafka:spring-kafka:3.3.0'

// application.yml
// spring:
//   kafka:
//     bootstrap-servers: localhost:9092
//     producer:
//       key-serializer: org.apache.kafka.common.serialization.StringSerializer
//       value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
//     consumer:
//       group-id: order-service
//       auto-offset-reset: earliest
//       key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
//       value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer
// KafkaProducerService.java
@Service
@RequiredArgsConstructor
public class KafkaProducerService {

    private final KafkaTemplate<String, OrderEvent> kafkaTemplate;

    public CompletableFuture<SendResult<String, OrderEvent>> sendOrderEvent(
            OrderEvent event) {
        return kafkaTemplate.send("orders", event.getOrderId(), event);
    }
}

// KafkaConsumerService.java
@Service
@Slf4j
public class KafkaConsumerService {

    @KafkaListener(
        topics = "orders",
        groupId = "order-service",
        containerFactory = "kafkaListenerContainerFactory"
    )
    public void handleOrderEvent(
            @Payload OrderEvent event,
            @Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
            @Header(KafkaHeaders.OFFSET) long offset) {

        log.info("Received order: {} from partition: {}, offset: {}",
            event.getOrderId(), partition, offset);
        processOrder(event);
    }

    private void processOrder(OrderEvent event) {
        // 주문 처리 로직
    }
}

6.2 RabbitMQ + Spring Boot

// build.gradle
// implementation 'org.springframework.boot:spring-boot-starter-amqp'

// application.yml
// spring:
//   rabbitmq:
//     host: localhost
//     port: 5672
//     username: guest
//     password: guest
// RabbitMQConfig.java
@Configuration
public class RabbitMQConfig {

    public static final String ORDER_EXCHANGE = "order.exchange";
    public static final String ORDER_QUEUE = "order.queue";
    public static final String ORDER_ROUTING_KEY = "order.created";
    public static final String DLQ_QUEUE = "order.dlq";

    @Bean
    public TopicExchange orderExchange() {
        return new TopicExchange(ORDER_EXCHANGE);
    }

    @Bean
    public Queue orderQueue() {
        return QueueBuilder.durable(ORDER_QUEUE)
            .withArgument("x-dead-letter-exchange", "")
            .withArgument("x-dead-letter-routing-key", DLQ_QUEUE)
            .withArgument("x-message-ttl", 60000)
            .build();
    }

    @Bean
    public Queue deadLetterQueue() {
        return QueueBuilder.durable(DLQ_QUEUE).build();
    }

    @Bean
    public Binding orderBinding(Queue orderQueue, TopicExchange orderExchange) {
        return BindingBuilder.bind(orderQueue)
            .to(orderExchange)
            .with(ORDER_ROUTING_KEY);
    }
}

// RabbitMQProducer.java
@Service
@RequiredArgsConstructor
public class RabbitMQProducer {

    private final RabbitTemplate rabbitTemplate;

    public void sendOrderEvent(OrderEvent event) {
        rabbitTemplate.convertAndSend(
            RabbitMQConfig.ORDER_EXCHANGE,
            RabbitMQConfig.ORDER_ROUTING_KEY,
            event,
            message -> {
                message.getMessageProperties().setContentType("application/json");
                message.getMessageProperties().setDeliveryMode(
                    MessageDeliveryMode.PERSISTENT);
                return message;
            }
        );
    }
}

// RabbitMQConsumer.java
@Service
@Slf4j
public class RabbitMQConsumer {

    @RabbitListener(queues = RabbitMQConfig.ORDER_QUEUE)
    public void handleOrderEvent(OrderEvent event, Channel channel,
            @Header(AmqpHeaders.DELIVERY_TAG) long tag) throws IOException {
        try {
            log.info("Received order: {}", event.getOrderId());
            processOrder(event);
            channel.basicAck(tag, false);
        } catch (Exception e) {
            log.error("Failed to process order: {}", event.getOrderId(), e);
            channel.basicNack(tag, false, false); // DLQ로 이동
        }
    }
}

6.3 SQS + Spring Boot

// build.gradle
// implementation 'io.awspring.cloud:spring-cloud-aws-starter-sqs:3.2.0'

// application.yml
// spring:
//   cloud:
//     aws:
//       region:
//         static: ap-northeast-2
//       sqs:
//         endpoint: https://sqs.ap-northeast-2.amazonaws.com
// SQSProducer.java
@Service
@RequiredArgsConstructor
public class SQSProducer {

    private final SqsTemplate sqsTemplate;

    public void sendOrderEvent(OrderEvent event) {
        sqsTemplate.send(to -> to
            .queue("order-queue")
            .payload(event)
            .header("eventType", "ORDER_CREATED")
        );
    }

    // FIFO 큐 전송
    public void sendOrderEventFifo(OrderEvent event) {
        sqsTemplate.send(to -> to
            .queue("order-queue.fifo")
            .payload(event)
            .header(SqsHeaders.MessageSystemAttributes.SQS_MESSAGE_GROUP_ID_HEADER,
                event.getOrderId())
            .header(SqsHeaders.MessageSystemAttributes
                .SQS_MESSAGE_DEDUPLICATION_ID_HEADER,
                UUID.randomUUID().toString())
        );
    }
}

// SQSConsumer.java
@Service
@Slf4j
public class SQSConsumer {

    @SqsListener(value = "order-queue", maxConcurrentMessages = "10",
        maxMessagesPerPoll = "5")
    public void handleOrderEvent(@Payload OrderEvent event,
            @Header("eventType") String eventType) {
        log.info("Received order: {} with type: {}",
            event.getOrderId(), eventType);
        processOrder(event);
    }
}

6.4 Pulsar + Spring Boot

// build.gradle
// implementation 'org.springframework.pulsar:spring-pulsar-spring-boot-starter:1.2.0'

// application.yml
// spring:
//   pulsar:
//     client:
//       service-url: pulsar://localhost:6650
//     producer:
//       topic-name: persistent://public/default/orders
//     consumer:
//       subscription-name: order-service-sub
//       subscription-type: shared
// PulsarProducer.java
@Service
@RequiredArgsConstructor
public class PulsarProducer {

    private final PulsarTemplate<OrderEvent> pulsarTemplate;

    public void sendOrderEvent(OrderEvent event) {
        pulsarTemplate.send("persistent://public/default/orders", event);
    }
}

// PulsarConsumer.java
@Service
@Slf4j
public class PulsarConsumer {

    @PulsarListener(
        topics = "persistent://public/default/orders",
        subscriptionName = "order-service-sub",
        subscriptionType = SubscriptionType.Shared,
        schemaType = SchemaType.JSON
    )
    public void handleOrderEvent(OrderEvent event) {
        log.info("Received order: {}", event.getOrderId());
        processOrder(event);
    }
}

6.5 NATS + Spring Boot

// build.gradle
// implementation 'io.nats:jnats:2.20.4'
// NATSConfig.java
@Configuration
public class NATSConfig {

    @Bean
    public Connection natsConnection() throws IOException, InterruptedException {
        Options options = new Options.Builder()
            .server("nats://localhost:4222")
            .reconnectWait(Duration.ofSeconds(2))
            .maxReconnects(-1) // 무한 재연결
            .connectionListener((conn, type) -> {
                System.out.println("NATS connection event: " + type);
            })
            .build();
        return Nats.connect(options);
    }

    @Bean
    public JetStream jetStream(Connection connection) throws IOException {
        return connection.jetStream();
    }
}

// NATSProducer.java
@Service
@RequiredArgsConstructor
public class NATSProducer {

    private final JetStream jetStream;
    private final ObjectMapper objectMapper;

    public PublishAck sendOrderEvent(OrderEvent event) throws Exception {
        byte[] data = objectMapper.writeValueAsBytes(event);
        Message msg = NatsMessage.builder()
            .subject("orders.created")
            .data(data)
            .build();
        return jetStream.publish(msg);
    }
}

// NATSConsumer.java
@Service
@Slf4j
public class NATSConsumer {

    @PostConstruct
    public void startConsumer() throws Exception {
        JetStream js = natsConnection.jetStream();

        PushSubscribeOptions options = PushSubscribeOptions.builder()
            .durable("order-consumer")
            .build();

        js.subscribe("orders.>", "order-queue", dispatcher -> {
            dispatcher.onMessage(msg -> {
                try {
                    OrderEvent event = objectMapper.readValue(
                        msg.getData(), OrderEvent.class);
                    log.info("Received order: {}", event.getOrderId());
                    processOrder(event);
                    msg.ack();
                } catch (Exception e) {
                    log.error("Failed to process message", e);
                    msg.nak();
                }
            });
        }, false, options);
    }
}

6.6 Redis Streams + Spring Boot

// build.gradle
// implementation 'org.springframework.boot:spring-boot-starter-data-redis'
// RedisStreamConfig.java
@Configuration
public class RedisStreamConfig {

    @Bean
    public StreamMessageListenerContainer<String, MapRecord<String, String, String>>
            streamListenerContainer(RedisConnectionFactory factory) {

        var options = StreamMessageListenerContainer
            .StreamMessageListenerContainerOptions.builder()
            .pollTimeout(Duration.ofSeconds(1))
            .batchSize(10)
            .targetType(MapRecord.class)
            .build();

        var container = StreamMessageListenerContainer.create(factory, options);

        container.receiveAutoAck(
            Consumer.from("order-group", "consumer-1"),
            StreamOffset.create("orders", ReadOffset.lastConsumed()),
            new OrderStreamListener()
        );

        container.start();
        return container;
    }
}

// RedisStreamProducer.java
@Service
@RequiredArgsConstructor
public class RedisStreamProducer {

    private final StringRedisTemplate redisTemplate;
    private final ObjectMapper objectMapper;

    public RecordId sendOrderEvent(OrderEvent event) throws Exception {
        Map<String, String> fields = Map.of(
            "orderId", event.getOrderId(),
            "type", "ORDER_CREATED",
            "payload", objectMapper.writeValueAsString(event)
        );
        return redisTemplate.opsForStream()
            .add(StreamRecords.newRecord()
                .ofMap(fields)
                .withStreamKey("orders"));
    }
}

// OrderStreamListener.java
@Slf4j
public class OrderStreamListener
        implements StreamListener<String, MapRecord<String, String, String>> {

    @Override
    public void onMessage(MapRecord<String, String, String> message) {
        Map<String, String> body = message.getValue();
        log.info("Received order: {} from stream: {}",
            body.get("orderId"), message.getStream());
        processOrder(body);
    }
}

7. 의사결정 플로우차트

메시지 큐 선택을 위한 의사결정 트리입니다.

graph TD
    START[메시지 큐 선택] --> Q1["이벤트 리플레이가 필요한가?"]

    Q1 -->|예| Q1A["멀티 테넌시 또는 지역 복제가 필요한가?"]
    Q1A -->|예| PULSAR["Apache Pulsar"]
    Q1A -->|아니오| KAFKA["Apache Kafka"]

    Q1 -->|아니오| Q2["복잡한 라우팅이 필요한가?"]
    Q2 -->|예| RABBIT["RabbitMQ"]

    Q2 -->|아니오| Q3["AWS 네이티브 서버리스인가?"]
    Q3 -->|예| SQS["Amazon SQS"]

    Q3 -->|아니오| Q4["초저지연 sub-ms가 필수인가?"]
    Q4 -->|예| Q4A["영속성이 필요한가?"]
    Q4A -->|예| NATS["NATS + JetStream"]
    Q4A -->|아니오| NATS_CORE["NATS Core"]

    Q4 -->|아니오| Q5["이미 Redis를 사용 중인가?"]
    Q5 -->|예| REDIS["Redis Streams"]
    Q5 -->|아니오| KAFKA2["Kafka (기본 선택)"]

빠른 선택 가이드

요구사항추천 시스템이유
이벤트 스트리밍 + 리플레이Kafka커밋 로그 기반, 업계 표준
복잡한 메시지 라우팅RabbitMQExchange/Binding 모델
서버리스 + AWS 올인SQS제로 운영, Lambda 트리거
멀티 테넌시 + 지역 복제Pulsar네이티브 멀티 테넌시
IoT + 엣지 컴퓨팅NATS경량, sub-ms 지연
이미 Redis 사용 중Redis Streams추가 인프라 불필요
대규모 이벤트 + 낮은 비용Kafka (자체 호스팅)높은 처리량 대비 비용 효율
운영 인력 없음SQS 또는 Confluent Cloud완전 관리형

8. 실전 아키텍처 패턴

8.1 이커머스: 주문 처리 파이프라인

graph LR
    subgraph "주문 생성"
        API[API Gateway] --> OS[Order Service]
    end

    OS -->|OrderCreated| KAFKA[Kafka - orders 토픽]

    KAFKA --> PAY[Payment Service]
    KAFKA --> INV[Inventory Service]
    KAFKA --> NOTIF[Notification Service]

    PAY -->|PaymentCompleted| KAFKA2[Kafka - payments 토픽]
    KAFKA2 --> SHIP[Shipping Service]

    INV -->|StockReserved| KAFKA3[Kafka - inventory 토픽]
    KAFKA3 --> OS2[Order Service - 상태 업데이트]

    SHIP -->|ShipmentCreated| KAFKA4[Kafka - shipments 토픽]
    KAFKA4 --> NOTIF2[Notification Service - 배송 알림]

왜 Kafka인가?

  • 주문 이벤트 리플레이 필요 (결제 실패 시 재처리)
  • 다수의 서비스가 동일 이벤트를 독립적으로 소비
  • 감사 로그로 활용 가능

핵심 설정:

# 주문 토픽 설정
topics:
  orders:
    partitions: 12
    replication-factor: 3
    configs:
      retention.ms: 604800000 # 7일
      min.insync.replicas: 2
      cleanup.policy: delete

  payments:
    partitions: 6
    replication-factor: 3
    configs:
      retention.ms: 2592000000 # 30일 (감사 로그)
      min.insync.replicas: 2

8.2 실시간 분석 파이프라인

graph LR
    subgraph "데이터 수집"
        WEB[Web Clickstream] --> KAFKA[Kafka]
        APP[App Events] --> KAFKA
        IOT[IoT Sensors] --> NATS[NATS]
        NATS -->|브릿지| KAFKA
    end

    subgraph "스트림 처리"
        KAFKA --> FLINK[Apache Flink]
        KAFKA --> KSQL[ksqlDB]
    end

    subgraph "저장 및 서빙"
        FLINK --> ES[Elasticsearch]
        FLINK --> CH[ClickHouse]
        KSQL --> REDIS[Redis Cache]
    end

    subgraph "시각화"
        ES --> GRAFANA[Grafana]
        CH --> SUPERSET[Apache Superset]
    end

왜 Kafka + NATS 조합인가?

  • Kafka: 높은 처리량의 이벤트 수집과 Flink/ksqlDB 연동
  • NATS: IoT 센서의 초저지연 요구사항 충족, 경량 에이전트
  • 브릿지: NATS에서 수집한 데이터를 Kafka로 통합

8.3 IoT 센서 데이터 수집

graph TB
    subgraph "엣지 레이어"
        S1[온도 센서] --> LN1[NATS Leaf Node - 공장 A]
        S2[습도 센서] --> LN1
        S3[진동 센서] --> LN2[NATS Leaf Node - 공장 B]
    end

    subgraph "코어 레이어"
        LN1 --> NATS[NATS 클러스터 - JetStream]
        LN2 --> NATS
        NATS --> PROC[Stream Processor]
        PROC -->|이상 감지| ALERT[Alert Service - RabbitMQ]
        PROC -->|집계 데이터| TS[TimescaleDB]
    end

    ALERT -->|이메일| EMAIL[Email Service]
    ALERT -->|Slack| SLACK[Slack Notification]

왜 NATS + RabbitMQ 조합인가?

  • NATS: Leaf Node로 엣지까지 확장, sub-ms 지연
  • JetStream: 센서 데이터 영속성과 리플레이
  • RabbitMQ: 알림 라우팅 (이메일, Slack, SMS를 유연하게 분배)

8.4 Saga 패턴 — 분산 트랜잭션

sequenceDiagram
    participant OS as Order Service
    participant K as Kafka
    participant PS as Payment Service
    participant IS as Inventory Service
    participant SS as Shipping Service

    OS->>K: OrderCreated
    K->>PS: 결제 요청
    PS->>K: PaymentCompleted
    K->>IS: 재고 차감
    IS->>K: StockReserved
    K->>SS: 배송 생성
    SS->>K: ShipmentCreated
    K->>OS: 주문 완료

    Note over PS,IS: 실패 시 보상 트랜잭션
    PS-->>K: PaymentFailed (보상)
    K-->>IS: 재고 복구 (보상)
    K-->>OS: 주문 취소 (보상)

Choreography vs Orchestration:

방식장점단점추천 시나리오
Choreography서비스 간 결합도 낮음, 확장 용이흐름 파악 어려움단순한 워크플로우
Orchestration중앙 제어, 흐름 명확오케스트레이터 단일 장애점복잡한 비즈니스 로직

9. 운영 체크리스트

공통 모니터링 메트릭

메트릭설명위험 임계값
Consumer Lag소비 지연 메시지 수1만 건 이상 지속
메시지 처리 실패율DLQ 유입 비율1% 이상
Broker 디스크 사용률저장소 용량80% 이상
네트워크 I/O인/아웃 트래픽대역폭의 70% 이상
JVM GC 시간Kafka/Pulsar/RabbitMQFull GC 5초 이상

Kafka 프로덕션 체크리스트

1. Broker 설정
   - min.insync.replicas = 2 (replication-factor 3 기준)
   - unclean.leader.election.enable = false
   - auto.create.topics.enable = false

2. Producer 설정
   - acks = all (데이터 손실 방지)
   - retries = Integer.MAX_VALUE
   - enable.idempotence = true
   - max.in.flight.requests.per.connection = 5

3. Consumer 설정
   - enable.auto.commit = false (수동 커밋)
   - max.poll.records = 500
   - session.timeout.ms = 30000

4. 모니터링
   - Prometheus + Grafana (JMX Exporter)
   - Consumer Lag 알림 설정
   - Under-Replicated Partitions 알림

RabbitMQ 프로덕션 체크리스트

1. 클러스터 설정
   - Quorum Queue 사용 (Classic Mirrored Queue 대체)
   - vm_memory_high_watermark = 0.4
   - disk_free_limit = 2GB

2. 연결 관리
   - Connection Pool 사용
   - Heartbeat 타임아웃: 60초
   - 채널 재사용 (채널 생성 비용 높음)

3. 메시지 설정
   - Persistent delivery mode
   - Publisher Confirms 활성화
   - Dead Letter Exchange 설정

4. 모니터링
   - Management Plugin Dashboard
   - rabbitmq_prometheus 플러그인
   - Queue 깊이 알림 설정

10. 퀴즈

Q1. Kafka 4.0의 가장 큰 변화

Kafka 4.0에서 제거된 핵심 컴포넌트는 무엇이며, 이를 대체하는 기술은?

정답: ZooKeeper가 완전히 제거되고 **KRaft (Kafka Raft)**가 이를 대체합니다.

  • KRaft는 Raft 합의 알고리즘 기반의 메타데이터 관리 시스템입니다
  • 메타데이터 전파 속도가 10배 향상되고, 파티션 수가 100만 개 이상으로 확장 가능합니다
  • ZooKeeper 클러스터를 별도로 관리할 필요가 없어져 운영 복잡도가 크게 감소합니다

Q2. Pull vs Push 모델

Pull 기반 소비 모델(Kafka, SQS)과 Push 기반 소비 모델(RabbitMQ)의 차이점과 각각의 장단점은?

Pull 기반 (Kafka, SQS):

  • Consumer가 능동적으로 메시지를 가져감
  • 장점: Consumer가 자신의 처리 속도에 맞춰 가져갈 수 있음 (자연스러운 배압)
  • 단점: Long Polling이 필요하고 실시간성이 약간 떨어질 수 있음

Push 기반 (RabbitMQ):

  • Broker가 Consumer에게 메시지를 밀어넣음
  • 장점: 메시지 도착 즉시 전달 가능 (낮은 지연시간)
  • 단점: Consumer가 과부하 시 별도의 배압 메커니즘 필요 (prefetch count 등)

Q3. Exactly-Once 보장

Kafka에서 Exactly-Once Semantics(EOS)를 달성하기 위해 필요한 3가지 설정은?

정답:

  1. enable.idempotence = true: Producer가 동일 메시지를 중복 전송해도 한 번만 저장
  2. transactional.id 설정: Producer 트랜잭션을 식별하는 고유 ID
  3. isolation.level = read_committed: Consumer가 커밋된 트랜잭션 메시지만 읽음

이 세 가지를 조합하면 Producer에서 Consumer까지 end-to-end Exactly-Once가 보장됩니다.

Q4. SQS Standard vs FIFO

Amazon SQS Standard 큐와 FIFO 큐의 처리량, 순서 보장, 중복 처리 측면에서의 차이점은?
특성StandardFIFO
처리량거의 무제한초당 300 TPS (배치 시 3,000)
순서 보장Best-effort (순서 보장 없음)엄격한 FIFO 순서
중복 처리At-Least-Once (중복 가능)Exactly-Once (5분 내 중복 제거)
가격0.40 USD / 100만 건0.50 USD / 100만 건
사용 사례높은 처리량, 순서 무관결제, 주문 등 순서 중요

Q5. Pulsar vs Kafka 아키텍처

Pulsar의 Compute/Storage 분리 아키텍처가 Kafka 대비 갖는 운영상의 장점 2가지는?

정답:

  1. 독립적 스케일링: Broker(연산)와 BookKeeper(저장)를 별도로 확장할 수 있습니다. Kafka는 Broker가 데이터를 직접 저장하므로 저장 용량 확장 시 전체 Broker를 추가해야 합니다. Pulsar는 BookKeeper 노드만 추가하면 됩니다.

  2. Rebalancing 없는 확장: Kafka에서 Broker를 추가하면 파티션 Rebalancing이 필요하여 대량의 데이터 이동이 발생합니다. Pulsar는 새 Broker가 즉시 토픽 소유권을 받을 수 있어 확장 시 서비스 영향이 최소화됩니다.


참고 자료

  1. Apache Kafka 4.0 Release Notes — KRaft GA, Share Groups
  2. KIP-932: Queues for Kafka — Share Group 상세 스펙
  3. RabbitMQ 4.0 Release Blog — AMQP 1.0, Khepri
  4. RabbitMQ Streams — Stream Plugin 및 필터링
  5. Amazon SQS Developer Guide — Standard vs FIFO
  6. Apache Pulsar 4.1 Release — 19 PIPs
  7. NATS 2.11 Release Notes — Message TTL, 분산 트레이싱
  8. Confluent Benchmark: Kafka vs Pulsar — 성능 비교
  9. Spring for Apache Kafka — Spring Kafka 공식 문서
  10. Spring AMQP — Spring RabbitMQ 연동
  11. Spring Cloud AWS SQS — SQS 연동
  12. Spring for Apache Pulsar — Spring Pulsar 공식 문서
  13. Designing Data-Intensive Applications (Martin Kleppmann) — 메시지 시스템 이론
  14. Enterprise Integration Patterns — 메시징 패턴 바이블
  15. Confluent Cloud Pricing — 클라우드 비용 참고
  16. AWS MSK Pricing — MSK 비용 참고

Message Queue Showdown: Kafka vs RabbitMQ vs SQS vs Pulsar vs NATS — 2025 Decision Guide

1. Why Message Queues?

Market Size and Adoption

The global message queue middleware market has grown to approximately 6 billion USD as of 2025. According to Fortune Business Insights, 41% of enterprises have already adopted message-queue-based asynchronous architectures, with an additional 27% planning to adopt them.

Why are so many companies embracing message queues?

Synchronous vs Asynchronous Communication

graph LR
    subgraph "Synchronous - Tight Coupling"
        A[Order Service] -->|HTTP call| B[Payment Service]
        B -->|HTTP call| C[Inventory Service]
        C -->|HTTP call| D[Notification Service]
    end

In synchronous communication, a 1-second delay in the payment service slows the entire chain by over 1 second. If one service dies, cascading failures occur.

graph LR
    subgraph "Asynchronous - Loose Coupling"
        A2[Order Service] -->|Publish event| MQ[Message Queue]
        MQ -->|Subscribe| B2[Payment Service]
        MQ -->|Subscribe| C2[Inventory Service]
        MQ -->|Subscribe| D2[Notification Service]
    end

In asynchronous communication, the order service only needs to publish a message. Even if the payment service goes down temporarily, the message is retained in the queue and processed after recovery.

5 Signs You Need a Message Queue

  1. Synchronous calls between services chain more than 3 levels deep
  2. Traffic spikes overload downstream services
  3. Event replay is needed (failure recovery, data reprocessing)
  4. Multiple consumers need to process the same data for different purposes
  5. Microservice migration requires reducing inter-service coupling

2. Architecture Comparison: Log vs Broker vs Hybrid

Message queues fall into three main architectural paradigms.

2.1 Apache Kafka — Distributed Commit Log

graph TB
    subgraph "Kafka Cluster"
        direction TB
        KR[KRaft Controller] --> B1[Broker 1]
        KR --> B2[Broker 2]
        KR --> B3[Broker 3]
    end
    P[Producer] --> B1
    B1 --> T["Topic: orders (3 partitions)"]
    T --> P0[Partition 0 - Leader: B1]
    T --> P1[Partition 1 - Leader: B2]
    T --> P2[Partition 2 - Leader: B3]
    P0 --> CG["Consumer Group"]
    P1 --> CG
    P2 --> CG

Key Characteristics:

  • Append-Only Commit Log: Messages are appended sequentially to partitions and are immutable
  • Consumer Offsets: Consumers manage their read positions, enabling replay
  • KRaft (Kafka 4.0): Complete removal of ZooKeeper with Raft-based metadata management
  • Retention Policy: Messages retained based on time or size (default 7 days)
  • Partition Parallelism: Scale consumers in parallel up to the number of partitions

Best For: Event streaming, log aggregation, event sourcing, real-time analytics

2.2 RabbitMQ — AMQP Message Broker

graph LR
    P[Producer] --> E["Exchange (type: topic)"]
    E -->|"routing.key.a"| Q1[Queue A]
    E -->|"routing.key.b"| Q2[Queue B]
    E -->|"routing.#"| Q3[Queue C - wildcard]
    Q1 --> C1[Consumer 1]
    Q2 --> C2[Consumer 2]
    Q3 --> C3[Consumer 3]

Key Characteristics:

  • AMQP Protocol: Flexible routing model with Exchanges, Bindings, and Queues
  • Exchange Types: Direct, Topic, Fanout, Headers for complex routing
  • Message ACK: Messages are removed from the queue only after consumer acknowledgment
  • Plugin Ecosystem: Shovel, Federation, Management UI, and more
  • Priority Queue: Support for message priority levels

Best For: Complex routing, RPC patterns, task distribution, existing AMQP ecosystems

2.3 Amazon SQS — Fully Managed Queue

graph LR
    P[Producer] -->|SendMessage| SQS["SQS Queue"]
    SQS -->|ReceiveMessage| C1[Consumer 1]
    SQS -->|ReceiveMessage| C2[Consumer 2]
    SQS -.->|On failure| DLQ["Dead Letter Queue"]

Key Characteristics:

  • Fully Managed: No server provisioning, patching, or scaling needed
  • Pull-Based: Consumers actively fetch messages
  • Standard vs FIFO: Unlimited throughput (Standard) or ordered delivery (FIFO, 300-3,000 TPS)
  • Visibility Timeout: Messages are hidden from other consumers during processing
  • Native DLQ Support: Automatic routing of failed messages

Best For: Serverless architecture, AWS-native workloads, minimal operational overhead

2.4 Apache Pulsar — Decoupled Architecture

graph TB
    subgraph "Pulsar Cluster"
        direction TB
        B1[Broker 1 - Serving] --> BK1[BookKeeper 1 - Storage]
        B2[Broker 2 - Serving] --> BK2[BookKeeper 2 - Storage]
        B3[Broker 3 - Serving] --> BK3[BookKeeper 3 - Storage]
    end
    P[Producer] --> B1
    B1 --> T["Topic: events"]
    T --> SUB["Subscription (Shared/Key_Shared/Exclusive)"]
    SUB --> C1[Consumer 1]
    SUB --> C2[Consumer 2]

Key Characteristics:

  • Compute/Storage Separation: Brokers (serving) and BookKeeper (storage) scale independently
  • Multi-Tenancy: Isolation at the namespace and tenant level
  • Geo-Replication: Native cross-datacenter replication
  • Tiered Storage: Automatic offloading of old data to low-cost storage like S3
  • Subscription Modes: Exclusive, Shared, Failover, Key_Shared

Best For: Multi-tenant environments, geo-distributed deployments, large-scale streaming

2.5 NATS — Lightweight Message Mesh

graph LR
    subgraph "NATS Cluster"
        N1[NATS Server 1] --- N2[NATS Server 2]
        N2 --- N3[NATS Server 3]
        N1 --- N3
    end
    P[Publisher] --> N1
    N1 -->|"subject: sensor.temp"| S1[Subscriber 1]
    N2 -->|"subject: sensor.*"| S2[Subscriber 2 - wildcard]
    N3 -->|JetStream| JS[JetStream Consumer]

Key Characteristics:

  • Core NATS: Fire-and-forget Pub/Sub for minimal latency
  • JetStream: Enable when persistence, replay, and exactly-once are needed
  • Lightweight Binary: Single binary of about 20MB, written in Go
  • Subject-Based Routing: Hierarchical subject names with wildcard matching
  • Leaf Nodes: Lightweight node extension for edge computing

Best For: IoT, edge computing, fast inter-service communication, cloud-native

2.6 Redis Streams — In-Memory Log Structure

graph LR
    P[Producer] -->|XADD| RS["Redis Stream (mystream)"]
    RS -->|XREADGROUP| CG["Consumer Group"]
    CG --> C1["Consumer 1 (pending entries)"]
    CG --> C2["Consumer 2 (pending entries)"]

Key Characteristics:

  • In-Memory: Ultra-low latency, high throughput
  • Consumer Groups: Similar consumer group model to Kafka
  • XADD/XREAD: Simple command-based API
  • Pending Entries List (PEL): Tracks unacknowledged messages
  • MAXLEN Trimming: Automatic cleanup within memory limits

Best For: Lightweight streaming, environments already using Redis, cache + queue integration

Architecture Paradigm Summary

SystemParadigmMessage StorageConsumption ModelMessage Deletion
KafkaDistributed Commit LogDisk (retention policy)Pull + OffsetAfter retention period
RabbitMQMessage BrokerMemory + DiskPush (delete on ACK)Immediately on ACK
SQSManaged QueueAWS managedPullAfter processing
PulsarDecoupled LogBookKeeper + TieredPull + CursorRetention-policy based
NATSMessage MeshMemory / JetStreamPush / PullPolicy-based
Redis StreamsIn-Memory LogMemory (AOF optional)Pull (Consumer Group)MAXLEN trimming

3. Performance Benchmarks

Throughput and Latency Comparison

The following benchmarks are measured on standard 3-node cluster environments (m5.2xlarge or equivalent).

SystemThroughput (msgs/s)P50 LatencyP99 LatencyBest For
Kafka500K - 1M5 - 15ms10 - 50msEvent streams, replay, log aggregation
Pulsar1M - 2.6M5 - 10ms300x better P99 tail latency vs KafkaMulti-tenancy, geo-replication
NATS200K - 400Ksub-ms1 - 5msIoT, edge, fast messaging
RabbitMQ50K - 100K1 - 5ms5 - 20msComplex routing, RPC
Redis Streams100K - 500Ksub-ms1 - 3msLightweight streaming, cache integration
SQS3K - 30K (Standard)20 - 50ms50 - 100msServerless, AWS native

Throughput Deep Dive

The secret behind Kafka's 1M msgs/s:

# Kafka throughput optimization configuration
num.partitions=12
batch.size=65536          # 64KB batch
linger.ms=5               # Wait 5ms before sending batch
compression.type=lz4      # LZ4 compression
buffer.memory=67108864    # 64MB buffer
acks=1                    # Leader-only ACK (max throughput)
  • With 12 partitions, each partition handles approximately 83K msgs/s
  • Changing to acks=all reduces throughput by 30-40% but ensures data durability
  • LZ4 compression saves 40-60% network bandwidth

Why Pulsar excels at P99:

  • Broker and BookKeeper separation means write I/O and read I/O don't compete
  • BookKeeper's Journal (WAL) and Ledger (data) use physically separate disks
  • Result: Up to 300x improvement in P99 tail latency compared to Kafka

NATS sub-ms requirements:

# NATS Core (JetStream disabled)
# - No persistence, in-memory only
# - Message size under 10KB
# - Communication within a single datacenter

Scalability Comparison

SystemHorizontal ScalingPartition/Queue LimitMax Cluster Size
KafkaAdd partitions + brokersRecommended 4,000 - 20,000/clusterHundreds of brokers
PulsarIndependent broker/BookKeeper scaling1M+ topics possibleThousands of nodes
NATSLeaf node expansionUnlimited (subject-based)Super cluster
RabbitMQQuorum queues + node additionRecommended thousandsDozens of nodes
Redis StreamsRedis Cluster shardingOne stream per shardHundreds of nodes
SQSAutomatic (unlimited)UnlimitedAWS managed

4. 2025 New Feature Highlights

4.1 Kafka 4.0 — KRaft GA and Queue Mode

KRaft GA (KIP-833):

ZooKeeper has been completely removed in Kafka 4.0. All metadata management is now handled by KRaft controllers.

# Kafka 4.0 KRaft mode configuration
process.roles: controller,broker
node.id: 1
controller.quorum.voters: 1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
listeners: PLAINTEXT://:9092,CONTROLLER://:9093
# All ZooKeeper-related configuration removed

Improvements:

  • 10x faster metadata propagation
  • Partition count scalable beyond 1 million
  • Drastically reduced operational complexity (no separate ZooKeeper cluster)
  • Cluster startup time reduced from minutes to seconds

KIP-932: Queues for Kafka (Share Groups):

Traditional "queue" semantics have been added to Kafka. Previously, messages within a consumer group could only be distributed at the partition level. With Share Groups, messages can be distributed at the individual message level.

// Kafka 4.0 Share Group example
Properties props = new Properties();
props.put("group.id", "my-share-group");
props.put("group.type", "share");  // New Share type

KafkaShareConsumer<String, String> consumer =
    new KafkaShareConsumer<>(props);
consumer.subscribe(List.of("orders"));

// Distributed per message, not bound to partitions
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        process(record);
        consumer.acknowledge(record);  // Per-message ACK
    }
}

4.2 RabbitMQ 4.x — SQL Filters and AMQP 1.0

Native AMQP 1.0 Support:

Starting with RabbitMQ 4.0, AMQP 1.0 has been promoted to a core protocol. The existing AMQP 0.9.1 transitions to a legacy plugin.

Stream Filtering (4.2):

// RabbitMQ 4.2 Stream SQL Filter
// Server-side filtering saves network bandwidth
Consumer consumer = environment.consumerBuilder()
    .stream("events")
    .filter()
        .values("region-us", "priority-high")  // Server-side filter
        .postFilter(msg -> msg.getProperties().getSubject().equals("order"))
    .builder()
    .messageHandler((context, message) -> {
        // Only filtered messages received
        processMessage(message);
        context.storeOffset();
    })
    .build();

Performance: Maintains 4M+ msgs/s even with filtering enabled

Khepri Metadata Store:

A Raft-based metadata store replacing Mnesia, significantly improving stability in large-scale clusters.

4.3 Pulsar 4.1 — 19 PIPs

Apache Pulsar 4.1 includes 19 Pulsar Improvement Proposals (PIPs).

Key changes:

  • Java 17 Required: Java 8/11 support dropped
  • PIP-354: Topic Policy performance optimization
  • PIP-362: Broker Level Dispatch Throttling
  • PIP-374: Improved memory management reducing GC overhead

4.4 NATS 2.11 — Message TTL and Distributed Tracing

# NATS 2.11 JetStream message TTL configuration
nats stream add EVENTS \
  --subjects "events.>" \
  --max-msg-ttl 3600 \
  --storage file \
  --replicas 3

Key changes:

  • Message-Level TTL: Set expiration times on individual messages, not entire streams
  • Distributed Tracing: OpenTelemetry integration for message flow tracking
  • SubjectTransform: Dynamic subject name transformation
  • Improved Monitoring: Extended Prometheus metrics

5. Managed Cloud Services Comparison + Pricing

Major Managed Services

ServiceUnderlying TechProviderKey Features
Confluent CloudKafkaConfluentSchema Registry, ksqlDB, Connectors
Amazon MSKKafkaAWSVPC integration, Serverless option
Amazon MSK ServerlessKafkaAWSAuto-scaling, pay-per-use
Amazon SQSProprietaryAWSFully managed, serverless integration
CloudAMQPRabbitMQ84codesDedicated/Shared instances
StreamNativePulsarStreamNativePulsar-specialized managed service
SynadiaNATSSynadiaNATS-specialized managed service

Monthly Cost Simulation (100M messages/day, 1KB average)

Scenario: 100M messages/day, 1KB average, 3-day retention

Confluent Cloud (Basic):
  - Throughput: ~1,160 msgs/s average
  - Estimated cost: ~800 - 1,200 USD/month
  - Includes: Brokers, storage, network

Amazon MSK Provisioned (kafka.m5.large x 3):
  - Instances: 0.21 USD/h x 3 = ~454 USD/month
  - Storage: 300GB x 0.10 USD/GB = 30 USD
  - Estimated total: ~500 - 700 USD/month

Amazon MSK Serverless:
  - Cluster: 0.75 USD/h = ~540 USD/month
  - Partitions: 0.0015 USD/h/partition
  - Storage: 0.10 USD/GB
  - Estimated total: ~600 - 1,000 USD/month

Amazon SQS Standard:
  - First 1M requests free
  - Then: 0.40 USD / 1M requests
  - ~3B requests/month: ~1,200 USD
  - Data transfer costs extra

CloudAMQP (Dedicated - Tiger):
  - ~500 - 1,000 USD/month
  - Cluster included, monitoring included

Self-Hosted Kafka (c5.2xlarge x 3 + EBS):
  - EC2: 0.34 USD/h x 3 = ~734 USD/month
  - EBS gp3 300GB x 3: ~72 USD/month
  - Operations cost extra (at least 1 DevOps engineer)
  - Estimated total: ~800 USD/month + ops costs

Selection Criteria Summary

  • Minimize cost: SQS for small scale (pay-per-use), self-hosted Kafka for large scale
  • Minimize ops overhead: SQS or MSK Serverless
  • Maximize features: Confluent Cloud (Schema Registry, ksqlDB)
  • Need RabbitMQ: CloudAMQP

6. Spring Boot Integration Code

6.1 Kafka + Spring Boot

// build.gradle
// implementation 'org.springframework.kafka:spring-kafka:3.3.0'

// application.yml
// spring:
//   kafka:
//     bootstrap-servers: localhost:9092
//     producer:
//       key-serializer: org.apache.kafka.common.serialization.StringSerializer
//       value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
//     consumer:
//       group-id: order-service
//       auto-offset-reset: earliest
//       key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
//       value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer
// KafkaProducerService.java
@Service
@RequiredArgsConstructor
public class KafkaProducerService {

    private final KafkaTemplate<String, OrderEvent> kafkaTemplate;

    public CompletableFuture<SendResult<String, OrderEvent>> sendOrderEvent(
            OrderEvent event) {
        return kafkaTemplate.send("orders", event.getOrderId(), event);
    }
}

// KafkaConsumerService.java
@Service
@Slf4j
public class KafkaConsumerService {

    @KafkaListener(
        topics = "orders",
        groupId = "order-service",
        containerFactory = "kafkaListenerContainerFactory"
    )
    public void handleOrderEvent(
            @Payload OrderEvent event,
            @Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
            @Header(KafkaHeaders.OFFSET) long offset) {

        log.info("Received order: {} from partition: {}, offset: {}",
            event.getOrderId(), partition, offset);
        processOrder(event);
    }

    private void processOrder(OrderEvent event) {
        // Order processing logic
    }
}

6.2 RabbitMQ + Spring Boot

// build.gradle
// implementation 'org.springframework.boot:spring-boot-starter-amqp'

// application.yml
// spring:
//   rabbitmq:
//     host: localhost
//     port: 5672
//     username: guest
//     password: guest
// RabbitMQConfig.java
@Configuration
public class RabbitMQConfig {

    public static final String ORDER_EXCHANGE = "order.exchange";
    public static final String ORDER_QUEUE = "order.queue";
    public static final String ORDER_ROUTING_KEY = "order.created";
    public static final String DLQ_QUEUE = "order.dlq";

    @Bean
    public TopicExchange orderExchange() {
        return new TopicExchange(ORDER_EXCHANGE);
    }

    @Bean
    public Queue orderQueue() {
        return QueueBuilder.durable(ORDER_QUEUE)
            .withArgument("x-dead-letter-exchange", "")
            .withArgument("x-dead-letter-routing-key", DLQ_QUEUE)
            .withArgument("x-message-ttl", 60000)
            .build();
    }

    @Bean
    public Queue deadLetterQueue() {
        return QueueBuilder.durable(DLQ_QUEUE).build();
    }

    @Bean
    public Binding orderBinding(Queue orderQueue, TopicExchange orderExchange) {
        return BindingBuilder.bind(orderQueue)
            .to(orderExchange)
            .with(ORDER_ROUTING_KEY);
    }
}

// RabbitMQProducer.java
@Service
@RequiredArgsConstructor
public class RabbitMQProducer {

    private final RabbitTemplate rabbitTemplate;

    public void sendOrderEvent(OrderEvent event) {
        rabbitTemplate.convertAndSend(
            RabbitMQConfig.ORDER_EXCHANGE,
            RabbitMQConfig.ORDER_ROUTING_KEY,
            event,
            message -> {
                message.getMessageProperties().setContentType("application/json");
                message.getMessageProperties().setDeliveryMode(
                    MessageDeliveryMode.PERSISTENT);
                return message;
            }
        );
    }
}

// RabbitMQConsumer.java
@Service
@Slf4j
public class RabbitMQConsumer {

    @RabbitListener(queues = RabbitMQConfig.ORDER_QUEUE)
    public void handleOrderEvent(OrderEvent event, Channel channel,
            @Header(AmqpHeaders.DELIVERY_TAG) long tag) throws IOException {
        try {
            log.info("Received order: {}", event.getOrderId());
            processOrder(event);
            channel.basicAck(tag, false);
        } catch (Exception e) {
            log.error("Failed to process order: {}", event.getOrderId(), e);
            channel.basicNack(tag, false, false); // Move to DLQ
        }
    }
}

6.3 SQS + Spring Boot

// build.gradle
// implementation 'io.awspring.cloud:spring-cloud-aws-starter-sqs:3.2.0'

// application.yml
// spring:
//   cloud:
//     aws:
//       region:
//         static: ap-northeast-2
//       sqs:
//         endpoint: https://sqs.ap-northeast-2.amazonaws.com
// SQSProducer.java
@Service
@RequiredArgsConstructor
public class SQSProducer {

    private final SqsTemplate sqsTemplate;

    public void sendOrderEvent(OrderEvent event) {
        sqsTemplate.send(to -> to
            .queue("order-queue")
            .payload(event)
            .header("eventType", "ORDER_CREATED")
        );
    }

    // FIFO queue sending
    public void sendOrderEventFifo(OrderEvent event) {
        sqsTemplate.send(to -> to
            .queue("order-queue.fifo")
            .payload(event)
            .header(SqsHeaders.MessageSystemAttributes.SQS_MESSAGE_GROUP_ID_HEADER,
                event.getOrderId())
            .header(SqsHeaders.MessageSystemAttributes
                .SQS_MESSAGE_DEDUPLICATION_ID_HEADER,
                UUID.randomUUID().toString())
        );
    }
}

// SQSConsumer.java
@Service
@Slf4j
public class SQSConsumer {

    @SqsListener(value = "order-queue", maxConcurrentMessages = "10",
        maxMessagesPerPoll = "5")
    public void handleOrderEvent(@Payload OrderEvent event,
            @Header("eventType") String eventType) {
        log.info("Received order: {} with type: {}",
            event.getOrderId(), eventType);
        processOrder(event);
    }
}

6.4 Pulsar + Spring Boot

// build.gradle
// implementation 'org.springframework.pulsar:spring-pulsar-spring-boot-starter:1.2.0'

// application.yml
// spring:
//   pulsar:
//     client:
//       service-url: pulsar://localhost:6650
//     producer:
//       topic-name: persistent://public/default/orders
//     consumer:
//       subscription-name: order-service-sub
//       subscription-type: shared
// PulsarProducer.java
@Service
@RequiredArgsConstructor
public class PulsarProducer {

    private final PulsarTemplate<OrderEvent> pulsarTemplate;

    public void sendOrderEvent(OrderEvent event) {
        pulsarTemplate.send("persistent://public/default/orders", event);
    }
}

// PulsarConsumer.java
@Service
@Slf4j
public class PulsarConsumer {

    @PulsarListener(
        topics = "persistent://public/default/orders",
        subscriptionName = "order-service-sub",
        subscriptionType = SubscriptionType.Shared,
        schemaType = SchemaType.JSON
    )
    public void handleOrderEvent(OrderEvent event) {
        log.info("Received order: {}", event.getOrderId());
        processOrder(event);
    }
}

6.5 NATS + Spring Boot

// build.gradle
// implementation 'io.nats:jnats:2.20.4'
// NATSConfig.java
@Configuration
public class NATSConfig {

    @Bean
    public Connection natsConnection() throws IOException, InterruptedException {
        Options options = new Options.Builder()
            .server("nats://localhost:4222")
            .reconnectWait(Duration.ofSeconds(2))
            .maxReconnects(-1) // Unlimited reconnects
            .connectionListener((conn, type) -> {
                System.out.println("NATS connection event: " + type);
            })
            .build();
        return Nats.connect(options);
    }

    @Bean
    public JetStream jetStream(Connection connection) throws IOException {
        return connection.jetStream();
    }
}

// NATSProducer.java
@Service
@RequiredArgsConstructor
public class NATSProducer {

    private final JetStream jetStream;
    private final ObjectMapper objectMapper;

    public PublishAck sendOrderEvent(OrderEvent event) throws Exception {
        byte[] data = objectMapper.writeValueAsBytes(event);
        Message msg = NatsMessage.builder()
            .subject("orders.created")
            .data(data)
            .build();
        return jetStream.publish(msg);
    }
}

// NATSConsumer.java
@Service
@Slf4j
public class NATSConsumer {

    @PostConstruct
    public void startConsumer() throws Exception {
        JetStream js = natsConnection.jetStream();

        PushSubscribeOptions options = PushSubscribeOptions.builder()
            .durable("order-consumer")
            .build();

        js.subscribe("orders.>", "order-queue", dispatcher -> {
            dispatcher.onMessage(msg -> {
                try {
                    OrderEvent event = objectMapper.readValue(
                        msg.getData(), OrderEvent.class);
                    log.info("Received order: {}", event.getOrderId());
                    processOrder(event);
                    msg.ack();
                } catch (Exception e) {
                    log.error("Failed to process message", e);
                    msg.nak();
                }
            });
        }, false, options);
    }
}

6.6 Redis Streams + Spring Boot

// build.gradle
// implementation 'org.springframework.boot:spring-boot-starter-data-redis'
// RedisStreamConfig.java
@Configuration
public class RedisStreamConfig {

    @Bean
    public StreamMessageListenerContainer<String, MapRecord<String, String, String>>
            streamListenerContainer(RedisConnectionFactory factory) {

        var options = StreamMessageListenerContainer
            .StreamMessageListenerContainerOptions.builder()
            .pollTimeout(Duration.ofSeconds(1))
            .batchSize(10)
            .targetType(MapRecord.class)
            .build();

        var container = StreamMessageListenerContainer.create(factory, options);

        container.receiveAutoAck(
            Consumer.from("order-group", "consumer-1"),
            StreamOffset.create("orders", ReadOffset.lastConsumed()),
            new OrderStreamListener()
        );

        container.start();
        return container;
    }
}

// RedisStreamProducer.java
@Service
@RequiredArgsConstructor
public class RedisStreamProducer {

    private final StringRedisTemplate redisTemplate;
    private final ObjectMapper objectMapper;

    public RecordId sendOrderEvent(OrderEvent event) throws Exception {
        Map<String, String> fields = Map.of(
            "orderId", event.getOrderId(),
            "type", "ORDER_CREATED",
            "payload", objectMapper.writeValueAsString(event)
        );
        return redisTemplate.opsForStream()
            .add(StreamRecords.newRecord()
                .ofMap(fields)
                .withStreamKey("orders"));
    }
}

// OrderStreamListener.java
@Slf4j
public class OrderStreamListener
        implements StreamListener<String, MapRecord<String, String, String>> {

    @Override
    public void onMessage(MapRecord<String, String, String> message) {
        Map<String, String> body = message.getValue();
        log.info("Received order: {} from stream: {}",
            body.get("orderId"), message.getStream());
        processOrder(body);
    }
}

7. Decision Flowchart

A decision tree for choosing a message queue system.

graph TD
    START[Choose a Message Queue] --> Q1["Need event replay?"]

    Q1 -->|Yes| Q1A["Need multi-tenancy or geo-replication?"]
    Q1A -->|Yes| PULSAR["Apache Pulsar"]
    Q1A -->|No| KAFKA["Apache Kafka"]

    Q1 -->|No| Q2["Need complex routing?"]
    Q2 -->|Yes| RABBIT["RabbitMQ"]

    Q2 -->|No| Q3["AWS-native serverless?"]
    Q3 -->|Yes| SQS["Amazon SQS"]

    Q3 -->|No| Q4["Need sub-ms ultra-low latency?"]
    Q4 -->|Yes| Q4A["Need persistence?"]
    Q4A -->|Yes| NATS["NATS + JetStream"]
    Q4A -->|No| NATS_CORE["NATS Core"]

    Q4 -->|No| Q5["Already using Redis?"]
    Q5 -->|Yes| REDIS["Redis Streams"]
    Q5 -->|No| KAFKA2["Kafka (default choice)"]

Quick Selection Guide

RequirementRecommended SystemReason
Event streaming + replayKafkaCommit-log based, industry standard
Complex message routingRabbitMQExchange/Binding model
Serverless + all-in on AWSSQSZero ops, Lambda triggers
Multi-tenancy + geo-replicationPulsarNative multi-tenancy
IoT + edge computingNATSLightweight, sub-ms latency
Already using RedisRedis StreamsNo additional infrastructure
High-volume events + low costKafka (self-hosted)High throughput per dollar
No operations teamSQS or Confluent CloudFully managed

8. Real-World Architecture Patterns

8.1 E-Commerce: Order Processing Pipeline

graph LR
    subgraph "Order Creation"
        API[API Gateway] --> OS[Order Service]
    end

    OS -->|OrderCreated| KAFKA[Kafka - orders topic]

    KAFKA --> PAY[Payment Service]
    KAFKA --> INV[Inventory Service]
    KAFKA --> NOTIF[Notification Service]

    PAY -->|PaymentCompleted| KAFKA2[Kafka - payments topic]
    KAFKA2 --> SHIP[Shipping Service]

    INV -->|StockReserved| KAFKA3[Kafka - inventory topic]
    KAFKA3 --> OS2[Order Service - status update]

    SHIP -->|ShipmentCreated| KAFKA4[Kafka - shipments topic]
    KAFKA4 --> NOTIF2[Notification Service - shipping alerts]

Why Kafka?

  • Order event replay needed (reprocessing on payment failures)
  • Multiple services consume the same event independently
  • Can serve as an audit log

Key Configuration:

# Order topic configuration
topics:
  orders:
    partitions: 12
    replication-factor: 3
    configs:
      retention.ms: 604800000 # 7 days
      min.insync.replicas: 2
      cleanup.policy: delete

  payments:
    partitions: 6
    replication-factor: 3
    configs:
      retention.ms: 2592000000 # 30 days (audit log)
      min.insync.replicas: 2

8.2 Real-Time Analytics Pipeline

graph LR
    subgraph "Data Collection"
        WEB[Web Clickstream] --> KAFKA[Kafka]
        APP[App Events] --> KAFKA
        IOT[IoT Sensors] --> NATS[NATS]
        NATS -->|Bridge| KAFKA
    end

    subgraph "Stream Processing"
        KAFKA --> FLINK[Apache Flink]
        KAFKA --> KSQL[ksqlDB]
    end

    subgraph "Storage and Serving"
        FLINK --> ES[Elasticsearch]
        FLINK --> CH[ClickHouse]
        KSQL --> REDIS[Redis Cache]
    end

    subgraph "Visualization"
        ES --> GRAFANA[Grafana]
        CH --> SUPERSET[Apache Superset]
    end

Why the Kafka + NATS combination?

  • Kafka: High-throughput event collection and Flink/ksqlDB integration
  • NATS: Meets ultra-low latency requirements for IoT sensors, lightweight agents
  • Bridge: Consolidates NATS-collected data into Kafka

8.3 IoT Sensor Data Collection

graph TB
    subgraph "Edge Layer"
        S1[Temperature Sensor] --> LN1[NATS Leaf Node - Factory A]
        S2[Humidity Sensor] --> LN1
        S3[Vibration Sensor] --> LN2[NATS Leaf Node - Factory B]
    end

    subgraph "Core Layer"
        LN1 --> NATS[NATS Cluster - JetStream]
        LN2 --> NATS
        NATS --> PROC[Stream Processor]
        PROC -->|Anomaly detection| ALERT[Alert Service - RabbitMQ]
        PROC -->|Aggregated data| TS[TimescaleDB]
    end

    ALERT -->|Email| EMAIL[Email Service]
    ALERT -->|Slack| SLACK[Slack Notification]

Why the NATS + RabbitMQ combination?

  • NATS: Extends to the edge with Leaf Nodes, sub-ms latency
  • JetStream: Sensor data persistence and replay
  • RabbitMQ: Flexible alert routing (email, Slack, SMS distribution)

8.4 Saga Pattern — Distributed Transactions

sequenceDiagram
    participant OS as Order Service
    participant K as Kafka
    participant PS as Payment Service
    participant IS as Inventory Service
    participant SS as Shipping Service

    OS->>K: OrderCreated
    K->>PS: Payment request
    PS->>K: PaymentCompleted
    K->>IS: Deduct inventory
    IS->>K: StockReserved
    K->>SS: Create shipment
    SS->>K: ShipmentCreated
    K->>OS: Order complete

    Note over PS,IS: Compensating transactions on failure
    PS-->>K: PaymentFailed (compensate)
    K-->>IS: Restore inventory (compensate)
    K-->>OS: Cancel order (compensate)

Choreography vs Orchestration:

ApproachProsConsRecommended Scenario
ChoreographyLow coupling, easy to scaleHard to trace flowSimple workflows
OrchestrationCentral control, clear flowOrchestrator as single point of failureComplex business logic

9. Operations Checklist

Common Monitoring Metrics

MetricDescriptionWarning Threshold
Consumer LagNumber of unconsumed messagesSustained above 10K
Message Failure RateDLQ ingestion rateAbove 1%
Broker Disk UsageStorage capacityAbove 80%
Network I/OIn/Out trafficAbove 70% of bandwidth
JVM GC TimeKafka/Pulsar/RabbitMQFull GC over 5 seconds

Kafka Production Checklist

1. Broker Configuration
   - min.insync.replicas = 2 (with replication-factor 3)
   - unclean.leader.election.enable = false
   - auto.create.topics.enable = false

2. Producer Configuration
   - acks = all (prevent data loss)
   - retries = Integer.MAX_VALUE
   - enable.idempotence = true
   - max.in.flight.requests.per.connection = 5

3. Consumer Configuration
   - enable.auto.commit = false (manual commits)
   - max.poll.records = 500
   - session.timeout.ms = 30000

4. Monitoring
   - Prometheus + Grafana (JMX Exporter)
   - Consumer Lag alerts
   - Under-Replicated Partitions alerts

RabbitMQ Production Checklist

1. Cluster Configuration
   - Use Quorum Queues (replaces Classic Mirrored Queues)
   - vm_memory_high_watermark = 0.4
   - disk_free_limit = 2GB

2. Connection Management
   - Use Connection Pools
   - Heartbeat timeout: 60 seconds
   - Reuse channels (channel creation is expensive)

3. Message Configuration
   - Persistent delivery mode
   - Enable Publisher Confirms
   - Configure Dead Letter Exchange

4. Monitoring
   - Management Plugin Dashboard
   - rabbitmq_prometheus plugin
   - Queue depth alerts

10. Quiz

Q1. The Biggest Change in Kafka 4.0

What core component was removed in Kafka 4.0, and what technology replaces it?

Answer: ZooKeeper was completely removed and replaced by KRaft (Kafka Raft).

  • KRaft is a Raft consensus algorithm-based metadata management system
  • Metadata propagation speed improved by 10x, partition count scalable beyond 1 million
  • No need to manage a separate ZooKeeper cluster, greatly reducing operational complexity

Q2. Pull vs Push Model

What are the differences, pros, and cons between pull-based consumption (Kafka, SQS) and push-based consumption (RabbitMQ)?

Pull-Based (Kafka, SQS):

  • Consumers actively fetch messages
  • Pros: Consumers can fetch at their own processing speed (natural backpressure)
  • Cons: Requires long polling, slightly lower real-time capability

Push-Based (RabbitMQ):

  • Broker pushes messages to consumers
  • Pros: Immediate delivery on message arrival (low latency)
  • Cons: Requires separate backpressure mechanisms when consumers are overloaded (prefetch count, etc.)

Q3. Exactly-Once Semantics

What 3 configurations are needed to achieve Exactly-Once Semantics (EOS) in Kafka?

Answer:

  1. enable.idempotence = true: Even if a producer sends duplicate messages, only one is stored
  2. transactional.id setting: A unique ID that identifies producer transactions
  3. isolation.level = read_committed: Consumers only read committed transaction messages

Combining all three guarantees end-to-end Exactly-Once from producer to consumer.

Q4. SQS Standard vs FIFO

What are the differences between Amazon SQS Standard and FIFO queues in terms of throughput, ordering, and deduplication?
FeatureStandardFIFO
ThroughputNearly unlimited300 TPS (3,000 with batching)
OrderingBest-effort (no guarantee)Strict FIFO ordering
DeduplicationAt-Least-Once (duplicates possible)Exactly-Once (5-minute dedup window)
Price0.40 USD / 1M requests0.50 USD / 1M requests
Use CaseHigh throughput, order-independentPayments, orders where sequence matters

Q5. Pulsar vs Kafka Architecture

What are 2 operational advantages of Pulsar's compute/storage separation architecture compared to Kafka?

Answer:

  1. Independent Scaling: Brokers (compute) and BookKeeper (storage) can be scaled separately. In Kafka, brokers store data directly, so scaling storage requires adding entire brokers. In Pulsar, you just add BookKeeper nodes.

  2. Expansion Without Rebalancing: When adding a broker in Kafka, partition rebalancing is needed, causing massive data movement. In Pulsar, new brokers can immediately take topic ownership, minimizing service impact during scaling.


References

  1. Apache Kafka 4.0 Release Notes — KRaft GA, Share Groups
  2. KIP-932: Queues for Kafka — Share Group specification
  3. RabbitMQ 4.0 Release Blog — AMQP 1.0, Khepri
  4. RabbitMQ Streams — Stream Plugin and filtering
  5. Amazon SQS Developer Guide — Standard vs FIFO
  6. Apache Pulsar 4.1 Release — 19 PIPs
  7. NATS 2.11 Release Notes — Message TTL, distributed tracing
  8. Confluent Benchmark: Kafka vs Pulsar — Performance comparison
  9. Spring for Apache Kafka — Official Spring Kafka docs
  10. Spring AMQP — Spring RabbitMQ integration
  11. Spring Cloud AWS SQS — SQS integration
  12. Spring for Apache Pulsar — Official Spring Pulsar docs
  13. Designing Data-Intensive Applications (Martin Kleppmann) — Message system theory
  14. Enterprise Integration Patterns — Messaging pattern bible
  15. Confluent Cloud Pricing — Cloud cost reference
  16. AWS MSK Pricing — MSK cost reference