Split View: Redis 클러스터 아키텍처와 고가용성 운영 가이드: Sentinel·Cluster 모드·메모리 최적화·장애 복구

Redis 클러스터 아키텍처와 고가용성 운영 가이드: Sentinel·Cluster 모드·메모리 최적화·장애 복구

들어가며
Redis 배포 모드 비교: Standalone vs Sentinel vs Cluster
- 배포 모드 비교표
- 선택 기준 정리
Sentinel 아키텍처와 쿼럼 메커니즘
Cluster 해시 슬롯과 리샤딩
복제(Replication)와 PSYNC
- 복제 설정과 PSYNC 프로토콜
- 복제 상태 모니터링
메모리 관리와 축출 정책
영속화: RDB와 AOF
- RDB vs AOF 비교
- AOF 재작성과 관리
슬로우 로그 분석
- SCAN 기반 안전한 키 탐색
메모리 단편화 대응
장애 시나리오와 복구 절차
운영 모니터링과 핵심 메트릭
운영 시 주의사항
마치며
참고자료

들어가며

Redis는 인메모리 데이터 스토어로서 캐시, 세션 스토어, 메시지 브로커 등 다양한 용도로 사용된다. 단일 인스턴스로 시작하는 것은 쉽지만, 프로덕션 환경에서 고가용성(HA)과 수평 확장을 달성하려면 Sentinel 또는 Cluster 모드를 반드시 이해해야 한다. 메모리 기반 시스템인 만큼 메모리 관리, 축출 정책, 영속화 전략, 그리고 장애 시 복구 절차까지 운영 전반을 꿰뚫는 지식이 필요하다.

이 글에서는 Redis의 세 가지 배포 모드(Standalone, Sentinel, Cluster)를 비교하는 것부터 시작하여, Sentinel의 쿼럼 메커니즘, Cluster의 해시 슬롯 분산, 복제 프로토콜(PSYNC), 메모리 최적화 전략, 영속화(RDB/AOF), 슬로우 로그 분석, 그리고 대표적인 장애 시나리오(Split-brain, OOM)와 복구 절차까지 실전 예제와 함께 다룬다.

Redis 배포 모드 비교: Standalone vs Sentinel vs Cluster

배포 모드 비교표

항목	Standalone	Sentinel	Cluster
노드 수	1 (+ 선택적 레플리카)	최소 3 Sentinel + 1 Master + N Replica	최소 6 (3 Master + 3 Replica)
데이터 분산	불가	불가 (단일 마스터)	해시 슬롯 기반 자동 분산
자동 페일오버	불가	가능 (쿼럼 기반)	가능 (과반수 투표)
쓰기 확장	불가	불가	가능 (멀티 마스터)
읽기 확장	레플리카 READONLY	레플리카 READONLY	레플리카 READONLY
클라이언트 복잡도	낮음	중간 (Sentinel-aware)	높음 (MOVED/ASK 리다이렉션)
적합한 시나리오	개발/소규모	HA가 필요한 단일 데이터셋	대규모 데이터 + HA

선택 기준 정리

# 의사결정 흐름
# 1. 데이터가 단일 노드 메모리에 들어가는가?
#    - YES → Sentinel (HA 필요 시) 또는 Standalone
#    - NO  → Cluster 필수
#
# 2. 쓰기 성능 확장이 필요한가?
#    - YES → Cluster (멀티 마스터)
#    - NO  → Sentinel로 충분
#
# 3. 운영 복잡도를 감당할 수 있는가?
#    - 소규모 팀 → Sentinel 우선 고려
#    - 전담 인프라팀 → Cluster 도입 가능

Sentinel 아키텍처와 쿼럼 메커니즘

Sentinel의 역할

Redis Sentinel은 Redis 인스턴스를 모니터링하고, 마스터 장애 시 자동으로 레플리카를 승격시키는 분산 감시 시스템이다. Sentinel 프로세스 자체도 분산되어 단일 장애 지점(SPOF)을 방지한다.

# Sentinel 설정 파일 (sentinel.conf)
port 26379
sentinel monitor mymaster 192.168.1.10 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1

# sentinel monitor <master-name> <ip> <port> <quorum>
# quorum = 장애 판정에 필요한 최소 Sentinel 동의 수

쿼럼(Quorum)과 과반수(Majority)

쿼럼과 과반수는 다른 개념이다. 쿼럼은 장애를 감지하는 데 필요한 최소 동의 수이고, 과반수는 실제 페일오버를 수행하기 위한 요건이다.

# 3대 Sentinel 구성 예시
# quorum = 2 (2대가 동의하면 ODOWN 판정)
# majority = 2 (3대 중 2대 = 과반수)

# 5대 Sentinel 구성 예시
# quorum = 3 (3대가 동의하면 ODOWN 판정)
# majority = 3 (5대 중 3대 = 과반수)

# Sentinel 상태 확인
redis-cli -p 26379 SENTINEL masters
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster
redis-cli -p 26379 SENTINEL replicas mymaster

페일오버 순서

# 1. SDOWN (Subjective Down) - 개별 Sentinel이 마스터 응답 없음 감지
#    down-after-milliseconds 경과 후 발생

# 2. ODOWN (Objective Down) - quorum 수 이상의 Sentinel이 SDOWN 동의
#    이 시점에서 페일오버 프로세스 시작

# 3. Leader Election - Sentinel 중 하나가 리더로 선출
#    majority 이상의 투표 필요 (Raft 유사 알고리즘)

# 4. Replica 선택 - 리더 Sentinel이 최적의 레플리카 선택
#    우선순위: replica-priority → 복제 오프셋 → runid

# 5. Failover 실행
#    선택된 레플리카에 SLAVEOF NO ONE 명령
#    다른 레플리카를 새 마스터로 재연결

# 페일오버 진행 상황 모니터링
redis-cli -p 26379 SENTINEL failover-status mymaster

Cluster 해시 슬롯과 리샤딩

해시 슬롯 분배

Redis Cluster는 전체 키스페이스를 16384개의 해시 슬롯으로 나누어 관리한다. 각 마스터 노드는 슬롯의 일부를 담당하며, 키의 CRC16 해시값을 16384로 나눈 나머지로 슬롯을 결정한다.

# 클러스터 생성 (최소 6노드: 3 Master + 3 Replica)
redis-cli --cluster create \
  192.168.1.10:7000 192.168.1.11:7001 192.168.1.12:7002 \
  192.168.1.10:7003 192.168.1.11:7004 192.168.1.12:7005 \
  --cluster-replicas 1

# 슬롯 분배 확인
redis-cli -c -h 192.168.1.10 -p 7000 CLUSTER SLOTS

# 개별 키의 슬롯 확인
redis-cli -c CLUSTER KEYSLOT mykey
# (integer) 14687

# 해시 태그를 사용한 같은 슬롯 배치
# 중괄호 안의 문자열만으로 슬롯 결정
redis-cli -c SET "user:1001:profile" "data1"
redis-cli -c SET "user:1001:session" "data2"
# 위 두 키는 다른 슬롯에 배치될 수 있음

# 해시 태그 사용 시 같은 슬롯에 배치 가능
# CLUSTER KEYSLOT 명령으로 확인
redis-cli CLUSTER KEYSLOT "user:1001"

리샤딩(Resharding)

# 온라인 리샤딩 - 서비스 중단 없이 슬롯 이동
redis-cli --cluster reshard 192.168.1.10:7000 \
  --cluster-from <source-node-id> \
  --cluster-to <target-node-id> \
  --cluster-slots 1000 \
  --cluster-yes

# 리샤딩 진행 상황 확인
redis-cli -c -h 192.168.1.10 -p 7000 CLUSTER INFO

# 클러스터 상태 점검
redis-cli --cluster check 192.168.1.10:7000

# 슬롯 리밸런싱 (자동 균등 분배)
redis-cli --cluster rebalance 192.168.1.10:7000 \
  --cluster-threshold 2

MOVED와 ASK 리다이렉션

# Python redis-py-cluster 예제
import redis

# ClusterMode 클라이언트는 MOVED/ASK를 자동 처리
rc = redis.RedisCluster(
    host='192.168.1.10',
    port=7000,
    decode_responses=True,
    skip_full_coverage_check=True
)

# 일반적인 사용 - 리다이렉션 자동 처리
rc.set('session:abc123', 'user_data')
value = rc.get('session:abc123')

# 파이프라인 사용 시 같은 슬롯의 키만 묶어야 함
# 해시 태그를 활용하면 같은 슬롯에 배치 가능
pipe = rc.pipeline()
pipe.set('order:1001:status', 'pending')
pipe.set('order:1001:total', '50000')
pipe.execute()

# MOVED: 키가 다른 노드에 있을 때 (영구 이동)
# ASK: 리샤딩 중 일시적으로 다른 노드에 있을 때

복제(Replication)와 PSYNC

복제 설정과 PSYNC 프로토콜

# 레플리카 설정
# redis.conf (레플리카 노드)
replicaof 192.168.1.10 6379
masterauth your_password

# PSYNC 프로토콜 동작:
# 1. Full Sync (전체 동기화)
#    - 레플리카가 처음 연결되거나 backlog가 부족할 때
#    - 마스터가 RDB 스냅샷 생성 후 전송
#    - 전송 중 변경분은 별도 버퍼에 저장 후 추가 전송

# 2. Partial Sync (부분 동기화) - PSYNC2
#    - 레플리카가 짧은 단절 후 재연결 시
#    - replication backlog에 저장된 변경분만 전송
#    - 훨씬 빠르고 효율적

# backlog 크기 설정 (기본 1MB, 운영 환경에서는 증가 권장)
repl-backlog-size 256mb
repl-backlog-ttl 3600

# 복제 상태 확인
redis-cli INFO replication

복제 상태 모니터링

# 마스터에서 복제 상태 확인
redis-cli INFO replication
# role:master
# connected_slaves:2
# slave0:ip=192.168.1.11,port=6379,state=online,offset=1234567,lag=0
# slave1:ip=192.168.1.12,port=6379,state=online,offset=1234567,lag=1
# master_replid:abc123def456...
# master_repl_offset:1234567
# repl_backlog_active:1
# repl_backlog_size:268435456

# 복제 지연(lag) 모니터링 - lag이 지속적으로 높으면 조치 필요
# lag > 10 : 네트워크 또는 레플리카 성능 확인
# lag > 60 : 긴급 조치 필요 (풀싱크 발생 가능)

# 레플리카에서 읽기 전용 확인
redis-cli -h 192.168.1.11 CONFIG GET replica-read-only

메모리 관리와 축출 정책

maxmemory 설정

# maxmemory 설정 (redis.conf)
maxmemory 8gb

# 런타임 변경
redis-cli CONFIG SET maxmemory 8589934592

# 현재 메모리 사용량 확인
redis-cli INFO memory
# used_memory:4294967296
# used_memory_human:4.00G
# used_memory_rss:4831838208
# used_memory_rss_human:4.50G
# mem_fragmentation_ratio:1.12
# maxmemory:8589934592
# maxmemory_human:8.00G
# maxmemory_policy:allkeys-lru

축출 정책 비교

정책	대상 키	알고리즘	적합한 시나리오
noeviction	없음 (쓰기 거부)	-	데이터 유실 불허
allkeys-lru	모든 키	LRU	일반 캐시 (가장 많이 사용)
allkeys-lfu	모든 키	LFU	인기도 기반 캐시
allkeys-random	모든 키	랜덤	균등 접근 패턴
volatile-lru	TTL 설정된 키만	LRU	캐시 + 영구 데이터 혼합
volatile-lfu	TTL 설정된 키만	LFU	TTL 키 중 빈도 기반 제거
volatile-ttl	TTL 설정된 키만	남은 TTL 짧은 순	만료 임박 키 우선 제거
volatile-random	TTL 설정된 키만	랜덤	TTL 키 무작위 제거

# 축출 정책 설정
redis-cli CONFIG SET maxmemory-policy allkeys-lfu

# LFU 카운터 설정 (Redis 4.0+)
# lfu-log-factor: 카운터 증가 속도 (기본 10, 높을수록 느리게 증가)
# lfu-decay-time: 카운터 감소 주기 (분, 기본 1)
redis-cli CONFIG SET lfu-log-factor 10
redis-cli CONFIG SET lfu-decay-time 1

# 특정 키의 LFU 빈도 확인
redis-cli OBJECT FREQ mykey

# 축출 통계 확인
redis-cli INFO stats | grep evicted
# evicted_keys:12345

메모리 최적화 기법

# 1. 데이터 구조 최적화 - ziplist/listpack 활용
# 소규모 해시에 ziplist 사용 (메모리 최대 10배 절약)
redis-cli CONFIG SET hash-max-ziplist-entries 128
redis-cli CONFIG SET hash-max-ziplist-value 64

# 소규모 리스트에 listpack 사용 (Redis 7.0+)
redis-cli CONFIG SET list-max-listpack-size -2

# 소규모 Sorted Set에 ziplist 사용
redis-cli CONFIG SET zset-max-ziplist-entries 128
redis-cli CONFIG SET zset-max-ziplist-value 64

# 2. 키 네이밍 최적화 - 짧은 키 이름 사용
# Bad:  user:session:authentication:token:1001
# Good: u:s:t:1001

# 3. 메모리 사용량 분석
redis-cli MEMORY USAGE mykey
redis-cli MEMORY DOCTOR

# 4. 큰 키(Big Key) 탐지
redis-cli --bigkeys --memkeys

# 5. Lazy Free 설정 (백그라운드 삭제로 블로킹 방지)
redis-cli CONFIG SET lazyfree-lazy-eviction yes
redis-cli CONFIG SET lazyfree-lazy-expire yes
redis-cli CONFIG SET lazyfree-lazy-server-del yes

영속화: RDB와 AOF

RDB vs AOF 비교

항목	RDB (Snapshot)	AOF (Append Only File)
방식	특정 시점의 전체 스냅샷	모든 쓰기 명령을 순차적으로 기록
데이터 유실	마지막 스냅샷 이후 유실 가능	fsync 설정에 따라 최소화
파일 크기	작음 (바이너리 압축)	큼 (명령어 텍스트, rewrite로 축소)
복구 속도	빠름	느림 (명령 재실행)
성능 영향	fork() 시 순간 지연	fsync 빈도에 따라 지속적 영향
권장 설정	백업용	데이터 안정성용

# RDB 설정 (redis.conf)
save 900 1      # 900초(15분) 내 1개 이상 변경 시 스냅샷
save 300 10     # 300초(5분) 내 10개 이상 변경 시 스냅샷
save 60 10000   # 60초(1분) 내 10000개 이상 변경 시 스냅샷
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis

# AOF 설정
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec  # 권장: 1초마다 fsync (성능과 안정성 균형)
# appendfsync always  # 매 명령마다 fsync (가장 안전, 성능 저하)
# appendfsync no      # OS에 위임 (가장 빠르나 데이터 유실 위험)

# AOF Rewrite 설정
auto-aof-rewrite-percentage 100  # AOF가 마지막 rewrite 대비 100% 커지면
auto-aof-rewrite-min-size 64mb   # 최소 64MB 이상일 때 rewrite 실행

# 프로덕션 권장: RDB + AOF 동시 사용
# RDB: 빠른 복구 + 백업
# AOF: 데이터 유실 최소화

AOF 재작성과 관리

# 수동 AOF Rewrite 실행
redis-cli BGREWRITEAOF

# AOF 상태 확인
redis-cli INFO persistence
# aof_enabled:1
# aof_rewrite_in_progress:0
# aof_last_rewrite_time_sec:2
# aof_current_size:134217728
# aof_base_size:67108864

# AOF 무결성 검사
redis-check-aof --fix appendonly.aof

# RDB 무결성 검사
redis-check-rdb dump.rdb

# 수동 백업 (BGSAVE 사용)
redis-cli BGSAVE
# 백그라운드에서 RDB 스냅샷 생성
# dump.rdb 파일을 안전한 스토리지로 복사

슬로우 로그 분석

# 슬로우 로그 설정
redis-cli CONFIG SET slowlog-log-slower-than 10000  # 10ms 이상 기록
redis-cli CONFIG SET slowlog-max-len 128            # 최대 128개 항목 유지

# 슬로우 로그 조회
redis-cli SLOWLOG GET 10
# 1) 1) (integer) 14          # 로그 ID
#    2) (integer) 1710230400   # 타임스탬프
#    3) (integer) 15230       # 실행 시간 (마이크로초)
#    4) 1) "KEYS"             # 명령어
#       2) "*session*"
#    5) "192.168.1.50:54321"  # 클라이언트 주소
#    6) ""

# 슬로우 로그 통계
redis-cli SLOWLOG LEN
redis-cli SLOWLOG RESET

# 프로덕션에서 반드시 피해야 할 O(N) 명령어:
# KEYS *        → SCAN으로 대체
# SMEMBERS      → SSCAN으로 대체
# HGETALL       → HSCAN으로 대체
# LRANGE 0 -1   → 페이지네이션 적용

SCAN 기반 안전한 키 탐색

# KEYS 대신 SCAN 사용 (비차단)
redis-cli SCAN 0 MATCH "session:*" COUNT 100
# 1) "17920"    # 다음 커서
# 2) 1) "session:abc123"
#    2) "session:def456"
#    ...

# 반복 스캔 (커서가 0이 될 때까지)
redis-cli SCAN 17920 MATCH "session:*" COUNT 100

메모리 단편화 대응

# 단편화 비율 확인
redis-cli INFO memory | grep frag
# mem_fragmentation_ratio:1.45
# mem_fragmentation_bytes:536870912

# mem_fragmentation_ratio 해석:
# 1.0 ~ 1.5 : 정상 범위
# 1.5 이상   : 단편화 심각, 조치 필요
# 1.0 미만   : 스왑 사용 중 (매우 위험)

# Active Defragmentation 활성화 (Redis 4.0+)
redis-cli CONFIG SET activedefrag yes
redis-cli CONFIG SET active-defrag-enabled yes

# 단편화 임계값 설정
redis-cli CONFIG SET active-defrag-ignore-bytes 100mb
redis-cli CONFIG SET active-defrag-threshold-lower 10   # 10% 이상이면 시작
redis-cli CONFIG SET active-defrag-threshold-upper 100  # 100% 이상이면 최대 노력
redis-cli CONFIG SET active-defrag-cycle-min 1           # CPU 최소 1% 사용
redis-cli CONFIG SET active-defrag-cycle-max 25          # CPU 최대 25% 사용

# jemalloc 통계 확인
redis-cli MEMORY MALLOC-STATS

장애 시나리오와 복구 절차

시나리오 1: Split-brain (스플릿 브레인)

네트워크 파티션이 발생하면 Sentinel 클러스터가 분리되어 두 개의 마스터가 동시에 존재할 수 있다. 이는 데이터 불일치를 유발하는 가장 위험한 장애다.

# Split-brain 예방 설정
# 마스터가 최소 N개의 레플리카와 연결되지 않으면 쓰기 거부
redis-cli CONFIG SET min-replicas-to-write 1
redis-cli CONFIG SET min-replicas-max-lag 10

# 발생 시 복구 절차:
# 1. 모든 Redis 인스턴스 상태 확인
redis-cli -h 192.168.1.10 -p 6379 INFO replication
redis-cli -h 192.168.1.11 -p 6379 INFO replication

# 2. 어떤 마스터가 최신 데이터를 가지고 있는지 확인
redis-cli -h 192.168.1.10 -p 6379 INFO replication | grep master_repl_offset
redis-cli -h 192.168.1.11 -p 6379 INFO replication | grep master_repl_offset

# 3. 오래된 마스터를 레플리카로 강등
redis-cli -h 192.168.1.11 -p 6379 REPLICAOF 192.168.1.10 6379

# 4. Sentinel 상태 리셋
redis-cli -p 26379 SENTINEL RESET mymaster

# 5. 데이터 정합성 검증
redis-cli -h 192.168.1.10 DBSIZE
redis-cli -h 192.168.1.11 DBSIZE

시나리오 2: OOM (Out of Memory)

# OOM 예방 모니터링
redis-cli INFO memory
# used_memory_peak:8589934592
# used_memory_peak_human:8.00G

# OOM Killer에 의해 종료된 경우 확인
# 시스템 로그 확인
# dmesg 또는 /var/log/syslog에서 OOM 관련 로그 확인

# 긴급 복구 절차:
# 1. maxmemory 확인 및 조정
redis-cli CONFIG SET maxmemory 12gb

# 2. 축출 정책이 noeviction이면 변경
redis-cli CONFIG GET maxmemory-policy
redis-cli CONFIG SET maxmemory-policy allkeys-lru

# 3. 큰 키 식별 및 정리
redis-cli --bigkeys
redis-cli --memkeys --memkeys-samples 100

# 4. 불필요한 키 만료 설정
redis-cli SCAN 0 MATCH "temp:*" COUNT 1000
# 필요한 키에 TTL 설정
redis-cli EXPIRE "temp:old-data" 3600

# 5. 향후 예방을 위한 알림 설정 (maxmemory의 80%에서 경고)

시나리오 3: Cluster 노드 장애

# 클러스터 상태 확인
redis-cli -c CLUSTER INFO
# cluster_state:ok
# cluster_slots_assigned:16384
# cluster_slots_ok:16384
# cluster_slots_pfail:0
# cluster_slots_fail:0

# 장애 노드 확인
redis-cli -c CLUSTER NODES
# 장애 노드에 "fail" 플래그가 표시됨

# 노드 교체 절차:
# 1. 새 Redis 인스턴스 시작
redis-server /etc/redis/7006.conf

# 2. 클러스터에 새 노드 추가
redis-cli --cluster add-node 192.168.1.13:7006 192.168.1.10:7000

# 3. 새 노드를 특정 마스터의 레플리카로 지정
redis-cli --cluster add-node 192.168.1.13:7006 192.168.1.10:7000 \
  --cluster-slave --cluster-master-id <master-node-id>

# 4. 장애 노드 제거
redis-cli --cluster del-node 192.168.1.10:7000 <failed-node-id>

# 5. 슬롯 재할당 (마스터 장애 시)
redis-cli --cluster fix 192.168.1.10:7000

운영 모니터링과 핵심 메트릭

# 종합 상태 확인 스크립트
redis-cli INFO ALL | grep -E "used_memory_human|mem_fragmentation_ratio|connected_clients|blocked_clients|instantaneous_ops_per_sec|hit_rate|evicted_keys|keyspace_misses"

# 핵심 모니터링 메트릭:
# 1. 메모리: used_memory / maxmemory 비율
# 2. 단편화: mem_fragmentation_ratio
# 3. 캐시 히트율: keyspace_hits / (keyspace_hits + keyspace_misses)
# 4. 연결 수: connected_clients
# 5. 초당 처리량: instantaneous_ops_per_sec
# 6. 축출 수: evicted_keys (급증 시 경고)
# 7. 복제 지연: master_repl_offset vs slave_repl_offset

# Latency 모니터링
redis-cli --latency
redis-cli --latency-history

# Latency 이벤트 모니터링 (Redis 2.8.13+)
redis-cli CONFIG SET latency-monitor-threshold 100
redis-cli LATENCY LATEST
redis-cli LATENCY HISTORY event-name

운영 시 주의사항

메모리 오버커밋: Linux에서 vm.overcommit_memory=1 설정이 필요하다. fork() 기반 RDB/AOF rewrite 시 메모리 부족으로 실패할 수 있다.
Transparent Huge Pages(THP) 비활성화: THP는 Redis의 fork 성능에 악영향을 미치므로 반드시 비활성화한다.
maxmemory 마진: 물리 메모리의 70-80%로 설정하여 복제 버퍼, AOF rewrite, OS 캐시 등에 여유를 둔다.
Cluster 모드에서 MULTI/EXEC: 트랜잭션 내 모든 키가 같은 슬롯에 있어야 한다. 해시 태그를 활용하라.
Sentinel 배포 위치: Sentinel 인스턴스를 Redis 노드와 다른 물리 서버 또는 가용 영역에 배치하여 동시 장애를 방지한다.
KEYS 명령 금지: 프로덕션에서 KEYS 명령은 O(N) 블로킹을 유발한다. SCAN으로 대체하고, rename-command로 비활성화하라.

# 운영 환경 커널 튜닝
# /etc/sysctl.conf
# vm.overcommit_memory = 1
# net.core.somaxconn = 65535
# net.ipv4.tcp_max_syn_backlog = 65535

# THP 비활성화
# echo never > /sys/kernel/mm/transparent_hugepage/enabled

# 위험 명령어 비활성화 (redis.conf)
# rename-command KEYS ""
# rename-command FLUSHDB ""
# rename-command FLUSHALL ""
# rename-command DEBUG ""

마치며

Redis의 고가용성 운영은 단순히 Sentinel이나 Cluster를 설정하는 것을 넘어선다. 메모리 관리, 축출 정책, 영속화 전략, 복제 모니터링, 그리고 장애 시나리오에 대한 사전 준비가 모두 갖추어져야 안정적인 프로덕션 운영이 가능하다.

핵심은 세 가지다. 첫째, 워크로드에 맞는 배포 모드를 선택하라. 둘째, 메모리 한계와 축출 정책을 반드시 설정하라. 셋째, 장애 시나리오별 복구 절차를 문서화하고 주기적으로 훈련하라. 이 세 가지가 갖추어지면 Redis는 초고속 인메모리 스토어의 장점을 안정적으로 발휘할 수 있다.

참고자료

Redis Cluster Architecture and High Availability Operations Guide: Sentinel, Cluster Mode, Memory Optimization, and Disaster Recovery

Introduction
Redis Deployment Modes: Standalone vs Sentinel vs Cluster
- Deployment Mode Comparison
- Decision Criteria
Sentinel Architecture and Quorum Mechanism
Cluster Hash Slots and Resharding
Replication and PSYNC
- Replication Setup and PSYNC Protocol
- Replication Monitoring
Memory Management and Eviction Policies
Persistence: RDB and AOF
- RDB vs AOF Comparison
- AOF Rewrite and Management
Slow Log Analysis
- Safe Key Scanning with SCAN
Memory Fragmentation Handling
Failure Scenarios and Recovery Procedures
Operations Monitoring and Key Metrics
Operational Notes
Conclusion
References

Introduction

Redis is an in-memory data store used for various purposes including caching, session storage, and message brokering. While starting with a single instance is straightforward, achieving high availability (HA) and horizontal scaling in production environments requires a thorough understanding of either Sentinel or Cluster mode. As a memory-based system, comprehensive knowledge spanning memory management, eviction policies, persistence strategies, and disaster recovery procedures is essential.

This guide starts by comparing Redis's three deployment modes (Standalone, Sentinel, Cluster) and then covers Sentinel's quorum mechanism, Cluster hash slot distribution, replication protocol (PSYNC), memory optimization strategies, persistence (RDB/AOF), slow log analysis, and common failure scenarios (Split-brain, OOM) with recovery procedures.

Redis Deployment Modes: Standalone vs Sentinel vs Cluster

Deployment Mode Comparison

Aspect	Standalone	Sentinel	Cluster
Node Count	1 (+ optional replicas)	Min 3 Sentinel + 1 Master + N Replica	Min 6 (3 Master + 3 Replica)
Data Distribution	None	None (single master)	Automatic via hash slots
Automatic Failover	No	Yes (quorum-based)	Yes (majority vote)
Write Scaling	No	No	Yes (multi-master)
Read Scaling	Replica READONLY	Replica READONLY	Replica READONLY
Client Complexity	Low	Medium (Sentinel-aware)	High (MOVED/ASK redirection)
Best For	Dev/Small-scale	HA for single dataset	Large-scale data + HA

Decision Criteria

# Decision flow
# 1. Does the data fit in a single node's memory?
#    - YES → Sentinel (if HA needed) or Standalone
#    - NO  → Cluster required
#
# 2. Do you need write performance scaling?
#    - YES → Cluster (multi-master)
#    - NO  → Sentinel is sufficient
#
# 3. Can you handle the operational complexity?
#    - Small team → Consider Sentinel first
#    - Dedicated infra team → Cluster is viable

Sentinel Architecture and Quorum Mechanism

Role of Sentinel

Redis Sentinel is a distributed monitoring system that watches Redis instances and automatically promotes a replica to master when a primary failure is detected. Sentinel processes themselves are distributed to prevent single points of failure (SPOF).

# Sentinel configuration file (sentinel.conf)
port 26379
sentinel monitor mymaster 192.168.1.10 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1

# sentinel monitor <master-name> <ip> <port> <quorum>
# quorum = minimum number of Sentinel agreements for failure detection

Quorum vs Majority

Quorum and majority are distinct concepts. Quorum is the minimum number of agreements needed to detect a failure, while majority is the requirement for actually executing failover.

# 3 Sentinel deployment example
# quorum = 2 (2 agreements trigger ODOWN)
# majority = 2 (2 out of 3 = majority)

# 5 Sentinel deployment example
# quorum = 3 (3 agreements trigger ODOWN)
# majority = 3 (3 out of 5 = majority)

# Check Sentinel status
redis-cli -p 26379 SENTINEL masters
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster
redis-cli -p 26379 SENTINEL replicas mymaster

Failover Sequence

# 1. SDOWN (Subjective Down) - Individual Sentinel detects master unresponsive
#    Triggered after down-after-milliseconds elapsed

# 2. ODOWN (Objective Down) - Quorum or more Sentinels agree on SDOWN
#    Failover process begins at this point

# 3. Leader Election - One Sentinel elected as leader
#    Requires majority votes (Raft-like algorithm)

# 4. Replica Selection - Leader Sentinel selects the best replica
#    Priority: replica-priority → replication offset → runid

# 5. Failover Execution
#    SLAVEOF NO ONE command to selected replica
#    Redirect other replicas to new master

# Monitor failover progress
redis-cli -p 26379 SENTINEL failover-status mymaster

Cluster Hash Slots and Resharding

Hash Slot Distribution

Redis Cluster divides the entire keyspace into 16384 hash slots. Each master node is responsible for a subset of slots, and a key's slot is determined by computing CRC16(key) mod 16384.

# Create cluster (minimum 6 nodes: 3 Master + 3 Replica)
redis-cli --cluster create \
  192.168.1.10:7000 192.168.1.11:7001 192.168.1.12:7002 \
  192.168.1.10:7003 192.168.1.11:7004 192.168.1.12:7005 \
  --cluster-replicas 1

# Check slot distribution
redis-cli -c -h 192.168.1.10 -p 7000 CLUSTER SLOTS

# Check which slot a key belongs to
redis-cli -c CLUSTER KEYSLOT mykey
# (integer) 14687

# Use hash tags to place keys in the same slot
# Only the string inside curly braces determines the slot
redis-cli -c SET "user:1001:profile" "data1"
redis-cli -c SET "user:1001:session" "data2"
# These two keys may end up in different slots

# With hash tags, they can be co-located
# Verify with CLUSTER KEYSLOT
redis-cli CLUSTER KEYSLOT "user:1001"

Resharding

# Online resharding - move slots without service interruption
redis-cli --cluster reshard 192.168.1.10:7000 \
  --cluster-from <source-node-id> \
  --cluster-to <target-node-id> \
  --cluster-slots 1000 \
  --cluster-yes

# Check resharding progress
redis-cli -c -h 192.168.1.10 -p 7000 CLUSTER INFO

# Cluster health check
redis-cli --cluster check 192.168.1.10:7000

# Slot rebalancing (automatic even distribution)
redis-cli --cluster rebalance 192.168.1.10:7000 \
  --cluster-threshold 2

MOVED and ASK Redirection

# Python redis-py-cluster example
import redis

# ClusterMode client handles MOVED/ASK automatically
rc = redis.RedisCluster(
    host='192.168.1.10',
    port=7000,
    decode_responses=True,
    skip_full_coverage_check=True
)

# Normal usage - redirection handled automatically
rc.set('session:abc123', 'user_data')
value = rc.get('session:abc123')

# When using pipelines, only keys in the same slot can be batched
# Hash tags help place keys in the same slot
pipe = rc.pipeline()
pipe.set('order:1001:status', 'pending')
pipe.set('order:1001:total', '50000')
pipe.execute()

# MOVED: key lives on a different node (permanent move)
# ASK: key temporarily on a different node during resharding

Replication and PSYNC

Replication Setup and PSYNC Protocol

# Replica configuration
# redis.conf (replica node)
replicaof 192.168.1.10 6379
masterauth your_password

# PSYNC protocol behavior:
# 1. Full Sync
#    - When replica connects for the first time or backlog is insufficient
#    - Master generates and sends RDB snapshot
#    - Changes during transfer are buffered and sent afterward

# 2. Partial Sync (PSYNC2)
#    - When replica reconnects after a brief disconnection
#    - Only delta from replication backlog is transmitted
#    - Much faster and more efficient

# backlog size (default 1MB, increase for production)
repl-backlog-size 256mb
repl-backlog-ttl 3600

# Check replication status
redis-cli INFO replication

Replication Monitoring

# Check replication status on master
redis-cli INFO replication
# role:master
# connected_slaves:2
# slave0:ip=192.168.1.11,port=6379,state=online,offset=1234567,lag=0
# slave1:ip=192.168.1.12,port=6379,state=online,offset=1234567,lag=1
# master_replid:abc123def456...
# master_repl_offset:1234567
# repl_backlog_active:1
# repl_backlog_size:268435456

# Replication lag monitoring - investigate if lag is persistently high
# lag > 10 : Check network or replica performance
# lag > 60 : Urgent action needed (full sync may occur)

# Verify read-only mode on replica
redis-cli -h 192.168.1.11 CONFIG GET replica-read-only

Memory Management and Eviction Policies

maxmemory Configuration

# maxmemory setting (redis.conf)
maxmemory 8gb

# Runtime change
redis-cli CONFIG SET maxmemory 8589934592

# Check current memory usage
redis-cli INFO memory
# used_memory:4294967296
# used_memory_human:4.00G
# used_memory_rss:4831838208
# used_memory_rss_human:4.50G
# mem_fragmentation_ratio:1.12
# maxmemory:8589934592
# maxmemory_human:8.00G
# maxmemory_policy:allkeys-lru

Eviction Policy Comparison

Policy	Target Keys	Algorithm	Best For
noeviction	None (rejects writes)	-	No data loss tolerated
allkeys-lru	All keys	LRU	General caching (most common)
allkeys-lfu	All keys	LFU	Popularity-based caching
allkeys-random	All keys	Random	Uniform access patterns
volatile-lru	TTL-set keys only	LRU	Mixed cache + persistent data
volatile-lfu	TTL-set keys only	LFU	Frequency-based among TTL keys
volatile-ttl	TTL-set keys only	Shortest remaining TTL	Expire-soon keys first
volatile-random	TTL-set keys only	Random	Random TTL key removal

# Set eviction policy
redis-cli CONFIG SET maxmemory-policy allkeys-lfu

# LFU counter configuration (Redis 4.0+)
# lfu-log-factor: counter increment speed (default 10, higher = slower)
# lfu-decay-time: counter decay interval in minutes (default 1)
redis-cli CONFIG SET lfu-log-factor 10
redis-cli CONFIG SET lfu-decay-time 1

# Check LFU frequency of a specific key
redis-cli OBJECT FREQ mykey

# Check eviction statistics
redis-cli INFO stats | grep evicted
# evicted_keys:12345

Memory Optimization Techniques

# 1. Data structure optimization - use ziplist/listpack
# Use ziplist for small hashes (up to 10x memory savings)
redis-cli CONFIG SET hash-max-ziplist-entries 128
redis-cli CONFIG SET hash-max-ziplist-value 64

# Use listpack for small lists (Redis 7.0+)
redis-cli CONFIG SET list-max-listpack-size -2

# Use ziplist for small Sorted Sets
redis-cli CONFIG SET zset-max-ziplist-entries 128
redis-cli CONFIG SET zset-max-ziplist-value 64

# 2. Key naming optimization - use short key names
# Bad:  user:session:authentication:token:1001
# Good: u:s:t:1001

# 3. Memory usage analysis
redis-cli MEMORY USAGE mykey
redis-cli MEMORY DOCTOR

# 4. Big key detection
redis-cli --bigkeys --memkeys

# 5. Lazy Free configuration (background deletion to avoid blocking)
redis-cli CONFIG SET lazyfree-lazy-eviction yes
redis-cli CONFIG SET lazyfree-lazy-expire yes
redis-cli CONFIG SET lazyfree-lazy-server-del yes

Persistence: RDB and AOF

RDB vs AOF Comparison

Aspect	RDB (Snapshot)	AOF (Append Only File)
Mechanism	Point-in-time full snapshot	Sequential recording of all write commands
Data Loss	Possible since last snapshot	Minimized based on fsync settings
File Size	Small (binary compressed)	Large (command text, reduced by rewrite)
Recovery Speed	Fast	Slow (command replay)
Performance Impact	Momentary latency during fork()	Continuous impact based on fsync frequency
Recommended Use	Backup	Data durability

# RDB configuration (redis.conf)
save 900 1      # Snapshot if 1+ changes in 900 seconds (15 min)
save 300 10     # Snapshot if 10+ changes in 300 seconds (5 min)
save 60 10000   # Snapshot if 10000+ changes in 60 seconds (1 min)
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis

# AOF configuration
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec  # Recommended: fsync every second (balance of safety and performance)
# appendfsync always  # fsync on every command (safest, slowest)
# appendfsync no      # Delegate to OS (fastest, risk of data loss)

# AOF Rewrite configuration
auto-aof-rewrite-percentage 100  # Rewrite when AOF grows 100% since last rewrite
auto-aof-rewrite-min-size 64mb   # Only rewrite when at least 64MB

# Production recommendation: use both RDB + AOF
# RDB: fast recovery + backup
# AOF: minimize data loss

AOF Rewrite and Management

# Manual AOF Rewrite
redis-cli BGREWRITEAOF

# Check AOF status
redis-cli INFO persistence
# aof_enabled:1
# aof_rewrite_in_progress:0
# aof_last_rewrite_time_sec:2
# aof_current_size:134217728
# aof_base_size:67108864

# AOF integrity check
redis-check-aof --fix appendonly.aof

# RDB integrity check
redis-check-rdb dump.rdb

# Manual backup (using BGSAVE)
redis-cli BGSAVE
# Creates RDB snapshot in background
# Copy dump.rdb to safe storage

Slow Log Analysis

# Slow log configuration
redis-cli CONFIG SET slowlog-log-slower-than 10000  # Log queries taking 10ms+
redis-cli CONFIG SET slowlog-max-len 128            # Keep up to 128 entries

# View slow log
redis-cli SLOWLOG GET 10
# 1) 1) (integer) 14          # Log ID
#    2) (integer) 1710230400   # Timestamp
#    3) (integer) 15230       # Execution time (microseconds)
#    4) 1) "KEYS"             # Command
#       2) "*session*"
#    5) "192.168.1.50:54321"  # Client address
#    6) ""

# Slow log statistics
redis-cli SLOWLOG LEN
redis-cli SLOWLOG RESET

# O(N) commands to avoid in production:
# KEYS *        → Replace with SCAN
# SMEMBERS      → Replace with SSCAN
# HGETALL       → Replace with HSCAN
# LRANGE 0 -1   → Apply pagination

Safe Key Scanning with SCAN

# Use SCAN instead of KEYS (non-blocking)
redis-cli SCAN 0 MATCH "session:*" COUNT 100
# 1) "17920"    # Next cursor
# 2) 1) "session:abc123"
#    2) "session:def456"
#    ...

# Iterate until cursor returns 0
redis-cli SCAN 17920 MATCH "session:*" COUNT 100

Memory Fragmentation Handling

# Check fragmentation ratio
redis-cli INFO memory | grep frag
# mem_fragmentation_ratio:1.45
# mem_fragmentation_bytes:536870912

# Interpreting mem_fragmentation_ratio:
# 1.0 ~ 1.5 : Normal range
# Above 1.5 : Severe fragmentation, action needed
# Below 1.0 : Swapping in use (very dangerous)

# Enable Active Defragmentation (Redis 4.0+)
redis-cli CONFIG SET activedefrag yes
redis-cli CONFIG SET active-defrag-enabled yes

# Defragmentation threshold settings
redis-cli CONFIG SET active-defrag-ignore-bytes 100mb
redis-cli CONFIG SET active-defrag-threshold-lower 10   # Start at 10%
redis-cli CONFIG SET active-defrag-threshold-upper 100  # Max effort at 100%
redis-cli CONFIG SET active-defrag-cycle-min 1           # Min 1% CPU
redis-cli CONFIG SET active-defrag-cycle-max 25          # Max 25% CPU

# Check jemalloc statistics
redis-cli MEMORY MALLOC-STATS

Failure Scenarios and Recovery Procedures

Scenario 1: Split-brain

When a network partition occurs, the Sentinel cluster can split, resulting in two simultaneous masters. This is the most dangerous failure as it causes data inconsistency.

# Split-brain prevention settings
# Master rejects writes if fewer than N replicas are connected
redis-cli CONFIG SET min-replicas-to-write 1
redis-cli CONFIG SET min-replicas-max-lag 10

# Recovery procedure when split-brain occurs:
# 1. Check status of all Redis instances
redis-cli -h 192.168.1.10 -p 6379 INFO replication
redis-cli -h 192.168.1.11 -p 6379 INFO replication

# 2. Determine which master has the latest data
redis-cli -h 192.168.1.10 -p 6379 INFO replication | grep master_repl_offset
redis-cli -h 192.168.1.11 -p 6379 INFO replication | grep master_repl_offset

# 3. Demote the stale master to replica
redis-cli -h 192.168.1.11 -p 6379 REPLICAOF 192.168.1.10 6379

# 4. Reset Sentinel state
redis-cli -p 26379 SENTINEL RESET mymaster

# 5. Verify data consistency
redis-cli -h 192.168.1.10 DBSIZE
redis-cli -h 192.168.1.11 DBSIZE

Scenario 2: OOM (Out of Memory)

# OOM prevention monitoring
redis-cli INFO memory
# used_memory_peak:8589934592
# used_memory_peak_human:8.00G

# Check if Redis was killed by OOM Killer
# Check system logs
# Look for OOM-related entries in dmesg or /var/log/syslog

# Emergency recovery procedure:
# 1. Check and adjust maxmemory
redis-cli CONFIG SET maxmemory 12gb

# 2. Change eviction policy if set to noeviction
redis-cli CONFIG GET maxmemory-policy
redis-cli CONFIG SET maxmemory-policy allkeys-lru

# 3. Identify and clean up large keys
redis-cli --bigkeys
redis-cli --memkeys --memkeys-samples 100

# 4. Set TTL on unnecessary keys
redis-cli SCAN 0 MATCH "temp:*" COUNT 1000
# Set TTL on identified keys
redis-cli EXPIRE "temp:old-data" 3600

# 5. Set up alerts for future prevention (warn at 80% of maxmemory)

Scenario 3: Cluster Node Failure

# Check cluster status
redis-cli -c CLUSTER INFO
# cluster_state:ok
# cluster_slots_assigned:16384
# cluster_slots_ok:16384
# cluster_slots_pfail:0
# cluster_slots_fail:0

# Identify failed node
redis-cli -c CLUSTER NODES
# Failed node will show "fail" flag

# Node replacement procedure:
# 1. Start new Redis instance
redis-server /etc/redis/7006.conf

# 2. Add new node to cluster
redis-cli --cluster add-node 192.168.1.13:7006 192.168.1.10:7000

# 3. Assign new node as replica of a specific master
redis-cli --cluster add-node 192.168.1.13:7006 192.168.1.10:7000 \
  --cluster-slave --cluster-master-id <master-node-id>

# 4. Remove failed node
redis-cli --cluster del-node 192.168.1.10:7000 <failed-node-id>

# 5. Reassign slots (if master failed)
redis-cli --cluster fix 192.168.1.10:7000

Operations Monitoring and Key Metrics

# Comprehensive status check script
redis-cli INFO ALL | grep -E "used_memory_human|mem_fragmentation_ratio|connected_clients|blocked_clients|instantaneous_ops_per_sec|hit_rate|evicted_keys|keyspace_misses"

# Key monitoring metrics:
# 1. Memory: used_memory / maxmemory ratio
# 2. Fragmentation: mem_fragmentation_ratio
# 3. Cache hit rate: keyspace_hits / (keyspace_hits + keyspace_misses)
# 4. Connections: connected_clients
# 5. Throughput: instantaneous_ops_per_sec
# 6. Evictions: evicted_keys (alert on spikes)
# 7. Replication lag: master_repl_offset vs slave_repl_offset

# Latency monitoring
redis-cli --latency
redis-cli --latency-history

# Latency event monitoring (Redis 2.8.13+)
redis-cli CONFIG SET latency-monitor-threshold 100
redis-cli LATENCY LATEST
redis-cli LATENCY HISTORY event-name

Operational Notes

Memory Overcommit: On Linux, set vm.overcommit_memory=1. Without this, fork()-based RDB/AOF rewrite may fail due to insufficient memory.
Disable Transparent Huge Pages (THP): THP negatively affects Redis fork performance and must be disabled.
maxmemory Margin: Set to 70-80% of physical memory to leave headroom for replication buffers, AOF rewrite, and OS cache.
MULTI/EXEC in Cluster Mode: All keys in a transaction must reside in the same slot. Use hash tags to ensure co-location.
Sentinel Placement: Deploy Sentinel instances on separate physical servers or availability zones from Redis nodes to prevent simultaneous failures.
Ban KEYS Command: KEYS causes O(N) blocking in production. Replace with SCAN and disable via rename-command.

# Production kernel tuning
# /etc/sysctl.conf
# vm.overcommit_memory = 1
# net.core.somaxconn = 65535
# net.ipv4.tcp_max_syn_backlog = 65535

# Disable THP
# echo never > /sys/kernel/mm/transparent_hugepage/enabled

# Disable dangerous commands (redis.conf)
# rename-command KEYS ""
# rename-command FLUSHDB ""
# rename-command FLUSHALL ""
# rename-command DEBUG ""

Conclusion

Operating Redis for high availability goes beyond simply configuring Sentinel or Cluster. Stable production operations require memory management, eviction policies, persistence strategies, replication monitoring, and preparation for failure scenarios.

The three essentials are: First, choose the right deployment mode for your workload. Second, always configure memory limits and eviction policies. Third, document recovery procedures for each failure scenario and practice them regularly. With these three pillars in place, Redis can reliably deliver its strengths as an ultra-fast in-memory data store.