AI Platform Feature Store 동기화 설계

Online-Offline 동기화가 실패하는 순간
Point-in-Time Join의 원리와 함정
- Feast의 point-in-time join 동작 방식
- Leakage 검증 유틸리티
CDC 기반 Online Store 동기화 파이프라인
- 동기화 아키텍처 구성
- Debezium CDC -> Kafka -> Feast Materializer 흐름
Freshness 모니터링과 SLA 알림
- Freshness 점검 SQL (Online Store가 PostgreSQL 기반일 때)
- Prometheus + Alertmanager 알림 설정
Feature View 버전 관리와 모델 Pinning
- KServe InferenceService에 피처 버전 명시
- Feature View 스키마 호환성 테스트
장애 시나리오별 대응
동기화 상태 종합 대시보드
배포 체크리스트: Feature View 변경 시
참고 자료

Online-Offline 동기화가 실패하는 순간

Feature Store를 도입한 팀 대부분이 겪는 첫 번째 장애는 "학습할 때 본 피처와 서빙 시 조회되는 피처가 다르다"는 것이다. 이 학습-서빙 불일치(Training-Serving Skew)는 단순한 버그가 아니라 아키텍처 수준의 설계 결함에서 비롯된다.

전형적인 시나리오를 살펴보자.

Offline Store(BigQuery, S3 Parquet)에서 학습 데이터를 point-in-time join으로 생성한다.
Online Store(Redis, DynamoDB)에는 CDC 파이프라인으로 최신 피처를 밀어 넣는다.
서빙 시 Online Store에서 조회한 피처와 학습 시 사용한 피처의 변환 로직이 미묘하게 다르다.

이 글에서는 이 불일치를 구조적으로 제거하는 동기화 설계를 Feast 0.40+와 Kafka 기반 CDC를 중심으로 설명한다.

Point-in-Time Join의 원리와 함정

Point-in-time join은 "예측 시점 기준으로 그때 알 수 있었던 피처만 사용한다"는 원칙이다. Feature leakage(미래 데이터 유입)를 방지하는 핵심 메커니즘이지만, 구현 시 미묘한 함정이 있다.

Feast의 point-in-time join 동작 방식

from feast import FeatureStore
from datetime import datetime, timedelta
import pandas as pd

store = FeatureStore(repo_path="./feature_repo")

# Entity DataFrame: 예측 대상 + 예측 시점
entity_df = pd.DataFrame({
    "user_id": ["u_001", "u_002", "u_003"],
    "event_timestamp": [
        datetime(2026, 3, 3, 10, 0, 0),  # 각 유저별 예측 시점이 다르다
        datetime(2026, 3, 3, 14, 30, 0),
        datetime(2026, 3, 4, 9, 0, 0),
    ],
})

# Feast가 각 row의 event_timestamp 이전에 발생한 최신 피처를 자동으로 매칭
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "user_purchase_stats:total_purchases_7d",
        "user_purchase_stats:avg_order_value_30d",
        "user_session_features:session_count_24h",
        "user_session_features:last_active_minutes_ago",
    ],
).to_df()

print(training_df.head())
# user_id | event_timestamp | total_purchases_7d | avg_order_value_30d | ...

Leakage 검증 유틸리티

학습 데이터 생성 후 반드시 leakage 검증을 수행해야 한다. 아래는 CI 파이프라인에 넣을 수 있는 검증 함수다.

import pandas as pd
from typing import List

def validate_point_in_time_correctness(
    training_df: pd.DataFrame,
    entity_timestamp_col: str,
    feature_timestamp_cols: List[str],
    tolerance_seconds: int = 0,
) -> dict:
    """
    학습 데이터에서 피처 타임스탬프가 엔티티 타임스탬프보다
    미래인 행을 탐지한다. tolerance_seconds로 클럭 드리프트 허용.
    """
    violations = {}
    for feat_ts_col in feature_timestamp_cols:
        if feat_ts_col not in training_df.columns:
            continue
        mask = (
            training_df[feat_ts_col]
            > training_df[entity_timestamp_col] + pd.Timedelta(seconds=tolerance_seconds)
        )
        violation_count = mask.sum()
        if violation_count > 0:
            violations[feat_ts_col] = {
                "count": int(violation_count),
                "ratio": round(violation_count / len(training_df), 4),
                "sample_entity_ts": str(training_df.loc[mask, entity_timestamp_col].iloc[0]),
                "sample_feat_ts": str(training_df.loc[mask, feat_ts_col].iloc[0]),
            }
    return violations  # 비어 있으면 정상

# CI에서 호출
result = validate_point_in_time_correctness(
    training_df,
    entity_timestamp_col="event_timestamp",
    feature_timestamp_cols=["feature_timestamp_purchase", "feature_timestamp_session"],
    tolerance_seconds=5,  # 5초 클럭 드리프트 허용
)
if result:
    raise AssertionError(f"Point-in-time leakage detected: {result}")

CDC 기반 Online Store 동기화 파이프라인

Offline Store의 원천 데이터가 변경되면 Online Store에도 반영해야 한다. 배치 주기(1시간, 1일)로 동기화하면 freshness 문제가 생긴다. CDC(Change Data Capture) + Kafka Streams 조합이 준실시간 동기화의 표준 패턴이다.

동기화 아키텍처 구성

# feature_store.yaml (Feast 0.40+ 구성)
project: recommendation_platform
provider: gcp
registry:
  registry_type: sql
  path: postgresql://feast:feast@pg-host:5432/feast_registry
  cache_ttl_seconds: 60

offline_store:
  type: bigquery
  dataset: features_offline

online_store:
  type: redis
  connection_string: redis://redis-cluster:6379
  key_ttl_seconds: 86400 # 24시간 TTL

# 동기화 설정
stream_ingestion:
  enabled: true
  kafka:
    bootstrap_servers: kafka-broker-1:9092,kafka-broker-2:9092
    topic_prefix: feast-features
    consumer_group: feast-materializer
    security_protocol: SASL_SSL
    sasl_mechanism: SCRAM-SHA-512

Debezium CDC -> Kafka -> Feast Materializer 흐름

"""
Kafka Consumer로 CDC 이벤트를 수신하여 Feast Online Store에 반영하는 워커.
Debezium의 PostgreSQL CDC 커넥터에서 발행된 이벤트를 처리한다.
"""
import json
from datetime import datetime
from confluent_kafka import Consumer, KafkaError
from feast import FeatureStore
import pandas as pd

KAFKA_CONFIG = {
    "bootstrap.servers": "kafka-broker-1:9092",
    "group.id": "feast-cdc-materializer",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False,
    "max.poll.interval.ms": 300000,
}

store = FeatureStore(repo_path="./feature_repo")
consumer = Consumer(KAFKA_CONFIG)
consumer.subscribe(["dbserver1.public.user_purchases"])

BATCH_SIZE = 500
FLUSH_INTERVAL_SEC = 10
buffer = []
last_flush = datetime.now()

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None:
        pass
    elif msg.error():
        if msg.error().code() == KafkaError._PARTITION_EOF:
            continue
        raise Exception(f"Kafka error: {msg.error()}")
    else:
        payload = json.loads(msg.value().decode("utf-8"))
        after = payload.get("after", {})
        if after:
            buffer.append({
                "user_id": after["user_id"],
                "total_purchases_7d": after["total_purchases_7d"],
                "avg_order_value_30d": after["avg_order_value_30d"],
                "event_timestamp": datetime.fromisoformat(after["updated_at"]),
            })

    elapsed = (datetime.now() - last_flush).total_seconds()
    if len(buffer) >= BATCH_SIZE or (buffer and elapsed >= FLUSH_INTERVAL_SEC):
        df = pd.DataFrame(buffer)
        # Feast Online Store에 직접 write
        store.write_to_online_store(
            feature_view_name="user_purchase_stats",
            df=df,
        )
        consumer.commit()
        print(f"Materialized {len(buffer)} rows to online store")
        buffer.clear()
        last_flush = datetime.now()

Freshness 모니터링과 SLA 알림

Online Store의 데이터가 얼마나 최신인지 지속적으로 추적해야 한다. "최근 5분 이내 업데이트된 entity 비율"이 핵심 지표다.

Freshness 점검 SQL (Online Store가 PostgreSQL 기반일 때)

-- Online Store freshness 대시보드 쿼리
-- entity별 마지막 업데이트 시각과 현재 시각의 차이를 계산한다

WITH freshness AS (
    SELECT
        entity_key,
        feature_view_name,
        MAX(event_ts)                              AS latest_event_ts,
        EXTRACT(EPOCH FROM now() - MAX(event_ts))  AS lag_seconds
    FROM online_store_features
    WHERE feature_view_name = 'user_purchase_stats'
    GROUP BY entity_key, feature_view_name
)
SELECT
    feature_view_name,
    COUNT(*)                                                      AS total_entities,
    COUNT(*) FILTER (WHERE lag_seconds <= 300)                    AS fresh_entities,
    ROUND(100.0 * COUNT(*) FILTER (WHERE lag_seconds <= 300) / COUNT(*), 2) AS freshness_pct,
    ROUND(AVG(lag_seconds), 1)                                   AS avg_lag_sec,
    ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lag_seconds), 1) AS p95_lag_sec,
    MAX(lag_seconds)                                              AS max_lag_sec
FROM freshness
GROUP BY feature_view_name;

Prometheus + Alertmanager 알림 설정

# prometheus-rules.yaml
groups:
  - name: feature_store_freshness
    interval: 30s
    rules:
      - alert: FeatureStoreFreshnessBreached
        expr: |
          feast_feature_view_freshness_seconds{feature_view="user_purchase_stats"} > 300
        for: 3m
        labels:
          severity: warning
          team: ml-platform
        annotations:
          summary: 'Feature View {{ $labels.feature_view }}의 freshness SLA 위반'
          description: |
            현재 lag: {{ $value }}초 (SLA: 300초).
            CDC 파이프라인 또는 Kafka consumer lag을 확인하세요.

      - alert: FeatureStoreOnlineStoreDown
        expr: |
          up{job="feast-online-store"} == 0
        for: 1m
        labels:
          severity: critical
          team: ml-platform
        annotations:
          summary: 'Online Store(Redis) 접속 불가'
          description: '서빙 중인 모델이 피처를 조회할 수 없습니다. 즉시 확인 필요.'

Feature View 버전 관리와 모델 Pinning

피처 스키마가 변경되면 기존 모델과의 호환성이 깨진다. Feature View 버전과 모델 버전을 함께 pinning 해야 한다.

KServe InferenceService에 피처 버전 명시

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: recommendation-model-v3
  labels:
    feature-view-version: 'v2026_03_04'
    model-version: 'v3.2.1'
  annotations:
    feast.dev/feature-views: 'user_purchase_stats:v2026_03_04,user_session_features:v2026_03_01'
spec:
  predictor:
    model:
      modelFormat:
        name: mlflow
      storageUri: gs://ml-models/recommendation/v3.2.1
    containers:
      - name: kserve-container
        env:
          - name: FEATURE_VIEW_VERSION
            value: 'v2026_03_04'
          - name: FEAST_REPO_PATH
            value: '/mnt/feast-repo'
          - name: ONLINE_STORE_TIMEOUT_MS
            value: '50'
        resources:
          requests:
            memory: '2Gi'
            cpu: '1'
          limits:
            memory: '4Gi'
            cpu: '2'

Feature View 스키마 호환성 테스트

"""
모델이 기대하는 피처 스키마와 Feature View가 제공하는 스키마의
호환성을 검증하는 CI 테스트. pytest로 실행한다.
"""
import pytest
from feast import FeatureStore
from dataclasses import dataclass
from typing import Dict, List

@dataclass
class ModelFeatureContract:
    """모델이 요구하는 피처 계약"""
    model_name: str
    feature_view_version: str
    expected_features: Dict[str, str]  # feature_name -> expected_type

# 모델별 피처 계약 정의
CONTRACTS = [
    ModelFeatureContract(
        model_name="recommendation-model-v3",
        feature_view_version="v2026_03_04",
        expected_features={
            "total_purchases_7d": "INT64",
            "avg_order_value_30d": "DOUBLE",
            "session_count_24h": "INT64",
            "last_active_minutes_ago": "DOUBLE",
        },
    ),
]

@pytest.fixture
def feature_store():
    return FeatureStore(repo_path="./feature_repo")

@pytest.mark.parametrize("contract", CONTRACTS, ids=lambda c: c.model_name)
def test_feature_schema_compatibility(feature_store, contract):
    """Feature View 스키마가 모델 계약과 일치하는지 검증"""
    fv = feature_store.get_feature_view(f"user_purchase_stats_{contract.feature_view_version}")

    actual_schema = {f.name: str(f.dtype) for f in fv.features}

    for feat_name, expected_type in contract.expected_features.items():
        assert feat_name in actual_schema, (
            f"Feature '{feat_name}' not found in FeatureView. "
            f"Available: {list(actual_schema.keys())}"
        )
        assert actual_schema[feat_name] == expected_type, (
            f"Feature '{feat_name}' type mismatch: "
            f"expected {expected_type}, got {actual_schema[feat_name]}"
        )

장애 시나리오별 대응

시나리오 1: Online 추론값과 Offline 재현값 불일치

증상: A/B 테스트 리포트에서 offline 재현 수치와 online 수치가 3% 이상 차이남
원인: 서빙 코드에서 피처 변환(정규화, 클리핑)을 직접 구현한 부분이
      Feast Feature View의 변환 로직과 다름
에러 로그:
  AssertionError: offline_ctr=0.0823, online_ctr=0.1134, diff=0.0311

해결:
  1. 모든 피처 변환을 Feast FeatureView 또는 OnDemandFeatureView 안에 정의
  2. 서빙 코드에서 변환 로직 직접 구현 금지
  3. 재현성 테스트를 CI에 추가 (아래 코드)

def test_online_offline_parity(feature_store, sample_entities):
    """Online Store 조회값과 Offline Store 재현값이 일치하는지 검증"""
    online_features = feature_store.get_online_features(
        features=["user_purchase_stats:total_purchases_7d"],
        entity_rows=[{"user_id": eid} for eid in sample_entities],
    ).to_dict()

    # 동일 시점의 Offline 값 조회
    entity_df = pd.DataFrame({
        "user_id": sample_entities,
        "event_timestamp": [datetime.now()] * len(sample_entities),
    })
    offline_features = feature_store.get_historical_features(
        entity_df=entity_df,
        features=["user_purchase_stats:total_purchases_7d"],
    ).to_df()

    for i, eid in enumerate(sample_entities):
        online_val = online_features["total_purchases_7d"][i]
        offline_val = offline_features.loc[
            offline_features["user_id"] == eid, "total_purchases_7d"
        ].iloc[0]
        assert online_val == offline_val, (
            f"Parity violation for {eid}: online={online_val}, offline={offline_val}"
        )

시나리오 2: Backfill 후 모델 성능 급락

증상: 3일간 CDC 장애로 backfill 수행 후 모델 precision이 0.82 -> 0.61로 하락
원인: backfill 시 event_timestamp를 backfill 실행 시각으로 잘못 설정하여
      point-in-time join에서 미래 데이터 leakage 발생
에러: 직접적 에러 없음. 지표 하락으로만 발견됨.

해결:
  1. backfill 시 원본 event_timestamp 보존 필수
  2. backfill 완료 후 leakage 검증 스크립트 자동 실행
  3. 모델 품질 회귀 테스트를 backfill 런북에 포함

시나리오 3: Redis Online Store 타임아웃으로 서빙 5xx 급증

증상: 서빙 pod에서 간헐적 5xx 응답. 에러 로그:
  redis.exceptions.TimeoutError: Timeout reading from redis-cluster:6379
  feast.errors.FeatureRetrievalError: Failed to retrieve features in 50ms

원인: Redis 클러스터 노드 1대 교체 중 failover 지연

해결:
  1. Feast online_store 설정에 timeout + retry 추가
  2. 서빙 코드에 피처 조회 실패 시 fallback 로직 추가
  3. Circuit breaker 패턴 적용 (연속 3회 실패 시 fallback 전환)

from tenacity import retry, stop_after_attempt, wait_exponential
from feast import FeatureStore
from typing import Dict, List, Optional

class ResilientFeatureFetcher:
    """피처 조회 실패 시 fallback으로 전환하는 래퍼"""

    def __init__(self, store: FeatureStore, fallback_ttl_seconds: int = 300):
        self.store = store
        self.fallback_cache: Dict[str, dict] = {}
        self.consecutive_failures = 0
        self.circuit_open = False
        self.failure_threshold = 3

    @retry(stop=stop_after_attempt(2), wait=wait_exponential(multiplier=0.01, max=0.05))
    def _fetch_online(self, features: List[str], entity_rows: List[dict]) -> dict:
        return self.store.get_online_features(
            features=features, entity_rows=entity_rows
        ).to_dict()

    def get_features(
        self, features: List[str], entity_rows: List[dict]
    ) -> Optional[dict]:
        if self.circuit_open:
            return self._get_fallback(entity_rows)

        try:
            result = self._fetch_online(features, entity_rows)
            self.consecutive_failures = 0
            # 성공 시 캐시 업데이트
            for i, row in enumerate(entity_rows):
                key = str(row)
                self.fallback_cache[key] = {
                    f: result[f][i] for f in features
                }
            return result
        except Exception:
            self.consecutive_failures += 1
            if self.consecutive_failures >= self.failure_threshold:
                self.circuit_open = True
            return self._get_fallback(entity_rows)

    def _get_fallback(self, entity_rows: List[dict]) -> Optional[dict]:
        """캐시에서 fallback 값 반환. 캐시 미스 시 기본값 사용."""
        # 실 서비스에서는 Redis가 아닌 로컬 캐시 또는 사전 계산 평균값 반환
        return None

동기화 상태 종합 대시보드

운영 안정성을 위해 아래 지표들을 Grafana 대시보드에 배치한다.

지표	SLA	알림 임계치	측정 방법
Online Store Freshness	p95 < 300s	> 300s for 3min	Feast metrics exporter
CDC Consumer Lag	< 1000 events	> 5000 events	Kafka consumer group lag
Point-in-time Leakage Rate	0%	> 0%	CI 테스트
Online-Offline Parity	100% 일치	< 99.5%	일 1회 샘플링 검증
Feature View Schema Drift	0 breaking changes	> 0	CI contract test
Redis p99 Latency	< 10ms	> 50ms	Redis slowlog

배포 체크리스트: Feature View 변경 시

새로운 Feature View를 배포하거나 기존 스키마를 변경할 때 아래 순서를 따른다.

퀴즈

Q1. Point-in-time join에서 leakage가 발생하는 근본 원인은?

||피처의 타임스탬프가 예측 시점의 타임스탬프보다 미래인 데이터가 학습에 포함되기 때문이다.||

Q2. CDC 기반 Online Store 동기화에서 Kafka consumer lag이 급증하면 어떤 문제가 생기는가?

||Online Store의 피처가 오래된 값을 반환하여 서빙 모델의 예측 품질이 저하된다. Freshness SLA 위반으로 이어진다.||

Q3. Feature View 스키마 변경 시 기존 모델과의 호환성을 보장하는 방법은?

||Feature View 버전과 모델 버전을 함께 pinning하고, CI에서 contract test를 실행하여 스키마 호환성을 자동 검증한다.||

Q4. Backfill 작업에서 event_timestamp를 backfill 실행 시각으로 설정하면 왜 문제가 되는가?

||Point-in-time join 시 원래 시점에서는 알 수 없었던 미래 데이터가 과거 학습 셋에 혼입되어 leakage가 발생하고 모델 성능이 왜곡된다.||

Q5. Redis Online Store의 연속 타임아웃 상황에서 Circuit Breaker를 적용하는 이유는?

||장애가 전파되어 서빙 전체가 다운되는 것을 방지하고, fallback 경로(캐시 또는 기본값)로 빠르게 전환하여 가용성을 유지하기 위해서다.||

Q6. Feast의 write_to_online_store와 materialize의 차이는?

||materialize는 Offline Store에서 Online Store로 배치 복사하는 명령이고, write_to_online_store는 DataFrame을 직접 Online Store에 기록하는 API다. CDC 패턴에서는 후자를 사용한다.||

Q7. Online-Offline Parity 테스트는 얼마나 자주 실행해야 하는가?

||일 1회 이상 샘플링 검증을 수행하되, Feature View 변경이나 배포 시에는 반드시 즉시 실행해야 한다. CI/CD 파이프라인에 통합하는 것이 이상적이다.||