- Authors
- Name
- Entering
- Feature Store Necessity
- Feast Architecture
- Installation and initial setup
- Feature definition and entity management
- Feature View and Feature Service
- Generate offline training data
- Online serving pipeline
- Materialization Strategy
- Learning-Serving Skew Prevention
- Streaming feature integration
- Feature Store Comparison
- Monitoring
- Troubleshooting
- Precautions during operation
- Failure cases and recovery
- Checklist
- References

Entering
Many of the reasons why ML models fail in production come from Features, not the model itself. Representative examples include training-serving skew, where the features calculated during learning and the features viewed during serving are different, serving stale data due to feature pipeline failure, and breaking compatibility with lower-level models when changing feature definitions. Feature Store is an infrastructure component that emerged to solve these problems at the architectural level.
Feast (Feature Store) is an open source Feature Store project started by Gojek and Google in 2019, and currently has the most active community. This article comprehensively covers the design decisions, architecture choices, failure response, and strategies to structurally prevent learning-serving skew encountered when actually operating Feast in a production environment. Rather than being a simple tutorial, our goal is to be a practical operation guide that teams that manage dozens of Feature Views and millions of entities can refer to.
Feature Store Necessity
Practical problems of feature management
If you operate an ML system without a Feature Store, the following problems repeatedly occur.
- Feature logic duplication: The feature conversion code written by the data scientist in Jupyter Notebook and the code implemented by the backend engineer on the serving server are slightly different. same
avg_purchase_amount_30dIt is a feature that fills NULL with 0 in learning and with the average value in serving. - No feature discoverability: Team A has already calculated
user_click_rate_7dThe features are recalculated without Team B knowing. Feature assets within the organization are not managed. - Time Travel Not Possible: It is impossible to answer “What was the value of this user’s features two weeks ago?” Retraining or debugging becomes impossible.
- Serving Latency: When calculating features in real time from multiple data sources during model inference, p99 latency soars to hundreds of milliseconds.
What Feature Store solves
The Feature Store acts as a Single Source of Truth for features. Since both offline training data and online serving data are generated from a single feature definition, logic inconsistencies are fundamentally eliminated. Features can be searched and reused through a central registry, and features from past points can be accurately reproduced through Point-in-Time Join.
Feast Architecture
Feast's architecture largely consists of four core components: Offline Store, Online Store, Registry, and Feature Server.
Offline Store
The offline store stores large amounts of historical feature data and performs Point-in-Time Join when generating learning data. Large-scale analysis data warehouses or data lakes such as BigQuery, Snowflake, Redshift, and S3/Parquet are used as backends. Read performance is important, and it must be possible to efficiently scan tens of TB of data.
Online Store
The online store is a low-latency key-value storage for real-time serving. Redis, DynamoDB, Bigtable, etc. are used as backends, and the latest feature value must be returned within 10ms based on the entity key. Through the materialization process, data from the offline store is copied to the online store.
Registry
The registry is a central catalog that stores metadata such as entities, feature views, and feature services. It can be operated file-based (local, S3, GCS) or SQL-based (PostgreSQL, MySQL). For production, a SQL-based registry is recommended. Multiple teams at the same timefeast applyThis is because it can prevent conflicts when running.
Feature Server
The Feast built-in feature server serves online features through REST/gRPC endpoints. It is a Go-based high-performance server that can be deployed as a sidecar or independent service in model serving infrastructure (KServe, Seldon, etc.).
Installation and initial setup
Install Feast and initialize project
# Feast 설치 (Redis, PostgreSQL 백엔드 포함)
pip install "feast[redis,postgres]"
# 새 프로젝트 초기화
feast init recommendation_features
cd recommendation_features
# 프로젝트 구조 확인
# recommendation_features/
# feature_repo/
# __init__.py
# feature_store.yaml # Feast 프로젝트 설정
# entities.py # 엔티티 정의
# feature_views.py # Feature View 정의
# feature_services.py # Feature Service 정의
# data/ # 로컬 테스트용 데이터
Production feature_store.yaml settings
# feature_store.yaml - 프로덕션 환경 구성
project: recommendation_platform
provider: local
entity_key_serialization_version: 3
registry:
registry_type: sql
path: postgresql://feast_user:${FEAST_REGISTRY_PASSWORD}@pg-registry.internal:5432/feast_registry
cache_ttl_seconds: 60
sqlalchemy_config_kwargs:
pool_size: 10
max_overflow: 20
pool_pre_ping: true
offline_store:
type: bigquery
project_id: ml-platform-prod
dataset: feast_offline_store
location: asia-northeast3
online_store:
type: redis
connection_string: redis://:${REDIS_PASSWORD}@redis-feast.internal:6379
key_ttl_seconds: 172800 # 48시간 TTL
redis_type: redis_cluster
flags:
alpha_features: true
on_demand_transforms: true
Things to note about this setting:cache_ttl_secondsall. Setting the registry cache TTL too longfeast applySubsequent changes may not be reflected immediately, which may result in service failures. Conversely, if it is set too short, the load will be concentrated on the registry DB. For production, 30 to 120 seconds is appropriate.
Feature definition and entity management
Entity Definition
An entity represents the subject of a feature. Key entities in the business domain, such as users, items, and orders, are defined as entities.
# entities.py
from feast import Entity, ValueType
# 유저 엔티티
user = Entity(
name="user_id",
description="서비스 유저의 고유 식별자. UUID v4 형식.",
value_type=ValueType.STRING,
tags={
"owner": "ml-platform-team",
"pii": "true",
"domain": "user",
},
)
# 아이템(상품) 엔티티
item = Entity(
name="item_id",
description="상품 카탈로그의 고유 식별자. 정수형 ID.",
value_type=ValueType.INT64,
tags={
"owner": "recommendation-team",
"domain": "catalog",
},
)
# 유저-아이템 상호작용 엔티티 (복합 키)
user_item = Entity(
name="user_item_id",
description="유저-아이템 상호작용 복합 키. 'user_id:item_id' 형식.",
value_type=ValueType.STRING,
tags={
"owner": "recommendation-team",
"domain": "interaction",
},
)
A key principle when designing entities is that one entity should represent one business entity. When a composite key is needed, define a separate entity, but be careful not to increase the key cardinality excessively considering join performance.
Feature View and Feature Service
Feature View Definition
A Feature View is a logical grouping of related features derived from a single data source. It is a core abstraction that provides features in a consistent way across both offline and online stores.
# feature_views.py
from datetime import timedelta
from feast import (
FeatureView,
Field,
BigQuerySource,
PushSource,
StreamSource,
)
from feast.types import Float32, Float64, Int64, String, UnixTimestamp
# 유저 구매 통계 Feature View
user_purchase_source = BigQuerySource(
name="user_purchase_stats_source",
table="ml-platform-prod.feast_offline_store.user_purchase_stats",
timestamp_field="event_timestamp",
created_timestamp_column="created_at",
description="유저별 구매 통계 집계 테이블. 매시간 dbt로 갱신.",
)
user_purchase_stats = FeatureView(
name="user_purchase_stats",
entities=[user],
ttl=timedelta(days=7),
schema=[
Field(name="total_purchases_7d", dtype=Int64, description="최근 7일 총 구매 건수"),
Field(name="total_purchases_30d", dtype=Int64, description="최근 30일 총 구매 건수"),
Field(name="avg_order_value_7d", dtype=Float64, description="최근 7일 평균 주문 금액"),
Field(name="avg_order_value_30d", dtype=Float64, description="최근 30일 평균 주문 금액"),
Field(name="max_order_value_30d", dtype=Float64, description="최근 30일 최대 주문 금액"),
Field(name="purchase_frequency_score", dtype=Float32, description="구매 빈도 정규화 점수 (0~1)"),
Field(name="last_purchase_days_ago", dtype=Int64, description="마지막 구매 후 경과 일수"),
],
source=user_purchase_source,
online=True,
tags={
"team": "recommendation",
"sla_freshness_minutes": "60",
"data_quality_owner": "data-eng@company.com",
},
)
# 유저 세션 피처 (실시간 Push 소스 포함)
user_session_push_source = PushSource(
name="user_session_push",
batch_source=BigQuerySource(
name="user_session_batch_source",
table="ml-platform-prod.feast_offline_store.user_session_features",
timestamp_field="event_timestamp",
),
)
user_session_features = FeatureView(
name="user_session_features",
entities=[user],
ttl=timedelta(hours=24),
schema=[
Field(name="session_count_24h", dtype=Int64, description="최근 24시간 세션 수"),
Field(name="avg_session_duration_min", dtype=Float32, description="평균 세션 시간(분)"),
Field(name="pages_viewed_1h", dtype=Int64, description="최근 1시간 조회 페이지 수"),
Field(name="last_active_minutes_ago", dtype=Int64, description="마지막 활동 후 경과 분"),
Field(name="is_currently_active", dtype=Int64, description="현재 활성 세션 여부 (0/1)"),
],
source=user_session_push_source,
online=True,
tags={
"team": "recommendation",
"sla_freshness_minutes": "5",
"realtime": "true",
},
)
On-Demand Feature View
Use On-Demand Feature View when you need to apply the same transformation logic for both learning and serving. This is one of the key mechanisms to prevent learning-serving skew.
# on_demand_features.py
from feast import on_demand_feature_view, Field
from feast.types import Float32, Int64
import pandas as pd
@on_demand_feature_view(
sources=[user_purchase_stats, user_session_features],
schema=[
Field(name="engagement_score", dtype=Float32),
Field(name="purchase_recency_bucket", dtype=Int64),
Field(name="activity_intensity", dtype=Float32),
],
description="구매와 세션 피처를 결합하여 계산하는 파생 피처",
)
def user_engagement_features(inputs: pd.DataFrame) -> pd.DataFrame:
"""
학습과 서빙에서 동일하게 실행되는 변환 로직.
이 함수 안에서 정의된 로직은 get_historical_features()와
get_online_features() 양쪽에서 동일하게 적용된다.
"""
df = pd.DataFrame()
# 참여도 점수: 구매 빈도와 세션 활동을 결합
purchase_norm = inputs["purchase_frequency_score"].fillna(0.0)
session_norm = (
inputs["session_count_24h"].fillna(0).clip(upper=50) / 50.0
)
df["engagement_score"] = (0.6 * purchase_norm + 0.4 * session_norm).astype("float32")
# 구매 최신성 버킷: 0(오늘), 1(1~3일), 2(4~7일), 3(8~14일), 4(15일+)
days = inputs["last_purchase_days_ago"].fillna(999)
df["purchase_recency_bucket"] = pd.cut(
days,
bins=[-1, 0, 3, 7, 14, float("inf")],
labels=[0, 1, 2, 3, 4],
).astype("int64")
# 활동 강도: 최근 1시간 페이지 뷰 기반
df["activity_intensity"] = (
inputs["pages_viewed_1h"].fillna(0).clip(upper=100) / 100.0
).astype("float32")
return df
Feature Service Definition
Feature Service is a unit that bundles and manages versions of feature sets used by a specific model. 1:1 mapping between model and feature service makes it clear which model depends on which feature.
# feature_services.py
from feast import FeatureService
# 추천 모델 v3용 Feature Service
recommendation_v3_service = FeatureService(
name="recommendation_v3",
features=[
user_purchase_stats[["total_purchases_7d", "avg_order_value_30d", "purchase_frequency_score"]],
user_session_features[["session_count_24h", "last_active_minutes_ago"]],
user_engagement_features, # On-Demand Feature View 전체 포함
],
description="추천 모델 v3이 사용하는 피처 세트. 모델 버전 변경 시 새 FeatureService를 생성할 것.",
tags={
"model_version": "v3.2.1",
"model_owner": "rec-team",
"last_validated": "2026-03-07",
},
)
# 이탈 예측 모델용 Feature Service
churn_prediction_service = FeatureService(
name="churn_prediction_v1",
features=[
user_purchase_stats, # 전체 피처 사용
user_session_features[["session_count_24h", "avg_session_duration_min"]],
],
description="이탈 예측 모델 v1용 피처 세트",
tags={
"model_version": "v1.0.0",
"model_owner": "growth-team",
},
)
Generate offline training data
When generating training data, you must use Point-in-Time Join. This is the principle of using only feature values that were actually known at the time of prediction. If this principle is violated, Feature Leakage occurs, and although the model performs well in the learning environment, its performance drops significantly in actual serving.
# training_data_generator.py
from feast import FeatureStore
from datetime import datetime, timedelta
import pandas as pd
store = FeatureStore(repo_path="./feature_repo")
# 1. 엔티티 데이터프레임 준비: 예측 대상 + 예측 시점
# 각 row가 "이 유저에 대해 이 시점에 예측해야 했다"는 의미
entity_df = pd.DataFrame({
"user_id": ["u_1001", "u_1002", "u_1003", "u_1004", "u_1005"],
"event_timestamp": [
datetime(2026, 2, 15, 10, 0, 0),
datetime(2026, 2, 20, 14, 30, 0),
datetime(2026, 2, 25, 9, 0, 0),
datetime(2026, 3, 1, 11, 15, 0),
datetime(2026, 3, 5, 16, 45, 0),
],
})
# 2. Feature Service를 통해 학습 데이터 조회
# Point-in-Time Join이 자동으로 적용됨
training_df = store.get_historical_features(
entity_df=entity_df,
features=store.get_feature_service("recommendation_v3"),
).to_df()
print(f"학습 데이터 shape: {training_df.shape}")
print(f"컬럼: {list(training_df.columns)}")
# 3. 대규모 학습 데이터 생성 시 BigQuery에서 직접 실행
# 수백만 건의 엔티티 데이터는 BigQuery로 직접 읽어 효율적으로 처리
large_entity_df = pd.read_gbq(
"""
SELECT
user_id,
prediction_timestamp AS event_timestamp
FROM `ml-platform-prod.labels.recommendation_labels`
WHERE prediction_timestamp BETWEEN '2026-01-01' AND '2026-03-01'
""",
project_id="ml-platform-prod",
)
training_df_large = store.get_historical_features(
entity_df=large_entity_df,
features=store.get_feature_service("recommendation_v3"),
).to_df()
# 4. 학습 데이터 품질 검증
assert training_df_large.isnull().sum().sum() / training_df_large.size < 0.05, \
"NULL 비율이 5%를 초과합니다. 데이터 소스를 점검하세요."
training_df_large.to_parquet(
"gs://ml-datasets/recommendation/training_2026_q1.parquet",
index=False,
)
Online serving pipeline
Deploy Feast Feature Server
# online_serving.py
"""
모델 서빙 시 Feast에서 온라인 피처를 조회하는 클라이언트 코드.
KServe Transformer 또는 서빙 애플리케이션에서 사용한다.
"""
from feast import FeatureStore
from typing import Dict, List, Any
import time
import logging
logger = logging.getLogger(__name__)
class FeatureClient:
"""
프로덕션용 Feast 온라인 피처 클라이언트.
타임아웃, fallback, 메트릭 수집을 포함한다.
"""
def __init__(
self,
repo_path: str = "./feature_repo",
timeout_ms: int = 50,
fallback_enabled: bool = True,
):
self.store = FeatureStore(repo_path=repo_path)
self.timeout_ms = timeout_ms
self.fallback_enabled = fallback_enabled
self._default_values = self._load_default_values()
def _load_default_values(self) -> Dict[str, Any]:
"""Feature View별 기본값을 사전에 로드한다."""
return {
"total_purchases_7d": 0,
"avg_order_value_30d": 0.0,
"purchase_frequency_score": 0.0,
"session_count_24h": 0,
"last_active_minutes_ago": 1440,
"engagement_score": 0.0,
"purchase_recency_bucket": 4,
"activity_intensity": 0.0,
}
def get_features_for_prediction(
self, user_ids: List[str]
) -> Dict[str, List[Any]]:
"""
추천 모델 추론을 위한 피처 조회.
타임아웃 시 fallback 기본값을 반환한다.
"""
entity_rows = [{"user_id": uid} for uid in user_ids]
start_time = time.monotonic()
try:
result = self.store.get_online_features(
features=self.store.get_feature_service("recommendation_v3"),
entity_rows=entity_rows,
).to_dict()
elapsed_ms = (time.monotonic() - start_time) * 1000
logger.info(
f"Feature retrieval: {len(user_ids)} entities, {elapsed_ms:.1f}ms"
)
if elapsed_ms > self.timeout_ms:
logger.warning(
f"Feature retrieval exceeded SLA: {elapsed_ms:.1f}ms > {self.timeout_ms}ms"
)
return result
except Exception as e:
elapsed_ms = (time.monotonic() - start_time) * 1000
logger.error(
f"Feature retrieval failed after {elapsed_ms:.1f}ms: {e}"
)
if self.fallback_enabled:
return self._build_fallback_response(user_ids)
raise
def _build_fallback_response(
self, user_ids: List[str]
) -> Dict[str, List[Any]]:
"""피처 조회 실패 시 기본값으로 fallback 응답을 생성한다."""
response = {"user_id": user_ids}
for feat_name, default_val in self._default_values.items():
response[feat_name] = [default_val] * len(user_ids)
logger.warning(
f"Returning fallback features for {len(user_ids)} entities"
)
return response
# 사용 예시
client = FeatureClient(repo_path="./feature_repo")
features = client.get_features_for_prediction(["u_1001", "u_1002"])
Materialization Strategy
Materialization is the process of copying data from an offline store to an online store. Strategies vary depending on the requirements of batch serving and real-time serving.
Batch Materialization
# 전체 Feature View를 특정 시간 범위로 materialize
feast materialize 2026-03-06T00:00:00 2026-03-07T00:00:00
# 점진적 materialization (마지막 실행 이후 ~ 현재)
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
Programmatic Materialization (Airflow DAG)
# airflow_materialization_dag.py
"""
Airflow DAG: 1시간 간격으로 Feast Feature View를 materialize한다.
Feature View별로 독립적인 태스크로 실행하여 병렬 처리하고,
실패 시 개별 재시도가 가능하도록 구성한다.
"""
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from feast import FeatureStore
FEATURE_VIEWS_TO_MATERIALIZE = [
"user_purchase_stats",
"user_session_features",
"item_popularity_stats",
]
def materialize_feature_view(feature_view_name: str, **context):
"""개별 Feature View를 materialize한다."""
store = FeatureStore(repo_path="/opt/feast/feature_repo")
end_date = context["data_interval_end"]
start_date = end_date - timedelta(hours=2) # 2시간 lookback (안전 마진)
store.materialize(
start_date=start_date,
end_date=end_date,
feature_views=[feature_view_name],
)
# 성공 메트릭 기록
row_count = _verify_materialization(store, feature_view_name, end_date)
context["ti"].xcom_push(
key=f"{feature_view_name}_materialized_rows",
value=row_count,
)
def _verify_materialization(store, fv_name, end_date):
"""Materialization 결과를 검증한다."""
# 샘플 엔티티로 온라인 스토어 조회 테스트
sample_result = store.get_online_features(
features=[f"{fv_name}:total_purchases_7d"],
entity_rows=[{"user_id": "u_health_check"}],
).to_dict()
return len(sample_result.get("user_id", []))
with DAG(
dag_id="feast_materialization",
schedule_interval="0 * * * *", # 매시간
start_date=datetime(2026, 3, 1),
catchup=False,
default_args={
"retries": 3,
"retry_delay": timedelta(minutes=5),
"execution_timeout": timedelta(minutes=30),
},
tags=["feast", "feature-store", "materialization"],
) as dag:
for fv_name in FEATURE_VIEWS_TO_MATERIALIZE:
PythonOperator(
task_id=f"materialize_{fv_name}",
python_callable=materialize_feature_view,
op_args=[fv_name],
)
Materialization Cycle Decision Guide
| Feature Type | Frequency of change | Recommended Materialization Cycle | Example |
|---|---|---|---|
| static profile | Almost no change | Once a day | User registration date, gender, region |
| Daily tally | Update once a day | 1-2 times a day | 30-day average purchase amount, weekly visits |
| Hourly tally | Hourly update | hourly | Number of sessions in last 24 hours |
| Real-time features | update in seconds | Push-based / Streaming | Current session active, recent clicks |
Learning-Serving Skew Prevention
Training-Serving Skew is the most subtle and difficult to find bug in ML systems. This is a phenomenon in which performance deteriorates because the distribution of features the model sees during training and the distribution of features it receives during serving are different. This is not automatically solved by using Feast, and requires conscious design and verification.
Main causes of skew
- Dual implementation of conversion logic: SQL for learning and Python code for serving are separated and slightly different.
- Materialization delay: Old features remain in the online store.
- NULL processing mismatch: NULL is removed offline, but filled with 0 online.
- Time zone difference: Offline is calculated in UTC and online is calculated in KST, so the “last 7 days” range is different.
- Feature Leakage: Future data is mixed during learning.
Structural Prevention Strategy
# skew_prevention_test.py
"""
학습-서빙 스큐를 자동으로 탐지하는 CI 테스트.
정기적으로 실행하여 온라인-오프라인 값의 일관성을 검증한다.
"""
import pytest
import pandas as pd
import numpy as np
from datetime import datetime
from feast import FeatureStore
@pytest.fixture(scope="module")
def store():
return FeatureStore(repo_path="./feature_repo")
@pytest.fixture(scope="module")
def sample_entity_ids():
"""검증용 샘플 엔티티. 프로덕션 엔티티 중 랜덤 샘플링."""
return ["u_1001", "u_1002", "u_1003", "u_1005", "u_1010"]
FEATURE_LIST = [
"user_purchase_stats:total_purchases_7d",
"user_purchase_stats:avg_order_value_30d",
"user_purchase_stats:purchase_frequency_score",
"user_session_features:session_count_24h",
]
def test_online_offline_consistency(store, sample_entity_ids):
"""
온라인 스토어 값과 오프라인 스토어 최신 값이 일치하는지 검증.
Materialization이 정상적으로 동작하면 두 값은 동일해야 한다.
"""
# 온라인 조회
online_result = store.get_online_features(
features=FEATURE_LIST,
entity_rows=[{"user_id": eid} for eid in sample_entity_ids],
).to_dict()
# 오프라인 조회 (현재 시점 기준)
entity_df = pd.DataFrame({
"user_id": sample_entity_ids,
"event_timestamp": [datetime.utcnow()] * len(sample_entity_ids),
})
offline_result = store.get_historical_features(
entity_df=entity_df,
features=FEATURE_LIST,
).to_df()
# 각 피처별 온라인-오프라인 값 비교
for feat in FEATURE_LIST:
feat_short = feat.split(":")[1]
for i, eid in enumerate(sample_entity_ids):
online_val = online_result[feat_short][i]
offline_val = offline_result.loc[
offline_result["user_id"] == eid, feat_short
].iloc[0]
if online_val is None and pd.isna(offline_val):
continue # 둘 다 NULL이면 일치
if isinstance(online_val, float):
assert np.isclose(online_val, offline_val, rtol=1e-5), (
f"Skew detected for {eid}.{feat_short}: "
f"online={online_val}, offline={offline_val}"
)
else:
assert online_val == offline_val, (
f"Skew detected for {eid}.{feat_short}: "
f"online={online_val}, offline={offline_val}"
)
def test_null_handling_consistency(store, sample_entity_ids):
"""
NULL 처리가 온라인과 오프라인에서 동일한지 검증.
존재하지 않는 엔티티를 조회했을 때 양쪽의 동작이 같아야 한다.
"""
ghost_entities = ["u_nonexistent_001", "u_nonexistent_002"]
online_result = store.get_online_features(
features=FEATURE_LIST,
entity_rows=[{"user_id": eid} for eid in ghost_entities],
).to_dict()
for feat in FEATURE_LIST:
feat_short = feat.split(":")[1]
for val in online_result[feat_short]:
assert val is None, (
f"Non-existent entity should return None, got {val} for {feat_short}"
)
Skew Detection Dashboard Metrics
| indicators | normal range | Notification Threshold | Description |
|---|---|---|---|
| Online-Offline Parity Rate | 100% | under 99.5% | Online/offline value matching rate |
| Feature Distribution Drift (KL Divergence) | under 0.01 | over 0.05 | Change in feature distribution at serving time compared to learning time |
| NULL Rate Difference | 0% | over 1% | Online/offline NULL rate difference |
| Materialization Lag p95 | under 60min | over 120min | Time required to complete materialization |
| Stale Feature Rate | 0% | over 5% | Percentage of Features Exceeding TTL |
Streaming feature integration
Batch materialization alone cannot serve real-time features (current session state, previous click, etc.). By combining Feast's Push API and Kafka streaming, real-time features are reflected in the online store.
# streaming_feature_ingest.py
"""
Kafka에서 유저 세션 이벤트를 소비하여 Feast Online Store에
실시간으로 반영하는 스트리밍 워커.
Faust(Python stream processing) 또는 Kafka Consumer로 구현한다.
"""
import json
from datetime import datetime
from confluent_kafka import Consumer, KafkaError
from feast import FeatureStore
import pandas as pd
store = FeatureStore(repo_path="./feature_repo")
consumer = Consumer({
"bootstrap.servers": "kafka-broker:9092",
"group.id": "feast-session-materializer",
"auto.offset.reset": "latest",
"enable.auto.commit": False,
"max.poll.interval.ms": 300000,
"session.timeout.ms": 45000,
})
consumer.subscribe(["user-session-events"])
MICRO_BATCH_SIZE = 200
FLUSH_INTERVAL_SEC = 5
buffer = []
last_flush = datetime.utcnow()
try:
while True:
msg = consumer.poll(timeout=1.0)
if msg is not None and not msg.error():
event = json.loads(msg.value().decode("utf-8"))
buffer.append({
"user_id": event["user_id"],
"session_count_24h": event["session_count_24h"],
"avg_session_duration_min": event["avg_session_duration_min"],
"pages_viewed_1h": event["pages_viewed_1h"],
"last_active_minutes_ago": 0, # 방금 활동했으므로 0
"is_currently_active": 1,
"event_timestamp": datetime.utcnow(),
})
elif msg is not None and msg.error():
if msg.error().code() != KafkaError._PARTITION_EOF:
raise Exception(f"Kafka error: {msg.error()}")
elapsed = (datetime.utcnow() - last_flush).total_seconds()
should_flush = (
len(buffer) >= MICRO_BATCH_SIZE
or (buffer and elapsed >= FLUSH_INTERVAL_SEC)
)
if should_flush:
df = pd.DataFrame(buffer)
# PushSource를 통해 온라인 + 오프라인 스토어 동시 기록
store.push(
push_source_name="user_session_push",
df=df,
to=store.PushMode.ONLINE_AND_OFFLINE,
)
consumer.commit()
buffer.clear()
last_flush = datetime.utcnow()
finally:
consumer.close()
Feature Store Comparison
| Item | Feast | Tecton | Hopsworks | Databricks Feature Store |
|---|---|---|---|---|
| License | Apache 2.0 (open source) | Commercial (SaaS/Self-hosted) | AGPL + Commercial | Databricks Platform Dependency |
| Distribution method | Self-managed | Managed SaaS | Self-managed / Managed | Built in Databricks Workspace |
| offline store | BigQuery, Snowflake, Redshift, S3 Parquet, File | Spark, Snowflake, Rift | HopsFS, S3 | Delta Lake (Unity Catalog) |
| Online Store | Redis, DynamoDB, Bigtable, PostgreSQL, SQLite | DynamoDB (built-in optimization) | RonDB (MySQL NDB Cluster) | Cosmos DB, DynamoDB |
| Streaming Support | Push API, Kafka (manual implementation) | Native Spark Streaming | Native Kafka/Spark | Spark Structured Streaming |
| conversion engine | On-Demand (Python), dbt integration | Tecton SDK (Spark/Rift) | Hopsworks Feature Pipeline | Spark UDF |
| Feature Registry | SQL/File based | Built-in (Rich UI) | Built-in (Hopsworks UI) | Unity Catalog Integration |
| Avoid learning-serving skew | On-Demand FV, Point-in-Time Join | Time-Travel, Materialization Guaranteed | Provenance tracking | Time-Travel (Delta Lake) |
| Community activity | Very High (GitHub 5.5k+ stars) | Medium (commercial focus) | High (Academic/Open Source) | High (Databricks Ecosystem) |
| The right team | Cloud independent, customization required team | Enterprise, fully managed preferred team | On-prem, GPU learning-focused team | Teams already using Databricks |
Monitoring
Key monitoring indicators
There are certain metrics that you must track when running a production Feature Store.
- Freshness: How fresh are the features of your online store.
현재시각 - 피처의 event_timestampIt is measured with - Serving Latency: p50/p95/p99 latency of online feature lookup. SLA usually p99 < 50ms.
- Materialization Success Rate: Success/failure rate of materialization operation.
- Feature Staleness Rate: The percentage of features whose TTL is exceeded.
- Online Store Hit Rate: The percentage of lookup requests in which an actual value is returned (non-NULL response rate).
# feast_metrics_exporter.py
"""
Feast 상태를 Prometheus 메트릭으로 내보내는 exporter.
Kubernetes CronJob으로 5분 간격 실행을 권장한다.
"""
from prometheus_client import Gauge, Histogram, start_http_server
from feast import FeatureStore
import time
# Prometheus 메트릭 정의
FRESHNESS_GAUGE = Gauge(
"feast_feature_view_freshness_seconds",
"Feature View의 온라인 스토어 freshness (초)",
["feature_view", "project"],
)
MATERIALIZATION_LAG = Gauge(
"feast_materialization_lag_seconds",
"마지막 Materialization 이후 경과 시간 (초)",
["feature_view"],
)
ONLINE_STORE_LATENCY = Histogram(
"feast_online_store_latency_seconds",
"온라인 스토어 조회 레이턴시",
["feature_view"],
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5],
)
def collect_metrics():
store = FeatureStore(repo_path="./feature_repo")
feature_views = store.list_feature_views()
for fv in feature_views:
# Freshness 측정
try:
start = time.monotonic()
result = store.get_online_features(
features=[f"{fv.name}:{fv.features[0].name}"],
entity_rows=[{"user_id": "u_health_check"}],
).to_dict()
elapsed = time.monotonic() - start
ONLINE_STORE_LATENCY.labels(feature_view=fv.name).observe(elapsed)
except Exception:
pass
if __name__ == "__main__":
start_http_server(8000)
while True:
collect_metrics()
time.sleep(300) # 5분 간격
Troubleshooting
Issue 1: Feature not found in online serving after feast apply
증상: feast apply 성공 후 get_online_features()에서 KeyError 발생
에러: KeyError: "Feature view 'user_purchase_stats_v2' not found in registry"
원인: 레지스트리 캐시 TTL 내에 오래된 메타데이터를 참조
해결:
1. 레지스트리 캐시를 즉시 갱신:
store = FeatureStore(repo_path="./feature_repo")
store.refresh_registry()
2. 배포 프로세스에 registry refresh를 포함
3. cache_ttl_seconds를 60초 이하로 설정
Issue 2: Out of memory during materialization
증상: feast materialize 실행 중 OOMKilled
에러: Container killed due to OOM (exit code 137)
원인: 대규모 Feature View를 한 번에 materialize할 때 전체 데이터를
메모리에 로드하려 함
해결:
1. 시간 범위를 좁혀서 분할 실행:
feast materialize 2026-03-06T00:00:00 2026-03-06T06:00:00
feast materialize 2026-03-06T06:00:00 2026-03-06T12:00:00
2. Feature View별로 개별 materialize
3. Kubernetes Job의 메모리 limit 증가 (최소 4Gi 권장)
4. 배치 크기를 줄이는 환경변수 설정:
FEAST_BATCH_MATERIALIZATION_MAX_WORKERS=2
Issue 3: Redis online store connection timeout spikes
증상: 서빙 레이턴시 p99가 200ms 이상으로 급증
에러: redis.exceptions.TimeoutError: Timeout reading from socket
원인: Redis 클러스터 리밸런싱 또는 단일 노드 과부하
해결:
1. Redis 클러스터 노드별 메모리/CPU 확인
2. 핫 키 분석: MONITOR 또는 redis-cli --hotkeys로 확인
3. connection_string에 timeout 파라미터 추가:
redis://host:6379?socket_timeout=0.1&socket_connect_timeout=0.1
4. 서빙 코드에 circuit breaker + fallback 패턴 적용
Precautions during operation
TTL design
Feature ViewttlThe value defines "how old this feature is valid." If the TTL is set too short, it does not match the materialization cycle and NULL is returned during online inquiry. TTL must be set to more than twice the materialization cycle. For example, if you materialize every hour, the TTL should be at least 2 hours.
Entity Key Design
The cardinality of your entity keys determines the size of your online store. If the number of users is 100 million and there are 10 feature views, 1 billion key-value pairs are stored in the online store. Assuming that Redis occupies approximately 500 bytes per key, approximately 500GB of memory is required. A strategy is needed to automatically expire inactive users with TTL or materialize only users who have been active within the last N days.
feast apply and deployment order
When changing the Feature View schema, backward compatibility must be maintained. Adding a new feature is safe, but deleting an existing feature or changing its type immediately breaks the dependent model. The distribution order is as follows.
- Create a Feature View definition that adds new features 2.
feast applyUpdate the registry with - Run Materialization to load new features into your online store
- Deploy the model using the new features
- If old features are deprecated, they will be removed in the next release
Failure cases and recovery
Case 1: Serving stale features for 48 hours due to materialization failure
상황: Airflow DAG 장애로 Materialization이 48시간 중단됨.
그 동안 온라인 스토어에는 이틀 전 피처가 서빙됨.
영향: 추천 모델의 CTR이 15% 하락.
"최근 24시간 세션 수"가 48시간 전 값이었기 때문.
복구 절차:
1. Airflow DAG 원인 분석 및 수정 (BigQuery 인증 토큰 만료)
2. 수동 backfill materialization 실행:
feast materialize 2026-03-05T00:00:00 2026-03-07T12:00:00
3. 온라인 스토어 freshness 확인 후 알림 해제
4. 사후 조치: Materialization 실패 시 30분 내 PagerDuty 알림 설정
교훈: Materialization 상태를 별도로 모니터링하지 않으면
모델 성능 저하로만 발견되어 대응이 늦어진다.
Case 2: Full serving down due to Feature View schema change
상황: "avg_order_value_30d" 피처의 타입을 Float64에서 Int64로 변경하고
feast apply를 실행한 직후, 서빙 중인 추천 모델에서 타입 에러 발생.
영향: 5분간 추천 API 500 에러. 긴급 롤백 수행.
복구 절차:
1. git revert로 Feature View 정의를 이전 버전으로 복원
2. feast apply 재실행
3. 서빙 정상화 확인
교훈: Feature View 스키마 변경은 반드시 staging 환경에서 먼저 테스트하고,
의존 모델의 호환성 테스트를 통과한 후에만 프로덕션에 적용해야 한다.
Case 3: Duplicate event handling in Push-based streaming features
상황: Kafka consumer의 at-least-once 보장으로 인해 동일 이벤트가
2회 처리되어 session_count_24h가 실제보다 높게 기록됨.
영향: 일부 유저의 engagement_score가 비정상적으로 높아져
추천 결과 왜곡 발생.
복구 절차:
1. 중복 이벤트 탐지 로직 추가 (이벤트 ID 기반 dedup)
2. 영향받은 엔티티 식별 후 재 materialize
3. Kafka consumer에 idempotent 처리 추가
교훈: 스트리밍 피처 파이프라인에는 반드시 멱등성(idempotency) 보장이
필요하다. 이벤트 ID를 Redis Set으로 관리하여 중복을 제거한다.
Checklist
Initial Deployment Checklist
- Production
feature_store.yamlSetup completed (SQL registry, Redis online store) - Define entities and establish naming conventions
- Feature View definition and TTL setting
- Feature Service and model version mapping
- [ ]
feast applyIntegrate into your CI/CD pipeline - Materialization DAG construction and scheduling
- Online store capacity calculation and provisioning
Daily Operations Checklist
- Daily check whether materialization is successful or not
- Check compliance with online store freshness SLA
- Verification of backward compatibility when changing Feature View
- Weekly execution of online-offline parity tests
- Monitor online store memory usage trends
- Feature serving latency p99 tracking
Feature View Change Deployment Checklist
- In staging environment
feast applyand materialization testing - Schema compatibility test (all dependent models)
- Passed Point-in-Time Leakage verification
- Passed the Online-Offline Parity test
- Production
feast applyrun - Confirmation of normal completion of materialization
- Observe serving latency and error rate for 10 minutes
- Rollback procedure documented and tested