Skip to content

필사 모드: Feature Store Design and Operations Guide: Building Online/Offline Stores with Feast and ML Feature Pipeline Automation

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

As production deployment of machine learning models becomes commonplace, feature management has emerged as a core challenge in MLOps. Features used for model training must be identically reproduced during real-time serving, multiple teams need to share the same features to avoid redundant computation, and feature quality and freshness must be continuously monitored.

Feature Store is an infrastructure layer designed to solve these problems by centrally managing feature definition, storage, serving, and monitoring. Among them, Feast (Feature Store) is the most widely used open-source Feature Store, providing flexible feature serving while reusing existing data infrastructure.

This article covers Feature Store core concepts, Feast architecture, feature definitions and entity design, materialization pipelines, Online/Offline Store configuration, training-serving skew prevention strategies, feature monitoring, comparison with Tecton/Hopsworks, production deployment patterns, and failure recovery procedures.

Feature Store Core Concepts

Why You Need a Feature Store

Operating an ML pipeline without a Feature Store leads to the following problems.

| Problem | Description | Impact |

| ----------------------------- | -------------------------------------------------------------- | --------------------------------- |

| Training-Serving Skew | Feature computation logic differs between training and serving | Degraded model performance |

| Redundant Feature Computation | Each team implements the same features independently | Wasted computing resources |

| Data Leakage | Future data included in training | Overfitting, incorrect evaluation |

| Feature Discovery Difficulty | Cannot know what features exist | Reduced development productivity |

| Serving Latency | Computing features in real-time causes delays | Degraded user experience |

Online Store vs Offline Store

A Feature Store has two fundamental storage systems.

| Item | Online Store | Offline Store |

| ---------------- | -------------------- | ------------------------------- |

| Purpose | Real-time inference | Model training, batch inference |

| Latency | 1-10ms | Seconds to minutes |

| Data Scope | Latest values only | Full history |

| Storage Examples | Redis, DynamoDB | BigQuery, Redshift, S3 |

| Query Pattern | Key-Value lookup | SQL/DataFrame queries |

| Data Volume | GB scale | TB-PB scale |

| Consistency | Eventual consistency | Strong consistency |

Feature Freshness and Consistency

Feature freshness indicates how recently the data reflects the current state.

- **Batch features**: Updated hourly to daily (e.g., number of purchases in the last 30 days)

- **Streaming features**: Updated every seconds to minutes (e.g., transaction amount in the last 5 minutes)

- **Real-time features**: Computed at request time (e.g., number of clicks in the current session)

Feast Architecture

Overall Structure

Feast consists of the following components.

feature_store.yaml - Feast project configuration

project: fraud_detection

registry: gs://ml-feature-store/registry.db

provider: gcp

offline_store:

type: bigquery

dataset: feature_store

online_store:

type: redis

connection_string: 'redis-cluster.internal:6379'

redis_type: redis_cluster

entity_key_serialization_version: 2

| Component | Role | Description |

| ---------------------- | --------------------------- | ----------------------------------------------------------------- |

| Feature Registry | Feature metadata management | Stores feature definitions, entities, and data source information |

| Offline Store | Historical data storage | Extracts training data from BigQuery, Redshift, Spark, etc. |

| Online Store | Latest feature serving | Real-time lookups from Redis, DynamoDB, etc. |

| Feature Server | REST API serving | Low-latency feature serving endpoint based on FastAPI |

| Materialization Engine | Data synchronization | Copies features from Offline Store to Online Store |

Feature Definitions and Entity Design

features/fraud_detection.py

from datetime import timedelta

from feast import Entity, FeatureView, Field, FileSource, BigQuerySource

from feast.types import Float32, Int64, String

Entity definition - the target that features are linked to

user_entity = Entity(

name="user_id",

join_keys=["user_id"],

description="Unique identifier for a user",

)

merchant_entity = Entity(

name="merchant_id",

join_keys=["merchant_id"],

description="Unique identifier for a merchant",

)

Data Source definition

user_transactions_source = BigQuerySource(

name="user_transactions",

table="ml_data.user_transaction_features",

timestamp_field="event_timestamp",

created_timestamp_column="created_timestamp",

)

merchant_stats_source = BigQuerySource(

name="merchant_stats",

table="ml_data.merchant_statistics",

timestamp_field="event_timestamp",

)

Feature View definition - a group of features

user_transaction_features = FeatureView(

name="user_transaction_features",

entities=[user_entity],

ttl=timedelta(days=7),

schema=[

Field(name="transaction_count_7d", dtype=Int64),

Field(name="transaction_amount_avg_7d", dtype=Float32),

Field(name="transaction_amount_max_7d", dtype=Float32),

Field(name="unique_merchants_7d", dtype=Int64),

Field(name="avg_time_between_transactions", dtype=Float32),

],

source=user_transactions_source,

online=True,

tags={

"team": "fraud-detection",

"version": "v2",

},

)

merchant_risk_features = FeatureView(

name="merchant_risk_features",

entities=[merchant_entity],

ttl=timedelta(days=30),

schema=[

Field(name="chargeback_rate_30d", dtype=Float32),

Field(name="avg_transaction_amount", dtype=Float32),

Field(name="total_transactions_30d", dtype=Int64),

Field(name="risk_score", dtype=Float32),

],

source=merchant_stats_source,

online=True,

tags={

"team": "fraud-detection",

"version": "v1",

},

)

Feature Service Definition

features/services.py

from feast import FeatureService

fraud_detection_service = FeatureService(

name="fraud_detection_v2",

features=[

user_transaction_features,

merchant_risk_features,

],

tags={

"model": "fraud_detector_v2",

"owner": "ml-team",

},

)

Materialization Pipeline

Batch Materialization

Apply Feature Registry with Feast CLI

feast apply

Run batch materialization

feast materialize 2026-03-01T00:00:00 2026-03-12T00:00:00

Incremental materialization (since last run)

feast materialize-incremental 2026-03-12T00:00:00

Automation with Airflow

dags/feast_materialization.py

from datetime import datetime, timedelta

from airflow import DAG

from airflow.operators.bash import BashOperator

from airflow.operators.python import PythonOperator

default_args = {

"owner": "ml-team",

"retries": 3,

"retry_delay": timedelta(minutes=5),

}

dag = DAG(

"feast_materialization",

default_args=default_args,

description="Daily feature materialization pipeline",

schedule_interval="0 6 * * *",

start_date=datetime(2026, 1, 1),

catchup=False,

tags=["feast", "mlops"],

)

Validate feature source data

validate_sources = PythonOperator(

task_id="validate_sources",

python_callable=lambda: __import__("feast").FeatureStore(

repo_path="/opt/feast/feature_repo"

),

dag=dag,

)

Run materialization

materialize = BashOperator(

task_id="materialize_features",

bash_command="""

cd /opt/feast/feature_repo && \

feast materialize-incremental $(date -u +%Y-%m-%dT%H:%M:%S)

""",

dag=dag,

)

Validate Online Store consistency

validate_online = PythonOperator(

task_id="validate_online_store",

python_callable=lambda: print("Validating online store consistency..."),

dag=dag,

)

validate_sources >> materialize >> validate_online

Online Store Backend Configuration

Redis-Based Online Store

feature_store.yaml - Redis configuration

project: fraud_detection

registry: gs://ml-feature-store/registry.db

provider: gcp

online_store:

type: redis

connection_string: 'redis-cluster.internal:6379,redis-cluster.internal:6380,redis-cluster.internal:6381'

redis_type: redis_cluster

key_ttl_seconds: 604800 # 7 days

DynamoDB-Based Online Store

feature_store.yaml - DynamoDB configuration

project: fraud_detection

registry: s3://ml-feature-store/registry.db

provider: aws

online_store:

type: dynamodb

region: ap-northeast-2

table_name_template: 'feast_online_{project}_{table}'

Online Store Backend Comparison

| Item | Redis | DynamoDB | PostgreSQL |

| -------------------- | ----------------- | --------------- | --------------------------- |

| Latency | 0.5-2ms | 1-5ms | 2-10ms |

| Scalability | Manual (cluster) | Automatic | Manual |

| Cost Model | Instance-based | Request-based | Instance-based |

| TTL Support | Native | Native | Manual implementation |

| Operational Overhead | Medium | Low | Medium |

| Best For | Ultra-low latency | Serverless, AWS | Small scale, cost-sensitive |

Offline Store Configuration

BigQuery-Based Offline Store

feature_store.yaml - BigQuery configuration

project: fraud_detection

registry: gs://ml-feature-store/registry.db

provider: gcp

offline_store:

type: bigquery

dataset: feature_store

location: asia-northeast3

Training Data Generation (Point-in-Time Join)

training_data.py

from feast import FeatureStore

store = FeatureStore(repo_path="./feature_repo")

Entity DataFrame - timestamps and entities for training

entity_df = pd.DataFrame({

"user_id": ["user_001", "user_002", "user_003", "user_001"],

"merchant_id": ["merch_100", "merch_200", "merch_100", "merch_300"],

"event_timestamp": pd.to_datetime([

"2026-03-01 10:00:00",

"2026-03-02 14:30:00",

"2026-03-03 09:15:00",

"2026-03-05 16:45:00",

]),

"label": [0, 1, 0, 1], # fraud label

})

Extract features with point-in-time correctness

training_df = store.get_historical_features(

entity_df=entity_df,

features=[

"user_transaction_features:transaction_count_7d",

"user_transaction_features:transaction_amount_avg_7d",

"user_transaction_features:transaction_amount_max_7d",

"user_transaction_features:unique_merchants_7d",

"merchant_risk_features:chargeback_rate_30d",

"merchant_risk_features:risk_score",

],

).to_df()

print(training_df.head())

print(f"Training data shape: {training_df.shape}")

Online Feature Retrieval (Real-Time Inference)

inference.py

from feast import FeatureStore

store = FeatureStore(repo_path="./feature_repo")

Real-time feature retrieval

feature_vector = store.get_online_features(

features=[

"user_transaction_features:transaction_count_7d",

"user_transaction_features:transaction_amount_avg_7d",

"user_transaction_features:unique_merchants_7d",

"merchant_risk_features:chargeback_rate_30d",

"merchant_risk_features:risk_score",

],

entity_rows=[

{"user_id": "user_001", "merchant_id": "merch_100"},

],

).to_dict()

print(feature_vector)

Example output:

{

"user_id": ["user_001"],

"merchant_id": ["merch_100"],

"transaction_count_7d": [23],

"transaction_amount_avg_7d": [45000.5],

"unique_merchants_7d": [8],

"chargeback_rate_30d": [0.02],

"risk_score": [0.15]

}

Training-Serving Skew Prevention Strategies

Training-Serving Skew is one of the most common causes of ML model performance degradation.

Skew Causes and Countermeasures

| Cause | Description | Countermeasure |

| ---------------------------------- | ---------------------------------------- | --------------------------------------- |

| Feature computation logic mismatch | Different code used in training/serving | Unify source through Feature Store |

| Data leakage | Future data included in training | Apply Point-in-Time Join |

| Feature freshness difference | Batch vs real-time update cycle mismatch | TTL management and freshness monitoring |

| Schema changes | Feature definitions changed | Feature Registry version control |

| NULL handling differences | Inconsistent default value handling | Set unified default value policy |

Skew Prevention with Feast

skew_detection.py

from feast import FeatureStore

from scipy import stats

store = FeatureStore(repo_path="./feature_repo")

def detect_training_serving_skew(

feature_name: str,

training_values: pd.Series,

sample_size: int = 1000,

):

"""Compare feature distributions between training data and Online Store."""

Sample from Online Store

online_features = []

entity_rows = [{"user_id": f"user_{i:04d}"} for i in range(sample_size)]

online_result = store.get_online_features(

features=[feature_name],

entity_rows=entity_rows,

).to_df()

serving_values = online_result[feature_name.split(":")[-1]].dropna()

KS test for distribution comparison

ks_stat, p_value = stats.ks_2samp(

training_values.dropna(),

serving_values,

)

PSI (Population Stability Index) calculation

psi = calculate_psi(training_values.dropna(), serving_values)

return {

"feature": feature_name,

"ks_statistic": ks_stat,

"p_value": p_value,

"psi": psi,

"skew_detected": psi > 0.2 or p_value < 0.05,

}

def calculate_psi(expected, actual, bins=10):

"""Calculate Population Stability Index (PSI)."""

breakpoints = np.linspace(

min(expected.min(), actual.min()),

max(expected.max(), actual.max()),

bins + 1,

)

expected_counts = np.histogram(expected, breakpoints)[0] / len(expected)

actual_counts = np.histogram(actual, breakpoints)[0] / len(actual)

Avoid division by zero

expected_counts = np.clip(expected_counts, 0.001, None)

actual_counts = np.clip(actual_counts, 0.001, None)

psi = np.sum(

(actual_counts - expected_counts) * np.log(actual_counts / expected_counts)

)

return psi

Feature Monitoring and Drift Detection

Monitoring Metrics

monitoring/feature_monitor.py

from dataclasses import dataclass

from datetime import datetime

from typing import Optional

@dataclass

class FeatureStats:

feature_name: str

timestamp: datetime

mean: float

std: float

min_val: float

max_val: float

null_rate: float

unique_count: int

p99_latency_ms: Optional[float] = None

def compute_feature_stats(df: pd.DataFrame, feature_name: str) -> FeatureStats:

"""Compute statistical information for a feature."""

series = df[feature_name]

return FeatureStats(

feature_name=feature_name,

timestamp=datetime.utcnow(),

mean=series.mean(),

std=series.std(),

min_val=series.min(),

max_val=series.max(),

null_rate=series.isnull().sum() / len(series),

unique_count=series.nunique(),

)

def check_drift_alerts(

current: FeatureStats,

baseline: FeatureStats,

thresholds: dict,

) -> list:

"""Check for feature drift alerts."""

alerts = []

Check mean change rate

if baseline.mean != 0:

mean_change = abs(current.mean - baseline.mean) / abs(baseline.mean)

if mean_change > thresholds.get("mean_change", 0.3):

alerts.append(

f"Mean drift detected: {baseline.mean:.4f} -> {current.mean:.4f} "

f"(change: {mean_change:.2%})"

)

NULL rate change

null_diff = abs(current.null_rate - baseline.null_rate)

if null_diff > thresholds.get("null_rate_change", 0.05):

alerts.append(

f"Null rate change: {baseline.null_rate:.4f} -> {current.null_rate:.4f}"

)

Range anomaly

if current.max_val > baseline.max_val * thresholds.get("max_multiplier", 2.0):

alerts.append(

f"Max value anomaly: {current.max_val} (baseline max: {baseline.max_val})"

)

return alerts

Feature Store Solution Comparison

| Item | Feast | Tecton | Hopsworks |

| ---------------------- | -------------------------------------------------- | -------------------------------- | -------------------------------- |

| License | Apache 2.0 (Open Source) | Commercial (Managed) | AGPL + Commercial |

| Architecture | Modular, pluggable | Managed, end-to-end | Integrated platform |

| Real-time Features | Limited | Native support | Supported |

| Streaming | Push-based | Kafka/Kinesis native | Kafka integration |

| Feature Transformation | Python SDK | Spark/Pandas/SQL | Spark/Flink |

| Monitoring | Basic | Built-in (auto-alerts) | Built-in (drift detection) |

| Governance | Basic | RBAC, audit logs | RBAC, audit, lineage tracking |

| Cloud | Multi-cloud | AWS/Databricks | AWS/Azure/GCP |

| Best For | Flexibility-focused orgs with engineering capacity | Enterprises needing real-time ML | Regulated industries, all-in-one |

| Community | Very active (CNCF-related) | Commercial support | Active |

Operational Notes

1. Entity Design Principles

- Design entity keys to align with your business domain (user_id, order_id, device_id, etc.)

- Use composite entity keys carefully as they affect lookup performance

- Excessively high cardinality entity keys can cause Online Store memory usage to spike

2. TTL Management

- Set Online Store TTL generously relative to the materialization cycle

- If TTL is too short, NULL values are returned when materialization is delayed

- If TTL is too long, Online Store storage costs increase

3. Schema Change Management

- Adding features is backward compatible, but removing features or changing types requires model retraining

- Include versions in Feature View names for version management (e.g., `user_features_v2`)

- Always verify compatibility with existing models when changing schemas

4. Materialization Failure Response

Check materialization status

feast materialize-incremental 2026-03-12T00:00:00 --verbose

Rerun for specific Feature View only

feast materialize-incremental 2026-03-12T00:00:00 \

--feature-views user_transaction_features

Check feature freshness in Online Store

feast feature-views list

Failure Cases and Recovery Procedures

Failure Case 1: Online Store Failure (Redis Cluster Down)

**Symptom**: Feature retrieval fails for all real-time inference requests

**Recovery procedure**:

1. Check and recover Redis cluster status

2. If recovery is not possible, switch to backup Redis (Sentinel/Cluster Failover)

3. Rerun materialization to restore Online Store data

4. Verify feature freshness and resume inference service

Failure Case 2: Materialization Pipeline Failure

**Symptom**: Feature data in Online Store is not updated, serving stale values

Feature freshness check script

from feast import FeatureStore

from datetime import datetime, timedelta

store = FeatureStore(repo_path="./feature_repo")

Check last materialization time per Feature View

for fv in store.list_feature_views():

if fv.materialization_intervals:

last_mat = fv.materialization_intervals[-1]

staleness = datetime.utcnow() - last_mat.end_date

if staleness > timedelta(hours=24):

print(f"ALERT: {fv.name} is stale by {staleness}")

else:

print(f"OK: {fv.name} last materialized at {last_mat.end_date}")

else:

print(f"WARNING: {fv.name} has never been materialized")

**Recovery procedure**:

1. Identify the failure cause from materialization logs (Offline Store access issues, schema changes, etc.)

2. Verify data source availability

3. Retry materialization for the failed Feature View

4. Validate feature consistency in the Online Store

Failure Case 3: Feature Drift Detected

**Symptom**: Model performance metrics gradually degrade

**Recovery procedure**:

1. Confirm drift in the feature monitoring dashboard

2. Identify the root cause (data pipeline changes, upstream schema changes, actual distribution shifts)

3. Modify the feature pipeline if necessary

4. Trigger model retraining for severe drift

Production Deployment Checklist

Always verify the following items when deploying a Feature Store to production.

| Item | Checkpoint | Recommended Setting |

| ------------------------- | ------------------------------------- | --------------------------- |

| Online Store Availability | Cluster configuration, replicas | Redis Cluster 3+ nodes |

| Materialization Cycle | Freshness vs business requirements | Set cycle matching SLA |

| TTL Setting | Impact on materialization failure | 2-3x materialization cycle |

| Backup Strategy | Online/Offline Store backups | Daily snapshots |

| Monitoring | Feature drift, latency | Prometheus + Grafana |

| Alerting | Materialization failure, drift | PagerDuty/Slack integration |

| Security | Authentication/authorization, network | IAM, VPC, TLS |

Conclusion

Feature Store is a core infrastructure component that solves the complexity of feature management in production ML model operations. Feast provides the flexibility of open source and a modular architecture that naturally integrates with existing infrastructure.

Here is a summary of the key takeaways.

- **Online/Offline Store separation**: Optimize requirements for real-time serving and batch training independently

- **Point-in-Time Correctness**: Fundamentally prevent data leakage through Feature Store time-travel queries

- **Training-Serving Skew prevention**: Ensure consistency by supporting both training and serving from a single feature definition

- **Materialization automation**: Reliably operate feature refresh pipelines by integrating with Airflow and similar tools

- **Feature Monitoring**: Maintain model performance continuously through drift detection and feature quality monitoring

Adopting a Feature Store is an investment that raises the ML maturity of the entire organization, not just a single model. Start small, validate with one production model, and gradually expand.

References

- [Feast Official Documentation](https://docs.feast.dev)

- [Feast Architecture Overview](https://docs.feast.dev/getting-started/architecture/overview)

- [Feature Store Architecture and Storage - DragonflyDB](https://www.dragonflydb.io/blog/feature-store-architecture-and-storage)

- [A Comparative Analysis: Feast vs Tecton vs Hopsworks - Uplatz](https://uplatz.com/blog/a-comparative-analysis-of-modern-feature-stores-feast-vs-tecton-vs-hopsworks/)

- [Feature Store 101: Build, Serve, and Scale ML Features - Aerospike](https://aerospike.com/blog/feature-store/)

- [What is a Feature Store? - Databricks](https://www.databricks.com/blog/what-feature-store-complete-guide-ml-feature-engineering)

- [Solving Training-Serving Skew with Feast - Medium](https://medium.com/@scoopnisker/solving-the-training-serving-skew-problem-with-feast-feature-store-3719b47e23a2)

현재 단락 (1/411)

As production deployment of machine learning models becomes commonplace, feature management has emer...

작성 글자: 0원문 글자: 19,383작성 단락: 0/411