Skip to content

필사 모드: MLOps Feature Store in Practice — Building a Feature Pipeline with Feast

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Overview

One of the most common problems when deploying ML models to production is **Training-Serving Skew** — a phenomenon where model performance degrades because the features used during training differ from those used during serving. A **Feature Store** is the infrastructure component that fundamentally solves this problem by centrally managing feature definitions, storage, and serving.

**Feast** (Feature Store) is the most widely used open-source feature store, supporting both offline (batch training) and online (real-time serving) paths. This article covers the entire process of building a feature pipeline with Feast.

Why You Need a Feature Store

The Training-Serving Skew Problem

During training (offline)

features = pd.read_sql("""

SELECT user_id,

AVG(purchase_amount) as avg_purchase,

COUNT(*) as purchase_count

FROM transactions

WHERE timestamp < '2026-01-01'

GROUP BY user_id

""", conn)

During serving (online) - Skew occurs when using different logic!

features = redis_client.get(f"user:{user_id}:features")

When training and serving compute the same features with different code, subtle discrepancies emerge, and model performance diverges from offline experiments. A Feature Store provides consistent values from **a single feature definition** for both offline and online use.

Core Capabilities of a Feature Store

| Capability | Description |

| ---------------------- | -------------------------------------------------------------- |

| **Feature Registry** | Manages feature metadata, schemas, and ownership |

| **Offline Store** | Bulk feature retrieval for batch training (Point-in-Time Join) |

| **Online Store** | Low-latency feature retrieval for real-time serving |

| **Feature Service** | Serves features via gRPC/HTTP API |

| **Point-in-Time Join** | Joins exact feature values based on timestamps |

Installing Feast and Initializing a Project

Installation

Basic installation

pip install feast

With PostgreSQL online store

pip install feast[postgres]

With Redis online store

pip install feast[redis]

Full dependencies

pip install feast[postgres,redis,aws,gcp]

Project Initialization

Create project

feast init my_feature_store

cd my_feature_store

Directory structure

my_feature_store/

├── feature_repo/

│ ├── feature_store.yaml # Feast configuration

│ ├── example_repo.py # Feature definition examples

│ └── data/ # Sample data

└── README.md

feature_store.yaml Configuration

project: my_feature_store

registry: data/registry.db

provider: local

online_store:

type: sqlite

path: data/online_store.db

offline_store:

type: file

entity_key_serialization_version: 2

For production environments, modify as follows:

project: my_feature_store

registry:

registry_type: sql

path: postgresql://user:pass@host:5432/feast_registry

provider: local

online_store:

type: redis

connection_string: redis://localhost:6379

offline_store:

type: file # or bigquery, redshift, snowflake

Feature Definitions

Data Source and Entity Definitions

feature_repo/features.py

from datetime import timedelta

from feast import Entity, FeatureView, Field, FileSource, PushSource

from feast.types import Float32, Int64, String

Data source definition

user_transactions_source = FileSource(

path="data/user_transactions.parquet",

timestamp_field="event_timestamp",

created_timestamp_column="created_timestamp",

)

Entity definition (the key that features are based on)

user = Entity(

name="user_id",

join_keys=["user_id"],

description="Unique user ID",

)

Feature View Definition

Offline + Online Feature View

user_transaction_features = FeatureView(

name="user_transaction_features",

entities=[user],

ttl=timedelta(days=7), # Expires after 7 days in online store

schema=[

Field(name="total_purchases", dtype=Int64, description="Total number of purchases"),

Field(name="avg_purchase_amount", dtype=Float32, description="Average purchase amount"),

Field(name="last_purchase_amount", dtype=Float32, description="Most recent purchase amount"),

Field(name="purchase_frequency", dtype=Float32, description="Purchase frequency (transactions/day)"),

Field(name="user_segment", dtype=String, description="User segment"),

],

online=True,

source=user_transactions_source,

tags={"team": "ml-platform", "version": "v1"},

)

On-Demand Feature View (Real-time Transformation)

from feast import on_demand_feature_view, RequestSource

Features computed dynamically at request time

input_request = RequestSource(

name="purchase_request",

schema=[

Field(name="current_amount", dtype=Float32),

],

)

@on_demand_feature_view(

sources=[user_transaction_features, input_request],

schema=[

Field(name="amount_vs_avg_ratio", dtype=Float32),

Field(name="is_high_value", dtype=Int64),

],

)

def purchase_analysis(inputs: dict) -> dict:

"""Calculate the ratio of current purchase amount to average purchase amount"""

df = pd.DataFrame(inputs)

df["amount_vs_avg_ratio"] = df["current_amount"] / (df["avg_purchase_amount"] + 1e-6)

df["is_high_value"] = (df["amount_vs_avg_ratio"] > 2.0).astype(int)

return df[["amount_vs_avg_ratio", "is_high_value"]]

Generating Sample Data

scripts/generate_data.py

from datetime import datetime, timedelta

np.random.seed(42)

n_users = 1000

n_records = 5000

user_ids = [f"user_{i:04d}" for i in range(n_users)]

records = []

for _ in range(n_records):

user_id = np.random.choice(user_ids)

ts = datetime(2026, 1, 1) + timedelta(

days=np.random.randint(0, 60),

hours=np.random.randint(0, 24),

)

records.append({

"user_id": user_id,

"total_purchases": np.random.randint(1, 100),

"avg_purchase_amount": round(np.random.uniform(10, 500), 2),

"last_purchase_amount": round(np.random.uniform(5, 1000), 2),

"purchase_frequency": round(np.random.uniform(0.1, 5.0), 3),

"user_segment": np.random.choice(["bronze", "silver", "gold", "platinum"]),

"event_timestamp": ts,

"created_timestamp": ts,

})

df = pd.DataFrame(records)

df.to_parquet("feature_repo/data/user_transactions.parquet", index=False)

print(f"Generated {len(df)} records for {n_users} users")

python scripts/generate_data.py

Generated 5000 records for 1000 users

Feast Workflow

1. Apply — Register Feature Definitions

cd feature_repo

feast apply

Created entity user_id

Created feature view user_transaction_features

Created on demand feature view purchase_analysis

Deploying infrastructure for my_feature_store...

2. Materialize — Sync Offline to Online Store

Load data for a specific time range into the online store

feast materialize 2026-01-01T00:00:00 2026-03-01T00:00:00

Incremental load (from last materialize to now)

feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")

Materializing 1 feature views from 2026-01-01 to 2026-03-01

user_transaction_features:

100%|████████████████████████| 1000/1000 [00:03<00:00, 312.45it/s]

3. Offline Feature Retrieval (for Training)

from feast import FeatureStore

store = FeatureStore(repo_path="feature_repo")

Entity DataFrame for generating training data

entity_df = pd.DataFrame({

"user_id": ["user_0001", "user_0042", "user_0100", "user_0500"],

"event_timestamp": pd.to_datetime([

"2026-02-01", "2026-02-15", "2026-01-20", "2026-02-28"

]),

})

Retrieve features with Point-in-Time Join

training_df = store.get_historical_features(

entity_df=entity_df,

features=[

"user_transaction_features:total_purchases",

"user_transaction_features:avg_purchase_amount",

"user_transaction_features:last_purchase_amount",

"user_transaction_features:purchase_frequency",

"user_transaction_features:user_segment",

],

).to_df()

print(training_df.head())

user_id event_timestamp total_purchases avg_purchase_amount ...

0 user_0001 2026-02-01 45 234.56 ...

1 user_0042 2026-02-15 12 89.30 ...

2 user_0100 2026-01-20 78 456.78 ...

3 user_0500 2026-02-28 33 167.42 ...

**Point-in-Time Join** is the key here. It retrieves the most recent feature values as of each entity's `event_timestamp`. This ensures accurate training data without data leakage.

4. Online Feature Retrieval (for Serving)

Retrieve features in real-time serving

online_features = store.get_online_features(

features=[

"user_transaction_features:total_purchases",

"user_transaction_features:avg_purchase_amount",

"user_transaction_features:user_segment",

"purchase_analysis:amount_vs_avg_ratio",

"purchase_analysis:is_high_value",

],

entity_rows=[

{"user_id": "user_0001", "current_amount": 750.0},

{"user_id": "user_0042", "current_amount": 50.0},

],

).to_dict()

print(online_features)

{

"user_id": ["user_0001", "user_0042"],

"total_purchases": [45, 12],

"avg_purchase_amount": [234.56, 89.30],

"user_segment": ["gold", "silver"],

"amount_vs_avg_ratio": [3.199, 0.560],

"is_high_value": [1, 0],

}

Managing Feature Groups with Feature Service

from feast import FeatureService

Bundle of features needed for the recommendation model

recommendation_service = FeatureService(

name="recommendation_features",

features=[

user_transaction_features[["total_purchases", "avg_purchase_amount", "user_segment"]],

purchase_analysis,

],

tags={"model": "recommendation-v2"},

)

Bundle of features needed for the fraud detection model

fraud_detection_service = FeatureService(

name="fraud_detection_features",

features=[

user_transaction_features,

purchase_analysis,

],

tags={"model": "fraud-detection-v1"},

)

Retrieve via Feature Service

features = store.get_online_features(

features=store.get_feature_service("recommendation_features"),

entity_rows=[{"user_id": "user_0001", "current_amount": 750.0}],

).to_dict()

Real-time Feature Updates with Push Source

from feast import PushSource

Push source definition

user_realtime_source = PushSource(

name="user_realtime_push",

batch_source=user_transactions_source,

)

Update features when real-time events occur

store.push(

push_source_name="user_realtime_push",

df=pd.DataFrame({

"user_id": ["user_0001"],

"total_purchases": [46],

"avg_purchase_amount": [240.12],

"last_purchase_amount": [750.0],

"purchase_frequency": [2.1],

"user_segment": ["gold"],

"event_timestamp": [pd.Timestamp.now()],

"created_timestamp": [pd.Timestamp.now()],

}),

)

Feature Server Deployment

Start local Feature Server

feast serve -h 0.0.0.0 -p 6566

Retrieve features via HTTP API

curl -X POST http://localhost:6566/get-online-features \

-H "Content-Type: application/json" \

-d '{

"features": [

"user_transaction_features:total_purchases",

"user_transaction_features:avg_purchase_amount"

],

"entities": {

"user_id": ["user_0001", "user_0042"]

}

}'

Deploying Feature Server with Docker

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .

RUN pip install feast[redis]

COPY feature_repo/ feature_repo/

WORKDIR /app/feature_repo

Apply registry & start server

CMD feast apply && feast serve -h 0.0.0.0 -p 6566

docker-compose.yml

services:

feast-server:

build: .

ports:

- '6566:6566'

depends_on:

- redis

environment:

- REDIS_URL=redis://redis:6379

redis:

image: redis:7-alpine

ports:

- '6379:6379'

Integration with Airflow (Automated Materialize)

dags/feast_materialize.py

from airflow import DAG

from airflow.operators.bash import BashOperator

from datetime import datetime, timedelta

default_args = {

"owner": "ml-platform",

"retries": 2,

"retry_delay": timedelta(minutes=5),

}

with DAG(

dag_id="feast_materialize",

default_args=default_args,

schedule_interval="0 */6 * * *", # Every 6 hours

start_date=datetime(2026, 1, 1),

catchup=False,

) as dag:

materialize = BashOperator(

task_id="materialize_incremental",

bash_command=(

"cd /opt/feature_repo && "

"feast materialize-incremental $(date -u +'%Y-%m-%dT%H:%M:%S')"

),

)

Conclusion

Here are the key takeaways for building a feature pipeline with Feast:

- **Consistent Feature Definitions**: Using the same feature definitions for training and serving prevents Training-Serving Skew

- **Point-in-Time Join**: Accurate feature joins based on timestamps prevent data leakage

- **Offline/Online Dual Stores**: Offline store for batch training, online store for real-time serving

- **Feature Service**: Managing feature groups per model improves reusability

- **Push Source**: Supports real-time event-based feature updates

A Feature Store may feel like overkill when you have just one or two ML models, but it becomes essential infrastructure as the number of models grows and the team scales. Its value is maximized especially when multiple models share the same features.

Quiz

It is a phenomenon where model performance degrades because the features used during training

differ from those used during serving. Common causes include inconsistencies in feature

computation logic, differences in data sources, and misaligned time references.

It joins the most recent feature values prior to each entity's event timestamp (event_timestamp).

This prevents data leakage where future data would be used during training.

The offline store holds large volumes of historical features for batch training (files, BigQuery,

etc.), while the online store holds only the latest feature values for low-latency real-time

serving (Redis, DynamoDB, etc.).

It synchronizes (loads) feature data from the offline store into the online store. It stores the

latest feature values for the specified time range in the online store, enabling real-time

retrieval.

Q5: What is the difference between On-Demand Feature View and a regular Feature View?

A regular Feature View stores pre-computed features, while an On-Demand Feature View dynamically

computes features at request time. It is used for real-time transformations that combine request

parameters with existing features.

It allows logical grouping of features needed by each model. It makes it clear which model uses

which features and provides a consistent interface for feature retrieval.

It specifies the validity period for feature values in the online store. Features past their TTL

are returned as null during retrieval, preventing stale feature values from being used in serving.

It is used when the online store features need to be updated immediately upon real-time events

(payments, clicks, etc.). It keeps features up to date between periodic batch materializations.

현재 단락 (1/298)

One of the most common problems when deploying ML models to production is **Training-Serving Skew** ...

작성 글자: 0원문 글자: 12,764작성 단락: 0/298