Feature Store 구축 완전 가이드: Feast 아키텍처·온라인/오프라인 서빙·ML 파이프라인 통합

들어가며
1. Feature Store의 필요성
- 피처 관리 없이 발생하는 문제
- Feature Store가 해결하는 핵심 과제
2. Feature Store 아키텍처
3. Feast 프레임워크 심층 분석
- 핵심 개념
4. Feast 설치와 프로젝트 구성
- 설치 및 초기화
- 프로젝트 설정 (feature_store.yaml)
5. Feature View와 Entity 정의
- Entity와 Feature View 정의
- Feature Service 정의
6. 온라인/오프라인 서빙 구현
7. ML 파이프라인 통합
- Airflow를 활용한 피처 파이프라인
- Kubeflow Pipelines 통합
8. Feature Store 비교 분석
- 핵심 차이점 요약
9. 장애 사례와 해결 전략
10. 운영 체크리스트
마치며

들어가며

ML 모델의 프로덕션 성능은 모델 아키텍처보다 피처의 품질과 일관성에 더 크게 좌우된다. 데이터 사이언티스트가 Jupyter Notebook에서 만든 피처 변환 로직과 실제 서빙 서버에 구현된 로직이 미묘하게 달라지는 순간, 학습-서빙 스큐(Training-Serving Skew)가 발생하고 모델 성능은 급격히 저하된다.

Feature Store는 이러한 문제를 아키텍처 수준에서 해결하는 핵심 인프라다. 이 글에서는 오픈소스 Feature Store인 Feast를 중심으로, 아키텍처 설계부터 온라인/오프라인 서빙 구현, Airflow와 Kubeflow를 활용한 ML 파이프라인 통합, 그리고 Tecton과 Hopsworks 등 경쟁 플랫폼과의 비교까지 종합적으로 다룬다. 단순한 설치 가이드를 넘어 실제 프로덕션 환경에서 수백만 엔티티와 수십 개의 Feature View를 운영하는 팀이 참고할 수 있는 실전 가이드를 목표로 한다.

1. Feature Store의 필요성

피처 관리 없이 발생하는 문제

Feature Store 없이 ML 시스템을 운영하면 다음 문제가 반복적으로 발생한다.

피처 로직 중복: 학습 파이프라인에서 avg_order_amount_30d를 계산하는 코드와 서빙 서버의 코드가 다르다. NULL 처리 방식, 집계 윈도우 범위, 타임존 등에서 미묘한 차이가 생긴다.
피처 발견성 부재: 팀 A가 이미 계산한 user_click_rate_7d 피처를 팀 B가 모르고 다시 계산한다. 조직 내 피처 자산이 사일로화된다.
시간 여행 불가: "3주 전 시점에서 이 유저의 피처 값은 무엇이었는가?"에 답할 수 없어 재학습과 디버깅이 불가능해진다.
서빙 레이턴시: 모델 추론 시 여러 데이터 소스에서 실시간으로 피처를 계산하면 p99 레이턴시가 수백 밀리초까지 치솟는다.

Feature Store가 해결하는 핵심 과제

Feature Store는 피처의 단일 진실 공급원(Single Source of Truth) 역할을 한다. 하나의 피처 정의에서 오프라인 학습 데이터와 온라인 서빙 데이터를 모두 생성하므로 로직 불일치가 원천적으로 제거된다. 중앙 레지스트리를 통해 피처를 검색하고 재사용할 수 있으며, Point-in-Time Join을 통해 과거 시점의 피처를 정확하게 재현할 수 있다.

2. Feature Store 아키텍처

Feature Store의 핵심 아키텍처는 오프라인 스토어, 온라인 스토어, 레지스트리의 세 가지 컴포넌트로 구성된다.

오프라인 스토어 (Offline Store)

대량의 히스토리컬 피처 데이터를 저장하고, 학습 데이터 생성 시 Point-in-Time Join을 수행한다. BigQuery, Snowflake, Redshift, S3/Parquet 등 대규모 분석용 데이터 웨어하우스나 데이터 레이크를 백엔드로 사용한다. 수십 TB 규모의 데이터를 효율적으로 스캔할 수 있어야 한다.

온라인 스토어 (Online Store)

실시간 서빙을 위한 저지연 키-값 저장소다. Redis, DynamoDB, Bigtable 등을 백엔드로 사용하며, 엔티티 키를 기준으로 최신 피처 값을 p99 10ms 이내에 반환해야 한다. Materialization 프로세스를 통해 오프라인 스토어의 데이터가 온라인 스토어로 동기화된다.

레지스트리 (Registry)

엔티티, Feature View, Feature Service 등의 메타데이터를 저장하는 중앙 카탈로그다. 파일 기반(로컬, S3, GCS) 또는 SQL 기반(PostgreSQL, MySQL)으로 운영할 수 있다. 프로덕션 환경에서는 SQL 기반 레지스트리를 권장한다.

3. Feast 프레임워크 심층 분석

Feast(Feature Store)는 2019년 Gojek과 Google에 의해 시작된 오픈소스 프로젝트로, 현재 Linux Foundation 산하에서 관리된다. Feast의 핵심 강점은 플러거블 아키텍처다. 기존 인프라(Spark, Kafka, Redis, Snowflake 등)를 그대로 활용하면서 Feature Store 레이어만 추가할 수 있다.

핵심 개념

Entity: 피처가 연결되는 비즈니스 객체 (예: 사용자, 상품, 드라이버)
Feature View: 동일 소스에서 파생된 피처 그룹의 논리적 단위
Feature Service: 특정 모델이 사용하는 피처들의 묶음
Data Source: 피처 데이터의 원본 (BigQuery, Parquet, Kafka 등)
Materialization: 오프라인 스토어에서 온라인 스토어로 데이터를 동기화하는 프로세스

4. Feast 설치와 프로젝트 구성

설치 및 초기화

# Feast 설치 (Redis, PostgreSQL 지원 포함)
pip install 'feast[redis,postgres]'

# 프로젝트 초기화
feast init feature_repo
cd feature_repo

# 디렉토리 구조 확인
# feature_repo/
#   feature_store.yaml    -- 프로젝트 설정
#   definitions.py        -- Entity, Feature View 정의
#   data/                 -- 샘플 데이터

프로젝트 설정 (feature_store.yaml)

project: my_ml_platform
provider: gcp
registry:
  registry_type: sql
  path: postgresql://feast:feast@db-host:5432/feast_registry
  cache_ttl_seconds: 60
online_store:
  type: redis
  connection_string: redis-host:6379,password=secret
offline_store:
  type: bigquery
  dataset: feast_offline
entity_key_serialization_version: 2

이 설정에서 주목할 부분은 세 가지다. 첫째, SQL 기반 레지스트리를 사용하여 다중 팀의 동시 접근을 지원한다. 둘째, 온라인 스토어는 Redis를 사용하여 밀리초 단위 응답을 보장한다. 셋째, 오프라인 스토어는 BigQuery를 사용하여 대규모 히스토리컬 조인을 처리한다.

5. Feature View와 Entity 정의

Entity와 Feature View 정의

from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, BigQuerySource
from feast.types import Float32, Float64, Int64, String

# Entity 정의
customer = Entity(
    name="customer_id",
    description="고객의 고유 식별자",
)

driver = Entity(
    name="driver_id",
    description="드라이버의 고유 식별자",
)

# BigQuery 소스 정의
customer_stats_source = BigQuerySource(
    name="customer_stats_source",
    table="my_project.feast_dataset.customer_stats",
    timestamp_field="event_timestamp",
    created_timestamp_column="created_timestamp",
)

# Feature View 정의
customer_stats_fv = FeatureView(
    name="customer_stats",
    entities=[customer],
    ttl=timedelta(days=3),
    schema=[
        Field(name="total_orders", dtype=Int64),
        Field(name="avg_order_amount", dtype=Float64),
        Field(name="lifetime_value", dtype=Float64),
        Field(name="preferred_category", dtype=String),
        Field(name="churn_risk_score", dtype=Float32),
    ],
    source=customer_stats_source,
    online=True,
    tags={
        "team": "growth",
        "version": "v2",
    },
)

driver_stats_source = BigQuerySource(
    name="driver_stats_source",
    table="my_project.feast_dataset.driver_stats",
    timestamp_field="event_timestamp",
)

driver_stats_fv = FeatureView(
    name="driver_stats",
    entities=[driver],
    ttl=timedelta(hours=6),
    schema=[
        Field(name="avg_rating", dtype=Float64),
        Field(name="total_trips", dtype=Int64),
        Field(name="acceptance_rate", dtype=Float64),
        Field(name="avg_delivery_time_min", dtype=Float32),
    ],
    source=driver_stats_source,
    online=True,
)

Feature Service 정의

특정 모델이 사용하는 피처들을 Feature Service로 묶어 관리한다.

from feast import FeatureService

# 이탈 예측 모델용 Feature Service
churn_prediction_svc = FeatureService(
    name="churn_prediction_service",
    features=[
        customer_stats_fv[["total_orders", "avg_order_amount", "lifetime_value", "churn_risk_score"]],
    ],
    tags={
        "model": "churn_prediction_v3",
        "owner": "growth-team",
    },
)

# 드라이버 매칭 모델용 Feature Service
driver_matching_svc = FeatureService(
    name="driver_matching_service",
    features=[
        driver_stats_fv[["avg_rating", "acceptance_rate", "avg_delivery_time_min"]],
        customer_stats_fv[["preferred_category"]],
    ],
)

6. 온라인/오프라인 서빙 구현

오프라인 서빙 (학습 데이터 생성)

오프라인 서빙은 Point-in-Time Join을 사용하여 과거 특정 시점의 피처 값을 정확하게 조회한다. 이것이 Feature Leakage를 방지하는 핵심 메커니즘이다.

from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path="feature_repo/")

# 학습용 엔티티 데이터프레임 (라벨 포함)
entity_df = pd.DataFrame({
    "customer_id": [1001, 1002, 1003, 1001, 1002],
    "event_timestamp": pd.to_datetime([
        "2026-01-15 10:00:00",
        "2026-01-15 11:00:00",
        "2026-01-16 09:00:00",
        "2026-02-01 10:00:00",
        "2026-02-01 11:00:00",
    ]),
    "churned": [0, 1, 0, 1, 0],  # 라벨
})

# Point-in-Time Join으로 학습 데이터 생성
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "customer_stats:total_orders",
        "customer_stats:avg_order_amount",
        "customer_stats:lifetime_value",
        "customer_stats:churn_risk_score",
    ],
).to_df()

print(training_df.head())
# customer_id | event_timestamp     | churned | total_orders | avg_order_amount | ...
# 1001        | 2026-01-15 10:00:00 | 0       | 42           | 35.50            | ...

온라인 서빙 (실시간 추론)

온라인 서빙은 최신 피처 값을 밀리초 단위로 반환한다.

# 온라인 피처 조회
online_features = store.get_online_features(
    features=[
        "customer_stats:total_orders",
        "customer_stats:avg_order_amount",
        "customer_stats:churn_risk_score",
    ],
    entity_rows=[
        {"customer_id": 1001},
        {"customer_id": 1002},
    ],
).to_dict()

print(online_features)
# 출력 예시:
# {
#     "customer_id": [1001, 1002],
#     "total_orders": [45, 12],
#     "avg_order_amount": [35.50, 28.00],
#     "churn_risk_score": [0.15, 0.82],
# }

Materialization (오프라인에서 온라인으로 동기화)

# 전체 Feature View Materialization
feast materialize 2026-01-01T00:00:00 2026-03-10T00:00:00

# 증분 Materialization (마지막 실행 이후 변경분만)
feast materialize-incremental 2026-03-10T00:00:00

7. ML 파이프라인 통합

Airflow를 활용한 피처 파이프라인

Feast의 Materialization을 Airflow DAG로 자동화하면 안정적인 운영이 가능하다.

from airflow import DAG
from airflow.decorators import task
from datetime import datetime, timedelta

default_args = {
    "owner": "ml-platform",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    dag_id="feast_materialization_pipeline",
    default_args=default_args,
    schedule_interval="0 */4 * * *",  # 4시간마다 실행
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    @task()
    def validate_source_data():
        """소스 데이터 품질 검증"""
        from great_expectations import get_context
        context = get_context()
        result = context.run_checkpoint(checkpoint_name="feature_source_check")
        if not result.success:
            raise ValueError("소스 데이터 품질 검증 실패")
        return True

    @task()
    def materialize_features():
        """오프라인 스토어에서 온라인 스토어로 Materialization"""
        from feast import RepoConfig, FeatureStore
        from feast.infra.online_stores.redis import RedisOnlineStoreConfig
        from feast.repo_config import RegistryConfig

        repo_config = RepoConfig(
            project="my_ml_platform",
            provider="gcp",
            registry=RegistryConfig(
                registry_type="sql",
                path="postgresql://feast:feast@db-host:5432/feast_registry",
            ),
            online_store=RedisOnlineStoreConfig(
                connection_string="redis-host:6379",
            ),
        )
        store = FeatureStore(config=repo_config)
        store.materialize_incremental(end_date=datetime.utcnow())
        return True

    @task()
    def validate_online_store():
        """온라인 스토어 피처 값 검증"""
        from feast import FeatureStore
        store = FeatureStore(repo_path="feature_repo/")

        # 샘플 엔티티로 온라인 피처 조회
        result = store.get_online_features(
            features=["customer_stats:total_orders"],
            entity_rows=[{"customer_id": 1001}],
        ).to_dict()

        if result["total_orders"][0] is None:
            raise ValueError("온라인 스토어에 피처가 로드되지 않음")
        return True

    @task()
    def notify_completion():
        """Slack 알림 전송"""
        import requests
        requests.post(
            "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
            json={"text": "Feast Materialization 완료"},
        )

    validate_source_data() >> materialize_features() >> validate_online_store() >> notify_completion()

Kubeflow Pipelines 통합

Kubeflow Pipelines에서는 컴포넌트 단위로 Feast 작업을 정의할 수 있다.

from kfp import dsl
from kfp.dsl import component, Output, Dataset

@component(
    base_image="python:3.10",
    packages_to_install=["feast[redis,postgres]>=0.40.0"],
)
def feast_materialize_op(
    project_name: str,
    registry_path: str,
    redis_connection: str,
):
    from feast import RepoConfig, FeatureStore
    from feast.infra.online_stores.redis import RedisOnlineStoreConfig
    from feast.repo_config import RegistryConfig
    from datetime import datetime

    config = RepoConfig(
        project=project_name,
        provider="gcp",
        registry=RegistryConfig(
            registry_type="sql",
            path=registry_path,
        ),
        online_store=RedisOnlineStoreConfig(
            connection_string=redis_connection,
        ),
    )
    store = FeatureStore(config=config)
    store.materialize_incremental(end_date=datetime.utcnow())

@component(
    base_image="python:3.10",
    packages_to_install=["feast[redis,postgres]>=0.40.0", "scikit-learn"],
)
def train_model_op(
    project_name: str,
    model_output: Output[Dataset],
):
    from feast import FeatureStore
    import pandas as pd
    from sklearn.ensemble import GradientBoostingClassifier
    import pickle

    store = FeatureStore(repo_path="feature_repo/")
    entity_df = pd.read_parquet("gs://my-bucket/training_entities.parquet")

    training_df = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "customer_stats:total_orders",
            "customer_stats:avg_order_amount",
            "customer_stats:churn_risk_score",
        ],
    ).to_df()

    X = training_df.drop(columns=["customer_id", "event_timestamp", "churned"])
    y = training_df["churned"]

    model = GradientBoostingClassifier(n_estimators=200)
    model.fit(X, y)

    with open(model_output.path, "wb") as f:
        pickle.dump(model, f)

@dsl.pipeline(name="feast-ml-training-pipeline")
def feast_training_pipeline():
    materialize_task = feast_materialize_op(
        project_name="my_ml_platform",
        registry_path="postgresql://feast:feast@db-host:5432/feast_registry",
        redis_connection="redis-host:6379",
    )
    train_task = train_model_op(
        project_name="my_ml_platform",
    )
    train_task.after(materialize_task)

8. Feature Store 비교 분석

항목	Feast	Tecton	Hopsworks	SageMaker Feature Store
라이선스	Apache 2.0 (오픈소스)	상용 (매니지드)	AGPL / 상용	AWS 종속
배포 방식	셀프 호스팅	SaaS / VPC	SaaS / 셀프 호스팅	AWS 매니지드
온라인 스토어	Redis, DynamoDB, PostgreSQL	DynamoDB (내장)	RonDB (내장, 고성능)	자체 스토어
오프라인 스토어	BigQuery, Snowflake, Redshift, Spark	Spark, Snowflake	Apache Hudi	S3 + Glue Catalog
스트리밍 지원	Kafka Push (기본)	Kafka, Kinesis (네이티브)	Kafka, Spark Streaming	Kinesis
변환 엔진	On-Demand Transform	Spark, SQL, Python DSL	Spark, Flink	SageMaker Processing
Point-in-Time Join	지원	지원 (고도화)	지원	제한적 지원
레지스트리	SQL, 파일 기반	내장 (웹 UI)	내장 (Hopsworks UI)	AWS Glue
GenAI/벡터 지원	제한적	Embedding 지원	Embedding + RAG	없음
비용	무료 (인프라 비용만)	높음 (엔터프라이즈)	중간	AWS 사용량 기반
적합 대상	유연성 중시, 오픈소스 선호 팀	엔터프라이즈, 실시간 요구	규제 산업, 거버넌스 필요	AWS 생태계 기존 사용자

핵심 차이점 요약

Feast: 최대 유연성. 기존 인프라에 맞춰 각 컴포넌트를 선택 가능. 운영 부담은 팀이 직접 감수해야 한다.
Tecton: 턴키 솔루션. 스트리밍 피처 파이프라인이 강력하나 비용이 높다. 실시간 ML이 핵심인 조직에 적합하다.
Hopsworks: 데이터 거버넌스와 감사 로깅이 강력하여 금융, 헬스케어 등 규제 산업에 선호된다. RonDB 기반 온라인 스토어는 SageMaker 대비 15% 수준의 레이턴시를 달성한다.
SageMaker Feature Store: AWS 생태계에 이미 깊이 묶인 조직에 편리하나, 벤더 락인 리스크가 있다.

9. 장애 사례와 해결 전략

Training-Serving Skew 방지

학습-서빙 스큐는 Feature Store를 도입해도 완전히 사라지지 않는다. 대표적인 발생 시나리오와 대응 방법은 다음과 같다.

시나리오 1: TTL 초과로 인한 stale 피처

온라인 스토어의 TTL이 6시간인데 Materialization 배치가 장애로 12시간 동안 실행되지 않으면, 일부 피처가 null로 반환된다. 대응 방법은 Materialization 실패 시 즉각 알림을 보내고, TTL을 Materialization 주기의 최소 3배로 설정하는 것이다.

시나리오 2: 피처 정의 변경 시 호환성 파괴

avg_order_amount 피처의 집계 윈도우를 30일에서 90일로 변경하면, 이미 학습된 모델과의 호환성이 깨진다. 대응 방법은 피처를 변경하지 말고 새 피처(예: avg_order_amount_90d)를 추가하는 것이다.

시나리오 3: 타임존 불일치

오프라인 학습 데이터는 UTC 기준인데 온라인 소스 데이터가 로컬 타임존을 사용하면 피처 값이 달라진다. 모든 timestamp를 UTC로 통일해야 한다.

레이턴시 최적화

온라인 서빙 레이턴시가 높아지는 원인과 해결 방법:

원인: 단일 요청에서 너무 많은 Feature View를 조회한다.
해결: Feature Service로 필요한 피처만 묶고, 배치 조회(get_online_features에 여러 엔티티를 한번에 전달)를 활용한다.
원인: Redis 클러스터의 핫키 문제.
해결: 엔티티 키에 해시 태그를 사용하여 키를 고르게 분산시킨다.

데이터 일관성 보장

오프라인 스토어와 온라인 스토어 간의 데이터 일관성은 Materialization에 의존한다. 이를 보강하기 위해:

Materialization 후 샘플링 검증 태스크를 실행하여 오프라인과 온라인의 피처 값을 비교한다.
피처 드리프트 모니터링을 구축하여 피처 분포의 이상 변화를 감지한다.
소스 데이터 파이프라인에 Great Expectations 같은 데이터 품질 도구를 통합한다.

10. 운영 체크리스트

프로덕션 Feature Store를 안정적으로 운영하기 위한 체크리스트:

설계 단계

엔티티 설계: 비즈니스 도메인에 맞는 엔티티 키를 정의했는가
TTL 설정: 각 Feature View의 TTL이 데이터 갱신 주기와 정합적인가
오프라인/온라인 분리: 모든 Feature View가 온라인 스토어에 필요한 것은 아니다. online=False로 설정 가능한 피처를 식별했는가
레지스트리: SQL 기반 레지스트리를 사용하여 동시 접근 충돌을 방지했는가

파이프라인 구성

Materialization 스케줄: Airflow 또는 Cron으로 주기적 Materialization이 설정되어 있는가
장애 알림: Materialization 실패 시 Slack/PagerDuty 알림이 구성되어 있는가
소스 데이터 검증: Great Expectations 등으로 소스 데이터 품질을 사전 검증하는가
증분 Materialization: materialize-incremental을 사용하여 전체 재처리를 피하고 있는가

모니터링

온라인 스토어 레이턴시: p50, p95, p99 레이턴시를 모니터링하는가
피처 Freshness: 각 Feature View의 최신 Materialization 시각을 추적하는가
피처 드리프트: 피처 분포의 변화를 감지하는 모니터링이 있는가
null 비율: 온라인 피처 조회 시 null 반환 비율을 추적하는가

보안과 거버넌스

RBAC: 팀별로 Feature View 접근 권한이 분리되어 있는가
감사 로그: 피처 정의 변경 이력이 기록되는가
PII 마스킹: 개인정보가 포함된 피처에 적절한 마스킹이 적용되어 있는가

마치며

Feature Store는 ML 시스템의 성숙도를 한 단계 끌어올리는 핵심 인프라다. Feast는 오픈소스의 유연성과 플러거블 아키텍처를 바탕으로 대부분의 조직에서 좋은 출발점이 된다. 다만 Feature Store 도입 자체가 목적이 되어서는 안 된다. 피처 로직의 일관성 확보, 학습-서빙 스큐 방지, 피처 재사용성 향상이라는 명확한 비즈니스 가치에 초점을 맞추어야 한다.

Airflow나 Kubeflow와의 파이프라인 통합, Redis 기반 온라인 서빙, SQL 레지스트리를 통한 메타데이터 관리를 조합하면, 수십 개의 Feature View와 수백만 엔티티를 안정적으로 운영할 수 있는 프로덕션 Feature Store를 구축할 수 있다. 조직의 규모와 요구사항에 따라 Tecton이나 Hopsworks 같은 매니지드 솔루션도 검토해 볼 가치가 있다.