Skip to content

필사 모드: AI Finance & Quant Trading: FinBERT, Reinforcement Learning, and Backtesting

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

1. Financial Data Collection and Preprocessing

The foundation of quantitative trading is data. Beyond OHLCV (Open, High, Low, Close, Volume), the scope now extends to order book snapshots, tick data, and alternative data such as news feeds and satellite imagery.

Downloading Stock Data with yfinance

Download daily OHLCV for multiple tickers

tickers = ["AAPL", "MSFT", "GOOGL", "NVDA"]

df = yf.download(tickers, start="2020-01-01", end="2026-01-01", auto_adjust=True)

Flatten MultiIndex → per-ticker DataFrames

close = df["Close"]

volume = df["Volume"]

Handle missing data: forward fill then drop leading NaNs

close = close.ffill().dropna()

print(close.tail())

Fetching Crypto Order Books with ccxt

exchange = ccxt.binance()

symbol = "BTC/USDT"

orderbook = exchange.fetch_order_book(symbol, limit=20)

bids = orderbook["bids"][:5] # top-5 [price, qty] bid levels

asks = orderbook["asks"][:5] # top-5 [price, qty] ask levels

mid_price = (bids[0][0] + asks[0][0]) / 2

spread_bps = (asks[0][0] - bids[0][0]) / mid_price * 10000

print(f"Mid: {mid_price:.2f}, Spread: {spread_bps:.2f} bps")

Alternative Data: News Headline Collection

News and social-media data provide **natural language alpha** that structured price data cannot capture.

from datetime import datetime, timedelta

API_KEY = "YOUR_NEWSAPI_KEY"

yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")

url = (

f"https://newsapi.org/v2/everything"

f"?q=NVIDIA+earnings&from={yesterday}&sortBy=publishedAt"

f"&language=en&apiKey={API_KEY}"

)

resp = requests.get(url).json()

headlines = [art["title"] for art in resp.get("articles", [])]

print(headlines[:5])

2. Technical Analysis Automation

TA-Lib and pandas-ta let you compute hundreds of technical indicators in a single Python call.

RSI / MACD Calculation

df = yf.download("AAPL", start="2023-01-01", end="2026-01-01", auto_adjust=True)

close = df["Close"].squeeze().values.astype(float)

RSI (14-period)

rsi = talib.RSI(close, timeperiod=14)

MACD

macd, signal, hist = talib.MACD(close, fastperiod=12, slowperiod=26, signalperiod=9)

pandas-ta alternative (no TA-Lib dependency)

df_ta = df["Close"].squeeze().to_frame("close")

df_ta.ta.rsi(length=14, append=True)

df_ta.ta.macd(fast=12, slow=26, signal=9, append=True)

print(df_ta.tail())

Candlestick Pattern Recognition

open_p = df["Open"].squeeze().values.astype(float)

high_p = df["High"].squeeze().values.astype(float)

low_p = df["Low"].squeeze().values.astype(float)

close_p = df["Close"].squeeze().values.astype(float)

hammer = talib.CDLHAMMER(open_p, high_p, low_p, close_p)

engulfing = talib.CDLENGULFING(open_p, high_p, low_p, close_p)

morning_star = talib.CDLMORNINGSTAR(open_p, high_p, low_p, close_p)

Returns 100 (bullish) / -100 (bearish) / 0 (no pattern)

print("Hammer signals detected:", (hammer != 0).sum())

3. ML Trading Strategy: XGBoost Alpha Factors

Feature Engineering

df = yf.download("SPY", start="2018-01-01", end="2026-01-01", auto_adjust=True)

df.columns = df.columns.droplevel(1) if df.columns.nlevels > 1 else df.columns

df.columns = [c.lower() for c in df.columns]

Return features

df["ret_1d"] = df["close"].pct_change(1)

df["ret_5d"] = df["close"].pct_change(5)

df["ret_20d"] = df["close"].pct_change(20)

Volatility feature

df["vol_20d"] = df["ret_1d"].rolling(20).std()

Technical indicator features

df.ta.rsi(length=14, append=True)

df.ta.macd(fast=12, slow=26, signal=9, append=True)

df.ta.bbands(length=20, append=True)

Volume feature

df["vol_ratio"] = df["volume"] / df["volume"].rolling(20).mean()

Target: sign of 5-day forward return (1 = up, 0 = down)

df["target"] = (df["close"].pct_change(5).shift(-5) > 0).astype(int)

df.dropna(inplace=True)

print(df.shape)

XGBoost with Walk-Forward Validation

from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, roc_auc_score

warnings.filterwarnings("ignore")

feature_cols = [

"ret_1d", "ret_5d", "ret_20d", "vol_20d",

"RSI_14", "MACD_12_26_9", "MACDs_12_26_9",

"BBL_20_2.0", "BBM_20_2.0", "BBU_20_2.0",

"vol_ratio"

]

target_col = "target"

results = []

train_years = 2

test_months = 3

dates = df.index

start_year = dates[0].year + train_years

for year in range(start_year, 2026):

for q in range(1, 5):

train_end = pd.Timestamp(f"{year}-{(q-1)*3+1:02d}-01") if q > 1 else pd.Timestamp(f"{year}-01-01")

test_start = train_end

test_end = test_start + pd.DateOffset(months=test_months)

train_df = df[df.index < test_start].tail(504)

test_df = df[(df.index >= test_start) & (df.index < test_end)]

if len(train_df) < 100 or len(test_df) < 10:

continue

X_train, y_train = train_df[feature_cols], train_df[target_col]

X_test, y_test = test_df[feature_cols], test_df[target_col]

model = XGBClassifier(

n_estimators=200, max_depth=4,

learning_rate=0.05, subsample=0.8,

eval_metric="logloss", random_state=42

)

model.fit(X_train, y_train)

preds = model.predict(X_test)

proba = model.predict_proba(X_test)[:, 1]

acc = accuracy_score(y_test, preds)

auc = roc_auc_score(y_test, proba)

results.append({"period": str(test_start.date()), "acc": acc, "auc": auc})

result_df = pd.DataFrame(results)

print(result_df.tail(8))

print(f"\nMean AUC: {result_df['auc'].mean():.4f}")

4. Deep Learning for Finance: LSTM and Temporal Fusion Transformer

LSTM Price Prediction

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled = scaler.fit_transform(df[["close"]].values)

SEQ_LEN = 60

def make_sequences(data, seq_len):

X, y = [], []

for i in range(len(data) - seq_len):

X.append(data[i:i+seq_len])

y.append(data[i+seq_len])

return np.array(X), np.array(y)

X, y = make_sequences(scaled, SEQ_LEN)

split = int(len(X) * 0.8)

X_train, X_test = X[:split], X[split:]

y_train, y_test = y[:split], y[split:]

X_train_t = torch.tensor(X_train, dtype=torch.float32)

y_train_t = torch.tensor(y_train, dtype=torch.float32)

class LSTMModel(nn.Module):

def __init__(self, input_size=1, hidden_size=64, num_layers=2):

super().__init__()

self.lstm = nn.LSTM(input_size, hidden_size, num_layers,

batch_first=True, dropout=0.2)

self.fc = nn.Linear(hidden_size, 1)

def forward(self, x):

out, _ = self.lstm(x)

return self.fc(out[:, -1, :])

model = LSTMModel()

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

criterion = nn.MSELoss()

for epoch in range(30):

model.train()

pred = model(X_train_t)

loss = criterion(pred, y_train_t)

optimizer.zero_grad()

loss.backward()

optimizer.step()

if (epoch + 1) % 10 == 0:

print(f"Epoch {epoch+1}, Loss: {loss.item():.6f}")

FinRL Reinforcement Learning Trading Agent

FinRL builds on an OpenAI Gym-style environment to train RL agents for stock trading.

pip install finrl

from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv

from finrl.agents.stablebaselines3.models import DRLAgent

Preprocessed financial data in FinRL format

Required columns: date, tic, open, high, low, close, volume + tech indicators

processed_df = pd.read_csv("processed_stock_data.csv")

env_kwargs = {

"hmax": 100, # max shares held per stock

"initial_amount": 100000, # starting capital ($)

"buy_cost_pct": 0.001, # 0.1% commission

"sell_cost_pct": 0.001,

"reward_scaling": 1e-4,

"state_space": 181,

"action_space": 30,

"tech_indicator_list": ["macd", "rsi_30", "cci_30", "dx_30"],

}

train_env = StockTradingEnv(df=processed_df, **env_kwargs)

agent = DRLAgent(env=train_env)

model_ppo = agent.get_model("ppo")

trained_ppo = agent.train_model(

model=model_ppo,

tb_log_name="ppo_stock",

total_timesteps=50000

)

5. LLM for Finance: FinBERT Sentiment Analysis

FinBERT is a BERT model pre-trained on financial news and earnings call transcripts. It classifies text into Positive, Negative, or Neutral.

from transformers import BertTokenizer, BertForSequenceClassification

model_name = "ProsusAI/finbert"

tokenizer = BertTokenizer.from_pretrained(model_name)

model = BertForSequenceClassification.from_pretrained(model_name)

model.eval()

def finbert_sentiment(texts):

inputs = tokenizer(texts, padding=True, truncation=True,

max_length=512, return_tensors="pt")

with torch.no_grad():

logits = model(**inputs).logits

probs = F.softmax(logits, dim=-1).numpy()

labels = ["positive", "negative", "neutral"]

return [

{"text": t, "label": labels[p.argmax()], "score": float(p.max())}

for t, p in zip(texts, probs)

]

headlines = [

"NVIDIA beats Q4 earnings estimates by 15%, raises guidance",

"Fed signals higher-for-longer rates amid sticky inflation",

"Apple reports record services revenue despite iPhone slowdown",

]

results = finbert_sentiment(headlines)

for r in results:

print(f"[{r['label'].upper():8s}] {r['score']:.3f} | {r['text']}")

Separating Numeric Guidance from Text Tone

Earnings calls contain three distinct signals: (1) quantitative results such as EPS and revenue, (2) forward guidance figures, and (3) management tone in the spoken text. LLMs excel at capturing tone but can misread sentences that embed raw numbers. Parsing numbers via regex or structured NLP and combining them with FinBERT tone scores in an ensemble produces stronger natural-language alpha than either approach alone.

6. Risk Management: VaR, CVaR, and Kelly Criterion

VaR / CVaR Computation

def calculate_var_cvar(returns, confidence=0.95):

"""

Historical simulation VaR and CVaR.

returns: array of daily returns

"""

sorted_returns = np.sort(returns)

index = int((1 - confidence) * len(sorted_returns))

var = -sorted_returns[index]

cvar = -sorted_returns[:index].mean()

return var, cvar

daily_returns = df["close"].pct_change().dropna().values

var_95, cvar_95 = calculate_var_cvar(daily_returns, 0.95)

var_99, cvar_99 = calculate_var_cvar(daily_returns, 0.99)

print(f"VaR 95%: {var_95:.4f} ({var_95*100:.2f}%)")

print(f"CVaR 95%: {cvar_95:.4f} ({cvar_95*100:.2f}%)")

print(f"VaR 99%: {var_99:.4f} ({var_99*100:.2f}%)")

print(f"CVaR 99%: {cvar_99:.4f} ({cvar_99*100:.2f}%)")

Key Risk Metric Comparison

| Metric | Formula | Strength | Limitation |

| ------------- | ------------------------ | --------------------------------- | --------------------------------------- |

| Sharpe Ratio | (Rp - Rf) / sigma_p | Standardized risk-adjusted return | Penalizes upside volatility equally |

| Sortino Ratio | (Rp - Rf) / sigma_d | Penalizes only downside vol | Denominator less intuitive |

| Max Drawdown | Peak-to-trough loss | Captures extreme losses | Ignores recovery duration |

| VaR 95% | 5th percentile loss | Regulatory standard | Underestimates tail risk |

| CVaR 95% | Expected loss beyond VaR | Captures tail risk | Sensitive to distributional assumptions |

| Calmar Ratio | CAGR / MDD | Growth vs. drawdown | Less meaningful for short periods |

Kelly Criterion Position Sizing

def kelly_fraction(win_rate, win_loss_ratio):

"""

f* = W - (1 - W) / R

W: win rate, R: average win/loss ratio

"""

return win_rate - (1 - win_rate) / win_loss_ratio

Example: 55% win rate, 1.5 win/loss ratio

f_full = kelly_fraction(0.55, 1.5)

f_half = f_full * 0.5 # fractional Kelly reduces variance

print(f"Full Kelly: {f_full:.2%}")

print(f"Half Kelly: {f_half:.2%}")

> Fractional Kelly (typically 0.25–0.5x) is used in practice because win-rate and edge estimates carry significant estimation error. Full Kelly can produce catastrophic drawdowns when inputs are off, so a safety margin is essential.

7. Backtesting: Vectorbt Strategy Verification

Moving Average Crossover Backtest with Vectorbt

price = yf.download("SPY", start="2018-01-01", end="2026-01-01",

auto_adjust=True)["Close"].squeeze()

fast_ma = vbt.MA.run(price, 20)

slow_ma = vbt.MA.run(price, 60)

entries = fast_ma.ma_crossed_above(slow_ma)

exits = fast_ma.ma_crossed_below(slow_ma)

portfolio = vbt.Portfolio.from_signals(

price,

entries,

exits,

init_cash=100_000,

fees=0.001, # 0.1% commission

slippage=0.001, # 0.1% slippage

freq="D",

)

stats = portfolio.stats()

print(stats[["Total Return [%]", "Sharpe Ratio", "Max Drawdown [%]",

"Win Rate [%]", "Profit Factor"]])

Backtesting Bias Checklist

When a backtest produces unexpectedly strong results, always audit these failure modes:

| Bias Type | Root Cause | Mitigation |

| -------------------------------- | -------------------------------------------------- | ----------------------------------------------- |

| Look-ahead bias | Future data used to compute current-period signals | Audit shift(-1) calls; check feature timestamps |

| Survivorship bias | Delisted tickers excluded from universe | Use point-in-time universe datasets |

| Optimization bias | In-sample parameter over-fitting | Walk-forward validation, out-of-sample holdout |

| Market impact ignored | Large orders assumed to fill at mid price | Slippage model; volume-constrained sizing |

| Underestimated transaction costs | Real spreads and fees excluded | Realistic commission + slippage parameters |

Quiz

**Answer**: Financial time series have temporal dependency; randomly splitting folds allows future data to leak into training, creating look-ahead bias that inflates apparent model performance.

**Explanation**: Walk-forward validation always trains on past data and tests on future data, preserving temporal order. In k-fold, a training fold can contain observations that occur after some test observations, meaning the model effectively "knows the future." Financial returns also exhibit autocorrelation and regime shifts, making temporal ordering of validation essential.

**Answer**: The Sharpe ratio treats upside and downside volatility identically. A strategy with large positive return spikes is penalized unfairly, making its Sharpe ratio look worse than it deserves.

**Explanation**: The Sortino ratio replaces the denominator with downside deviation, penalizing only losses that investors actually dislike. It is more appropriate for strategies with asymmetric return distributions — such as momentum, option writing, or trend-following — where upside variance is desirable and should not reduce the risk-adjusted score.

**Answer**: When signals are computed using data that did not exist at the time of the trade — such as the same bar's closing price used to trigger an open-bar entry — the model implicitly knows the future and records artificially high accuracy.

**Explanation**: Common sources include: using the daily close to generate a same-day entry signal, failing to apply `shift(-n)` when labeling forward returns, rolling statistics that include the current bar, and exponential moving averages that back-propagate future information. Each instance makes the strategy appear to predict what it actually already observed.

**Answer**: The Kelly formula f\* = W - (1 - W) / R maximizes the expected log return, which is equivalent to maximizing the long-run geometric growth rate of wealth.

**Explanation**: By maximizing E[log(wealth)], Kelly provably grows capital faster than any other fixed-fraction strategy over the long run. However, the formula is sensitive to estimation error in W (win rate) and R (win/loss ratio). Overestimating edge leads to over-betting and severe drawdowns. Fractional Kelly (25–50% of f\*) sacrifices some asymptotic growth rate for dramatically reduced variance and drawdown, making it far more practical for live trading.

**Answer**: Management often delivers strong headline numbers while guiding conservatively for the next quarter, or presents weak results in reassuring language. Mixing the two signals causes them to cancel out, diluting alpha.

**Explanation**: Earnings releases contain three distinct information types: (1) realized figures such as EPS and revenue, (2) forward guidance numbers, and (3) qualitative tone in management commentary. LLMs like FinBERT accurately score tone but can misclassify a sentence like "revenue missed by 8%" as neutral or positive depending on surrounding context. Parsing numeric figures with structured extraction and scoring text tone separately — then combining them in a weighted ensemble — produces more accurate and robust natural-language alpha signals.

현재 단락 (1/254)

The foundation of quantitative trading is data. Beyond OHLCV (Open, High, Low, Close, Volume), the s...

작성 글자: 0원문 글자: 14,142작성 단락: 0/254