- Authors

- Name
- Youngju Kim
- @fjvbn20031
- 1. Financial Data Collection and Preprocessing
- 2. Technical Analysis Automation
- 3. ML Trading Strategy: XGBoost Alpha Factors
- 4. Deep Learning for Finance: LSTM and Temporal Fusion Transformer
- 5. LLM for Finance: FinBERT Sentiment Analysis
- 6. Risk Management: VaR, CVaR, and Kelly Criterion
- 7. Backtesting: Vectorbt Strategy Verification
- Quiz
1. Financial Data Collection and Preprocessing
The foundation of quantitative trading is data. Beyond OHLCV (Open, High, Low, Close, Volume), the scope now extends to order book snapshots, tick data, and alternative data such as news feeds and satellite imagery.
Downloading Stock Data with yfinance
import yfinance as yf
import pandas as pd
# Download daily OHLCV for multiple tickers
tickers = ["AAPL", "MSFT", "GOOGL", "NVDA"]
df = yf.download(tickers, start="2020-01-01", end="2026-01-01", auto_adjust=True)
# Flatten MultiIndex → per-ticker DataFrames
close = df["Close"]
volume = df["Volume"]
# Handle missing data: forward fill then drop leading NaNs
close = close.ffill().dropna()
print(close.tail())
Fetching Crypto Order Books with ccxt
import ccxt
exchange = ccxt.binance()
symbol = "BTC/USDT"
orderbook = exchange.fetch_order_book(symbol, limit=20)
bids = orderbook["bids"][:5] # top-5 [price, qty] bid levels
asks = orderbook["asks"][:5] # top-5 [price, qty] ask levels
mid_price = (bids[0][0] + asks[0][0]) / 2
spread_bps = (asks[0][0] - bids[0][0]) / mid_price * 10000
print(f"Mid: {mid_price:.2f}, Spread: {spread_bps:.2f} bps")
Alternative Data: News Headline Collection
News and social-media data provide natural language alpha that structured price data cannot capture.
import requests
from datetime import datetime, timedelta
API_KEY = "YOUR_NEWSAPI_KEY"
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
url = (
f"https://newsapi.org/v2/everything"
f"?q=NVIDIA+earnings&from={yesterday}&sortBy=publishedAt"
f"&language=en&apiKey={API_KEY}"
)
resp = requests.get(url).json()
headlines = [art["title"] for art in resp.get("articles", [])]
print(headlines[:5])
2. Technical Analysis Automation
TA-Lib and pandas-ta let you compute hundreds of technical indicators in a single Python call.
RSI / MACD Calculation
import talib
import numpy as np
import yfinance as yf
df = yf.download("AAPL", start="2023-01-01", end="2026-01-01", auto_adjust=True)
close = df["Close"].squeeze().values.astype(float)
# RSI (14-period)
rsi = talib.RSI(close, timeperiod=14)
# MACD
macd, signal, hist = talib.MACD(close, fastperiod=12, slowperiod=26, signalperiod=9)
# pandas-ta alternative (no TA-Lib dependency)
import pandas_ta as ta
df_ta = df["Close"].squeeze().to_frame("close")
df_ta.ta.rsi(length=14, append=True)
df_ta.ta.macd(fast=12, slow=26, signal=9, append=True)
print(df_ta.tail())
Candlestick Pattern Recognition
open_p = df["Open"].squeeze().values.astype(float)
high_p = df["High"].squeeze().values.astype(float)
low_p = df["Low"].squeeze().values.astype(float)
close_p = df["Close"].squeeze().values.astype(float)
hammer = talib.CDLHAMMER(open_p, high_p, low_p, close_p)
engulfing = talib.CDLENGULFING(open_p, high_p, low_p, close_p)
morning_star = talib.CDLMORNINGSTAR(open_p, high_p, low_p, close_p)
# Returns 100 (bullish) / -100 (bearish) / 0 (no pattern)
print("Hammer signals detected:", (hammer != 0).sum())
3. ML Trading Strategy: XGBoost Alpha Factors
Feature Engineering
import pandas as pd
import numpy as np
import yfinance as yf
import pandas_ta as ta
df = yf.download("SPY", start="2018-01-01", end="2026-01-01", auto_adjust=True)
df.columns = df.columns.droplevel(1) if df.columns.nlevels > 1 else df.columns
df.columns = [c.lower() for c in df.columns]
# Return features
df["ret_1d"] = df["close"].pct_change(1)
df["ret_5d"] = df["close"].pct_change(5)
df["ret_20d"] = df["close"].pct_change(20)
# Volatility feature
df["vol_20d"] = df["ret_1d"].rolling(20).std()
# Technical indicator features
df.ta.rsi(length=14, append=True)
df.ta.macd(fast=12, slow=26, signal=9, append=True)
df.ta.bbands(length=20, append=True)
# Volume feature
df["vol_ratio"] = df["volume"] / df["volume"].rolling(20).mean()
# Target: sign of 5-day forward return (1 = up, 0 = down)
df["target"] = (df["close"].pct_change(5).shift(-5) > 0).astype(int)
df.dropna(inplace=True)
print(df.shape)
XGBoost with Walk-Forward Validation
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
import warnings
warnings.filterwarnings("ignore")
feature_cols = [
"ret_1d", "ret_5d", "ret_20d", "vol_20d",
"RSI_14", "MACD_12_26_9", "MACDs_12_26_9",
"BBL_20_2.0", "BBM_20_2.0", "BBU_20_2.0",
"vol_ratio"
]
target_col = "target"
results = []
train_years = 2
test_months = 3
dates = df.index
start_year = dates[0].year + train_years
for year in range(start_year, 2026):
for q in range(1, 5):
train_end = pd.Timestamp(f"{year}-{(q-1)*3+1:02d}-01") if q > 1 else pd.Timestamp(f"{year}-01-01")
test_start = train_end
test_end = test_start + pd.DateOffset(months=test_months)
train_df = df[df.index < test_start].tail(504)
test_df = df[(df.index >= test_start) & (df.index < test_end)]
if len(train_df) < 100 or len(test_df) < 10:
continue
X_train, y_train = train_df[feature_cols], train_df[target_col]
X_test, y_test = test_df[feature_cols], test_df[target_col]
model = XGBClassifier(
n_estimators=200, max_depth=4,
learning_rate=0.05, subsample=0.8,
eval_metric="logloss", random_state=42
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
proba = model.predict_proba(X_test)[:, 1]
acc = accuracy_score(y_test, preds)
auc = roc_auc_score(y_test, proba)
results.append({"period": str(test_start.date()), "acc": acc, "auc": auc})
result_df = pd.DataFrame(results)
print(result_df.tail(8))
print(f"\nMean AUC: {result_df['auc'].mean():.4f}")
4. Deep Learning for Finance: LSTM and Temporal Fusion Transformer
LSTM Price Prediction
import numpy as np
import torch
import torch.nn as nn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df[["close"]].values)
SEQ_LEN = 60
def make_sequences(data, seq_len):
X, y = [], []
for i in range(len(data) - seq_len):
X.append(data[i:i+seq_len])
y.append(data[i+seq_len])
return np.array(X), np.array(y)
X, y = make_sequences(scaled, SEQ_LEN)
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.float32)
class LSTMModel(nn.Module):
def __init__(self, input_size=1, hidden_size=64, num_layers=2):
super().__init__()
self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
batch_first=True, dropout=0.2)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
out, _ = self.lstm(x)
return self.fc(out[:, -1, :])
model = LSTMModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()
for epoch in range(30):
model.train()
pred = model(X_train_t)
loss = criterion(pred, y_train_t)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}, Loss: {loss.item():.6f}")
FinRL Reinforcement Learning Trading Agent
FinRL builds on an OpenAI Gym-style environment to train RL agents for stock trading.
# pip install finrl
from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv
from finrl.agents.stablebaselines3.models import DRLAgent
import pandas as pd
# Preprocessed financial data in FinRL format
# Required columns: date, tic, open, high, low, close, volume + tech indicators
processed_df = pd.read_csv("processed_stock_data.csv")
env_kwargs = {
"hmax": 100, # max shares held per stock
"initial_amount": 100000, # starting capital ($)
"buy_cost_pct": 0.001, # 0.1% commission
"sell_cost_pct": 0.001,
"reward_scaling": 1e-4,
"state_space": 181,
"action_space": 30,
"tech_indicator_list": ["macd", "rsi_30", "cci_30", "dx_30"],
}
train_env = StockTradingEnv(df=processed_df, **env_kwargs)
agent = DRLAgent(env=train_env)
model_ppo = agent.get_model("ppo")
trained_ppo = agent.train_model(
model=model_ppo,
tb_log_name="ppo_stock",
total_timesteps=50000
)
5. LLM for Finance: FinBERT Sentiment Analysis
FinBERT is a BERT model pre-trained on financial news and earnings call transcripts. It classifies text into Positive, Negative, or Neutral.
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import torch.nn.functional as F
model_name = "ProsusAI/finbert"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)
model.eval()
def finbert_sentiment(texts):
inputs = tokenizer(texts, padding=True, truncation=True,
max_length=512, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probs = F.softmax(logits, dim=-1).numpy()
labels = ["positive", "negative", "neutral"]
return [
{"text": t, "label": labels[p.argmax()], "score": float(p.max())}
for t, p in zip(texts, probs)
]
headlines = [
"NVIDIA beats Q4 earnings estimates by 15%, raises guidance",
"Fed signals higher-for-longer rates amid sticky inflation",
"Apple reports record services revenue despite iPhone slowdown",
]
results = finbert_sentiment(headlines)
for r in results:
print(f"[{r['label'].upper():8s}] {r['score']:.3f} | {r['text']}")
Separating Numeric Guidance from Text Tone
Earnings calls contain three distinct signals: (1) quantitative results such as EPS and revenue, (2) forward guidance figures, and (3) management tone in the spoken text. LLMs excel at capturing tone but can misread sentences that embed raw numbers. Parsing numbers via regex or structured NLP and combining them with FinBERT tone scores in an ensemble produces stronger natural-language alpha than either approach alone.
6. Risk Management: VaR, CVaR, and Kelly Criterion
VaR / CVaR Computation
import numpy as np
def calculate_var_cvar(returns, confidence=0.95):
"""
Historical simulation VaR and CVaR.
returns: array of daily returns
"""
sorted_returns = np.sort(returns)
index = int((1 - confidence) * len(sorted_returns))
var = -sorted_returns[index]
cvar = -sorted_returns[:index].mean()
return var, cvar
daily_returns = df["close"].pct_change().dropna().values
var_95, cvar_95 = calculate_var_cvar(daily_returns, 0.95)
var_99, cvar_99 = calculate_var_cvar(daily_returns, 0.99)
print(f"VaR 95%: {var_95:.4f} ({var_95*100:.2f}%)")
print(f"CVaR 95%: {cvar_95:.4f} ({cvar_95*100:.2f}%)")
print(f"VaR 99%: {var_99:.4f} ({var_99*100:.2f}%)")
print(f"CVaR 99%: {cvar_99:.4f} ({cvar_99*100:.2f}%)")
Key Risk Metric Comparison
| Metric | Formula | Strength | Limitation |
|---|---|---|---|
| Sharpe Ratio | (Rp - Rf) / sigma_p | Standardized risk-adjusted return | Penalizes upside volatility equally |
| Sortino Ratio | (Rp - Rf) / sigma_d | Penalizes only downside vol | Denominator less intuitive |
| Max Drawdown | Peak-to-trough loss | Captures extreme losses | Ignores recovery duration |
| VaR 95% | 5th percentile loss | Regulatory standard | Underestimates tail risk |
| CVaR 95% | Expected loss beyond VaR | Captures tail risk | Sensitive to distributional assumptions |
| Calmar Ratio | CAGR / MDD | Growth vs. drawdown | Less meaningful for short periods |
Kelly Criterion Position Sizing
def kelly_fraction(win_rate, win_loss_ratio):
"""
f* = W - (1 - W) / R
W: win rate, R: average win/loss ratio
"""
return win_rate - (1 - win_rate) / win_loss_ratio
# Example: 55% win rate, 1.5 win/loss ratio
f_full = kelly_fraction(0.55, 1.5)
f_half = f_full * 0.5 # fractional Kelly reduces variance
print(f"Full Kelly: {f_full:.2%}")
print(f"Half Kelly: {f_half:.2%}")
Fractional Kelly (typically 0.25–0.5x) is used in practice because win-rate and edge estimates carry significant estimation error. Full Kelly can produce catastrophic drawdowns when inputs are off, so a safety margin is essential.
7. Backtesting: Vectorbt Strategy Verification
Moving Average Crossover Backtest with Vectorbt
import vectorbt as vbt
import pandas as pd
import yfinance as yf
price = yf.download("SPY", start="2018-01-01", end="2026-01-01",
auto_adjust=True)["Close"].squeeze()
fast_ma = vbt.MA.run(price, 20)
slow_ma = vbt.MA.run(price, 60)
entries = fast_ma.ma_crossed_above(slow_ma)
exits = fast_ma.ma_crossed_below(slow_ma)
portfolio = vbt.Portfolio.from_signals(
price,
entries,
exits,
init_cash=100_000,
fees=0.001, # 0.1% commission
slippage=0.001, # 0.1% slippage
freq="D",
)
stats = portfolio.stats()
print(stats[["Total Return [%]", "Sharpe Ratio", "Max Drawdown [%]",
"Win Rate [%]", "Profit Factor"]])
Backtesting Bias Checklist
When a backtest produces unexpectedly strong results, always audit these failure modes:
| Bias Type | Root Cause | Mitigation |
|---|---|---|
| Look-ahead bias | Future data used to compute current-period signals | Audit shift(-1) calls; check feature timestamps |
| Survivorship bias | Delisted tickers excluded from universe | Use point-in-time universe datasets |
| Optimization bias | In-sample parameter over-fitting | Walk-forward validation, out-of-sample holdout |
| Market impact ignored | Large orders assumed to fill at mid price | Slippage model; volume-constrained sizing |
| Underestimated transaction costs | Real spreads and fees excluded | Realistic commission + slippage parameters |
Quiz
Q1. Why is walk-forward validation more appropriate for financial time series than k-fold cross-validation?
Answer: Financial time series have temporal dependency; randomly splitting folds allows future data to leak into training, creating look-ahead bias that inflates apparent model performance.
Explanation: Walk-forward validation always trains on past data and tests on future data, preserving temporal order. In k-fold, a training fold can contain observations that occur after some test observations, meaning the model effectively "knows the future." Financial returns also exhibit autocorrelation and regime shifts, making temporal ordering of validation essential.
Q2. What are the limitations of the Sharpe ratio, and when is the Sortino ratio more appropriate?
Answer: The Sharpe ratio treats upside and downside volatility identically. A strategy with large positive return spikes is penalized unfairly, making its Sharpe ratio look worse than it deserves.
Explanation: The Sortino ratio replaces the denominator with downside deviation, penalizing only losses that investors actually dislike. It is more appropriate for strategies with asymmetric return distributions — such as momentum, option writing, or trend-following — where upside variance is desirable and should not reduce the risk-adjusted score.
Q3. How does look-ahead bias cause backtest results to be overly optimistic?
Answer: When signals are computed using data that did not exist at the time of the trade — such as the same bar's closing price used to trigger an open-bar entry — the model implicitly knows the future and records artificially high accuracy.
Explanation: Common sources include: using the daily close to generate a same-day entry signal, failing to apply shift(-n) when labeling forward returns, rolling statistics that include the current bar, and exponential moving averages that back-propagate future information. Each instance makes the strategy appear to predict what it actually already observed.
Q4. What is the mathematical justification for Kelly Criterion in position sizing, and why use fractional Kelly?
Answer: The Kelly formula f* = W - (1 - W) / R maximizes the expected log return, which is equivalent to maximizing the long-run geometric growth rate of wealth.
Explanation: By maximizing E[log(wealth)], Kelly provably grows capital faster than any other fixed-fraction strategy over the long run. However, the formula is sensitive to estimation error in W (win rate) and R (win/loss ratio). Overestimating edge leads to over-betting and severe drawdowns. Fractional Kelly (25–50% of f*) sacrifices some asymptotic growth rate for dramatically reduced variance and drawdown, making it far more practical for live trading.
Q5. Why should numeric guidance and text tone be analyzed separately when using LLMs on earnings calls?
Answer: Management often delivers strong headline numbers while guiding conservatively for the next quarter, or presents weak results in reassuring language. Mixing the two signals causes them to cancel out, diluting alpha.
Explanation: Earnings releases contain three distinct information types: (1) realized figures such as EPS and revenue, (2) forward guidance numbers, and (3) qualitative tone in management commentary. LLMs like FinBERT accurately score tone but can misclassify a sentence like "revenue missed by 8%" as neutral or positive depending on surrounding context. Parsing numeric figures with structured extraction and scoring text tone separately — then combining them in a weighted ensemble — produces more accurate and robust natural-language alpha signals.