Skip to content

Split View: 헬스케어 & 금융 AI 완전 가이드: 의료 영상 분석부터 알고리즘 트레이딩까지

|

헬스케어 & 금융 AI 완전 가이드: 의료 영상 분석부터 알고리즘 트레이딩까지

개요

AI는 헬스케어와 금융이라는 두 산업에서 전례 없는 변화를 이끌고 있습니다. 의사가 영상 판독에 사용하는 딥러닝 모델부터 초당 수천 건의 거래를 처리하는 알고리즘 트레이딩 시스템까지, AI는 이 분야들의 핵심 인프라로 자리 잡았습니다. 이 가이드는 두 분야의 주요 응용 사례를 실제 코드와 함께 심층적으로 살펴봅니다.


Part 1: 헬스케어 AI

1. 의료 영상 AI

의료 영상 분석은 딥러닝이 가장 큰 임팩트를 보여주는 분야 중 하나입니다. CheXNet (Stanford) 모델은 흉부 X-ray에서 방사선과 전문의 수준의 폐렴 진단 성능을 달성하였고, 이후 수많은 의료 영상 AI 연구의 기반이 되었습니다.

DICOM 이미지 처리

의료 영상 표준 포맷인 DICOM을 파이썬으로 처리하는 기본 파이프라인입니다.

import pydicom
import numpy as np
from torchvision import transforms
import torch

# DICOM 이미지 로드
ds = pydicom.dcmread('chest_xray.dcm')
img_array = ds.pixel_array.astype(np.float32)

# 정규화 (HU to Windowing)
img_normalized = (img_array - img_array.min()) / (img_array.max() - img_array.min())

# 모델 예측용 전처리
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485], [0.229])
])

tensor = transform(img_normalized).unsqueeze(0)

U-Net을 이용한 의료 영상 세그멘테이션

MRI나 CT에서 종양, 장기 등의 관심 영역을 자동으로 분할하기 위해 U-Net 아키텍처가 널리 사용됩니다.

import torch
import torch.nn as nn

class DoubleConv(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True)
        )
    def forward(self, x):
        return self.conv(x)

class UNet(nn.Module):
    def __init__(self, in_channels=1, num_classes=2):
        super().__init__()
        self.enc1 = DoubleConv(in_channels, 64)
        self.enc2 = DoubleConv(64, 128)
        self.pool = nn.MaxPool2d(2)
        self.bottleneck = DoubleConv(128, 256)
        self.up1 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec1 = DoubleConv(256, 128)
        self.up2 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec2 = DoubleConv(128, 64)
        self.out_conv = nn.Conv2d(64, num_classes, 1)

    def forward(self, x):
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        b = self.bottleneck(self.pool(e2))
        d1 = self.dec1(torch.cat([self.up1(b), e2], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d1), e1], dim=1))
        return self.out_conv(d2)

평가 지표: 의료 세그멘테이션에서는 Dice Score와 IoU(Intersection over Union)가 주요 지표입니다. Dice Score는 두 집합의 겹치는 비율로, 불균형한 배경-전경 문제에서 Accuracy보다 훨씬 유용합니다.


2. 임상 NLP

전자 건강 기록(EHR)에는 방대한 비정형 텍스트 데이터가 있습니다. NLP 모델은 이 데이터에서 임상적으로 의미 있는 정보를 추출합니다.

의료 개체명 인식 (NER)

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

model_name = "allenai/scibert_scivocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# BioBERT 기반 NER 파이프라인
ner_pipeline = pipeline(
    "ner",
    model="dmis-lab/biobert-v1.1",
    tokenizer="dmis-lab/biobert-v1.1",
    aggregation_strategy="simple"
)

clinical_text = """
Patient presents with acute myocardial infarction.
Administered aspirin 325mg and clopidogrel 600mg.
ECG shows ST elevation in leads V1-V4.
"""

entities = ner_pipeline(clinical_text)
for ent in entities:
    print(f"Entity: {ent['word']}, Label: {ent['entity_group']}, Score: {ent['score']:.3f}")

임상 텍스트 요약 (Med-PaLM, BioMedLM 활용)

PhysioNet의 MIMIC-III 데이터셋은 임상 NLP 연구의 표준 벤치마크입니다. 퇴원 요약, 방사선 보고서, 간호 노트 등 다양한 임상 텍스트를 포함하며, 접근 권한을 취득한 연구자들이 모델 학습에 활용합니다.


3. 신약 개발 AI

신약 개발은 평균 10~15년, 수십억 달러가 소요되는 과정입니다. AI는 이 프로세스의 여러 단계를 가속화하고 있습니다.

분자 특성 예측 (RDKit 활용)

from rdkit import Chem
from rdkit.Chem import Descriptors
import pandas as pd

def compute_molecular_features(smiles_list):
    features = []
    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            features.append({
                'MW': Descriptors.MolWt(mol),
                'LogP': Descriptors.MolLogP(mol),
                'HBD': Descriptors.NumHDonors(mol),
                'HBA': Descriptors.NumHAcceptors(mol),
                'TPSA': Descriptors.TPSA(mol)
            })
    return pd.DataFrame(features)

def lipinski_filter(df):
    """리핀스키 5의 규칙: 경구 투여 가능성 필터"""
    return df[
        (df['MW'] <= 500) &
        (df['LogP'] <= 5) &
        (df['HBD'] <= 5) &
        (df['HBA'] <= 10)
    ]

# 예시 SMILES 목록
smiles_list = [
    'CC(=O)Oc1ccccc1C(=O)O',  # Aspirin
    'CC12CCC3C(C1CCC2O)CCC4=CC(=O)CCC34C',  # Testosterone
]
df = compute_molecular_features(smiles_list)
drug_candidates = lipinski_filter(df)

AlphaFold2와 단백질 구조 예측

DeepMind의 AlphaFold2는 단백질 서열에서 3D 구조를 예측하는 획기적인 모델입니다. 수십 년간 해결되지 않던 단백질 접힘 문제(Protein Folding Problem)를 사실상 해결하여 신약 개발의 패러다임을 바꿨습니다.

# ColabFold (AlphaFold2 인터페이스) 활용 예시
# pip install colabfold

from colabfold.batch import get_queries, run

# 단백질 서열 입력 (FASTA 형식)
queries = [("target_protein", "MKTIIALSYIFCLVFA")]

results = run(
    queries=queries,
    result_dir="./alphafold_results",
    use_templates=False,
    num_recycles=3,
    model_type="auto"
)

GNN을 이용한 분자 생성

그래프 신경망(GNN)은 분자를 원자(노드)와 결합(엣지)으로 표현하여 새로운 약물 후보를 생성하는 데 사용됩니다.

import torch
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, global_mean_pool

class MoleculeGNN(torch.nn.Module):
    def __init__(self, num_features, hidden_dim, num_classes):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.classifier = torch.nn.Linear(hidden_dim, num_classes)
        self.relu = torch.nn.ReLU()

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.relu(self.conv1(x, edge_index))
        x = self.relu(self.conv2(x, edge_index))
        x = global_mean_pool(x, batch)
        return self.classifier(x)

4. 웨어러블 데이터와 AI

스마트워치와 웨어러블 기기에서 수집되는 생체 신호 데이터는 실시간 건강 모니터링의 새로운 패러다임을 열고 있습니다.

ECG 부정맥 분류 (1D CNN)

import torch
import torch.nn as nn

class ECGClassifier(nn.Module):
    """1D CNN for ECG arrhythmia classification"""
    def __init__(self, num_classes=5):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv1d(1, 32, kernel_size=11, padding=5),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Conv1d(32, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(32)
        )
        self.classifier = nn.Sequential(
            nn.Linear(128 * 32, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

# PhysioNet MIT-BIH Arrhythmia Database 활용
# https://physionet.org/content/mitdb/1.0.0/

주요 ECG 클래스: Normal (N), Atrial Fibrillation (A), Other (O), Noisy (~) 등 다양한 부정맥 패턴을 분류합니다.


Part 2: 금융 AI

5. 주가 예측 모델

주가 예측은 금융 AI에서 가장 널리 연구되는 분야이지만, 동시에 가장 어려운 과제입니다. 효율적 시장 가설(EMH)에 따르면 공개 정보는 이미 가격에 반영되어 있어, 순수하게 가격 데이터만으로 일관된 초과 수익을 얻기는 극히 어렵습니다.

기술 지표를 활용한 피처 엔지니어링

import yfinance as yf
import pandas as pd
import numpy as np

# yfinance로 주가 데이터 다운로드
# 참고: https://pypi.org/project/yfinance/
ticker = yf.Ticker("AAPL")
df = ticker.history(period="5y")

def compute_rsi(prices, window=14):
    delta = prices.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

def add_technical_indicators(df):
    df['SMA20'] = df['Close'].rolling(20).mean()
    df['SMA50'] = df['Close'].rolling(50).mean()
    df['EMA12'] = df['Close'].ewm(span=12).mean()
    df['EMA26'] = df['Close'].ewm(span=26).mean()
    df['MACD'] = df['EMA12'] - df['EMA26']
    df['Signal'] = df['MACD'].ewm(span=9).mean()
    df['RSI'] = compute_rsi(df['Close'], 14)
    df['BB_upper'] = df['SMA20'] + 2 * df['Close'].rolling(20).std()
    df['BB_lower'] = df['SMA20'] - 2 * df['Close'].rolling(20).std()
    df['Volatility'] = df['Close'].pct_change().rolling(20).std()
    df['Volume_MA'] = df['Volume'].rolling(20).mean()
    return df.dropna()

df = add_technical_indicators(df)

LSTM 기반 시계열 예측

import torch
import torch.nn as nn
from sklearn.preprocessing import MinMaxScaler

class StockLSTM(nn.Module):
    def __init__(self, input_size, hidden_size=128, num_layers=2, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout
        )
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, x):
        out, _ = self.lstm(x)
        return self.fc(out[:, -1, :])

# Walk-Forward 교차 검증 (Look-Ahead Bias 방지)
def walk_forward_split(df, train_size=0.7, val_size=0.15):
    n = len(df)
    train_end = int(n * train_size)
    val_end = int(n * (train_size + val_size))
    return df[:train_end], df[train_end:val_end], df[val_end:]

중요: 시계열 데이터에서 랜덤 분할은 절대 사용하지 마세요. 미래 데이터가 학습 셋에 포함되는 look-ahead bias가 발생하여 백테스팅 결과가 실제보다 훨씬 좋게 보이는 문제가 생깁니다.


6. 알고리즘 트레이딩

Backtrader를 이용한 이동 평균 교차 전략

import backtrader as bt
# 참고: https://www.backtrader.com/docu/

class MACrossStrategy(bt.Strategy):
    params = (('fast', 10), ('slow', 30),)

    def __init__(self):
        self.sma_fast = bt.indicators.SMA(period=self.p.fast)
        self.sma_slow = bt.indicators.SMA(period=self.p.slow)
        self.crossover = bt.indicators.CrossOver(self.sma_fast, self.sma_slow)

    def next(self):
        if not self.position:
            if self.crossover > 0:
                self.buy(size=100)
        elif self.crossover < 0:
            self.sell(size=100)

# 백테스팅 실행
cerebro = bt.Cerebro()
cerebro.addstrategy(MACrossStrategy)
cerebro.broker.setcash(100000.0)
cerebro.broker.setcommission(commission=0.001)
results = cerebro.run()
print(f"Final Portfolio Value: {cerebro.broker.getvalue():.2f}")

포트폴리오 최적화 (Markowitz Mean-Variance)

import numpy as np
from scipy.optimize import minimize

def portfolio_stats(weights, returns):
    port_return = np.sum(returns.mean() * weights) * 252
    port_vol = np.sqrt(weights @ returns.cov() @ weights * 252)
    sharpe = port_return / port_vol
    return port_return, port_vol, sharpe

def min_variance(returns):
    n = returns.shape[1]
    constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}
    bounds = [(0, 1)] * n
    result = minimize(
        lambda w: portfolio_stats(w, returns)[1],
        x0=np.ones(n) / n,
        bounds=bounds,
        constraints=constraints
    )
    return result.x

강화학습 기반 트레이딩 (PPO)

import gym
import numpy as np
from stable_baselines3 import PPO

class TradingEnv(gym.Env):
    """단순화된 주식 트레이딩 환경"""
    def __init__(self, df, initial_balance=10000):
        super().__init__()
        self.df = df
        self.initial_balance = initial_balance
        self.action_space = gym.spaces.Discrete(3)  # 0: Hold, 1: Buy, 2: Sell
        self.observation_space = gym.spaces.Box(
            low=-np.inf, high=np.inf, shape=(10,), dtype=np.float32
        )
        self.reset()

    def reset(self):
        self.balance = self.initial_balance
        self.shares = 0
        self.current_step = 0
        return self._get_obs()

    def _get_obs(self):
        row = self.df.iloc[self.current_step]
        return np.array([
            row['Close'], row['SMA20'], row['SMA50'],
            row['RSI'], row['MACD'], row['Volatility'],
            self.balance, self.shares,
            row['Volume'], row['BB_upper'] - row['BB_lower']
        ], dtype=np.float32)

    def step(self, action):
        price = self.df.iloc[self.current_step]['Close']
        reward = 0
        if action == 1 and self.balance >= price:
            self.shares += 1
            self.balance -= price
        elif action == 2 and self.shares > 0:
            self.shares -= 1
            self.balance += price
            reward = price - self.df.iloc[max(0, self.current_step - 1)]['Close']
        self.current_step += 1
        done = self.current_step >= len(self.df) - 1
        return self._get_obs(), reward, done, {}

7. 사기 탐지

신용카드 사기 탐지는 극심한 클래스 불균형(정상 거래 99.9% vs 사기 거래 0.1%)이라는 특수한 도전을 가집니다.

SMOTE와 Isolation Forest

from sklearn.ensemble import IsolationForest
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, roc_auc_score
import numpy as np

# 불균형 데이터 오버샘플링
smote = SMOTE(sampling_strategy=0.1, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# 비지도 이상 탐지
iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.001,
    random_state=42
)
iso_forest.fit(X_train)
anomaly_scores = iso_forest.decision_function(X_test)

# SHAP으로 모델 설명
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test[:100])
shap.summary_plot(shap_values, X_test[:100], feature_names=feature_names)

오토인코더 기반 이상 탐지

import torch
import torch.nn as nn

class FraudAutoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dim=16):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, encoding_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

# 재구성 오류가 크면 사기로 분류
def detect_fraud(model, x, threshold=0.5):
    reconstructed = model(x)
    reconstruction_error = torch.mean((x - reconstructed) ** 2, dim=1)
    return reconstruction_error > threshold

8. 신용 리스크 모델

바젤 III 규제 하에서 금융기관은 신용 리스크를 정량화할 의무가 있습니다. 기대 손실(Expected Loss)은 다음과 같이 계산됩니다.

  • EL = PD x LGD x EAD
    • PD(Probability of Default): 부도 확률
    • LGD(Loss Given Default): 부도 시 손실률
    • EAD(Exposure at Default): 부도 시 익스포저
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
import lightgbm as lgb

# PD 모델 (로지스틱 회귀)
pd_model = LogisticRegression(class_weight='balanced', max_iter=1000)
pd_model_calibrated = CalibratedClassifierCV(pd_model, cv=5)
pd_model_calibrated.fit(X_train, y_train)
pd_scores = pd_model_calibrated.predict_proba(X_test)[:, 1]

# LightGBM으로 스코어카드 구현
lgb_model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    num_leaves=31,
    class_weight='balanced'
)
lgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

# 기대 손실 계산
def expected_loss(pd, lgd, ead):
    return pd * lgd * ead

# Survival Analysis for Time-to-Default
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df_survival, duration_col='time_to_default', event_col='defaulted')

9. 금융 NLP

뉴스 센티멘트 분석과 주가 상관관계

from transformers import pipeline
import pandas as pd

# FinanceBERT 활용 (금융 특화 BERT)
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="ProsusAI/finbert"
)

def analyze_news_sentiment(headlines):
    results = sentiment_pipeline(headlines)
    return pd.DataFrame([
        {'headline': h, 'label': r['label'], 'score': r['score']}
        for h, r in zip(headlines, results)
    ])

headlines = [
    "Apple reports record quarterly earnings",
    "Federal Reserve signals rate hike ahead",
    "Tech sector faces regulatory headwinds"
]

df_sentiment = analyze_news_sentiment(headlines)
print(df_sentiment)

EDGAR 재무제표 파싱

import requests
from bs4 import BeautifulSoup
import re

def fetch_10k_filing(cik, accession_number):
    """SEC EDGAR에서 10-K 연간 보고서 가져오기"""
    base_url = "https://www.sec.gov/Archives/edgar"
    url = f"{base_url}/{cik}/{accession_number}/10k.htm"
    response = requests.get(url, headers={'User-Agent': 'research@example.com'})
    soup = BeautifulSoup(response.text, 'html.parser')

    # 특정 섹션 추출 (예: Risk Factors)
    risk_section = soup.find(text=re.compile("Risk Factors", re.IGNORECASE))
    return risk_section.parent.get_text() if risk_section else ""

헬스케어 & 금융 AI 핵심 주의사항

헬스케어

  1. FDA 규제: 미국에서 의료 AI 소프트웨어는 FDA의 Software as a Medical Device (SaMD) 프레임워크의 적용을 받습니다.
  2. HIPAA 준수: 환자 데이터를 다룰 때는 반드시 데-식별화(De-identification)와 보안 조치가 필요합니다.
  3. 편향(Bias) 문제: 특정 인종이나 성별의 데이터가 적으면 AI 모델이 그 집단에 대해 성능이 낮아질 수 있습니다.
  4. 임상 검증: 모델 성능이 좋더라도 실제 임상 환경에서의 검증(Clinical Trial)이 필수입니다.

금융

  1. Look-Ahead Bias: 시계열 모델 학습 시 미래 정보 유출을 철저히 차단해야 합니다.
  2. Overfitting: 금융 데이터는 노이즈가 많아 복잡한 모델이 오히려 실전에서 나쁠 수 있습니다.
  3. 시장 체제 변화: 과거에 잘 작동한 전략이 시장 환경 변화로 갑자기 실패할 수 있습니다.
  4. 거래 비용: 백테스팅에서 슬리피지, 수수료, 시장 충격 비용을 반드시 포함해야 합니다.

참고 자료


퀴즈

Q1. 금융 AI 백테스팅에서 look-ahead bias란 무엇이며 왜 위험한가?

정답: 백테스팅 시 해당 시점에서는 알 수 없는 미래 데이터를 모델 학습이나 신호 생성에 사용하는 오류입니다.

설명: 예를 들어, 오늘의 종가 데이터를 이용해 오늘의 거래 신호를 생성하면 실제로는 불가능한 거래가 됩니다. Walk-Forward 검증과 시간 순서를 고려한 교차 검증으로 이를 방지할 수 있습니다. 이 편향이 있는 백테스트는 실제로는 존재하지 않는 알파를 발견한 것처럼 보이게 만들어 실제 투자 시 큰 손실로 이어질 수 있습니다.

Q2. 의료 영상 AI에서 Dice Score를 Accuracy 대신 사용하는 이유는?

정답: 의료 영상 세그멘테이션에서는 배경 픽셀이 관심 영역보다 압도적으로 많아 클래스 불균형이 극심하기 때문입니다.

설명: 예를 들어, 뇌 MRI에서 종양이 전체 픽셀의 1%를 차지할 때, 모든 픽셀을 배경으로 예측해도 Accuracy는 99%가 됩니다. Dice Score는 두 집합의 겹치는 비율(2 * TP / (2*TP + FP + FN))로 계산되어 소수 클래스의 예측 성능을 더 정확히 반영합니다.

Q3. 신약 개발에서 리핀스키 5의 규칙(Ro5)이 사용되는 목적은?

정답: 경구 투여 가능한 약물 후보의 약동학적 특성을 초기 단계에서 스크리닝하기 위한 경험적 규칙입니다.

설명: 분자량 500 이하, LogP 5 이하, 수소 결합 공여체 5 이하, 수소 결합 수용체 10 이하를 만족하는 분자는 경구 투여 시 흡수와 투과성이 좋을 가능성이 높습니다. 초기 스크리닝으로 수백만 개의 후보 분자를 빠르게 필터링해 개발 비용을 줄입니다.

Q4. 사기 탐지에서 SMOTE를 사용할 때 주의해야 할 점은?

정답: SMOTE는 반드시 학습 데이터에만 적용해야 하며, 검증 세트나 테스트 세트에는 적용하면 안 됩니다.

설명: 테스트 셋에 SMOTE를 적용하면 합성 데이터가 포함되어 실제 성능을 과대평가하게 됩니다. 또한, SMOTE로 생성된 합성 샘플은 실제 사기 패턴을 완전히 반영하지 못할 수 있으므로, Precision-Recall AUC나 F1 Score로 평가하는 것이 Accuracy보다 적합합니다.

Q5. AlphaFold2가 신약 개발에 혁신적인 이유는?

정답: 단백질의 아미노산 서열만으로 3D 구조를 원자 수준의 정밀도로 예측하여, 기존에 수년이 걸리던 단백질 구조 결정 과정을 수분 내로 단축시켰기 때문입니다.

설명: 신약 개발에서 표적 단백질의 3D 구조는 약물 결합 부위(binding site)를 파악하는 데 필수적입니다. AlphaFold2 이전에는 X-선 결정학, cryo-EM 등의 실험적 방법이 필요했습니다. DeepMind는 2억 개 이상의 단백질 구조 예측 결과를 공개 데이터베이스로 제공하고 있습니다.

AI for Healthcare & Finance: From Medical Imaging to Algorithmic Trading

Overview

Artificial intelligence is driving unprecedented transformation in two of the most consequential industries: healthcare and finance. From deep learning models that assist radiologists in reading medical images, to algorithmic trading systems that execute thousands of transactions per second, AI has become core infrastructure in both fields. This guide explores key applications in each domain with practical, production-oriented code examples.


Part 1: Healthcare AI

1. Medical Imaging AI

Medical image analysis is one of the highest-impact areas for deep learning. CheXNet (Stanford) demonstrated radiologist-level pneumonia detection from chest X-rays, establishing a benchmark for AI-assisted diagnostics that has driven extensive follow-on research.

Processing DICOM Images

DICOM is the standard format for medical imaging. The following pipeline loads and preprocesses a DICOM file for inference.

import pydicom
import numpy as np
from torchvision import transforms
import torch

# Load DICOM image
ds = pydicom.dcmread('chest_xray.dcm')
img_array = ds.pixel_array.astype(np.float32)

# Normalize pixel values (min-max normalization)
img_normalized = (img_array - img_array.min()) / (img_array.max() - img_array.min())

# Preprocessing for model inference
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485], [0.229])
])

tensor = transform(img_normalized).unsqueeze(0)

Medical Image Segmentation with U-Net

U-Net is the dominant architecture for segmenting anatomical structures and pathologies in MRI and CT scans.

import torch
import torch.nn as nn

class DoubleConv(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True)
        )
    def forward(self, x):
        return self.conv(x)

class UNet(nn.Module):
    def __init__(self, in_channels=1, num_classes=2):
        super().__init__()
        self.enc1 = DoubleConv(in_channels, 64)
        self.enc2 = DoubleConv(64, 128)
        self.pool = nn.MaxPool2d(2)
        self.bottleneck = DoubleConv(128, 256)
        self.up1 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec1 = DoubleConv(256, 128)
        self.up2 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec2 = DoubleConv(128, 64)
        self.out_conv = nn.Conv2d(64, num_classes, 1)

    def forward(self, x):
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        b = self.bottleneck(self.pool(e2))
        d1 = self.dec1(torch.cat([self.up1(b), e2], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d1), e1], dim=1))
        return self.out_conv(d2)

Evaluation Metrics: Dice Score and IoU (Intersection over Union) are the primary metrics for medical segmentation. Dice Score measures the overlap between prediction and ground truth, and is far more informative than accuracy when dealing with the severe foreground-background imbalance typical in medical images.

Whole Slide Images (WSI): Pathology slides can be gigapixel images. The standard approach is multi-instance learning (MIL) with patch extraction — the slide is divided into tiles, each tile is embedded by a CNN, and the tile embeddings are aggregated (e.g., attention-based pooling) to produce a slide-level prediction.


2. Clinical NLP

Electronic Health Records (EHR) contain vast amounts of unstructured text. NLP models extract clinically meaningful information from physician notes, discharge summaries, and radiology reports.

Medical Named Entity Recognition

from transformers import pipeline

# BioBERT-based NER pipeline
ner_pipeline = pipeline(
    "ner",
    model="dmis-lab/biobert-v1.1",
    tokenizer="dmis-lab/biobert-v1.1",
    aggregation_strategy="simple"
)

clinical_text = """
Patient presents with acute myocardial infarction.
Administered aspirin 325mg and clopidogrel 600mg.
ECG shows ST elevation in leads V1-V4.
"""

entities = ner_pipeline(clinical_text)
for ent in entities:
    print(f"Entity: {ent['word']}, Label: {ent['entity_group']}, Score: {ent['score']:.3f}")

Clinical Text Summarization

Med-PaLM 2 (Google) and BioMedLM (Stanford) are domain-specific large language models fine-tuned on clinical and biomedical text. They significantly outperform general-purpose models on tasks such as clinical question answering, discharge summary generation, and medical knowledge retrieval.

PhysioNet & MIMIC-III: The MIMIC-III Clinical Database (available at PhysioNet: https://physionet.org/content/mimiciii/) is the standard benchmark for clinical NLP research. It contains de-identified EHR data including discharge summaries, radiology reports, and nursing notes from over 40,000 ICU patients.


3. AI-Accelerated Drug Discovery

Traditional drug development takes 10-15 years and costs billions of dollars. AI is accelerating multiple stages of this pipeline.

Molecular Property Prediction with RDKit

from rdkit import Chem
from rdkit.Chem import Descriptors
import pandas as pd

def compute_molecular_features(smiles_list):
    features = []
    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            features.append({
                'MW': Descriptors.MolWt(mol),
                'LogP': Descriptors.MolLogP(mol),
                'HBD': Descriptors.NumHDonors(mol),
                'HBA': Descriptors.NumHAcceptors(mol),
                'TPSA': Descriptors.TPSA(mol)
            })
    return pd.DataFrame(features)

def lipinski_filter(df):
    """Lipinski's Rule of Five: filter for oral bioavailability"""
    return df[
        (df['MW'] <= 500) &
        (df['LogP'] <= 5) &
        (df['HBD'] <= 5) &
        (df['HBA'] <= 10)
    ]

smiles_list = [
    'CC(=O)Oc1ccccc1C(=O)O',  # Aspirin
    'CC12CCC3C(C1CCC2O)CCC4=CC(=O)CCC34C',  # Testosterone
]
df = compute_molecular_features(smiles_list)
drug_candidates = lipinski_filter(df)

AlphaFold2 and Protein Structure Prediction

DeepMind's AlphaFold2 (https://alphafold.ebi.ac.uk/) predicts 3D protein structures from amino acid sequences with near-experimental accuracy. It has fundamentally changed structural biology by solving the protein folding problem — a grand challenge that resisted solution for over 50 years.

# ColabFold (AlphaFold2 interface) usage example
# pip install colabfold

from colabfold.batch import get_queries, run

queries = [("target_protein", "MKTIIALSYIFCLVFA")]

results = run(
    queries=queries,
    result_dir="./alphafold_results",
    use_templates=False,
    num_recycles=3,
    model_type="auto"
)

DeepMind has publicly released over 200 million predicted protein structures. Drug developers use these structures for virtual screening and structure-based drug design, dramatically reducing the time needed to identify promising binding sites.

Graph Neural Networks for Molecular Generation

GNNs represent molecules as graphs — atoms as nodes, bonds as edges — and learn chemical patterns for property prediction and de novo molecular generation.

import torch
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, global_mean_pool

class MoleculeGNN(torch.nn.Module):
    def __init__(self, num_features, hidden_dim, num_classes):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.classifier = torch.nn.Linear(hidden_dim, num_classes)
        self.relu = torch.nn.ReLU()

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.relu(self.conv1(x, edge_index))
        x = self.relu(self.conv2(x, edge_index))
        x = global_mean_pool(x, batch)
        return self.classifier(x)

4. Wearables and AI

Consumer wearables generate continuous streams of physiological data that enable real-time health monitoring at scale.

ECG Arrhythmia Classification with 1D CNN

import torch
import torch.nn as nn

class ECGClassifier(nn.Module):
    """1D CNN for ECG arrhythmia classification"""
    def __init__(self, num_classes=5):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv1d(1, 32, kernel_size=11, padding=5),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Conv1d(32, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(32)
        )
        self.classifier = nn.Sequential(
            nn.Linear(128 * 32, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

# PhysioNet MIT-BIH Arrhythmia Database
# Reference: https://physionet.org/content/mitdb/1.0.0/

Human Activity Recognition (HAR): Accelerometer and gyroscope data from wearables can classify activities such as walking, running, and climbing stairs. Transformer-based models have recently surpassed CNN/LSTM baselines on HAR benchmarks.


Part 2: Finance AI

5. Stock Price Prediction

Stock price prediction is one of the most studied but also most difficult problems in finance. The Efficient Market Hypothesis (EMH) posits that public information is already priced in — making consistent alpha generation from price data alone extremely challenging.

Feature Engineering with Technical Indicators

import yfinance as yf
import pandas as pd
import numpy as np

# Download historical price data
# Reference: https://pypi.org/project/yfinance/
ticker = yf.Ticker("AAPL")
df = ticker.history(period="5y")

def compute_rsi(prices, window=14):
    delta = prices.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

def add_technical_indicators(df):
    df['SMA20'] = df['Close'].rolling(20).mean()
    df['SMA50'] = df['Close'].rolling(50).mean()
    df['EMA12'] = df['Close'].ewm(span=12).mean()
    df['EMA26'] = df['Close'].ewm(span=26).mean()
    df['MACD'] = df['EMA12'] - df['EMA26']
    df['Signal'] = df['MACD'].ewm(span=9).mean()
    df['RSI'] = compute_rsi(df['Close'], 14)
    df['BB_upper'] = df['SMA20'] + 2 * df['Close'].rolling(20).std()
    df['BB_lower'] = df['SMA20'] - 2 * df['Close'].rolling(20).std()
    df['Volatility'] = df['Close'].pct_change().rolling(20).std()
    df['Volume_MA'] = df['Volume'].rolling(20).mean()
    return df.dropna()

df = add_technical_indicators(df)

LSTM for Time Series Forecasting

import torch
import torch.nn as nn
from sklearn.preprocessing import MinMaxScaler

class StockLSTM(nn.Module):
    def __init__(self, input_size, hidden_size=128, num_layers=2, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout
        )
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, x):
        out, _ = self.lstm(x)
        return self.fc(out[:, -1, :])

# Walk-Forward cross-validation (prevents look-ahead bias)
def walk_forward_split(df, train_size=0.7, val_size=0.15):
    n = len(df)
    train_end = int(n * train_size)
    val_end = int(n * (train_size + val_size))
    return df[:train_end], df[train_end:val_end], df[val_end:]

Critical Warning: Never use random train/test splits on time series data. This introduces look-ahead bias — future data leaks into the training set — and produces backtesting results that are far better than what you would achieve in live trading.


6. Algorithmic Trading

Moving Average Crossover Strategy with Backtrader

import backtrader as bt
# Reference: https://www.backtrader.com/docu/

class MACrossStrategy(bt.Strategy):
    params = (('fast', 10), ('slow', 30),)

    def __init__(self):
        self.sma_fast = bt.indicators.SMA(period=self.p.fast)
        self.sma_slow = bt.indicators.SMA(period=self.p.slow)
        self.crossover = bt.indicators.CrossOver(self.sma_fast, self.sma_slow)

    def next(self):
        if not self.position:
            if self.crossover > 0:
                self.buy(size=100)
        elif self.crossover < 0:
            self.sell(size=100)

# Run backtest
cerebro = bt.Cerebro()
cerebro.addstrategy(MACrossStrategy)
cerebro.broker.setcash(100000.0)
cerebro.broker.setcommission(commission=0.001)
results = cerebro.run()
print(f"Final Portfolio Value: {cerebro.broker.getvalue():.2f}")

Markowitz Mean-Variance Portfolio Optimization

import numpy as np
from scipy.optimize import minimize

def portfolio_stats(weights, returns):
    port_return = np.sum(returns.mean() * weights) * 252
    port_vol = np.sqrt(weights @ returns.cov() @ weights * 252)
    sharpe = port_return / port_vol
    return port_return, port_vol, sharpe

def min_variance_portfolio(returns):
    n = returns.shape[1]
    constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}
    bounds = [(0, 1)] * n
    result = minimize(
        lambda w: portfolio_stats(w, returns)[1],
        x0=np.ones(n) / n,
        bounds=bounds,
        constraints=constraints
    )
    return result.x

Reinforcement Learning for Trading (PPO)

import gym
import numpy as np
from stable_baselines3 import PPO

class TradingEnv(gym.Env):
    """Simplified stock trading environment for RL"""
    def __init__(self, df, initial_balance=10000):
        super().__init__()
        self.df = df
        self.initial_balance = initial_balance
        # Actions: 0=Hold, 1=Buy, 2=Sell
        self.action_space = gym.spaces.Discrete(3)
        self.observation_space = gym.spaces.Box(
            low=-np.inf, high=np.inf, shape=(10,), dtype=np.float32
        )
        self.reset()

    def reset(self):
        self.balance = self.initial_balance
        self.shares = 0
        self.current_step = 0
        return self._get_obs()

    def _get_obs(self):
        row = self.df.iloc[self.current_step]
        return np.array([
            row['Close'], row['SMA20'], row['SMA50'],
            row['RSI'], row['MACD'], row['Volatility'],
            self.balance, self.shares,
            row['Volume'], row['BB_upper'] - row['BB_lower']
        ], dtype=np.float32)

    def step(self, action):
        price = self.df.iloc[self.current_step]['Close']
        reward = 0
        if action == 1 and self.balance >= price:
            self.shares += 1
            self.balance -= price
        elif action == 2 and self.shares > 0:
            self.shares -= 1
            self.balance += price
            reward = price - self.df.iloc[max(0, self.current_step - 1)]['Close']
        self.current_step += 1
        done = self.current_step >= len(self.df) - 1
        return self._get_obs(), reward, done, {}

model = PPO("MlpPolicy", TradingEnv(df), verbose=1)
model.learn(total_timesteps=100000)

7. Fraud Detection

Credit card fraud detection involves extreme class imbalance: fraudulent transactions typically represent less than 0.1% of all transactions.

SMOTE Oversampling and Isolation Forest

from sklearn.ensemble import IsolationForest
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, roc_auc_score
import numpy as np

# Oversample minority class — apply ONLY to training data
smote = SMOTE(sampling_strategy=0.1, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Unsupervised anomaly detection
iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.001,
    random_state=42
)
iso_forest.fit(X_train)
anomaly_scores = iso_forest.decision_function(X_test)

# Model explainability with SHAP
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test[:100])
shap.summary_plot(shap_values, X_test[:100], feature_names=feature_names)

Autoencoder-Based Anomaly Detection

Train an autoencoder only on normal transactions. At inference time, fraudulent transactions have high reconstruction error because the autoencoder was never trained to reconstruct them.

import torch
import torch.nn as nn

class FraudAutoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dim=16):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, encoding_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim)
        )

    def forward(self, x):
        return self.decoder(self.encoder(x))

def detect_fraud(model, x, threshold=0.5):
    with torch.no_grad():
        reconstructed = model(x)
    reconstruction_error = torch.mean((x - reconstructed) ** 2, dim=1)
    return reconstruction_error > threshold

8. Credit Risk Modeling

Under Basel III, financial institutions must quantify credit risk. Expected Loss is computed as:

  • EL = PD x LGD x EAD
    • PD (Probability of Default): likelihood the borrower will default
    • LGD (Loss Given Default): fraction of exposure lost in default
    • EAD (Exposure at Default): outstanding balance at time of default
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
import lightgbm as lgb

# PD model with probability calibration
pd_model = LogisticRegression(class_weight='balanced', max_iter=1000)
pd_calibrated = CalibratedClassifierCV(pd_model, cv=5)
pd_calibrated.fit(X_train, y_train)
pd_scores = pd_calibrated.predict_proba(X_test)[:, 1]

# Gradient Boosting scorecard
lgb_model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    num_leaves=31,
    class_weight='balanced'
)
lgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

# Survival Analysis for time-to-default modeling
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df_survival, duration_col='time_to_default', event_col='defaulted')
cph.print_summary()

9. NLP for Finance

News Sentiment Analysis with FinBERT

from transformers import pipeline
import pandas as pd

# FinBERT: financial domain BERT
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="ProsusAI/finbert"
)

def analyze_news_sentiment(headlines):
    results = sentiment_pipeline(headlines)
    return pd.DataFrame([
        {'headline': h, 'label': r['label'], 'score': r['score']}
        for h, r in zip(headlines, results)
    ])

headlines = [
    "Apple reports record quarterly earnings",
    "Federal Reserve signals rate hike ahead",
    "Tech sector faces regulatory headwinds"
]

df_sentiment = analyze_news_sentiment(headlines)
print(df_sentiment)

Parsing SEC EDGAR Filings

import requests
from bs4 import BeautifulSoup
import re

def fetch_10k_section(cik, accession_number):
    """Retrieve Risk Factors section from an SEC 10-K filing"""
    base_url = "https://www.sec.gov/Archives/edgar"
    url = f"{base_url}/{cik}/{accession_number}/10k.htm"
    response = requests.get(url, headers={'User-Agent': 'research@example.com'})
    soup = BeautifulSoup(response.text, 'html.parser')
    risk_section = soup.find(text=re.compile("Risk Factors", re.IGNORECASE))
    return risk_section.parent.get_text() if risk_section else ""

ESG Scoring: LLMs are increasingly used to extract ESG (Environmental, Social, Governance) signals from sustainability reports, news articles, and earnings call transcripts. These signals feed directly into socially responsible investing (SRI) portfolios.


Key Considerations

Healthcare

  1. Regulatory approval: In the US, AI software used for clinical decision-making falls under FDA's Software as a Medical Device (SaMD) framework and requires pre-market approval or clearance.
  2. HIPAA compliance: Any system handling Protected Health Information (PHI) must implement de-identification, access control, and audit logging.
  3. Algorithmic bias: Models trained on data skewed toward specific demographics may perform poorly for underrepresented groups — a direct patient safety concern.
  4. Clinical validation: Strong in-silico performance does not guarantee real-world efficacy. Prospective clinical trials remain essential.

Finance

  1. Look-ahead bias: Strict temporal discipline in data splits is non-negotiable. Even small leakage can make a worthless strategy appear profitable.
  2. Overfitting: Financial data is noisy and non-stationary. Complex models often fail out-of-sample when market regimes shift.
  3. Regime change: Strategies that worked well historically can fail suddenly when market structure, volatility, or correlations change.
  4. Transaction costs: Backtests must model slippage, commission, and market impact. Ignoring these can make an unprofitable strategy look excellent.

References


Quiz

Q1. Why is look-ahead bias especially dangerous in financial backtesting?

Answer: It occurs when information unavailable at the time of the trading decision is used in model training or signal generation, producing artificially inflated backtest performance.

Explanation: For example, using end-of-day closing prices to generate same-day trading signals is look-ahead bias because you cannot trade at a price until it is known. Walk-forward validation and strict chronological splitting prevent this. Strategies built on leaked data may appear to generate significant alpha but will fail immediately in live trading, leading to substantial real financial losses.

Q2. Why is Dice Score preferred over accuracy for medical image segmentation?

Answer: Severe class imbalance between background and foreground pixels makes accuracy a misleading metric — a model that always predicts background achieves very high accuracy while being clinically useless.

Explanation: In a brain MRI where a tumor occupies 1% of voxels, always predicting "no tumor" achieves 99% accuracy. Dice Score is computed as 2 times TP divided by (2 times TP plus FP plus FN), which directly measures overlap between prediction and ground truth and correctly penalizes missing the small foreground class.

Q3. What is Lipinski's Rule of Five and why is it used in drug discovery?

Answer: An empirical guideline for filtering drug candidates likely to have acceptable oral bioavailability, based on four physicochemical properties: molecular weight under 500 Da, LogP under 5, hydrogen bond donors under 5, hydrogen bond acceptors under 10.

Explanation: Poor oral bioavailability is a major cause of late-stage clinical trial failure. Applying the Rule of Five early in the screening pipeline eliminates molecules that are unlikely to be absorbed, dramatically reducing the number of candidates that proceed to expensive in-vitro and in-vivo testing.

Q4. What is the key pitfall when applying SMOTE to fraud detection datasets?

Answer: SMOTE must be applied only to training data. Applying it to validation or test sets introduces synthetic samples that inflate performance metrics and do not reflect real-world distribution.

Explanation: Precision-Recall AUC and F1 Score are more appropriate evaluation metrics than overall accuracy for fraud detection because they explicitly measure performance on the minority (fraud) class. Even on the training set, SMOTE-generated samples may not capture the true distribution of fraud patterns, so the technique should be combined with robust out-of-sample evaluation.

Q5. Why was AlphaFold2 a breakthrough for drug discovery specifically?

Answer: AlphaFold2 predicts 3D protein structures from amino acid sequences at near-experimental accuracy, reducing structure determination from years of laboratory work to minutes of compute time.

Explanation: Structure-based drug design requires knowing the 3D shape of the target protein — especially binding pockets — to design molecules that fit precisely. Before AlphaFold2, structures had to be determined experimentally via X-ray crystallography or cryo-EM, which is slow, expensive, and not always successful. DeepMind has released over 200 million predicted structures in a public database, enabling virtual screening at unprecedented scale.