헬스케어 & 금융 AI 완전 가이드: 의료 영상 분석부터 알고리즘 트레이딩까지

개요
Part 1: 헬스케어 AI
Part 2: 금융 AI
헬스케어 & 금융 AI 핵심 주의사항
- 헬스케어
- 금융
참고 자료
퀴즈

개요

AI는 헬스케어와 금융이라는 두 산업에서 전례 없는 변화를 이끌고 있습니다. 의사가 영상 판독에 사용하는 딥러닝 모델부터 초당 수천 건의 거래를 처리하는 알고리즘 트레이딩 시스템까지, AI는 이 분야들의 핵심 인프라로 자리 잡았습니다. 이 가이드는 두 분야의 주요 응용 사례를 실제 코드와 함께 심층적으로 살펴봅니다.

Part 1: 헬스케어 AI

1. 의료 영상 AI

의료 영상 분석은 딥러닝이 가장 큰 임팩트를 보여주는 분야 중 하나입니다. CheXNet (Stanford) 모델은 흉부 X-ray에서 방사선과 전문의 수준의 폐렴 진단 성능을 달성하였고, 이후 수많은 의료 영상 AI 연구의 기반이 되었습니다.

DICOM 이미지 처리

의료 영상 표준 포맷인 DICOM을 파이썬으로 처리하는 기본 파이프라인입니다.

import pydicom
import numpy as np
from torchvision import transforms
import torch

# DICOM 이미지 로드
ds = pydicom.dcmread('chest_xray.dcm')
img_array = ds.pixel_array.astype(np.float32)

# 정규화 (HU to Windowing)
img_normalized = (img_array - img_array.min()) / (img_array.max() - img_array.min())

# 모델 예측용 전처리
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485], [0.229])
])

tensor = transform(img_normalized).unsqueeze(0)

U-Net을 이용한 의료 영상 세그멘테이션

MRI나 CT에서 종양, 장기 등의 관심 영역을 자동으로 분할하기 위해 U-Net 아키텍처가 널리 사용됩니다.

import torch
import torch.nn as nn

class DoubleConv(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True)
        )
    def forward(self, x):
        return self.conv(x)

class UNet(nn.Module):
    def __init__(self, in_channels=1, num_classes=2):
        super().__init__()
        self.enc1 = DoubleConv(in_channels, 64)
        self.enc2 = DoubleConv(64, 128)
        self.pool = nn.MaxPool2d(2)
        self.bottleneck = DoubleConv(128, 256)
        self.up1 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec1 = DoubleConv(256, 128)
        self.up2 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec2 = DoubleConv(128, 64)
        self.out_conv = nn.Conv2d(64, num_classes, 1)

    def forward(self, x):
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        b = self.bottleneck(self.pool(e2))
        d1 = self.dec1(torch.cat([self.up1(b), e2], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d1), e1], dim=1))
        return self.out_conv(d2)

평가 지표: 의료 세그멘테이션에서는 Dice Score와 IoU(Intersection over Union)가 주요 지표입니다. Dice Score는 두 집합의 겹치는 비율로, 불균형한 배경-전경 문제에서 Accuracy보다 훨씬 유용합니다.

2. 임상 NLP

전자 건강 기록(EHR)에는 방대한 비정형 텍스트 데이터가 있습니다. NLP 모델은 이 데이터에서 임상적으로 의미 있는 정보를 추출합니다.

의료 개체명 인식 (NER)

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

model_name = "allenai/scibert_scivocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# BioBERT 기반 NER 파이프라인
ner_pipeline = pipeline(
    "ner",
    model="dmis-lab/biobert-v1.1",
    tokenizer="dmis-lab/biobert-v1.1",
    aggregation_strategy="simple"
)

clinical_text = """
Patient presents with acute myocardial infarction.
Administered aspirin 325mg and clopidogrel 600mg.
ECG shows ST elevation in leads V1-V4.
"""

entities = ner_pipeline(clinical_text)
for ent in entities:
    print(f"Entity: {ent['word']}, Label: {ent['entity_group']}, Score: {ent['score']:.3f}")

임상 텍스트 요약 (Med-PaLM, BioMedLM 활용)

PhysioNet의 MIMIC-III 데이터셋은 임상 NLP 연구의 표준 벤치마크입니다. 퇴원 요약, 방사선 보고서, 간호 노트 등 다양한 임상 텍스트를 포함하며, 접근 권한을 취득한 연구자들이 모델 학습에 활용합니다.

3. 신약 개발 AI

신약 개발은 평균 10~15년, 수십억 달러가 소요되는 과정입니다. AI는 이 프로세스의 여러 단계를 가속화하고 있습니다.

분자 특성 예측 (RDKit 활용)

from rdkit import Chem
from rdkit.Chem import Descriptors
import pandas as pd

def compute_molecular_features(smiles_list):
    features = []
    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            features.append({
                'MW': Descriptors.MolWt(mol),
                'LogP': Descriptors.MolLogP(mol),
                'HBD': Descriptors.NumHDonors(mol),
                'HBA': Descriptors.NumHAcceptors(mol),
                'TPSA': Descriptors.TPSA(mol)
            })
    return pd.DataFrame(features)

def lipinski_filter(df):
    """리핀스키 5의 규칙: 경구 투여 가능성 필터"""
    return df[
        (df['MW'] <= 500) &
        (df['LogP'] <= 5) &
        (df['HBD'] <= 5) &
        (df['HBA'] <= 10)
    ]

# 예시 SMILES 목록
smiles_list = [
    'CC(=O)Oc1ccccc1C(=O)O',  # Aspirin
    'CC12CCC3C(C1CCC2O)CCC4=CC(=O)CCC34C',  # Testosterone
]
df = compute_molecular_features(smiles_list)
drug_candidates = lipinski_filter(df)

AlphaFold2와 단백질 구조 예측

DeepMind의 AlphaFold2는 단백질 서열에서 3D 구조를 예측하는 획기적인 모델입니다. 수십 년간 해결되지 않던 단백질 접힘 문제(Protein Folding Problem)를 사실상 해결하여 신약 개발의 패러다임을 바꿨습니다.

# ColabFold (AlphaFold2 인터페이스) 활용 예시
# pip install colabfold

from colabfold.batch import get_queries, run

# 단백질 서열 입력 (FASTA 형식)
queries = [("target_protein", "MKTIIALSYIFCLVFA")]

results = run(
    queries=queries,
    result_dir="./alphafold_results",
    use_templates=False,
    num_recycles=3,
    model_type="auto"
)

GNN을 이용한 분자 생성

그래프 신경망(GNN)은 분자를 원자(노드)와 결합(엣지)으로 표현하여 새로운 약물 후보를 생성하는 데 사용됩니다.

import torch
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, global_mean_pool

class MoleculeGNN(torch.nn.Module):
    def __init__(self, num_features, hidden_dim, num_classes):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.classifier = torch.nn.Linear(hidden_dim, num_classes)
        self.relu = torch.nn.ReLU()

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.relu(self.conv1(x, edge_index))
        x = self.relu(self.conv2(x, edge_index))
        x = global_mean_pool(x, batch)
        return self.classifier(x)

4. 웨어러블 데이터와 AI

스마트워치와 웨어러블 기기에서 수집되는 생체 신호 데이터는 실시간 건강 모니터링의 새로운 패러다임을 열고 있습니다.

ECG 부정맥 분류 (1D CNN)

import torch
import torch.nn as nn

class ECGClassifier(nn.Module):
    """1D CNN for ECG arrhythmia classification"""
    def __init__(self, num_classes=5):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv1d(1, 32, kernel_size=11, padding=5),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Conv1d(32, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(32)
        )
        self.classifier = nn.Sequential(
            nn.Linear(128 * 32, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

# PhysioNet MIT-BIH Arrhythmia Database 활용
# https://physionet.org/content/mitdb/1.0.0/

주요 ECG 클래스: Normal (N), Atrial Fibrillation (A), Other (O), Noisy (~) 등 다양한 부정맥 패턴을 분류합니다.

Part 2: 금융 AI

5. 주가 예측 모델

주가 예측은 금융 AI에서 가장 널리 연구되는 분야이지만, 동시에 가장 어려운 과제입니다. 효율적 시장 가설(EMH)에 따르면 공개 정보는 이미 가격에 반영되어 있어, 순수하게 가격 데이터만으로 일관된 초과 수익을 얻기는 극히 어렵습니다.

기술 지표를 활용한 피처 엔지니어링

import yfinance as yf
import pandas as pd
import numpy as np

# yfinance로 주가 데이터 다운로드
# 참고: https://pypi.org/project/yfinance/
ticker = yf.Ticker("AAPL")
df = ticker.history(period="5y")

def compute_rsi(prices, window=14):
    delta = prices.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

def add_technical_indicators(df):
    df['SMA20'] = df['Close'].rolling(20).mean()
    df['SMA50'] = df['Close'].rolling(50).mean()
    df['EMA12'] = df['Close'].ewm(span=12).mean()
    df['EMA26'] = df['Close'].ewm(span=26).mean()
    df['MACD'] = df['EMA12'] - df['EMA26']
    df['Signal'] = df['MACD'].ewm(span=9).mean()
    df['RSI'] = compute_rsi(df['Close'], 14)
    df['BB_upper'] = df['SMA20'] + 2 * df['Close'].rolling(20).std()
    df['BB_lower'] = df['SMA20'] - 2 * df['Close'].rolling(20).std()
    df['Volatility'] = df['Close'].pct_change().rolling(20).std()
    df['Volume_MA'] = df['Volume'].rolling(20).mean()
    return df.dropna()

df = add_technical_indicators(df)

LSTM 기반 시계열 예측

import torch
import torch.nn as nn
from sklearn.preprocessing import MinMaxScaler

class StockLSTM(nn.Module):
    def __init__(self, input_size, hidden_size=128, num_layers=2, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout
        )
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, x):
        out, _ = self.lstm(x)
        return self.fc(out[:, -1, :])

# Walk-Forward 교차 검증 (Look-Ahead Bias 방지)
def walk_forward_split(df, train_size=0.7, val_size=0.15):
    n = len(df)
    train_end = int(n * train_size)
    val_end = int(n * (train_size + val_size))
    return df[:train_end], df[train_end:val_end], df[val_end:]

중요: 시계열 데이터에서 랜덤 분할은 절대 사용하지 마세요. 미래 데이터가 학습 셋에 포함되는 look-ahead bias가 발생하여 백테스팅 결과가 실제보다 훨씬 좋게 보이는 문제가 생깁니다.

6. 알고리즘 트레이딩

Backtrader를 이용한 이동 평균 교차 전략

import backtrader as bt
# 참고: https://www.backtrader.com/docu/

class MACrossStrategy(bt.Strategy):
    params = (('fast', 10), ('slow', 30),)

    def __init__(self):
        self.sma_fast = bt.indicators.SMA(period=self.p.fast)
        self.sma_slow = bt.indicators.SMA(period=self.p.slow)
        self.crossover = bt.indicators.CrossOver(self.sma_fast, self.sma_slow)

    def next(self):
        if not self.position:
            if self.crossover > 0:
                self.buy(size=100)
        elif self.crossover < 0:
            self.sell(size=100)

# 백테스팅 실행
cerebro = bt.Cerebro()
cerebro.addstrategy(MACrossStrategy)
cerebro.broker.setcash(100000.0)
cerebro.broker.setcommission(commission=0.001)
results = cerebro.run()
print(f"Final Portfolio Value: {cerebro.broker.getvalue():.2f}")

포트폴리오 최적화 (Markowitz Mean-Variance)

import numpy as np
from scipy.optimize import minimize

def portfolio_stats(weights, returns):
    port_return = np.sum(returns.mean() * weights) * 252
    port_vol = np.sqrt(weights @ returns.cov() @ weights * 252)
    sharpe = port_return / port_vol
    return port_return, port_vol, sharpe

def min_variance(returns):
    n = returns.shape[1]
    constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}
    bounds = [(0, 1)] * n
    result = minimize(
        lambda w: portfolio_stats(w, returns)[1],
        x0=np.ones(n) / n,
        bounds=bounds,
        constraints=constraints
    )
    return result.x

강화학습 기반 트레이딩 (PPO)

import gym
import numpy as np
from stable_baselines3 import PPO

class TradingEnv(gym.Env):
    """단순화된 주식 트레이딩 환경"""
    def __init__(self, df, initial_balance=10000):
        super().__init__()
        self.df = df
        self.initial_balance = initial_balance
        self.action_space = gym.spaces.Discrete(3)  # 0: Hold, 1: Buy, 2: Sell
        self.observation_space = gym.spaces.Box(
            low=-np.inf, high=np.inf, shape=(10,), dtype=np.float32
        )
        self.reset()

    def reset(self):
        self.balance = self.initial_balance
        self.shares = 0
        self.current_step = 0
        return self._get_obs()

    def _get_obs(self):
        row = self.df.iloc[self.current_step]
        return np.array([
            row['Close'], row['SMA20'], row['SMA50'],
            row['RSI'], row['MACD'], row['Volatility'],
            self.balance, self.shares,
            row['Volume'], row['BB_upper'] - row['BB_lower']
        ], dtype=np.float32)

    def step(self, action):
        price = self.df.iloc[self.current_step]['Close']
        reward = 0
        if action == 1 and self.balance >= price:
            self.shares += 1
            self.balance -= price
        elif action == 2 and self.shares > 0:
            self.shares -= 1
            self.balance += price
            reward = price - self.df.iloc[max(0, self.current_step - 1)]['Close']
        self.current_step += 1
        done = self.current_step >= len(self.df) - 1
        return self._get_obs(), reward, done, {}

7. 사기 탐지

신용카드 사기 탐지는 극심한 클래스 불균형(정상 거래 99.9% vs 사기 거래 0.1%)이라는 특수한 도전을 가집니다.

SMOTE와 Isolation Forest

from sklearn.ensemble import IsolationForest
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, roc_auc_score
import numpy as np

# 불균형 데이터 오버샘플링
smote = SMOTE(sampling_strategy=0.1, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# 비지도 이상 탐지
iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.001,
    random_state=42
)
iso_forest.fit(X_train)
anomaly_scores = iso_forest.decision_function(X_test)

# SHAP으로 모델 설명
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test[:100])
shap.summary_plot(shap_values, X_test[:100], feature_names=feature_names)

오토인코더 기반 이상 탐지

import torch
import torch.nn as nn

class FraudAutoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dim=16):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, encoding_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

# 재구성 오류가 크면 사기로 분류
def detect_fraud(model, x, threshold=0.5):
    reconstructed = model(x)
    reconstruction_error = torch.mean((x - reconstructed) ** 2, dim=1)
    return reconstruction_error > threshold

8. 신용 리스크 모델

바젤 III 규제 하에서 금융기관은 신용 리스크를 정량화할 의무가 있습니다. 기대 손실(Expected Loss)은 다음과 같이 계산됩니다.

EL = PD x LGD x EAD
- PD(Probability of Default): 부도 확률
- LGD(Loss Given Default): 부도 시 손실률
- EAD(Exposure at Default): 부도 시 익스포저

from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
import lightgbm as lgb

# PD 모델 (로지스틱 회귀)
pd_model = LogisticRegression(class_weight='balanced', max_iter=1000)
pd_model_calibrated = CalibratedClassifierCV(pd_model, cv=5)
pd_model_calibrated.fit(X_train, y_train)
pd_scores = pd_model_calibrated.predict_proba(X_test)[:, 1]

# LightGBM으로 스코어카드 구현
lgb_model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    num_leaves=31,
    class_weight='balanced'
)
lgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

# 기대 손실 계산
def expected_loss(pd, lgd, ead):
    return pd * lgd * ead

# Survival Analysis for Time-to-Default
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(df_survival, duration_col='time_to_default', event_col='defaulted')

9. 금융 NLP

뉴스 센티멘트 분석과 주가 상관관계

from transformers import pipeline
import pandas as pd

# FinanceBERT 활용 (금융 특화 BERT)
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="ProsusAI/finbert"
)

def analyze_news_sentiment(headlines):
    results = sentiment_pipeline(headlines)
    return pd.DataFrame([
        {'headline': h, 'label': r['label'], 'score': r['score']}
        for h, r in zip(headlines, results)
    ])

headlines = [
    "Apple reports record quarterly earnings",
    "Federal Reserve signals rate hike ahead",
    "Tech sector faces regulatory headwinds"
]

df_sentiment = analyze_news_sentiment(headlines)
print(df_sentiment)

EDGAR 재무제표 파싱

import requests
from bs4 import BeautifulSoup
import re

def fetch_10k_filing(cik, accession_number):
    """SEC EDGAR에서 10-K 연간 보고서 가져오기"""
    base_url = "https://www.sec.gov/Archives/edgar"
    url = f"{base_url}/{cik}/{accession_number}/10k.htm"
    response = requests.get(url, headers={'User-Agent': 'research@example.com'})
    soup = BeautifulSoup(response.text, 'html.parser')

    # 특정 섹션 추출 (예: Risk Factors)
    risk_section = soup.find(text=re.compile("Risk Factors", re.IGNORECASE))
    return risk_section.parent.get_text() if risk_section else ""

헬스케어 & 금융 AI 핵심 주의사항

헬스케어

FDA 규제: 미국에서 의료 AI 소프트웨어는 FDA의 Software as a Medical Device (SaMD) 프레임워크의 적용을 받습니다.
HIPAA 준수: 환자 데이터를 다룰 때는 반드시 데-식별화(De-identification)와 보안 조치가 필요합니다.
편향(Bias) 문제: 특정 인종이나 성별의 데이터가 적으면 AI 모델이 그 집단에 대해 성능이 낮아질 수 있습니다.
임상 검증: 모델 성능이 좋더라도 실제 임상 환경에서의 검증(Clinical Trial)이 필수입니다.

금융

Look-Ahead Bias: 시계열 모델 학습 시 미래 정보 유출을 철저히 차단해야 합니다.
Overfitting: 금융 데이터는 노이즈가 많아 복잡한 모델이 오히려 실전에서 나쁠 수 있습니다.
시장 체제 변화: 과거에 잘 작동한 전략이 시장 환경 변화로 갑자기 실패할 수 있습니다.
거래 비용: 백테스팅에서 슬리피지, 수수료, 시장 충격 비용을 반드시 포함해야 합니다.

참고 자료

PhysioNet Clinical Data: https://physionet.org/
AlphaFold2 (DeepMind): https://alphafold.ebi.ac.uk/
yfinance 공식 문서: https://pypi.org/project/yfinance/
Backtrader 공식 문서: https://www.backtrader.com/docu/
MIMIC-III Clinical Database: https://physionet.org/content/mimiciii/
SEC EDGAR: https://www.sec.gov/edgar/

퀴즈

Q1. 금융 AI 백테스팅에서 look-ahead bias란 무엇이며 왜 위험한가?

정답: 백테스팅 시 해당 시점에서는 알 수 없는 미래 데이터를 모델 학습이나 신호 생성에 사용하는 오류입니다.

설명: 예를 들어, 오늘의 종가 데이터를 이용해 오늘의 거래 신호를 생성하면 실제로는 불가능한 거래가 됩니다. Walk-Forward 검증과 시간 순서를 고려한 교차 검증으로 이를 방지할 수 있습니다. 이 편향이 있는 백테스트는 실제로는 존재하지 않는 알파를 발견한 것처럼 보이게 만들어 실제 투자 시 큰 손실로 이어질 수 있습니다.

Q2. 의료 영상 AI에서 Dice Score를 Accuracy 대신 사용하는 이유는?

정답: 의료 영상 세그멘테이션에서는 배경 픽셀이 관심 영역보다 압도적으로 많아 클래스 불균형이 극심하기 때문입니다.

설명: 예를 들어, 뇌 MRI에서 종양이 전체 픽셀의 1%를 차지할 때, 모든 픽셀을 배경으로 예측해도 Accuracy는 99%가 됩니다. Dice Score는 두 집합의 겹치는 비율(2 * TP / (2*TP + FP + FN))로 계산되어 소수 클래스의 예측 성능을 더 정확히 반영합니다.

Q3. 신약 개발에서 리핀스키 5의 규칙(Ro5)이 사용되는 목적은?

정답: 경구 투여 가능한 약물 후보의 약동학적 특성을 초기 단계에서 스크리닝하기 위한 경험적 규칙입니다.

설명: 분자량 500 이하, LogP 5 이하, 수소 결합 공여체 5 이하, 수소 결합 수용체 10 이하를 만족하는 분자는 경구 투여 시 흡수와 투과성이 좋을 가능성이 높습니다. 초기 스크리닝으로 수백만 개의 후보 분자를 빠르게 필터링해 개발 비용을 줄입니다.

Q4. 사기 탐지에서 SMOTE를 사용할 때 주의해야 할 점은?

정답: SMOTE는 반드시 학습 데이터에만 적용해야 하며, 검증 세트나 테스트 세트에는 적용하면 안 됩니다.

설명: 테스트 셋에 SMOTE를 적용하면 합성 데이터가 포함되어 실제 성능을 과대평가하게 됩니다. 또한, SMOTE로 생성된 합성 샘플은 실제 사기 패턴을 완전히 반영하지 못할 수 있으므로, Precision-Recall AUC나 F1 Score로 평가하는 것이 Accuracy보다 적합합니다.

Q5. AlphaFold2가 신약 개발에 혁신적인 이유는?

정답: 단백질의 아미노산 서열만으로 3D 구조를 원자 수준의 정밀도로 예측하여, 기존에 수년이 걸리던 단백질 구조 결정 과정을 수분 내로 단축시켰기 때문입니다.

설명: 신약 개발에서 표적 단백질의 3D 구조는 약물 결합 부위(binding site)를 파악하는 데 필수적입니다. AlphaFold2 이전에는 X-선 결정학, cryo-EM 등의 실험적 방법이 필요했습니다. DeepMind는 2억 개 이상의 단백질 구조 예측 결과를 공개 데이터베이스로 제공하고 있습니다.