Split View: AI for Science: AlphaFold, 신약 개발, 기후 AI, 물리 시뮬레이션까지

AI for Science: AlphaFold, 신약 개발, 기후 AI, 물리 시뮬레이션까지

AI for Science: 과학 연구를 혁신하는 인공지능

AI는 더 이상 텍스트 생성이나 이미지 분류에만 머물지 않습니다. 오늘날 AI는 단백질 접힘 문제를 해결하고, 신약 후보를 설계하며, 기후 모델을 개선하고, 물리 법칙을 신경망에 내장하는 방식으로 과학 연구의 최전선을 달리고 있습니다. 이 글에서는 과학 AI의 핵심 분야 7가지를 코드와 함께 깊이 탐구합니다.

1. AI 논문 분석: arXiv와 Semantic Scholar 활용

arXiv 트렌드 자동 수집

매일 수백 편의 논문이 arXiv에 올라옵니다. Semantic Scholar API를 활용하면 특정 주제의 최신 논문을 자동으로 수집하고 요약할 수 있습니다.

import requests
import json
from datetime import datetime, timedelta

def search_papers(query: str, limit: int = 10) -> list[dict]:
    """Semantic Scholar API로 논문 검색"""
    url = "https://api.semanticscholar.org/graph/v1/paper/search"
    params = {
        "query": query,
        "limit": limit,
        "fields": "title,abstract,year,citationCount,authors,externalIds"
    }
    headers = {"User-Agent": "ResearchBot/1.0"}
    response = requests.get(url, params=params, headers=headers)
    data = response.json()
    return data.get("data", [])

def format_paper(paper: dict) -> str:
    """논문 정보를 읽기 쉽게 포맷팅"""
    title = paper.get("title", "N/A")
    year = paper.get("year", "N/A")
    citations = paper.get("citationCount", 0)
    authors = [a["name"] for a in paper.get("authors", [])[:3]]
    abstract = paper.get("abstract", "")[:300]

    return f"""
제목: {title}
연도: {year} | 인용수: {citations}
저자: {", ".join(authors)}
초록: {abstract}...
"""

# 사용 예시
papers = search_papers("protein structure prediction AlphaFold", limit=5)
for p in papers:
    print(format_paper(p))

arXiv 카테고리별 최신 논문 모니터링

import feedparser

def get_arxiv_papers(category: str = "cs.LG", max_results: int = 20) -> list[dict]:
    """arXiv RSS 피드에서 최신 논문 가져오기"""
    url = f"http://export.arxiv.org/rss/{category}"
    feed = feedparser.parse(url)
    papers = []
    for entry in feed.entries[:max_results]:
        papers.append({
            "title": entry.title,
            "summary": entry.summary[:400],
            "link": entry.link,
            "published": entry.published
        })
    return papers

# 주요 arXiv 카테고리
categories = {
    "cs.LG": "Machine Learning",
    "q-bio.BM": "Biomolecules",
    "physics.comp-ph": "Computational Physics",
    "stat.ML": "Statistical ML"
}

for cat, name in categories.items():
    papers = get_arxiv_papers(cat, max_results=3)
    print(f"\n=== {name} ({cat}) ===")
    for p in papers:
        print(f"- {p['title'][:80]}")

2. 단백질 구조 예측: AlphaFold2/3의 원리

MSA, Attention, Recycling

AlphaFold2는 세 가지 핵심 혁신으로 단백질 구조 예측을 해결했습니다.

Multiple Sequence Alignment (MSA): 수백만 년의 진화 정보를 담은 진화적으로 관련된 단백질 서열을 정렬합니다. 공진화(co-evolution) 패턴에서 접촉 잔기를 추론합니다.

Evoformer: MSA 표현과 잔기 쌍 표현을 상호 업데이트하는 attention 모듈입니다.

구조 모듈: 잔기별 회전 및 이동 변환을 예측합니다.

Recycling: 초기 예측을 3회 반복 개선합니다.

# BioPython + ESMFold로 단백질 구조 예측
import torch
from transformers import EsmForProteinFolding, EsmTokenizer

def predict_structure_esmfold(sequence: str) -> dict:
    """
    ESMFold로 단백질 3D 구조 예측
    AlphaFold2와 달리 MSA 없이 단일 서열만으로 예측
    """
    model_name = "facebook/esmfold_v1"
    tokenizer = EsmTokenizer.from_pretrained(model_name)
    model = EsmForProteinFolding.from_pretrained(
        model_name,
        low_cpu_mem_usage=True
    )
    model = model.cuda() if torch.cuda.is_available() else model
    model.eval()

    # 토크나이즈
    tokenized = tokenizer(
        sequence,
        return_tensors="pt",
        add_special_tokens=False
    )

    with torch.no_grad():
        output = model(**tokenized)

    # pLDDT (예측 신뢰도 점수) 추출 — 100에 가까울수록 신뢰도 높음
    plddt_scores = output.plddt.squeeze().cpu().numpy()

    return {
        "plddt_mean": float(plddt_scores.mean()),
        "plddt_per_residue": plddt_scores.tolist(),
        "positions": output.positions[-1].squeeze().cpu().numpy()
    }

# 예시: 짧은 헬릭스 형성 펩타이드
seq = "AAKAAAKAAAKAAAKAAAK"
result = predict_structure_esmfold(seq)
print(f"평균 pLDDT 점수: {result['plddt_mean']:.2f}")
print(f"잔기 수: {len(result['plddt_per_residue'])}")

AlphaFold2 vs ESMFold vs RoseTTAFold 비교

모델	MSA 필요	속도	정확도	특징
AlphaFold2	필요	느림	매우 높음	금표준, 단백질 단독 구조
AlphaFold3	필요	느림	최고	DNA/RNA/소분자 복합체
ESMFold	불필요	빠름	높음	LLM 기반, 단일 서열
RoseTTAFold	필요	중간	높음	복합체 구조, 오픈소스

3. 신약 개발 AI: 분자 그래프 신경망

분자를 그래프로 표현하기

분자에서 원자는 노드, 화학 결합은 엣지입니다. 그래프 신경망(GNN)은 이 구조를 자연스럽게 처리합니다.

# Chemprop으로 분자 특성 예측
# pip install chemprop

import chemprop
from chemprop.data import MoleculeDataLoader, MoleculeDataset, MoleculeDatapoint

def predict_molecular_properties(smiles_list: list[str]) -> list[float]:
    """
    Chemprop D-MPNN으로 분자 특성 예측
    SMILES 문자열로부터 독성, 용해도 등 예측
    """
    # 데이터셋 생성
    data = MoleculeDataset([
        MoleculeDatapoint.from_smi(smi) for smi in smiles_list
    ])
    loader = MoleculeDataLoader(dataset=data, batch_size=32)

    # 사전 학습된 모델 로드 (예: HIV 억제제 예측)
    model = chemprop.models.MPNN.load_from_checkpoint("hiv_model.ckpt")

    predictions = []
    for batch in loader:
        pred = model(batch.mol_graph, batch.V_d, batch.E_d)
        predictions.extend(pred.squeeze().tolist())

    return predictions

# 예시 SMILES (아스피린, 카페인, 이부프로펜)
molecules = [
    "CC(=O)Oc1ccccc1C(=O)O",     # 아스피린
    "Cn1cnc2c1c(=O)n(c(=O)n2C)C",  # 카페인
    "CC(C)Cc1ccc(cc1)C(C)C(=O)O"  # 이부프로펜
]

# PyTorch Geometric으로 커스텀 분자 GNN 구현
import torch
import torch.nn as nn
from torch_geometric.nn import GCNConv, global_mean_pool
from torch_geometric.data import Data

class MolecularGNN(nn.Module):
    """분자 특성 예측을 위한 간단한 그래프 신경망"""
    def __init__(self, node_features: int = 9, hidden_dim: int = 64):
        super().__init__()
        self.conv1 = GCNConv(node_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.conv3 = GCNConv(hidden_dim, hidden_dim)
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim, 32),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, data: Data) -> torch.Tensor:
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = torch.relu(self.conv1(x, edge_index))
        x = torch.relu(self.conv2(x, edge_index))
        x = torch.relu(self.conv3(x, edge_index))
        # 전체 그래프를 하나의 벡터로 풀링
        x = global_mean_pool(x, batch)
        return self.classifier(x)

ADMET 예측 파이프라인

신약 개발에서 ADMET(흡수, 분포, 대사, 배설, 독성)는 후보 물질 필터링의 핵심입니다.

from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

def lipinski_filter(smiles: str) -> dict:
    """
    Lipinski의 Rule of Five — 경구 생체이용률 예측
    MW <= 500, LogP <= 5, HBD <= 5, HBA <= 10
    """
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return {"valid": False}

    mw = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    hbd = Lipinski.NumHDonors(mol)
    hba = Lipinski.NumHAcceptors(mol)

    violations = sum([mw > 500, logp > 5, hbd > 5, hba > 10])

    return {
        "valid": True,
        "MW": round(mw, 2),
        "LogP": round(logp, 2),
        "HBD": hbd,
        "HBA": hba,
        "violations": violations,
        "drug_like": violations <= 1
    }

# 테스트
for smi in molecules:
    result = lipinski_filter(smi)
    print(f"SMILES: {smi[:30]}...")
    print(f"  약물 유사성: {result['drug_like']}, 위반: {result['violations']}\n")

4. 기후/에너지 AI

NeuralGCM: 물리 기반 날씨 예측

Google의 NeuralGCM은 전통적인 수치 기상 예측(NWP)과 신경망을 결합합니다. 물리 방정식으로 대기 역학을 모델링하고, 신경망으로 소규모 과정(구름, 난류)을 파라미터화합니다.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor

def solar_power_forecast(
    weather_data: np.ndarray,
    target_hours: int = 24
) -> np.ndarray:
    """
    태양광 발전량 예측
    입력: 기온, 일사량, 풍속, 습도, 시간
    출력: 시간별 발전량 예측 (kWh)
    """
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(weather_data)

    model = GradientBoostingRegressor(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=5,
        random_state=42
    )
    # 실제 사용 시 학습 데이터로 fit()
    # model.fit(X_train, y_train)

    # 예측 (시뮬레이션 데이터)
    predictions = np.random.exponential(scale=50, size=target_hours)
    predictions = np.clip(predictions, 0, 500)  # 0~500 kWh 범위

    return predictions

def co2_capture_optimization(
    temperature: float,
    pressure: float,
    flow_rate: float
) -> dict:
    """
    CO2 포집 공정 최적화
    DAC(직접 공기 포집) 시스템 파라미터 최적화
    """
    # 단순화된 물리 모델 (실제는 더 복잡)
    efficiency = (
        0.85 * (1 - np.exp(-flow_rate / 100))
        * (1 / (1 + np.exp((temperature - 60) / 10)))
        * min(pressure / 1.5, 1.0)
    )

    energy_kwh_per_ton = 300 + (1 - efficiency) * 500

    return {
        "capture_efficiency": round(efficiency * 100, 2),
        "energy_cost_kwh_per_ton": round(energy_kwh_per_ton, 1),
        "optimal": efficiency > 0.75
    }

# 파라미터 스윕
for temp in [40, 60, 80]:
    result = co2_capture_optimization(temp, pressure=1.2, flow_rate=80)
    print(f"온도 {temp}C: 효율 {result['capture_efficiency']}%")

5. 물리 시뮬레이션: PINN

Physics-Informed Neural Networks (PINN)

PINN은 편미분 방정식(PDE)을 손실 함수에 직접 포함시켜 물리 법칙을 준수하는 솔루션을 학습합니다.

열 확산 방정식 $\frac{\partial u}{\partial t} = \alpha \frac{\partial^2 u}{\partial x^2}$ 를 신경망으로 풀어봅니다.

손실 함수는 두 부분으로 구성됩니다:

$\mathcal{L}_{total} = \mathcal{L}_{data} + \lambda \mathcal{L}_{physics}$

물리 손실 $\mathcal{L}_{physics} = ||\frac{\partial u}{\partial t} - \alpha \frac{\partial^2 u}{\partial x^2}||^2$ 이 핵심입니다.

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

class PINN(nn.Module):
    """
    Physics-Informed Neural Network
    1D 열 확산 방정식: du/dt = alpha * d2u/dx2
    """
    def __init__(self, hidden_layers: int = 4, neurons: int = 64):
        super().__init__()
        layers = [nn.Linear(2, neurons), nn.Tanh()]
        for _ in range(hidden_layers - 1):
            layers += [nn.Linear(neurons, neurons), nn.Tanh()]
        layers += [nn.Linear(neurons, 1)]
        self.net = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
        inputs = torch.cat([x, t], dim=1)
        return self.net(inputs)

def physics_loss(model: PINN, x: torch.Tensor, t: torch.Tensor, alpha: float = 0.01) -> torch.Tensor:
    """열 확산 방정식 잔차 손실"""
    x.requires_grad_(True)
    t.requires_grad_(True)

    u = model(x, t)

    # 자동 미분으로 편미분 계산
    u_t = torch.autograd.grad(u.sum(), t, create_graph=True)[0]
    u_x = torch.autograd.grad(u.sum(), x, create_graph=True)[0]
    u_xx = torch.autograd.grad(u_x.sum(), x, create_graph=True)[0]

    # 물리 잔차: du/dt - alpha * d2u/dx2 = 0
    residual = u_t - alpha * u_xx
    return torch.mean(residual ** 2)

def train_pinn(epochs: int = 5000) -> PINN:
    """PINN 학습"""
    model = PINN()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    # 훈련 포인트 샘플링
    N_pde = 1000   # PDE 잔차 포인트
    N_bc = 100     # 경계 조건 포인트
    N_ic = 200     # 초기 조건 포인트

    for epoch in range(epochs):
        optimizer.zero_grad()

        # PDE 잔차 포인트
        x_pde = torch.rand(N_pde, 1)
        t_pde = torch.rand(N_pde, 1)
        loss_pde = physics_loss(model, x_pde, t_pde)

        # 경계 조건: u(0,t) = u(1,t) = 0
        x_bc = torch.zeros(N_bc, 1)
        t_bc = torch.rand(N_bc, 1)
        u_bc = model(x_bc, t_bc)
        loss_bc = torch.mean(u_bc ** 2)

        # 초기 조건: u(x,0) = sin(pi*x)
        x_ic = torch.rand(N_ic, 1)
        t_ic = torch.zeros(N_ic, 1)
        u_ic = model(x_ic, t_ic)
        u_exact = torch.sin(np.pi * x_ic)
        loss_ic = torch.mean((u_ic - u_exact) ** 2)

        # 총 손실
        loss = loss_pde + 10 * loss_bc + 10 * loss_ic
        loss.backward()
        optimizer.step()

        if epoch % 1000 == 0:
            print(f"Epoch {epoch}: Loss = {loss.item():.6f}")

    return model

# 학습 실행
# model = train_pinn(epochs=5000)
print("PINN 구조: 입력(x,t) -> 4x64 Tanh -> 출력(u)")

Neural ODE: 연속 시간 동역학

Neural ODE는 은닉 상태의 변화율을 신경망으로 모델링합니다.

# pip install torchdiffeq
import torch
import torch.nn as nn
from torchdiffeq import odeint

class ODEFunc(nn.Module):
    """ODE 우변 함수 f(t, y) = dy/dt"""
    def __init__(self, dim: int = 2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, dim)
        )

    def forward(self, t: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        return self.net(y)

class NeuralODE(nn.Module):
    """Neural ODE 모델"""
    def __init__(self, dim: int = 2):
        super().__init__()
        self.odefunc = ODEFunc(dim)

    def forward(self, y0: torch.Tensor, t_span: torch.Tensor) -> torch.Tensor:
        """
        y0: 초기 상태 [batch, dim]
        t_span: 시간 점들 [T]
        반환: 각 시간 점에서의 상태 [T, batch, dim]
        """
        return odeint(self.odefunc, y0, t_span, method='dopri5')

# Lotka-Volterra 포식자-피식자 모델 예시
def simulate_lotka_volterra():
    model = NeuralODE(dim=2)
    # 초기 조건: 토끼 1000마리, 여우 100마리
    y0 = torch.tensor([[1.0, 0.1]])
    t = torch.linspace(0, 15, 300)
    with torch.no_grad():
        trajectory = model(y0, t)
    print(f"시뮬레이션 형태: {trajectory.shape}")  # [300, 1, 2]
    return trajectory

# simulate_lotka_volterra()

Fourier Neural Operator (FNO)

FNO는 주파수 도메인에서 연산하여 해상도 무관 PDE 솔버를 구현합니다.

import torch
import torch.nn as nn
import torch.fft

class SpectralConv2d(nn.Module):
    """FNO 핵심 블록: 스펙트럼 도메인 합성곱"""
    def __init__(self, in_channels: int, out_channels: int, modes: int = 12):
        super().__init__()
        self.modes = modes
        scale = 1 / (in_channels * out_channels)
        self.weights = nn.Parameter(
            scale * torch.rand(in_channels, out_channels, modes, modes, dtype=torch.cfloat)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, C, H, W = x.shape
        x_ft = torch.fft.rfft2(x)
        out_ft = torch.zeros(B, self.weights.shape[1], H, W//2+1,
                             dtype=torch.cfloat, device=x.device)
        # 저주파 성분만 학습
        out_ft[:, :, :self.modes, :self.modes] = torch.einsum(
            'bixy,ioxy->boxy',
            x_ft[:, :, :self.modes, :self.modes],
            self.weights
        )
        return torch.fft.irfft2(out_ft, s=(H, W))

6. AI 실험 자동화: Self-Driving Lab

Bayesian 최적화로 실험 설계

자기 주도 실험실(SDL)은 AI가 실험을 설계하고, 로봇이 실험을 수행하고, AI가 결과를 분석하는 폐쇄 루프입니다.

# pip install scikit-optimize
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
import numpy as np

# 실험 파라미터 공간 정의
# 예: 페로브스카이트 태양전지 최적화
search_space = [
    Real(0.5, 2.0, name="pb_concentration"),      # Pb 농도 (mol/L)
    Real(0.3, 0.8, name="ma_ratio"),               # MA:FA 비율
    Real(50, 150, name="annealing_temp"),           # 어닐링 온도 (C)
    Integer(10, 60, name="annealing_time"),         # 어닐링 시간 (분)
    Categorical(["DMF", "DMSO", "GBL"], name="solvent")  # 용매
]

@use_named_args(search_space)
def experimental_objective(
    pb_concentration, ma_ratio, annealing_temp,
    annealing_time, solvent
) -> float:
    """
    실험 목적 함수 (실제로는 로봇 실험 시스템 호출)
    반환: 음의 PCE (Power Conversion Efficiency)
    최대화 문제를 최소화로 변환
    """
    # 시뮬레이션된 실험 결과 (실제는 하드웨어 제어)
    noise = np.random.normal(0, 0.5)
    pce = (
        15.0
        + 2.0 * np.exp(-((pb_concentration - 1.2) ** 2) / 0.1)
        + 1.5 * np.exp(-((ma_ratio - 0.6) ** 2) / 0.05)
        - 0.05 * abs(annealing_temp - 100)
        + noise
    )
    print(f"실험: Pb={pb_concentration:.2f}, MA={ma_ratio:.2f}, "
          f"T={annealing_temp:.0f}C -> PCE={pce:.2f}%")
    return -pce  # 최소화

# Bayesian 최적화 실행
result = gp_minimize(
    func=experimental_objective,
    dimensions=search_space,
    n_calls=30,          # 총 실험 횟수
    n_initial_points=10, # 초기 랜덤 탐색
    acq_func="EI",       # Expected Improvement 획득 함수
    random_state=42
)

print(f"\n최적 PCE: {-result.fun:.2f}%")
print(f"최적 파라미터:")
for name, val in zip([s.name for s in search_space], result.x):
    print(f"  {name}: {val}")

7. 연구 재현성: DVC와 실험 추적

DVC로 데이터 버전 관리

# DVC 초기화 및 데이터 추적
git init
dvc init

# 대용량 데이터셋 추적 (Git-LFS 대신 DVC 사용)
dvc add data/protein_structures/
dvc add data/molecular_datasets/

# 원격 스토리지 설정
dvc remote add -d myremote s3://mybucket/dvc-store

# 파이프라인 정의 (dvc.yaml)
# stages:
#   preprocess:
#     cmd: python preprocess.py
#     deps: [data/raw/, src/preprocess.py]
#     outs: [data/processed/]
#   train:
#     cmd: python train.py --seed 42
#     deps: [data/processed/, src/train.py]
#     outs: [models/]
#     metrics: [metrics.json]

# 실험 실행 및 추적
dvc repro
dvc push

MLflow로 실험 추적

import mlflow
import mlflow.pytorch
import torch

def train_with_tracking(config: dict) -> float:
    """MLflow로 실험 파라미터와 메트릭 추적"""
    with mlflow.start_run():
        # 하이퍼파라미터 로깅
        mlflow.log_params(config)

        # 시드 고정 (재현성)
        torch.manual_seed(config["seed"])
        np.random.seed(config["seed"])

        # 모델 학습 (예시)
        model = MolecularGNN(
            node_features=config["node_features"],
            hidden_dim=config["hidden_dim"]
        )

        # 메트릭 로깅
        for epoch in range(config["epochs"]):
            train_loss = np.random.exponential(1.0) / (epoch + 1)
            val_auc = 1 - np.exp(-epoch / 20)
            mlflow.log_metric("train_loss", train_loss, step=epoch)
            mlflow.log_metric("val_auc", val_auc, step=epoch)

        # 모델 저장
        mlflow.pytorch.log_model(model, "model")
        final_auc = val_auc
        mlflow.log_metric("final_val_auc", final_auc)

    return final_auc

# 실험 설정
config = {
    "seed": 42,
    "node_features": 9,
    "hidden_dim": 128,
    "epochs": 100,
    "lr": 1e-3,
    "dataset": "tox21"
}

mlflow.set_experiment("molecular-property-prediction")
# auc = train_with_tracking(config)

환경 재현을 위한 Docker + Conda

# Dockerfile
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

# 시스템 패키지
RUN apt-get update && apt-get install -y \
    git wget curl \
    && rm -rf /var/lib/apt/lists/*

# Python 의존성 설치
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 재현성을 위한 시드 환경변수
ENV PYTHONHASHSEED=42
ENV CUBLAS_WORKSPACE_CONFIG=:4096:8

WORKDIR /workspace
COPY . .

퀴즈

Q1. AlphaFold2에서 Multiple Sequence Alignment(MSA)를 사용하는 진화적 이유는?

정답: 공진화(co-evolution) 패턴에서 공간적으로 인접한 잔기 쌍을 추론하기 위해서입니다.

설명: 수백만 년 동안 진화한 단백질 패밀리에서, 기능적으로 상호작용하는 두 잔기는 함께 변이하는 경향이 있습니다(공진화). MSA에서 특정 위치 쌍의 상관 변이를 분석하면, 3D 구조에서 가까이 위치한 잔기 쌍을 예측할 수 있습니다. AlphaFold2의 Evoformer는 이 공진화 정보를 행렬 형태로 처리하여 구조 예측의 정확도를 크게 높입니다. ESMFold는 대형 단백질 언어 모델(PLM)의 임베딩이 이 진화 정보를 암묵적으로 포착했기 때문에 MSA 없이도 작동합니다.

Q2. Physics-Informed Neural Networks(PINN)에서 물리 법칙을 손실 함수에 포함하는 방법은?

정답: 신경망 출력에 자동 미분을 적용해 PDE 잔차를 계산하고, 이를 손실 함수의 정규화 항으로 추가합니다.

설명: PINN의 핵심 아이디어는 신경망 $u_\theta(x,t)$ 의 출력이 편미분 방정식을 만족하도록 강제하는 것입니다. PyTorch의 autograd를 이용해 $\partial u / \partial t$ 와 $\partial^2 u / \partial x^2$ 를 정확히 계산한 후, PDE 잔차 $r = \partial u / \partial t - \alpha \partial^2 u / \partial x^2$ 의 제곱 평균을 물리 손실로 사용합니다. 데이터가 없거나 적은 영역에서도 물리 법칙이 솔루션을 제약하기 때문에, 외삽 성능이 순수 데이터 기반 모델보다 뛰어납니다.

Q3. 분자 그래프 신경망(Molecular GNN)에서 원자를 노드, 결합을 엣지로 표현하는 이유는?

정답: 분자의 화학적 특성이 원자 종류와 결합 구조(위상)에 의해 결정되기 때문입니다.

설명: SMILES 문자열은 분자를 1D 서열로 표현하지만, 분자의 실제 구조는 그래프입니다. GNN은 메시지 패싱(message passing) 메커니즘으로 각 원자가 이웃 원자의 정보를 집계하여 로컬 화학 환경을 학습합니다. 여러 레이어를 쌓으면 더 넓은 범위의 구조 정보를 포착합니다. 이 표현은 분자의 크기나 구조에 무관하게 작동하고(순열 불변성), 물리화학적 직관에도 부합합니다. Chemprop의 D-MPNN은 방향성 메시지 패싱을 사용해 고리 구조 표현을 더욱 개선합니다.

Q4. Bayesian 최적화가 그리드 서치보다 실험 설계에 효율적인 수학적 근거는?

정답: Gaussian Process로 목적 함수를 근사하고, 획득 함수(Expected Improvement)로 탐색-활용 균형을 최적화하기 때문입니다.

설명: 그리드 서치는 파라미터 차원이 증가하면 지수적으로 실험 수가 늘어납니다(차원의 저주). Bayesian 최적화는 (1) 이전 실험 결과로 목적 함수의 사후 분포를 추정하는 Gaussian Process 대리 모델, (2) 다음 실험 위치를 제안하는 획득 함수(EI, UCB, PI)로 구성됩니다. EI는 현재 최적값 대비 개선 기댓값을 최대화하는 점을 선택하며, 불확실성이 높은 미탐색 영역과 유망한 영역 모두를 효율적으로 탐색합니다. 실험 비용이 비쌀수록(예: 합성 실험) Bayesian 최적화의 효과가 극적으로 나타납니다.

Q5. Neural ODE가 일반 RNN보다 연속 시계열 데이터를 더 자연스럽게 모델링하는 이유는?

정답: Neural ODE는 이산 시간 스텝 대신 연속 미분 방정식으로 시스템 동역학을 직접 모델링하기 때문입니다.

설명: RNN은 고정된 이산 시간 스텝을 전제로 하므로, 불규칙한 시간 간격의 데이터나 연속 물리 시스템 모델링에 부자연스럽습니다. Neural ODE는 $dy/dt = f_\theta(t, y)$ 를 정의하고, ODE 솔버(RK45, Dopri5)로 임의의 시간 점에서 상태를 계산합니다. 이를 통해 (1) 임의의 시간 해상도로 예측 가능, (2) 더 적은 파라미터로 연속 동역학 표현, (3) 역전파를 adjoint method로 메모리 효율적으로 수행하는 장점이 있습니다. 약동학/약력학(PK/PD) 모델링, 기후 예측 등 연속 동역학이 있는 과학 문제에 특히 적합합니다.

마무리: AI for Science의 미래

AI는 과학 연구의 모든 단계를 가속화하고 있습니다.

단백질 구조 예측: AlphaFold3는 이제 DNA, RNA, 소분자와의 복합체까지 예측합니다
신약 개발: 분자 생성 AI로 후보 물질 발굴 기간이 수년에서 수개월로 단축됩니다
기후 과학: NeuralGCM, Pangu-Weather 등 AI 날씨 모델이 기존 수치 모델을 앞서기 시작했습니다
물리 시뮬레이션: PINN과 FNO로 CFD, 양자역학 시뮬레이션이 수백 배 빨라집니다
자기 주도 실험실: 재료 발견, 화학 합성 최적화에서 실험 효율이 극적으로 향상됩니다

과학 AI의 핵심은 도메인 지식과 데이터의 결합입니다. 물리 법칙, 화학 구조, 진화 정보 등 인류가 축적한 과학적 지식을 AI 모델에 통합할 때 가장 강력한 결과를 냅니다.

AI for Science: AlphaFold, Drug Discovery, Climate AI, and Physics Simulation

AI for Science: How Artificial Intelligence Is Transforming Research

AI is no longer confined to generating text or classifying images. Today, AI solves protein folding problems, designs drug candidates, improves climate models, and embeds physical laws directly into neural networks — all at the cutting edge of scientific discovery. This guide explores seven core areas of scientific AI with practical code examples.

1. AI Paper Analysis: arXiv and Semantic Scholar

Automated Paper Discovery

Hundreds of papers appear on arXiv every day. The Semantic Scholar API lets you automatically collect and summarize the latest work in any research area.

import requests
import json
from datetime import datetime, timedelta

def search_papers(query: str, limit: int = 10) -> list[dict]:
    """Search papers via Semantic Scholar API"""
    url = "https://api.semanticscholar.org/graph/v1/paper/search"
    params = {
        "query": query,
        "limit": limit,
        "fields": "title,abstract,year,citationCount,authors,externalIds"
    }
    headers = {"User-Agent": "ResearchBot/1.0"}
    response = requests.get(url, params=params, headers=headers)
    data = response.json()
    return data.get("data", [])

def format_paper(paper: dict) -> str:
    """Format paper info for readability"""
    title = paper.get("title", "N/A")
    year = paper.get("year", "N/A")
    citations = paper.get("citationCount", 0)
    authors = [a["name"] for a in paper.get("authors", [])[:3]]
    abstract = paper.get("abstract", "")[:300]

    return f"""
Title: {title}
Year: {year} | Citations: {citations}
Authors: {", ".join(authors)}
Abstract: {abstract}...
"""

# Example usage
papers = search_papers("protein structure prediction AlphaFold", limit=5)
for p in papers:
    print(format_paper(p))

Monitoring arXiv Categories for New Papers

import feedparser

def get_arxiv_papers(category: str = "cs.LG", max_results: int = 20) -> list[dict]:
    """Fetch latest papers from an arXiv RSS feed"""
    url = f"http://export.arxiv.org/rss/{category}"
    feed = feedparser.parse(url)
    papers = []
    for entry in feed.entries[:max_results]:
        papers.append({
            "title": entry.title,
            "summary": entry.summary[:400],
            "link": entry.link,
            "published": entry.published
        })
    return papers

# Key arXiv categories for scientific AI
categories = {
    "cs.LG": "Machine Learning",
    "q-bio.BM": "Biomolecules",
    "physics.comp-ph": "Computational Physics",
    "stat.ML": "Statistical ML"
}

for cat, name in categories.items():
    papers = get_arxiv_papers(cat, max_results=3)
    print(f"\n=== {name} ({cat}) ===")
    for p in papers:
        print(f"- {p['title'][:80]}")

2. Protein Structure Prediction: How AlphaFold2/3 Works

MSA, Attention, and Recycling

AlphaFold2 solved protein structure prediction with three core innovations.

Multiple Sequence Alignment (MSA): Aligns evolutionary related protein sequences that encode millions of years of evolutionary information. Co-evolution patterns in the MSA reveal which residue pairs are spatially close in the 3D structure.

Evoformer: An attention module that iteratively updates both the MSA representation and residue-pair representation, letting them inform each other.

Structure Module: Predicts per-residue rotations and translations to place each amino acid in 3D space.

Recycling: The initial prediction is fed back as input and refined across three iterations.

# Protein structure prediction with BioPython + ESMFold
import torch
from transformers import EsmForProteinFolding, EsmTokenizer

def predict_structure_esmfold(sequence: str) -> dict:
    """
    Predict protein 3D structure with ESMFold.
    Unlike AlphaFold2, ESMFold requires only a single sequence — no MSA needed.
    """
    model_name = "facebook/esmfold_v1"
    tokenizer = EsmTokenizer.from_pretrained(model_name)
    model = EsmForProteinFolding.from_pretrained(
        model_name,
        low_cpu_mem_usage=True
    )
    model = model.cuda() if torch.cuda.is_available() else model
    model.eval()

    # Tokenize the sequence
    tokenized = tokenizer(
        sequence,
        return_tensors="pt",
        add_special_tokens=False
    )

    with torch.no_grad():
        output = model(**tokenized)

    # pLDDT: per-residue confidence score (closer to 100 = higher confidence)
    plddt_scores = output.plddt.squeeze().cpu().numpy()

    return {
        "plddt_mean": float(plddt_scores.mean()),
        "plddt_per_residue": plddt_scores.tolist(),
        "positions": output.positions[-1].squeeze().cpu().numpy()
    }

# Example: short helix-forming peptide
seq = "AAKAAAKAAAKAAAKAAAK"
result = predict_structure_esmfold(seq)
print(f"Mean pLDDT score: {result['plddt_mean']:.2f}")
print(f"Number of residues: {len(result['plddt_per_residue'])}")

AlphaFold2 vs ESMFold vs RoseTTAFold

Model	MSA Required	Speed	Accuracy	Notes
AlphaFold2	Yes	Slow	Very High	Gold standard, single-chain structures
AlphaFold3	Yes	Slow	Highest	DNA/RNA/small-molecule complexes
ESMFold	No	Fast	High	LLM-based, single sequence only
RoseTTAFold	Yes	Medium	High	Complex structures, open source

3. Drug Discovery AI: Molecular Graph Neural Networks

Representing Molecules as Graphs

In a molecule, atoms are nodes and chemical bonds are edges. Graph Neural Networks (GNNs) process this structure naturally and in a permutation-invariant way.

# Molecular property prediction with Chemprop
# pip install chemprop

import chemprop
from chemprop.data import MoleculeDataLoader, MoleculeDataset, MoleculeDatapoint

def predict_molecular_properties(smiles_list: list[str]) -> list[float]:
    """
    Predict molecular properties with Chemprop D-MPNN.
    Takes SMILES strings and predicts toxicity, solubility, etc.
    """
    data = MoleculeDataset([
        MoleculeDatapoint.from_smi(smi) for smi in smiles_list
    ])
    loader = MoleculeDataLoader(dataset=data, batch_size=32)

    # Load a pre-trained model (e.g., HIV inhibitor prediction)
    model = chemprop.models.MPNN.load_from_checkpoint("hiv_model.ckpt")

    predictions = []
    for batch in loader:
        pred = model(batch.mol_graph, batch.V_d, batch.E_d)
        predictions.extend(pred.squeeze().tolist())

    return predictions

# Example SMILES (aspirin, caffeine, ibuprofen)
molecules = [
    "CC(=O)Oc1ccccc1C(=O)O",      # aspirin
    "Cn1cnc2c1c(=O)n(c(=O)n2C)C",  # caffeine
    "CC(C)Cc1ccc(cc1)C(C)C(=O)O"  # ibuprofen
]

# Custom Molecular GNN with PyTorch Geometric
import torch
import torch.nn as nn
from torch_geometric.nn import GCNConv, global_mean_pool
from torch_geometric.data import Data

class MolecularGNN(nn.Module):
    """Simple graph neural network for molecular property prediction"""
    def __init__(self, node_features: int = 9, hidden_dim: int = 64):
        super().__init__()
        self.conv1 = GCNConv(node_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.conv3 = GCNConv(hidden_dim, hidden_dim)
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim, 32),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, data: Data) -> torch.Tensor:
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = torch.relu(self.conv1(x, edge_index))
        x = torch.relu(self.conv2(x, edge_index))
        x = torch.relu(self.conv3(x, edge_index))
        # Pool all node features into a single graph-level vector
        x = global_mean_pool(x, batch)
        return self.classifier(x)

ADMET Prediction Pipeline

ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) filtering is central to drug candidate selection.

from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

def lipinski_filter(smiles: str) -> dict:
    """
    Lipinski's Rule of Five — predicts oral bioavailability.
    MW <= 500, LogP <= 5, HBD <= 5, HBA <= 10
    """
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return {"valid": False}

    mw = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    hbd = Lipinski.NumHDonors(mol)
    hba = Lipinski.NumHAcceptors(mol)

    violations = sum([mw > 500, logp > 5, hbd > 5, hba > 10])

    return {
        "valid": True,
        "MW": round(mw, 2),
        "LogP": round(logp, 2),
        "HBD": hbd,
        "HBA": hba,
        "violations": violations,
        "drug_like": violations <= 1
    }

# Test the filter
for smi in molecules:
    result = lipinski_filter(smi)
    print(f"SMILES: {smi[:30]}...")
    print(f"  Drug-like: {result['drug_like']}, Violations: {result['violations']}\n")

4. Climate and Energy AI

NeuralGCM: Physics-Driven Weather Prediction

Google's NeuralGCM combines traditional Numerical Weather Prediction (NWP) with neural networks. Physical equations model atmospheric dynamics, while neural networks parameterize subgrid-scale processes like cloud formation and turbulence.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor

def solar_power_forecast(
    weather_data: np.ndarray,
    target_hours: int = 24
) -> np.ndarray:
    """
    Solar power generation forecast.
    Inputs: temperature, irradiance, wind speed, humidity, time-of-day
    Output: hourly generation forecast (kWh)
    """
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(weather_data)

    model = GradientBoostingRegressor(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=5,
        random_state=42
    )
    # In production: model.fit(X_train, y_train)

    # Simulated predictions
    predictions = np.random.exponential(scale=50, size=target_hours)
    predictions = np.clip(predictions, 0, 500)  # 0–500 kWh range

    return predictions

def co2_capture_optimization(
    temperature: float,
    pressure: float,
    flow_rate: float
) -> dict:
    """
    CO2 capture process optimization.
    Optimizes parameters for a Direct Air Capture (DAC) system.
    """
    # Simplified physics model
    efficiency = (
        0.85 * (1 - np.exp(-flow_rate / 100))
        * (1 / (1 + np.exp((temperature - 60) / 10)))
        * min(pressure / 1.5, 1.0)
    )

    energy_kwh_per_ton = 300 + (1 - efficiency) * 500

    return {
        "capture_efficiency": round(efficiency * 100, 2),
        "energy_cost_kwh_per_ton": round(energy_kwh_per_ton, 1),
        "optimal": efficiency > 0.75
    }

# Parameter sweep
for temp in [40, 60, 80]:
    result = co2_capture_optimization(temp, pressure=1.2, flow_rate=80)
    print(f"Temp {temp}C: efficiency {result['capture_efficiency']}%")

5. Physics Simulation: PINN

Physics-Informed Neural Networks (PINN)

PINNs embed partial differential equations (PDEs) directly into the loss function, forcing the network to learn solutions that obey physical laws.

Consider the 1D heat equation $\frac{\partial u}{\partial t} = \alpha \frac{\partial^2 u}{\partial x^2}$ .

The total loss has two components:

$\mathcal{L}_{total} = \mathcal{L}_{data} + \lambda \mathcal{L}_{physics}$

The physics loss $\mathcal{L}_{physics} = ||\frac{\partial u}{\partial t} - \alpha \frac{\partial^2 u}{\partial x^2}||^2$ enforces the PDE.

import torch
import torch.nn as nn
import numpy as np

class PINN(nn.Module):
    """
    Physics-Informed Neural Network.
    Solves the 1D heat equation: du/dt = alpha * d2u/dx2
    """
    def __init__(self, hidden_layers: int = 4, neurons: int = 64):
        super().__init__()
        layers = [nn.Linear(2, neurons), nn.Tanh()]
        for _ in range(hidden_layers - 1):
            layers += [nn.Linear(neurons, neurons), nn.Tanh()]
        layers += [nn.Linear(neurons, 1)]
        self.net = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
        inputs = torch.cat([x, t], dim=1)
        return self.net(inputs)

def physics_loss(model: PINN, x: torch.Tensor, t: torch.Tensor, alpha: float = 0.01) -> torch.Tensor:
    """Compute PDE residual loss for the heat equation"""
    x.requires_grad_(True)
    t.requires_grad_(True)

    u = model(x, t)

    # Compute partial derivatives via automatic differentiation
    u_t = torch.autograd.grad(u.sum(), t, create_graph=True)[0]
    u_x = torch.autograd.grad(u.sum(), x, create_graph=True)[0]
    u_xx = torch.autograd.grad(u_x.sum(), x, create_graph=True)[0]

    # PDE residual: du/dt - alpha * d2u/dx2 = 0
    residual = u_t - alpha * u_xx
    return torch.mean(residual ** 2)

def train_pinn(epochs: int = 5000) -> PINN:
    """Train a PINN for the heat equation"""
    model = PINN()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    N_pde = 1000   # PDE collocation points
    N_bc  = 100    # Boundary condition points
    N_ic  = 200    # Initial condition points

    for epoch in range(epochs):
        optimizer.zero_grad()

        # PDE residual points (interior of domain)
        x_pde = torch.rand(N_pde, 1)
        t_pde = torch.rand(N_pde, 1)
        loss_pde = physics_loss(model, x_pde, t_pde)

        # Boundary condition: u(0,t) = u(1,t) = 0
        x_bc = torch.zeros(N_bc, 1)
        t_bc = torch.rand(N_bc, 1)
        u_bc = model(x_bc, t_bc)
        loss_bc = torch.mean(u_bc ** 2)

        # Initial condition: u(x,0) = sin(pi*x)
        x_ic = torch.rand(N_ic, 1)
        t_ic = torch.zeros(N_ic, 1)
        u_ic = model(x_ic, t_ic)
        u_exact = torch.sin(np.pi * x_ic)
        loss_ic = torch.mean((u_ic - u_exact) ** 2)

        loss = loss_pde + 10 * loss_bc + 10 * loss_ic
        loss.backward()
        optimizer.step()

        if epoch % 1000 == 0:
            print(f"Epoch {epoch}: Loss = {loss.item():.6f}")

    return model

print("PINN architecture: input(x,t) -> 4x64 Tanh -> output(u)")

Neural ODE: Continuous-Time Dynamics

Neural ODEs model the rate of change of a hidden state with a neural network, enabling continuous-time sequence modeling.

# pip install torchdiffeq
import torch
import torch.nn as nn
from torchdiffeq import odeint

class ODEFunc(nn.Module):
    """Defines the right-hand side f(t, y) = dy/dt"""
    def __init__(self, dim: int = 2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, dim)
        )

    def forward(self, t: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        return self.net(y)

class NeuralODE(nn.Module):
    """Neural ODE model"""
    def __init__(self, dim: int = 2):
        super().__init__()
        self.odefunc = ODEFunc(dim)

    def forward(self, y0: torch.Tensor, t_span: torch.Tensor) -> torch.Tensor:
        """
        y0: initial state [batch, dim]
        t_span: time points [T]
        returns: states at each time point [T, batch, dim]
        """
        return odeint(self.odefunc, y0, t_span, method='dopri5')

# Example: Lotka-Volterra predator-prey model
def simulate_lotka_volterra():
    model = NeuralODE(dim=2)
    # Initial conditions: 1000 rabbits, 100 foxes
    y0 = torch.tensor([[1.0, 0.1]])
    t = torch.linspace(0, 15, 300)
    with torch.no_grad():
        trajectory = model(y0, t)
    print(f"Simulation shape: {trajectory.shape}")  # [300, 1, 2]
    return trajectory

Fourier Neural Operator (FNO)

FNO operates in the frequency domain to build resolution-independent PDE solvers.

import torch
import torch.nn as nn
import torch.fft

class SpectralConv2d(nn.Module):
    """Core FNO block: convolution in the spectral domain"""
    def __init__(self, in_channels: int, out_channels: int, modes: int = 12):
        super().__init__()
        self.modes = modes
        scale = 1 / (in_channels * out_channels)
        self.weights = nn.Parameter(
            scale * torch.rand(in_channels, out_channels, modes, modes,
                               dtype=torch.cfloat)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, C, H, W = x.shape
        x_ft = torch.fft.rfft2(x)
        out_ft = torch.zeros(B, self.weights.shape[1], H, W // 2 + 1,
                             dtype=torch.cfloat, device=x.device)
        # Learn only low-frequency components
        out_ft[:, :, :self.modes, :self.modes] = torch.einsum(
            'bixy,ioxy->boxy',
            x_ft[:, :, :self.modes, :self.modes],
            self.weights
        )
        return torch.fft.irfft2(out_ft, s=(H, W))

6. AI Lab Automation: Self-Driving Laboratories

Bayesian Optimization for Experimental Design

A Self-Driving Lab (SDL) is a closed loop where AI designs experiments, robots execute them, and AI analyzes the results to propose the next experiment.

# pip install scikit-optimize
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
import numpy as np

# Define the experimental parameter space
# Example: perovskite solar cell optimization
search_space = [
    Real(0.5, 2.0, name="pb_concentration"),     # Pb concentration (mol/L)
    Real(0.3, 0.8, name="ma_ratio"),              # MA:FA ratio
    Real(50, 150, name="annealing_temp"),          # Annealing temperature (C)
    Integer(10, 60, name="annealing_time"),        # Annealing time (minutes)
    Categorical(["DMF", "DMSO", "GBL"], name="solvent")
]

@use_named_args(search_space)
def experimental_objective(
    pb_concentration, ma_ratio, annealing_temp,
    annealing_time, solvent
) -> float:
    """
    Objective function for an experiment (in practice, calls a robotic system).
    Returns: negative PCE (Power Conversion Efficiency) — minimization problem.
    """
    noise = np.random.normal(0, 0.5)
    pce = (
        15.0
        + 2.0 * np.exp(-((pb_concentration - 1.2) ** 2) / 0.1)
        + 1.5 * np.exp(-((ma_ratio - 0.6) ** 2) / 0.05)
        - 0.05 * abs(annealing_temp - 100)
        + noise
    )
    print(f"Experiment: Pb={pb_concentration:.2f}, MA={ma_ratio:.2f}, "
          f"T={annealing_temp:.0f}C -> PCE={pce:.2f}%")
    return -pce  # minimize negative PCE

# Run Bayesian optimization
result = gp_minimize(
    func=experimental_objective,
    dimensions=search_space,
    n_calls=30,           # total experiments
    n_initial_points=10,  # initial random exploration
    acq_func="EI",        # Expected Improvement acquisition function
    random_state=42
)

print(f"\nBest PCE: {-result.fun:.2f}%")
print("Optimal parameters:")
for name, val in zip([s.name for s in search_space], result.x):
    print(f"  {name}: {val}")

7. Research Reproducibility: DVC and Experiment Tracking

Data Version Control with DVC

# Initialize DVC alongside Git
git init
dvc init

# Track large datasets (use DVC instead of Git-LFS)
dvc add data/protein_structures/
dvc add data/molecular_datasets/

# Configure remote storage
dvc remote add -d myremote s3://mybucket/dvc-store

# Define pipeline in dvc.yaml
# stages:
#   preprocess:
#     cmd: python preprocess.py
#     deps: [data/raw/, src/preprocess.py]
#     outs: [data/processed/]
#   train:
#     cmd: python train.py --seed 42
#     deps: [data/processed/, src/train.py]
#     outs: [models/]
#     metrics: [metrics.json]

# Reproduce the full pipeline
dvc repro
dvc push

Experiment Tracking with MLflow

import mlflow
import mlflow.pytorch
import torch

def train_with_tracking(config: dict) -> float:
    """Track experiment parameters and metrics with MLflow"""
    with mlflow.start_run():
        # Log hyperparameters
        mlflow.log_params(config)

        # Fix seeds for reproducibility
        torch.manual_seed(config["seed"])
        np.random.seed(config["seed"])

        # Build model
        model = MolecularGNN(
            node_features=config["node_features"],
            hidden_dim=config["hidden_dim"]
        )

        # Training loop with metric logging
        for epoch in range(config["epochs"]):
            train_loss = np.random.exponential(1.0) / (epoch + 1)
            val_auc = 1 - np.exp(-epoch / 20)
            mlflow.log_metric("train_loss", train_loss, step=epoch)
            mlflow.log_metric("val_auc", val_auc, step=epoch)

        # Save model artifact
        mlflow.pytorch.log_model(model, "model")
        mlflow.log_metric("final_val_auc", val_auc)

    return val_auc

config = {
    "seed": 42,
    "node_features": 9,
    "hidden_dim": 128,
    "epochs": 100,
    "lr": 1e-3,
    "dataset": "tox21"
}

mlflow.set_experiment("molecular-property-prediction")
# auc = train_with_tracking(config)

Reproducible Environments with Docker

# Dockerfile
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

RUN apt-get update && apt-get install -y \
    git wget curl \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Environment variables for deterministic behavior
ENV PYTHONHASHSEED=42
ENV CUBLAS_WORKSPACE_CONFIG=:4096:8

WORKDIR /workspace
COPY . .

Quiz

Q1. Why does AlphaFold2 use Multiple Sequence Alignment (MSA) for protein structure prediction?

Answer: To infer spatially proximate residue pairs from co-evolutionary patterns encoded in the MSA.

Explanation: Over millions of years of evolution, residue pairs that interact functionally tend to co-vary — when one mutates, the other often compensates. By analyzing correlated mutations across an MSA, one can predict which residue pairs are in contact in the 3D structure. AlphaFold2's Evoformer processes this co-evolutionary information as a pair representation and iteratively refines it alongside the MSA representation. ESMFold bypasses the need for an explicit MSA because its large protein language model (PLM) implicitly captures evolutionary information through pre-training on hundreds of millions of protein sequences.

Q2. How does a PINN incorporate physical laws into its loss function?

Answer: By computing PDE residuals through automatic differentiation of the network output and adding them as a regularization term in the loss.

Explanation: The key idea is to treat the neural network output $u_\theta(x,t)$ as a candidate solution and penalize violations of the governing PDE. PyTorch's autograd computes exact partial derivatives $\partial u / \partial t$ and $\partial^2 u / \partial x^2$ through the network graph. The physics loss $\mathcal{L}_{physics} = ||\partial u / \partial t - \alpha \partial^2 u / \partial x^2||^2$ measures how well the network satisfies the PDE at a set of collocation points sampled inside the domain. Because physical laws constrain the solution everywhere — not just where data exists — PINNs often extrapolate more reliably than purely data-driven models.

Q3. Why is the atom-node / bond-edge graph representation well-suited for molecular GNNs?

Answer: Because a molecule's chemical properties are determined by its atom types and bonding topology — a natural graph structure.

Explanation: While SMILES strings linearize molecules into 1D sequences, their true structure is a graph. GNNs exploit this with message passing: each atom aggregates information from neighboring atoms, building up a representation of its local chemical environment. Stacking multiple layers allows the model to capture increasingly global structural context. Critically, this representation is permutation-invariant — the prediction is independent of how atoms are ordered — which matches the physical reality that a molecule has no canonical atom ordering. Chemprop's Directed Message Passing Neural Network (D-MPNN) further improves ring-system representations by passing messages along directed edges.

Q4. What is the mathematical justification for Bayesian optimization outperforming grid search in experimental design?

Answer: Bayesian optimization uses a Gaussian Process surrogate model and an acquisition function (e.g., Expected Improvement) to balance exploration and exploitation, avoiding the exponential scaling of grid search.

Explanation: Grid search scales exponentially with the number of parameters — the "curse of dimensionality." Bayesian optimization has two components: (1) a Gaussian Process that maintains a posterior distribution over the objective function conditioned on all previous evaluations, and (2) an acquisition function such as Expected Improvement (EI) that analytically identifies the point most likely to improve over the current best. EI naturally balances exploration (high uncertainty regions) and exploitation (promising regions near the current optimum). As each experiment is expensive (e.g., a synthesis run), converging to a near-optimal configuration in 30–50 experiments — rather than thousands — is a decisive practical advantage.

Q5. Why do Neural ODEs model continuous time-series data more naturally than standard RNNs?

Answer: Neural ODEs model system dynamics directly as a continuous differential equation, rather than assuming fixed discrete time steps.

Explanation: Standard RNNs assume a fixed, uniform time step. This makes them ill-suited for irregularly sampled data or systems governed by continuous physical dynamics. A Neural ODE defines $dy/dt = f_\theta(t, y)$ and integrates it with an adaptive ODE solver (e.g., Dormand-Prince RK45), which can evaluate the state at any time point with adaptive step size. This yields three key advantages: (1) handling arbitrary time resolution without retraining, (2) representing continuous dynamics with fewer parameters than a deep RNN, and (3) computing gradients memory-efficiently via the adjoint method rather than storing intermediate activations. Neural ODEs are particularly well-suited for pharmacokinetics, climate variable trajectories, and any domain where an underlying continuous ODE governs the system.

Conclusion: The Future of AI for Science

AI is accelerating every phase of scientific research.

Protein structure: AlphaFold3 now predicts complexes with DNA, RNA, and small molecules
Drug discovery: Generative molecular AI compresses candidate identification from years to months
Climate science: AI weather models like NeuralGCM and Pangu-Weather are beginning to match or exceed classical numerical forecasts
Physics simulation: PINNs and FNO accelerate CFD and quantum simulations by orders of magnitude
Self-Driving Labs: Closed-loop AI-robotic systems dramatically improve efficiency in materials discovery and chemical synthesis

The key insight underlying all of scientific AI is the synergy between domain knowledge and data. Embedding physical laws, chemical structure, or evolutionary information into AI models — rather than treating them as black boxes — consistently produces the most powerful and trustworthy results.