- Published on
AI for Biology & Drug Discovery 2026 Complete Guide — AlphaFold 3, RoseTTAFold, ESM Atlas, Boltz, Chai-1, RFdiffusion, Isomorphic Labs, Recursion, Insilico Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — What the 2024 Nobel Prize in Chemistry Means
On October 9, 2024, the Royal Swedish Academy of Sciences announced the Chemistry laureates: David Baker (University of Washington), Demis Hassabis (CEO of DeepMind), and John Jumper (Senior Director, DeepMind). Half the prize went to Baker for de novo protein design, the other half to Hassabis and Jumper for AlphaFold 2 and protein structure prediction.
This was more than academic recognition. It was the official declaration that AI had solved a 50-year-old biology problem (the protein folding problem), and at the same time a signal that the companies industrializing that AI — DeepMind, Isomorphic Labs, Generate:Biomedicines, Recursion, Insilico Medicine, and others — were redrawing the future of drug discovery.
"Protein structure prediction is dead. The real game starts now." — wrote a molecular biologist on X right after the prize announcement. AlphaFold 2 solved static structure prediction; the next step is dynamic interactions, drug binding, and new protein design. And as of 2026, all of these are exploding simultaneously.
What this guide covers:
- The map of AI biology — the camps of 2026
- History of protein structure prediction — Anfinsen to AlphaFold
- AlphaFold 2, 3, and Server — the DeepMind line
- The RoseTTAFold family — Baker Lab's answer
- ESM-2, ESM-3, ESM Atlas — from Meta to EvolutionaryScale
- Boltz-1, Boltz-2 — MIT's open reproduction
- Chai-1, Protenix — the new entrants
- ColabFold, OmegaFold — the accessibility revolution
- RFdiffusion + ProteinMPNN — a new paradigm for protein design
- Antibody design — AbDesign, IgFold, Absci
- Small molecules and docking — MolMIM, DiffDock, NeuralPLexer
- Isomorphic Labs — DeepMind's drug discovery subsidiary
- Recursion Pharmaceuticals + Exscientia merger
- Insilico Medicine — pioneer of generative AI drug discovery
- Schrödinger, Atomwise, BenevolentAI, Cradle
- Genomics AI — DeepVariant, Enformer, Geneformer, scGPT
- Cell imaging AI — Cell Painting, JUMP-CP, CellPose
- Clinical trial AI — Saama, Unlearn.ai
- Bio foundation models — BioGPT, GeneGPT, NACL
- Korean AI bio — Standigm, Deep Bio, Syntekabio
- Japanese AI bio — Preferred Networks, Elix, MOLCURE
- Datasets and benchmarks — PDB, UniProt, ChEMBL, AlphaFold DB
- Simulation infrastructure — GROMACS, AMBER, DESMOND
- Ethics and regulation — new safety standards
- References
1. The Map of AI Biology — Camps of 2026
As of May 2026, AI biology splits into roughly five camps.
1) The structure prediction camp Predicts 3D structure from a given protein sequence. AlphaFold 2/3, RoseTTAFold, ESMFold, Boltz, Chai-1, OmegaFold, ColabFold, and Protenix belong here. AlphaFold 3's 2024 launch — modeling not just proteins but DNA, RNA, ligands, and ions simultaneously — bumped the game up a level.
2) The protein design camp "Let's build proteins with the function we want from scratch." RFdiffusion, ProteinMPNN, ESM3 (its generative form), Chroma, and Genie are the main names. Baker Lab and Generate:Biomedicines are the twin pillars here.
3) The drug discovery company camp Companies that actually run clinical pipelines. Isomorphic Labs (Alphabet), Recursion (merged with Exscientia), Insilico Medicine, Schrödinger, Atomwise, BenevolentAI, Cradle, Absci, Generate:Biomedicines.
4) The genomics + single-cell camp Modeling DNA sequence, gene expression, and cell state. DeepVariant (variant calling), Enformer (expression prediction), Geneformer and scGPT (single-cell foundation models), AlphaMissense (variant effect prediction) are representative.
5) The imaging + phenotypic camp Reading drug effects directly from cell images. Recursion's "Maps" platform, the public JUMP-CP dataset, and analysis tools like CellPose and CellProfiler form the core.
These camps overlap. Recursion does imaging plus design plus drugs. EvolutionaryScale does prediction plus design with ESM3. So instead of asking "which camp is a company in," you should ask what problem each company is trying to solve.
2. History of Protein Structure Prediction — Anfinsen to AlphaFold
A short history first. Protein structure prediction was a 50-year-old problem.
1972: Christian Anfinsen receives the Nobel Prize in Chemistry. He experimentally proves the "Anfinsen dogma" — that a protein's three-dimensional structure is determined by its one-dimensional amino acid sequence. If true, in principle structure should be predictable from sequence alone.
1994 to 2020: CASP (Critical Assessment of protein Structure Prediction) runs every two years. Traditional methods — homology modeling, threading, fragment assembly, Rosetta — make incremental progress, but GDT-TS (accuracy metric) is stuck in the 60s to 70s.
CASP13, 2018: DeepMind's first AlphaFold 1 records a GDT-TS of 58.9, leaving the second-place group six points behind. Academia is stunned.
CASP14, December 2020: AlphaFold 2 scores GDT-TS 92.4 — essentially experimental accuracy (~95). CASP14 organizer John Moult declares the protein structure prediction problem "largely solved."
July 2021: AlphaFold 2 code and weights are released open source. The AlphaFold DB launches at the same time — first the human proteome (~20,000 proteins), then expanding to over 200 million predicted structures by 2022.
July 2021: David Baker's team announces RoseTTAFold. Same period as AlphaFold 2, similar accuracy. An attention-based three-track (sequence, distance, coordinates) architecture.
November 2022: Meta AI (FAIR) releases ESMFold and ESM Atlas. They predict over 600 million metagenomic protein structures and release them publicly. Without multiple sequence alignment (MSA), prediction runs through a language model only.
May 2024: AlphaFold 3 announced. Models not just proteins but DNA, RNA, small molecules (ligands), and ions simultaneously. The model is closed, however; access only via the AlphaFold Server web interface.
October 2024: Nobel Prize in Chemistry — half to David Baker, half to Demis Hassabis plus John Jumper.
May 2024 through 2025: MIT's Boltz-1, Chai Discovery's Chai-1, and ByteDance's Protenix sequentially release AlphaFold 3-class open models.
June 2025: EvolutionaryScale releases ESM-3. Generative model evolution (ESMFold = prediction, ESM-3 = prediction plus generation).
2026 today: Boltz-2 ships, AlphaFold 4 rumored, RFdiffusion All-Atom reaches clinical candidate compounds. And structure prediction itself is no longer a differentiator has become obvious.
3. AlphaFold 2, 3, and Server — The DeepMind Line
AlphaFold 2 (2021) architecture in essence.
- Input: target protein sequence plus MSA (multiple sequence alignment, evolutionary information)
- Evoformer: refines sequence and pair representations via attention
- Structure module: directly generates 3D coordinates. Rotations and translations handled in SE(3)-equivariant form
- Outputs confidence metrics like pLDDT and pTM
AlphaFold 2 specializes in static structure prediction. Dynamic conformations, binding-state changes, and interactions with small molecules required separate tools.
AlphaFold 3 (2024) tackles those limits head on.
- Handles protein plus DNA plus RNA plus ligands plus ions in a single model
- Diffusion-based coordinate generation — the structure module replaced with a diffusion model
- Average accuracy improved roughly 50% over AlphaFold 2 (especially for protein-ligand interactions)
- Code and weights are closed, however. Access via the AlphaFold Server only. Free for academic and non-commercial use.
This closed policy provoked major debate. DeepMind's position was clear: the model is closed because Isomorphic Labs (a sister company) must use it commercially. In response, MIT, Chai Discovery, and ByteDance immediately started open reproductions, and within a year nearly equivalent open models were released.
AlphaFold Server launched May 2024. Anyone logs in with a Google account, enters a sequence, and gets a structure within 24 hours. Academic usage exploded. Caveats:
- Results downloadable, model itself closed
- Non-commercial use only
- Daily job limits
As of 2026 AlphaFold DB offers about 214 million structures for free. It has predictions for nearly every protein registered in UniProt, not just the human proteome.
4. The RoseTTAFold Family — Baker Lab's Answer
The David Baker lab at the University of Washington (Nobel laureate) is DeepMind's rival on both structure prediction and design. Their answer is the RoseTTAFold series.
RoseTTAFold (2021)
- 3-track architecture: learns sequence, distance, and coordinates simultaneously
- Released around the same time as AlphaFold 2 with similar accuracy (slightly lower but faster)
- Open source
RoseTTAFold2 (2023)
- Nearly identical accuracy to AlphaFold 2
- Handles larger proteins
- Enhanced protein-protein complex prediction
RoseTTAFold All-Atom (RFAA, 2023)
- Protein plus DNA plus RNA plus ligand plus cofactor in a single model
- A similar concept to AlphaFold 3 but released earlier
- Open source plus weights public
RFdiffusion (2023, design)
- Diffusion model that generates protein backbones from scratch
- Used for binder, enzyme, and antibody design
- One of the core contributions cited in the Nobel
RFdiffusion All-Atom (2024)
- Designs backbones plus side chains plus ligands simultaneously
- Generates proteins with measurably higher binding affinity
Baker Lab's value proposition is unambiguous: open, design, application. All models are released and design tools beyond raw prediction are bundled.
5. ESM-2, ESM-3, ESM Atlas — From Meta to EvolutionaryScale
The ESM (Evolutionary Scale Modeling) series was Meta AI's (formerly FAIR) protein language model project.
ESM-1, ESM-2 (2019-2022)
- Transformers treating protein sequence like text
- Pre-trained on roughly 65 million UniRef50 sequences
- The largest ESM-2 has 15 billion parameters
ESMFold (2022)
- Attaches a structure prediction head to ESM-2
- Predicts structure from sequence alone, without MSA — about 60 times faster than AlphaFold 2
- Slightly lower accuracy, but powerful where MSA is hard to build (metagenomic proteins, etc.)
ESM Atlas (2022)
- Predicted 617 million metagenomic protein structures using ESMFold
- First public visualization of the "dark proteome" from soil, ocean, and human microbiome
- Together with AlphaFold DB, one of the two pillars of the proteomic universe
2024: Meta spins off the FAIR protein team. EvolutionaryScale becomes a separate company. Alex Rives (ESM lead author) is a co-founder.
ESM-3 (2024, EvolutionaryScale)
- A multimodal generative model unifying sequence, structure, and function
- Beyond prediction it can also generate — design proteins with desired functions
- The largest ESM-3 has 98 billion parameters
- Only partially open — the largest model is API-only
- 7B/24B models released under a non-commercial license
EvolutionaryScale showcased ESM-3 with an evolution-simulation experiment (esmGFP) that designed a new GFP variant by compressing roughly 500 million years of evolutionary trajectory.
6. Boltz-1, Boltz-2 — MIT's Open Reproduction
When AlphaFold 3 went closed, MIT's Regina Barzilay group and collaborators released Boltz-1 in May 2024.
Boltz-1 (2024)
- AlphaFold 3-class accuracy (protein plus nucleic acid plus ligand plus ion)
- Fully open under the MIT license — code and weights
- Trained on a mix of internal and public data
- A game changer for commercial researchers who cannot use AlphaFold Server
Boltz-2 (2025)
- About 1.5x faster than Boltz-1
- Adds binding affinity prediction
- Memory efficiency improvements enable larger systems
- Same MIT license
Boltz's contribution is simple: "Can't use AlphaFold 3? Use Boltz-2." Free for internal R&D at pharma, academic research, and commercial applications alike.
Here's an example of invoking Boltz-2 from the command line.
# Install Boltz-2 (PyPI)
pip install boltz
# Prepare input FASTA
cat > target.fasta <<EOF
>protein|name=kinase
MKTLLLTLVVVTIVCLDLGYTEEEEYNEELEKKMEEILSKLEKK
EOF
# Predict structure for a single protein
boltz predict target.fasta --use_msa_server --out_dir results/
# Outputs — PDB and mmCIF appear inside results/predictions/target/
YAML input also supports protein-ligand complexes.
version: 1
sequences:
- protein:
id: A
sequence: MKTLLLTLVVVTIVCLDLGYTEEEEYNEELEKKMEEILSKLEKK
- ligand:
id: B
smiles: "CC(=O)OC1=CC=CC=C1C(=O)O" # aspirin
properties:
- affinity:
binder: B
A single GPU (A100 80GB) handles medium-sized proteins in 1-5 minutes.
7. Chai-1, Protenix — The New Entrants
Chai Discovery is a startup that appeared in fall 2024, building AlphaFold 3-class models in house.
Chai-1 (2024)
- Protein plus nucleic acid plus ligand plus ion
- Accuracy slightly below AlphaFold 3, comparable to Boltz-1
- Some weights released (non-commercial license)
- Also provides a web UI — anyone can try
- Especially strong on antibody modeling
Chai-1r (2025)
- Adds binding affinity prediction
- Reinforcement learning-based reranking
- Used in binder design simulations
Protenix (ByteDance, 2024)
- Released by ByteDance Research (TikTok's parent)
- An AlphaFold 3 reproduction, fully open under Apache 2.0
- Weights plus training code
- Accuracy similar to Boltz-1
Thanks to these three models — Boltz, Chai, and Protenix — by spring 2025 there were effectively three open models at AlphaFold 3-class accuracy. DeepMind's closed policy paradoxically accelerated the open ecosystem.
8. ColabFold, OmegaFold — The Accessibility Revolution
AlphaFold 2 was released, but running it required expensive GPUs and enormous MSA databases (BFD, Uniref30, etc. — several TB). The thing that made it accessible to everyone is ColabFold.
ColabFold (2022)
- A notebook built by Sergey Ovchinnikov and collaborators
- Runs AlphaFold 2 plus RoseTTAFold plus ESMFold on Google Colab
- Replaces MSA with fast MMseqs2-based search (instead of BFD)
- An undergraduate can predict a protein structure in 30 minutes
- About one million users by 2025
OmegaFold (2022)
- Announced by Helixon
- Works without MSA
- Similar concept to ESMFold but trained separately
- More accurate than ESMFold on certain cases
ColabFold's significance is democratization. Nobel-level technology running on a laptop. As of 2025 ColabFold is gradually integrating AlphaFold 3, Boltz-2, and Chai-1 as well.
9. RFdiffusion + ProteinMPNN — A New Paradigm for Protein Design
So far it has been about prediction. Now let's move to design.
Traditional protein design was attempted with physics-based simulation like Rosetta. Evaluate possible side-chain combinations and find low-energy structures. Slow, and hard to invent new protein folds.
RFdiffusion (Baker Lab, 2023) changed the game.
- Generates protein backbones from scratch with a diffusion model
- Input: a portion of the target protein structure you want to bind plus the binding site
- Output: a new protein backbone that can bind at that location
- One of the Nobel-cited technologies
ProteinMPNN (Baker Lab, 2022)
- Given a backbone, generates an amino acid sequence that fits that backbone
- A message-passing graph neural network
- "Generate the backbone with RFdiffusion, fill in the sequence with ProteinMPNN" is the standard pipeline
The actual workflow of the RFdiffusion + ProteinMPNN pipeline:
- Choose a binding site on the target protein
- Use RFdiffusion to generate 10,000 backbones that could bind at that site
- Use ProteinMPNN to assign sequences to each backbone (8 per backbone)
- Refold those sequences with AlphaFold 2 to verify they match the backbones
- Express the top 100 in the wet lab and measure binding affinity
This pipeline put 10+ new binder proteins into preclinical or clinical stage in 2024 alone.
RFdiffusion All-Atom (2024) designs side chains and ligands together with the backbone in one shot. For example, you can design an enzyme that precisely fits around a drug molecule.
10. Antibody Design — AbDesign, IgFold, Absci
Antibodies are the most important biologic drug category (about $200 billion in 2024 revenue). So antibody design AI forms its own large market.
IgFold (Johns Hopkins, 2022)
- Specialized for antibody structure prediction (more accurate than vanilla AlphaFold)
- Enhanced CDR (complementarity-determining region) modeling
- Open source
ABodyBuilder (Oxford OPIG, 2024)
- Rapid modeling of antibody variable regions
- Under-1-second prediction on a single GPU
AbDesign / RFdiffusion-Ab (Baker Lab, 2024)
- Fine-tunes RFdiffusion for antibody design
- Generates antibodies that bind a target antigen from scratch
- Achieves about 1% or higher hit rate in wet-lab validation (10x to 100x over traditional display methods)
Absci (Nasdaq listed, 2021)
- "Generative AI for antibody discovery"
- Combines in-house ML and wet lab
- 2024 partnerships with GSK, Merck, and others
- Designs and expresses target-binding antibodies within six weeks
Generate:Biomedicines (spun off in 2022, $270M Series C in 2024)
- Incubated by Flagship Pioneering
- Develops the Chroma model in-house — antibody plus general protein design
- Multiple collaborations with global big pharma
The core KPIs for antibody design are affinity (binding affinity, Kd) and developability (aggregation, viscosity, immunogenicity). Optimizing both axes simultaneously is the challenge for AI.
11. Small Molecules + Docking — MolMIM, DiffDock, NeuralPLexer
The small molecule side has also moved fast under AI.
SMILES and SELFIES
- SMILES: a string representation of molecules (e.g.,
CC(=O)OC1=CC=CC=C1C(=O)Ois aspirin) - SELFIES: addresses SMILES limitations and always represents valid molecules
Mol-BERT, ChemBERTa, MoLFormer (2020-2022)
- Transformers pre-trained on SMILES
- Used for molecular property prediction
MolMIM (NVIDIA, 2024)
- A molecular generation model, part of NVIDIA BioNeMo
- Generates molecules with similar but improved properties starting from an input molecule
- Accelerates the medicinal chemist's hit-to-lead phase
DiffDock (MIT, 2023)
- Diffusion-based docking model
- Directly generates protein-ligand binding poses
- Tens of times faster than traditional docking (AutoDock Vina, etc.)
NeuralPLexer (2024, Caltech)
- Takes protein and ligand together as input and predicts the binding complex
- Considers cofactors and accessory proteins
AlphaFold 3 + Boltz-2 + Chai-1 also predict small molecule binding in the end, so the docking field and the structure prediction field are practically merging.
12. Isomorphic Labs — DeepMind's Drug Discovery Subsidiary
Isomorphic Labs is Alphabet's drug discovery subsidiary, spun off in November 2021. Demis Hassabis is CEO concurrently with DeepMind.
Mission: "Re-imagining drug discovery through AI." AlphaFold is the basic tool for drug discovery.
Strategy:
- Dual-track strategy of internal pipeline plus big-pharma partnerships
- 2024 deal with Eli Lilly: $170 million upfront plus milestones
- 2024 deal with Novartis: $120 million upfront plus milestones
- Own candidates focus on oncology and immunology
Tech stack:
- AlphaFold 3 is core (closed externally, used first internally)
- Proprietary design model plus docking plus ADMET prediction
- Minimizes in-house wet lab, collaborates with CROs
Closed policy: Isomorphic's existence is the reason AlphaFold 3 is closed. If AF3 had been open, every big pharma would have used it internally and Isomorphic's business model would have weakened.
2025 status: First IND-enabling candidates are imminent. Phase 1 entry targeted within 2026.
13. Recursion Pharmaceuticals + Exscientia Merger
Recursion (Nasdaq RXRX) is the Salt Lake City-based AI drug company. IPO in 2021.
Core tech:
- "Recursion Maps" — phenotypic screening based on cell imaging
- About one million cell images automatically analyzed per experiment
- Models drug-gene-disease relationships as a graph
- NVIDIA collaboration on the BioHive-1 and BioHive-2 supercomputers (NVIDIA invested)
January 2024: announces acquisition of Exscientia (about $700 million). Exscientia is a UK-based AI drug company strong in proprietary molecular design. The merger combines imaging plus molecular design in one company.
Pipeline:
- 11+ preclinical/clinical assets
- Oncology, neurology, rare diseases
- 2024 collaborations with Bayer, Roche, Sanofi, and others
Vision:
- "Industrialize drug discovery"
- AI plus automated wet lab plus cloud computing
14. Insilico Medicine — Pioneer of Generative AI Drug Discovery
Insilico Medicine is an AI drug discovery company headquartered across Hong Kong, New York, and Shanghai. Founded in 2014. IPO underway in 2025 (Hong Kong exchange).
Core tech:
- The Pharma.AI platform — target discovery plus molecular design plus clinical trial design
- Composed of PandaOmics (targets), Chemistry42 (molecules), and InClinico (clinical)
- Combination of proprietary generative models and reinforcement learning
Hit:
- INS018_055 (IPF therapeutic candidate) — entered Phase 2 in 2023. The world's first "AI-discovered + AI-designed" clinical-stage drug.
- AI performs both target discovery (TNIK) and molecular design
- 18 months to candidate compound, more than halved compared to traditional approaches
Pipeline: 30+ programs, 7+ clinical assets.
2025 trends:
- Expanding collaboration with Sanofi
- INS018_055 Phase 2 readout expected
- Pursuing the Hong Kong IPO
Insilico's value proposition is clear: "AI discovers, AI designs, humans validate." Cut time and cost in half.
15. Schrödinger, Atomwise, BenevolentAI, Cradle
Schrödinger (Nasdaq SDGR)
- A leader in molecular dynamics (MD) and quantum chemistry software since 1990
- Industry-standard tools like DESMOND, Maestro, and Glide
- Integrated AI aggressively in the 2020s
- Runs its own pipeline as well — collaboration with Nimbus Therapeutics
Atomwise
- Founded 2012, "AtomNet" CNN-based docking model
- Many big pharma collaborations (Pfizer, Bayer, Merck, etc.)
- Virtual screening across 200+ targets
BenevolentAI (London Stock Exchange BAI)
- Integrates knowledge graph plus natural language plus molecular design
- Proposed baricitinib as a COVID-19 candidate early on, leading to FDA emergency use authorization
- Restructured in 2024 (underperformance), recovery mode in 2025
Cradle
- Netherlands/Switzerland, founded 2021
- Specializes in protein engineering (industrial enzymes, pharmaceutical proteins)
- Partnerships with Novartis, BASF, AstraZeneca
- 2024 Series B of $73 million
EvolutionaryScale (already covered in Section 5)
- The company behind ESM3
- 2024 Series A of $142 million, invested by Amazon, NVIDIA, and others
- Model plus consulting business
16. Genomics AI — DeepVariant, Enformer, Geneformer, scGPT
DNA, RNA, and gene expression are also large AI domains beyond proteins.
DeepVariant (Google, 2018)
- Detects variants (SNPs, indels) from sequencing reads
- CNN-based, more accurate than traditional GATK
- As of 2025 supports both PacBio HiFi and ONT (nanopore) long reads
Enformer (DeepMind + Calico, 2021)
- Predicts gene expression from roughly 200 kb of DNA input
- Transformer-based
- Used to predict the expression impact of clinical variants
AlphaMissense (DeepMind, 2023)
- Pathogenicity prediction for missense variants (single amino acid substitutions)
- Public predictions for 71 million human missense variants
Geneformer (MIT Broad, 2023)
- Transformer over single-cell transcriptomic data
- "Rank-value encoding" — tokenizes by expression rank
- Pre-trained on about 30 million single cells
scGPT (University of Toronto + Wang Lab, 2023)
- Single-cell foundation model
- Pre-trained on 33 million cells
- Multitasks across cell type classification, batch correction, perturbation prediction, etc.
Universal Cell Embeddings (UCE) (Stanford, 2023)
- A cross-species (human plus mouse plus fly, etc.) single-cell model
These models learn from public datasets like GTEx, Tabula Sapiens, and the Human Cell Atlas.
17. Cell Imaging AI — Cell Painting, JUMP-CP, CellPose
Cell Painting is a phenotypic profiling technique based on fluorescent staining plus automated microscopy. After treating cells with a compound, you automatically capture fluorescent images in five channels and extract roughly 1,500 morphological features.
JUMP-CP (2023, Broad + big pharma consortium)
- 116,000 compounds plus 12,000 gene perturbations
- Phenotypic profiles released via Cell Painting
- Used by the 12 big pharma co-funders (Bayer, Janssen, etc.)
- Fully released May 2024
CellPose (Janelia, 2021)
- Cell segmentation model — a U-Net variant
- Generalizes across many cell types
- Open source with ImageJ/Fiji plugins
CellProfiler (Broad)
- A cell image analysis tool stretching back to the 1990s
- Integrated deep learning models from 2023
Recursion Maps
- Recursion's proprietary platform
- A database of roughly 6 billion cell images
- A graph of drugs, diseases, and genes
- Trained on the BioHive-1 and BioHive-2 (NVIDIA) supercomputers
The core of this field is a phenotype-first approach. Even when targets are unknown, you find compounds that normalize cell phenotype first.
18. Clinical Trial AI — Saama, Unlearn.ai
Beyond discovery, clinical trials are the costliest stage (average clinical cost about $1.9 billion). AI enters here too.
Saama Technologies
- Founded 2015, specializes in clinical data management
- Proprietary LLM for automated data integrity checks
- Multiple big pharma collaborations
Unlearn.ai
- Founded 2018, synthetic control arms based on "digital twins"
- Generates virtual twins of patients to partially replace placebo controls
- Piloted in Alzheimer's trials in collaboration with the FDA
TriNetX
- Global patient data network, optimizes clinical design
- Pre-analyzes which cohorts can be recruited
Owkin (Paris)
- Federated learning-based multi-center clinical data analysis
- Patient data stays local, only models are shared
The core value of clinical trial AI is time reduction. Cutting one year off a single clinical phase can save over $100 million.
19. Bio Foundation Models — BioGPT, GeneGPT, NACL
Natural language-side bio foundation models are equally active.
BioGPT (Microsoft, 2022)
- A GPT-2 variant pre-trained on about 15 million PubMed abstracts
- Used for tasks like drug side effect and protein-drug relation extraction
GeneGPT (NCBI, 2023)
- A model trained to call genomics tool APIs
- Queries BLAST, dbSNP, ClinVar via natural language
NACL biomedical Llamas (NIH NACL, 2024)
- A series of Llama fine-tunes for biomedicine
- Domain-specific models for clinical, genomics, drugs, and more
Med-PaLM (Google, 2022-2024)
- A PaLM variant specialized for medical Q&A
- USMLE (U.S. medical licensing exam) passing level
Med-Gemini (Google, 2024)
- Gemini-based medical multimodal model
- Images plus text plus clinical notes
The common challenge for these models is hallucination control. Because medical accuracy is tied directly to life, strong RAG and human verification are essential.
20. Korean AI Bio — Standigm, Deep Bio, Syntekabio
Korea's AI bio ecosystem is growing fast.
Standigm
- Founded 2015, Korea's first-generation AI drug company
- Proprietary AI platform plus wet lab
- Collaborations with SK Chemicals and JW Pharmaceutical
- 2024 Series C of about 60 billion won
Deep Bio
- Specializes in pathology AI
- The prostate cancer grading AI (DeepDx-Prostate) registered with the FDA
- Commercial service in the U.S., Japan, and Korea
Syntekabio (Kosdaq listed)
- Supercomputer plus AI-based virtual screening
- Runs its own STB Cloud
- Collaborations with KT, Celltrion, and others
JLK Inspection
- Started in medical imaging AI and expanded into drug discovery
- Brain stroke and brain disease imaging analysis tied into target discovery
Macrogen
- Korea's largest sequencing and genomics analysis company
- Built its own AI variant interpretation platform
Lunit
- A leader in medical imaging AI, expanding into pathology
- Global expansion via the 2024 Volpara acquisition
Investment trends: 2024 Korean AI bio investment totaled about 500 billion won. Small versus global benchmarks but government support (Ministry of Health and Welfare data projects) is active.
21. Japanese AI Bio — Preferred Networks, Elix, MOLCURE
Japan is equally aggressive on AI bio.
Preferred Networks
- Japan's largest AI startup, known for Chainer
- From 2024 onward, Materials Project plus protein design
- ENEOS, Toyota, and other industrial partners
Elix Inc
- Tokyo, founded 2016, drug discovery AI
- Proprietary Elix Discovery platform
- Collaborations with Daiichi Sankyo and Shionogi
MOLCURE
- Specializes in antibody discovery AI
- In-house wet lab integrated with ML
Healios
- iPS cell-based regenerative medicine plus AI
- Listed on Tokyo Stock Exchange Mothers
Spiber
- Synthetic spider silk proteins — leverages protein design AI
- Collaborations with Uniqlo and GAP
Japan's strengths: chemistry plus precision engineering plus university research run deep, but the IPO market is weaker than in the U.S. Companies like PFN and Elix hint at the potential for globalization.
22. Datasets and Benchmarks — PDB, UniProt, ChEMBL, AlphaFold DB
The core datasets underpinning AI biology.
PDB (Protein Data Bank, 1971-)
- The standard repository for experimental protein structures
- About 230,000 structures as of 2025
- Experimental data from X-ray crystallography, cryo-EM, NMR, and more
- Core training data for AlphaFold
UniProt
- The standard protein sequence database
- About 250 million sequences (mostly auto-annotated)
- The curated part is SwissProt (about 570,000)
ChEMBL (EMBL-EBI)
- Database of bioactive molecules
- About 2.3 million compounds and 20 million activity measurements as of 2025
- Foundational for medicinal chemistry ML
AlphaFold DB
- About 214 million structures predicted with AlphaFold 2/3
- Predictions published for every UniProt protein
- Free, usable for both academic and commercial purposes
ESM Atlas
- About 617 million metagenomic protein structures predicted with ESMFold
- Soil, ocean, and human microbiome proteins
The Human Cell Atlas
- A global consortium
- A single-cell map of human cell types
- About 100 million cells by 2025
JUMP-CP (see Section 17 above)
Open Targets (GSK + Sanofi + Bristol Myers Squibb + ...)
- A drug target prioritization database
- Combines genetics, clinical, and chemistry
ClinicalTrials.gov + clinicaltrialsregister.eu
- Clinical trial metadata
Data diversity and quality determine the ceiling of AI models. The biggest bottleneck as of 2026 is the shortage of wet-lab validation data.
23. Simulation Infrastructure — GROMACS, AMBER, DESMOND
AI predicts static structures well, but dynamic behavior is still where molecular dynamics (MD) leads.
GROMACS (Sweden KTH and others)
- Open source, used in both academia and industry
- Excellent GPU acceleration
- Used on protein, membrane, and nucleic acid systems
AMBER (UCSF + Rutgers and others)
- One of the oldest MD packages
- Wide variety of force-field options
- The AMBER force field is one of the de facto standards
NAMD (University of Illinois)
- Handles very large systems (10 million atoms and beyond)
- Used in COVID-19 spike protein simulations
DESMOND (Schrödinger commercial)
- Developed by D.E. Shaw Research, commercialized by Schrödinger
- Fast performance plus commercial support
- D.E. Shaw's Anton supercomputer is a separate dedicated piece of hardware
OpenMM (Stanford)
- An MD library callable from Python
- Integrates easily with AI workflows
- AlphaFold's relaxation step also uses OpenMM
ML potential rising:
- ML force fields like AIMNet2, ANI, and MACE deliver quantum-chemistry-level accuracy at speed
- Equivariant models like NequIP and Allegro
- Becoming a de facto standard tool from 2025 onward
On GPU infrastructure, NVIDIA H100/B100, AMD MI300, and Google TPU are all in use. Recursion's BioHive-2 is built on roughly 600 H100s.
24. Ethics and Regulation — New Safety Standards
The advance of AI biology equally raises misuse concerns.
Dual-use concerns:
- Can protein design AI be used to design new toxins or pathogens?
- A 2022 paper reversed a drug design AI to generate 40,000 potential toxins (Urbina et al, Nature Machine Intelligence)
- Dual-use guidelines being discussed at the U.S. NSABB and the UK SAGE, among others
Regulatory trends:
- FDA: drafting "AI in Drug Discovery" guidance from 2024
- EMA: published a reflection paper on AI use in clinical (2024)
- Japan's PMDA: accelerating medical AI approval
Open vs. closed:
- DeepMind's closure of AlphaFold 3 reflects both safety and commercial logic
- Baker Lab takes the position that "open improves safety"
- EvolutionaryScale splits the difference — small models open, large models API only
Bio security evaluation:
- Responsible AI policy — filters to detect dangerous protein designs
- Guidelines like "DNA synthesis companies should refuse suspicious sequences"
- IGSC (International Gene Synthesis Consortium) self-regulation
As of 2026, the regulatory framework for this field is still forming. Cooperation between the AI safety community (MIRI, ARC, METR) and the bio safety community (NTI, Johns Hopkins CHS) is growing.
25. Closing — From 2026 to 2030
The 2024 Nobel Prize was academia's recognition of AI biology. As of 2026, the downstream effects are spreading into industry.
Expected trends (2026-2030):
- First FDA approval of an AI-discovered plus AI-designed drug — possible between 2027 and 2029. Insilico's INS018_055 is one of the leading candidates
- Cloud SaaS for protein design tools — an era where medicinal chemists use RFdiffusion like Excel
- Integrated foundation models for single-cell plus phenotype plus structure — the merging trajectory of Recursion Maps, ESM3, and Geneformer
- Personalized antibodies — therapeutics designed per patient antigen
- Big pharma plus AI company integration — likely more mergers like Recursion-Exscientia
- Stricter dual-use regulation — possible mandates for risk design detection filters
Right after the Nobel announcement, Demis Hassabis posted briefly on X. "This is just the beginning." The protein folding problem may be solved, but in the whole of biology AI has not even covered one percent. Dynamic behavior, cell-level simulation, tissue models, full-body models — the road ahead is long, and that road is the biggest science plus business opportunity of the next decade.
26. References
Key papers:
- AlphaFold 2 (Jumper et al, Nature 2021) — https://www.nature.com/articles/s41586-021-03819-2
- AlphaFold 3 (Abramson et al, Nature 2024) — https://www.nature.com/articles/s41586-024-07487-w
- RoseTTAFold (Baek et al, Science 2021) — https://www.science.org/doi/10.1126/science.abj8754
- RoseTTAFold All-Atom (Krishna et al, Science 2024) — https://www.science.org/doi/10.1126/science.adl2528
- ESM-2 / ESMFold (Lin et al, Science 2023) — https://www.science.org/doi/10.1126/science.ade2574
- ESM-3 (Hayes et al, bioRxiv 2024) — https://www.biorxiv.org/content/10.1101/2024.07.01.600583
- RFdiffusion (Watson et al, Nature 2023) — https://www.nature.com/articles/s41586-023-06415-8
- ProteinMPNN (Dauparas et al, Science 2022) — https://www.science.org/doi/10.1126/science.add2187
- DiffDock (Corso et al, ICLR 2023) — https://arxiv.org/abs/2210.01776
- Boltz-1 — https://github.com/jwohlwend/boltz
- Chai-1 — https://www.chaidiscovery.com/
- Protenix — https://github.com/bytedance/Protenix
- AlphaMissense (Cheng et al, Science 2023) — https://www.science.org/doi/10.1126/science.adg7492
- Enformer (Avsec et al, Nature Methods 2021) — https://www.nature.com/articles/s41592-021-01252-x
Databases and services:
- AlphaFold Server — https://alphafoldserver.com/
- AlphaFold DB — https://alphafold.ebi.ac.uk/
- PDB — https://www.rcsb.org/
- UniProt — https://www.uniprot.org/
- ChEMBL — https://www.ebi.ac.uk/chembl/
- ESM Atlas — https://esmatlas.com/
- Human Cell Atlas — https://www.humancellatlas.org/
- JUMP-CP — https://jump-cellpainting.broadinstitute.org/
- Open Targets — https://www.opentargets.org/
Companies and official sites:
- DeepMind — https://deepmind.google/
- Isomorphic Labs — https://www.isomorphiclabs.com/
- Recursion — https://www.recursion.com/
- Insilico Medicine — https://insilico.com/
- Schrödinger — https://www.schrodinger.com/
- Atomwise — https://www.atomwise.com/
- BenevolentAI — https://www.benevolent.com/
- Cradle — https://www.cradle.bio/
- Absci — https://www.absci.com/
- Generate:Biomedicines — https://generatebiomedicines.com/
- EvolutionaryScale — https://www.evolutionaryscale.ai/
- Chai Discovery — https://www.chaidiscovery.com/
Nobel resources:
- Nobel Prize 2024 — https://www.nobelprize.org/prizes/chemistry/2024/
Foundational tools:
- ColabFold — https://github.com/sokrypton/ColabFold
- CellPose — https://www.cellpose.org/
- CellProfiler — https://cellprofiler.org/
- OpenMM — https://openmm.org/
- GROMACS — https://www.gromacs.org/
- AMBER — https://ambermd.org/
Closing. AI solved the protein folding problem, but biology lies beyond folding. Dynamic interactions, cell level, tissue level, human level — the truly hard problems all live beyond that boundary. So this field will be most exciting in the decade ahead. A glorious time for computer scientists, and the first time biologists have tools strong enough to match the questions. Good luck to both fields.