Skip to content
Published on

AI for Biology & Drug Discovery 2026 Complete Guide — AlphaFold 3, RoseTTAFold, ESM Atlas, Boltz, Chai-1, RFdiffusion, Isomorphic Labs, Recursion, Insilico Deep Dive

Authors

Prologue — What the 2024 Nobel Prize in Chemistry Means

On October 9, 2024, the Royal Swedish Academy of Sciences announced the Chemistry laureates: David Baker (University of Washington), Demis Hassabis (CEO of DeepMind), and John Jumper (Senior Director, DeepMind). Half the prize went to Baker for de novo protein design, the other half to Hassabis and Jumper for AlphaFold 2 and protein structure prediction.

This was more than academic recognition. It was the official declaration that AI had solved a 50-year-old biology problem (the protein folding problem), and at the same time a signal that the companies industrializing that AI — DeepMind, Isomorphic Labs, Generate:Biomedicines, Recursion, Insilico Medicine, and others — were redrawing the future of drug discovery.

"Protein structure prediction is dead. The real game starts now." — wrote a molecular biologist on X right after the prize announcement. AlphaFold 2 solved static structure prediction; the next step is dynamic interactions, drug binding, and new protein design. And as of 2026, all of these are exploding simultaneously.

What this guide covers:

  1. The map of AI biology — the camps of 2026
  2. History of protein structure prediction — Anfinsen to AlphaFold
  3. AlphaFold 2, 3, and Server — the DeepMind line
  4. The RoseTTAFold family — Baker Lab's answer
  5. ESM-2, ESM-3, ESM Atlas — from Meta to EvolutionaryScale
  6. Boltz-1, Boltz-2 — MIT's open reproduction
  7. Chai-1, Protenix — the new entrants
  8. ColabFold, OmegaFold — the accessibility revolution
  9. RFdiffusion + ProteinMPNN — a new paradigm for protein design
  10. Antibody design — AbDesign, IgFold, Absci
  11. Small molecules and docking — MolMIM, DiffDock, NeuralPLexer
  12. Isomorphic Labs — DeepMind's drug discovery subsidiary
  13. Recursion Pharmaceuticals + Exscientia merger
  14. Insilico Medicine — pioneer of generative AI drug discovery
  15. Schrödinger, Atomwise, BenevolentAI, Cradle
  16. Genomics AI — DeepVariant, Enformer, Geneformer, scGPT
  17. Cell imaging AI — Cell Painting, JUMP-CP, CellPose
  18. Clinical trial AI — Saama, Unlearn.ai
  19. Bio foundation models — BioGPT, GeneGPT, NACL
  20. Korean AI bio — Standigm, Deep Bio, Syntekabio
  21. Japanese AI bio — Preferred Networks, Elix, MOLCURE
  22. Datasets and benchmarks — PDB, UniProt, ChEMBL, AlphaFold DB
  23. Simulation infrastructure — GROMACS, AMBER, DESMOND
  24. Ethics and regulation — new safety standards
  25. References

1. The Map of AI Biology — Camps of 2026

As of May 2026, AI biology splits into roughly five camps.

1) The structure prediction camp Predicts 3D structure from a given protein sequence. AlphaFold 2/3, RoseTTAFold, ESMFold, Boltz, Chai-1, OmegaFold, ColabFold, and Protenix belong here. AlphaFold 3's 2024 launch — modeling not just proteins but DNA, RNA, ligands, and ions simultaneously — bumped the game up a level.

2) The protein design camp "Let's build proteins with the function we want from scratch." RFdiffusion, ProteinMPNN, ESM3 (its generative form), Chroma, and Genie are the main names. Baker Lab and Generate:Biomedicines are the twin pillars here.

3) The drug discovery company camp Companies that actually run clinical pipelines. Isomorphic Labs (Alphabet), Recursion (merged with Exscientia), Insilico Medicine, Schrödinger, Atomwise, BenevolentAI, Cradle, Absci, Generate:Biomedicines.

4) The genomics + single-cell camp Modeling DNA sequence, gene expression, and cell state. DeepVariant (variant calling), Enformer (expression prediction), Geneformer and scGPT (single-cell foundation models), AlphaMissense (variant effect prediction) are representative.

5) The imaging + phenotypic camp Reading drug effects directly from cell images. Recursion's "Maps" platform, the public JUMP-CP dataset, and analysis tools like CellPose and CellProfiler form the core.

These camps overlap. Recursion does imaging plus design plus drugs. EvolutionaryScale does prediction plus design with ESM3. So instead of asking "which camp is a company in," you should ask what problem each company is trying to solve.


2. History of Protein Structure Prediction — Anfinsen to AlphaFold

A short history first. Protein structure prediction was a 50-year-old problem.

1972: Christian Anfinsen receives the Nobel Prize in Chemistry. He experimentally proves the "Anfinsen dogma" — that a protein's three-dimensional structure is determined by its one-dimensional amino acid sequence. If true, in principle structure should be predictable from sequence alone.

1994 to 2020: CASP (Critical Assessment of protein Structure Prediction) runs every two years. Traditional methods — homology modeling, threading, fragment assembly, Rosetta — make incremental progress, but GDT-TS (accuracy metric) is stuck in the 60s to 70s.

CASP13, 2018: DeepMind's first AlphaFold 1 records a GDT-TS of 58.9, leaving the second-place group six points behind. Academia is stunned.

CASP14, December 2020: AlphaFold 2 scores GDT-TS 92.4 — essentially experimental accuracy (~95). CASP14 organizer John Moult declares the protein structure prediction problem "largely solved."

July 2021: AlphaFold 2 code and weights are released open source. The AlphaFold DB launches at the same time — first the human proteome (~20,000 proteins), then expanding to over 200 million predicted structures by 2022.

July 2021: David Baker's team announces RoseTTAFold. Same period as AlphaFold 2, similar accuracy. An attention-based three-track (sequence, distance, coordinates) architecture.

November 2022: Meta AI (FAIR) releases ESMFold and ESM Atlas. They predict over 600 million metagenomic protein structures and release them publicly. Without multiple sequence alignment (MSA), prediction runs through a language model only.

May 2024: AlphaFold 3 announced. Models not just proteins but DNA, RNA, small molecules (ligands), and ions simultaneously. The model is closed, however; access only via the AlphaFold Server web interface.

October 2024: Nobel Prize in Chemistry — half to David Baker, half to Demis Hassabis plus John Jumper.

May 2024 through 2025: MIT's Boltz-1, Chai Discovery's Chai-1, and ByteDance's Protenix sequentially release AlphaFold 3-class open models.

June 2025: EvolutionaryScale releases ESM-3. Generative model evolution (ESMFold = prediction, ESM-3 = prediction plus generation).

2026 today: Boltz-2 ships, AlphaFold 4 rumored, RFdiffusion All-Atom reaches clinical candidate compounds. And structure prediction itself is no longer a differentiator has become obvious.


3. AlphaFold 2, 3, and Server — The DeepMind Line

AlphaFold 2 (2021) architecture in essence.

  • Input: target protein sequence plus MSA (multiple sequence alignment, evolutionary information)
  • Evoformer: refines sequence and pair representations via attention
  • Structure module: directly generates 3D coordinates. Rotations and translations handled in SE(3)-equivariant form
  • Outputs confidence metrics like pLDDT and pTM

AlphaFold 2 specializes in static structure prediction. Dynamic conformations, binding-state changes, and interactions with small molecules required separate tools.

AlphaFold 3 (2024) tackles those limits head on.

  • Handles protein plus DNA plus RNA plus ligands plus ions in a single model
  • Diffusion-based coordinate generation — the structure module replaced with a diffusion model
  • Average accuracy improved roughly 50% over AlphaFold 2 (especially for protein-ligand interactions)
  • Code and weights are closed, however. Access via the AlphaFold Server only. Free for academic and non-commercial use.

This closed policy provoked major debate. DeepMind's position was clear: the model is closed because Isomorphic Labs (a sister company) must use it commercially. In response, MIT, Chai Discovery, and ByteDance immediately started open reproductions, and within a year nearly equivalent open models were released.

AlphaFold Server launched May 2024. Anyone logs in with a Google account, enters a sequence, and gets a structure within 24 hours. Academic usage exploded. Caveats:

  • Results downloadable, model itself closed
  • Non-commercial use only
  • Daily job limits

As of 2026 AlphaFold DB offers about 214 million structures for free. It has predictions for nearly every protein registered in UniProt, not just the human proteome.


4. The RoseTTAFold Family — Baker Lab's Answer

The David Baker lab at the University of Washington (Nobel laureate) is DeepMind's rival on both structure prediction and design. Their answer is the RoseTTAFold series.

RoseTTAFold (2021)

  • 3-track architecture: learns sequence, distance, and coordinates simultaneously
  • Released around the same time as AlphaFold 2 with similar accuracy (slightly lower but faster)
  • Open source

RoseTTAFold2 (2023)

  • Nearly identical accuracy to AlphaFold 2
  • Handles larger proteins
  • Enhanced protein-protein complex prediction

RoseTTAFold All-Atom (RFAA, 2023)

  • Protein plus DNA plus RNA plus ligand plus cofactor in a single model
  • A similar concept to AlphaFold 3 but released earlier
  • Open source plus weights public

RFdiffusion (2023, design)

  • Diffusion model that generates protein backbones from scratch
  • Used for binder, enzyme, and antibody design
  • One of the core contributions cited in the Nobel

RFdiffusion All-Atom (2024)

  • Designs backbones plus side chains plus ligands simultaneously
  • Generates proteins with measurably higher binding affinity

Baker Lab's value proposition is unambiguous: open, design, application. All models are released and design tools beyond raw prediction are bundled.


5. ESM-2, ESM-3, ESM Atlas — From Meta to EvolutionaryScale

The ESM (Evolutionary Scale Modeling) series was Meta AI's (formerly FAIR) protein language model project.

ESM-1, ESM-2 (2019-2022)

  • Transformers treating protein sequence like text
  • Pre-trained on roughly 65 million UniRef50 sequences
  • The largest ESM-2 has 15 billion parameters

ESMFold (2022)

  • Attaches a structure prediction head to ESM-2
  • Predicts structure from sequence alone, without MSA — about 60 times faster than AlphaFold 2
  • Slightly lower accuracy, but powerful where MSA is hard to build (metagenomic proteins, etc.)

ESM Atlas (2022)

  • Predicted 617 million metagenomic protein structures using ESMFold
  • First public visualization of the "dark proteome" from soil, ocean, and human microbiome
  • Together with AlphaFold DB, one of the two pillars of the proteomic universe

2024: Meta spins off the FAIR protein team. EvolutionaryScale becomes a separate company. Alex Rives (ESM lead author) is a co-founder.

ESM-3 (2024, EvolutionaryScale)

  • A multimodal generative model unifying sequence, structure, and function
  • Beyond prediction it can also generate — design proteins with desired functions
  • The largest ESM-3 has 98 billion parameters
  • Only partially open — the largest model is API-only
  • 7B/24B models released under a non-commercial license

EvolutionaryScale showcased ESM-3 with an evolution-simulation experiment (esmGFP) that designed a new GFP variant by compressing roughly 500 million years of evolutionary trajectory.


6. Boltz-1, Boltz-2 — MIT's Open Reproduction

When AlphaFold 3 went closed, MIT's Regina Barzilay group and collaborators released Boltz-1 in May 2024.

Boltz-1 (2024)

  • AlphaFold 3-class accuracy (protein plus nucleic acid plus ligand plus ion)
  • Fully open under the MIT license — code and weights
  • Trained on a mix of internal and public data
  • A game changer for commercial researchers who cannot use AlphaFold Server

Boltz-2 (2025)

  • About 1.5x faster than Boltz-1
  • Adds binding affinity prediction
  • Memory efficiency improvements enable larger systems
  • Same MIT license

Boltz's contribution is simple: "Can't use AlphaFold 3? Use Boltz-2." Free for internal R&D at pharma, academic research, and commercial applications alike.

Here's an example of invoking Boltz-2 from the command line.

# Install Boltz-2 (PyPI)
pip install boltz

# Prepare input FASTA
cat > target.fasta <<EOF
>protein|name=kinase
MKTLLLTLVVVTIVCLDLGYTEEEEYNEELEKKMEEILSKLEKK
EOF

# Predict structure for a single protein
boltz predict target.fasta --use_msa_server --out_dir results/

# Outputs — PDB and mmCIF appear inside results/predictions/target/

YAML input also supports protein-ligand complexes.

version: 1
sequences:
  - protein:
      id: A
      sequence: MKTLLLTLVVVTIVCLDLGYTEEEEYNEELEKKMEEILSKLEKK
  - ligand:
      id: B
      smiles: "CC(=O)OC1=CC=CC=C1C(=O)O"  # aspirin
properties:
  - affinity:
      binder: B

A single GPU (A100 80GB) handles medium-sized proteins in 1-5 minutes.


7. Chai-1, Protenix — The New Entrants

Chai Discovery is a startup that appeared in fall 2024, building AlphaFold 3-class models in house.

Chai-1 (2024)

  • Protein plus nucleic acid plus ligand plus ion
  • Accuracy slightly below AlphaFold 3, comparable to Boltz-1
  • Some weights released (non-commercial license)
  • Also provides a web UI — anyone can try
  • Especially strong on antibody modeling

Chai-1r (2025)

  • Adds binding affinity prediction
  • Reinforcement learning-based reranking
  • Used in binder design simulations

Protenix (ByteDance, 2024)

  • Released by ByteDance Research (TikTok's parent)
  • An AlphaFold 3 reproduction, fully open under Apache 2.0
  • Weights plus training code
  • Accuracy similar to Boltz-1

Thanks to these three models — Boltz, Chai, and Protenix — by spring 2025 there were effectively three open models at AlphaFold 3-class accuracy. DeepMind's closed policy paradoxically accelerated the open ecosystem.


8. ColabFold, OmegaFold — The Accessibility Revolution

AlphaFold 2 was released, but running it required expensive GPUs and enormous MSA databases (BFD, Uniref30, etc. — several TB). The thing that made it accessible to everyone is ColabFold.

ColabFold (2022)

  • A notebook built by Sergey Ovchinnikov and collaborators
  • Runs AlphaFold 2 plus RoseTTAFold plus ESMFold on Google Colab
  • Replaces MSA with fast MMseqs2-based search (instead of BFD)
  • An undergraduate can predict a protein structure in 30 minutes
  • About one million users by 2025

OmegaFold (2022)

  • Announced by Helixon
  • Works without MSA
  • Similar concept to ESMFold but trained separately
  • More accurate than ESMFold on certain cases

ColabFold's significance is democratization. Nobel-level technology running on a laptop. As of 2025 ColabFold is gradually integrating AlphaFold 3, Boltz-2, and Chai-1 as well.


9. RFdiffusion + ProteinMPNN — A New Paradigm for Protein Design

So far it has been about prediction. Now let's move to design.

Traditional protein design was attempted with physics-based simulation like Rosetta. Evaluate possible side-chain combinations and find low-energy structures. Slow, and hard to invent new protein folds.

RFdiffusion (Baker Lab, 2023) changed the game.

  • Generates protein backbones from scratch with a diffusion model
  • Input: a portion of the target protein structure you want to bind plus the binding site
  • Output: a new protein backbone that can bind at that location
  • One of the Nobel-cited technologies

ProteinMPNN (Baker Lab, 2022)

  • Given a backbone, generates an amino acid sequence that fits that backbone
  • A message-passing graph neural network
  • "Generate the backbone with RFdiffusion, fill in the sequence with ProteinMPNN" is the standard pipeline

The actual workflow of the RFdiffusion + ProteinMPNN pipeline:

  1. Choose a binding site on the target protein
  2. Use RFdiffusion to generate 10,000 backbones that could bind at that site
  3. Use ProteinMPNN to assign sequences to each backbone (8 per backbone)
  4. Refold those sequences with AlphaFold 2 to verify they match the backbones
  5. Express the top 100 in the wet lab and measure binding affinity

This pipeline put 10+ new binder proteins into preclinical or clinical stage in 2024 alone.

RFdiffusion All-Atom (2024) designs side chains and ligands together with the backbone in one shot. For example, you can design an enzyme that precisely fits around a drug molecule.


10. Antibody Design — AbDesign, IgFold, Absci

Antibodies are the most important biologic drug category (about $200 billion in 2024 revenue). So antibody design AI forms its own large market.

IgFold (Johns Hopkins, 2022)

  • Specialized for antibody structure prediction (more accurate than vanilla AlphaFold)
  • Enhanced CDR (complementarity-determining region) modeling
  • Open source

ABodyBuilder (Oxford OPIG, 2024)

  • Rapid modeling of antibody variable regions
  • Under-1-second prediction on a single GPU

AbDesign / RFdiffusion-Ab (Baker Lab, 2024)

  • Fine-tunes RFdiffusion for antibody design
  • Generates antibodies that bind a target antigen from scratch
  • Achieves about 1% or higher hit rate in wet-lab validation (10x to 100x over traditional display methods)

Absci (Nasdaq listed, 2021)

  • "Generative AI for antibody discovery"
  • Combines in-house ML and wet lab
  • 2024 partnerships with GSK, Merck, and others
  • Designs and expresses target-binding antibodies within six weeks

Generate:Biomedicines (spun off in 2022, $270M Series C in 2024)

  • Incubated by Flagship Pioneering
  • Develops the Chroma model in-house — antibody plus general protein design
  • Multiple collaborations with global big pharma

The core KPIs for antibody design are affinity (binding affinity, Kd) and developability (aggregation, viscosity, immunogenicity). Optimizing both axes simultaneously is the challenge for AI.


11. Small Molecules + Docking — MolMIM, DiffDock, NeuralPLexer

The small molecule side has also moved fast under AI.

SMILES and SELFIES

  • SMILES: a string representation of molecules (e.g., CC(=O)OC1=CC=CC=C1C(=O)O is aspirin)
  • SELFIES: addresses SMILES limitations and always represents valid molecules

Mol-BERT, ChemBERTa, MoLFormer (2020-2022)

  • Transformers pre-trained on SMILES
  • Used for molecular property prediction

MolMIM (NVIDIA, 2024)

  • A molecular generation model, part of NVIDIA BioNeMo
  • Generates molecules with similar but improved properties starting from an input molecule
  • Accelerates the medicinal chemist's hit-to-lead phase

DiffDock (MIT, 2023)

  • Diffusion-based docking model
  • Directly generates protein-ligand binding poses
  • Tens of times faster than traditional docking (AutoDock Vina, etc.)

NeuralPLexer (2024, Caltech)

  • Takes protein and ligand together as input and predicts the binding complex
  • Considers cofactors and accessory proteins

AlphaFold 3 + Boltz-2 + Chai-1 also predict small molecule binding in the end, so the docking field and the structure prediction field are practically merging.


12. Isomorphic Labs — DeepMind's Drug Discovery Subsidiary

Isomorphic Labs is Alphabet's drug discovery subsidiary, spun off in November 2021. Demis Hassabis is CEO concurrently with DeepMind.

Mission: "Re-imagining drug discovery through AI." AlphaFold is the basic tool for drug discovery.

Strategy:

  • Dual-track strategy of internal pipeline plus big-pharma partnerships
  • 2024 deal with Eli Lilly: $170 million upfront plus milestones
  • 2024 deal with Novartis: $120 million upfront plus milestones
  • Own candidates focus on oncology and immunology

Tech stack:

  • AlphaFold 3 is core (closed externally, used first internally)
  • Proprietary design model plus docking plus ADMET prediction
  • Minimizes in-house wet lab, collaborates with CROs

Closed policy: Isomorphic's existence is the reason AlphaFold 3 is closed. If AF3 had been open, every big pharma would have used it internally and Isomorphic's business model would have weakened.

2025 status: First IND-enabling candidates are imminent. Phase 1 entry targeted within 2026.


13. Recursion Pharmaceuticals + Exscientia Merger

Recursion (Nasdaq RXRX) is the Salt Lake City-based AI drug company. IPO in 2021.

Core tech:

  • "Recursion Maps" — phenotypic screening based on cell imaging
  • About one million cell images automatically analyzed per experiment
  • Models drug-gene-disease relationships as a graph
  • NVIDIA collaboration on the BioHive-1 and BioHive-2 supercomputers (NVIDIA invested)

January 2024: announces acquisition of Exscientia (about $700 million). Exscientia is a UK-based AI drug company strong in proprietary molecular design. The merger combines imaging plus molecular design in one company.

Pipeline:

  • 11+ preclinical/clinical assets
  • Oncology, neurology, rare diseases
  • 2024 collaborations with Bayer, Roche, Sanofi, and others

Vision:

  • "Industrialize drug discovery"
  • AI plus automated wet lab plus cloud computing

14. Insilico Medicine — Pioneer of Generative AI Drug Discovery

Insilico Medicine is an AI drug discovery company headquartered across Hong Kong, New York, and Shanghai. Founded in 2014. IPO underway in 2025 (Hong Kong exchange).

Core tech:

  • The Pharma.AI platform — target discovery plus molecular design plus clinical trial design
  • Composed of PandaOmics (targets), Chemistry42 (molecules), and InClinico (clinical)
  • Combination of proprietary generative models and reinforcement learning

Hit:

  • INS018_055 (IPF therapeutic candidate) — entered Phase 2 in 2023. The world's first "AI-discovered + AI-designed" clinical-stage drug.
  • AI performs both target discovery (TNIK) and molecular design
  • 18 months to candidate compound, more than halved compared to traditional approaches

Pipeline: 30+ programs, 7+ clinical assets.

2025 trends:

  • Expanding collaboration with Sanofi
  • INS018_055 Phase 2 readout expected
  • Pursuing the Hong Kong IPO

Insilico's value proposition is clear: "AI discovers, AI designs, humans validate." Cut time and cost in half.


15. Schrödinger, Atomwise, BenevolentAI, Cradle

Schrödinger (Nasdaq SDGR)

  • A leader in molecular dynamics (MD) and quantum chemistry software since 1990
  • Industry-standard tools like DESMOND, Maestro, and Glide
  • Integrated AI aggressively in the 2020s
  • Runs its own pipeline as well — collaboration with Nimbus Therapeutics

Atomwise

  • Founded 2012, "AtomNet" CNN-based docking model
  • Many big pharma collaborations (Pfizer, Bayer, Merck, etc.)
  • Virtual screening across 200+ targets

BenevolentAI (London Stock Exchange BAI)

  • Integrates knowledge graph plus natural language plus molecular design
  • Proposed baricitinib as a COVID-19 candidate early on, leading to FDA emergency use authorization
  • Restructured in 2024 (underperformance), recovery mode in 2025

Cradle

  • Netherlands/Switzerland, founded 2021
  • Specializes in protein engineering (industrial enzymes, pharmaceutical proteins)
  • Partnerships with Novartis, BASF, AstraZeneca
  • 2024 Series B of $73 million

EvolutionaryScale (already covered in Section 5)

  • The company behind ESM3
  • 2024 Series A of $142 million, invested by Amazon, NVIDIA, and others
  • Model plus consulting business

16. Genomics AI — DeepVariant, Enformer, Geneformer, scGPT

DNA, RNA, and gene expression are also large AI domains beyond proteins.

DeepVariant (Google, 2018)

  • Detects variants (SNPs, indels) from sequencing reads
  • CNN-based, more accurate than traditional GATK
  • As of 2025 supports both PacBio HiFi and ONT (nanopore) long reads

Enformer (DeepMind + Calico, 2021)

  • Predicts gene expression from roughly 200 kb of DNA input
  • Transformer-based
  • Used to predict the expression impact of clinical variants

AlphaMissense (DeepMind, 2023)

  • Pathogenicity prediction for missense variants (single amino acid substitutions)
  • Public predictions for 71 million human missense variants

Geneformer (MIT Broad, 2023)

  • Transformer over single-cell transcriptomic data
  • "Rank-value encoding" — tokenizes by expression rank
  • Pre-trained on about 30 million single cells

scGPT (University of Toronto + Wang Lab, 2023)

  • Single-cell foundation model
  • Pre-trained on 33 million cells
  • Multitasks across cell type classification, batch correction, perturbation prediction, etc.

Universal Cell Embeddings (UCE) (Stanford, 2023)

  • A cross-species (human plus mouse plus fly, etc.) single-cell model

These models learn from public datasets like GTEx, Tabula Sapiens, and the Human Cell Atlas.


17. Cell Imaging AI — Cell Painting, JUMP-CP, CellPose

Cell Painting is a phenotypic profiling technique based on fluorescent staining plus automated microscopy. After treating cells with a compound, you automatically capture fluorescent images in five channels and extract roughly 1,500 morphological features.

JUMP-CP (2023, Broad + big pharma consortium)

  • 116,000 compounds plus 12,000 gene perturbations
  • Phenotypic profiles released via Cell Painting
  • Used by the 12 big pharma co-funders (Bayer, Janssen, etc.)
  • Fully released May 2024

CellPose (Janelia, 2021)

  • Cell segmentation model — a U-Net variant
  • Generalizes across many cell types
  • Open source with ImageJ/Fiji plugins

CellProfiler (Broad)

  • A cell image analysis tool stretching back to the 1990s
  • Integrated deep learning models from 2023

Recursion Maps

  • Recursion's proprietary platform
  • A database of roughly 6 billion cell images
  • A graph of drugs, diseases, and genes
  • Trained on the BioHive-1 and BioHive-2 (NVIDIA) supercomputers

The core of this field is a phenotype-first approach. Even when targets are unknown, you find compounds that normalize cell phenotype first.


18. Clinical Trial AI — Saama, Unlearn.ai

Beyond discovery, clinical trials are the costliest stage (average clinical cost about $1.9 billion). AI enters here too.

Saama Technologies

  • Founded 2015, specializes in clinical data management
  • Proprietary LLM for automated data integrity checks
  • Multiple big pharma collaborations

Unlearn.ai

  • Founded 2018, synthetic control arms based on "digital twins"
  • Generates virtual twins of patients to partially replace placebo controls
  • Piloted in Alzheimer's trials in collaboration with the FDA

TriNetX

  • Global patient data network, optimizes clinical design
  • Pre-analyzes which cohorts can be recruited

Owkin (Paris)

  • Federated learning-based multi-center clinical data analysis
  • Patient data stays local, only models are shared

The core value of clinical trial AI is time reduction. Cutting one year off a single clinical phase can save over $100 million.


19. Bio Foundation Models — BioGPT, GeneGPT, NACL

Natural language-side bio foundation models are equally active.

BioGPT (Microsoft, 2022)

  • A GPT-2 variant pre-trained on about 15 million PubMed abstracts
  • Used for tasks like drug side effect and protein-drug relation extraction

GeneGPT (NCBI, 2023)

  • A model trained to call genomics tool APIs
  • Queries BLAST, dbSNP, ClinVar via natural language

NACL biomedical Llamas (NIH NACL, 2024)

  • A series of Llama fine-tunes for biomedicine
  • Domain-specific models for clinical, genomics, drugs, and more

Med-PaLM (Google, 2022-2024)

  • A PaLM variant specialized for medical Q&A
  • USMLE (U.S. medical licensing exam) passing level

Med-Gemini (Google, 2024)

  • Gemini-based medical multimodal model
  • Images plus text plus clinical notes

The common challenge for these models is hallucination control. Because medical accuracy is tied directly to life, strong RAG and human verification are essential.


20. Korean AI Bio — Standigm, Deep Bio, Syntekabio

Korea's AI bio ecosystem is growing fast.

Standigm

  • Founded 2015, Korea's first-generation AI drug company
  • Proprietary AI platform plus wet lab
  • Collaborations with SK Chemicals and JW Pharmaceutical
  • 2024 Series C of about 60 billion won

Deep Bio

  • Specializes in pathology AI
  • The prostate cancer grading AI (DeepDx-Prostate) registered with the FDA
  • Commercial service in the U.S., Japan, and Korea

Syntekabio (Kosdaq listed)

  • Supercomputer plus AI-based virtual screening
  • Runs its own STB Cloud
  • Collaborations with KT, Celltrion, and others

JLK Inspection

  • Started in medical imaging AI and expanded into drug discovery
  • Brain stroke and brain disease imaging analysis tied into target discovery

Macrogen

  • Korea's largest sequencing and genomics analysis company
  • Built its own AI variant interpretation platform

Lunit

  • A leader in medical imaging AI, expanding into pathology
  • Global expansion via the 2024 Volpara acquisition

Investment trends: 2024 Korean AI bio investment totaled about 500 billion won. Small versus global benchmarks but government support (Ministry of Health and Welfare data projects) is active.


21. Japanese AI Bio — Preferred Networks, Elix, MOLCURE

Japan is equally aggressive on AI bio.

Preferred Networks

  • Japan's largest AI startup, known for Chainer
  • From 2024 onward, Materials Project plus protein design
  • ENEOS, Toyota, and other industrial partners

Elix Inc

  • Tokyo, founded 2016, drug discovery AI
  • Proprietary Elix Discovery platform
  • Collaborations with Daiichi Sankyo and Shionogi

MOLCURE

  • Specializes in antibody discovery AI
  • In-house wet lab integrated with ML

Healios

  • iPS cell-based regenerative medicine plus AI
  • Listed on Tokyo Stock Exchange Mothers

Spiber

  • Synthetic spider silk proteins — leverages protein design AI
  • Collaborations with Uniqlo and GAP

Japan's strengths: chemistry plus precision engineering plus university research run deep, but the IPO market is weaker than in the U.S. Companies like PFN and Elix hint at the potential for globalization.


22. Datasets and Benchmarks — PDB, UniProt, ChEMBL, AlphaFold DB

The core datasets underpinning AI biology.

PDB (Protein Data Bank, 1971-)

  • The standard repository for experimental protein structures
  • About 230,000 structures as of 2025
  • Experimental data from X-ray crystallography, cryo-EM, NMR, and more
  • Core training data for AlphaFold

UniProt

  • The standard protein sequence database
  • About 250 million sequences (mostly auto-annotated)
  • The curated part is SwissProt (about 570,000)

ChEMBL (EMBL-EBI)

  • Database of bioactive molecules
  • About 2.3 million compounds and 20 million activity measurements as of 2025
  • Foundational for medicinal chemistry ML

AlphaFold DB

  • About 214 million structures predicted with AlphaFold 2/3
  • Predictions published for every UniProt protein
  • Free, usable for both academic and commercial purposes

ESM Atlas

  • About 617 million metagenomic protein structures predicted with ESMFold
  • Soil, ocean, and human microbiome proteins

The Human Cell Atlas

  • A global consortium
  • A single-cell map of human cell types
  • About 100 million cells by 2025

JUMP-CP (see Section 17 above)

Open Targets (GSK + Sanofi + Bristol Myers Squibb + ...)

  • A drug target prioritization database
  • Combines genetics, clinical, and chemistry

ClinicalTrials.gov + clinicaltrialsregister.eu

  • Clinical trial metadata

Data diversity and quality determine the ceiling of AI models. The biggest bottleneck as of 2026 is the shortage of wet-lab validation data.


23. Simulation Infrastructure — GROMACS, AMBER, DESMOND

AI predicts static structures well, but dynamic behavior is still where molecular dynamics (MD) leads.

GROMACS (Sweden KTH and others)

  • Open source, used in both academia and industry
  • Excellent GPU acceleration
  • Used on protein, membrane, and nucleic acid systems

AMBER (UCSF + Rutgers and others)

  • One of the oldest MD packages
  • Wide variety of force-field options
  • The AMBER force field is one of the de facto standards

NAMD (University of Illinois)

  • Handles very large systems (10 million atoms and beyond)
  • Used in COVID-19 spike protein simulations

DESMOND (Schrödinger commercial)

  • Developed by D.E. Shaw Research, commercialized by Schrödinger
  • Fast performance plus commercial support
  • D.E. Shaw's Anton supercomputer is a separate dedicated piece of hardware

OpenMM (Stanford)

  • An MD library callable from Python
  • Integrates easily with AI workflows
  • AlphaFold's relaxation step also uses OpenMM

ML potential rising:

  • ML force fields like AIMNet2, ANI, and MACE deliver quantum-chemistry-level accuracy at speed
  • Equivariant models like NequIP and Allegro
  • Becoming a de facto standard tool from 2025 onward

On GPU infrastructure, NVIDIA H100/B100, AMD MI300, and Google TPU are all in use. Recursion's BioHive-2 is built on roughly 600 H100s.


24. Ethics and Regulation — New Safety Standards

The advance of AI biology equally raises misuse concerns.

Dual-use concerns:

  • Can protein design AI be used to design new toxins or pathogens?
  • A 2022 paper reversed a drug design AI to generate 40,000 potential toxins (Urbina et al, Nature Machine Intelligence)
  • Dual-use guidelines being discussed at the U.S. NSABB and the UK SAGE, among others

Regulatory trends:

  • FDA: drafting "AI in Drug Discovery" guidance from 2024
  • EMA: published a reflection paper on AI use in clinical (2024)
  • Japan's PMDA: accelerating medical AI approval

Open vs. closed:

  • DeepMind's closure of AlphaFold 3 reflects both safety and commercial logic
  • Baker Lab takes the position that "open improves safety"
  • EvolutionaryScale splits the difference — small models open, large models API only

Bio security evaluation:

  • Responsible AI policy — filters to detect dangerous protein designs
  • Guidelines like "DNA synthesis companies should refuse suspicious sequences"
  • IGSC (International Gene Synthesis Consortium) self-regulation

As of 2026, the regulatory framework for this field is still forming. Cooperation between the AI safety community (MIRI, ARC, METR) and the bio safety community (NTI, Johns Hopkins CHS) is growing.


25. Closing — From 2026 to 2030

The 2024 Nobel Prize was academia's recognition of AI biology. As of 2026, the downstream effects are spreading into industry.

Expected trends (2026-2030):

  1. First FDA approval of an AI-discovered plus AI-designed drug — possible between 2027 and 2029. Insilico's INS018_055 is one of the leading candidates
  2. Cloud SaaS for protein design tools — an era where medicinal chemists use RFdiffusion like Excel
  3. Integrated foundation models for single-cell plus phenotype plus structure — the merging trajectory of Recursion Maps, ESM3, and Geneformer
  4. Personalized antibodies — therapeutics designed per patient antigen
  5. Big pharma plus AI company integration — likely more mergers like Recursion-Exscientia
  6. Stricter dual-use regulation — possible mandates for risk design detection filters

Right after the Nobel announcement, Demis Hassabis posted briefly on X. "This is just the beginning." The protein folding problem may be solved, but in the whole of biology AI has not even covered one percent. Dynamic behavior, cell-level simulation, tissue models, full-body models — the road ahead is long, and that road is the biggest science plus business opportunity of the next decade.


26. References

Key papers:

Databases and services:

Companies and official sites:

Nobel resources:

Foundational tools:

Closing. AI solved the protein folding problem, but biology lies beyond folding. Dynamic interactions, cell level, tissue level, human level — the truly hard problems all live beyond that boundary. So this field will be most exciting in the decade ahead. A glorious time for computer scientists, and the first time biologists have tools strong enough to match the questions. Good luck to both fields.