필사 모드: Bioinformatics Tools 2026 — Galaxy / BioPython / Nextflow / Snakemake / AlphaFold 3 / ESM3 / RoseTTAFold / Boltz-1 / Chai-1 / Foldseek Deep Dive
EnglishPrologue — The Landscape That the 2024 Nobel Prize Rearranged
In October 2024, the Nobel Prize in Chemistry went to three people: **David Baker** (University of Washington, *computational protein design*), **Demis Hassabis**, and **John Jumper** (Google DeepMind, *AlphaFold 2 for protein structure prediction*). One short citation, and the entire landscape of bioinformatics shifted.
Through the early 2010s, solving one protein structure with X-ray crystallography was a normal three-year PhD project. When AlphaFold 3 was published in May 2024, the same task could be done on a laptop in 30 minutes. **And not just proteins.** AlphaFold 3 predicts complexes of proteins + small-molecule ligands + DNA + RNA + ions in a single shot. ESM3 generates protein sequences like GPT. RoseTTAFold All-Atom does the same job, the Baker Lab way. **Boltz-1** (MIT, June 2024) and **Chai-1** (Chai Discovery, September 2024) released the same accuracy with *open* weights.
All of that happened in one year.
This post threads the whole 2026 bioinformatics stack — from the moment data leaves a sequencer to the moment a protein structure renders — into one continuous read. The next 20+ chapters cover:
- **Galaxy** — the standard web UI for researchers who do not code
- **BioPython + Bioconductor** — the two language libraries (Python and R)
- **Nextflow + Snakemake** — workflow standards
- **BLAST + DIAMOND2 + MMseqs2** — sequence search (slow then fast then faster)
- **SAMtools + BCFtools + GATK** — BAM and VCF tooling
- **STAR + HISAT2 + Salmon + Kallisto + DESeq2 + edgeR** — RNA-seq pipeline
- **AlphaFold 3 + ESM3 + RoseTTAFold + ProteinMPNN + Boltz-1 + Chai-1 + Foldseek** — proteins
- **Anvi'o + QIIME 2** — microbiomes
- **Seurat + Scanpy** — single-cell RNA-seq
- **Illumina + 10x Genomics + Oxford Nanopore** — sequencers
- **AWS HealthOmics + GCP Healthcare API + Microsoft Genomics** — cloud
1. The 2026 Bioinformatics Map — Workflow / Alignment / Protein / Single Cell
Before walking through tools, draw the map. The 2026 stack splits into four layers.
[Sequencer] Illumina NovaSeq X / Nanopore PromethION / 10x Chromium
|
| BCL files (raw)
v
[Demultiplex / Convert] bcl2fastq, DRAGEN BCL Convert
|
| FASTQ files
v
[QC & Trim] FastQC, fastp, MultiQC
|
| Clean FASTQ
v
[Align / Quantify] BWA-MEM2, STAR, HISAT2, Salmon, Kallisto
|
| BAM / count matrix
v
[Variant Call / DE] GATK, BCFtools, DESeq2, edgeR
|
| VCF / DE table
v
[Downstream] Seurat, Scanpy, Anvi'o, QIIME 2
|
v
[Protein structure] AlphaFold 3, ESM3, Boltz-1, Chai-1, RoseTTAFold
A workflow engine binds the whole stack together. Nextflow and Snakemake are the two pillars, and Galaxy layers a web UI on top. The de facto 2026 combinations:
- **Starting a new lab**: Nextflow + nf-core + Seqera Tower, or Snakemake + Snakemake-Wrappers
- **Need a protein structure**: ColabFold (web) then AlphaFold 3 (precision) then Boltz-1 / Chai-1 (open alternatives)
- **Single cell**: 10x Cell Ranger then Scanpy (Python) or Seurat (R)
- **Microbiome**: QIIME 2 (16S) or Anvi'o (metagenome)
- **Cloud**: AWS HealthOmics (with NVIDIA Parabricks) or GCP Healthcare API
One-liner to remember: **"Files start as FASTQ, are organized into BAM and VCF, and meaning comes out of R or Python."**
2. Galaxy — The Standard Web Platform
Galaxy is an open-source bioinformatics web platform started by Penn State and Johns Hopkins. It has run since 2005, and in 2026 the public instances **usegalaxy.org** (US), **usegalaxy.eu** (Freiburg, Germany), **usegalaxy.org.au** (Australia), and **usegalaxy.fr** (France) operate side by side. Anyone can sign up for free and run BLAST, STAR, DESeq2, and Cell Ranger by clicking.
Three core concepts:
1. **History** — Each user's workspace. Uploaded data, executed tools, and outputs accumulate chronologically.
2. **Tool** — One analysis step (e.g., FastQC, STAR, DESeq2). More than 8,000 tools are registered.
3. **Workflow** — Tools connected into a pipeline. Drag nodes around the GUI.
Galaxy's strength is **reproducibility**. Share one History and another researcher can rerun the identical data, tool versions, and parameters. Since 2025, Galaxy ToolShed integrates Bioconda and BioContainers directly, automating per-container tool installation.
Galaxy CLI (BioBlend) — control a Galaxy instance from Python
pip install bioblend
python -c "
from bioblend.galaxy import GalaxyInstance
gi = GalaxyInstance('https://usegalaxy.org', key='YOUR_API_KEY')
history = gi.histories.create_history(name='RNA-seq 2026')
gi.tools.upload_file('reads.fastq.gz', history['id'])
"
**When to use it?** When you don't want to write code, or when teaching, sharing reproducible experiments, or onboarding new lab members. **When not to?** When you need to push thousands of CPU cores for 24 hours on an industrial pipeline. Then you spin up Nextflow on cloud.
3. BioPython + Bioconductor — Language Libraries
Bioinformatics has long been split between two languages: **Python** (data wrangling, machine learning) and **R** (statistics, visualization). Their standard libraries are BioPython and Bioconductor.
BioPython
The standard Python bio library since 1999. FASTA, FASTQ, GenBank, UniProt parsers, NCBI Entrez access, sequence alignment, and PDB structure handling all live under one library.
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
1. Read a FASTA
for record in SeqIO.parse("genome.fasta", "fasta"):
print(record.id, len(record.seq))
2. Manipulate sequences
dna = Seq("ATGAAGCTGGAATTC")
print(dna.complement()) # TACTTCGACCTTAAG
print(dna.reverse_complement()) # GAATTCCAGCTTCAT
print(dna.translate()) # MKLEF (protein)
3. Fetch GenBank via NCBI Entrez
Entrez.email = "you@example.com"
handle = Entrez.efetch(db="nucleotide", id="NC_000913.3",
rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
Bioconductor
The bio package bundle of the R ecosystem. Running since 2002, it now lists **more than 2,300 packages**. DESeq2, edgeR, limma, Seurat, and ChIPseeker all live here. Quarterly releases ensure every package builds and tests against the same R version.
Install Bioconductor
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
library(DESeq2)
Build a DE analysis object from a count matrix and sample info
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ condition)
dds <- DESeq(dds)
res <- results(dds)
**When Python vs R?** Data cleaning, machine learning, and deep learning (AlphaFold and friends) feel natural in Python. Statistical modeling, plots, and DE analysis feel natural in R. A 2026 lab uses **both**.
4. Nextflow (DSL2) — The Workflow Standard
Nextflow is a workflow language built by Italian engineer Paolo Di Tommaso at CRG Barcelona in 2013. It spun out as Seqera Labs in 2018, and in 2026 it is the **de facto workflow standard**.
The core idea is **dataflow + channels**. Every process has input and output channels, and data flows through channels. Running 100 samples in parallel through the same pipeline feels natural.
// DSL2 — first two steps of an RNA-seq pipeline
nextflow.enable.dsl=2
process FASTQC {
container 'biocontainers/fastqc:v0.11.9_cv8'
input:
tuple val(sample_id), path(reads)
output:
path "*_fastqc.zip"
script:
"""
fastqc ${reads}
"""
}
process STAR_ALIGN {
container 'quay.io/biocontainers/star:2.7.11a--h0033a41_0'
cpus 16
memory '64 GB'
input:
tuple val(sample_id), path(reads)
path index
output:
tuple val(sample_id), path("*.bam")
script:
"""
STAR --runThreadN ${task.cpus} \\
--genomeDir ${index} \\
--readFilesIn ${reads} \\
--readFilesCommand zcat \\
--outSAMtype BAM SortedByCoordinate
"""
}
workflow {
samples = Channel.fromFilePairs('data/*_R{1,2}.fastq.gz')
FASTQC(samples)
STAR_ALIGN(samples, file('star_index'))
}
Nextflow's killer feature is **execution-environment independence**. The same workflow runs locally, on SLURM, AWS Batch, Google Cloud Batch, Azure Batch, or Kubernetes unchanged.
**nf-core** is the community-curated catalog of standard pipelines. In 2026 there are over 100 pipelines — nf-core/rnaseq, nf-core/sarek, nf-core/scrnaseq, nf-core/proteinfold, and so on. Nine out of ten new RNA-seq analyses end with running nf-core/rnaseq.
**Seqera Tower (Seqera Platform)** is the commercial management dashboard. Execution logs, cost analysis, and data catalogs in a web UI. Academic licenses are free; enterprise licenses are paid.
5. Snakemake — The Python Alternative
Snakemake started at the University of Bonn, combining **Python syntax + GNU Make's dependency tracking** into one tool. Johannes Köster released it in 2012, and in 2026 the v8 series is in production.
If Nextflow uses a channel-based dataflow model, Snakemake uses a **rule + input/output file** model. It reasons backwards: to make this file, which rule must run on which input?
Example Snakefile
SAMPLES = ["s1", "s2", "s3"]
rule all:
input:
expand("results/{sample}.sorted.bam", sample=SAMPLES)
rule fastqc:
input:
"data/{sample}.fastq.gz"
output:
"qc/{sample}_fastqc.zip"
conda:
"envs/fastqc.yaml"
shell:
"fastqc {input} -o qc/"
rule align:
input:
reads="data/{sample}.fastq.gz",
index="reference/index"
output:
"results/{sample}.sorted.bam"
threads: 8
shell:
"bwa-mem2 mem -t {threads} {input.index} {input.reads} | "
"samtools sort -@ {threads} -o {output}"
**When Nextflow vs Snakemake?**
- **Nextflow** — industry, clinical, large-scale cloud, when you want to use nf-core pipelines off the shelf
- **Snakemake** — academic labs, Python-friendly groups, small to mid-sized analyses, when you author a workflow from scratch
Both support Conda, containers, and SLURM, so the decision comes down to the team's language affinity.
6. BLAST + DIAMOND2 + MMseqs2 — Sequence Search
Tools that answer "what is this DNA or protein sequence similar to?". Three tools do the same job at different speed and accuracy.
BLAST (Basic Local Alignment Search Tool)
The original, made by NCBI in 1990. Accuracy is top-tier but searching billions of proteins against a huge database takes days.
BLAST+ usage
makeblastdb -in proteins.fasta -dbtype prot -out protdb
blastp -query query.fasta -db protdb \
-outfmt 6 -num_threads 16 -evalue 1e-5 \
-out hits.tsv
DIAMOND2
Released by Benjamin Buchfink in 2014, a 100x to 10,000x faster BLAST alternative. DIAMOND2 came out in 2024, matching **BLAST-level sensitivity** with `--ultra-sensitive` mode. For metagenomics with hundreds of millions of reads against NCBI nr, it is effectively required.
diamond makedb --in proteins.fasta -d protdb
diamond blastp -q query.fasta -d protdb -o hits.tsv \
--threads 16 --ultra-sensitive --evalue 1e-5
MMseqs2
Released by Martin Steinegger (Seoul National University, formerly Max Planck) in 2017. Its strength is **clustering in one shot**. UniRef50 and UniRef90 cluster databases are all built with MMseqs2. The MSA step of ColabFold is MMseqs2.
Protein clustering — group at 50% identity
mmseqs createdb proteins.fasta seqDB
mmseqs cluster seqDB clusterDB tmp --min-seq-id 0.5 -c 0.8
mmseqs createtsv seqDB seqDB clusterDB clusters.tsv
One-liner: **"Use BLAST for accuracy, DIAMOND2 for speed, MMseqs2 for clustering."**
7. SAMtools + BCFtools + GATK — The Standard BAM and VCF Toolkit
When you align sequencing data, you get a **BAM** file (Binary Alignment Map). When you call variants, you get a **VCF** file (Variant Call Format). Three tools dominate these formats.
SAMtools
Made by Heng Li (Broad Institute, now DFCI/Harvard), the Swiss army knife for BAM. Sort, index, stats, view, subset, and markdup all live in one binary.
Common BAM post-processing
samtools sort -@ 16 input.sam -o sorted.bam
samtools index sorted.bam
samtools flagstat sorted.bam
samtools view -b -q 30 sorted.bam chr1:1000-2000 > region.bam
samtools markdup sorted.bam dedup.bam
BCFtools
Made by the same Heng Li, the knife for VCF. Filtering, merging, normalizing, and subsetting all live here.
bcftools view -f PASS -O z -o pass.vcf.gz input.vcf.gz
bcftools norm -f reference.fa pass.vcf.gz -O z -o norm.vcf.gz
bcftools merge sample1.vcf.gz sample2.vcf.gz -O z -o cohort.vcf.gz
bcftools stats cohort.vcf.gz > stats.txt
GATK (Genome Analysis Toolkit)
The Broad Institute's de facto standard for variant calling. For human germline variants, nine in ten projects run GATK's **HaplotypeCaller + GenomicsDBImport + GenotypeGVCFs** pipeline.
Per-sample GVCF with HaplotypeCaller
gatk HaplotypeCaller \
-R reference.fa -I dedup.bam \
-O sample.g.vcf.gz -ERC GVCF
Joint genotyping (GenomicsDB)
gatk GenomicsDBImport \
--genomicsdb-workspace-path my_database \
-L chr1 -V s1.g.vcf.gz -V s2.g.vcf.gz
Final variant call
gatk GenotypeGVCFs \
-R reference.fa -V gendb://my_database -O cohort.vcf.gz
In 2026 GATK 5 is the production line, and NVIDIA's **Parabricks** runs GATK on GPUs, shrinking an 18-hour job to 30 minutes. AWS HealthOmics offers Parabricks as a managed service.
8. STAR + HISAT2 + Salmon + Kallisto + DESeq2 + edgeR — Full-Stack RNA-seq
RNA-seq is the most common experiment in bioinformatics. Knowing which gene is expressed how much per cell separates cancer from normal, before/after drug treatment, and time-course dynamics.
Alignment vs Pseudo-alignment
[FASTQ] -- alignment ------------> [BAM] --- count ----> [count matrix]
| STAR / HISAT2 htseq / featureCounts
|
+----- pseudo-alignment --------> [count / TPM matrix]
Salmon / Kallisto
- **STAR** — built by Alexander Dobin at Cold Spring Harbor, a splice-aware aligner. Large indexes (~30 GB RAM), fast and accurate. The ENCODE and GTEx standard.
- **HISAT2** — built by Daehwan Kim at Johns Hopkins, a lightweight alternative. Around 8 GB RAM and STAR-level results.
- **Salmon / Kallisto** — skip alignment and statistically infer which transcript each read came from. Ten times faster and easier on disk. Salmon by Rob Patro (Maryland), Kallisto by Lior Pachter (Caltech).
Salmon example
salmon index -t transcripts.fa -i salmon_index -k 31
salmon quant -i salmon_index -l A \
-1 reads_1.fq.gz -2 reads_2.fq.gz \
-p 16 --validateMappings -o quant_out
DE Analysis — DESeq2 vs edgeR
Once a count matrix is in hand, run **differential expression**. The two Bioconductor giants are DESeq2 and edgeR.
- **DESeq2** — Michael Love (UNC), Wolfgang Huber (EMBL). Negative binomial with shrinkage estimators. The most-cited DE tool.
- **edgeR** — Gordon Smyth (WEHI, Australia). Negative binomial with empirical Bayes. Same group as limma.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "treated", "control"))
summary(res)
plotMA(res, ylim = c(-2, 2))
One-liner: **"Align with STAR, quantify quickly with Salmon, DE with DESeq2."**
9. AlphaFold 3 (May 2024, DeepMind) — Protein + Ligands + Nucleic Acids
In 2020, AlphaFold 2 effectively solved the protein structure prediction problem at CASP14. In May 2024, **AlphaFold 3** appeared in *Nature* and went one step further — it predicts complexes of **protein + small-molecule ligands + DNA + RNA + ions + modifications** in a single pass.
Key differences:
1. **Diffusion-based structure generator** — Instead of AF2's Evoformer plus Structure Module, AF3 uses a **diffusion model** that gradually denoises coordinates.
2. **Arbitrary molecules** — Not only protein sequences. SMILES for ligands and FASTA for nucleic acids go in together.
3. **AlphaFold Server (alphafoldserver.com)** — A free academic web service. Weights were released under a non-commercial academic license in November 2024.
Input
Protein A sequence (FASTA)
Protein B sequence (FASTA)
Two strands of DNA
Ligand (SMILES: CC(=O)Oc1ccccc1C(=O)O)
Output
PDB-style mmCIF
pLDDT (per-residue confidence)
PAE (pairwise alignment error)
ipTM (interface confidence)
**When to use it?**
- Fast drug–target docking
- Mapping the interface of a protein complex
- Screening candidate ligand binding sites
**Limits**: AF3 gives a single static snapshot. Dynamics and conformational ensembles still require MD (molecular dynamics) simulations.
10. ESM3 (Meta EvolutionaryScale)
**EvolutionaryScale**, a Meta spinout founded in June 2024, released this protein language model the same month. If ESM2 is BERT, ESM3 is **GPT** — generates protein sequences in a generative fashion.
Three tracks are modeled together:
1. **Sequence** — amino acid sequence
2. **Structure** — 3D coordinates (tokenized form)
3. **Function** — functional annotations (InterPro, GO)
ESM3-open (1.4 B parameters) is released under a non-commercial/research license. ESM3-medium/large are served through the EvolutionaryScale API.
ESM3 usage example
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
model = ESM3.from_pretrained("esm3-open").to("cuda")
Sequence to structure
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK")
protein = model.generate(protein, GenerationConfig(track="structure",
num_steps=8,
temperature=0.7))
print(protein.coordinates.shape)
**When to use it?** When you need *generation* — variants of an existing protein, a new protein with a chosen binding site, or a sequence that satisfies a given functional annotation.
11. RoseTTAFold + ProteinMPNN (Baker Lab — 2024 Nobel Prize in Chemistry!)
The reason University of Washington's **David Baker** won the 2024 Nobel in Chemistry is not a single paper. It is that his lab's tools made *designing proteins on a computer* an everyday activity.
RoseTTAFold
A protein structure predictor published in *Science* in 2021, nearly simultaneously with AF2. In 2023 it extended into **RoseTTAFold All-Atom**, handling proteins + ligands + nucleic acids like AF3. RFdiffusion, RFantibody, and RF2NA are its descendants.
ProteinMPNN
An **inverse-folding** model. That is, "given a 3D backbone, design an amino acid sequence that folds to it." Published in *Science* in 2022, it produces sequences that often fold *better* than the original.
ProteinMPNN inference (conceptual)
1. Feed backbone coordinates (N, CA, C)
2. Get an amino-acid distribution per residue
3. Sample sequences
python protein_mpnn_run.py \
--pdb_path designed_backbone.pdb \
--pdb_path_chains A \
--out_folder ./output \
--num_seq_per_target 8 \
--sampling_temp "0.1"
RFdiffusion
Published in *Nature* in December 2023. Generates **protein backbones from scratch**, with conditioning like "make a protein that binds this motif." Baker Lab used it to design ACE2 mimics for SARS-CoV-2, influenza binders, and snake venom neutralizers — and many of them *actually folded*.
One-liner: **"AlphaFold predicts structure. Baker Lab *designs* it."**
12. Boltz-1 (MIT, June 2024) — Open AlphaFold 3
When AlphaFold 3 was released, its code and weights forbade **commercial use**. Two open alternatives appeared almost immediately. The first was **Boltz-1** from MIT's Jameel Clinic in June 2024.
- **Open weights, MIT license** — commercial use is free
- The same **diffusion architecture** as AF3
- Protein + ligand + nucleic acid + ion complexes
- Accuracy close to AF3 (PoseBusters, RNA targets, and so on)
In 2025 **Boltz-2** extended into dynamics and affinity prediction.
Boltz-1 quick start
pip install boltz
boltz predict input.yaml --use_msa_server
input.yaml example
sequences:
- protein:
id: A
sequence: MKTAYIAKQRQISFVKSHFSRQ...
- ligand:
id: B
smiles: "CC(=O)Oc1ccccc1C(=O)O"
**When to use it?** Commercial drug discovery, large-scale screening on an academic cluster, or whenever the AF3 server queue is too long.
13. Chai-1 (Chai Discovery, 2024)
A San Francisco startup, **Chai Discovery**, released another open AF3 alternative in September 2024. Academic use is free; commercial use needs a separate license.
- Equal-to-or-better benchmarks vs AF3 (per Chai's own paper)
- Protein + ligand + nucleic acid
- Web UI (chaiagent.com) and code (GitHub) released together
- Supports **constrained prediction** — you can specify "this residue and that residue must be close"
In 2025 **Chai-2** added *de novo* antibody design results.
from chai_lab.chai1 import run_inference
fasta = """
>protein|A
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK
>ligand|B|smiles
CC(=O)Oc1ccccc1C(=O)O
"""
result = run_inference(
fasta_file="input.fasta",
output_dir="out/",
num_trunk_recycles=3,
num_diffn_timesteps=200,
)
**Boltz or Chai?** For academic work either is free. For commercial work Boltz is more permissive (MIT license), while Chai requires a negotiated business license. Accuracy is case-dependent, so the 2026 production norm is to *run both* and pick the one that fits.
14. Foldseek (Martin Steinegger) — Structure Search
Released by Seoul National University's Martin Steinegger (the same author behind MMseqs2) in *Nature Biotechnology* in 2022, **Foldseek** is a tool for **structure-based protein search**. If BLAST finds sequence-similar proteins, Foldseek finds *3D-structure-similar* proteins **thousands of times faster**.
The key idea: tokenize a 3D structure into a **20-character alphabet (3Di)**, then run MMseqs2-style search on that alphabet. This is what made searching all 230 million AlphaFold structures feasible on a single PC.
foldseek easy-search query.pdb afdb result.m8 tmp \
--format-output "query,target,evalue,tmscore" \
--threads 16
Use cases:
- Search the entire AlphaFold DB (230 M structures) in minutes
- "Which species has a protein similar in structure to this one?" — evolutionary inference
- Novelty checks for *de novo* designed proteins
One-liner: **"BLAST is for sequences, Foldseek is for structures."**
15. Anvi'o + QIIME 2 — Microbiomes
Two standards for studying gut, ocean, and soil microbes.
QIIME 2
The 16S/ITS amplicon analysis standard, built by the Knight Lab lineage at UC San Diego and Northern Arizona University. Version 2 (2018) redesigned it around plugins. DADA2 (denoising), q2-feature-classifier (taxonomy), and q2-diversity (diversity metrics) are the core plugins.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux.qza \
--p-trim-left-f 0 --p-trim-left-r 0 \
--p-trunc-len-f 240 --p-trunc-len-r 200 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats stats.qza
Anvi'o
An integrated metagenome platform built by A. Murat Eren (formerly Marine Biological Lab, now Helmholtz Munich). Operating since 2015, it handles contig assembly, metagenome assembly, binning, and visualization in one tool. Its interactive visualization is unusually strong.
anvi-gen-contigs-database -f contigs.fa -o contigs.db -n "MyMetagenome"
anvi-run-hmms -c contigs.db
anvi-run-ncbi-cogs -c contigs.db
anvi-profile -i sample.bam -c contigs.db --output-dir profile
**When to use which?** 16S amplicon (cheap, taxonomy) goes to QIIME 2. Shotgun metagenome (expensive, functional genes) goes to Anvi'o.
16. Seurat + Scanpy — Single-Cell RNA-seq
10x Genomics Chromium normalized single-cell sequencing, and two downstream-analysis standards solidified.
Seurat (R)
The R standard from the Rahul Satija Lab (NYGC). In 2026 v5 is in production with v6 in beta. Clustering, UMAP, integration, and spatial all live inside.
library(Seurat)
data <- Read10X(data.dir = "filtered_feature_bc_matrix")
obj <- CreateSeuratObject(counts = data, project = "pbmc")
obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)
obj <- FindNeighbors(obj, dims = 1:20)
obj <- FindClusters(obj, resolution = 0.5)
obj <- RunUMAP(obj, dims = 1:20)
DimPlot(obj, label = TRUE)
Scanpy (Python)
The Python standard from the Theis Lab (Helmholtz Munich). It sits on top of AnnData, and ML-based tools like scvi-tools, CellTypist, and scArches all share the same object.
adata = sc.read_10x_mtx("filtered_feature_bc_matrix", var_names="gene_symbols")
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=20)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
sc.pl.umap(adata, color="leiden")
**Seurat or Scanpy?** R/statistics-leaning labs go to Seurat. Python/ML-leaning labs go to Scanpy. Most of the 2026 ML successors (scVI, scGPT, scFoundation, and friends) attach to the Scanpy/AnnData ecosystem.
17. Illumina + 10x Genomics + Oxford Nanopore — Sequencers
The machines that make the data. Three families dominate 2026.
Illumina
The undisputed leader in short-read sequencing. **NovaSeq X Plus** delivers up to 16 Tb per run, while **MiSeq i100** is the small-to-mid-scale standard. The output is **BCL** (raw binary). Conversion uses **bcl2fastq** or **DRAGEN BCL Convert** (GPU-accelerated).
bcl2fastq usage example
bcl2fastq --runfolder-dir 250101_VH00123_456_AACDEFG \
--output-dir fastq_out --sample-sheet SampleSheet.csv \
-p 32
**Illumina BaseSpace** is the cloud managed-analysis service, and the **DRAGEN Bio-IT** platform provides FPGA/GPU-accelerated analysis.
10x Genomics
The Chromium platform effectively owns the single-cell and spatial transcriptomics market. **Cell Ranger** (scRNA-seq), **Space Ranger** (Visium), and **Xenium Analyzer** (in situ) are the core software.
cellranger count --id=sample1 \
--transcriptome=refdata-gex-GRCh38-2024-A \
--fastqs=/path/to/fastqs \
--sample=sample1 --localcores=16 --localmem=64
Oxford Nanopore
A UK startup out of Oxford. **MinION** (USB), **GridION** (desktop), and **PromethION** (data center) form one of the two long-read pillars (the other is PacBio Revio). Reads stretch from tens of kb to several Mb, making it strong for structural variants, methylation, and complete genome assembly.
Dorado basecalling (Nanopore's modern caller)
dorado basecaller hac pod5/ > basecalls.bam
Then align with minimap2
minimap2 -ax map-ont reference.fa basecalls.fq | samtools sort -o aln.bam
18. AWS HealthOmics + Google Cloud Healthcare API + Microsoft Genomics
All three clouds run managed services tailored for genomic data. In 2026 the differences are clear.
AWS HealthOmics
Launched in 2022 (formerly Amazon Omics), it runs Nextflow, WDL, and CWL workflows as a managed service. **NVIDIA Parabricks** is integrated, cutting GATK runtimes from 18 hours to 30 minutes on GPUs. Storage is split into reference store, sequence store, variant store, and annotation store.
aws omics start-run \
--workflow-id 1234567 \
--role-arn arn:aws:iam::123456789012:role/HealthOmicsRole \
--name "rnaseq-run-2026-05" \
--parameters file://params.json
Google Cloud Healthcare API
Strong at combining clinical data standards like FHIR, DICOM, and HL7 with genomic data. **Variant Transforms** and integration with **Verily** (an Alphabet subsidiary) are notable. Google's own **DeepVariant** (deep-learning variant calling) is also managed here.
Microsoft Genomics
Runs the BWA + GATK best-practices pipeline as a managed Azure service. The Microsoft Genomics SDK offers .NET and Python clients. It links into the AI for Health initiative.
**Which cloud when?** If you want to run Nextflow + nf-core directly, AWS HealthOmics. If a hospital handles FHIR/DICOM clinical data alongside genomes, GCP Healthcare API. If your enterprise already runs on Azure, Microsoft Genomics.
19. Korea — KAIST / Seoul National University / KIST / KRIBB
Korea's bioinformatics ecosystem has scaled rapidly.
- **KAIST Biological Sciences / Graduate School of Medical Science and Engineering** — Lee Dae-yeop (genome analysis), Kim Jae-Kyoung (systems biology), Cho Kwang-Hyun (systems biology), and others
- **Seoul National University Biology / Genetic Engineering Program** — Martin Steinegger (MMseqs2 and Foldseek author, joined SNU in 2021), Park Jong-hwan, Kim Sang-wook
- **POSTECH Life Sciences** — Kim Sang-wook, Song lab
- **Korea Institute of Science and Technology (KIST)** — natural products and drug discovery
- **Korea Research Institute of Bioscience and Biotechnology (KRIBB)** — Daedeok Innopolis, Daejeon. The hub of national bio R&D.
- **Korean Bioinformation Center (KOBIC)** — Korea's national bio-data hub
- **Korean Society for Bioinformatics (KSBi)** — runs an annual conference
Steinegger's move to SNU was a major event for Korean bioinformatics infrastructure. World-class tools like MMseqs2, Foldseek, and ColabFold (2021) are maintained from Seoul.
20. Japan — RIKEN / NIG / DDBJ
Three institutions anchor Japan's infrastructure.
- **RIKEN (理研)** — multidisciplinary research institute with campuses in Wako, Yokohama, and Kobe. Single cell, neuroscience, and high-performance computing. Supercomputer **Fugaku** lives here.
- **National Institute of Genetics (NIG, 国立遺伝学研究所, Mishima)** — Japan's counterpart to KRIBB. Comparative genomics, evolution, metagenomics.
- **DDBJ (DNA Data Bank of Japan)** — the Japanese arm of INSDC (NCBI GenBank, EBI ENA, DDBJ). Located in Mishima.
- **Institute of Medical Science, University of Tokyo (IMS-UT)** — single cell, immunity
- **Kyoto University iPS Cell Research Institute (CiRA)** — iPS cells
- **Keio University** — IAB Tsuruoka, systems biology
- **AMED and NEDO** — national R&D funding agencies
DDBJ mirrors with NCBI and EBI daily, acting as the primary repository for Japanese genomic data. It plays the same role as Korea's KOBIC and EBI's ENA.
21. Who Should Learn Bioinformatics — Students / Researchers / Pharma / Clinical
The same tools matter for different reasons depending on who is using them.
- **Undergraduates and graduate students (life sciences)** — Start with Galaxy then BioPython/R, save Nextflow for last. ColabFold (web) covers most protein structure needs.
- **Postdocs and staff scientists** — Run nf-core pipelines as is, fork them for your analysis, manage clusters and cloud through Seqera Tower.
- **Small-to-mid biotech** — AlphaFold 3 / Boltz-1 / Chai-1 for docking, RFdiffusion + ProteinMPNN for design, validate experimentally.
- **Large pharma** — In-house AlphaFold variants (BioNeMo, Iambic) plus GATK clinical variant analysis plus AWS HealthOmics.
- **Clinical geneticists and hospitals** — GATK + DRAGEN + ClinVar/OMIM integration. Reports are the deliverable. Security and HIPAA are the deciding factor.
- **Public health and infectious diseases** — Nextstrain, metagenome (Anvi'o/QIIME 2), portable Nanopore sequencing.
One-liner to remember: **"Onboard with Galaxy, automate with Nextflow, look at proteins with AlphaFold, and pull meaning out with R or Python."**
Epilogue — The 2026 State of Bioinformatics
Bioinformatics through the 2010s was about **organizing data**. From the mid-2020s onward, it became about **extracting meaning** and **designing new proteins**. The Nobel Prize endorsed both.
A one-line summary of the 2026 landscape:
- **Workflows** — Nextflow is the de facto standard, Snakemake the academic alternative
- **Sequence search** — BLAST then DIAMOND2 then MMseqs2 then Foldseek (structure)
- **RNA-seq** — STAR/Salmon then DESeq2/edgeR
- **Protein structure** — AlphaFold 3 / Boltz-1 / Chai-1 / RoseTTAFold
- **Protein design** — RFdiffusion + ProteinMPNN (Baker Lab)
- **Single cell** — Seurat / Scanpy
- **Cloud** — AWS HealthOmics / GCP Healthcare / Microsoft Genomics
If you are a student — start with Galaxy, learn both Python and R, save Nextflow for last. **The era of proteins literally folding in your own hands** is already here.
References
- [Galaxy Project](https://galaxyproject.org/)
- [BioPython](https://biopython.org/)
- [Bioconductor](https://www.bioconductor.org/)
- [Nextflow / Seqera Labs](https://www.nextflow.io/)
- [nf-core pipeline catalog](https://nf-co.re/)
- [Snakemake](https://snakemake.github.io/)
- [NCBI BLAST](https://blast.ncbi.nlm.nih.gov/)
- [DIAMOND2 (Buchfink) on GitHub](https://github.com/bbuchfink/diamond)
- [MMseqs2 (Steinegger) on GitHub](https://github.com/soedinglab/MMseqs2)
- [SAMtools](http://www.htslib.org/)
- [BCFtools](https://samtools.github.io/bcftools/bcftools.html)
- [GATK (Broad Institute)](https://gatk.broadinstitute.org/)
- [STAR aligner on GitHub](https://github.com/alexdobin/STAR)
- [HISAT2](https://daehwankimlab.github.io/hisat2/)
- [Salmon (Patro Lab)](https://salmon.readthedocs.io/)
- [Kallisto (Pachter Lab)](https://pachterlab.github.io/kallisto/)
- [DESeq2 — Love, Anders, and Huber, Genome Biology 2014](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8)
- [edgeR — Robinson, McCarthy, and Smyth, Bioinformatics 2010](https://academic.oup.com/bioinformatics/article/26/1/139/182458)
- [AlphaFold 3 — Abramson et al., Nature 2024](https://www.nature.com/articles/s41586-024-07487-w)
- [AlphaFold Server](https://alphafoldserver.com/)
- [ESM3 — Hayes et al., 2024 / EvolutionaryScale](https://www.evolutionaryscale.ai/)
- [RoseTTAFold — Baek et al., Science 2021](https://www.science.org/doi/10.1126/science.abj8754)
- [ProteinMPNN — Dauparas et al., Science 2022](https://www.science.org/doi/10.1126/science.add2187)
- [RFdiffusion — Watson et al., Nature 2023](https://www.nature.com/articles/s41586-023-06415-8)
- [Boltz-1 — MIT Jameel Clinic on GitHub](https://github.com/jwohlwend/boltz)
- [Chai-1 — Chai Discovery](https://www.chaidiscovery.com/)
- [Foldseek — van Kempen et al., Nature Biotechnology 2024](https://www.nature.com/articles/s41587-023-01773-0)
- [Anvi'o](https://anvio.org/)
- [QIIME 2](https://qiime2.org/)
- [Seurat (Satija Lab)](https://satijalab.org/seurat/)
- [Scanpy (Theis Lab)](https://scanpy.readthedocs.io/)
- [10x Genomics Cell Ranger](https://www.10xgenomics.com/support/software/cell-ranger)
- [Oxford Nanopore Dorado](https://github.com/nanoporetech/dorado)
- [Illumina BaseSpace](https://basespace.illumina.com/)
- [AWS HealthOmics](https://aws.amazon.com/healthomics/)
- [Google Cloud Healthcare API](https://cloud.google.com/healthcare-api)
- [Microsoft Genomics](https://www.microsoft.com/en-us/genomics/)
- [KRIBB (Korea Research Institute of Bioscience and Biotechnology)](https://www.kribb.re.kr/)
- [KOBIC (Korean Bioinformation Center)](https://www.kobic.re.kr/)
- [RIKEN](https://www.riken.jp/)
- [NIG (National Institute of Genetics)](https://www.nig.ac.jp/)
- [DDBJ](https://www.ddbj.nig.ac.jp/)
- [2024 Nobel Prize in Chemistry — Baker, Hassabis, and Jumper](https://www.nobelprize.org/prizes/chemistry/2024/summary/)
현재 단락 (1/446)
In October 2024, the Nobel Prize in Chemistry went to three people: **David Baker** (University of W...