Open Source AI Training Datasets in 2026 — Common Crawl / FineWeb (HF) / RedPajama-V2 / Dolma / SlimPajama / The Stack v2 / LAION / COYO-700M (Kakao) Deep Dive

Prologue — A Model Is a Function of Its Data

When we talk about the 2026 LLM race, we tend to focus on model size and architecture. The truth is simpler: a model is a function of its data (Model = f(Data)). Llama 3 dominated Llama 2 not because of transformer changes but because of training-data quantity and quality. 15 trillion tokens. The same reason 7B models started catching up to old 30B models after FineWeb-Edu landed.

"Garbage in, garbage out" is an old machine-learning saying, but in the LLM era it carries new weight. If 5% of your trillion tokens is garbage, your model has learned 5% worth of hallucinations.

This piece is a complete atlas of the open source AI training datasets that matter in 2026. From Common Crawl, the substrate of every LLM, through the refinement lineages (RefinedWeb, RedPajama, FineWeb, Dolma, SlimPajama), code-only The Stack v2, multimodal LAION/DataComp, Korean COYO-700M and AI Hub, Japanese NII/NTT/ABEJA datasets — and finally licensing, ethics, and the new GDPR right-to-be-forgotten era.

1. The 2026 Dataset Map — Four Categories

Open source datasets split cleanly into four families.

                 ┌─ Web Text ─────────────┐
                 │   Common Crawl          │
                 │   ├ RefinedWeb          │
                 │   ├ RedPajama-V2        │
                 │   ├ FineWeb / FW-Edu    │
                 │   ├ Dolma / SlimPajama  │
                 │   └ C4 / mC4 / OSCAR    │
                 │                         │
                 ├─ Academic / Books ──────┤
Open datasets    │   ├ The Pile            │
                 │   ├ arXiv / S2ORC       │
                 │   ├ Wikipedia / ROOTS   │
                 │   └ CommonPile          │
                 │                         │
                 ├─ Code ──────────────────┤
                 │   ├ The Stack v2        │
                 │   └ StarCoder Data      │
                 │                         │
                 └─ Multimodal ────────────┘
                     ├ LAION-5B / Aesth.
                     ├ DataComp
                     ├ ImageNet / COCO
                     ├ CC12M / Open Images
                     ├ COYO-700M (Kakao)
                     └ Open X-Embodiment

Four key insights:

All roads lead to Common Crawl — RefinedWeb, RedPajama, FineWeb, Dolma are just different refinement strategies on top of the same crawl.
The refinement pipeline is the differentiator — same raw material; different heuristics, dedup strategies, and LLM-based classifiers decide token quality.
2024 to 2026 was the golden age of refinement — FineWeb-Edu (May 2024) introduced model-based quality classifiers, and every new dataset since follows that pattern.
Multimodal is its own universe — LAION wobbled under copyright lawsuits; DataComp is filling the gap.

2. Common Crawl — Foundation of Everything

Common Crawl is a non-profit that has crawled the web monthly since 2007 and released the result for free. Petabytes of fetched pages, billions of domains. Effectively the first input of every open LLM in existence.

2.1 Formats

WARC (Web ARChive): raw HTTP responses including headers, HTML, binaries.
WAT: extracted metadata as JSON.
WET: plain text only.

Most LLM pipelines start from WET — HTML parsing is already done; you just need to filter boilerplate and junk.

2.2 Crawl Units

New crawls drop monthly, e.g. CC-MAIN-2026-21 (the 21st week of 2026). One crawl is typically 3–4 PB. Cumulative size is north of 100 PB.

2.3 Limits

Heavy duplication: mirrored pages everywhere. Dedup is mandatory.
Extreme quality variance: Wikipedia text sits next to auto-generated SEO spam.
Language skew: English ~45%, then Russian / German / Chinese / Japanese / Korean.
robots.txt respected: opt-out domains simply do not appear.

2.4 Download

# WET index for a given crawl
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2026-21/

# Fetch one segment via Python
import boto3
s3 = boto3.client("s3", region_name="us-east-1")
s3.download_file(
    "commoncrawl",
    "crawl-data/CC-MAIN-2026-21/segments/.../wet/...wet.gz",
    "sample.wet.gz",
)

Common Crawl itself is unfit for training — you always go through a refined derivative.

3. RefinedWeb (Falcon Team, 2023)

RefinedWeb is the Common Crawl refinement produced by UAE's Technology Innovation Institute (TII) for the Falcon models. It was the first dataset to prove "web data alone can beat curated book+paper mixes (e.g. The Pile)".

3.1 Contributions

A 5T-token web-only dataset (only a 600B-token sample is public).
The MacroData Refinement (MDR) pipeline: URL filter, text extraction (trafilatura), language ID, heuristics, MinHash dedup.
No model-based filter — heuristics + dedup pushed quality high enough. Elegance of the simple approach.

3.2 Pipeline Summary

Common Crawl WARC
   │
   ▼
URL filter (blacklists, NSFW / harmful domains)
   │
   ▼
Trafilatura (HTML → article body)
   │
   ▼
Language ID (fastText, English only)
   │
   ▼
Heuristics (repeated-line ratio, mean word length, ...)
   │
   ▼
Exact-line dedup + MinHash document dedup
   │
   ▼
600B tokens (public subset)

3.3 Impact

RefinedWeb trained Falcon-7B / 40B, which beat LLaMA-1 at the time. Subsequent refinement datasets all adopted RefinedWeb's dedup strategy (MinHash + line-level exact match) as the de facto standard.

4. RedPajama-V2 (Together AI, 2023)

RedPajama began as a community attempt to reproduce LLaMA-1's data mix openly. V1 was 1.2T tokens — a faithful LLaMA recipe replica. V2 went much further.

4.1 Scale

30T raw tokens across 84 Common Crawl snapshots (2014–2023)
Five languages: English, German, French, Spanish, Italian
Every document ships with precomputed quality signals, so users filter to their own threshold.

4.2 The Quality-Signal Innovation

RedPajama-V2 does not just hand you cleaned text. It attaches 40+ quality metrics (perplexity, natural-language ratio, code ratio, …) to each document, so you can build your own filter.

# RedPajama-V2 loading example
from datasets import load_dataset

ds = load_dataset(
    "togethercomputer/RedPajama-Data-V2",
    name="default",
    partition="head_middle",  # or "tail"
    snapshots=["2023-14"],
    languages=["en", "de"],
)

def filter_quality(doc):
    return (
        doc["quality_signals"]["rps_doc_lorem_ipsum"] == 0 and
        doc["quality_signals"]["rps_doc_word_count"] >= 50 and
        doc["quality_signals"]["rps_lines_javascript_counts"] < 0.1
    )

filtered = ds.filter(filter_quality)

4.3 Significance

First dataset to bake quality filters into the dataset itself. FineWeb and its kin then standardized this approach.

5. FineWeb (Hugging Face, Feb 2024)

FineWeb is the 15T-token English web dataset Hugging Face released in February 2024. As of 2026, the most widely used LLM training baseline.

5.1 Why FineWeb Matters

Benchmarks right after release showed FineWeb beat RefinedWeb, C4, and RedPajama-V2 at matched token counts. Why:

All 96 Common Crawl dumps used (RefinedWeb used only some).
A more recent HTML-to-text extractor instead of plain trafilatura.
Improved heuristics: C4 + RefinedWeb heuristics merged and refined.
Per-dump MinHash dedup: intra-dump dedup first, then inter-dump (balancing compute vs effect).

5.2 Pipeline (datatrove)

Hugging Face built a library, datatrove, to produce FineWeb, and they released the whole pipeline.

pip install datatrove
python -m datatrove.executor.local pipeline.py

A typical pipeline.py looks like:

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import WarcReader
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, GopherQualityFilter, C4QualityFilter
from datatrove.pipeline.dedup import MinhashDedupSignature
from datatrove.pipeline.writers import JsonlWriter

pipeline = [
    WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2026-21/"),
    Trafilatura(),
    LanguageFilter(languages=["en"]),
    GopherQualityFilter(),
    C4QualityFilter(),
    MinhashDedupSignature(output_folder="dedup_sigs/"),
    JsonlWriter("output/"),
]

executor = LocalPipelineExecutor(pipeline=pipeline, tasks=64, workers=16)
executor.run()

5.3 Usage

from datasets import load_dataset

ds = load_dataset(
    "HuggingFaceFW/fineweb",
    name="sample-10BT",
    split="train",
    streaming=True,
)

for doc in ds:
    print(doc["text"][:200])
    break

6. FineWeb-Edu (HF, May 2024) — The Education Filter Revolution

FineWeb-Edu is a 1.3T-token subset of FineWeb released three months later. One extra step: use an LLM-trained classifier to keep only the most "educational" documents.

6.1 How It Was Built

Llama-3-70B-Instruct scored 500K documents on "educational value" 0–5 via prompt engineering.
Those scores trained a small classifier (snowflake-arctic-embed-m and friends).
The classifier swept across all 15T FineWeb tokens, keeping only score-3-and-above → 1.3T tokens.

6.2 Result

Small models (1B, 3B, 7B) trained on FineWeb-Edu alone outperform twice-as-large models trained on plain FineWeb on MMLU and HellaSwag. A big jump in token efficiency.

6.3 Significance

"Quality over quantity" stopped being a slogan and became a measurable fact. Every dataset after 2024 includes an LLM-as-classifier step.

ds = load_dataset(
    "HuggingFaceFW/fineweb-edu",
    name="sample-100BT",
    split="train",
    streaming=True,
)

7. The Pile (EleutherAI) / Dolma (Allen AI) / SlimPajama (Cerebras)

7.1 The Pile (2020, EleutherAI)

The Pile is the 825 GB dataset that trained GPT-Neo / GPT-J / Pythia. A mix of 22 sub-datasets:

Common Crawl (Pile-CC)
PubMed Central, ArXiv, FreeLaw, USPTO Backgrounds
StackExchange, GitHub, Books3 (removed for copyright issues)
OpenWebText2, Wikipedia, OpenSubtitles
and more

The Books3 incident: in 2023 Books3 was revealed as a copyright-infringing dataset; it was pulled from The Pile. Open datasets have treated books cautiously ever since.

7.2 Dolma (Allen AI, 2024)

Dolma is the 3T-token dataset Allen AI released for OLMo. Distinctive features:

Fully transparent licensing: every document ships with provenance and license metadata.
Reproducible pipeline: the dolma toolkit is open.
Composition: Common Crawl refinement + Wikipedia + The Stack v1 + Reddit + arXiv + academic publishing + books.

pip install dolma
dolma tag --tag c4_v1 --documents path/to/jsonl

7.3 SlimPajama (Cerebras, 2023)

SlimPajama is RedPajama-V1 with additional deduplication — a 627B-token version. Key insight:

RedPajama-V1 contained up to ~50% duplicates.
Dedup halved the token count, but at matched token counts SlimPajama consistently beat RedPajama-V1.
Dedup = free lunch.

This result made aggressive dedup the default in every dataset that followed.

8. OSCAR (Inria) / C4 + mC4 (Google)

8.1 OSCAR (Inria, 2019–)

OSCAR (Open Super-large Crawled Aggregated coRpus) is the multilingual dataset led by INRIA in France. Language-classified Common Crawl in 151 languages.

As of 2024 (OSCAR 2301): ~35 GB Korean, ~270 GB Japanese.
The dominant base for early Korean/Japanese LLM training.

8.2 C4 (Google, 2019)

C4 (Colossal Clean Crawled Corpus) was the cleaned subset released in the T5 paper. 156 GB. Simple heuristics:

Sentences must end with ., ?, !, or ".
At least five sentences.
Filter "lorem ipsum" and similar templated junk.
English only (langdetect).

8.3 mC4 (Google, 2021)

mC4 (multilingual C4) is the multilingual extension. 101 languages, 27 TB. Trained mT5. Korean ~90 GB, Japanese ~200 GB.

from datasets import load_dataset
ds = load_dataset("mc4", "ko", split="train", streaming=True)

C4 / mC4 represent the older heuristic-only refinement era. In 2026 FineWeb is displacing them in English; mC4 and OSCAR still dominate non-English.

9. CommonPile (a16z) / ROOTS (BigScience BLOOM)

9.1 CommonPile (2024–, a16z funded)

CommonPile is The Pile's next generation, built by EleutherAI talent with a16z funding. Goals:

Only data with clear licensing (CC0, PD, CC-BY, …).
Books only from the public domain (mostly Project Gutenberg).
Heavier weight on government documents and open-access academic publishing.

Released in tranches through 2024–2026; expected to become The Pile's successor.

9.2 ROOTS (BigScience BLOOM, 2022)

ROOTS is the 1.6 TB multilingual dataset used to train BLOOM. 46 natural languages + 13 programming languages. Distinguishing features:

Language communities participated directly in curation (participatory data governance).
License and provenance metadata on every document.
Korean and Japanese are not included (focus was English plus Latin American, African, and other Asian languages).

ROOTS' governance model — "the data subject participates in the curation" — set the bar for every ethical dataset that followed.

10. arXiv / Wikipedia / S2ORC — Academic Data

10.1 Wikipedia Dumps

Wikipedia publishes a full dump every month. The cleanest, fact-densest text on the open web. Downside: tiny by LLM standards (~20 GB English).

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Process with wikiextractor or wikipedia2vec.

10.2 arXiv Corpus

arXiv hosts 2.2M+ papers accumulated since 1991. LaTeX sources, PDFs, and metadata are all available.

Training-side processing: LaTeX → plain text (math tokenization is hard).
Domains: math, physics, CS, statistics, quantitative biology / finance.

Bulk download via s3://arxiv/ (requester-pays).

10.3 S2ORC (Allen AI, 2020–)

S2ORC (Semantic Scholar Open Research Corpus) is metadata, abstracts, and partial full text for 80M+ academic papers. Backbone of Semantic Scholar.

Open-access papers: full text (~10M).
Closed-access: abstracts only.
Citation graph included (paper-to-paper edges).

import requests
api_key = "YOUR_KEY"
r = requests.get(
    "https://api.semanticscholar.org/graph/v1/paper/search",
    params={"query": "large language models", "limit": 10},
    headers={"x-api-key": api_key},
)

Academic data contributes disproportionately to reasoning and factuality. The Pile, Dolma, and CommonPile all weight arXiv and S2ORC as core components.

11. Code — The Stack v2 (BigCode, 900 GB) / StarCoder Data

11.1 The Stack (BigCode, 2022–)

The Stack is the code dataset from Hugging Face's BigCode project. v1 was 6 TB. v2 is 900 GB+ (after dedup, with 67× more dedup than v1).

600+ programming languages.
Only permissively licensed GitHub repos (MIT, BSD, Apache 2.0, ISC, …).
Author opt-out: search your GitHub username at https://huggingface.co/spaces/bigcode/in-the-stack to request removal.

11.2 The Stack v2 (2024)

Software Heritage partnership broadens code coverage.
Now includes issue discussions, PR comments, notebooks, commit messages.
License metadata attached per document.

from datasets import load_dataset
ds = load_dataset(
    "bigcode/the-stack-v2",
    "Python",
    split="train",
    streaming=True,
)

11.3 StarCoder Data

StarCoder is the further-refined slice of The Stack BigCode used to train StarCoder / StarCoder2. ~1T tokens across 80 languages.

Code data is known to lift systematic reasoning in general LLMs — Anthropic, OpenAI, and Google have all reported that adding code data improves non-code reasoning.

12. Korean — COYO-700M (Kakao Brain) / AI Hub / NIA / KAIST / Naver HyperCLOVA

12.1 COYO-700M (Kakao Brain, 2022)

COYO-700M is the 700M-pair image-text dataset from Kakao Brain. A Korean-origin counterpart to LAION-400M.

Sourced from <img alt="..."> pairs in Common Crawl HTML.
CLIP-score and aesthetic-score filters.
Larger than LAION-400M; used to train Kakao's own CLIP.

from datasets import load_dataset
ds = load_dataset("kakaobrain/coyo-700m", split="train")

12.2 AI Hub (NIA)

AI Hub (aihub.or.kr) is the Korean government's AI dataset portal. Hundreds of datasets across text, speech, video, image. The standard source for Korean-language LLM training.

Korean conversation, Korean translation, Korean STT/TTS.
Specialized corpora for medical, legal, and financial Korean.
NIA terms of use apply (mixed commercial / non-commercial).

12.3 NIA Datasets

The Korean National Information Society Agency (NIA) runs annual dataset-construction programs. By 2026 the catalog has 1,000+ datasets.

12.4 KAIST Datasets

KAIST's Kim Jaechul AI Graduate School and others release Korean academic benchmarks:

KLUE (Korean Language Understanding Evaluation, 8 tasks).
KoBEST (Korean Balanced Evaluation of Significant Tasks).
KMMLU (Korean MMLU).

12.5 Naver HyperCLOVA Data

Naver's HyperCLOVA X is trained on Naver-curated Korean data. Some pieces stay proprietary, but the public side (KorQuAD, NSMC, KLUE) is rich.

A typical Korean LLM data recipe in 2026 looks like AI Hub + COYO + custom crawl + mC4(ko) + OSCAR(ko).

13. Japanese — NII / NTT / ABEJA

13.1 National Institute of Informatics (NII)

NII is the Japanese academic dataset hub. Notable resources:

NII Test Collection for IR Systems (NTCIR).
Cleaned Japanese Wikipedia dumps.
Academic paper corpora (CiNii).

13.2 NTT Datasets

NTT, Japan's largest telecom, runs its own LLM research. Public data is limited, but:

Japanese-task benchmarks (JGLUE, etc.).
Partial data-recipe disclosures from LLMs trained on the ABCI supercomputer.

13.3 ABEJA / Stockmark / CyberAgent

Japanese AI startups have released Japanese LLM datasets:

ABEJA: partial release of training data for ABEJA-LLM 7B / 13B.
Stockmark: business-domain Japanese corpus.
CyberAgent: advertising / marketing Japanese corpus.

13.4 Standard Japanese Recipe

A typical Japanese LLM recipe in 2026:

mC4(ja) + OSCAR(ja) — web base.
Japanese Wikipedia + public-domain books (Aozora Bunko etc.).
NII / NTCIR — academic.
Japanese code corpus releases via ABCI.

14. Image-Text — LAION-5B / DataComp / ImageNet / CC12M / Open Images / COCO

14.1 LAION-5B (LAION, 2022)

LAION-5B is the 5.8B-pair image-text dataset, scraped from Common Crawl <img alt="..."> pairs and filtered by CLIP score. The base for Stable Diffusion training.

LAION copyright litigation (2023–): Getty Images and artists sued LAION and downstream model providers (e.g. Stability AI). LAION removed parts of its dataset in 2024 (especially over child-safety concerns). As of 2026 LAION's legal status remains a grey area.

14.2 LAION-Aesthetics

LAION-Aesthetics is the aesthetic-score-high subset of LAION-5B — used for the high-quality generation finetuning stage of Stable Diffusion. ~120M pairs.

14.3 DataComp (2023–)

DataComp has emerged as the LAION alternative. It starts from 12.8B Common Crawl pairs and lets participants compete on filter strategies, evaluated against downstream model quality.

DataComp-1B: 1B pairs (a LAION-400M alternative).
Clean provenance.
Academic license, commercial use permitted.

from datasets import load_dataset
ds = load_dataset("mlfoundations/datacomp_1b", split="train")

14.4 ImageNet (2009–)

ImageNet is the computer-vision classic. 14M images, 20K+ classes. ImageNet-1K (1000 classes, 1.3M images) is most-used. Still the vision-eval standard in 2026.

14.5 CC12M (Google, 2021)

CC12M (Conceptual 12M) is 12M image-text pairs from Google. Used to train ALIGN, BASIC, and other vision-language models.

14.6 Open Images (Google, 2016–)

Open Images ships 9M images with object detection and segmentation labels. 600 object classes. Larger than COCO.

14.7 COCO (Microsoft, 2014–)

COCO (Common Objects in Context) is 330K images, 80 object classes, 5 captions each. The standard benchmark for detection, segmentation, and captioning.

14.8 Standard Multimodal Recipe 2026

Open vision-language models (LLaVA, Idefics, etc.) typically use:

Pre-training: hundreds of millions of pairs from LAION or DataComp.
Instruction tuning: COCO captions + ScienceQA + custom curation.
Eval: ImageNet, COCO, MMVet, MMMU.

15. Robotics — Open X-Embodiment

Open X-Embodiment (RT-X, 2023–) is the Google DeepMind-led robotics dataset. 1M+ episodes across 22 robot platforms.

15.1 Core Idea

Before Open X-Embodiment, robot learning data was siloed per platform. UR5 data did not transfer to Franka. Open X-Embodiment unified across robots via the RLDS (Reinforcement Learning Datasets) format.

21 collaborating research institutions (Stanford, CMU, Berkeley, Google, …).
Unified action space (6-DOF end effector + gripper).
Unified visual observation (RGB camera, optional depth).

15.2 RT-1, RT-2, RT-X

RT-2-X, trained on Open X-Embodiment, first showed that a skill learned on one robot transfers to another. Robotics' "ImageNet moment".

import tensorflow_datasets as tfds
ds = tfds.load("bridge", split="train")

15.3 2026 State

Open X-Embodiment v2 (2025) covers 60+ robot platforms and 2M episodes. Tesla Optimus and Figure 02 humanoid data has partially joined.

16. Licensing + Ethics — Copyright, Opt-Out, Right to Be Forgotten

16.1 License Matrix

Dataset	License	Commercial Use
Common Crawl	Public	Yes (per-page original copyright still applies)
RefinedWeb	ODC-By 1.0	Yes
RedPajama-V2	Apache 2.0 (code), per-source for data	Partial
FineWeb / FineWeb-Edu	ODC-By 1.0	Yes
The Pile	MIT (code); some data removed (Books3)	Partial
Dolma	ODC-By 1.0	Yes
SlimPajama	Apache 2.0	Yes
The Stack v2	Original per-document license	Yes (if opt-outs respected)
LAION-5B	CC-BY 4.0 (metadata)	Contested
DataComp	CC-BY 4.0	Yes
COYO-700M	CC-BY 4.0	Yes
Open X-Embodiment	Apache 2.0	Yes

16.2 Opt-Out Mechanisms

Opt-out systems standardizing in 2026:

robots.txt: the established crawler standard. Disallow: / removes you from Common Crawl.
The Stack's "Am I in The Stack?": search your GitHub username; request removal.
HF "Have I Been Trained?" (haveibeentrained.com partnership): image-text opt-out.
ai.txt: a newer standard some domains adopt to explicitly state AI-training permission.

Whether GDPR Article 17 ("right to erasure") applies to LLMs remains unsettled.

Easy to remove from pre-training corpora (document-level).
Much harder to remove from already-trained model weights — the field of machine unlearning is emerging.
The EU AI Act started partial enforcement in 2025–2026, with LAION among the first to feel the impact.

16.4 Ethical-Use Checklist

When training a new LLM, run through this checklist:

Do you use only data with declared licensing?
Do you respect documented opt-outs?
Have you filtered personal data (PII)?
Have you filtered harmful content?
Have you released a data card (Datasheet for Datasets)?
Have you documented data governance (who participated in curation)?

Epilogue — The Data Era

The real center of gravity in 2026 LLM competition is not model weights — it is the dataset. Who has more clean tokens, who covers more domains, who carries less licensing risk: those choices decide the next generation.

Open source datasets are the equalizer. They are nearly the only way small labs and startups can stand against the proprietary datasets of hyperscalers. With FineWeb-Edu, a trillion high-quality tokens are open to anyone. The next game is who uses those tokens best.

Garbage in, garbage out — gold in, gold out.

The teams that treat data seriously lead the next generation.

References

Common Crawl — https://commoncrawl.org/
RefinedWeb (Falcon) — https://huggingface.co/datasets/tiiuae/falcon-refinedweb
RedPajama-V2 (Together AI) — https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
FineWeb (HF) — https://huggingface.co/datasets/HuggingFaceFW/fineweb
FineWeb-Edu (HF) — https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
The Pile (EleutherAI) — https://pile.eleuther.ai/
Dolma (Allen AI) — https://huggingface.co/datasets/allenai/dolma
SlimPajama (Cerebras) — https://huggingface.co/datasets/cerebras/SlimPajama-627B
OSCAR (Inria) — https://oscar-project.org/
C4 (Google) — https://www.tensorflow.org/datasets/catalog/c4
mC4 (Google) — https://huggingface.co/datasets/mc4
ROOTS (BigScience) — https://huggingface.co/bigscience-data
CommonPile (a16z) — https://github.com/r-three/common-pile
arXiv Bulk Access — https://info.arxiv.org/help/bulk_data_s3.html
S2ORC (Allen AI) — https://github.com/allenai/s2orc
Wikipedia Dumps — https://dumps.wikimedia.org/
The Stack v2 (BigCode) — https://huggingface.co/datasets/bigcode/the-stack-v2
StarCoder — https://huggingface.co/bigcode/starcoder
COYO-700M (Kakao Brain) — https://huggingface.co/datasets/kakaobrain/coyo-700m
AI Hub (NIA) — https://www.aihub.or.kr/
KLUE — https://klue-benchmark.com/
LAION-5B — https://laion.ai/blog/laion-5b/
LAION-Aesthetics — https://laion.ai/blog/laion-aesthetics/
DataComp — https://www.datacomp.ai/
ImageNet — https://www.image-net.org/
CC12M (Google) — https://github.com/google-research-datasets/conceptual-12m
Open Images — https://storage.googleapis.com/openimages/web/index.html
COCO — https://cocodataset.org/
Open X-Embodiment — https://robotics-transformer-x.github.io/
BigScience ROOTS — https://huggingface.co/spaces/bigscience/SourcingCatalog
datatrove (HF) — https://github.com/huggingface/datatrove
dolma toolkit (Allen AI) — https://github.com/allenai/dolma
Datasheets for Datasets — https://arxiv.org/abs/1803.09010
Am I in The Stack? — https://huggingface.co/spaces/bigcode/in-the-stack
Have I Been Trained? — https://haveibeentrained.com/