Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — data makes the model, not the other way around

A conversation every AI team has at some point in 2026.

PM: "Why does our model keep failing on the same case?" ML eng: "...it's mislabeled in the data. Twelve of thirty examples." PM: "Is that possible?" ML eng: "Usually so. Five to fifteen percent label error is common. Even ImageNet was around six percent."

This is still a common scene in 2026. We pour time into comparing models but barely look at the quality of the data the model trains on. And roughly seventy percent of the "model isn't working" cases are about data — label errors, imbalance, domain shift, fuzzy class definitions.

In 2020, labeling meant a person drawing boxes or highlighting text. 2026 is different. LLMs do the first-pass labels and humans verify only the high-stakes samples. Cleanlab finds noisy labels, Argilla curates LLM fine-tune data, Distilabel produces synthetic samples, and Apify or Firecrawl pull in training data.

This piece maps the 2026 landscape of data work tools. Label Studio, CVAT, Roboflow (general and vision labeling), Cleanlab (data quality), Argilla, Galileo, Phoenix (LLM eval and curation), Scale, Surge, Labelbox (managed labeling), Apify, BrightData, Firecrawl, Crawl4AI (web data collection) — where each tool sits, what the workflows actually look like, and how to build the first pipeline.

1. The landscape — where does the data work happen in 2026?

View the ML data lifecycle in seven stages.

Stage	Activity	Tools in 2026
1. Collect	Pull raw from web, DB, logs	Apify, BrightData, Firecrawl, Crawl4AI
2. Clean	Remove dupes, noise, PII	Pandas, Polars, Lilac, custom scripts
3. Label	Attach ground truth	Label Studio, CVAT, Roboflow, Argilla, Refuel, Scale/Surge
4. Quality	Find label errors and ambiguity	Cleanlab, Argilla, Lilac
5. Curate	Compose train/eval sets	Argilla, Galileo, Phoenix, Lilac, HF Datasets
6. Synthesize	Fill gaps with generated samples	Distilabel, Argilla synthetic, custom LLM pipes
7. Version	Track dataset changes	DVC, HF Datasets, LakeFS, Weights and Biases

The core insight: stages 3 through 5 are now dominated by LLMs. Even in vision, SAM, DINOv2, and Florence-2 produce a first pass. In text, GPT and Claude tag classification, summarization, sentiment, and toxicity labels first. The human role has shifted from "labeler" to verifier and adjudicator.

A second insight: data quality tools (Cleanlab, Argilla disagreement, Galileo data eval) now matter as much as labeling tools. "Find the wrong labels in what we already have" beats "label more" most of the time — usually a five to fifteen percent label correction lifts model accuracy by one to five points.

2. General labeling platforms — Label Studio, CVAT

Label Studio — the open-source classic (Heartex/HumanSignal)

Broadest coverage. Text, image, audio, video, time-series, HTML — all in one tool. An XML-ish config defines the UI.

<View>
  <Image name="img" value="$image" />
  <RectangleLabels name="labels" toName="img">
    <Label value="cat" background="green" />
    <Label value="dog" background="blue" />
  </RectangleLabels>
</View>

That snippet gives you a bounding-box UI. Swap to Text/Labels for NER, AudioPlus/Labels for audio — consistent pattern.

Strengths:

Open source (Apache 2.0), self-hostable.
Best data-type coverage on the market.
Machine learning backend integration — easy to bring in external models for pre-labeling.
Active community, friendly with Hugging Face.

Weaknesses:

Large-team features (SSO, fine-grained permissions, workflows) are partially Enterprise.
UI can feel heavy; learning curve exists.
Review workflows often need to be built.

When to use: teams with mixed data types, teams that need self-hosting, OSS-first teams wanting full-stack labeling infra.

Enterprise (HumanSignal) adds SSO, audit logs, workflows, and analytics. By 2026, it's a SaaS-grade labeling backbone for many shops.

CVAT — vision-only, OSS from Intel

CVAT (Computer Vision Annotation Tool) is image and video specialized. Boxes, polygons, polylines, keypoints, cuboids, 3D — every vision label type is supported.

Strengths:

Vision UX is faster and tighter than Label Studio (shortcuts, interpolation, tracking).
Video annotation — cross-frame object IDs and auto-interpolation work well.
Segment Anything (SAM) integration — one click to mask.
Self-hostable, AGPL.

Weaknesses:

Vision-only.
Management features lag Label Studio.
Search and filter on huge datasets is weak.

When to use: pure computer vision, video annotation is central, want SAM-style acceleration.

3. Vision-focused — Roboflow

Roboflow packs vision labeling, dataset hosting, and model training in one platform. If CVAT is a labeling tool, Roboflow is a vision ML workflow SaaS.

Core features:

Roboflow Annotate — boxes, polygons, keypoints, masks.
Smart Polygon / Auto Label — SAM-based auto-labeling, model first then human review.
Roboflow Universe — public dataset marketplace (hundreds of thousands of datasets).
Roboflow Train — train YOLOv8/YOLOv11 with a few clicks.
Deploy — ship the model as API, edge, or mobile.

Strengths:

Smoothest label-to-train-to-deploy loop, especially for small teams.
Auto Label accuracy has improved significantly through 2026.
Format conversion (COCO, YOLO, Pascal VOC, TFRecord) is a single click.

Weaknesses:

SaaS-first; self-hosting is limited or paid.
Large scale (millions of images) gets pricey.
No non-vision data types.

When to use: startups shipping vision ML quickly, detection/segmentation where YOLO-family is enough.

4. Data quality — Cleanlab and friends

The late-2020s game changer. The balance between "label more" and "fix what we already have" has tilted toward the latter.

Cleanlab — label error scanner

Core idea: the model finds samples where its own prediction disagrees with the label. A meaningful share of those are label errors. Built on Confident Learning, a statistical framework.

from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression

cl = CleanLearning(clf=LogisticRegression())
cl.fit(X_train, labels)
issues = cl.find_label_issues(X_train, labels)
# `issues` is the indices most likely to be mislabeled.

Vision, text, tabular, multi-label, sequence — all supported. Cleanlab Studio (SaaS) layers a GUI to review and fix noisy labels in bulk.

Strengths:

Empirically validated — found label errors in ImageNet, CIFAR, MNIST and other public datasets (labelerrors.com).
Model-agnostic — sklearn, PyTorch, HF transformers, XGBoost all work.
Both OSS (cleanlab/cleanlab) and SaaS (Studio).

When to use: when you want to lift label quality on a classification, NER, or detection dataset in one pass. ROI scales with dataset size.

Argilla — text-first, the LLM fine-tune curation standard

Argilla is a text labeling and data curation tool. After Hugging Face acquired it in 2024, it became a first-class data tool in the HF ecosystem.

Core use cases:

LLM SFT dataset curation — humans rate and edit instruction-response pairs.
DPO/preference data — A vs B comparisons.
Noisy text label cleanup — annotator disagreement is surfaced automatically.
Synthetic data verification — review what Distilabel generates.

Strengths:

First-class HF Datasets, Transformers, Hub integration (one-line push/pull).
Built for LLM-era workflows (SFT, DPO, RLAIF).
Self-hostable OSS plus free hosting on HF Spaces.

When to use: building LLM fine-tune data (SFT/DPO/RLAIF), using the HF stack, want annotator disagreement handled explicitly.

Lilac (now part of datology AI after Databricks/MosaicML) — dataset inspection

LLM training datasets are too big to eyeball. Lilac clusters by embeddings and surfaces topic, language, toxicity, and duplicate signals automatically. Databricks acquired Lilac in 2024; some core members later moved to datology AI.

When to use: scanning multi-million-row pretraining or SFT datasets, visually probing data distribution.

Galileo / Arize Phoenix — LLM eval plus data curation

Galileo started as an ML data quality tool (noisy label detection) and shifted weight toward LLM observability and eval through 2024 and 2025. By 2026 Galileo is a SaaS that builds eval datasets from production traces and scores hallucination and groundedness.

Arize Phoenix covers the same area in OSS. Telemetry (OpenInference), dataset, and eval in one tool.

When to use: when you want to turn production LLM traffic into datasets and catch regressions there.

5. Managed human labeling — Scale, Surge, Labelbox, Snorkel

If you don't run your own labeler team, you use a managed service. The 2026 big four (or five).

Scale AI

The biggest. Took the autonomous-driving, defense, LLM RLHF customers. Scale Data Engine is full-stack labeling, QA, and dataset management. After Meta's reported 14 billion dollar investment in 2024, the focus has shifted toward LLM data.

Strengths: large throughput, domain-expert labelers (medical, legal, code), SLA. Weaknesses: expensive, heavy procurement for small teams.

Surge AI

Grew fast in the LLM era. RLHF preference data, instruction tuning data, red-teaming. The labeler pool's English ability and domain depth are the moat.

When to use: high-quality text for LLM fine-tune, preference labeling for RLHF/DPO.

Labelbox

Enterprise labeling platform. Use your own labelers, a managed workforce, or both. Vision, text, document, video all supported.

When to use: running in-house labelers plus external workforce, enterprise SSO/audit requirements.

Snorkel — programmatic labeling

A different angle. Snorkel composes many heuristics or models as labeling functions and reconciles their noise into weak supervision. Scales beyond what humans can touch.

When to use: domains where expert time is expensive but rules are writable (legal, medical, finance) and data is millions of rows.

Refuel — LLM auto-label

Refuel is a SaaS that uses LLMs as the labeler. You provide an instruction and a few-shot example; the LLM labels and humans review only low-confidence samples. By 2026 Refuel also ships fine-tuned labeling models of its own.

Core value: ten to a hundred times faster and cheaper than human labeling. Accuracy is on par with or better than humans in domains with low inter-annotator agreement (LLMs are more consistent there).

6. Tool by use-case matrix

Use case	First choice	Second	Note
General image classification/detection	Label Studio or Roboflow	CVAT	CVAT/Roboflow faster for vision-only
Video object tracking	CVAT	Roboflow	CVAT's tracking/interpolation is strong
Text classification/NER	Label Studio	Argilla	Argilla wins for LLM-adjacent work
LLM SFT dataset	Argilla	Label Studio	Argilla dominant if on HF stack
RLHF/DPO preference	Argilla or Surge	Scale	Surge if outsourcing
Label error cleanup	Cleanlab	Argilla disagreement	Cleanlab is quantitative
Pretraining dataset inspection	Lilac/datology	Argilla	Lilac for millions of rows
Production LLM trace curation	Phoenix	Galileo	Phoenix OSS, Galileo SaaS
Medical/legal managed labeling	Scale	Labelbox	Need domain labeler pool
Weak supervision at scale	Snorkel	(none)	Only programmatic-labeling option
LLM auto-label	Refuel	DIY LLM pipe	Verification workflow needed
Web crawling for LLM data	Firecrawl or Crawl4AI	Apify	Crawl4AI if OSS
Proxy-required large-scale crawling	BrightData	Apify	Anti-blocking is the point

7. Synthetic data — filling gaps

One of the biggest 2026 trends in data work is synthetic. When human labeling is expensive, LLMs make both the data and the labels.

Why now

LLM quality is good enough that self-training works in many domains.
Human labeling has gotten more expensive (labeler wages rose as LLMs raised the floor).
Anthropic's Constitutional AI, Microsoft's phi series, and Meta's Llama 3 publicly acknowledge significant synthetic share.

Tools

Distilabel (Argilla team) — instruction generation, preference generation, response critique. First-class HF integration.
Argilla Synthetic — push Distilabel results into Argilla for human review.
DIY LLM pipelines — gpt-4 or claude for instruction generation, gpt-4o-mini or claude-haiku for response, judge LLM for grading. Most common 2026 pattern.

Traps (skip these and you break things)

Monotony of synthetic data — LLM-generated samples have lower lexical and topical diversity than humans. Same domain on repeat.
Self-reinforcing bias — training A on data generated by A keeps A's blind spots invisible.
Synthetic without verification equals noise — always sample five to ten percent through Argilla or humans.
Factuality — synthetic instructions can carry false facts. Groundedness scoring is mandatory.

The practical answer: synthetic data extends human data, it does not replace it. A 5k human + 50k synthetic mix often beats 50k human alone — but 0 human + 100k synthetic is dangerous.

8. Web crawling — the data-acquisition side

If labeling tools are about shaping data, crawling tools are about producing or collecting it. LLM pretraining, RAG, and domain datasets all lean on web crawling in 2026.

Apify — managed actor (script) marketplace

Apify is a SaaS for building or renting "actors" (crawling scripts) built on Puppeteer/Playwright. Instagram, Twitter, Amazon, Google Maps — every popular site has an existing actor.

When to use: pulling data from popular sites without writing code, scheduling plus proxies plus storage in one stop.

BrightData — proxy plus scraper

BrightData (formerly Luminati) started as a proxy company with the largest residential and mobile IP pool. They layered scrapers and web dataset APIs on top.

When to use: heavily blocked sites, very large scale (hundreds of millions of pages), enterprises that need legal/contractual clarity.

A caveat: some scraping is ToS-violating or legally gray. The 2024-2025 LinkedIn precedent (allowed under certain conditions) and subsequent rulings are worth tracking.

Firecrawl — LLM-friendly crawling

Firecrawl appeared in 2024 and became a standard for LLM data pipelines by 2026. Give it a URL, get clean markdown — exactly the shape LLMs ingest.

from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-xxx")
result = app.scrape_url("https://example.com", params={'formats': ['markdown']})
print(result['markdown'])

Notable bits:

JS rendering, 401/redirect handling.
crawl_url queues a whole site and respects robots.txt.
Structured extraction (extract) — give a schema, get JSON.
LLM-friendly chunking and auto-tagged metadata.

When to use: RAG document collection, LLM training datasets, domain-specific chatbots.

Crawl4AI — OSS LLM-friendly crawler

Crawl4AI is the OSS alternative to Firecrawl. Same idea (LLM-friendly output), self-hostable. The GitHub project grew quickly in 2025.

When to use: Firecrawl cost or lock-in is a concern, want to run on your own infra, OSS-first.

Legal and ethical lines (do not skip)

Public data does not override site ToS.
Respecting robots.txt is not a legal obligation but is industry convention.
Use of copyrighted data for training is being adjudicated more finely (OpenAI vs New York Times in 2025, Anthropic vs Reddit settlement, etc.).
Filter PII at collection — by labeling time it's already too late.

9. Four real workflows

Workflow 1 — vision classification dataset (startup)

Collect: Firecrawl/Crawl4AI to gather domain image URLs, then download.
Pre-label: Roboflow Auto Label (SAM-based) — 50k images in a week.
Human review: review only the 10k low-confidence cases in Roboflow Annotate.
Cleanup: re-scan with Cleanlab for label errors — find about 200 issues and fix.
Train: Roboflow Train (YOLOv11) or your own PyTorch.
Post-deploy: loop low-confidence production cases back to Roboflow.

Workflow 2 — LLM SFT dataset (internal coding assistant)

Collect: pull instruction-response candidates from internal PRs, issues, and Slack.
Augment with synthetic: Distilabel for 10x instruction diversity.
Curate: push to Argilla, five human reviewers round-robin.
Quality: surface disagreement in Argilla, discuss, re-label.
Train: HF TRL SFT/DPO trainer.
Eval: collect production traces in Phoenix, add regressions back to the dataset.

Workflow 3 — RAG domain document set (legal)

Collect: public case law DB plus internal documents through Firecrawl as markdown.
Clean: distribution check in Lilac, drop dupes and low quality.
Chunk: compare chunking strategies in LangChain/LlamaIndex.
Embed: domain embedding evaluation, BEIR-style eval.
Quality: collect RAG traces in Phoenix, score groundedness.
Loop: add hallucination cases to the dataset, tune chunking and retriever.

Workflow 4 — medical imaging (regulated)

Collect: hospital PACS, export via DICOM standard.
PII removal: automated removal of patient identifiers plus human review.
Label: Scale or Labelbox medical labeler pool with double-labeling by radiologists.
Adjudication: senior radiologist decides disagreement cases.
Quality: Cleanlab to check inter-annotator agreement.
Train: own infrastructure (no external data transfer).
Audit trail: every step from labeling to deployment is recorded (for FDA/MDR).

10. Decision frame — which tool

Primary split by data type

Vision only: CVAT (OSS) or Roboflow (SaaS).
Text only, LLM fine-tune: Argilla.
Mixed types (text + image + audio): Label Studio.
Synthetic-heavy: Distilabel + Argilla.

Second split by team size

Solo/startup: Roboflow or Label Studio Community, Cleanlab OSS.
10-50 people: Label Studio Enterprise or self-hosted Argilla, Cleanlab Studio.
Enterprise (100+): Scale/Labelbox plus own infra, Snorkel for weak supervision.

Third split by budget

Zero (OSS only): Label Studio Community + CVAT + Cleanlab OSS + self-hosted Argilla + Crawl4AI.
Moderate (some SaaS): Roboflow + Cleanlab Studio + Firecrawl + Argilla on HF Spaces.
Generous (enterprise): Scale/Surge + Labelbox + Galileo + BrightData.

Fourth split by domain

General: the tools above.
Medical/legal/finance: Scale or Labelbox domain pools, tools that support audit trails.
Autonomous driving/robotics: Scale, CVAT (3D cuboid).
Security/defense: self-hosting mandatory, no external data transfer.

One more — when curation matters more than new labels

If the dataset already has a million rows and model regressions keep happening, invest in finding the wrong ones rather than adding more. Cleanlab + Argilla disagreement + Lilac. A five percent correction often beats 100k new rows.

11. Cost intuition — what it actually runs

Rough market rates (spring 2026).

Item	Unit	Note
Image classification label (human)	$0.01-0.05/img	Simple class; multi-label costs more
Image bounding box (human)	$0.05-0.30/box	Varies with boxes-per-image and domain
Text NER label (human)	$0.10-0.50/sentence	Scales with entity types
LLM first-pass label (GPT-4 class)	$0.0005-0.005/sample	Depends on token length
LLM first-pass label (small)	$0.00005-0.0005/sample	gpt-4o-mini, claude-haiku
RLHF preference (Surge)	$1-5/pair	Domain expertise dependent
Medical/legal domain label	$5-20/sample	Expert time
Roboflow Pro	from $250/mo	Scales with volume/team
Cleanlab Studio	quote	Volume/feature based
Apify	$49-499/mo plus usage	Compute/proxy separate
BrightData	$0.5-15/GB	Proxy class and volume
Firecrawl	$19-333/mo	Page-count based

Key insight: LLM first-pass labels are 10 to 1000 times cheaper than human. That's why 2026 workflows put LLMs first and only verify five to twenty percent. Cost structure is shifting fast in domains where this is feasible.

12. Adjacent standards and patterns

Dataset Cards (Hugging Face)

Hugging Face Dataset Cards document a dataset's provenance, labeling procedure, limitations, and ethics. The de facto standard for training data governance.

Croissant (Google/ML Commons)

ml-commons/croissant is a dataset metadata standard. Released in 2024 and supported by HF, Kaggle, OpenML. Makes datasets portable between tools.

Datasheets for Datasets, Model Cards

Gebru et al.'s Datasheets, Mitchell et al.'s Model Cards. Responsible documentation for datasets and models. Increasingly tied to EU AI Act readiness in 2026.

DVC, LakeFS

Data versioning. Datasets need versions like code. DVC dominates OSS, LakeFS plays at the data-lake scale.

Epilogue — change the data before you change the model

The labeling and curation map compresses to one line.

Models are commoditizing in 2026. Differentiation comes from the data.

You can buy a good model. You make a good dataset. And good data is less a tool selection problem than a workflow design problem — where the LLM does the first pass, where humans verify, where synthetic fits, where cleanup runs.

Team data workflow checklist

The collect/label/quality/version stages are on one diagram.
The spots where LLMs can take the first pass are identified.
A label error scanner (Cleanlab-class) runs before training, once.
Annotator disagreement is surfaced automatically with an adjudication step.
If synthetic data is in the mix, verification is in the loop.
Datasets are versioned and you can trace which model trained on what.
PII is filtered at collection, not at labeling.
Production traces flow back to the dataset automatically.
Dataset cards exist (provenance, limits, ethics).
You know whether you're past the point where label cleanup beats new labels.

Ten anti-patterns

"More labels are better" — high-noise corpora make models worse, not better.
Human-labeling-only purism — in 2026 LLM first pass plus human verify is faster and more accurate.
Skipping Cleanlab-class scans — the highest-ROI hour you can spend.
Synthetic data with no human review — self-reinforcing bias and noise.
Single-annotator labeling — disagreement signal is invisible.
No dataset versioning — yesterday's model trained on what, exactly?
PII filtering at labeling time — too late, legally risky.
Crawling that ignores ToS — legal risk has spiked through 2025-2026.
Production traces not flowing back to the dataset — the best data source thrown away.
"Switching models will fix it" — usually a five percent data cleanup wins.

Next time

Candidates: Synthetic data pipelines deep dive — Distilabel + Argilla + judge LLM, Cleanlab internals — the Confident Learning math and where it breaks, Production LLM traces to eval datasets — Phoenix and Argilla integration patterns.

"Models commoditize. Differentiation lives in the data. Tools are the by-product of a workflow — buy tools without one and you collect expensive toys."

— AI data labeling and curation tools 2026, end.

Prologue — data makes the model, not the other way around

1. The landscape — where does the data work happen in 2026?

2. General labeling platforms — Label Studio, CVAT

Label Studio — the open-source classic (Heartex/HumanSignal)

CVAT — vision-only, OSS from Intel

3. Vision-focused — Roboflow

4. Data quality — Cleanlab and friends

Cleanlab — label error scanner

Argilla — text-first, the LLM fine-tune curation standard

Lilac (now part of datology AI after Databricks/MosaicML) — dataset inspection

Galileo / Arize Phoenix — LLM eval plus data curation

5. Managed human labeling — Scale, Surge, Labelbox, Snorkel

Scale AI

Surge AI

Labelbox

Snorkel — programmatic labeling

Refuel — LLM auto-label

6. Tool by use-case matrix

7. Synthetic data — filling gaps

Why now

Tools

Traps (skip these and you break things)

8. Web crawling — the data-acquisition side

Apify — managed actor (script) marketplace

BrightData — proxy plus scraper

Firecrawl — LLM-friendly crawling

Crawl4AI — OSS LLM-friendly crawler

Legal and ethical lines (do not skip)

9. Four real workflows

Workflow 1 — vision classification dataset (startup)

Workflow 2 — LLM SFT dataset (internal coding assistant)

Workflow 3 — RAG domain document set (legal)

Workflow 4 — medical imaging (regulated)

10. Decision frame — which tool

Primary split by data type

Second split by team size

Third split by budget

Fourth split by domain

One more — when curation matters more than new labels

11. Cost intuition — what it actually runs

12. Adjacent standards and patterns

Dataset Cards (Hugging Face)

Croissant (Google/ML Commons)

Datasheets for Datasets, Model Cards

DVC, LakeFS

Epilogue — change the data before you change the model

Team data workflow checklist

Ten anti-patterns

Next time

References