Skip to content
Published on

AI Data Annotation & Labeling Tools 2026 Complete Guide - Labelbox · V7 · CVAT · Roboflow · Encord · SuperAnnotate · Supervisely · Scale AI · Label Studio Deep Dive

Authors

Prologue — Labeling Is Still Expensive and Hard in 2026

GPT-5, Claude 4, Gemini 3 — they all share the same secret. Data is more expensive than the model. When Meta dropped $14.3B into Scale AI in mid-2025, it wasn't a one-off — it was a signal. "If you want to build a frontier model, you need to buy an army of labelers."

As of May 2026, the labeling industry has split in two.

On one side, enterprise managed platforms. Scale AI, Labelbox, V7, Encord — they bundle their own labeler pools with their own tools. They win in places that need domain expertise: RLHF, autonomous driving, medical imaging.

On the other side, open-source self-hosted. CVAT, Label Studio, Doccano — they give you the tool for free and your team rounds up the labelers. They win when data is sensitive (medical, finance), when budgets are tight (startups, research labs), or when the domain is so specialized (Korean legal NER) that outsourcing won't fly.

And sitting on top of both sides — foundation models eating annotation alive. SAM 2 finds masks for you. Grounding DINO draws boxes from text. The annotator's role has shifted from "drawing boxes" to "reviewing AI-drawn boxes."

This article maps that landscape. 25 tools across 8 categories, with concrete advice on where to start whether your work is autonomous driving, medical, LLM RLHF, or Korean NER.


1. The 2026 Annotation Map — 8 Categories on One Page

Here is the landscape on a single page.

CategoryRepresentative toolsWho uses them
1. Enterprise managedScale AI, Labelbox, V7, Encord, SuperAnnotateOpenAI, Tesla, Waymo, pharma
2. CV-specializedRoboflow, Supervisely, HiveIndie CV teams, agri, industrial
3. OSS self-hostedCVAT, Label Studio, DoccanoResearch, startups, governments
4. 3D / LidarSegments.ai, Deepen AI, 3D Map LabsAutonomy, robotics
5. LLM eval / RLHFArgilla, Surge AI, Outlier, SnorkelFoundation model teams
6. Data qualityCleanlab, Galileo, LilacML ops teams
7. CrowdsourcingMechanical Turk, Clickworker, AppenBulk, low-difficulty
8. Auto-labeling modelsSAM 2, Grounding DINO, GPT-4V, Claude VisionWedge into all 1–7

Three core observations.

  • Categories 1 and 3 solve the same problem differently. Managed sells "labelers + tool + QC" as a bundle. OSS sells only the tool. The decision variable is "can data leave your network" and "what is the budget."
  • Category 8 is embedded inside 1–7. Labelbox integrated SAM 2 for model-assisted labeling. So has CVAT. Roboflow sells its own auto-labeling API. "AI does the first pass, humans review" is the 2026 default.
  • Category 6 (data quality) is now as important as labeling itself. The standard pattern is: label, then run Cleanlab to find label errors, then curate with Argilla.

Don't look at one tool — look at the pipeline. Collect → auto-label → human review → quality check → curate. All five stages matter.


2. Scale AI — The Managed Labeling Champion, and What the Meta Deal Means

Scale AI was founded in 2016 by Alexandr Wang at age 19. In June 2025, Meta invested $14.3B in Scale AI and brought Wang on as Chief AI Officer of Meta Superintelligence Labs. The real meaning of this deal is twofold.

First, Scale is no longer a neutral labeler. OpenAI, Google DeepMind, and Anthropic all began winding down their dependence on Scale (per Reuters, July 2025). The moment the Meta deal closed, OpenAI ramped up its own labeler pool and increased Surge AI's share. "I'm not entrusting my frontier model training data to a competitor's subsidiary" is the simple logic.

Second, labeling is no longer an event — it's market infrastructure. Worth $14.3B to Meta.

Scale's product line splits four ways.

  • Scale Data Engine — autonomous driving and robotics annotation. Used by Waymo, Cruise (pre-shutdown), Toyota.
  • Scale Donovan — government and defense. DoD contracts.
  • Scale GenAI — RLHF, prompt curation, eval data. Big role in OpenAI o1 and GPT-4 training.
  • Outlier.ai — Scale's labeler-facing platform. 240,000 labelers worldwide.

Pricing is not public. Per box from 0.05toperhourat0.05 to per hour at 60+ depending on domain, complexity, and QC tier. "Talk to enterprise sales" is the standard answer.

When to pick — autonomous driving, defense, or frontier LLM training where domain expertise is required and the budget is large. Overkill for indie/startups.

When to skip — ML teams at OpenAI/DeepMind competitors worried about Meta-subsidiary labeling. They're migrating to Surge AI or in-house labelers.


3. Labelbox — Enterprise Self-Service Plus Managed

Labelbox launched out of SF in 2018 and raised Series D in 2024. Its positioning: "Scale is too expensive, CVAT is too raw — we fill the middle."

They bundle three modes on one platform.

  • Self-service labeling — your team labels. From $25/seat/month.
  • Boost (managed) — Labelbox supplies the labelers.
  • Foundry / Model Foundry — foundation models do the first pass, humans review.

Labelbox's strengths are three.

  • Image, video, text, document, geospatial, LLM, audio — one UI for all. No need to relearn the tool when changing domains.
  • SAM 2 integration for auto-masking. A single click draws a mask. Reports of 5–10x annotator productivity.
  • Catalog + Model + Evaluation in one workspace. Datasets, models, predictions, and ground truth seen side by side.

Pricing (publicly listed as of May 2026).

  • Free — 5,000 data rows, 3 users.
  • Starter — from $25/seat/month.
  • Enterprise — quote, with SSO, SCIM, on-prem options.

When to pick — teams with multimodal datasets who want to freely mix self-service and managed, and who value tool standardization.

When to skip — when data cannot leave your SaaS perimeter (medical, finance, parts of government). Self-hosted CVAT is the answer there.


4. V7 Darwin — Image, Video, and Medical AI-Assisted Annotation

V7 is based in London. They pushed "Auto-Annotate" early (2020) and dominate medical imaging.

Three product lines.

  • V7 Darwin — general CV annotation platform.
  • V7 Go — document automation and extraction. OCR plus field extraction for receipts, invoices, contracts via LLM.
  • V7 Medical — DICOM, HIPAA, FDA 510(k) friendly. Charite, Mayo Clinic, others.

What V7 does well.

  • Model-assisted annotation — proprietary SAM-like model plus Grounding DINO. One click, one box, or a text prompt produces instant labels.
  • Video tracking — annotate keyframes once and V7 interpolates between frames.
  • Medical multi-frame — view DICOM series as a unit. 3D masking included.

Pricing is by quote. The usual entry is around $499/month per team, but medical/enterprise quickly climbs into five and six figures.

When to pick — medical and life-sciences imaging, video-heavy annotation, and teams that want GenAI to dramatically lift annotator productivity.


5. Roboflow — The De Facto Standard for Indie CV Teams

Roboflow launched in 2020. Its positioning is precise — "Hugging Face for Computer Vision." Dataset hosting, labeling, training, and deployment all in one site.

Four core features.

  • Roboflow Annotate — box, polygon, segmentation, keypoint. SAM 2 integrated.
  • Universe — over 500,000 public CV datasets. Same category as yours (helmet detection)? Grab one, fine-tune, done.
  • Train — one click to fine-tune YOLOv11, DETR, or VLMs. GPU is abstracted away.
  • Inference / Deploy — host the trained model via Roboflow API or push to edge (Jetson, Raspberry Pi).

Pricing.

  • Public — free, dataset must be public.
  • Starter — from $249/month, private.
  • Growth / Enterprise — from $999/month.

When to pick — indie teams, startups, students, and industrial/agriculture/retail side projects that need to go from dataset to deployed CV model in 1–2 days.

When to skip — text or audio annotation. Roboflow is CV-only.


6. Encord — DICOM Medical Plus Multimodal

Encord is based in London and raised Series B at Davos 2024. Their pitch: "Annotation plus active learning for medical imaging and multimodal data."

Three differentiators.

  • Native DICOM / NIfTI — they avoid the common trap of converting medical images to PNG. Pixel spacing, HU values, and series metadata are preserved.
  • Encord Active — active learning is a first-class feature. The model picks the samples it's least confident about and routes them to labelers first.
  • Multimodal — image, video, DICOM, document, audio. Medical clinical trials need all of them.

Pricing is by quote. Medical-domain compliance support (HIPAA, ISO 13485, FDA validation) is a core selling point.

When to pick — radiology, pathology, endoscopy, and other medical imaging AI teams, plus teams that want active learning as a first-class citizen.


7. SuperAnnotate, Supervisely, Hive — The Other Managed Options

These three compete in adjacent positioning.

SuperAnnotate — out of Armenia. Big logos like Adobe and Databricks. Strengths: clean UI and strong QC workflows. Growing share of GenAI data (LLM RLHF). Pricing is quoted, starting around $500/month.

Supervisely — Czech / Russian origin. Strong in 3D point clouds and medical imaging. They advertise having processed over 100M annotations. Pricing is Community (free, self-hosted self-service) and Enterprise.

Hive — out of SF. Began in content moderation and turned it into labeling infrastructure. Their own labeler pool (2M+) plus Hive AI models. Pricing by quote.

Decision variables for these three.

  • Want the comfort of Adobe-tier logos — SuperAnnotate.
  • 3D point clouds are central — Supervisely.
  • Content moderation, NSFW, violence detection at high volume — Hive.

8. CVAT — The Open-Source CV Labeling Standard, Started at Intel

CVAT began at Intel as a tool for the OpenCV community. It's now run by a separate company, CVAT.ai, but the GitHub core remains OSS (MIT).

What CVAT does well.

  • Image, video, and 3D point cloud annotation — box, polygon, polyline, keypoint, mask, 3D cuboid.
  • SAM, SAM 2, YOLO integration — model-assisted annotation works on self-hosted.
  • Team workflow — Job / Task / Project hierarchy, review, and statistics.
  • One Docker Compose to deploy — self-hosting is actually easy.

Pricing.

  • Self-hosted OSS — free, MIT license.
  • CVAT Cloud — Free (0,10users),Pro(0, 10 users), Pro (45/seat/month), Enterprise (quote).

When to pick — any CV team where data cannot leave the network, government/defense/medical/finance teams with hard self-host requirements, and budget-constrained research labs and startups.

When to skip — text, audio, or LLM data. CVAT is CV-only.


9. Label Studio (HumanSignal) — Multi-Domain OSS

Label Studio is built by Heartex (now HumanSignal). Where CVAT is CV-only, Label Studio handles every data type in one tool.

Supported data types.

  • Image (box, polygon, mask), video (tracking), audio (segment, transcription), text (NER, classification, summarization), HTML, time series, conversation (LLM data).

You define the UI with an XML-like labeling config (only safe inside a code block).

<View>
  <Text name="text" value="$text" />
  <Labels name="entities" toName="text">
    <Label value="PERSON" background="orange" />
    <Label value="ORG" background="green" />
  </Labels>
</View>

Pricing.

  • Community Edition — free OSS (Apache 2.0).
  • Starter Cloud — from $99/user/month.
  • Enterprise — quote, with SSO, SCIM, on-prem.

When to pick — teams with diverse data types (text + image + audio), teams that need self-host but aren't only doing CV, and teams that like clean ML backend integration.


10. Doccano, LabelImg, VIA, MakeSense, COCO Annotator — Lightweight OSS

If the bigger platforms feel like overkill, the lighter OSS tools are right there.

Doccano — text-only. NER, classification, seq2seq. One Python line to launch. Popular for Korean, Japanese, and Chinese NER projects. MIT.

LabelImg — desktop app for boxes only. Pascal VOC / YOLO format. The 2024 deprecation notice landed, but the repo still sits at 20k+ stars as a classic. Good for teaching.

VIA (VGG Image Annotator) — Oxford VGG's academic tool. Runs as a single HTML file. Box, polygon, point. Friendly to airgapped environments.

MakeSense.ai — browser-only, no install. Good for quick demos. Exports YOLO, VOC, COCO.

COCO Annotator — native COCO format. Used by small teams doing instance segmentation.

Common thread — they're fast to start. Common weakness — no team / QC / model-assist workflows. Past the prototype stage, you'll migrate to CVAT or Label Studio.


11. 3D and Lidar Annotation — Segments, Deepen, 3D Map Labs

Autonomous driving and robotics live or die on 3D point cloud labeling.

Segments.ai — Belgian origin. Multi-sensor (Lidar plus camera) viewed simultaneously. Point-cloud instance segmentation, semantic segmentation, cuboids. Pricing by quote, roughly from $500/month.

Deepen AI — autonomy-specialized. Lidar sequence tracking, calibration tools, all bundled. Customers include Toyota, Honda, BMW.

3D Map Labs — HD map annotation specialist. Lane, sign, signal mapping for autonomous driving.

When to skip — one-off 3D projects. CVAT's or Supervisely's 3D mode is enough then.


12. LLM Evaluation Plus RLHF — Argilla, Surge AI, Outlier, Snorkel

In the LLM era, the shape of labeling changed. Instead of "draw a box," it's "which of these two responses is better" or "is this response factually correct." This is RLHF data or eval data.

Argilla (acquired by Hugging Face in 2024) — open-source LLM data labeling and curation. Pairs with Distilabel for synthetic data pipelines. Directly connected to HF Hub. Apache 2.0.

Surge AI — the real competitor to Scale AI. Managed RLHF / eval data. OpenAI and Anthropic are growing their Surge share as they reduce Scale. Labeler quality is the moat — they explicitly match labelers with specialty (law, medicine, coding).

Outlier — Scale AI's labeler-facing platform (rebranded in 2024). 240,000 worldwide. Main work: RLHF, evaluation, code review labeling.

Snorkel AI — the original programmatic labeling company. Heuristics and weak supervision produce first labels, then a model propagates. Used by enterprises like Snowflake and JPMorgan.

When to pick —

  • LLM fine-tuning data / eval sets at the center -> Argilla (OSS) or Surge AI (managed).
  • "No humans, just rules for the first pass" strategy -> Snorkel.

13. Data Quality — Cleanlab, Galileo, Lilac

After labeling comes quality checking.

Cleanlab — out of MIT. The "Confident Learning" algorithm automatically detects label errors. In datasets, it surfaces the 5–15% of labels that are wrong. Cleanlab Studio is SaaS; cleanlab is the open-source library (BSD).

Galileo — LLM and NLP data observability. Visualizes "samples the model is confused about," "low-quality spans," and "drift" in training data. Enterprise SaaS.

Lilac (acquired by Hugging Face) — text dataset exploration, clustering, duplicate detection. Open source.

Core insight — "fix the 50 wrong labels out of 1,000" beats "make 100 more labels." Model accuracy typically climbs 1–5 points (especially in imbalanced domains).


14. Crowdsourcing — MTurk, Clickworker, Appen, TELUS

When you need volume, low difficulty, or language diversity, crowdsourcing wedges in.

Amazon Mechanical Turk — started in 2005, the original. Cheapest (from $0.01 per task), least controlled. Quality management (qualifications, master workers, consensus) is the big chore.

Clickworker — German origin. More refined crowd than MTurk. Multilingual text, images, and audio.

Appen — Australian origin. Strong in voice (call center, ASR). Together with Lionbridge AI (acquired by TELUS), they form the two pillars of voice/language data.

TELUS International AI Data Solutions — folded in Lionbridge AI. Handles much of the training data for Microsoft, Google, and Apple voice assistants.

When to pick — high-volume simple tasks (image classification, short-text classification), or multilingual voice data collection. For tasks needing domain expertise, Scale, Surge, and Labelbox Boost are better.


15. Auto-Labeling — SAM 2, Grounding DINO, CLIP, GPT-4V, Claude Vision

The biggest change in 2026 annotation is models becoming the first-pass labeler.

SAM 2 (Meta, 2024) — the universal segmentation model for image and video. One click, one box, or a text input yields a mask. Labelbox, CVAT, and Roboflow all integrate it.

Grounding DINO (IDEA) — text prompt ("a person wearing a helmet") to box. Open-vocabulary detection. Combined with SAM 2 (GroundingSAM), you go from text -> boxes -> masks in one pass.

CLIP / SigLIP — zero-shot classification. You ask "what is this image?" and the model picks one of your predefined labels. No boxes or masks, but strong for classification labeling.

GPT-4V / Claude Vision / Gemini Vision — send the image to a VLM and ask for the label. Most expensive, most flexible. Few-shot prompting can teach the domain.

The workflow pattern.

# Auto-labeling pipeline pseudo-code
for image in dataset:
    boxes = grounding_dino(image, prompt="helmet, vest, person")
    masks = sam2(image, boxes=boxes)
    labels = label_studio_predictions(image, boxes, masks)
    push_to_review(labels)  # humans only review

This single pattern is the 2026 standard for CV annotation. The annotator's job has clearly shifted from "drawing boxes" to "reviewing AI-drawn boxes." Productivity climbs 5–10x, monotony drops, and labeler burnout drops with it.


16. AI Safety Labeling — Red Team and Jailbreak Annotation

A new category of labeling in the LLM era.

  • Red-team prompt curation — collect potentially dangerous prompts and evaluate model responses. Anthropic and OpenAI both run in-house and contracted.
  • Jailbreak data — collect cases where the model breaks guardrails. Both training and eval uses.
  • Harmful content classification — toxicity, hate speech, CSAM. Hive, ActiveFence, Surge AI.

The hard part is labeler mental health. Labelers handling violence, CSAM, and suicide content face real PTSD risk. After Time's 2023 exposé on OpenAI's Kenya labelers, the industry is improving guidelines. Sama and Surge AI explicitly run mental-health care programs.


17. Domain-Specific — Medical, Autonomy, Geospatial

When the domain is clear, a domain-specific tool gets you there faster.

Medical

  • Encord — DICOM native, FDA validation support.
  • V7 Medical — imaging plus clinical-trial workflows.
  • Cohort.ai (formerly Centaur Labs) — physician labeler network.
  • MD.ai, Cogitech — radiology-focused.

Autonomous driving

  • Scale AI Data Engine — synchronized camera + Lidar + radar.
  • Mighty AI (acquired by Uber)
  • Understand.ai (acquired by DSpace)
  • Deepen AI — calibration plus Lidar.

Geospatial

  • GroundWork (CamoLabs) — satellite and drone imagery.
  • RemoteSensingAI — agriculture and forestry specialized.
  • Mapbox Labelbox integration — urban mapping.

18. Quality Management — IAA, Cohen's kappa, Consensus

Labeling is done by humans. Humans get it wrong. So quality management is a first-class feature of any labeling tool.

Three core metrics.

  • Inter-annotator agreement (IAA) — the rate at which two or more labelers agree on the same sample.
  • Cohen's kappa — IAA corrected for chance. 0.6+ is "okay," 0.8+ is "good."
  • Fleiss' kappa — the 3+ labelers version.

Workflow patterns.

  • Consensus voting — N labelers label the same sample, majority wins.
  • Gold standard injection — sprinkle in samples with known answers and monitor labeler accuracy.
  • Adjudication queue — samples where labelers disagree go to a senior annotator.

Managed platforms (Scale, Labelbox, V7) ship this built in. CVAT and Label Studio require you to wire it yourself, but their Job / Review primitives provide the skeleton.


19. Active Learning — The Model Decides What to Label Next

Labeling budget is not infinite. So "what should we label first" is a big decision.

The active learning idea — the model pushes uncertain samples, samples near the class boundary, and samples in new clusters to labelers first.

Three strategies.

  • Uncertainty sampling — samples where the predicted probability is near 0.5.
  • Margin sampling — samples where the top-1 and top-2 probabilities are close.
  • Diversity sampling — cluster representatives that are far apart in the embedding space.

Tools.

  • Encord Active — first-class feature.
  • Cleanlab Studio — label errors and uncertainty in one place.
  • Roboflow — Smart Polygon plus model assist.
  • CVAT — buildable via its own nuclio pipelines.

Rule of thumb — with active learning, you reach the same model accuracy on half the labels. That half is half your labeling cost.


20. Korean Annotation Ecosystem — AI Hub, EzData, Testworks

Korean / Korea-specific data isn't fully covered by global tools alone.

AI Hub (NIA, National Information Society Agency) — the Korean government's AI dataset hub. Thousands of public Korean NLP, Korean video, and Korean voice datasets. Many were labeled with public funding.

EzData — Korean managed labeling service. Korean NER, Korean medical imaging, more.

Testworks — labeling plus QA. Holds a social enterprise certification through diversity hiring.

Strategy — pull public datasets from AI Hub as the first training set, then add domain-specific labeling via EzData or Testworks.


21. Japanese Annotation Ecosystem — ABEJA, FastLabel, AnnoFab

Japan is strong in industrial and automotive data.

ABEJA Platform — Japan's ML platform. Annotation plus training plus deployment. Big customers like Toyota, NTT, Tokyu.

FastLabel — Tokyo origin AI annotation SaaS. Fastest-growing in the Japanese market. Customers include Honda, Sony.

Anolytics — Japan and India simultaneously. Managed labeling.

AnnoFab — Japanese-market annotation tool. Government plus manufacturing.

Strategy — for Japan-specific data (Japanese OCR, Japan road autonomous driving), Japanese companies dominate on domain knowledge and labeler pool.


22. Pricing Comparison — What Actually Costs What

Rough pricing map (as of May 2026).

CategoryToolPrice band
Managed enterpriseScale AIQuote, generally $100K+/year
Managed enterpriseLabelbox EnterpriseQuote, 50K50K–500K/year
Self-service SaaSLabelbox Starter$25/seat/month
Self-service SaaSLabel Studio Cloud$99/user/month
Self-service SaaSRoboflow249249–999/month
Self-service SaaSV7 Darwin$499/month and up
Self-hosted OSSCVAT$0
Self-hosted OSSLabel Studio Community$0
Self-hosted OSSDoccano, LabelImg, VIA$0
CrowdMTurkfrom $0.01 per task
RLHF managedSurge AIQuote, 2525–80 per hour
Auto-labeling APIRoboflow Auto, Labelbox Foundry0.0010.001–0.01 per image
Auto-labeling VLMGPT-4V, Claude Vision0.010.01–0.05 per image

Core point — self-hosted OSS is free for the tool, but labeler salaries are extra. Managed bills tool + labelers + QC as one line item.


23. Decision Tree — What Should Our Team Pick

Five branch points.

  1. Can data leave your SaaS perimeter?
    • No -> CVAT, Label Studio Community, Doccano (self-hosted OSS).
    • Yes -> next branch.
  2. What's the domain?
    • General image/video -> Roboflow (indie) or Labelbox (enterprise).
    • Medical -> Encord, V7 Medical.
    • Autonomy 3D -> Scale AI, Deepen AI, Segments.ai.
    • Text/NER -> Label Studio, Doccano, Argilla.
    • LLM RLHF / eval -> Argilla (OSS), Surge AI (managed).
  3. Can you recruit labelers yourself?
    • Yes -> self-service (Labelbox, Roboflow, Label Studio Cloud).
    • No, you need outsourced labor -> managed (Scale, Surge, Labelbox Boost, V7).
  4. What's the budget?
    • 00–10K/year -> OSS self-hosted plus interns.
    • 10K10K–100K/year -> Roboflow, Labelbox Starter, Label Studio Cloud.
    • $100K+/year -> Labelbox Enterprise, V7, Encord, parts of Scale.
  5. Is auto-labeling a first-class citizen?
    • Yes -> Encord Active, Cleanlab, SAM 2-integrated tools.
    • Human-first -> Scale, Surge, MTurk.

24. Real-World Workflow — Your First Dataset in a Week

A workflow that takes a first CV dataset from zero to 100–1,000 labeled images in one week.

  • Day 1 — collection. Scrape (Apify, Firecrawl) or shoot it yourself. Storage on S3.
  • Day 2 — tool selection. Non-sensitive indie -> Roboflow. Sensitive -> CVAT self-hosted.
  • Day 3 — first auto-labeling pass. Grounding DINO plus SAM 2 for boxes and masks. In Roboflow, "Smart Polygon"; in CVAT, the SAM 2 module.
  • Day 4 — human review. Quickly review and fix what auto-labeling drew. Typically 3–5x faster than drawing from scratch.
  • Day 5 — quality check. Cleanlab or Encord Active surfaces likely label errors. Revisit 10–20.
  • Day 6 — training. Roboflow Train or your own PyTorch. First baseline model.
  • Day 7 — analysis. Push the N samples the model is most confused about into the next labeling queue via active learning.

Run this loop 4–6 times and you usually have a production-ready model.


25. Honest Decision — Build a Data Pipeline, Not a Model

One last line — in 2026, ML teams differentiate on the data pipeline, not the model.

Everyone is on the same GPT-4o, Llama 3, and YOLOv11. Our edge is our labeling data, our eval set, our quality management workflow.

Pick tools that are within reach. CV -> Roboflow or CVAT. Text -> Label Studio or Doccano. LLM -> Argilla. All free or low-cost to start. Hold off on managed until you genuinely can't recruit labelers — switching tools after going managed is hard, but moving from self-service to managed is natural.

And don't forget — fixing 50 wrong labels in your existing 1,000 beats making 100 more. Spin up Cleanlab for an afternoon. Start there.


26. References