필사 모드: AI Content Moderation & Trust & Safety 2026 Deep Dive - Hive · Perspective API · Microsoft Content Safety · Spectrum Labs · Cinder · Sift · ActiveFence
EnglishPrologue — Why T&S Infrastructure, Why Now
Spring 2026, the Trust & Safety (T&S) operations room of a Southeast Asian gaming company. 4:17 AM. A user sends a voice-chat message with verbal abuse and a reference to the other party's young child. The voice stream flows into Hive AI real-time audio moderation and three labels light up simultaneously: "abuse, harassment, child reference." The same user's text chat goes to Spectrum Labs Guardian, which evaluates the probability of a grooming pattern. The two signals combine into a "P0 — child safety" case in the Cinder T&S workflow queue. Within fifteen minutes a T&S analyst suspends the account, and because user reports also came in from US and UK locations, a formal report to the NCMEC CyberTipline is auto-generated.
At the same time, the EU DSA quarterly transparency deadline is approaching for a messaging platform headquartered in Berlin. The T&S director must report — by category — the roughly sixty million content actions the platform took during the quarter: removals, downranks, account suspensions, age gates. Exposure stats, share of human review, share of automated decisions, false-positive rates, appeal handling times.
A social platform in Tokyo faces a different angle. After Japan's 2024 amendment to PCMA (Provider Liability Mitigation Act), takedown obligation and disclosure of sender information for defamatory content both tightened. "Why didn't you remove this, and on what basis" carries the same weight as "Why did you remove this." Too much is liability; too little is liability.
This piece is the 2026 spring map of the content-moderation and T&S infrastructure behind all those scenes. AI moderation infrastructure, hash sharing, deepfake detection, LLM safety, workflow platforms, regulation — companies from the US, EU, UK, Korea, Japan — in one go.
1. Why T&S Became Core Infrastructure in 2026
Content moderation is one of the internet's oldest tasks. From 1990s BBS sysops to 2000s forum mods to 2010s SNS report-and-remove flows. What is different in 2026 comes down to three pressures.
**First, regulatory upheaval.** The **Digital Services Act (DSA)**, passed in the EU in 2022 and fully in force in 2024, requires VLOPs (45M+ MAU) to perform systemic-risk assessments, undergo external audits, and file quarterly transparency reports. The UK **Online Safety Act 2023** (entering full enforcement in 2025) gave Ofcom strong powers and an active duty toward "harm to children." Korea layered the Telecommunications Business Act and the Information and Communications Network Act; Japan amended **PCMA** in 2024 to streamline sender-info disclosure and takedown procedure. The US has KOSA and a patchwork of state laws.
**Second, content production has exploded.** Generative AI pushed the cost of producing text, images, video, and audio to near zero, and the absolute volume of spam, scams, and deepfakes blew up. In 2024 alone Meta took action on more than five billion pieces of content. Humans alone cannot process this scale.
**Third, brand safety and the ad market.** Advertisers have become much stricter about which content their brand sits next to, so "brand safety" is now a second axis of T&S. GARM (Global Alliance for Responsible Media) brand-safety categories and the IAB content taxonomy feed directly into the ad-bidding stack.
The market produced by these three pressures crossed 10 billion USD in 2026. Inside that market we see infrastructure companies, workflow companies, and platform internal teams.
2. AI Moderation Categories — Image, Video, Text, Audio, Multimodal
Slicing content moderation by modality makes the outline clear.
**Image moderation** — the oldest area. CSAM (Child Sexual Abuse Material), nudity / sexual content, violence, hate symbols (Nazi imagery, Imperial Japanese rising-sun flag, terrorist-group banners), drugs, weapons — the taxonomy here is mostly settled. PhotoDNA (hash matching) and CNN-based classifiers run side by side.
**Video moderation** — an extension of image with time added. Violence, self-harm, CSAM video, and **deepfakes**. After 2024, non-consensual political-figure and celebrity deepfakes (especially targeting women) exploded, and authenticity detection rose into its own category.
**Text moderation** — hate speech, harassment, spam, scams, political disinformation. Models differ per language; slang and emerging terms shift fast. Accuracy drops in languages where tokenization itself is hard — Korean, Japanese, Arabic, Hindi.
**Audio moderation** — voice chat, voice rooms, live-stream audio. Slurs, harassment, voice-clone deepfakes. Game companies (Riot, Activision, Epic) are the most urgent customers here.
**Multimodal** — image+caption, video+subtitles, audio+video. Each modality alone may look harmless, but the combination (e.g., an innocuous photo plus threatening text) needs multimodal models. CLIP/BLIP families, LLaVA, and from 2024 onward zero-shot moderation via GPT-4V, Claude 3.5 Sonnet vision, and Gemini have become standard.
3. Hive AI — The De Facto Standard for Multimodal Moderation
San Francisco-based **Hive AI** was founded in 2017. Founders Kevin Guo and Dmitriy Karpman started as a data-labeling company, built their own models, and turned them into a content-moderation API.
As of 2026, Hive AI has the broadest modality coverage in content moderation.
- **Image moderation** — about 90 categories. NSFW, adult content, violence, drugs, hate symbols, self-harm.
- **Video moderation** — frame sampling plus temporal analysis.
- **Text moderation** — 30+ languages including English, Spanish, Portuguese, Japanese, Korean, Arabic.
- **Audio moderation** — real-time voice chat, live streaming, game voice chat.
- **AI-generated content detection** — identifying Stable Diffusion, Midjourney, and DALL-E images.
- **Deepfake detection** — face-swap video discrimination.
- **OCR + context** — reading text inside images and classifying together.
Hive's strength is **multimodal under a single API**. A platform can buy image, video, text, and audio moderation from one vendor. Customers include Reddit, Yubo, Bumble, and parts of the US Department of Defense. A multi-year deal with Reddit was disclosed in 2024.
Pricing depends on volume, but published reference rates are about $0.5–1 per 10K text requests and $2–5 per 10K image requests. Large contracts are negotiated separately.
4. Microsoft Azure AI Content Safety
Microsoft's **Azure AI Content Safety** launched generally in 2023. Inside Azure Cognitive Services as a separate line, it exposes the infrastructure Microsoft built for its own LLM, search, and gaming businesses.
Core features:
- **Image and text moderation API** — four core categories (Hate, Self-harm, Sexual, Violence), each with a 0–7 severity score.
- **Prompt Shields** — detection of LLM prompt injection and jailbreaks.
- **Groundedness Detection** — verifying that an LLM response is grounded in the source documents for a RAG system.
- **Protected Material Detection** — detecting accidental reproduction of copyrighted text or code.
Integration with Azure runs deep. If you use Azure OpenAI Service, Content Safety automatically inserts itself on both input and output paths and provides Responsible AI filtering by default. In regulated industries (healthcare, finance, legal) this default filter is a decisive adoption factor.
From 2024 **Custom Categories** has been generally available — each platform can add its own categories (for example "spoilers," "medical-diagnosis statements," "investment recommendations") with few-shot learning.
5. Google Perspective API — The Original Toxicity Scoring
Google's **Jigsaw** (formerly Google Ideas) released the **Perspective API** in 2017 as the first de facto standard for text toxicity scoring. The starting point was a model built originally to help moderate the New York Times comment section.
Core attributes:
- **TOXICITY** — rude or disrespectful comment.
- **SEVERE_TOXICITY** — a stronger form.
- **IDENTITY_ATTACK** — attack against identity (race, religion, sex, disability).
- **INSULT** — insult.
- **PROFANITY** — profanity.
- **THREAT** — threat.
- **SEXUALLY_EXPLICIT** (experimental), **FLIRTATION** (experimental), and others.
Each attribute returns a probability score between 0 and 1. The threshold is up to the platform.
Supported languages: English, Spanish, French, German, Italian, Portuguese, Russian, Chinese, Japanese, Arabic. Korean was added in late 2024.
The Perspective API is free (within fair-use quotas). That has made it the standard entry point for small platforms, researchers, and civil-society groups. The catch: academic research has repeatedly flagged false-positive bias on race and dialect — African American Vernacular English (AAVE) and LGBTQ+ identity terms tend to be misclassified as toxic. Jigsaw has been retraining through its "Unintended Bias" line of work.
6. OpenAI Moderation API · Anthropic Constitutional Classifiers
In the LLM era, moderation infrastructure is also built by the LLM companies themselves.
**OpenAI Moderation API** — free. The text-moderation-latest model (GPT-4o based) shipped in 2024 with a substantial accuracy lift. Categories: sexual, hate, harassment, self-harm, sexual/minors, hate/threatening, violence/graphic, and more. Each returns a boolean flag plus a 0–1 score. The internal filtering for ChatGPT and Sora uses the same signals.
**Anthropic Constitutional Classifiers** — released March 2025. Anthropic exposed externally the safety classifiers it has used internally for Claude evaluation. The principles align with Constitutional AI — harm, discrimination, self-harm, violence, fraud, cyber attacks, chemical/biological/radiological/nuclear (CBRN) weapons, drugs. Plug them on input and output of the Claude API.
The difference: OpenAI emphasizes general "harm-as-speech-act" categories, while Anthropic leans harder on "harms that emerge if AI outputs them" (especially CBRN and cyber). This is where LLM safety meets content moderation.
7. Spectrum Labs · Cinder · Cove — Community and Workflow
**Spectrum Labs** — San Francisco. Founded 2016. The flagship product **Guardian** combines text and behavioral signals to infer user intent. Not simple keywords — grooming, fraud, self-harm signals, and racism are inferred from the flow of a user's conversation. Major customers are game companies (Riot Games, Wildlife Studios), dating apps, and marketplaces. Expanded a partnership with ActiveFence in 2023.
**Cinder** — Y Combinator 2021. Co-founded by ex-Facebook T&S leaders Brian Fishman, Declan Cummings, and Glen Wise. Cinder is not a model company but a **T&S operations platform**. It pools multiple AI signals (Hive, Perspective, internal models) into a single queue and standardizes the T&S analyst workflow: triage, escalation, appeals handling, transparency-report generation. Customers include Discord, Yelp, Bumble, Patreon. Series B in 2024.
**Cove** — Y Combinator 2024 cohort. Targets smaller platforms and emerging SaaS, more focused than Cinder on the lower segment of the market. "T&S as a Service."
**ActiveFence** — Tel Aviv + New York. Started as an intel/threat-research firm focused on terrorism and child safety. In 2024 it extended into general moderation infrastructure with **ActiveScore** and **ActiveOS**. Supplies Microsoft, Reddit, X, Discord.
**Two Hat / Community Sift** — acquired by Microsoft in 2021. Specializes in gaming and kids' content moderation. The chat moderation backbone for Xbox and Minecraft.
**Sentropy** — acquired by Discord in 2021. Text moderation and anti-spam. Absorbed and became Discord's own T&S infrastructure.
**Sift** — US. Originally fraud and account-takeover protection, but extended into content and user-signal analysis closer to T&S. Airbnb, DoorDash, Twitch.
**TrustLab** — US. Election integrity and disinformation analysis. Monitoring partner for the EU Code of Practice on Disinformation.
**Bodyguard.ai** — France. Community moderation automation. Adopted by European media companies for comment moderation.
8. Image Hashing — PhotoDNA · PDQ · TMK+PDQF
CSAM and terrorist content are the oldest and most consensual moderation categories. The technology that became the standard there is **perceptual hashing** — algorithms that produce the same hash even when the image has been slightly modified (crop, resize, watermark, JPEG recompression).
**PhotoDNA** — co-developed in 2009 by Microsoft and Professor Hany Farid (Dartmouth College). Provides free matching against the NCMEC database of known CSAM image hashes (hundreds of thousands to millions of entries). Adopted by Facebook, Twitter, Google, Reddit, and nearly every large platform. The oldest and most universal standard.
**PDQ + TMK** — open-sourced by Meta in 2019. PDQ is an image hash; TMK+PDQF is a video hash. Combined with ThreatExchange, they enable cross-platform hash sharing. Meta's bet: when one platform catches CSAM or terrorist content, others should catch it too.
**NeuralHash** (Apple) — announced in 2021 but paused after privacy-organization backlash. The starting point of client-side scanning for CSAM.
Hash matching is simple but powerful. Against known material it can "catch on first sight." But it does not help with previously-unseen CSAM, so classifier models always run alongside.
9. NCMEC · IWF · GIFCT · Tech Coalition — The Hash-Sharing Network
A hash that is created but not shared is half a tool. So since the early 2010s, hash-sharing consortia have formed.
**NCMEC (National Center for Missing & Exploited Children)** — US, founded 1984. **CyberTipline** is the legally mandated channel for US-based platforms to report CSAM. In 2023 alone, roughly 36 million CyberTipline reports came in. NCMEC operates the hash database, licenses PhotoDNA to platforms, and bridges to law enforcement.
**IWF (Internet Watch Foundation)** — UK. The UK counterpart to NCMEC. Operates a URL block list and hash database. Tightly tied to Ofcom enforcement under the Online Safety Act.
**GIFCT (Global Internet Forum to Counter Terrorism)** — founded in 2017 by Facebook, Microsoft, Twitter, and YouTube. A consortium for sharing hashes of terrorist (violent extremist) content. Membership expanded sharply after the 2019 Christchurch attack. The core asset is the **Hash-Sharing Database** — content caught by one member can be blocked immediately by others.
**Tech Coalition** — a platform consortium for CSAM response. Coordinates with NCMEC and IWF and standardizes hash and signal sharing across members. Launched **Lantern** in 2024 to share CSAM signals as common infrastructure.
**StopNCII.org** (IWF + Meta + Bumble and others) — pre-blocking hashes for non-consensual intimate-image abuse. A victim hashes their own image and registers it; member platforms then block uploads that match.
10. Deepfake Detection — Reality Defender · Sensity · Truepic · TrueMedia
The biggest content-safety story of 2024 was deepfakes — especially non-consensual intimate deepfakes (women and minors disproportionately affected) and political deepfakes.
**Reality Defender** — New York. Founded 2021. Multi-model deepfake detection across image, video, audio, and text. Customers include CNN, NBC, the US State Department, and NATO StratCom. Series A in 2024.
**Sensity AI** (formerly Deeptrace Labs) — Amsterdam. Has tracked deepfake threats since 2018. Customer base skews to security firms, financial institutions, and governments.
**Truepic** — San Diego. A different angle: **C2PA (Content Authenticity Initiative)** metadata signing at the camera. Cryptographically establishes provenance so a photo can prove "I am real." Adobe, Microsoft, Nikon, and Sony are in the same standards camp.
**TrueMedia.org** — released in 2024 as a nonprofit deepfake-detection tool. Collaboration with AI2 (Allen Institute) and others. Free for journalists and researchers.
**Hive AI Deepfake Detection** — see chapter 3. Part of Hive's multimodal lineup.
**Microsoft Video Authenticator** — released in 2020 for the US election cycle. Scope limited to political video.
**Intel FakeCatcher** — analyzes subtle blood-flow (PPG) signals in the face to test for a real human.
The 2026 standard architecture is "detection + provenance, both." Detection alone has limits; signing from the camera (C2PA) is the complementary side.
11. Platform-Internal Tools — Meta · YouTube · Microsoft · Google
Large platforms run a mix of external infrastructure and in-house tools.
**Meta Hasher-Matcher-Actioner (HMA)** — open-sourced in 2022. A pipeline that ingests PDQ/TMK hashes, matches them, and triggers actions. The standard starting point when a smaller platform builds its own hash-matching path.
**YouTube CSAI Match** — YouTube's in-house CSAM-video matching tool, offered to external partners under free license.
**Microsoft Content Moderator** — predecessor of Azure Content Safety. Some features migrated to Content Safety; others are on a deprecation path.
**Google Content ID** — YouTube's copyright matching. A different domain from T&S, but the largest production case of content-fingerprint matching. Tens of billions of matches per month.
**Meta Llama Guard 3** — see chapter 14. An LLM safety classifier, open source.
**Roblox Voice Safety** — in-house voice-chat moderation models. Kids-platform specifics.
**TikTok TIDAL** (Trust & Safety Insights, Data, Analytics, Learnings) — TikTok's internal T&S operations platform.
12. Korean Content Moderation — KOCSC · Kakao · Naver · KISA
Korea is a place where formal regulation and private self-regulation co-evolved.
**KOCSC (Korea Communications Standards Commission)** — an administrative body with the authority to order takedowns and corrective action against internet content: defamation, obscenity, gambling, drugs, suicide instigation, election misinformation. About 240,000 corrective orders in 2024. Often criticized — there is real tension with freedom of expression.
**KISA (Korea Internet & Security Agency)** — under the Ministry of Science and ICT. Operates the 118 illegal-content tip line, victim support for digital sex crime, and cybersecurity incident response. From 2018 it runs the digital-sex-crime content-removal support program.
**Kakao Safety Center** — the reporting channel for KakaoTalk, Daum News, KakaoStory, and others. Kakao started publishing a regular Trust & Safety report in 2024.
**Naver Report Center** — handles reports for Naver Cafe, blog, Knowledge iN, and news comments. Naver automates comment moderation with its own AI (Clova X line) — slur and hate-speech auto-hide.
**Kakao Internal AI — Siren** — Kakao's content-moderation AI. Report classification, automatic blocking, human review-queue routing. Auto moderation for KakaoTalk open-chat uses the same line.
**Naver Cleanbot / Comment Moderation** — auto-blocking for slurs, hate, and spam-flooding in Naver News comments. Generally deployed in 2020. Strong on Korean slur dictionaries and obfuscation (jamo splitting, whitespace variations).
**Nth Room Prevention Act (2020)** — an active duty in response to digital sex crime content. Platforms above a certain size are required to adopt technical measures. This law made content-matching infrastructure de facto mandatory in Korea.
13. Japanese Content Moderation — Yahoo!Japan · LINE · Mercari · Pixiv
In Japan, the 2024 amendment of PCMA was the inflection point. Sender-information disclosure procedure was streamlined, and platforms moved more actively.
**Yahoo!Japan Comment Moderator** — AI moderation for Yahoo!Japan news comments. In-house model since 2019, especially tuned to Japanese slurs and personal attacks. Upgraded to an LLM-based moderator in 2024.
**LINE Cleansing** — LINE's content-moderation AI for groups and OpenChat. Covers voice calls, emojis, and stickers — multimodal.
**Mercari Hate-Detection** — combines Microsoft tooling and an in-house model to detect counterfeits, prohibited items, and hateful symbols at listing time. From 2024 the AI moderator auto-blocks from the moment of listing.
**Cybozu Moderation** — moderation AI used inside Cybozu's enterprise SaaS.
**Pixiv Moderation** — illustration and fiction platform. Sexual-content classification, automatic R-18 / R-18G tagging, CSAM detection (external plus in-house).
**Niconico Video** — Japanese video platform. In-house moderation plus external hash matching.
**Twitter / X Japan** — operational burden grew after the 2024 PCMA amendment increased sender-info disclosure obligations. Handling defamation against Japanese users is now the heaviest workload.
**Internet Hotline Center Japan (IHC)** — tip line for illegal and harmful information. Connected to the National Police Agency.
14. LLM Safety — Llama Guard 3 · Lakera Guard · Guardrails AI · NeMo Guardrails
With LLM chatbots and agents entering everyday workflows, moderation for "what AI outputs" became its own industry. It looks at both input prompts (prompt injection, jailbreak) and output responses (hallucinations, harmful content).
**Llama Guard 3** (Meta) — a Llama 3-based safety classifier released July 2024. Classifies both input and output and follows the MLCommons harm taxonomy: violent crimes, non-violent crimes, sex crimes, child exploitation, defamation, specialized advice, privacy, IP, indiscriminate weapons, hate, self-harm, sexual content. Open source under conditions.
**Anthropic Constitutional Classifiers** (March 2025) — see chapter 6. Claude's safety signals exposed externally.
**Lakera Guard** — Zurich, Switzerland. Specializes in **prompt-injection** detection. Catches patterns where an LLM chatbot has its system prompt bypassed or its tool calls hijacked. Series A in 2024.
**Guardrails AI** — open source plus commercial. Lets you declaratively define structure and content checks on LLM responses. JSON schema, regex, and external classifier calls in one place.
**NVIDIA NeMo Guardrails** — NVIDIA's open-source guardrails framework. Defines dialog flow and safety rules in a DSL called "Colang." Heavily adopted in enterprise chatbots.
**Prompt Guard** (Meta, 2024) — released alongside Llama Guard 3. A small dedicated model for prompt-injection and jailbreak detection.
**Rebuff** — open-source prompt-injection defense. Multi-layer: heuristics, embedding similarity, LLM classification, canary tokens.
**OpenAI Moderation API** — see chapter 6. Same signals can be used for LLM output filtering.
**Microsoft Prompt Shields** — see chapter 4. The LLM-protection component of Azure AI Content Safety.
LLM safety is already its own market in 2026. Gartner estimates the 2026 "AI Trust, Risk and Security Management (AI TRiSM)" market at roughly 1 billion USD.
15. Open Source — detoxify · Project Arachnid · Others
Researchers, smaller platforms, and civil-society groups rely on a rich open-source layer.
**detoxify** (Unitary) — UK. A Python library. Open-source classifier trained on Jigsaw toxicity datasets. One line of code returns a toxicity score. Heavily used in academic research.
**Perspective API** (Jigsaw) — see chapter 5. Free API.
**Project Arachnid** (C3P, Canadian Centre for Child Protection) — automates CSAM crawling, matching, and reporting. Based in Canada.
**Microsoft Reporting Service** — PhotoDNA is offered free to certain nonprofit organizations.
**Hive Submarine** (open model) — a project where Hive released some models under an academic license, with scope limits.
**LLM Guard** (open source) — an LLM input/output inspection library. PII masking, prompt-injection detection, topic blocking.
**Open-source CSAM hash database** — intentionally not publicly available. Provided only to vetted platforms via NCMEC and IWF for legitimate operational reasons.
16. Evaluation — Precision / Recall, Bias, Datasets
A content-moderation model is not finished when it has high accuracy. **Bias** is the core evaluation axis.
**False-positive bias**:
- Over-prediction of toxicity on AAVE (African American Vernacular English). Representative paper: Sap et al., 2019.
- LGBTQ+ identity terms (e.g., "gay," "lesbian," "trans") classified as toxic on their own.
- False positives on Korean dialects and Japanese casual register.
**False-negative bias**:
- Hate-speech detection failures in non-mainstream languages (Swahili, Uzbek, and so on).
- Missing detection in multimodal cases (image + text fused).
**Evaluation datasets**:
- **Jigsaw Toxicity Classification** (Kaggle) — Wikipedia Talk comments.
- **Jigsaw Unintended Bias** — identity-based bias evaluation.
- **HolisticBias** (Meta) — evaluation across about 600 identity descriptors.
- **TextDetox** (shared task) — multilingual toxicity plus detox (rewriting).
- **HateXplain** — hate-speech classification with rationales.
- **Stormfront dataset** — text from a white-supremacist forum (research-restricted).
- **CivilComments** — news comments with identity labels.
- **MMHS150K** — multimodal (image + text) hateful memes.
**Platform-standard evaluation**:
- **MLCommons AILuminate** — released in 2024. An AI-safety benchmark. The category system follows the same harm taxonomy that Llama Guard 3 uses.
- **HELM Safety** — Stanford CRFM's evaluation set.
Core lesson: a single number cannot evaluate a content-moderation model. You must look at accuracy sliced by identity, language, and domain.
17. AI Red Teaming — Anthropic · OpenAI · GIFCT
T&S models have to survive an adversarial environment, so **red teaming** has become a required step.
**Anthropic Red Teaming** — internal and external red teaming before each Claude release. Categories like CBRN, cyber, and political influence are evaluated with expert panels. Outputs feed both the Model Card and Constitutional Classifier training.
**OpenAI Red Team Network** — operating since the GPT-4 launch. An external expert pool (security, chemistry, biology, politics, medicine, cyber) evaluates new models pre-launch. Findings are summarized in artifacts like the GPT-4 System Card.
**Microsoft AI Red Team** — internal adversarial evaluation of Azure AI systems. **PyRIT** (Python Risk Identification Tool, open-sourced 2024) was released to the public.
**GIFCT Red Team Exercises** — joint red teaming across member platforms on terrorist content. Run regularly since 2023.
**DEF CON AI Village** — the first large public LLM red-team event ran in 2023 (around 2,200 participants). Repeated annually.
**MITRE ATLAS** — threat taxonomy for AI systems. ATT&CK for AI.
The output of red teaming is not just a discovery report. It feeds **automated adversarial eval sets**, **retraining data for the model**, and **category-policy updates**. A single red-team round moves classifiers, policies, and LLM weights all at once.
18. Transparency Reports — DSA · Periodic Disclosure
In spring 2026, every large platform produces a transparency report on a fixed cadence.
**EU DSA Article 15** — all intermediary providers (including non-VLOPs) must publish annual transparency reports in English plus the local language. Categories: number of content actions, automation vs. manual share, breakdown by harm type, human review time, appeals.
**EU DSA Article 24** — VLOPs report quarterly.
**United States**: California AB 587 (2023) mandates quarterly reports for sufficiently large platforms. Texas and Florida have their own variants.
**Korea**: limited statutory requirements under the Information and Communications Network Act; full transparency reports remain voluntary. Kakao and Naver publish voluntarily.
**Japan**: voluntary. Yahoo!Japan and LINE publish voluntarily.
**Major company reports**:
- **Meta Community Standards Enforcement Report** — quarterly stats on Facebook and Instagram content actions.
- **YouTube Community Guidelines Enforcement Report** — quarterly.
- **TikTok Community Guidelines Enforcement Report**.
- **X (Twitter) Transparency Center** — consistency has been criticized.
- **Discord Transparency Report** — semi-annual.
- **Reddit Transparency Report** — annual plus partial quarterly.
- **Snap Transparency Report** — semi-annual.
Precision continues to rise. After the DSA, share of actions per category is reported down to roughly 0.1 percent granularity.
19. Moderation Stack for a Smaller Platform — A Real Architecture
Let us draw on one page the moderation stack a smaller platform (100K to 10M MAU) could realistically stand up in 2026.
**1) Input layer**:
- Text → Perspective API (free) or Hive Text Moderation.
- Image → Hive Image plus PhotoDNA matching (NCMEC license).
- Video → Hive Video plus PDQ / TMK hashes.
- Audio / voice chat → Hive Audio.
- LLM input/output → Llama Guard 3 (self-hosted) or Lakera Guard.
**2) Classification and queue routing**:
- A T&S workflow platform — Cinder or Cove.
- Combine signals and triage into P0 / P1 / P2 queues.
**3) Human review**:
- Internal T&S analysts plus external BPO partners such as Telus International, TaskUs, Majorel.
- For multilingual coverage an external BPO is essentially required.
**4) Action and appeal**:
- Content actions: removal, downrank, age gate, account suspension.
- Notify the user; provide an appeal channel.
**5) Reporting and notification**:
- On CSAM finding, automatic transmission to the NCMEC CyberTipline.
- Terrorist content → GIFCT hash sharing.
- Generate periodic transparency reports.
**6) Model governance**:
- Quarterly bias evaluation (HolisticBias and similar).
- Policy update → classifier retraining → A/B testing.
Building this stack in-house runs roughly 1M USD to several million USD per year. Composing external solutions can start in the low hundreds of thousands. Entering regulation-heavy markets (especially DSA / OSA) more than doubles the cost.
20. Labor and Compensation — Moderator Mental Health
A chapter this piece cannot skip is the human moderator. No matter how capable AI becomes, the hardest decisions are made by people. CSAM, self-harm, violent extremism, abuse — people see this every day.
Beginning in 2018, a series of investigations (Casey Newton's series at The Verge, the documentary The Cleaners) and lawsuits (Selena Scola v. Facebook, 2018, $52M settlement) brought moderator PTSD into the open. In 2024, Facebook moderators in Kenya (under the Sama contract) filed a class action.
**Improvement directions**:
- Grayscaled screens, processed audio, daily exposure caps.
- Mandatory psychological counseling and peer support.
- Conversion of BPO-dependent roles to full-time or direct employment.
- Gradual increase in "what AI can process so a human does not have to look."
The ethics of the T&S industry are not only about content accuracy — they are also about **moderator working conditions**. One of the KPIs a 2026 T&S director must track is the "human reviewer wellness score."
21. Case Study — Fifteen Minutes of Voice-Chat Moderation at a Game Company
Returning to the opening scenario, let us unfold the fifteen minutes of one voice-chat case at a gaming company.
- **T+0:00** — User A starts voice chat in a multiplayer match. The voice stream is sent simultaneously to Hive AI audio moderation and an in-house STT pipeline.
- **T+0:30** — Hive scores three labels above 0.8: abuse, slur, child reference. Internal STT generates text and ships it to Spectrum Labs Guardian.
- **T+1:00** — Guardian combines this with user A's seven-day chat history. A new signal arrives: "grooming pattern probability: 0.7."
- **T+1:30** — A "P0 — child safety" case is created in the Cinder T&S queue. An automatic notification goes to the on-call T&S analyst.
- **T+10:00** — The T&S analyst opens the case and reviews the voice clip, text, user history, and existing reports.
- **T+12:00** — The analyst suspends the account and triggers an automatic NCMEC CyberTipline submission.
- **T+15:00** — User B (the affected party) receives a safety-resource message. If a parent or guardian contact is on file, a separate channel is opened.
- **T+24:00** — The event is aggregated into the quarterly transparency report by category.
Every arrow in this flow has a company behind it. Voice moderation (Hive), pattern detection (Spectrum Labs), workflow (Cinder), CSAM reporting (NCMEC), reporting (DSA Article 15). Fifteen minutes of one gaming match crosses the entire T&S ecosystem.
22. Limits — Bias, Liability, Free Expression
The final chapter is an honest sentence about limits.
**Language and culture bias** — almost all content-moderation models are trained around English first, and performance drops in non-English languages and non-mainstream dialects. Korean, Japanese, Arabic, Hindi, Swahili, Filipino — the markets are not small, but the gap in model quality is.
**False positives and free expression** — overly aggressive moderation hides legitimate opinion, satire, and art. AAVE, LGBTQ+ self-description, and political satire show up in error reports every year. The DSA's right to appeal is a partial answer, but once content is hidden it is hard to fully restore.
**False negatives and harm** — undermoderation lets harm persist. The explosion of non-consensual intimate deepfakes made the gap visible.
**Liability** — who owns the consequence of a moderation decision? The platform? The AI model provider? The moderator? The DSA and OSA clarify platform responsibility, but case law is still thin when an AI model error is the proximate cause.
**Privacy vs. safety tension** — how do you catch CSAM in end-to-end encrypted messaging? Apple paused NeuralHash, the EU is debating "Chat Control," the UK Online Safety Act has technical demands — the same question is producing different answers.
**Moderator mental health** — see chapter 20. AI relieves some load, but the darkest content is still seen by humans.
**Regulatory fragmentation** — EU DSA, UK OSA, Korean network law, Japanese PCMA, US state law. Global platforms tend to converge on the strictest baseline. That is the "Brussels effect" — EU regulation becoming a de facto global standard.
These limits are not reasons to dismiss the field. The balancing act between expression and safety has run through every medium since the printing press. AI walks the same path — critically, one step at a time.
23. Conclusion — Defense in Depth, Human Review, and Trust
Spring 2026, in fifteen minutes of one voice-chat case at a gaming company we saw a snapshot of the era. Hive, Spectrum Labs, Cinder, NCMEC, EU DSA — different companies, different standards, different algorithms. All converging on one point: a single user's safety.
The next five years are clear. **Defense in depth** (hash plus classifier plus behavior signal plus LLM plus human), **provenance standards** (C2PA), **bias evaluation discipline** (the HolisticBias generation), **industry standardization for moderator welfare**, **transparency-report comparability** (DSA Article 39).
T&S has moved from a single company's secret weapon to common industry infrastructure. As NCMEC and GIFCT showed, "harm caught by one platform should be blocked quickly by another" is the standard. At the same time, each platform's moderation choices are entangled with its governance — free expression, user rights, external audit.
Trust is not built in one go. But every time it is lost, it leaks out with the same weight. The most important asset in a 2026 T&S stack is not the algorithm — it is the user's sense that "this platform protects me." That sense is built jointly by every company above — Hive, Microsoft, Google, Anthropic, Spectrum Labs, Cinder, ActiveFence, NCMEC, IWF, GIFCT — and the human moderators behind them.
T&S is not one country's game. And it is not one company's game either.
24. References
- [EU Digital Services Act · Official](https://commission.europa.eu/strategy-and-policy/priorities-2019-2024/europe-fit-digital-age/digital-services-act_en)
- [UK Online Safety Act 2023 · Ofcom](https://www.ofcom.org.uk/online-safety)
- [Hive AI · Content Moderation API](https://thehive.ai/)
- [Microsoft Azure AI Content Safety](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety)
- [Google Perspective API · Jigsaw](https://perspectiveapi.com/)
- [OpenAI Moderation API · Docs](https://platform.openai.com/docs/guides/moderation)
- [Anthropic Constitutional Classifiers Announcement (Mar 2025)](https://www.anthropic.com/research/constitutional-classifiers)
- [Spectrum Labs · Guardian](https://www.spectrumlabsai.com/)
- [Cinder · Trust and Safety Operations](https://www.cinder.co/)
- [ActiveFence · Content Moderation and Threat Intelligence](https://www.activefence.com/)
- [Sift · Digital Trust and Safety](https://sift.com/)
- [NCMEC · CyberTipline](https://www.missingkids.org/gethelpnow/cybertipline)
- [Internet Watch Foundation (IWF)](https://www.iwf.org.uk/)
- [GIFCT · Global Internet Forum to Counter Terrorism](https://gifct.org/)
- [Tech Coalition · Lantern](https://www.technologycoalition.org/lantern)
- [Microsoft PhotoDNA](https://www.microsoft.com/en-us/photodna)
- [Meta · PDQ and TMK Open Source](https://github.com/facebook/ThreatExchange)
- [Meta · Hasher-Matcher-Actioner](https://github.com/facebook/ThreatExchange/tree/main/hasher-matcher-actioner)
- [Reality Defender · Deepfake Detection](https://www.realitydefender.com/)
- [Sensity AI · Visual Threat Intelligence](https://sensity.ai/)
- [Truepic · C2PA Provenance](https://truepic.com/)
- [TrueMedia.org · Nonprofit Deepfake Detection](https://www.truemedia.org/)
- [Llama Guard 3 · Meta](https://github.com/meta-llama/PurpleLlama)
- [Lakera Guard · Prompt Injection Defense](https://www.lakera.ai/)
- [NVIDIA NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails)
- [Guardrails AI](https://www.guardrailsai.com/)
- [detoxify · Unitary Open Source](https://github.com/unitaryai/detoxify)
- [MLCommons AILuminate Benchmark](https://mlcommons.org/benchmarks/ailuminate/)
- [Korean KOCSC](https://www.kocsc.or.kr/)
- [Japan Internet Hotline Center · IHC](https://www.internethotline.jp/)
현재 단락 (1/253)
Spring 2026, the Trust & Safety (T&S) operations room of a Southeast Asian gaming company. 4:17 AM. ...