Skip to content
Published on

Synthetic Data Generation 2026 — Gretel · MOSTLY AI · Tonic · Hazy · Synthea · SDV · Mimesis · Faker · Distilabel · Argilla Deep Dive

Authors

Prologue — 2026, the Second Spring of Synthetic Data

When I first heard about synthetic data in 2018, it sounded like "the consolation prize when real data is unavailable." Even when the CTGAN paper came out in 2019, synthetic data was closer to a toy for GAN researchers. The mood changed completely after ChatGPT in 2023. LLMs began to need trillions of new training tokens, and at the same time GDPR and HIPAA pressure choked off the flow of real data. The two currents met and produced the second spring of synthetic data.

In August 2024 the synthetic data market saw a decisive event — MOSTLY AI acquired Gretel (per press reports; exact deal terms undisclosed). With that, the #1 and #2 of the tabular synthetic data market came under one roof. In the same year Tonic AI launched its Ephemeral (database clone) product, and Synthea solidified its place as the open-source healthcare population simulator born from MITRE. On the LLM side, Anthropic's Constitutional AI training pipeline for synthetic data drew attention, and Hugging Face's 2024 acquisition of Argilla turned Distilabel into the standard pipeline for synthetic data.

Synthetic data is not a substitute for real data. It is a tool for filling the places real data cannot reach — the wall of privacy, the gap of rare classes, the corner of adversarial cases, and the diversity LLMs hunger for.

This article covers:

  1. The 2026 synthetic-data map
  2. Five problems synthetic data tries to solve
  3. The math of tabular synthesis — GAN, VAE, Diffusion
  4. SDV (Synthetic Data Vault) — open source from MIT
  5. CTGAN, TVAE, TabDDPM — core models
  6. MOSTLY AI — global leader of tabular synthesis
  7. Gretel AI — champion of Differential Privacy
  8. Tonic AI — Structural, Textual, Ephemeral
  9. Hazy, YData, Syntegra — European / healthcare camps
  10. Synthea — de facto standard of healthcare synthesis
  11. Image and video — Omniverse Replicator, Unity Perception
  12. LLM synthesis pipelines — Distilabel, Magpie, Self-Instruct
  13. Constitutional AI and RLAIF — synthetic preference data
  14. Faker libraries — Python Faker, Mimesis, Faker.js
  15. Structured output — Outlines, Instructor, DSPy
  16. Quality evaluation of synthetic data
  17. Privacy guarantees — Differential Privacy and MIA
  18. Law and regulation — GDPR, HIPAA, K-PIPA, APPI
  19. Korean synthetic data — KAIST, ETRI, NAVER LABS
  20. Japanese synthetic data — PFN, NTT, NICT
  21. Which tool to pick — a decision tree
  22. References

1. The 2026 Synthetic Data Map

Big picture first. If we classify synthetic-data tools by the data type they handle, the landscape looks like this.

Tabular — the enterprise mainstream

  • MOSTLY AI (Austria) — global leader of tabular synthesis. Reported to have acquired Gretel in 2024.
  • Gretel AI (US) — Differential Privacy + GAN-based. Cloud API and SDK.
  • Tonic AI (US) — Tonic Structural (RDBMS subsetting), Tonic Textual (PII masking), Tonic Ephemeral (DB clone).
  • Hazy (UK) — enterprise, especially financial services.
  • YData (Portugal) — Synthetic Data plus ydata-profiling.
  • Syntegra — clinical data specialist.
  • SDV (Synthetic Data Vault) — MIT open source. CTGAN, TVAE, PAR.

Healthcare

  • Synthea (MITRE) — population-level clinical simulation. The de facto standard.
  • HealthShare (InterSystems) — healthcare data platform.
  • Clinical Synthetic Data Generator — HHS / CMS synthetic data initiative.

Image / video

  • NVIDIA Omniverse Replicator — 3D simulation based.
  • Unity Perception — game engine as a data generator.
  • Datagen, Synthesis AI — faces and pose synthesis.
  • AI.Reverie (acquired by Meta, 2021) — autonomous driving and defense.

LLM / text

  • Distilabel (Argilla → Hugging Face) — standard pipeline for synthetic instructions and preferences.
  • Magpie (Princeton) — model self-instruction.
  • Self-Instruct (Yizhong Wang) — pioneer of LLM self-generated training data.
  • OpenHermes 2.5, OpenOrca, UltraChat — representative synthetic datasets.

Faker — fake identifiers, names, addresses

  • Python Faker (joke2k) — the most widely used fake-data lib.
  • Mimesis (Python) — faster than Faker, multi-locale.
  • Faker.js — the community fork after the Marak incident.
  • mockaroo.com — web UI fake-data generator.

Structured Output

  • Outlines, Instructor, DSPy — force JSON schemas onto LLM output.

What this map tells us: the single word "synthetic data" actually covers five different markets. Privacy-preserving tabular, population simulation for healthcare, instruction synthesis for LLMs, simulation for autonomous driving — all flying under the same flag, solving different problems.


2. The Five Problems Synthetic Data Tries to Solve

The reasons to use synthetic data boil down to five.

Problem 1: Privacy — GDPR (EU), HIPAA (US), K-PIPA (Korea), APPI (Japan) heavily restrict the transfer and sharing of identifiable personal data. Synthetic data preserves statistical distributions but is built so that individuals cannot be re-identified. "European HQ cannot send customer data to its Korean subsidiary → send synthetic data" is the most common scenario.

Problem 2: Data scarcity — In autonomous driving the real data for "almost ran over a child" is hard to collect (thankfully). In healthcare, rare diseases have few patients to begin with. In fraud detection the positive class is 0.1%. Simulation fills the gap.

Problem 3: Class imbalance — Anomaly, fraud, and clinical diagnosis tasks often have a positive class under 1%. From SMOTE/ADASYN to GAN/VAE-based oversampling, fighting class imbalance is the classic use case for synthetic data.

Problem 4: Augmentation — Apply transformations (rotation, noise, color shift) to real data to expand training sets. Cutout/mixup in vision, back-translation/EDA in NLP. Not synthetic data in the narrow sense, but the same goal.

Problem 5: LLM training data — The hottest application after 2024. Internet text is running dry, and human labelers are expensive. So LLMs synthesize data for other LLMs. Self-Instruct, Magpie, Constitutional AI, RLAIF — all the same current.

Which of these five you are solving, plus whether your data is tabular, image, or text, dictates the tool you pick.


3. The Math of Tabular Synthesis — GAN, VAE, Diffusion

The core of tabular synthesis is to learn the joint distribution P(X1, X2, ..., Xn) and draw new samples from it. Three paradigms compete in 2026.

Paradigm 1: GAN-based — CTGAN

# CTGAN core idea: conditional generator + mode-specific normalization
# (conceptual pseudocode)
class CTGAN:
    def fit(self, data, discrete_columns):
        # 1. Split continuous columns into multiple modes via Gaussian Mixture
        self.gmm = fit_gmm_per_column(data)
        # 2. One-hot encode discrete columns
        self.ohe = one_hot(data, discrete_columns)
        # 3. Use conditional sampling to handle class imbalance
        self.cond_sampler = ConditionalSampler(discrete_columns)
        # 4. Wasserstein GAN with gradient penalty
        train_wgan_gp(generator, critic, epochs=300)

    def sample(self, n):
        cond = self.cond_sampler.sample(n)
        return self.generator(noise, cond)

Paradigm 2: VAE-based — TVAE

TVAE shares CTGAN's preprocessing (GMM + one-hot) but the decoder is a VAE. Training is more stable than GAN and mode collapse is less common.

Paradigm 3: Diffusion-based — TabDDPM, TabSyn

TabDDPM (Kotelnikov et al., 2023) brought to tabular what diffusion proved on images. Gaussian diffusion for continuous, multinomial diffusion for discrete. TabSyn (Zhang et al., 2024) took it further with latent diffusion.

ModelParadigmStrengthWeakness
CTGANGANfast sampling, industry standardunstable training
TVAEVAEstable, mode-preservinglower diversity
TabDDPMDiffusionSOTA qualityslow sampling
TabSynLatent Diffusionquality + speednewcomer
ARFAdversarial Random Foreststrong on small dataweak on large data

Industry tools (MOSTLY AI, Gretel, Tonic) all have their own improved variants of these. Some are published, some are trade secrets.


4. SDV (Synthetic Data Vault) — Open Source from MIT

SDV started in 2016 at MIT's DAI Lab and is, in 2026, the de facto open-source standard for tabular synthetic data. It handles single tables, multi-table (relational) data, and time series.

# SDV single-table synthesis (representative example)
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
import pandas as pd

# 1) Load data
real_data = pd.read_csv('customers.csv')

# 2) Auto-detect metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

# 3) Train model
synthesizer = CTGANSynthesizer(metadata, epochs=300)
synthesizer.fit(real_data)

# 4) Generate 10k synthetic rows
synthetic = synthesizer.sample(num_rows=10_000)

# 5) Quality evaluation — SDMetrics ships with SDV
from sdv.evaluation.single_table import evaluate_quality
report = evaluate_quality(real_data, synthetic, metadata)
print(report.get_score())  # 0..1, higher is better

SDV's strength is multi-table relational preservation. HMA (Hierarchical Modeling Algorithm), HSA, and GaussianCopula respect foreign-key constraints while synthesizing parent-child tables together. PAR (Probabilistic AutoRegressive) handles time series.

The license is the Business Source License (a variant of MIT) — free for non-commercial / research use, with a separate SDV Enterprise license for commercial use through Datacebo.


5. CTGAN, TVAE, TabDDPM — Core Models Compared

Comparing the three on the same data (UCI Adult, n=48,842) shows the trends reported in public benchmarks.

MetricCTGANTVAETabDDPM
Marginal distribution (KS)0.050.040.02
Joint distribution (TVD)0.120.100.07
Downstream ML F10.810.830.85
Training time1x1.2x5x
Sampling time1x1x30x

Interpretation: TabDDPM wins on quality but costs more in training and sampling. Choose TVAE/CTGAN when you need volume, TabDDPM/TabSyn when quality is paramount. Real industry tools auto-pick between several models based on data size, sensitivity, and synthesis volume.


6. MOSTLY AI — Global Leader of Tabular Synthesis

MOSTLY AI was founded in Vienna, Austria in 2017 and is the largest player in tabular synthesis. They are strong in financial services (Erste Group, ING), insurance, telecom, and the EU public sector.

Product line

  • MOSTLY AI Platform — cloud SaaS or on-prem Docker. Web UI plus REST API.
  • mostlyai SDK — Python SDK open-sourced in 2024. Train and run synthesis models in your own environment.
  • AI Assistants — natural-language driven synthesis.
# MOSTLY AI open-source SDK (representative example)
from mostlyai.sdk import MostlyAI

mostly = MostlyAI(local=True)  # local mode

# 1) Train
g = mostly.train(
    data='customers.csv',
    name='customer-synth-v1',
)

# 2) Generate
syn = mostly.generate(g, size=50_000)
syn.data().to_csv('synthetic_customers.csv', index=False)

# 3) Quality report — printed as HTML
print(syn.report_path)

Technical highlights

  • Their own transformer-based generative model (announced 2023). One of the two transformer giants for tabular along with PFN.
  • Differential Privacy is an option; when enabled the ε (epsilon) budget is stated.
  • Supports multi-table relational synthesis and time series.

After the reported 2024 acquisition of Gretel, MOSTLY AI is effectively the dominant #1 in tabular synthesis. The two product lines have been kept separate for some time, however.


7. Gretel AI — Champion of Differential Privacy

Gretel was founded in California in 2019. They differ from MOSTLY AI in placing Differential Privacy as a first-class citizen.

Products

  • Gretel Cloud — SaaS. Handles tabular, free-form text (PII synthesis), and time series.
  • gretel-synthetics — open-source Python library. Includes ACTGAN, TimeSeries, LSTM, and GPT-based text synthesis models.
  • Gretel Tuner — automatic hyperparameter search for synthesis models.
# Generate synthetic data with the Gretel SDK (representative example)
from gretel_client import Gretel

gretel = Gretel(api_key='prompt')  # or env var
project = gretel.projects.create(name='customer-synth')

# Train a tabular synthesis model (in Gretel Cloud)
trained = project.create_model_obj(
    model_config='synthetics/tabular-actgan',
    data_source='customers.csv',
)
trained.submit_cloud()
trained.poll()  # wait for training

# Generate synthetic data
record_handler = trained.create_record_handler_obj(params={'num_records': 50_000})
record_handler.submit_cloud()
record_handler.poll()
synthetic_df = record_handler.get_artifact_handle('data_preview').download_to_dataframe()

Technical highlights

  • Differential Privacy GAN — DP-SGD bounds model-parameter leakage to ε.
  • PII redaction + synthesis combined — pipeline that masks names/addresses in text and fills the slots with synthetic substitutes.
  • Gretel Tuner — automatic model selection.

Even after the 2024 MOSTLY AI acquisition, the Gretel brand has been preserved and APIs remain compatible. New features arriving on a single unified line, however, is still in progress as of 2026.


8. Tonic AI — Structural, Textual, Ephemeral

Tonic AI was founded in San Francisco in 2018 as "synthetic data for engineers." If MOSTLY/Gretel assume a data scientist training a model, Tonic puts first the developer who needs a safe replica of the production DB.

Three product lines

  • Tonic Structural — subset + mask/synthesize while preserving foreign-key relationships in production RDBMS (PostgreSQL/MySQL/SQL Server/Oracle). Integrates into CI/CD.
  • Tonic Textual — detect PII in free-form text via NER + mask + synthetically substitute. Clinical notes, call center transcripts.
  • Tonic Ephemeral (GA 2024) — short-lived database instances. Spin up a synthetic DB per PR; tear it down when the PR closes.
# Tonic Structural workspace config (representative example)
workspace: customer-app
source:
  type: postgres
  host: prod-readonly.example.com
  database: app
subset:
  root_table: public.customers
  target_size: 10%
generators:
  public.customers.email:
    type: fake_email
  public.customers.phone:
    type: random_phone
  public.orders.notes:
    type: tonic_textual  # synthesizes PII inside free text
destination:
  type: postgres
  host: dev-db.internal
  database: app_dev

Market positioning

  • MOSTLY/Gretel build synthetic datasets that preserve statistics (for ML training) while protecting data.
  • Tonic builds safe DB replicas for dev/test (for engineering).
  • Same word — "synthetic data" — but they solve different problems.

9. Hazy, YData, Syntegra — European and Healthcare Camps

Hazy (UK, London, founded 2017)

  • Strong in financial and public sectors. Works with the UK Office for National Statistics' Secure Research Service.
  • Their own GAN variants plus Differential Privacy. Strong in air-gapped deployments.
  • 2024 NatWest and HSBC case studies have been publicized.

YData (Portugal, founded 2019)

  • ydata-synthetic — open-source synthesis lib. CTGAN, TimeGAN, DragonGAN.
  • ydata-profiling (formerly pandas-profiling) — the standard data-profiling tool. Joined YData in 2022.
  • YData Fabric — integrated SaaS for data prep, synthesis, and evaluation.

Syntegra (US, founded 2020)

  • Specializes in clinical data synthesis. EMR, claims, genomics.
  • Their own transformer-based models plus auto-generated HIPAA Safe Harbor evaluation reports.
  • Partnered with Mayo Clinic, Columbia, and others.

What this camp shares is deep domain knowledge in regulated industries (finance, healthcare). They automate domain-specific validation (e.g., clinical coding consistency) that generic tabular tools struggle with.


10. Synthea — De Facto Standard of Healthcare Synthesis

Synthea is the open-source healthcare population simulator released by MITRE Corporation in 2017, and as of 2026 the de facto standard of healthcare synthetic data. Unlike tabular models that learn "the distribution of existing data," Synthea is rule-based simulation that explicitly models clinical guidelines.

Architecture

  • Generic Module Framework — clinical modules defined in JSON (hypertension, diabetes, pregnancy, COVID-19, etc. — over a hundred).
  • Each module is a state machine: diagnosis → tests → treatment → follow-up.
  • Patients are simulated from birth to death, with incidence varying by age, sex, race, and region.
# Run Synthea (representative example) - Java jar
# Generate 10,000 synthetic patients in Massachusetts, output in FHIR R4 format
java -jar synthea-with-dependencies.jar \
  -p 10000 \
  -s 1234 \
  --exporter.fhir.export true \
  --exporter.csv.export true \
  Massachusetts

Output formats

  • FHIR R4 (JSON) — by far the most used.
  • C-CDA — HL7 clinical documents.
  • CSV — for tabular analysis.
  • CPCDS — claims data.

Limitations

  • Because Synthea does not learn statistics from real patients, there is no guarantee that "the distribution matches real data." Being guideline-based, it inherits the guidelines' limits (e.g., thin detail on rare diseases).
  • Hence a trend since 2024: scaffold with Synthea, then fill in details with GAN/diffusion — a hybrid approach.

11. Image and Video — Omniverse Replicator, Unity Perception

NVIDIA Omniverse Replicator — introduced in 2021. USD (Universal Scene Description)-based 3D simulation that emits RGB, segmentation, depth, and bounding boxes simultaneously. Strong in autonomous driving, robot manipulation, warehouse automation. Integrated with Isaac Sim in 2024, it became the standard for robot training data.

Unity Perception — the game engine Unity turned into a data generator. The public SynthDet baseline dataset is a flagship.

Datagen, Synthesis AI — face / pose / expression specialists. Used in driver monitoring for autonomous driving and AR/VR avatars.

AI.Reverie (acquired by Meta, 2021) — satellite, drone, and defense simulation. Effectively became Meta's internal tool after the 2021 acquisition.

# Omniverse Replicator (representative pseudocode)
import omni.replicator.core as rep

with rep.new_layer():
    camera = rep.create.camera()
    cube = rep.create.cube(position=(0, 0, 0))
    light = rep.create.light()

    with rep.trigger.on_frame(num_frames=1000):
        with cube:
            rep.modify.pose(position=rep.distribution.uniform((-5, 0, -5), (5, 0, 5)))

    writer = rep.WriterRegistry.get('BasicWriter')
    writer.initialize(output_dir='out/', rgb=True, bounding_box_2d_tight=True)
    writer.attach([camera])
    rep.orchestrator.run()

The core challenge of image and video synthesis is the sim-to-real gap. When simulation diverges from reality, models trained on it fail in the field. Domain randomization and domain adaptation are the key techniques for closing that gap.


12. LLM Synthesis Pipelines — Distilabel, Magpie, Self-Instruct

After 2023, the LLM thirst for training data opened a new front for synthetic data.

Self-Instruct (Yizhong Wang et al., 2022) — the pioneer of LLM self-generated training data. Starting from a small seed pool of human-written instructions, the LLM produces new instructions and answers. Alpaca, Vicuna, and WizardLM are direct descendants of this current.

Magpie (Princeton, 2024) — Self-Instruct without the seed. Feed only the user-token portion of a chat template; the LLM produces an instruction, then answers it. Run on Llama-3-Instruct, this produced a synthetic dataset on the scale of a million entries.

Distilabel (Argilla → Hugging Face, 2024) — the standard library for synthetic-data pipelines. Synthesis workflows are written as node graphs.

# Distilabel — synthetic instructions + preferences (representative example)
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
from distilabel.llms import OpenAILLM

with Pipeline(name='preference-synthesis') as pipeline:
    generator_a = TextGeneration(
        name='gen_a',
        llm=OpenAILLM(model='gpt-4o'),
    )
    generator_b = TextGeneration(
        name='gen_b',
        llm=OpenAILLM(model='gpt-4o-mini'),
    )
    # GPT-4 then judges which response is better → produces preference pairs
    # (omitted) JudgeStep ...

distiset = pipeline.run(
    parameters={
        'gen_a': {'llm': {'generation_kwargs': {'temperature': 0.7}}},
    },
)
distiset.push_to_hub('username/synthetic-prefs')

OpenHermes 2.5, OpenOrca, UltraChat — all flagship synthetic datasets. Instruction-response pairs on the scale of a million to a few million. Licensing is tricky, so commercial use needs care.


13. Constitutional AI and RLAIF — Synthetic Preference Data

In LLM alignment the most expensive part is human preference labeling. So synthetic preference data has become a central topic.

Constitutional AI (Anthropic, 2022) — the model critiques its own output against a "constitution" (a list of principles) and revises it. The (response, revised response) pair becomes preference-learning data.

RLAIF (RL from AI Feedback) (Google, 2023) — replaces the human labeler in RLHF with an LLM judge. Reports agreement with humans above 90%.

Constitutional AI pipeline (representative)
1. Initial model answers a user question.
2. Randomly pick one principle from the constitution; model critiques its own answer.
3. Model revises its answer based on the critique.
4. (question, original answer, revised answer) → SFT training data.
5. (question, original vs revised answer) → DPO/PPO preference data.

Self-Reward (Meta, 2024) — same model becomes both responder and judge in a self-reinforcing loop. The extreme form of synthetic data.

The crux of this current is a loop that improves without humans. The catch: the model can amplify its own biases (model collapse, mode collapse), so the ratio of synthetic to real and the verification of synthetic data both matter.


14. Faker Libraries — Python Faker, Mimesis, Faker.js

The lightest form of synthetic data is fakery without statistical learning — the Faker family.

Python Faker (joke2k, 2014~) — the most widely used fake-data library. Over 70 locales.

from faker import Faker
fake = Faker('en_US')

print(fake.name())          # John Smith
print(fake.address())       # 123 Main St, ...
print(fake.phone_number())  # +1-555-123-4567
print(fake.email())         # john@example.com
print(fake.company())       # Acme Inc.

Mimesis (Python) — faster than Faker (parts in C extensions) and rich in multi-locale support.

from mimesis import Generic
g = Generic('en')
print(g.person.full_name())   # John Smith
print(g.address.address())
print(g.business.company())

Faker.js (npm) — after the January 2022 incident where the original author Marak deliberately broke the npm package (the sibling of the "node-ipc" affair), the community-forked @faker-js/faker became the standard. The original faker was deprecated on npm.

import { faker } from '@faker-js/faker/locale/en';

console.log(faker.person.fullName());
console.log(faker.location.streetAddress());
console.log(faker.phone.number());

mockaroo.com — design fake-data columns in a web UI and download as CSV/JSON. Over 200 data types.

The limit of the Faker family: column correlations are not preserved. Age and occupation, city and zip — the relationships are fake. When statistics matter you must move up to SDV/MOSTLY/Gretel.


15. Structured Output — Outlines, Instructor, DSPy

When LLMs generate synthetic data, the biggest headache is output format. Libraries that enforce JSON schemas have become essential parts of synthesis pipelines.

Outlines (dottxt-ai) — enforces JSON schema / regex / context-free grammar at the token level. It masks logits so invalid tokens are never even generated.

import outlines

model = outlines.models.transformers('meta-llama/Llama-3.1-8B-Instruct')

generator = outlines.generate.json(model, schema='''
{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "age":  {"type": "integer", "minimum": 0, "maximum": 120},
    "city": {"type": "string"}
  },
  "required": ["name", "age", "city"]
}
''')

print(generator('Build a JSON profile for a fictional person in New York.'))

Instructor (Jason Liu) — uses Pydantic models directly as the output schema on top of OpenAI/Anthropic APIs. The most popular abstraction.

DSPy (Stanford) — instead of hand-writing prompts, you declare "signatures" and the compiler optimizes the prompts. A meta tool for synthesis pipelines.

These tools force the output of synthesis generators, so the pipeline flows through a fixed schema. They are the reliability baseline of synthetic data.


16. Quality Evaluation of Synthetic Data

The quality of synthetic data is judged on four axes.

Axis 1: Marginal distribution — does each column's histogram resemble real data? KS-test, Total Variation Distance.

Axis 2: Joint distribution — are column correlations and dependencies preserved? Pearson/Spearman correlation matrix difference, mutual information.

Axis 3: Downstream utility — train an ML model on synthetic data, evaluate on real. How much accuracy drops? TSTR (Train Synthetic, Test Real).

Axis 4: Privacy — can a real person be re-identified from the synthetic data? Distance to Closest Record (DCR), Membership Inference Attack (MIA), Attribute Inference Attack (AIA).

# Evaluate synthetic data with SDMetrics (representative example)
from sdmetrics.reports.single_table import QualityReport
from sdmetrics.single_table import (
    NewRowSynthesis,        # is the exact row already in real? (privacy)
    BoundaryAdherence,      # column boundary adherence
    CategoryCoverage,       # category coverage
)

report = QualityReport()
report.generate(real_data, synthetic_data, metadata)
print(report.get_score())            # 0..1, overall
print(report.get_details('Column Shapes'))   # marginal
print(report.get_details('Column Pair Trends'))  # joint

These four axes trade off. Strong privacy hurts utility; maximum utility risks the model memorizing the original. Where you balance them is the central craft of synthetic-data engineering.


17. Privacy Guarantees — Differential Privacy and MIA

Differential Privacy (DP) — defined by Dwork in 2006. Guarantees that removing or adding one person from the dataset barely changes the algorithm's output distribution. The strength is parameterized by (ε, δ).

Definition — (ε, δ)-DP
For neighbor datasets D, D' (differing in one row), for every output set S:
  Pr[A(D) in S] <= e^ε · Pr[A(D') in S] + δ

Smaller ε means stronger guarantee. ε <= 1 is the usual recommendation.
δ should be far smaller than 1/n (e.g., 1e-6).

DP-SGD (Abadi et al., 2016) — adds noise and clips gradients during neural net training to enforce DP. Applied to synthetic-data generative models.

Membership Inference Attack (MIA) — the attacker tries to decide "was this person in the training set?" Against synthetic data the attacker only sees synthetic output and tries to infer original members. An MIA success rate near 50% (random guessing) means safety.

Attribute Inference Attack (AIA) — "I don't know this person's age, but I know occupation and city. Can I infer age?" The question of whether partial exposure of synthetic data leaks other attributes.

Industrial implication — MOSTLY AI and Gretel make DP optional and state ε when enabled. Heavily regulated industries (healthcare, finance) effectively default to DP-on. But since smaller ε hurts utility, picking the right value per domain is hard.


18. Law and Regulation — GDPR, HIPAA, K-PIPA, APPI

GDPR (EU, 2018) — strict consent for processing and transfer of personal data. For synthetic data to be recognized as "no longer personal data," re-identification must be effectively impossible. The EDPB's anonymization guideline (succeeding the Article 29 WP Opinion 05/2014) is the anchor: the three risks of singling-out, linkability, and inference must all be low for true anonymity.

HIPAA (US, 1996) — de-identification via either Safe Harbor (strip 18 identifiers) or Expert Determination (a statistical expert evaluates re-identification risk). Synthetic data normally takes the Expert Determination route. Tools like Syntegra auto-generate the Expert Determination report.

K-PIPA (Korea) — concepts of pseudonymized and anonymized data. Pseudonymized data may be used without consent for statistics and research; anonymized data has even more freedom. Synthetic data is typically classified as anonymous, but case-by-case review is required. After the 2020 "Data 3 Acts" amendment, the legal positions of pseudonymous/anonymous data became clear.

APPI (Japan) — the concept of "anonymously processed information" (匿名加工情報). Similar to Korea's anonymous data, but creation must meet specific technical criteria. See the PPC (Personal Information Protection Commission) guidelines.

US state laws — CCPA/CPRA (California), and as of 2024 over 20 states have passed their own privacy laws. There is still no unified federal law.

The legal key: to have synthetic data recognized as de-identified, you must show statistical re-identification risk with objective metrics, not just "we masked the names." So privacy reports automatically generated by synthesis tools (DCR, MIA, k-anonymity, ...) become part of the legal record.


19. Korean Synthetic Data — KAIST, ETRI, NAVER LABS

Korea's synthetic data ecosystem has a public-led + corporate-internal dual structure.

KAIST, Seoul National University, POSTECH — active follow-ups to CTGAN and TabDDPM. NeurIPS-accepted tabular-diffusion variants have come out of KAIST in 2024.

ETRI (Electronics and Telecommunications Research Institute) — government-funded development of pseudonymization/anonymization tools, used for public data release.

NAVER LABS, Kakao Brain — synthetic Korean datasets for LLM training. Used to train their own models (HyperCLOVA X, KoGPT, ...).

Finance and telecom — the MyData era — after MyData went live in 2022, industrial use of pseudonymized/anonymized combined data has grown, and so has demand for synthetic data. Cases at Shinhan Bank, KB Kookmin Bank, KT and others using in-house tooling or external solutions (MOSTLY/Gretel) have been reported.

K-DATA (Korea Data Agency) — publishes guides on the use of pseudonymous and anonymous data and adequacy assessments. Key government documents on the legal status of synthetic data.

Healthcare — the Ministry of Health and Welfare's medical MyData push (2023~), pilots of synthetic EMR at the National Cancer Center, Asan Medical Center, and others. Localizing Synthea to Korea (population stats, incidence, clinical coding) is an active academic topic.


20. Japanese Synthetic Data — PFN, NTT, NICT

PFN (Preferred Networks) — autonomous driving and robot simulation data. Built synthetic video datasets with Toyota.

NTT Data, NTT Labs — in-house tabular synthesis tools. Used to analyze telecom users. Externally disclosed only via select papers.

NICT (National Institute of Information and Communications Technology) — multilingual NLP data and speech synthesis data. The ASTREC speech synthesis corpus is the flagship.

RIKEN AIP — synthetic EMR research for medical AI. Japan's medical data is so closed that synthesis is in some fields the only practical channel for external sharing.

Commercial adoption — Japanese finance shows MUFG and SMBC piloting in-house tools plus Hazy/MOSTLY. In insurance, Tokio Marine has reported training a fraud-detection model on synthetic claims data.

APPI in practice — with clear criteria for creating anonymously processed information, the standard field approach is to design synthesis so the output qualifies under that category.

Difference from Korea — Japan's public-sector data release is slower than Korea's, so the motivation for synthetic data leans more toward "moving data safely inside the private sector" than "releasing public data."


21. Which Tool to Pick — A Decision Tree

ScenarioRecommended tool
Tabular, open source, fast prototypeSDV (CTGAN/TVAE)
Tabular, enterprise, cloudMOSTLY AI or Gretel
Tabular, enterprise, on-prem + DPMOSTLY AI on-prem / Hazy
Replicating prod RDBMS + maskingTonic Structural
Throwaway DB per PRTonic Ephemeral
Free-form text (notes, call center) PIITonic Textual / Gretel
Healthcare EHR population simulationSynthea
Clinical data synthesis (distribution-based)Syntegra
Autonomous driving / robotics training videoNVIDIA Omniverse Replicator
LLM instruction synthesisDistilabel + Magpie
Simple fake names and addressesPython Faker / Mimesis / Faker.js
Force JSON schemasOutlines / Instructor

Four dimensions to consider

  1. Data type — tabular, text, video, or time series?
  2. Purpose — ML training, DEV/QA, or data sharing?
  3. Regulation — is DP required? Is HIPAA Safe Harbor needed?
  4. Operating environment — cloud OK, or air-gap required?

Write these four on paper, map candidate tools, and you usually narrow down to two or three. Then run a POC.


22. References

Official docs and major papers/reports only.

Tabular synthesis (official docs)

Healthcare

Image and video

LLM / instruction synthesis

Faker

Structured output

Core papers

Law and regulation


Epilogue — Synthetic Does Not Replace Real

One-line summary: synthetic data is not a substitute for real data; it is a tool for filling the places real data cannot reach. Behind the wall of privacy, in the void of rare classes, in the diversity LLMs hunger for — synthesis belongs only there. Fill every place with synthesis and you end up with a model that learns only its own shadow.

The biggest risk for synthetic data in 2026 is model collapse. When the loop of training LLMs on LLM-generated data grows long, models lose the diversity of the real world. So the ratio of synthetic to real, measurement of synthetic diversity, and the real holdout for definitive evaluation — these are the core engineering topics for the next five years.

Next post candidates: AI model evaluation systems deep dive (Inspect AI · Promptfoo · OpenAI Evals), LLM data curation pipelines, Differential Privacy in practice.

"Synthetic is not a photograph of reality. It is the statistics of reality. Putting statistics where a photo is wanted will fail. Putting a photo where statistics is wanted will also fail. They are different tools."