Data that wasn't measured, it was made.
Synthetic data is information generated by an algorithm rather than collected from a real-world event. It mimics the statistical patterns of real data so closely that, for many purposes, you can train a model or test a system as if it were the real thing.
Real data comes from measurement: a sensor reading, a patient record, a transaction log. Synthetic data comes from a process a rule, a probability distribution, a neural network designed to produce records that look and behave like the real ones, without actually being them.
Synthetic data is any data produced by a generative process whose goal is to preserve selected statistical properties of an original dataset, while breaking the direct one-to-one link between records and the real individuals or events they came from.
Three things that tend to be true
- It is structurally indistinguishable from real data same columns, same value ranges, same types.
- It is statistically faithful at the level you care about distributions, correlations, dependencies.
- It is not a copy. No synthetic row should map directly back to a specific real individual.
Why would anyone make up data?
There are at least five honest reasons researchers and practitioners reach for synthetic data and most projects you'll see have more than one.
Each motivation corresponds to a real, recurring problem. The same generation method might solve one of these well and another poorly which is why the choice of method follows from the choice of motivation, not the other way around.
Synthetic data exists because real data is sometimes private, sometimes rare, sometimes biased, sometimes expensive, and sometimes simply absent for the case you care about.
Not all synthetic data is synthetic in the same way.
The word "synthetic" hides a spectrum. At one end, every cell is generated from scratch. At the other, only sensitive attributes are replaced. Knowing which kind you have changes everything downstream.
The grids in each card show how many cells of the original table are kept (dark) versus replaced (blue).
Calling a dataset "synthetic" without specifying which kind. Privacy guarantees, evaluation metrics, and the appropriate generation method all depend on this choice. Always state it explicitly.
Four families, one shared idea.
Every method, no matter how sophisticated, does the same thing: it learns or assumes a probability distribution over the data, then draws new samples from that distribution.
Try it: fit, then sample
The demo uses the simplest Family 01 method: a Gaussian model. Pick a true distribution, generate a sample, and watch the Gaussian try to imitate it. You'll see immediately where simple methods succeed and where they fail.
Fidelity. Utility. Privacy. Pick two.
A synthetic dataset is evaluated along three axes that cannot all be maximised at once. Push one up and at least one of the others drops.
The three axes
- Fidelity how closely the synthetic distribution matches the real one. Measured with KS distance, Jensen–Shannon divergence, correlation similarity, and discriminator scores.
- Utility how well a model trained on the synthetic data performs on a real test set. The standard protocol is train-on-synthetic, test-on-real (TSTR).
- Privacy how hard it is to recover information about individuals in the real dataset. Measured with membership inference attacks, distance to closest record, and differential privacy bounds.
A perfect copy of the real data has perfect fidelity and utility and zero privacy. Pure random noise has perfect privacy and zero fidelity or utility. Real synthetic data sits somewhere between, and any honest report should state where.
See it move
Drag the noise slider to add privacy noise. Watch fidelity (correlation) go down as privacy goes up.
Five questions. Pencil down.
If you've worked through the previous five lessons, you should be able to answer each of these quickly. If a question stumps you, the lesson it draws from is one click away in the sidebar.