Evaluating Synthetic Data — The Million Dollar Question | Dr. Andrew Scarver

The dataset used in Part 1 is simple and can be easily modeled with just a mixture of Gaussian distributions. However, most real-world datasets are much more complex. In this part of the story, we apply some synthetic data generators to some common real-world datasets. Our main focus is on comparing the distribution of maximum similarity within and between observed and synthetic datasets to understand the extent to which they can be considered random samples from the same parent distribution.

The six datasets come from the UCI repository² and are all popular datasets that have been widely used in the machine learning literature for decades. All were chosen because they are mixed-type datasets and have a different balance of categorical and numerical features.

The six generators represent the main approaches used in synthetic data generation: copula-based, GAN-based, VAE-based, and approaches using sequential imputation. CopulaGAN®, GaussianCopula, CTGAN®, and TVAE® are all available at: Synthetic data repository The library⁴, synthpop⁵ is available as an open source R package, and “UNCRi” is a proprietary Unified numerical/categorical representation and inference (UNCRi) Framework⁶. All generators were used with default settings.

The table below shows the average maximum within-set and between-set similarities for each generator applied to each dataset. Entries highlighted in red are those whose privacy is violated (i.e., the average maximum cross-set similarity of the observed data exceeds the average maximum within-set similarity). Entries highlighted in green are those with the highest average maximum cross-set similarity (does not include entries in red). The last column displays the results of performing the following actions: Train in synthesis, test in reality (TSTR) Test. A classifier or regressor is trained on synthetic examples and tested on real (observed) examples. The Boston housing dataset is a regression task and the mean absolute error (MAE) is reported. All other tasks are classification tasks and the reported value is the area under the ROC curve (AUC).

Average maximum similarity and TSTR results for six generators on six datasets. TSTR values are MAE for Boston housing and AUC for all other datasets. [Image by Author]

The figure below displays, for each dataset, the distribution of maximum within-set and cross-set similarities corresponding to the generator that achieved the highest average maximum cross-set similarity (highlighted in red above) except for).

Synthpop maximum similarity distribution **boston housing** data set. [Image by Author]

Synthpop maximum similarity distribution **census income** data set. [Image by Author]

UNCRi maximum similarity distribution **cleveland heart disease** data set. [Image by Author]

UNCRi maximum similarity distribution **credit approval** data set. [Image by Author]

UNCRi maximum similarity distribution iris data set. [Image by Author]

Distribution of average similarity of TVAE **breast cancer wisconsin** data set. [Image by Author]

From the table, we can see that for the generators that did not violate privacy, the average of the maximum between-set similarities of the observed data is very close to the average of the maximum within-set similarities. The histogram shows the distribution of these maximum similarities, and we can see that the distributions are clearly similar in most cases. This is noticeable in datasets such as the Census Income Dataset. The table also shows that the generators that achieved the highest average maximum cross-set similarity on each dataset (except those highlighted in red) also had the best performance in the TSTR test (red (excluding those highlighted in ). Therefore, although we can never claim to have discovered the “true” underlying distribution, these results indicate that the most effective generators for each dataset capture important features of the underlying distribution. I am.

privacy

Of the seven generators, only two had privacy issues: synthpop and TVAE. Each of these violated privacy in 3 out of 6 datasets. In two cases, specifically the Cleveland Heart Disease TVAE and the Credit Approval TVAE, the violations were particularly serious. TVAE’s histogram for credit approvals is shown below. This shows that the synthetic examples are very similar to each other and to their closest neighbors in the observed data. This model is particularly poor at representing the underlying parent distribution. A possible reason for this is that the Credit Approval dataset contains some highly biased numerical features.

Distribution of average maximum similarity of TVAE on credit approval dataset. [Image by Author]

Other observations and comments

Two GAN-based generators (CopulaGAN and CTGAN) were consistently among the worst performing generators. This was somewhat surprising given the immense popularity of GANs.

GaussianCopula’s performance was mediocre on all datasets except Wisconsin breast cancer, where it yielded comparable best mean maximum cross-set similarity. Given that the Iris dataset is a very simple dataset that can be easily modeled using a mixture of Gaussian distributions and was expected to fit well with Copula-based methods, The amazing performance was especially surprising.

The generators that perform most consistently well across all datasets are synthpop and UNCRi, both of which operate via sequential assignment. This is a univariate conditional distribution (e.g. P(X₇|X₁、 X₂, …)), which is typically much easier than modeling and sampling from multivariate distributions (e.g. P(X₁、 X₂、 X₃, …)), which is what GANs and VAEs do (implicitly). While synthpop uses decision trees (which causes overfitting to which synthpop is prone) to estimate distributions, the UNCRi generator uses a nearest neighbor-based approach to estimate distributions, with cross-validation to prevent overfitting. Use hyperparameters that are optimized using the procedure.

Although synthetic data generation is a new and evolving field and there is no standard evaluation method yet, there is consensus that testing should cover fidelity, utility, and privacy. However, while each of these is important, they are not on equal footing. For example, synthetic datasets can achieve good performance when it comes to fidelity and utility, but can fail when it comes to privacy. This is not a “2 out of 3” rating. If the synthetic examples are too close to the observed examples (and therefore fail the privacy test), the model is overfitted and the fidelity and utility tests become meaningless. There is a trend among some vendors of synthetic data generation software to propose a single-score performance measure that combines the results of many tests. This is basically based on the same “2 out of 3” logic.

If the synthetic dataset can be considered a random sample from the same parent distribution as the observed data, then nothing more can be done. You’ve achieved maximum fidelity, practicality, and privacy. The maximum similarity test provides a measure of the degree to which two datasets can be considered random samples from the same parent distribution. This means that if the observed and synthetic datasets are random samples from the same parent distribution, then the instances are distributed such that the synthetic instances are similar to the observed instances to the closest observed instance on average. It’s based on the simple and intuitive concept of what you need to do. similar on average to the nearest observed entity.

We propose the following single-score measure of the quality of synthetic datasets.

The closer this ratio is to 1 without exceeding 1, the higher the quality of the synthetic data. Of course, it should also be accompanied by a histogram sanity check.

Source link

Subscribe to Updates

What's Hot

Evaluating Synthetic Data — The Million Dollar Question | Dr. Andrew Scarver | February 2024

Related Posts