Syndat is a software package that provides basic functionalities for the evaluation and visualizsation of synthetic data. Quality scores can be computed on 3 base metrics (Discrimation, Correlation and Distribution) and data may be visualized to inspect correlation structures or statistical distribution plots.
Install via pip:
pip install syndat
Compute data quality metrics by comparing real and synthetic data in terms of their separation complexity, distribution similarity or pairwise feature correlations:
import pandas as pd
import syndat
real = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")
# How similar are the statistical distributions of real and synthetic features
distribution_similarity_score = syndat.scores.distribution(real, synthetic)
# How hard is it for a classifier to discriminate real and synthetic data
discrimination_score = syndat.scores.discrimination(real, synthetic)
# How well are pairwise feature correlations preserved
correlation_score = syndat.scores.correlation(real, synthetic)
Scores are defined in a range of 0-100, with a higher score corresponding to better data fidelity.
Visualize real vs. synthetic data distributions and summary statistics for each feature:
import pandas as pd
import syndat
real = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")
syndat.visualization.plot_distributions(real, synthetic, store_destination="results/plots")
syndat.visualization.plot_correlations(real, synthetic, store_destination="results/plots")