Skip to content

Instantly share code, notes, and snippets.

@tobystokes
Last active November 10, 2020 14:58
Show Gist options
  • Save tobystokes/bdc2f5cb51faadebf12f5c5637b7b9e6 to your computer and use it in GitHub Desktop.
Save tobystokes/bdc2f5cb51faadebf12f5c5637b7b9e6 to your computer and use it in GitHub Desktop.
{
"sections": [
{
"name": "similarity",
"description": "",
"scores": [
{ "type": "score", "name": "Marginal distribution" },
{ "type": "score", "name": "Bi-joint distribution" },
{ "type": "score", "name": "Mutual Information"}
],
"graphs": [
{
"type": "histogram",
"data": {
"info": {
"name": "Marginal distribution",
"description": "Shows the overlay of the marginal distribution of the selected column for the real and the synthetic data. The more the histograms overlap, the better the statistics of the column have been captured."
}
}
},
{
"type": "histogram_2d",
"data": {
"info": {
"name": "Bi-joint distribution",
"description": "Shows the similarity of the joint distribution of the selected pair of columns in the real and the synthetic data. Each square represents a bin in the 2D histogram. In an ideal case all squares have a value of 1."
}
}
},
{
"type": "mutual_information",
"data": {
"info": {
"name": "Mutual Information",
"description": "Shows the difference between the mutual information matrix in the real and synthetic data."
}
}
}
]
},
{
"name": "utility",
"description": "",
"scores": [
{ "name": "Summary" },
{ "name": "pred_utility" },
{ "name": "Feature importance" },
{ "name": "Agreement rate" },
{ "name": "Query" }
],
"graphs": [
{
"type": "score_bar",
"data": {
"info": {
"name": "Predictive Performance analysis",
"description": "To compare the predictive power of the real data to its synthetic equivalent, we train the same machine learning algorithm on each, then measure performance through different methods on an independent test set. The scores obtained for each algorithm are shown overlayed. Ideally a perfect overlap indicates no performance degradation."
}
}
},
{
"type": "confusion_matrix",
"data": {
"info": {
"name": "Confusion Matrix",
"description": "Shows the confusion matrix for a target variable."
}
}
},
{
"type": "feature_importance_graph",
"data": {
"info": {
"name": "Feature Importance Shift",
"description": "Represents the shift in position (sorted by importance) between an algorithm trained on real data and synthetic data. Ideally the graph should look like a ladder - meaning that no feature has moved."
}
}
},
{
"type": "feature_importance_grid",
"data": {
"info": {
"name": "Feature Importance Shift",
"description": "Represents the shift in position (sorted by importance) between an algorithm trained on real data and synthetic data. Ideally the positions should be on the diagonal axis - meaning that no feature has moved."
}
}
},
{
"type": "feature_importance_bar",
"data": {
"info": {
"name": "Feature Importance Shift",
"description": "Displays the difference in importance as a histogram. Ideally a perfect overlap indicates no difference between real and synthetic data."
}
}
}
]
},
{
"name": "privacy",
"description": "",
"scores": [
{ "name": "Summary" },
{ "name": "Density disclosure" }
],
"graphs": [
{
"type": "risk_histogram",
"data": {
"info": {
"name": "Synthetic disclosure risk histogram",
"description": "Shows the distribution of the density disclosure risk over generated synthetic records. Disclosure risk is the risk of an individual's presence in the source data being disclosed. Density disclosure risk is a specific way of calculating this based on how closely a synthetic data record can be mapped to individual records in the source data. The higher the density, the lower the risk of disclosure. The density disclosure threshold is the level of risk of disclosure that you're willing to tolerate. A value of 6% means that all records which have more than a 6% risk are removed from the synthetic data. This can impact utility, as information may be lost when removing the sensitive data points."
}
}
},{
"type": "risk_scatter",
"data": {
"info": {
"name": "Synthetic disclosure risk scatter plot",
"description": "Shows a 2D mapping of the generated synthetic data overlayed on a density map of the real records. The coloring and size of the points shows the density disclosure risk of each synthetic point displayed, risky points tend to be big and red; safe points tend to be light and small. (Note that this shows a subsample of the total amount of points generated)."
}
}
},
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment