Pip install requirements
pip install -r requirements.txt
Run evaluation suite on a dataset with the command
python main.py [dataset]
ex: (python main.py car)
Differentially Private Synthetic Data Gym (DPSDGym) provides infrastructure to evaluate differentially private synthetic data generators on tabular datasets using best methods from recent literature (see citations). As of now, this includes:
- propensity Mean Squared Error (pMSE) (1)
- Wasserstein Randomization (1)
- Synthetic Ranking Agreement (SRA) (2)
- Aggregate AUROC/Machine learning accuracy across different epsilons. (3) The current models supporting aggregation are as follows. Note that any sklearn model can be added simply, using the KNOWN_MODELS list in conf.py.
- AdaBoost
- Bagging Classifier
- Logistic Regression
- Multilayer Perceptron
- Random Forest
DPSDGym contains a data loader function that retrieves datasets from the UCI ML Repository. Currently, DPSDGym has been tested with:
- Car Evaluation Data Set
- Mushroom Data Set
- Adult Data Set
- Online Shoppers Data Set
- Bank Marketing Data Set
These datasets and their specifications are included already in the datasets.json file. They should work outright.
When adding a new UCI dataset specification to the datasets.json file, use the following format:
"dataset": {
"name": "dataset_name",
"url": "https://archive.ics.uci.edu/ml/machine-learning-databases/direct/link/to/data.data",
"categorical_columns": "col1,col2,col4",
"columns":"col1,col2,col3,col4",
"target": "col4",
"header": "f",
"sep": ",",
"imbalanced": "f"
}
The "columns" are assumed to be in order and include all of the tabular columns, but "categorical_columns" does not need to be ordered. Note that the url links directly to a csv/tsv file hosted by UCI. Also, you must specify whether the dataset has a header (t/f) and what the separator is (will default to ',' but sometimes UCI datasets have weird separators). If you would like for an unbalanced dataset to be rebalanced using imblearn's SMOTE implementation, specify imbalanced as "t".
UCI datasets are not standardized, although many look similar. If you add a dataset and find there are errors in loading or parsing, you may have to modify load_data.py to accomodate the conversion of the unique dataset you are attempting to add into a pandas DataFrame object. Datasets that are formatted similarly to mushroom, adult, etc... should work outright.
If order to use the added dataset in evaluation, you must add it to the list of KNOWN_DATASETS
.
You must also add a new configuration for every synthesizer intended to be evaluated with a new dataset. The configuration is simply the hyperparameters to be used with the synthesizer for that dataset. Inside conf.py, add a new entry under the synthesizer you intend on using in the SYNTH_SETTINGS
dictionary. Specify the synthesizer's hyperparameters to be used in conjunction with the new dataset.
Make sure you have the following installed in a conda env, before running the evaluation.
python=3.5+ pyemd imblearn scipy mlflow==1.4.0 diffprivlib pandas==0.25.3
Once the configuration/datasets are set up, the entire evaluation pipeline, with all metrics, can be run through the eval script
python main.py [dataset]
The default epsilon increments for the evaluation (different privacy settings):
epsilons=[0.01, 0.1, 0.5, 1.0, 3.0, 6.0, 9.0]
Epsilon here is usually graphed on a log scale, hence the big jumps in privacy. Adding more fine grained epsilons, especially in the earlier values, is probably a good idea.
The output of the evaluation pipeline is recorded using mlflow. After a run, you can view the mlflow ui to begin your analysis.
mlflow ui
Due to the unique nature of evaluating differential privacy across privacy budgets, it can sometimes be difficult to analyze the special case metrics using the mlflow ui. We've added an accompanying notebook which helps with visualizing the metrics across privacy values.
jupyter notebook plot-v2.ipynb
DPSDGym draws inspiration from SDGYM - it is also designed to play nicely with SDGYM, and so synthesizers that work for DPSDGym should also work with SDGym.