The goal of this toolbox is to make private generation of synthetic data samples accessible to machine learning practitioners. It currently implements 5 state of the art generative models that can generate differentially private synthetic data. We evaluate the models on 4 public datasets from domains where privacy of sensitive data is paramount. Users can benchmark the models on the existing datasets or feed a new sensitive dataset as an input and get a synthetic dataset as the output which can be distributed to third parties with strong differential privacy guarantees.
PATE-GAN : PATE-GAN : Generating Synthetic Data with Differential Privacy Guarantees. ICLR 2019
DP-WGAN : Implementation of private Wasserstein GAN using noisy gradient descent moments accountant.
RON-GAUSS : Enhancing Utility in Non-Interactive Private Data Release, Proceedings on Privacy Enhancing Technologies (PETS), vol. 2019, no. 1, 2018
Private IMLE : Implementation of private Implicit Maximum Likelihood Estimation using noisy gradient descent and moments accountant.
Private PGM : Graphical-model based estimation and inference for differential privacy. Proceedings of the 36th International Conference on Machine Learning. 2019.
NOTE : Private IMLE code is released separately from this toolbox and can be found here : https://github.com/BorealisAI/IMLE. To run IMLE, do the following first:
git clone https://github.com/BorealisAI/IMLE.git
cp -r IMLE <root>/models
Also make sure to follow the build instructions in <root>/models/IMLE/dci_code/Makefile
Adult Census : The dataset comprises of census attributes like age, gender, native country etc and the goal is to predict whether a person earns more than $ 50k a year or not. https://archive.ics.uci.edu/ml/datasets/adult
NHANES Diabetes : National Health and Nutrition Examination Survey (NHANES) questionnaire is used to predict the onset of type II diabetes. https://github.com/semerj/NHANES-diabetes/tree/master/data
Give Me Some Credit : Historical data are provided on 250,000 borrowers and task is to help in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. https://www.kaggle.com/c/GiveMeSomeCredit/data
Home Credit Default Risk : Home Credit makes use of a variety of alternative data including telco and transactional information along with the client's past financial record to predict their clients' repayment abilities. https://www.kaggle.com/c/home-credit-default-risk/data
Adult Categorical : This dataset is the same as the Adult Census dataset, but the feature values for continuous attributes are put in buckets. We evaluate Private-PGM's performance on this dataset. https://github.com/ryan112358/private-pgm/tree/master/data
The datasets can be downloaded to the /data folder by using the download_datasets.sh
and can be preprocessed using the scripts in the /preprocess folder. Preprocessing is data set specific and mostly involves dealing with missing values, normalization, encoding
of attribute values, splitting data into train and test etc.
Example :
sh download_datasets.sh adult
python preprocessing/preprocess_adult.py
Classifiers used are Logistic Regression, Multi layer Perceptron, Gaussain Naive Bayes, Random Forests and Gradient Boost with default settings from sklearn.
The data needs to be in csv format and has to be partitioned as train and test before feeding it to the models. The generative models are learned using the training data. The downstream classifiers are either trained using the real train data or synthetic data generated by the models. The classifiers are evaluated on the left out test data.
Currently only two attribute types are supported :
-
All attributes are continuous : supported models are ron-gauss, pate-gan, dp-wgan, imle
-
All attributes are categorical : supported model is private-pgm . The categorical attribute values should be between 0 and max_category - 1.
In case the data has both kinds of attributes, it needs to be pre-processed (discretization for continuous values/ encoding for categorical attrbiutes) to use one of the models. Missing values are not supported and needs to replaced appropriately by the user before usage.
NOTE : Some imputation methods compute statistics using other data samples to fill missing values. Care needs to be taken to make the computed statistics differentially private and the cost must be added to the generative modeling privacy cost to compute the total privacy cost.
The first line of the csv data file is assumed to contain the column names and the target column (labels) needs to be specified using the --target-variable
flag when running the evaluation script as shown below.
python evaluate.py --target-variable=<> --train-data-path=<> --test-data-path=<> <model_name> --enable-privacy --target-epsilon=5 --target-delta=1e-5
Model names can be real-data, pate-gan, dp-wgan, ron-gauss, imle or private-pgm.
After preprocessing Adult data using the preprocess_adult.py, we can train a differentially private wasserstein GAN on it and evaluate the quality of the synthetic dataset using the below script :
python evaluate.py --target-variable='income' --train-data-path=./data/adult_processed_train.csv --test-data-path=./data/adult_processed_test.csv --normalize-data dp-wgan --enable-privacy --sigma=0.8 --target-epsilon=8
AUC scores of downstream classifiers on test data :
----------------------------------------
LR: 0.7411981709396546
----------------------------------------
Random Forest: 0.7540559254517339
----------------------------------------
Neural Network: 0.7311882809628891
----------------------------------------
GaussianNB: 0.7580265076488256
----------------------------------------
GradientBoostingClassifier: 0.747129484720164
Synthetic data can be saved in the /data folder using the flag --save-synthetic
--downstream-task :
classification or regression
--normalize-data :
Apply sigmoid function to each value in the data
--categorical :
If all attrbiutes of the data are categorical
--target-variable :
Attribute name denoting the target
--enable-privacy :
Enables private data generation. Non private mode can only be used for DP-WGAN and IMLE.
--target-epsilon :
epsilon parameter of differential privacy
--target-delta :
delta parameter of differential privacy
For more details refer to https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/dwork.pdf
--sigma :
Gaussian noise variance multiplier. A larger sigma will make the model train for longer epochs for the same privacy budget
--clip-coeff :
The coefficient to clip the gradients to before adding noise for private SGD training
--micro-batch-size :
Parameter to tradeoff speed vs efficiency. Gradients are averaged for a microbatch and then clipped before adding noise
--lap-scale :
Inverse laplace noise scale multiplier. A larger lap_scale will reduce the noise that is added per iteration of training
--num-teachers :
Number of teacher disciminators
--teacher-iters :
Teacher iterations during training per generator iteration
--student-iters :
Student iterations during training per generator iteration
--num-moments :
Number of higher moments to use for epsilon calculation
--decay-step :
Learning rate decay step
--decay-rate :
Learning rate decay rate
--staleness :
Number of iterations after which new synthetic samples are generated
--num-samples-factor :
Number of synthetic samples generated per real data point
--clamp-lower :
Lower clamp parameter for the weights of the NN in wasserstein GAN
--clamp-upper :
Upper clamp parameter for the weights of the NN in wasserstein GAN