-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE/BUG] Enable definition of traning and validation set for privacy-preserving machine learning #377
Comments
When a user select random records by clicking on "Select Randomly", a view output does not reflect that e.g. by selecting 0.80:
I get the output:
I am using the view output to get the training and the testing subset which makes a problem. |
This is OK and the expected behaviour. |
An implementation of this is provided in the following branch: feature-training-test Might still need a little bit of polishing, though. |
I was wondering if this feature can already be used it it's current state? |
Yes, should work. We would be happy to receive feedback. |
To be clear. The feature lives here in this branch: https://github.com/arx-deidentifier/arx/tree/feature-training-test |
Is your feature request related to a problem? Please describe.
The privacy-preserving machine learning framwork in ARX uses k-fold cross-validation to quantify the performance of privacy-preserving models. This can lead to misleading estimates as training and validation sets both influence the optimization process performed during anonymization.
Describe the solution you'd like
As an alternative it would be good to enable users to specify a training and a validation set in such a way that only the training set influences the anonymization process. This can easily be done in ARX by using the "research subset" feature, which allows selecting a subset of the records in a dataset that are then anonymized. What would needed to be added is a feature that allows to specify that machine learning performance is determined based on the set of records that is not included in the research subset.
The text was updated successfully, but these errors were encountered: