-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use the public dataset #125
Comments
This is a security issue, here is several things to consider about:
|
It seems that after networking control, there leaves the following potential leakage path:
How could we prevent data leakage alone above pipeline? |
As mentioned above in No.2:
More details: Users must use a special reader to pass public datasets to trainer: ...
trainer.train(reader=paddle.datasets.public.sample.train())
# Users can also use part of the feature columns or filters to get a reader:
trainer.train(reader=paddle.datasets.public.sample.train(fields=[3,4,5], filter=some_func))
... This reader returns encrypted data which is decrypted by Store the encrypt key as a |
Do we allow user to use custom built Paddle? If so, user can easily access the decrypted data by writing a custom layer. |
Good point, if we allow the user to use a custom Paddle binary file, he/she can always print the decrypted data, @typhoonzero and I discussed this question at yesterday, maybe prevent custom Paddle binary files and prevent custom runtime Docker image is a good choice. |
Paddlecloud provides some public dataset for the developer.
How to Usage
We can install a
cluster_dataset
python package in the runtime Docker image, and use it as:How to block the data leakage
Because of developers can upload a trainer python package to the PaddleCloud, so I think the most effective way to block the data leakage is blocking all connections of Kubernetes nodes to the exteranl internal.
The text was updated successfully, but these errors were encountered: