Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the public dataset #125

Open
Yancey1989 opened this issue Jun 6, 2017 · 5 comments
Open

How to use the public dataset #125

Yancey1989 opened this issue Jun 6, 2017 · 5 comments
Assignees

Comments

@Yancey1989
Copy link
Collaborator

Paddlecloud provides some public dataset for the developer.

How to Usage

We can install a cluster_dataset python package in the runtime Docker image, and use it as:

from paddle.cloud.cluster_dataset import mnist
...

trainer.train(reader = mnist.train(), ...)

How to block the data leakage

Because of developers can upload a trainer python package to the PaddleCloud, so I think the most effective way to block the data leakage is blocking all connections of Kubernetes nodes to the exteranl internal.

@typhoonzero
Copy link
Collaborator

typhoonzero commented Jun 6, 2017

This is a security issue, here is several things to consider about:

  1. Can nodes/pods access the internet? Users may need to download dependencies or public data, but also being able to upload public datasets to other places by injecting some code in the reader(). Maybe no internet access from nodes/pods is a good idea.
  2. Users can save training output models to their own cloud storage space and then download the models. This is another possible vulnerability, users can inject code to reader() to save the original data directly to the storage space. We can avoid this by the following ways:
    1. Validate the user uploaded program to find.
    2. Encrypt public datasets and store the key in a secure place on cloud which can only be read by reader()
  3. We need to know if their's attack occurred. This may be really hard, we can starting from inspecting network bandwidth to inference the unusual traffics.
  4. About network policies. We don't have network policies currently, so one user can sniff around the network or connect to any opened ports in the whole cluster. This may not lead to data leakage directly but still a problem.

@wangkuiyi
Copy link

wangkuiyi commented Jun 6, 2017

It seems that after networking control, there leaves the following potential leakage path:

  1. User programs should be able to read the data,
  2. User programs should be able to write data to CephFS, and
  3. Users are allowed to upload and download data from CephFS.

How could we prevent data leakage alone above pipeline?

@typhoonzero
Copy link
Collaborator

typhoonzero commented Jun 7, 2017

As mentioned above in No.2:

Encrypt public datasets and store the key in a secure place on cloud which can only be read by reader()

More details:

Users must use a special reader to pass public datasets to trainer:

...
trainer.train(reader=paddle.datasets.public.sample.train())
# Users can also use part of the feature columns or filters to get a reader:
trainer.train(reader=paddle.datasets.public.sample.train(fields=[3,4,5], filter=some_func))
...

This reader returns encrypted data which is decrypted by DataProviderConverter or in the c++ side. We need to implement a encrypt tool to encrypt data and upload them to cloud and then implement functions to decrypt, decrypting can not be accessed by users.

Store the encrypt key as a secret storage in kubernetes and keep it secretly.

@helinwang
Copy link
Collaborator

Do we allow user to use custom built Paddle? If so, user can easily access the decrypted data by writing a custom layer.

@Yancey1989
Copy link
Collaborator Author

Do we allow user to use custom built Paddle? If so, user can easily access the decrypted data by writing a custom layer.

Good point, if we allow the user to use a custom Paddle binary file, he/she can always print the decrypted data, @typhoonzero and I discussed this question at yesterday, maybe prevent custom Paddle binary files and prevent custom runtime Docker image is a good choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants