How to use the public dataset #125

Yancey1989 · 2017-06-06T00:07:45Z

Paddlecloud provides some public dataset for the developer.

How to Usage

We can install a cluster_dataset python package in the runtime Docker image, and use it as:

from paddle.cloud.cluster_dataset import mnist
...

trainer.train(reader = mnist.train(), ...)

How to block the data leakage

Because of developers can upload a trainer python package to the PaddleCloud, so I think the most effective way to block the data leakage is blocking all connections of Kubernetes nodes to the exteranl internal.

The text was updated successfully, but these errors were encountered:

typhoonzero · 2017-06-06T02:06:20Z

This is a security issue, here is several things to consider about:

Can nodes/pods access the internet? Users may need to download dependencies or public data, but also being able to upload public datasets to other places by injecting some code in the reader(). Maybe no internet access from nodes/pods is a good idea.
Users can save training output models to their own cloud storage space and then download the models. This is another possible vulnerability, users can inject code to reader() to save the original data directly to the storage space. We can avoid this by the following ways:
1. Validate the user uploaded program to find.
2. Encrypt public datasets and store the key in a secure place on cloud which can only be read by reader()
We need to know if their's attack occurred. This may be really hard, we can starting from inspecting network bandwidth to inference the unusual traffics.
About network policies. We don't have network policies currently, so one user can sniff around the network or connect to any opened ports in the whole cluster. This may not lead to data leakage directly but still a problem.

wangkuiyi · 2017-06-06T02:48:47Z

It seems that after networking control, there leaves the following potential leakage path:

User programs should be able to read the data,
User programs should be able to write data to CephFS, and
Users are allowed to upload and download data from CephFS.

How could we prevent data leakage alone above pipeline?

typhoonzero · 2017-06-07T02:06:37Z

As mentioned above in No.2:

Encrypt public datasets and store the key in a secure place on cloud which can only be read by reader()

More details:

Users must use a special reader to pass public datasets to trainer:

...
trainer.train(reader=paddle.datasets.public.sample.train())
# Users can also use part of the feature columns or filters to get a reader:
trainer.train(reader=paddle.datasets.public.sample.train(fields=[3,4,5], filter=some_func))
...

This reader returns encrypted data which is decrypted by DataProviderConverter or in the c++ side. We need to implement a encrypt tool to encrypt data and upload them to cloud and then implement functions to decrypt, decrypting can not be accessed by users.

Store the encrypt key as a secret storage in kubernetes and keep it secretly.

helinwang · 2017-06-12T18:47:38Z

Do we allow user to use custom built Paddle? If so, user can easily access the decrypted data by writing a custom layer.

Yancey1989 · 2017-06-13T02:15:25Z

Do we allow user to use custom built Paddle? If so, user can easily access the decrypted data by writing a custom layer.

Good point, if we allow the user to use a custom Paddle binary file, he/she can always print the decrypted data, @typhoonzero and I discussed this question at yesterday, maybe prevent custom Paddle binary files and prevent custom runtime Docker image is a good choice.

Yancey1989 assigned typhoonzero Jun 6, 2017

typhoonzero mentioned this issue Jun 12, 2017

[WIP]Simple public dataset encryption PaddlePaddle/Paddle#2447

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use the public dataset #125

How to use the public dataset #125

Yancey1989 commented Jun 6, 2017

typhoonzero commented Jun 6, 2017 •

edited

Loading

wangkuiyi commented Jun 6, 2017 •

edited

Loading

typhoonzero commented Jun 7, 2017 •

edited

Loading

helinwang commented Jun 12, 2017

Yancey1989 commented Jun 13, 2017

How to use the public dataset #125

How to use the public dataset #125

Comments

Yancey1989 commented Jun 6, 2017

How to Usage

How to block the data leakage

typhoonzero commented Jun 6, 2017 • edited Loading

wangkuiyi commented Jun 6, 2017 • edited Loading

typhoonzero commented Jun 7, 2017 • edited Loading

helinwang commented Jun 12, 2017

Yancey1989 commented Jun 13, 2017

typhoonzero commented Jun 6, 2017 •

edited

Loading

wangkuiyi commented Jun 6, 2017 •

edited

Loading

typhoonzero commented Jun 7, 2017 •

edited

Loading