Add iterable dataset support for multiprocess DataLoader #25558

heavengate · 2020-07-16T05:36:15Z

PR types

New features

PR changes

APIs

Describe

add IterableDataset support for multiprocess DataLoader

add paddle.io.IterableDataset base class
add paddle.io.get_worker_info to get worker process information for data splitting in IterableDataset

paddle-bot-old · 2020-07-16T05:36:24Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… add_iterable_dataset_support

LielinJiang · 2020-08-05T04:15:08Z

python/paddle/fluid/dataloader/dataloader_iter.py

+def get_worker_info():
+    """
+    Get DataLoader worker process information function, this function is
+    used to splitd data copy in worker process for IterableDataset


splitd typo?

Done, thanks!

LielinJiang · 2020-08-05T04:17:33Z

python/paddle/fluid/dataloader/dataset.py

+
+class IterableDataset(Dataset):
+    """
+    An abstract class to encapsulates methods and behaviors of iterable datasets.


encapsulates -> encapsulate

Done, thanks!

chenwhql · 2020-08-05T05:23:50Z

python/paddle/fluid/dataloader/dataset.py

+    An abstract class to encapsulates methods and behaviors of iterable datasets.
+
+    All datasets in iterable-style(can only get sample one by one sequentially, like
+    a python iterater) should be a subclass of `paddle.io.IterableDataset`. All subclasses should


iterater -> iterator

Done, thanks!

chenwhql · 2020-08-05T05:25:02Z

python/paddle/fluid/dataloader/dataset.py

+
+    :code:`__iter__`: yield sample sequentially. This method is required by reading dataset sample in :code:`paddle.io.DataLoader`.
+
+    NOTE: do not implement :code:`__getitem__` and :code:`__len__` in IterableDataset, should not be called either.


use new doc style? NOTE -> .. note::

Done, thanks!

chenwhql · 2020-08-05T05:26:17Z

python/paddle/fluid/dataloader/dataset.py

+                print(img, lbl)
+
+    When :attr:`num_workers > 0`, each worker has a different copy of the dataset object and
+    will yield whole dataset samples, which means samples in dataset will be repeat in


repeat -> repeated

Done, thanks!

chenwhql · 2020-08-05T05:31:00Z

python/paddle/fluid/reader.py

@@ -286,6 +288,10 @@ def forward(self, image, label=None):

            # -------------------------------------------------------

+    Note:


NOTE -> .. note:: ?

Done, thanks!

chenwhql · 2020-08-05T05:32:04Z

python/paddle/fluid/reader.py

+                    format(shuffle))
+            if batch_sampler is not None:
+                raise ValueError(
+                    "IterableDataset expect unspecified batch_sample")


batch_sample -> batch_sampler ?

Done, thanks!

qingqing01 · 2020-08-06T09:21:38Z

python/paddle/fluid/dataloader/dataloader_iter.py

        while not self._thread_done_event.is_set():
+            # For IterableDataset, batch indices is generate infinitely


is generate -> is generated

Done, thanks!

qingqing01 · 2020-08-06T09:34:22Z

python/paddle/fluid/dataloader/dataset.py

+    """
+    An abstract class to encapsulates methods and behaviors of iterable datasets.
+
+    All datasets in iterable-style(can only get sample one by one sequentially, like


iterable-style(can -> iterable-style ( can
python -> Python

Done, thanks!

qingqing01 · 2020-08-06T09:44:06Z

python/paddle/fluid/dataloader/dataset.py

+            place = fluid.CPUPlace()
+            with fluid.dygraph.guard(place):
+                dataset = SplitedIterableDataset(start=2, end=9)
+                dataloader = DataLoader(


Why fluid.dygraph.guard is needed for DataLoader? It cann't used in static graph?

In static mode, fluid.data should be defined and given as parameter feed_list, which is not concerned in this test case, so use dynamic mode to simplify the test code

Heeenrrry · 2020-08-10T11:27:38Z

python/paddle/fluid/dataloader/batch_sampler.py

+    def __init__(self, dataset, batch_size=1):
+        assert isinstance(
+            dataset, IterableDataset
+        ), "dataset should be an instnace of paddle.io.IterableDataset"


Done, thanks!

Heeenrrry · 2020-08-10T11:32:14Z

python/paddle/fluid/dataloader/dataset.py

+
+    When :attr:`num_workers > 0`, each worker has a different copy of the dataset object and
+    will yield whole dataset samples, which means samples in dataset will be repeated in
+    :attr:`num_workers` times. If it is require that each sample to be yield only once, there


If it is required for each sample to yield once only, ...

Done, thanks!

Heeenrrry · 2020-08-10T11:35:47Z

python/paddle/fluid/dataloader/dataset.py

+    will yield whole dataset samples, which means samples in dataset will be repeated in
+    :attr:`num_workers` times. If it is require that each sample to be yield only once, there
+    are two methods to configure different copy in each worker process to avoid duplicate data
+    among workers as follows. In both the two methods, worker information that can be get in


In both the methods, ... can be getted in...

Done, thanks!

Heeenrrry · 2020-08-10T11:40:17Z

python/paddle/fluid/reader.py

@@ -136,7 +137,8 @@ class DataLoader(object):

    Args:  
        dataset(Dataset): the dataset to load data from, should be an
-            instance of subclass of :code:`paddle.io.Dataset`.
+            instance of subclass of :code:`paddle.io.Dataset` or
+            :code:`paddle.io.IterableDataset`.
        feed_list (list(Variable)|tuple(Variable)): feed variable list.


根据新文档规范，variable的表述全部改为tensor。feed_list (list(Tensor)|tuple(Tensor)): feed tensor list. 请将其他位置的表述一起完成修改。

Done, thanks!

Heeenrrry · 2020-08-10T12:00:32Z

python/paddle/fluid/dataloader/dataloader_iter.py

+    """
+    Get DataLoader worker process information function, this function is
+    used to split data copy in worker process for IterableDataset
+    (see :code:`paddle.io.IterableDataset`), worker informations contains


information

Done, thanks!

chenwhql

LGTM

Heeenrrry

LGTM

guoshengCS

LGTM

guoshengCS · 2020-08-11T05:21:18Z

python/paddle/fluid/reader.py

+                    "IterableDataset expect unspecified batch_sampler")
+        else:
+            self.dataset_kind = _DatasetKind.MAP
+
        if batch_sampler is not None:
            assert isinstance(batch_sampler, BatchSampler), \
                "batch_sampler should be None or subclass instance " \


Can we remove the BatchSampler later, maybe it can also be Iterable object.

Additionally, can we support specified batch_sampler for IterableDataset later. It seems that users can't custom sampling or batch strategies even by themselves, since we can only support Iterable data with IterableDataset and _InfiniteIterableSampler

And would we also support Sampler except BatchSampler later

BatchSampler can be custom in map-style dataset(implement __getitem__), for IterableDataset， which can only get sample sequencely， I couldn't think of scenarios that require batch_sampler customization, sure it should be support if there is customization requirements.

Sampler is mostly a sub-function of BatchSampler, IMHO, custom Sampler can be defined in custom BatchSampler?

An example for batch_sampler customization is Transformer, it changes the batch size counter by using word number rather than sentence number , currently it uses a map-style dataset .

When custom batching strategies is needed, then Sampler may be abstracted from BatchSampler to reuse the sampling strategies.

However, it doesn't bother and we can consider it later. I also try to provide some helper to make it can be use like this

Thanks, I'll try to do some research and try to add this later~

heavengate added 2 commits July 15, 2020 14:54

add IterableDataset support in multiprocess DataLoader. test=develop

3a41a44

fix single process exit. test=develop

7575a0f

heavengate added 6 commits July 22, 2020 14:36

epoch end success. test=develop

f3da524

unittest success. test=develop

a1d5456

add doc

6671b7d

polish comment

f1a18a5

merge develop. test=develop

978bde1

add get_worker_info doc

85ac743

heavengate requested review from qingqing01, LielinJiang, guoshengCS and chenwhql July 30, 2020 07:28

heavengate added 6 commits July 30, 2020 12:42

fix doc. test=develop

78851a3

fix test_batch_sampler

a92c221

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

0ebf5b7

… add_iterable_dataset_support

fix unittest after merging develop. test=develop

789f9b4

fix sample code. test=develop

11fa586

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cf8ad66

… add_iterable_dataset_support

LielinJiang reviewed Aug 5, 2020

View reviewed changes

chenwhql reviewed Aug 5, 2020

View reviewed changes

qingqing01 reviewed Aug 6, 2020

View reviewed changes

fix according reviews. test=develop

2f5ee04

heavengate changed the title ~~Add iterable dataset support~~ Add iterable dataset support for multiprocess DataLoader Aug 7, 2020

heavengate added 4 commits August 7, 2020 10:52

add sample code for get_worker_info. test=develop

447573b

add runtime error test. test=develop

200c5d8

merge develop

c049e71

fix unittest after merge develop. test=develop

4120baa

Heeenrrry reviewed Aug 10, 2020

View reviewed changes

heavengate added 2 commits August 10, 2020 13:43

fix doc. test=develop

6bd3959

fix doc. test=develop

751fca0

chenwhql approved these changes Aug 11, 2020

View reviewed changes

chalsliu approved these changes Aug 11, 2020

View reviewed changes

qingqing01 approved these changes Aug 11, 2020

View reviewed changes

LielinJiang approved these changes Aug 11, 2020

View reviewed changes

Heeenrrry approved these changes Aug 11, 2020

View reviewed changes

guoshengCS approved these changes Aug 11, 2020

View reviewed changes

heavengate merged commit dbc88bb into PaddlePaddle:develop Aug 12, 2020

heavengate deleted the add_iterable_dataset_support branch August 12, 2020 02:30

heavengate mentioned this pull request Aug 27, 2020

fix dataloader performace decrease & unittest error #26739

Merged


		:code:`__iter__`: yield sample sequentially. This method is required by reading dataset sample in :code:`paddle.io.DataLoader`.

		NOTE: do not implement :code:`__getitem__` and :code:`__len__` in IterableDataset, should not be called either.

		@@ -286,6 +288,10 @@ def forward(self, image, label=None):

		# -------------------------------------------------------

		Note:

		while not self._thread_done_event.is_set():
		# For IterableDataset, batch indices is generate infinitely

Add iterable dataset support for multiprocess DataLoader #25558

Add iterable dataset support for multiprocess DataLoader #25558

Conversation

heavengate commented Jul 16, 2020 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Jul 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heavengate Aug 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heavengate Aug 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenwhql left a comment

Choose a reason for hiding this comment

Heeenrrry left a comment

Choose a reason for hiding this comment

guoshengCS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS Aug 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heavengate commented Jul 16, 2020 •

edited

Loading

heavengate Aug 6, 2020 •

edited

Loading

heavengate Aug 6, 2020 •

edited

Loading

guoshengCS Aug 11, 2020 •

edited

Loading