-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add iterable dataset support for multiprocess DataLoader #25558
Add iterable dataset support for multiprocess DataLoader #25558
Conversation
Thanks for your contribution! |
… add_iterable_dataset_support
… add_iterable_dataset_support
def get_worker_info(): | ||
""" | ||
Get DataLoader worker process information function, this function is | ||
used to splitd data copy in worker process for IterableDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitd typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
|
||
class IterableDataset(Dataset): | ||
""" | ||
An abstract class to encapsulates methods and behaviors of iterable datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encapsulates -> encapsulate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
An abstract class to encapsulates methods and behaviors of iterable datasets. | ||
|
||
All datasets in iterable-style(can only get sample one by one sequentially, like | ||
a python iterater) should be a subclass of `paddle.io.IterableDataset`. All subclasses should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterater
-> iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
|
||
:code:`__iter__`: yield sample sequentially. This method is required by reading dataset sample in :code:`paddle.io.DataLoader`. | ||
|
||
NOTE: do not implement :code:`__getitem__` and :code:`__len__` in IterableDataset, should not be called either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use new doc style? NOTE
-> .. note::
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
print(img, lbl) | ||
|
||
When :attr:`num_workers > 0`, each worker has a different copy of the dataset object and | ||
will yield whole dataset samples, which means samples in dataset will be repeat in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeat
-> repeated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
python/paddle/fluid/reader.py
Outdated
@@ -286,6 +288,10 @@ def forward(self, image, label=None): | |||
|
|||
# ------------------------------------------------------- | |||
|
|||
Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE
-> .. note::
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
python/paddle/fluid/reader.py
Outdated
format(shuffle)) | ||
if batch_sampler is not None: | ||
raise ValueError( | ||
"IterableDataset expect unspecified batch_sample") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_sample
-> batch_sampler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
while not self._thread_done_event.is_set(): | ||
# For IterableDataset, batch indices is generate infinitely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is generate -> is generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
""" | ||
An abstract class to encapsulates methods and behaviors of iterable datasets. | ||
|
||
All datasets in iterable-style(can only get sample one by one sequentially, like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterable-style(can -> iterable-style ( can
python -> Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
place = fluid.CPUPlace() | ||
with fluid.dygraph.guard(place): | ||
dataset = SplitedIterableDataset(start=2, end=9) | ||
dataloader = DataLoader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why fluid.dygraph.guard
is needed for DataLoader? It cann't used in static graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In static mode, fluid.data
should be defined and given as parameter feed_list
, which is not concerned in this test case, so use dynamic mode to simplify the test code
def __init__(self, dataset, batch_size=1): | ||
assert isinstance( | ||
dataset, IterableDataset | ||
), "dataset should be an instnace of paddle.io.IterableDataset" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
|
||
When :attr:`num_workers > 0`, each worker has a different copy of the dataset object and | ||
will yield whole dataset samples, which means samples in dataset will be repeated in | ||
:attr:`num_workers` times. If it is require that each sample to be yield only once, there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is required for each sample to yield once only, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
will yield whole dataset samples, which means samples in dataset will be repeated in | ||
:attr:`num_workers` times. If it is require that each sample to be yield only once, there | ||
are two methods to configure different copy in each worker process to avoid duplicate data | ||
among workers as follows. In both the two methods, worker information that can be get in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both the methods, ... can be getted in...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
python/paddle/fluid/reader.py
Outdated
@@ -136,7 +137,8 @@ class DataLoader(object): | |||
|
|||
Args: | |||
dataset(Dataset): the dataset to load data from, should be an | |||
instance of subclass of :code:`paddle.io.Dataset`. | |||
instance of subclass of :code:`paddle.io.Dataset` or | |||
:code:`paddle.io.IterableDataset`. | |||
feed_list (list(Variable)|tuple(Variable)): feed variable list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据新文档规范,variable的表述全部改为tensor。feed_list (list(Tensor)|tuple(Tensor)): feed tensor list. 请将其他位置的表述一起完成修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
""" | ||
Get DataLoader worker process information function, this function is | ||
used to split data copy in worker process for IterableDataset | ||
(see :code:`paddle.io.IterableDataset`), worker informations contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"IterableDataset expect unspecified batch_sampler") | ||
else: | ||
self.dataset_kind = _DatasetKind.MAP | ||
|
||
if batch_sampler is not None: | ||
assert isinstance(batch_sampler, BatchSampler), \ | ||
"batch_sampler should be None or subclass instance " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the BatchSampler later, maybe it can also be Iterable
object.
Additionally, can we support specified batch_sampler
for IterableDataset later. It seems that users can't custom sampling or batch strategies even by themselves, since we can only support Iterable data with IterableDataset
and _InfiniteIterableSampler
And would we also support Sampler except BatchSampler later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchSampler can be custom in map-style dataset(implement __getitem__
), for IterableDataset
, which can only get sample sequencely, I couldn't think of scenarios that require batch_sampler
customization, sure it should be support if there is customization requirements.
Sampler is mostly a sub-function of BatchSampler, IMHO, custom Sampler can be defined in custom BatchSampler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example for batch_sampler customization is Transformer, it changes the batch size counter by using word number rather than sentence number , currently it uses a map-style dataset .
When custom batching strategies is needed, then Sampler may be abstracted from BatchSampler to reuse the sampling strategies.
However, it doesn't bother and we can consider it later. I also try to provide some helper to make it can be use like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll try to do some research and try to add this later~
PR types
New features
PR changes
APIs
Describe
add IterableDataset support for multiprocess DataLoader
paddle.io.IterableDataset
base classpaddle.io.get_worker_info
to get worker process information for data splitting in IterableDataset