-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New TF dataset pipeline: draft #292
Comments
Some notes about Why to use Some relevant resources, StackOverflow questions:
For reference, the
Take
In C++:
and
and
So pretty straight-forward. The dataset (instance of Some interesting bits:
Datasets usually operate asynchronously.
So it means that dataset iterators can be fetched outside a session run step.
This allows for a very straight forward implementation of wrapping up a RETURNN How to design the configuration such that this becomes very straight-forward to configure and also flexible? How to make this efficient for our multi-GPU training? |
(Note, I assigned you to this issue, but this does not really mean an assignment. This is rather that you might be interested in this, and maybe might to take part in planning, brainstorming, maybe also implementation.) |
@curufinwe Ha, the
That sounds very much like our implementation... |
Lingvo does not seem to use |
Regarding Lingvo: yes, the project is older than the tf.data API |
(Sorry, clicked wrong button. Actually I also wanted to add @patrick-wilken but somehow that's not possible? Maybe because he is not part of the GitHub RETURNN team? I added you now. Pending invite.) |
Btw, update: The plan on the pipeline is mostly finished (see initial comment). Please review. What's missing is mostly the design of how you would configure that in the config. Which should be flexible, simple, and extensible (such that we can easily add other features later, esp all those mentioned above). I'm still thinking about that. (Maybe you have suggestions?) |
Before I start to review your Ideas, I want to clarify if I understood the current situation correctly (for TF backend):
My additional questions would be:
|
Yes, that's the only complete implementation, and also the only used implementation (you will see when you check code usages).
Yes, the
Yes. Or depending on the dataset implementation. Originally loading the data was done in
Yes, but this is an implementation detail, which we can easily change.
Yes. But we can extend or modify this API if this would match our new pipeline. You are asking about somewhat irrelevant implementation details here. We can easily change any of that, as we want. This draft here is not too much about these implementation details, but rather about a high level organization.
Yes, sure. This is just a standard Python
There is no point here to have multiple threads. Multiple threads would not help in any way here. Multiple threads would help inside specific dataset implementations. But this is independent of all this discussion here. E.g. RASR could use multiple threads. We could also extend But in any case, a single thread is enough to read from the RETURNN dataset ( |
I also tried to create an example of my own to understand how tf.data works:
The output is
(I apologize for starting with 1) So wouldn't it be a possible approach to just extend the RETURNN Dataset with the generator function (and providing the correct types/shapes)? Or is there a drawback by doing this? Edit: |
Yes, but you are again discussing minor implementation details. I even mentioned this already in the draft ("RETURNN The main discussion here is about the remaining parts. E.g. how would the configuration in the config look like. Basically let's clarify the open outstanding parts of the draft (see the TODO). |
Update: I think the draft is mostly ready now. Please check if this suits all possible use cases (multi GPU training, TPU training, having multiple dataset workers, etc, whatever you can imagine). It might be annoying/tricky to change the design/API later. |
I wanted to start with the easier parts before I assume wrong things. So far I did not fully understand how the distributed training works (I read the Wiki page and some of the API descriptions for TF1 but I need more time), so I will skip this for now. The parts that I would like to discuss are the config design and the possible multithreading of the feature pipeline. So the default pipeline you mentioned is clear:
And also the returnn.InputContext() seems reasonable. The question is now, how would this look like in practice? If I understand it correctly, all non-TF processing should stay inside the original datasets, and be fixed, and the So my questions are:
Maybe some questions are unnecessary details, but those are the most interesting questions for me from the user perspective. |
The multithreading would be covered by distributed TF.
In the config. Under the option
This data_network would not be part of the initial design. This is somewhat optional. We could later add this as a function to the
What do you mean by bypass? You can configure the RETURNN dataset as before, in whatever way you want (using
As said, this initial design does not cover the feature processing at all. It is simply that function
All of this is draft is not about the RETURNN dataset. As I explained, this is totally separate, and irrelevant here. (Although it would allow to work around it.) You could load multiple RETURNN datasets though. They would automatically be parallelized (in multiple dataset/producer workers - via distributed TF). (This probably would not be implemented initially, but the design of the API should make this possible.)
You could do that. This would be again via distributed TF. (But as said, this would probably not be implemented/supported initially. We just should make sure that this is easy to add later, and the API supports it directly, or can easily be extended.)
Note that in most of our cases, the dataset loading is probably fast enough.
Both. It's all part of |
One small remaining question: Should this new dataset pipeline (i.e. when you set Maybe it is more sensible to not use distributed TF by default, and keep these two things mostly orthogonal (although they are very related, as you see here in this issue). Another point: Not using distributed TF, but using the new dataset pipeline together with Horovod, that would be somewhat incompatible, or requires explicit sharding (just as we do now in |
No, distributed TF (
I think this was just a demo. We would probably not implement it that way. We would probably instead use |
Yup, that was my point. Too much to my taste to call the suckers the same
word. Keeping a parameter server on a CPU in a multi-GPU host is a 2 orders
of magnitude w.r.t latency than a PS on a different host. There are too
many “common in principle” things that make a difference between a go and a
no-go in the real world.
|
Just as a note: I started implementing this. Beginning with only the bare minimum. The first goal is to get single-GPU training to work. I will soon push some first commits. As this is an optional (non-default) feature, it should not interfere with anything else when not used, so it is save to directly push this to master, and do the work directly there. (See also contribution guidelines.) |
Note that our default type of multi-GPU training is anyway as much async as possible. I.e. we never used parameter servers, and probably also will not do so in the future (this is somewhat inconsistent with what TF means by "async training", which is almost always the use of a parameter server). I.e. this latency should not matter too much for us. |
Prepare for new dataset pipeline (#292).
Seems to work. At least for this simple case. #292
The simple case (no distributed TF, no dedicated dataset loader workers, no Horovod, i.e. no multi-GPU training) should work now, at least with the default pipeline. You can just set There are many outstanding TODOs (check the code). Although most of them are only relevant for distributed TF (#281). I guess this is the next step. Anyway, already this current support should be tested a bit. So please test it, and report your experience, or any problems (and also debug them if possible). (Test with some reasonably new TF 1.* version.) Also, this issue can be almost closed then. This was about the design of the API (not really about the underlying implementation). If the design looks fine, i.e. is flexible enough to allow all features we need, and simple enough to configure, and also straight-forward, then this is what we want. Otherwise please comment now! Now it's still possible to change all of it. This will get much uglier and more complicated later. |
I'm trying with such an implementation now for dynamic batch sizes via
This is also pretty generic, and takes I wonder if we might want to use sth like this as the default? We could (should) also move that bucketing to an own function in |
Small status report:
|
Prepare for new dataset pipeline (rwth-i6#292).
Seems to work. At least for this simple case. rwth-i6#292
Existing configs should work as before, without any change in behavior.
The datasets themselves (all what derives from class
Dataset
) will stay as is, as well as their API.It would also be nice to support TensorFlow datasets (TFDS) (more) directly.
There is also TensorFlow I/O which provide further dataset functions, and can read MP3, Ogg, etc directly.
This (here) is about what follows from there, i.e. how to get the data from the dataset into TF, and how to build up mini-batches.
The file
TFDataPipeline.py
is currently there for this purpose.The standard and working implementation (
FeedDictDataProvider
) builds up the mini-batches with Numpy code, and then uses the TF session run feed dict.There were plans (right from the beginning of the RETURNN TF implementation) to make this via TF queues instead, but that was never finished.
These plans are outdated now because
tf.data
is the better API for this. (Related is #171. See thistf.data
example for the high level logic to loop over epochs, initializetf.data.Iterator
andtf.data.Dataset
.)Effectively this would be used instead of the current feed dict approach
(another
DataProviderBase
implementation).Multi-GPU training is another aspect. For that, it makes sense to have the dataset loading and mini-batch building living in a separate (sub) process. This can make sense even for single GPU.
This process would then send the mini-batches to each computing node.
We would use distributed TensorFlow for this, even for single GPU, to have unified code.
(See #296 for a draft on distributed TensorFlow support in RETURNN.)
(Alternatives would be ZeroMQ TF ops (eg. this, or this, or this), or some own custom TF op.
However, distributed TensorFlow would allow us to extend our multi-GPU training code later more easily. Maybe Horovod does not cover all use cases.)
We would not use the input pipeline provided by the
tf.distribute.Strategy
API, and be more flexible on this. Our own processing would by default not use sharding (because that is inefficient in general; requires extra work to make efficient). We rather would have a single dedicated worker for the dataset, and its output would get distributed to the other train workers. I.e. we would have the training and data preprocessing decoupled. There is no TFStrategy
implementation which covers this case (TF feature request: Decoupling preprocessing and training). (RETURNN would not be restricted to this though; it would also support the TF dataset pipeline with sharding.)TPU training is another aspect which has some more constraints (beyond multi-GPU training), like having fixed predefined batch sizes. By using the
tf.data.Dataset
pipeline, this should be relatively easy to accomplish. (Although maybe we would want to tell our RETURNNData
class that we have a fixed batch dim, but this is maybe a minor detail, and maybe also irrelevant (or not too relevant) for the new dataset pipeline.)Further preprocessing on sequence level (such as custom feature extraction via pure TF, or custom data augmentation via pure TF) can be allowed by custom TF code, or a custom RETURNN TF network just for that. It might make sense to be able to run this on GPU as well.
A further aspect is to have the CPU->GPU transfer of the data asynchronously, so that the session run call of the training does not first need to wait for that. TF queues or
tf.data
or some custom solution can be used for that. If we already did the data preprocessing on GPU, we also probably want to avoid the GPU->CPU->GPU roundtrip (if it is all the same single GPU).Another aspect to keep in mind might be streaming / online decoding (see e.g. #276). Maybe the same data network could be used in an online setting.
We could use the
keep_over_epoch
(hidden state) functionality for that, which mostly provides exactly that functionality. Then the dataset would split up an incoming stream into chunks and pass on those chunks. This is also more efficient than just passing on individual frames.Maybe the dataset could optionally provide the information when to reset the state.
(This aspect would likely not part of the initial implementation, but should be easy to add later.)
Another note: The current computation of "computing time" (the log output like
train epoch 1, finished after 2941 steps, 0:28:58 elapsed (99.3% computing time)
) would not be valid anymore, or should be extended. More specifically, we should measure how much time is spend in waiting for data.High level ideas:
A
tf.data.Dataset
would wrap the existing RETURNN dataset API as a TF dataset. It would give the access to individual sequences. It would calldataset.load_seqs
anddataset.get_data
. This would live (by default) in a dataset subprocess (although it should be flexible to possibly run also in the main process).Reimplement the mini-batch building logic (
Dataset._generate_batches
+FeedDictDataProvider.get_next_batch
) in pure TF.tf.data
also provides pure TF code for this pupose already, to prepare mini-batches based on the sequences. This is a good base, which we can extend if needed.There could be some generic way to define the mini batch building logic.
Originally the idea was also to use RETURNN layers for this, but this would not quite work, as RETURNN layers work on tensors, but here we would work with TF datasets
(which represent a (possible infinite) stream of tensors (or possibly nested structure of tensors),
and do operations on the stream (e.g. bucketing, batching), not necessarily on individual tensors).
We would assume a very simple default behavior, which would mostly do some standard mini batching logic.
The user could write a custom
data_network
dict for the config, which is just asnetwork
, to define a RETURNN network.This would be separate from the mini-batch building logic, as this would run on the sequences (or a single sequence).
This can and should use existing RETURNN layers. E.g. if the dataset is configured such that it returns the raw audio, you can do the feature extraction in pure TF, easily add data augmentation and other things.
This data network would be some optional aspect for now.
The dataset (and maybe the data network) could (by default) be created in the dataset subprocess, and be executed in that process. The output of it would be forwarded to the main RETURNN computing instances via distributed TensorFlow. If there are multiple computing instances (for multi-GPU), the mini-batches would be distributed to them.
So, the pipeline would look like this:
Dataset
wrapped as a TFtf.data.Dataset
. (We would pass it theDataset
instance, where theinit_func
would initialize the epoch. And anExternData
instance to specify what data we expect and get from the dataset.)shuffle
.shuffle
.PaddedBatchDataset
(viadataset.padded_batch
) to form batches,or alternatively
bucket_by_sequence_length
for bucketing.shuffle
.PrefetchDataset
(in the dataset process).tf.data.experimental.service.distribute
andDispatchServer
.(Could be implemented with pure TF, quite straight-forward, via
_GeneratorDataset
, via distributed TensorFlow.)(
MultiDeviceIterator
/_PerDeviceGenerator
does exactly that via distributed TF. See code.)PrefetchDataset
in the main process, living on GPU to have async CPU->GPU transfer.Optionally we could also run multiple datasets in subprocesses, each handling a subset, and then interleave (
interleave
,sample_from_datasets
) them. This would help if the dataset loading is slow but parallelization would help.In the config, you can provide an option
dataset_pipeline
, which is supposed to be a functionreturnn.InputContext -> tf.data.Dataset
. This function is supposed to return the final dataset as used directly as-is for training. The final dataset is supposed to match the data keys fromextern_data
. More specifically, the elements should be a dict where the key is the data-key, and the value the tensor. Size placeholders would have the special key"size:%s:%i" % (key, i)
(similar as inTFNetwork.get_fetches_dict
).In distributed TF case, this function will get executed on every worker, and also on dedicated dataset pipeline worker(s). It would be controlled via scopes that some part of this only runs on the dataset pipeline worker, and then some remaining part on all the train workers. (So we might end up with some parts in the TF computation graph which are not used, depending on the worker type, but this is not really critical or relevant.)
returnn.InputContext
would be a class with an API such as:get_dataset_name()
: Would return"train"
,"dev"
or so (corresponding to the config, or alsoeval_datasets
in the config).get_returnn_dataset(**kwargs)
: This would get the initial RETURNN dataset (configured viatrain
,dev
etc as usual).kwargs
could maybe overwrite parts (if you would want to do sharding here or so -- by default you would use a single dataset worker).(This would also include seq length/size information.)
get_dataset_worker_scope()
(orget_producer_worker_scope()
): To be used aswith context.get_dataset_worker_scope():
, for the dataset/producer worker(s). Not sure if needed?)get_train_worker_scope()
orget_std_worker_scope()
orget_consumer_worker_scope()
?)MultiDeviceIterator
. Maybe usingtf.distribute.InputContext
. We could (or even should) make use oftf.distribute.get_strategy()
, as this would always return sth sensible, and then our code will work if any TF distributes strategies are used.)map_producer_to_consumer(dataset: tf.data.Dataset) -> tf.data.Dataset
: Operating on the batches, which maps the dataset(s) from dataset worker(s) to the consumer (train worker).With distributed TF, if there are multiple dataset workers, it would join the data (e.g. using
interleave
). If there are multiple consumers (train workers), it would evenly distribute the batches such that each consumer gets exactly the same amount (e.g. viaMultiDeviceIterator
, ortf.data.experimental.service.distribute
/DispatchServer
).With Horovod and without distributed TF, this would use
shard
, but print a warning that this is inefficient, and that distributed TF should be enabled (can be together with Horovod).get_default_max_seqs()
: Returns an int, the number of seqs for a batch (batch size, in number of seqs, i.e. the batch dimension).The default implementation would return sth like
config.int("max_seqs")
(asserting that"max_seqs"
is defined).add_seq_length(dataset: Dataset) -> Dataset
: Basicallydataset.map(lambda item: {**item, 'size:data:0': tf.shape(item['data'])[0]})
or similar (via).get_consumer_device() -> str
: E.g."gpu"
, for the consumer (trainer) worker.prefetch_to_consumer_device(dataset: Dataset) -> Dataset
: Basicallytf.data.experimental.prefetch_to_device(context.get_consumer_device())(dataset)
If just
dataset_pipeline = True
, we would have a default implementation, like this:Here an example for multi-producer case (which would not be implemented/supported directly, but should be easy to add later):
Example for Librispeech via TFDS:
The text was updated successfully, but these errors were encountered: