-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset post processing #1505
Comments
I would be very interested in this functionality, but I did not put too much thought in yet on how this would look best. I am definitely a fan to use MetaDataset for everything, so I would not bother if it would be part of that. |
Commenting on your 3 suggestions in order:
|
Well, you would factor this out, to have the logic in some common class But yes, I kind of agree.
Well, we do use combinations/transformations of datasets already, e.g. see So, such postprocessing logic would not really add anything new there - it fits very natural in how
There is also another aspect which becomes ambiguous: The Note, we also have the |
Regarding 1: |
@JackTemaki argued, he (and many others) anyway use Regarding So, how would that variant look like? Just a function in the config, like: def dataset_post_process(data: TensorDict) -> TensorDict: ... ? |
We definetly need to distinguish train and dev datasets. If we do data augmentation for training we don't necessarily want to do it for cross-validation. If we are doing some sort of format conversion then it would be needed for both. So in the end this should be a user choice. We can also have the type of dataset (train/dev/...) as one argument to the post-processing function. |
How would the user specify such post-processing function per dataset? It could be another argument for the dataset itself, so the user specifies it like: train = {
...,
"post_process": my_train_dataset_post_proc,
}
dev = {
...,
"post_process": my_dev_dataset_post_proc,
} It's a bit ugly, because dataset_post_process_funcs = {
"train": my_train_dataset_post_proc,
"dev": my_dev_dataset_post_proc,
} This is maybe fine for the training task, but for search or forward, it's ambiguous, and also it doesn't really work if RETURNN is used for scripting and not as a standalone tool. |
Could we add this to (every) engine class? The engine knows what kind of task it performs and what dataloader it uses for that task and could pick the correct post-processing function for that task. |
The post-processing function is not per task but per dataset. At least that is what I wrote above. Or do you want to have it per task? But I guess you don't really want it per task, but rather whether you train or eval? Or maybe a generic def dataset_post_process(data: TensorDict, *, train: bool = False) -> TensorDict: ... |
Sorry, I was not precise enough. What I meant was that in the engine class you know for what the dataset is used and from which name in the config it comes from (I hope). Then one can select the correct postprocessing function to go with it (by picking it from the dict you showed above with the correct key). |
No, you don't. E.g. we have this API for forward: def forward_with_callback(self, *, dataset: Dataset, callback: ForwardCallbackIface): ... Or this API for the init (including training): def init_train_from_config(
self,
config: Optional[Config] = None,
train_data: Optional[Dataset] = None,
dev_data: Optional[Dataset] = None,
eval_data: Optional[Dataset] = None,
): ... It is handled by our But yes, so I guess we can simply use this API: def dataset_post_process(data: TensorDict, *, train: bool = False) -> TensorDict: ... |
OK, let's use |
One aspect I realized now: Where exactly would this be executed? As this is now outside the dataset, |
I would also be interested in this feature. The discussed post processing solution seems fine to me. However, I would definitely like to have the post processing parallelizable into multiple procs. At least now, I have a setup with an |
Another aspect came up (@Judyxujj): We were interested in implementing mixup in this post processing function. But this is not really possible with the current design. This additionally needs:
(Note, I have a mixup implementation, but I did it inside the model, via a |
If we limited this feature to PyTorch we could also offer the user to inject custom DataPipes into the already existing pipeline instead of providing a callback-style API. That would interact favourably with multiprocessing, offer places to store state for e.g. Mixup, and just as well allows syncing data across workers, maybe given some setup or communication primitive to bootstrap further communication through. @vieting Are you using torch or TF in your current setups? |
No, not really. Only the
I don't understand. Where? How? I don't think that a data pipe should really have state (except of temporary state which we would reset every subepoch). Having state also means that you properly store/restore the state on disk after a restart, like the model parameters or optimizer state.
No, how? Every worker on the dataset is independent from each other. They don't have any way to communicate with each other.
I don't see any simple way to add this on this level. In any case, we should probably not overthink/overengineer this. E.g., for things like mixup, I think it's ok if the state is reset at the beginning of an epoch, and also if it's just local to the current worker. Otherwise mixup can just be done on the model level, which we have already implemented. And most other things you would want to do in such post processing don't have state. Also, I tend to think, it's ok to have multiple solutions here, and to see what is easiest for the user.
|
I was under the assumption that in RETURNN+PT the data loader |
Yes, this is likely wrong. We never really tested this, but: There is no sharding implemented for DataLoader multiple workers. It cannot be: There is no way you can do sharding in general for any dataset (or only in an inefficient way by iterating through all data and discarding what you don't want). The only dataset which can do sharding properly is But even with On the other hand, |
Btw, after some discussion yesterday with @curufinwe, I think a pragmatic simple solution for now is really to implement this as a new separate dataset, like this |
With those points I agree this is the simplest way to move forward, thanks for the explanations! |
For some other examples of similar processing datasets, see: Btw, in the main post, I extended a bit the list of example post processing functions. One important type is also to support concatenating sequences (see #1573). I.e. it means the post processing transformation is not necessarily only on a single individual sequence, but could also do things like concatenating sequences, maybe shuffle sequences, drop sequences, insert new sequences, etc. However, this should be implemented in a streaming way, i.e. it gets in a sequence of |
The question is a bit, how to define the API for the user then. Before, we suggested: def dataset_post_process(data: TensorDict, *, train: bool = False) -> TensorDict: ... (This would now be an argument for our Maybe we can still also provide this simpler API, as in many cases, the user wants to transform a single But then, it should also support the operations over multiple post_process_stream: Callable[[Iterator[TensorDict]], Iterator[TensorDict]] The user can simply implement this as a generator, like so: def no_op_post_process_stream(input_stream: Iterator[TensorDict]) -> Generator[TensorDict]:
for data in input_stream:
yield data |
Closes #1505 Co-authored-by: Albert Zeyer <albzey@gmail.com>
The Except that it should not have state, but that's not really so much a problem. You can still implement sth like mixup. You only should reset any internal state at the beginning of an epoch. |
Examples of post-processing:
Vocabulary
) on-the-fly.Some datasets already have partial support for post processing. Examples:
OggZipDataset
targets
, can be any type ofVocabulary
(e.g.BytePairEncoding
, but alsoSamplingBytePairEncoding
, orSentencePieces
). SimilarlyExternSprintDataset
orth_vocab
.ExtractAudioFeatures
is used in a couple of places, e.g. byOggZipDataset
audio
. It supports alsopre_process
(on raw audio) andpost_process
(on audio features).There was the idea about storing generic raw audio (or maybe even Ogg) inside the HDFDataset. And similarly, there was also the idea about storing the text (UTF8 bytes) inside the HDFDataset. In both cases, you would then maybe want to transform those into audio features or BPE labels on-the-fly as part of the dataset.
There are multiple options how to implement this:
OggZipDataset
, extend some other dataset (e.g.HDFDataset
) by such functionality. But how to do it in a somewhat generic and flexible way? One aspect to keep in mind is that this might also change the dimension or shape of the data. E.g. raw audio to audio features will add one dimension.MetaDataset
. Or maybe make it part ofMetaDataset
?TensorDict
(before batching), or on individual data streams. But the distinction when something should be done as part of the dataset and when it would be done as such post-processing would be kind of arbitrary.The text was updated successfully, but these errors were encountered: