Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset merge issue #487
Dataset merge issue #487
Changes from 3 commits
21499eb
326038c
75a76bf
eb131c1
8c48a98
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky because a generated dataset 1 with a given startTime filter, and dataset 2 with filter 2, then to be rigorous I need to know the startTime filter of both.
So that I know that the dataset 1 was produced with a given filter, and the other was produced with a different time filter. Idem for other filters.
Perhaps a new class
TRestDataSetMerged
could help. We keep the filters only forTRestDataSet
while inTRestDataSetMerged
we define as metadata members the dataset filenames used for the merge?And perhaps introduce a hash-id to identify/guarantee the TRestDataSet used to create the merged dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that previous post was me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed the timeStamp filters, I think that to make a new class TRestDataSetMerged is an overkill.
However, I don't know what is the best strategy here because apart of the filters there is plenty of metadata info which is not saved when we merge several dataSets. Note that before just only the metadata info from the first file was saved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for me a dataset does not need to contain all the metadata information that is available to the run files. The dataset must guarantee that we can identify the run files used to generate the dataset. I guess, having the selected files vector is already a good starting point. Then we need to know if any filters have been applied for event selection. But the final user working with the dataset needs to know also other conditions applied to generate the dataset, e.g. the detector pressure, the data taking days, the voltage on the mesh, etc, etc. Then, if those filters were used to generate the datasets, it is relevant for the end-user.
The metadata information transferred to the dataset in the form of relevant quantities is to provide the dataset with physical meaning, and allow to have all the analysis required information into one single entity.
Indeed, if we added new columns to the dataset, we need to add also that metadata information, because if I generate a new column using
<addColumn
then I need to now the formula used to calculate that column. Not sure if we are doing this right now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see now that the columns added should be in the form
fColumnNameExpressions
. Just last time I couldn't find them in my dataset.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion this should never happen, as I already stated many times I am not agree on merging
runs
. Mergingdatasets
is different issue, butruns
should be considered asread-only
(or end-product of event data processing) entities, that can be used later on to generate a data compilation or dataset.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about
datasets,
notTRestRuns
the issue is that if the metadata info of the dataset is not properly handled the results are wrong when I compute e.g. the rate. In this case the duration of the dataset has to be added, but other metadata info can be also wrong or misleading. I don't know what is the proper way to handle metadata info of the dataset while merging, but this issue was present before this PR.