-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset merge issue #487
Dataset merge issue #487
Conversation
for more information, see https://pre-commit.ci
|
||
if (REST_StringHelper::StringToTimeStamp(fFilterStartTime) > | ||
REST_StringHelper::StringToTimeStamp(dS.GetFilterStartTime())) | ||
fFilterStartTime = dS.GetFilterStartTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky because a generated dataset 1 with a given startTime filter, and dataset 2 with filter 2, then to be rigorous I need to know the startTime filter of both.
So that I know that the dataset 1 was produced with a given filter, and the other was produced with a different time filter. Idem for other filters.
Perhaps a new class TRestDataSetMerged
could help. We keep the filters only for TRestDataSet
while in TRestDataSetMerged
we define as metadata members the dataset filenames used for the merge?
And perhaps introduce a hash-id to identify/guarantee the TRestDataSet used to create the merged dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that previous post was me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed the timeStamp filters, I think that to make a new class TRestDataSetMerged is an overkill.
However, I don't know what is the best strategy here because apart of the filters there is plenty of metadata info which is not saved when we merge several dataSets. Note that before just only the metadata info from the first file was saved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for me a dataset does not need to contain all the metadata information that is available to the run files. The dataset must guarantee that we can identify the run files used to generate the dataset. I guess, having the selected files vector is already a good starting point. Then we need to know if any filters have been applied for event selection. But the final user working with the dataset needs to know also other conditions applied to generate the dataset, e.g. the detector pressure, the data taking days, the voltage on the mesh, etc, etc. Then, if those filters were used to generate the datasets, it is relevant for the end-user.
The metadata information transferred to the dataset in the form of relevant quantities is to provide the dataset with physical meaning, and allow to have all the analysis required information into one single entity.
Indeed, if we added new columns to the dataset, we need to add also that metadata information, because if I generate a new column using <addColumn
then I need to now the formula used to calculate that column. Not sure if we are doing this right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see now that the columns added should be in the form fColumnNameExpressions
. Just last time I couldn't find them in my dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that before just only the metadata info from the first file was saved.
In my opinion this should never happen, as I already stated many times I am not agree on merging runs
. Merging datasets
is different issue, but runs
should be considered as read-only
(or end-product of event data processing) entities, that can be used later on to generate a data compilation or dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that before just only the metadata info from the first file was saved.
In my opinion this should never happen, as I already stated many times I am not agree on merging
runs
. Mergingdatasets
is different issue, butruns
should be considered asread-only
(or end-product of event data processing) entities, that can be used later on to generate a data compilation or dataset.
I am talking about datasets,
not TRestRuns
the issue is that if the metadata info of the dataset is not properly handled the results are wrong when I compute e.g. the rate. In this case the duration of the dataset has to be added, but other metadata info can be also wrong or misleading. I don't know what is the proper way to handle metadata info of the dataset while merging, but this issue was present before this PR.
Implementation of merging of metadata while importing more than one dataset, in addition any file inside the list which is not a valid dataSet will be skipped.
Address issue #486