Dataset merge issue #487

juanangp · 2023-10-05T17:40:46Z

Implementation of merging of metadata while importing more than one dataset, in addition any file inside the list which is not a valid dataSet will be skipped.

Address issue #486

for more information, see https://pre-commit.ci

restforphysics · 2023-10-09T11:18:39Z

source/framework/core/src/TRestDataSet.cxx

+
+    if (REST_StringHelper::StringToTimeStamp(fFilterStartTime) >
+        REST_StringHelper::StringToTimeStamp(dS.GetFilterStartTime()))
+        fFilterStartTime = dS.GetFilterStartTime();


This is a bit tricky because a generated dataset 1 with a given startTime filter, and dataset 2 with filter 2, then to be rigorous I need to know the startTime filter of both.

So that I know that the dataset 1 was produced with a given filter, and the other was produced with a different time filter. Idem for other filters.

Perhaps a new class TRestDataSetMerged could help. We keep the filters only for TRestDataSet while in TRestDataSetMerged we define as metadata members the dataset filenames used for the merge?

And perhaps introduce a hash-id to identify/guarantee the TRestDataSet used to create the merged dataset?

Ok that previous post was me

I just removed the timeStamp filters, I think that to make a new class TRestDataSetMerged is an overkill.

However, I don't know what is the best strategy here because apart of the filters there is plenty of metadata info which is not saved when we merge several dataSets. Note that before just only the metadata info from the first file was saved.

Actually, for me a dataset does not need to contain all the metadata information that is available to the run files. The dataset must guarantee that we can identify the run files used to generate the dataset. I guess, having the selected files vector is already a good starting point. Then we need to know if any filters have been applied for event selection. But the final user working with the dataset needs to know also other conditions applied to generate the dataset, e.g. the detector pressure, the data taking days, the voltage on the mesh, etc, etc. Then, if those filters were used to generate the datasets, it is relevant for the end-user.

The metadata information transferred to the dataset in the form of relevant quantities is to provide the dataset with physical meaning, and allow to have all the analysis required information into one single entity.

Indeed, if we added new columns to the dataset, we need to add also that metadata information, because if I generate a new column using <addColumn then I need to now the formula used to calculate that column. Not sure if we are doing this right now.

Ok, I see now that the columns added should be in the form fColumnNameExpressions. Just last time I couldn't find them in my dataset.

Note that before just only the metadata info from the first file was saved.

In my opinion this should never happen, as I already stated many times I am not agree on merging runs. Merging datasets is different issue, but runs should be considered as read-only (or end-product of event data processing) entities, that can be used later on to generate a data compilation or dataset.

Note that before just only the metadata info from the first file was saved.

In my opinion this should never happen, as I already stated many times I am not agree on merging runs. Merging datasets is different issue, but runs should be considered as read-only (or end-product of event data processing) entities, that can be used later on to generate a data compilation or dataset.

I am talking about datasets, not TRestRuns the issue is that if the metadata info of the dataset is not properly handled the results are wrong when I compute e.g. the rate. In this case the duration of the dataset has to be added, but other metadata info can be also wrong or misleading. I don't know what is the proper way to handle metadata info of the dataset while merging, but this issue was present before this PR.

source/framework/core/src/TRestDataSet.cxx

juanangp added 2 commits October 5, 2023 19:38

Implementation of merging of metadata while importing TRestDataSet

21499eb

Removal of duplicated line

326038c

juanangp requested review from jgalan and lobis October 5, 2023 17:40

[pre-commit.ci] auto fixes from pre-commit.com hooks

75a76bf

for more information, see https://pre-commit.ci

juanangp requested a review from a team October 6, 2023 07:13

restforphysics reviewed Oct 9, 2023

View reviewed changes

lobis reviewed Oct 9, 2023

View reviewed changes

source/framework/core/src/TRestDataSet.cxx Outdated Show resolved Hide resolved

juanangp added 2 commits October 9, 2023 17:25

Added const to Merge funcion args

eb131c1

Removing filter timestamp and processObservableList at Merge

8c48a98

jgalan approved these changes Nov 5, 2023

View reviewed changes

juanangp requested a review from lobis November 6, 2023 16:07

lobis approved these changes Nov 6, 2023

View reviewed changes

juanangp merged commit 8dc9f22 into master Nov 6, 2023
62 checks passed

juanangp deleted the dSMergeIssue branch November 6, 2023 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset merge issue #487

Dataset merge issue #487

juanangp commented Oct 5, 2023 •

edited by pull-request-badge bot

Loading

restforphysics Oct 9, 2023

jgalan Oct 9, 2023

juanangp Oct 9, 2023

jgalan Oct 11, 2023

jgalan Oct 11, 2023

jgalan Oct 11, 2023 •

edited

Loading

juanangp Oct 11, 2023

Dataset merge issue #487

Dataset merge issue #487

Conversation

juanangp commented Oct 5, 2023 • edited by pull-request-badge bot Loading

restforphysics Oct 9, 2023

Choose a reason for hiding this comment

jgalan Oct 9, 2023

Choose a reason for hiding this comment

juanangp Oct 9, 2023

Choose a reason for hiding this comment

jgalan Oct 11, 2023

Choose a reason for hiding this comment

jgalan Oct 11, 2023

Choose a reason for hiding this comment

jgalan Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

juanangp Oct 11, 2023

Choose a reason for hiding this comment

juanangp commented Oct 5, 2023 •

edited by pull-request-badge bot

Loading

jgalan Oct 11, 2023 •

edited

Loading