Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset merge issue #487

Merged
merged 5 commits into from
Nov 6, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions source/framework/core/inc/TRestDataSet.h
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ class TRestDataSet : public TRestMetadata {
inline void SetQuantity(const std::map<std::string, RelevantQuantity>& quantity) { fQuantity = quantity; }

TRestDataSet& operator=(TRestDataSet& dS);
Bool_t Merge(TRestDataSet& dS);
void Import(const std::string& fileName);
void Import(std::vector<std::string> fileNames);
void Export(const std::string& filename);
Expand Down
90 changes: 72 additions & 18 deletions source/framework/core/src/TRestDataSet.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -894,12 +894,43 @@ TRestDataSet& TRestDataSet::operator=(TRestDataSet& dS) {
fFilterEqualsTo = dS.GetFilterEqualsTo();
fQuantity = dS.GetQuantity();
fTotalDuration = dS.GetTotalTimeInSeconds();
fFileSelection = dS.GetFileSelection();
fCut = dS.GetCut();

return *this;
}

///////////////////////////////////////////////
/// \brief This function merge different TRestDataSet
/// metadata in current dataSet
///
Bool_t TRestDataSet::Merge(TRestDataSet& dS) {
jgalan marked this conversation as resolved.
Show resolved Hide resolved
auto obsNames = GetObservablesList();
for (const auto& obs : fObservablesList) {
if (std::find(obsNames.begin(), obsNames.end(), obs) != obsNames.end()) {
RESTError << "Cannot merge dataSets with different observable list " << RESTendl;
return false;
}
}

if (REST_StringHelper::StringToTimeStamp(fFilterStartTime) >
REST_StringHelper::StringToTimeStamp(dS.GetFilterStartTime()))
fFilterStartTime = dS.GetFilterStartTime();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky because a generated dataset 1 with a given startTime filter, and dataset 2 with filter 2, then to be rigorous I need to know the startTime filter of both.

So that I know that the dataset 1 was produced with a given filter, and the other was produced with a different time filter. Idem for other filters.

Perhaps a new class TRestDataSetMerged could help. We keep the filters only for TRestDataSet while in TRestDataSetMerged we define as metadata members the dataset filenames used for the merge?

And perhaps introduce a hash-id to identify/guarantee the TRestDataSet used to create the merged dataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that previous post was me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed the timeStamp filters, I think that to make a new class TRestDataSetMerged is an overkill.

However, I don't know what is the best strategy here because apart of the filters there is plenty of metadata info which is not saved when we merge several dataSets. Note that before just only the metadata info from the first file was saved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for me a dataset does not need to contain all the metadata information that is available to the run files. The dataset must guarantee that we can identify the run files used to generate the dataset. I guess, having the selected files vector is already a good starting point. Then we need to know if any filters have been applied for event selection. But the final user working with the dataset needs to know also other conditions applied to generate the dataset, e.g. the detector pressure, the data taking days, the voltage on the mesh, etc, etc. Then, if those filters were used to generate the datasets, it is relevant for the end-user.

The metadata information transferred to the dataset in the form of relevant quantities is to provide the dataset with physical meaning, and allow to have all the analysis required information into one single entity.

Indeed, if we added new columns to the dataset, we need to add also that metadata information, because if I generate a new column using <addColumn then I need to now the formula used to calculate that column. Not sure if we are doing this right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see now that the columns added should be in the form fColumnNameExpressions. Just last time I couldn't find them in my dataset.

Copy link
Member

@jgalan jgalan Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that before just only the metadata info from the first file was saved.

In my opinion this should never happen, as I already stated many times I am not agree on merging runs. Merging datasets is different issue, but runs should be considered as read-only (or end-product of event data processing) entities, that can be used later on to generate a data compilation or dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that before just only the metadata info from the first file was saved.

In my opinion this should never happen, as I already stated many times I am not agree on merging runs. Merging datasets is different issue, but runs should be considered as read-only (or end-product of event data processing) entities, that can be used later on to generate a data compilation or dataset.

I am talking about datasets, not TRestRuns the issue is that if the metadata info of the dataset is not properly handled the results are wrong when I compute e.g. the rate. In this case the duration of the dataset has to be added, but other metadata info can be also wrong or misleading. I don't know what is the proper way to handle metadata info of the dataset while merging, but this issue was present before this PR.

if (REST_StringHelper::StringToTimeStamp(fFilterEndTime) <
REST_StringHelper::StringToTimeStamp(dS.GetFilterEndTime()))
fFilterEndTime = dS.GetFilterEndTime();
if (fStartTime > dS.GetStartTime()) fStartTime = dS.GetStartTime();
if (fEndTime < dS.GetEndTime()) fEndTime = dS.GetEndTime();

auto fileSelection = dS.GetFileSelection();
fFileSelection.insert(fFileSelection.end(), fileSelection.begin(), fileSelection.end());

fProcessObservablesList = dS.GetProcessObservablesList();

fTotalDuration += dS.GetTotalTimeInSeconds();

return true;
}

///////////////////////////////////////////////
/// \brief This function imports metadata from a root file
/// it import metadata info from the previous dataSet
Expand Down Expand Up @@ -956,25 +987,48 @@ void TRestDataSet::Import(std::vector<std::string> fileNames) {
return;
}

if (fileNames.size() == 0) return;

TFile* file = TFile::Open(fileNames[0].c_str(), "READ");
if (file != nullptr) {
TIter nextkey(file->GetListOfKeys());
TKey* key;
while ((key = (TKey*)nextkey())) {
std::string kName = key->GetClassName();
if (REST_Reflection::GetClassQuick(kName.c_str()) != nullptr &&
REST_Reflection::GetClassQuick(kName.c_str())->InheritsFrom("TRestDataSet")) {
TRestDataSet* dS = file->Get<TRestDataSet>(key->GetName());
if (GetVerboseLevel() >= TRestStringOutput::REST_Verbose_Level::REST_Info)
dS->PrintMetadata();
*this = *dS;
int count = 0;
auto it = fileNames.begin();
while (it != fileNames.end()) {
std::string fileName = *it;
TFile* file = TFile::Open(fileName.c_str(), "READ");
bool isValid = false;
if (file != nullptr) {
TIter nextkey(file->GetListOfKeys());
TKey* key;
while ((key = (TKey*)nextkey())) {
std::string kName = key->GetClassName();
if (REST_Reflection::GetClassQuick(kName.c_str()) != nullptr &&
REST_Reflection::GetClassQuick(kName.c_str())->InheritsFrom("TRestDataSet")) {
TRestDataSet* dS = file->Get<TRestDataSet>(key->GetName());
if (GetVerboseLevel() >= TRestStringOutput::REST_Verbose_Level::REST_Info)
dS->PrintMetadata();

if (count == 0) {
*this = *dS;
isValid = true;
} else {
isValid = Merge(*dS);
}

if (isValid) count++;
}
}
} else {
RESTError << "Cannot open " << fileName << RESTendl;
}
} else {
RESTError << "Cannot open " << fileNames[0] << RESTendl;
exit(1);

if (!isValid) {
RESTError << fileName << " is not a valid dataSet skipping..." << RESTendl;
it = fileNames.erase(it);
} else {
++it;
}
}

if (fileNames.empty()) {
RESTError << "File selection is empty, dataSet will not be imported " << RESTendl;
return;
}

RESTInfo << "Opening list of files. First file: " << fileNames[0] << RESTendl;
Expand Down