-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a require_data()
method
#715
Add a require_data()
method
#715
Conversation
Codecov Report
@@ Coverage Diff @@
## main #715 +/- ##
=====================================
Coverage 94.9% 94.9%
=====================================
Files 58 58
Lines 5914 5952 +38
=====================================
+ Hits 5614 5654 +40
+ Misses 300 298 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
I think the return values should contain all identifiers to missing data points, not just model and scenario as it could otherwise lead to ambiguities. As an example these data:
with a required data specification: region: [region_a, region_b]
variable: variable_a
year: [2020, 2025] would currently return model_a and scenario_a. That does not tell you the whole story though as for the first line in the data a value for 2025 is missing while for the second line, which is also model_a and scenario_a, everything is in order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly the continuation of my previous comment.
|
||
# identify scenarios that have some but not all required values | ||
columns = [i for i in self.coordinates if i not in required] | ||
rows = rows.droplevel(level=columns).drop_duplicates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In keeping with the theme of my comment of returning all identifiers of the missing values I don't think we should drop the relevant columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make the validation later len(df) != n
impossible, because there may be multiple entries along a not-required dimension.
) | ||
|
||
index_incomplete = pd.DataFrame( | ||
[idx for idx, df in data if len(df) != n], columns=self.index.names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is purely a clarifying question on my side, why is the if len(df) !=n
needed? As I understand it n
is the total number of combinations of required data fields, right? So for example two variables for two regions for three years would be 2x2x3=12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of required data (n
, computed as the product of the number of values per dimension) does not equal to the number of existing data len(df)
, then some of the required data is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that assumption not fall flat as soon as you provide multiple values for unit
? Maybe we should restrict the unit attribute to a single one then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why I remove the columns that are not required and then drop duplicates...
(And this issue is not restricted the units, you have the same problem when you only require variables and you have multiple regions).
It's not quite straightforward to develop a useful interface (or object) for this "complete" return-object. Riffing off the example above: Should the returned dataframe contain a column "unit"? If yes, the value should probably be Second, I see a problem that computing the missing index will be quite a lot of processing, only to be thrown away later anyway. Suggestion: we leave as is for now (or even simplify just return |
My preference would be the latter. Just keeping it as a kind of validator of, "is everything that we require there?". The question of "what is missing?" could almost be a separate one. |
I had an idea of a different, maybe more simple, approach that might be worth exploring. def require_data(self, region=None, variable=None, unit=None, year=None, exclude_on_fail=False):
missing_datapoints = {}
# combine "everything with everything"
for datapoint in itertools.product([region, variable, unit, year]):
# iterate over all model, scenario combinations
for model, scenario in self.index:
# obviously this won't work exactly like this but just to illustrate the idea
if datapoint not in self[model, scenario]:
missing_datapoints[(model, scenario)] = datapoint
if missing_datapoints:
return missing_datapoints |
Yes, that's the (only) way to go if we want to have a complete list of all missing (combinations of) required items. But if the use case (for now) is just to know whether all required items exist and (optionally) exclude those scenarios that do meet the requirement from further processing, checking the length of an existing index is much more efficient than creating a second index and comparing the two indices line-by-line. |
Yes, true. I was already thinking towards providing user feedback to point exactly what data points are missing but that’s probably better handled in a different function if we really need that level of precision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, ready to be merged.
Please confirm that this PR has done the following:
Name of contributors Added to AUTHORS.rstDescription of PR
This PR adds a
require_data()
method as a generalized approach torequire_variable()
. The new method returns none if all scenarios have all (combinations) of required values on the dimensions, or it returns a DataFrame of model-scenarios where not all required values are present.The PR also adds a (data) "coordinates" attribute to quickly access the data columns that are not in the index.
Question
The existing validation methods like
require_variable()
are very verbose. Here, I have not added any logging output. Any preferences?