Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible improvements in cube concatenation to be discussed with iris developers #1423

Open
valeriupredoi opened this issue Jan 17, 2022 · 16 comments
Labels
question Further information is requested

Comments

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jan 17, 2022

Hey @ESMValGroup/esmvaltool-coreteam - I got a reply on a long forgotten issue I had opened in the SciTools GH repo about cube concatenation, see SciTools/iris#3696 (comment) - and was wondering if we need to float any ideas about possible improvements of this functionality - shout out to the iris folk who are always improving their package based on our suggestions BTW 🍺

@valeriupredoi valeriupredoi added the question Further information is requested label Jan 17, 2022
@bouweandela
Copy link
Member

This seems a relevant issue SciTools/iris#4446 and I recently opened SciTools/iris#4453. Another issue that could be interesting is #1068, though that might be an issue with Dask instead of iris, not sure.

@zklaus
Copy link

zklaus commented Jan 17, 2022

I am rather confused about the original issue mentioned by @valeriupredoi. Basically, this is about what to do when concatenating two cubes that have an overlap in the concatenation dimension. But there really is no sensible default. The first three options that come to my mind are:

  • Prefer the earlier cube and throw away the overlapping part in the later cube
  • Prefer the later cube and throw away the overlapping part in the earlier cube
  • Average the cubes on the overlap

Instead of a simple average, any number of interpolation and mixing options are imaginable.

Point is, all of these options make sense in different situations, so what should a poor library do about? I think Iris behavior of not doing the concatenation is quite sensible.

@valeriupredoi
Copy link
Contributor Author

I agree with Klaus, shall we give people more time to chime in if they want to, say til the end of this week, than close both issues?

@zklaus ->

I am rather confused about the original issue mentioned by @valeriupredoi.

That's an old issue that I honestly can't remember what I was trying to get out of it - I believe I wrote a fix for us, and that's been used since, of matter here is Will's will (heh pun intended!) to make the concatenation a bit more flexible, and if we want to send feedback on that 👍

@schlunma
Copy link
Contributor

I agree with Klaus

+1

@zklaus
Copy link

zklaus commented Jan 18, 2022

Yeah, what we did in #615 is force option two from my little list above.

@valeriupredoi
Copy link
Contributor Author

aha! here's a relevant concatenation issue #932

@WilliamIngramAtmosphericPhysics

there really is no sensible default. The first three options that come to my mind are:

  • Prefer the earlier cube and throw away the overlapping part in the later cube
  • Prefer the later cube and throw away the overlapping part in the earlier cube
  • Average the cubes on the overlap

To me the sensible default seems

  • Check the earlier & later cube agree within rounding error for the overlapping part. If so, prefer whichever is more convenient to code because it doesn't matter. If not, fail telling the user clearly how the cubes contradict each other.

But am I missing something?

@zklaus
Copy link

zklaus commented Jan 31, 2022

there really is no sensible default. The first three options that come to my mind are:

  • Prefer the earlier cube and throw away the overlapping part in the later cube
  • Prefer the later cube and throw away the overlapping part in the earlier cube
  • Average the cubes on the overlap

To me the sensible default seems

* Check the earlier & later cube agree within rounding error for the overlapping part.  If so, prefer whichever is more convenient to code because it doesn't matter.  If not, fail telling the user clearly how the cubes contradict each other.

But am I missing something?

That seems to be an extraordinarily costly operation that still would only hide a manifest error in the input data. Perhaps I should clarify a little bit the use-cases that I/we have in mind.

Generally speaking, they have to do with different experiments that are direct continuations of each other. The first that comes to mind is a historical simulation, i.e. one that uses observations as forcings for a past period, that serves as the starting point for a scenario, i.e. a climate simulation of a future time period that is driven by best-estimate forcings. In this situation, it is fairly common to later extend the historical experiment, which leads to the overlap and is interesting then to compare with the earlier scenario.
Another situation is at the beginning of these historical simulations. They frequently take their initial conditions from long-running constant-forcing, so-called control simulations. These control simulations usually run alongside the historical simulations to provide a baseline comparison, again creating an overlap. In this situation, it is interesting to compare time series that are stiched together at different points in time.

Many different situations are imaginable, but I struggle to come up with one, where it is expected that the time series actually agree on the overlap.

@WilliamIngramAtmosphericPhysics

Perhaps I should clarify a little bit the use-cases that I/we have in mind.

Ah, thanks. They had not occurred to me because they are not concatenating the 2 cubes as I understand the words. What I would say you want to do is to concatenate the earlier part of the cube of data from the original run with the cube of data from the spun-off run.
So the transparent way to do what you want seemed immediately to me to take a slice in time (or in principle any dimension) to remove the data you do not want to concatenate, & concatenate what you do want.
But OK, while it is very easy to take a slice like that if you know which dimension is where, & how much you want to keep or to discard, it may not always be simple to establish that, so there may be good reason to allow users to leave it to the code.

Still, IMO apparently-valid data should not be discarded without explicit instructions. So I'd suggest adding arguments "discard_from_1st" & "discard_from_2nd". If one was set, overlap would be accepted & data discarded as stated. If both, it would fail saying why. If neither, & it found overlap, it should fail or do what I previously suggested. (The latter sounds better to me, but I have absolutely no use case.)

Many different situations are imaginable, but I struggle to come up with one, where it is expected that the time series actually agree on the overlap.

The ones I thought of in response to what you said do (apart possibly from rounding error) - & have only a "technical" overlap of 1 timelevel. I imagined e.g. breaking down big datasets to process a year/decade/century at a time, but doing each completely, so each cube had, say, 00Z on New Year's Day & 24:00Z on New Year's Eve; or spinning off an instantaneous CO2 doubling from a control run, creating its cubes including its starting-point as physically necessary for analysis, & then having reason to add earlier decades/centuries. I suppose anything longer would not have seemed "concatenation" to me.

@WilliamIngramAtmosphericPhysics

I'd suggest adding arguments "discard_from_1st" & "discard_from_2nd".

I was forgetting one can concatenate more than 1 cube!

So "discard_from_earlier_in_list" & "discard_from_later_in_list" seem right to me - not earlier or later in time, as it should work for all coordinates, & also the user may not know or want to know if the coordinate values increase or decrease (they may not even have the physical direction they expect, e.g. pressure v height), while they will normally know what order they're specifying the cubes they want to combine. (OK, if they want the 2nd cube to take priority over both the 1st & 3rd, say, they'd have to concatenate twice.)

But as always, I may be missing something.

@zklaus
Copy link

zklaus commented Feb 1, 2022

The issue with that approach would be that the order of the cubes in the list can be changed on concatenation, precisely to line all the (e.g.) times up correctly. Overall, I think this is all more hassle than it's worth since, as you correctly mention, slicing the cubes beforehand appropriately is the right and explicit approach.

I do think that in downstream (from Iris) projects (such as ESMValTool) corresponding helper functions can evolve. If they turn into something sufficiently general, we might propose it for inclusion in a later version of Iris.

@WilliamIngramAtmosphericPhysics

The issue with that approach would be that the order of the cubes in the list can be changed on concatenation,

To me that seems an advantage - you can get any cube to over-ride any cube just by specifying them in order of importance, & only 1 of the 2 arguments I suggested is needed.

Overall, I think this is all more hassle than it's worth

:-)

@zklaus
Copy link

zklaus commented Feb 2, 2022

Sorry for being unclear. What I meant is that Iris already re-orders the cubes on concatenate as it deems appropriate. As such, it is not clear that there is a good way to take any prior order into account.

@WilliamIngramAtmosphericPhysics

I don't think you were at all unclear, but I must have been - what I was suggesting is that if the user knows or suspects there is overlap, & wants concatenation to go ahead, & knows which cubes he wants data to be discarded from, they could specify the cubes in "priority" order & set the flag to say so.

But as you say, it may be best to forget the idea.

@zklaus
Copy link

zklaus commented Feb 2, 2022

Don't get me wrong, I think the topic could benefit from addressing and helping the user to realize different use-cases. I just wouldn't bake it into the main CubeList.concatenate function at this point, but rather draw up a few concrete use-cases and implement support for them in another place; perhaps a separate function in Iris, perhaps in a tool like ESMValTool, perhaps somewhere else.

@valeriupredoi
Copy link
Contributor Author

here's another facet to this general issue SciTools/iris#4720

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants