Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure positions are in code_length in get_radius function. #4091

Merged
merged 3 commits into from
Aug 25, 2022

Conversation

brittonsmith
Copy link
Member

PR Summary

PR #4079 changed results for particle_radius calculation within a yt frontend embedded in ytree. After some digging, I discovered that positions for that frontend are returned in "unitary" units which, for that data, are not the same as "code_length". Unless we have a specification requiring positions to be returned in "code_length" for all frontends, this change is probably necessary. This may be quietly failing for other frontends.

PR Checklist

  • New features are documented, with docstrings and narrative docs
  • Adds a test for any bugs fixed. Adds tests for new features.

@neutrinoceros
Copy link
Member

Good catch. I note that unyt_array.to will unconditionally create a copy, so maybe we could make it conditional to preserve optimal performances when input units are already compliant.

@neutrinoceros neutrinoceros added this to the 4.1.0 milestone Aug 22, 2022
@neutrinoceros neutrinoceros added the release critical Highest priority (in a milestone) label Aug 22, 2022
@matthewturk
Copy link
Member

Once upon a time in the long past we talked about having some types of unit arrays that were "IndexArrays" (@jzuhone worked on this for a while) where the units were immutable internally to the array. Because of the amount of time we spend doing stuff like unit checks and unit conversions even internally in selection_routines.pyx (and its included files) this would likely be a huge speedup if we could sidestep 100% of those checks. (For instance, a fun profile is to look at how time is spent computing ghost zones for a FLASH dataset.)

@brittonsmith
Copy link
Member Author

Good catch. I note that unyt_array.to will unconditionally create a copy, so maybe we could make it conditional to preserve optimal performances when input units are already compliant.

Good points. I'll do this.

@brittonsmith
Copy link
Member Author

Is this test failure related?

@neutrinoceros
Copy link
Member

It's not. It's just a random failure. I note that it's the third time I'm seeing it this week, so maybe we'll need to do something about it, but for now let's just try again
@yt-fido test this please

@yipihey
Copy link
Contributor

yipihey commented Aug 23, 2022

I had commented on this on the closed PR but perhaps it can help the discussion here?:

Perhaps we are mixing two concepts?
One is "code_units" and one is the "original_units", i.e. which units are used in the file we are using. I think it is advantageous to require frontends to not modify the numbers it reads from disk.
I think of two different use cases of yt.

  1. Debugging a code that generates the data we are reading in. It is desirable that yt attempts to not modify the data it reads so we can have some certainty we are debugging the code that generates the data.
  2. If I use yt to convert from one data format to another. E.g. I read all particle positions in enzo and then store them as a simpler hdf5 file where the positions are a simple array. In this case I want to make sure that I'm just changing the shape of the data but do not change any of the numbers.

The positive side effect that it also avoid unnecessary computation which saves a little time.

Is there already the concept of "original_units" ? I.e. give us a simple way to check in what units the data was provided?

@brittonsmith
Copy link
Member Author

Is there already the concept of "original_units" ? I.e. give us a simple way to check in what units the data was provided?

The closest we have to this is the field info container (e.g., ds.field_info). For example, for the Enzo particle positions, it shows:

>>> ds.field_info['all', 'particle_position_x']
On-Disk Field ('all', 'particle_position_x'): (units: 'code_length', particle field)

In theory, get_radius could check the field_info to see what units the data should be in, then convert the relevant variables (e.g., center, domain_width, radius2) to that. This would likely be the solution that would result in converting units for the data array least frequently, but there is still the potential to mess this up if units are not checked at all. For example, cached data may have been converted to a different unit by another operation.

@@ -18,8 +18,11 @@ def get_radius(data, field_prefix, ftype):
if any(data.ds.periodicity):
rdw = radius2.v
for i, ax in enumerate("xyz"):
pos = data[ftype, f"{field_prefix}{ax}"]
if str(pos.units) != "code_length":
pos.convert_to_units("code_length")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we're converting in place, changing the data outputted by ytree. Are we certain this isn't going to cause problems in ytree ?
Also I don't think I understand why it can't be changed in ytree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we can guarantee that every yt frontend is reading particle positions in "code_length", the problem will not be limited to ytree. It makes me nervous that we are not checking units before operating on memory views.

I'm happy to switch it back to making a copy. Would the solution I proposed above be better? That is, should we check the units of the position field using the field info container and the convert the other variables to that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we can guarantee that every yt frontend is reading particle positions in "code_length", the problem will not be limited to ytree. It makes me nervous that we are not checking units before operating on memory views.

ok that makes sense.

I'm happy to switch it back to making a copy.

I think it'd be the best option because it's the most conservative approach for these edge cases where we're not sure what to expect, and also an incentive for frontend development to use "code_lenght" more uniformly since it allows to use the fast track where no copy is performed.

Would the solution I proposed above be better? That is, should we check the units of the position field using the field info container and the convert the other variables to that?

I guess I don't have strong opinions there. I think the current solution is already good in that it is probably the simplest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with sticking with the simplest solution for now. I'll switch thing back to using a copy and forgo any further optimization.

Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is now in an uncontroversial state that should preserve both backward compatibility for ytree and recent optimization when available.

@neutrinoceros neutrinoceros merged commit 379cd1a into yt-project:main Aug 25, 2022
@brittonsmith brittonsmith deleted the rfix branch August 26, 2022 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug release critical Highest priority (in a milestone)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants