Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

55/refactor fix_latlon_coord #63

Merged
merged 30 commits into from
Sep 17, 2024
Merged

55/refactor fix_latlon_coord #63

merged 30 commits into from
Sep 17, 2024

Conversation

blimlim
Copy link
Collaborator

@blimlim blimlim commented Aug 2, 2024

This pull request closes #55. Apologies that there's a bit in here... It refactors the fix_latlon_coord function, which was used to add bounds, convert types, and add variable names for each cube's horizontal coordinates.

The main structural changes are to split the function into several more modular ones in order to help with readability and unit testing. Other changes to try and improve readability and testability include:

  • Adding docstring
  • Adding constants for repeated strings and preset bounds arrays
  • Modifying exception handling so that it occurs within the fix_latlon_coord function, and raises a more specific exception.
  • Update run script to handle alternate exception class?
  • Removes redundant cube coordinate lookups
  • Convert if check calcs to plain functions
  • Use descriptive names for variables used in the checks

To help with the unit tests, the pull request adds a DummyCoordinate class to mimic iris cube coordinate objects with imitations of the name and has_bounds methods needed to test the new functions. It also adds an extension to the DummyCube class, DummyCubeWithCoords which can be used to add DummyCoordinate to a DummyCube.

Any suggestions about the structure, logic, tests, and anything else would be great!

@@ -1,6 +1,8 @@
import unittest.mock as mock
from dataclasses import dataclass
from collections import namedtuple
import numpy as np
from iris.exceptions import CoordinateNotFoundError

import umpost.um2netcdf as um2nc
Copy link
Contributor

@marc-white marc-white Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, how am I meant to get this working for testing? I can't pip install -e the umpost directory in its current state, so I'm getting test collection errors.

Copy link
Collaborator

@truth-quark truth-quark Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, due to prior history, this project has started oddly. Spencer is using a pre-existing gadi environment & I'm reusing a local virtualenv from another related project. There's been no setup docs until last week, when I grabbed docs from a related project. Do you feel like yet another PR review for docs? #62
(these docs are missing pip -e)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like me some good docs...

umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@truth-quark truth-quark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, there is much more testing that I initially thought!

For some initial thoughts: fixing um2nc is tricky on account of uncertain requirements (gathered from the code), retro fitting tests & that the processing code contains substantial nested boolean logic. The latter causes a combinational testing blowout (more than the the fix naming funcs).

In both the earlier cube renaming & lat/long fixes, our testing approach is blowing out in an attempt to cover most/all code paths. Part of the problem in being this thorough, is writing tests against some internal implementation details. Some mocks test_fix_lat_coord_name() & the process/masking fixes are a symptom of coupling to internal details (although some of this is unavoidable as we work around some constraints).

Additionally, I assume writing the lat/long tests was tricky, which implies future maintenance/code changes could be harder (e..g changing processing code potentially breaks multiple tests). This could happen if cube modification requirements change.

Thus, I've been rethinking our testing strategy, helped with these TDD videos (~1 hour each):

Ian Cooper: “TDD, Where did it all go wrong?”
https://www.youtube.com/watch?v=EZ05e7EMOLM

Ian Cooper: “TDD Revisited” (more instructional)
https://www.youtube.com/watch?v=IN9lftH0cJc

One key message is focusing testing to external interfaces avoiding referencing internal implementation details. This suggests focusing tests around fix_latlon_coords() & skipping direct testing of sub functions. Calling the sub-functions can be treated as an implementation detail, with the tests asserting correct end results (thus, we treeat the check functions as correct/tested if the cube vars are correct). Also, as um2nc cube mod functions often don't return anything, we'll have to make assertions on modified cubes. In decoupling from details, we should move away from assertions on mocks (except for the DummyCubes). Some mocking is needed as an intermediate stop gap while the codebase is redesigned, but I think we can reduce it iteratively.

As a next step, I think we should try an experiment to see how this works, trying these steps:

  • Temporarily comment out the grid type & bounds tests as an implementation detail
    • e.g. test_is_lat/lon_river_grid, test_add_latlon_coord_bounds_has_bounds()
  • Test a common use case with a test_fix_latlon_coords_type_change() type function
    • Remove mock calls, so fix_latlon_coords() executes code for grids, bounds & coord naming fixes etc
    • Are more compound data fixtures needed to make a valid input?
  • Try assertions on the modded cube (check bounds, data types etc)
  • Run coverage tests to explore how this higher level testing covers the code

If the above is straightforward:

  • configure a cube with different data to execute other branches
  • run tests & re-analyse the coverage

Otherwise, if the testing is fiddly/tricky with setup or otherwise:

  • adjust experiment, test against the mid level funcs add_latlon_coord_bounds() & fix_..._coord_name()
  • explore the coverage...

Analysing higher level testing & coverage is probably a good pair dev exercise. It's also likely our code repair efforts will require starting with coupled tests, then iteratively refactoring the processing code & tests. Thus while things are harder now, it's temporary while iterating towards better design.

Hopefully this makes sense!

umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
test/test_um2netcdf.py Outdated Show resolved Hide resolved
test/test_um2netcdf.py Outdated Show resolved Hide resolved
test/test_um2netcdf.py Outdated Show resolved Hide resolved
test/test_um2netcdf.py Outdated Show resolved Hide resolved
test/test_um2netcdf.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@truth-quark truth-quark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that required substantially more effort than the initial skim indicated. Apologies!

This looks good to go & will help with the process of merging all the current PRs. @marc-white do you want to do a final check?

Comment on lines +136 to +140
# TODO: This test will not catch changes to the exceptions in um2nc. E.g.
# if decided to instead raise a RuntimeError in um2nc when timeseries
# are encountered, the conversion of ESM1.5 outputs would undesirably
# crash, but this test would say everything is fine, since we're
# prescribing the error that is raised.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could define the error type raised in this instance as a global variable in conversion_driver_esm1p5.py, and then import it for use in the test as the expected error type?

test/test_um2netcdf.py Outdated Show resolved Hide resolved
Comment on lines 17 to 18
D_LAT_N96 = 1.25
D_LON_N96 = 1.875
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment what these numbers are for clarity?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these meant to be DELTA_LAT_N96 & DELTA_LON_N96 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use similar naming here as in other parts of um2nc, eg here and here

I'm wondering would it be best to stick to that naming, and add a comment clarifying that these are grid spacings in degrees?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes just go the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments added in 21dc396

Comment on lines 129 to 185
def fix_lat_coord_name(lat_coordinate, grid_type, dlat):
"""
Add a 'var_name' attribute to a latitude coordinate object
based on the grid it lies on.

lat = cube.coord('latitude')
NB - Grid spacing dlon only refers to variables on the main
horizontal grids, and not the river grid.

# Force to double for consistency with CMOR
lat.points = lat.points.astype(np.float64)
_add_coord_bounds(lat)
lon = cube.coord('longitude')
lon.points = lon.points.astype(np.float64)
_add_coord_bounds(lon)

lat = cube.coord('latitude')
if len(lat.points) == 180:
lat.var_name = 'lat_river'
elif (lat.points[0] == -90 and grid_type == 'EG') or \
(np.allclose(-90.+0.5*dlat, lat.points[0]) and grid_type == 'ND'):
lat.var_name = 'lat_v'
Parameters
----------
lat_coordinate: coordinate object from iris cube (edits in place).
grid_type: (string) model horizontal grid type.
dlat: (float) meridional spacing between latitude grid points.
"""

if lat_coordinate.name() != LATITUDE:
raise ValueError(
f"Wrong coordinate {lat_coordinate.name()} supplied. "
f"Expected {LATITUDE}."
)

if is_lat_river_grid(lat_coordinate.points):
lat_coordinate.var_name = VAR_NAME_LAT_RIVER
elif is_lat_v_grid(lat_coordinate.points, grid_type, dlat):
lat_coordinate.var_name = VAR_NAME_LAT_V
else:
lat_coordinate.var_name = VAR_NAME_LAT_STANDARD


def fix_lon_coord_name(lon_coordinate, grid_type, dlon):
"""
Add a 'var_name' attribute to a longitude coordinate object
based on the grid it lies on.

NB - Grid spacing dlon only refers to variables on the main
horizontal grids, and not the river grid.

Parameters
----------
lon_coordinate: coordinate object from iris cube (edits in place).
grid_type: (string) model horizontal grid type.
dlon: (float) zonal spacing between longitude grid points.
"""

if lon_coordinate.name() != LONGITUDE:
raise ValueError(
f"Wrong coordinate {lon_coordinate.name()} supplied. "
f"Expected {LATITUDE}."
)

if is_lon_river_grid(lon_coordinate.points):
lon_coordinate.var_name = VAR_NAME_LON_RIVER
elif is_lon_u_grid(lon_coordinate.points, grid_type, dlon):
lon_coordinate.var_name = VAR_NAME_LON_U
else:
lon_coordinate.var_name = VAR_NAME_LON_STANDARD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these two functions have an almost identical structure, is it worth considering refactoring them as follows:

  • A private helper function that does the actual work;
  • Turn the existing functions into wrappers that call the private functions with the expecyed coordinate name, and comparison options for that coordinate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put together a trial of this - let me know if it's along the lines of what you were thinking! The first attempt was using a private function:

def _fix_horizontal_coord_name(coordinate, grid_type, grid_spacing, 
                               river_grid_check,
                               river_grid_name, staggered_grid_check,
                               staggered_name, base_name):
    if river_grid_check(coordinate.points):
        coordinate.var_name = river_grid_name
    elif staggered_grid_check(coordinate, grid_type, grid_spacing):
        coordinate.var_name = staggered_name
    else:
        coordinate.var_name = base_name



def fix_lat_coord_name(lat_coordinate, grid_type, dlat):
    _fix_horizontal_coord_name(coordinate = lat_coordinate,
                               grid_type=grid_type,
                               grid_spacing=dlat,
                               river_grid_check=is_lat_river,
                               river_grid_name=VAR_NAME_LAT_RIVER,
                               staggered_grid_check=is_lat_v_grid,
                               staggered_name=VAR_NAME_LAT_V,
                               base_name=VAR_NAME_LAT_STANDARD)

And another option was to put together a dictionary holding the specified checks:

HORIZONTAL_GRID_NAMING_DATA = {
    LATITUDE: {
        "river_grid_check": is_lat_river_grid,
        "river_grid_name": VAR_NAME_LAT_RIVER,
        "staggered_name": VAR_NAME_LAT_V,
        "base_name":  VAR_NAME_LAT_STANDARD,
        "staggered_grid_check": is_lat_v_grid
    },
    LONGITUDE: {
        "river_grid_check": is_lon_river_grid,
        "river_grid_name": VAR_NAME_LON_RIVER,
        "staggered_grid_check": is_lon_u_grid,
        "staggered_name": VAR_NAME_LON_U,
        "base_name":  VAR_NAME_LAT_STANDARD,
    }
}


def fix_latlon_coord_name(coordinate, grid_type, grid_spacing):
    coord_name = coordinate.name

    if HORIZONTAL_GRID_NAMING_DATA["river_grid_check"](coordinate.points):
        coordinate.var_name = HORIZONTAL_GRID_NAMING_DATA["river_grid_name"]
    elif HORIZONTAL_GRID_NAMING_DATA["staggered_grid_check"](coordinate.points, grid_type, grid_spacing):
        coordinate.var_name = HORIZONTAL_GRID_NAMING_DATA["staggered_name"]
    else:
        coordinate.var_name = HORIZONTAL_GRID_NAMING_DATA["base_name"]

In terms of readability, I think I find the separate tests a bit easier to follow, though it is some amount of duplicate code. Happy to hear other opinions and ideas though!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these experiments on other branches?

While the code & tests work here (with a bit of duplication), refining the code can be deferred in favour of modularisation & test coverage. @blimlim What are your thoughts on adding an neatening task to issue #27, linking it to this discussion & returning to it later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred action sounds like a good idea.

My original thought was like your option 2, but even more general - each coordinate could have an iterable of (check, name) tuples that you go through until you hit a match. The last tuple in the iterable would then be (True, <default name>).

umpost/um2netcdf.py Outdated Show resolved Hide resolved
umpost/um2netcdf.py Outdated Show resolved Hide resolved
Comment on lines 264 to 268
if coordinate_name not in [LONGITUDE, LATITUDE]:
raise ValueError(
f"Wrong coordinate {coordinate_name} supplied. "
f"Expected one of {LONGITUDE}, {LATITUDE}."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ever likely to get coordinate names other than LATITUDE and LONGITUDE? If so, it might be worth parameterizing out the 'set of coordinate names I expect' for use in this situation. You could also do it using GLOBAL_COORDS_BOUNDS.keys().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added in this error to try to guard against add_latlon_coord_bounds being inadvertently used on the wrong coordinates, and so here we should only be expecting LATITUDE or Longitude.

Would adding a constant eg HORIZONTAL_COORD_NAMES = [LATITUDE, LONGITUDE], and then checking if coordinate_name not in HORIZONTAL_COORD_NAMES be a bit cleaner here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely, although you've possibly already got the 'constant' available as GLOBAL_COORDS_BOUNDS.keys().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is GLOBAL_COORDS_BOUNDS likely to change, therefore adding keys that coord name should not be?

The advantage of the current line is that the test condition is obvious.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a minor formatting thing, lines 266-268 look over indented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated the indentation in 37049c7

Copy link
Contributor

@marc-white marc-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments I've made are only tweaks that you're free to think about and implement or not at your discretion - good job!

@truth-quark
Copy link
Collaborator

The comments I've made are only tweaks that you're free to think about and implement or not at your discretion - good job!

As a general observation, I think staggering the 2 reviews has been beneficial. The later 2nd review brings fresh eyes after a few refinements. I'm pretty sure I inadvertently started to skim the code as it got more familiar.

@blimlim blimlim mentioned this pull request Sep 17, 2024
18 tasks
@blimlim blimlim merged commit 9acda2f into develop Sep 17, 2024
4 checks passed
@blimlim
Copy link
Collaborator Author

blimlim commented Sep 17, 2024

Thanks @truth-quark and @marc-white for the very helpful reviews! And apologies it took so much work on your end!

@blimlim blimlim deleted the 55/refactor-fix_latlon_coord branch September 26, 2024 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor fix_latlon_coord() function
3 participants