Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the virtualfile_in function to accept more 1-D arrays #2744

Closed
wants to merge 30 commits into from

Conversation

seisman
Copy link
Member

@seisman seisman commented Oct 14, 2023

Description of proposed changes

Here are the current definitions of the virtualfile_from_data method and the data_kind function:

pygmt/pygmt/clib/session.py

Lines 1473 to 1483 in c9d6147

def virtualfile_from_data(
self,
check_kind=None,
data=None,
x=None,
y=None,
z=None,
extra_arrays=None,
required_z=False,
required_data=True,
):

def data_kind(data=None, x=None, y=None, z=None, required_z=False, required_data=True):

When I started issue #2731, I realized the current function definitions have some limitations:

  1. For some modules, the number of input columns can vary, depending on the given options. For example, binstats usually requires 3 columns (x/y/z), but only requires 2 columns (x/y) if -Cn is used, and requires 4 columns (x/y/z/w) if -W is used. I don't think we want to add w=None and required_w=False to these functions. Also, we don't check if the input table has the required number of columns.
  2. The data_kind function does three things: (1) determines the kind of the input data, and (2) checks if the data/x/y/z combinations are valid; and (3) checks if the matrix-type data has 3 columns. The data_kind function is called inside virtualfile_from_data, but sometimes we need to know in data kind when wrapping GMT modules, for example, in Figure.plot and Figure.plot3d. It means the data_kind function is called twice, which is not necessary.

Solutions:

  1. Refactor the virtualfile_from_data function like data=None, vectors=None, names=["x", "y"]. vectors is a list of vectors (e.g., vectors=[x, y]) and names is a list of column names. The wrappers are responsible for preparing the list of 1-D arrays (vectors) and counting the column names (names).
  2. Let the data_kind focus on determining the data kind and have another separate function to check if the input data/vectors are valid.

@seisman seisman force-pushed the refactor/virtualfile-to-data branch from 49b1b3f to 5512d2f Compare October 14, 2023 07:53
@seisman seisman changed the title POC: Refactor the virtualfile_to_data and data_kind function to accept more 1-D arrays WIP POC: Refactor the virtualfile_to_data and data_kind function to accept more 1-D arrays Oct 14, 2023
@seisman seisman changed the title WIP POC: Refactor the virtualfile_to_data and data_kind function to accept more 1-D arrays WIP/POC: Refactor the virtualfile_to_data and data_kind function to accept more 1-D arrays Oct 14, 2023
@seisman seisman force-pushed the refactor/virtualfile-to-data branch from 5512d2f to 70fc9e4 Compare October 14, 2023 12:15
@seisman seisman force-pushed the refactor/virtualfile-to-data branch from 70fc9e4 to 66c4b97 Compare October 14, 2023 12:19
@seisman seisman added maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. labels Oct 14, 2023
@seisman seisman added this to the 0.11.0 milestone Oct 14, 2023
@@ -15,127 +15,133 @@
from pygmt.exceptions import GMTInvalidInput


def _validate_data_input(
data=None, x=None, y=None, z=None, required_z=False, required_data=True, kind=None
def validate_data_input(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more useful to pass the list of column names instead, i.e., replacing ncols=2 with names=["x", "y"].

So, for most modules, vectors=["x", "y"] and names=["x", "y"] or vectors=[x, y, z] and names=["x", "y", "z"].

For more complicated modules like plot or plot3d, the names can be
names=["x", "y", "direction_arg1", "direction_arg2", "fill", "size", "symbol", "transparency"].

The column names will be very useful when the GMTInvalidInput exception is raised.
For example, instead of "Column 5 can't be None.", we can say "Column 5 ('size') can't be None.". Instead of "data must have at least 8 columns.", we can say

data must have at least 8 columns:
x y direction_arg1 direction_arg2 fill size symbol transparency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f37413b

pygmt/clib/session.py Outdated Show resolved Hide resolved
@seisman seisman modified the milestones: 0.11.0, 0.12.0 Dec 11, 2023
pygmt/helpers/utils.py Outdated Show resolved Hide resolved
Comment on lines +116 to +119
if len(vectors) < len(names):
raise GMTInvalidInput(
f"Requires {len(names)} 1-D arrays but got {len(vectors)}."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing unit test for this if-condition.

Comment on lines +129 to +130
if len(data.shape) == 1 and data.shape[0] < len(names):
raise GMTInvalidInput(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing unit test for this if-condition.

Comment on lines +226 to +228
vectors, names = [x, y], "xy"
if z is not None:
vectors.append(z)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to append 'z' to names here? Also, need a unit test for this if-condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no. The problem is that project only requires two columns, but three or more columns are required. Currently, the variable names are used for two purposes: (1) names of passed columns; (2) the number of columns. So, if we append z to names here, the calling pygmt.project(data=data) will fail if data has only two columns. I think we still need to maintain a variable for the number of required columns.

pygmt/tests/test_helpers.py Outdated Show resolved Hide resolved
Comment on lines 1530 to 1537
kind = data_kind(data, required=required_data)
validate_data_input(
data=data,
vectors=vectors,
names=names,
required_data=required_data,
kind=kind,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation checks have been moved from within data_kind to virtualfile_from_data here. But in plot.py, we actually use data_kind on its own here:

kind = data_kind(data, x, y)

Are we ok with raising GMTInvalidInput much later here in virtualfile_from_data (after all the keyword argument parsing), rather than early on in data_kind?

@seisman seisman removed this from the 0.12.0 milestone Feb 26, 2024
@seisman seisman marked this pull request as draft March 1, 2024 04:51
@seisman seisman changed the title Refactor the virtualfile_from_data and data_kind function to accept more 1-D arrays Refactor the virtualfile_in and data_kind function to accept more 1-D arrays Mar 4, 2024
@seisman seisman changed the title Refactor the virtualfile_in and data_kind function to accept more 1-D arrays Refactor the virtualfile_in afunction to accept more 1-D arrays Jul 20, 2024
@seisman seisman changed the title Refactor the virtualfile_in afunction to accept more 1-D arrays Refactor the virtualfile_in function to accept more 1-D arrays Jul 20, 2024
@seisman
Copy link
Member Author

seisman commented Oct 3, 2024

Closing this PR since it will be superseded by #3369.

@seisman seisman closed this Oct 3, 2024
@seisman seisman deleted the refactor/virtualfile-to-data branch October 3, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants