Define a clear and enforceable standard file layout for cuDF Python #11474

vyasr · 2022-08-04T21:48:17Z

Is your feature request related to a problem? Please describe.
Aside from the core classes (DataFrame, Series, etc), each of which typically live in eponymous files, the organization of functions in cuDF Python currently leaves quite a bit to be desired. The main offender is the core subpackage, which has a number of functions living in largely arbitrary locations. For instance, we have files algorithms.py and common.py, and it's not clear why pipe was placed in the latter while other functions were placed in the former. We also lack a consistent strategy on when a function gets its own file and when it lives with other functions in a shared module (we are generally OK about one class per file, although even that rule is violated occasionally). This problem also propagates to tests and benchmarks. Tests, in particular, are very disorganized to the point where it's nearly impossible to know which files to look at to determine whether some functionality is tested (cf. #9999). A desire for improved organization has come up in multiple PRs adding developer documentation (see the various PRs contributing to #6481).

Describe the solution you'd like
We should come up with a simple, easily enforceable set of rules addressing the following:

What files do classes go into? (I vote one class per file, with the file named for that class)
What files do functions go into? We typically aim to group by functionality, so we likely need to either more subpackages inside core, or (nearly equivalent) we need to decide on a set of named submodules where functions go.
To what extent should (and can) test and benchmark organization mirror the contents of the main package?
How does docs organization fit in? Our documentation is organized to match pandas documentation, which provides a pretty reasonable default organization for us.

The next question is how we enforce these rules. It would be nice if we could find a way to enforce these programatically (ideally with a pre-commit hook). If we use the documentation organization as our baseline, the difficulty here would be that we'd potentially need to do significant rst parsing in order to use this ordering. On the flip side, it would be pretty straightforward to enforce with linting rules.

The text was updated successfully, but these errors were encountered:

shwina · 2022-08-04T22:08:39Z

The next question is how we enforce these rules. It would be nice if we could find a way to enforce these programatically (ideally with a pre-commit hook).

Do we know of other projects that do something similar?

shwina · 2022-08-04T22:53:04Z

For instance, we have files algorithms.py and common.py, and it's not clear why pipe was placed in the latter while other functions were placed in the former.

Because Pandas :)

>>> pd.core.common.pipe
<function pandas.core.common.pipe(obj, func: 'Callable[..., T] | tuple[Callable[..., T], str]', *args, **kwargs) -> 'T'>

I think it's useful to mimic the way Pandas organizes its public modules. This seems to be the route cuml has taken. Note that because Pandas occasionally makes changes to its public namespaces, we'd have to decide which version of Pandas we want to match (latest?).

That doesn't answer the question about how to organize our internal functions/classes though, and I'd be open to discussions around that.

As for tests/benchmarks, I really think it's useful for them to follow the same layout as the API reference. So tests
for DataFrame/Constructor would go in tests/dataframe/constructor.py or tests/dataframe/construction.py.

As for enforcement of any of these rules, I'm worried about over-engineering a solution here. Thus my question above if other projects do similar enforcement for code organization.

vyasr · 2022-08-05T19:01:55Z

Hmm so then would we want to move towards matching pandas more closely? For instance, DataFrame is in pd.core.frame.DataFrame, whereas ours is in cudf.core.dataframe.DataFrame. We have an actual Frame class (pandas has NDFrame) so we run into some potential conflicts. How would we want to address that? I think this also circles back to some of the questions that you and @bdice ran into when trying to analyze the public APIs of pandas, namely how we determine what is public and what is not.

If we go with matching pandas for APIs, then for tests/benchmarks I would say we have two options:

Follow the same layout as the API reference. That's what we've discussed before.
Follow the same layout as the source. Tests go in files that exactly match the source layout.

I have also never seen anything enforcing a file layout and I agree that it might be overengineered unless it's very easy. It would be nice if it were possible, though. @mroeschke suggested this and may have some ideas.

shwina · 2022-08-08T12:54:44Z

If we go with matching pandas for APIs, then for tests/benchmarks I would say we have two options:

Follow the same layout as the API reference. That's what we've discussed before.

Follow the same layout as the source. Tests go in files that exactly match the source layout.

Recognizing that Vyas is on vacation for a few weeks, so it'll be a while before we can pick back up on this discussion, just responding now so I don't forget:

I'm much more in favor of 1. Matching tests with source files 1-1 can lead to unnecessary churn as we frequently move functions/methods across files.

bdice · 2022-08-08T13:21:54Z

Agreed. I think the natural separation of API components is more aligned with the docs than the package structure.

mroeschke · 2022-08-08T20:49:17Z

+1 as well for aligning the API with the docs as well.

I had thrown out the idea of "enforceable" file layout just noticing in pandas, at least, that having soft conventions in organizing a code base can still make it hard for new and experienced contributors where to place things and causing drift in conventions. Maybe this is a largely unsolved problem in general though. If there's no easily available tool out there today, I definitely agree with the worries about over-engineering a solution.

github-actions · 2022-09-07T21:02:57Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vyasr · 2024-05-14T00:38:41Z

Since this issue was first opened, cudf.pandas was created and requires an even closer matching to pandas. As a result, this issue isn't quite relevant anymore. pylibcudf's modules will largely be organized to match libcudf headers, while cudf modules will match pandas as closely as possible.

For discussion and work on the tests front, see #4730, #15723, and #12288.

vyasr added feature request New feature or request Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function labels Aug 4, 2022

github-actions bot added the inactive-30d label Sep 7, 2022

vyasr mentioned this issue Oct 26, 2022

Add developer docs for writing tests #11199

Merged

vyasr self-assigned this Oct 26, 2022

GregoryKimball added this to the cuDF Python Refactoring milestone Nov 19, 2022

GregoryKimball removed the inactive-30d label Apr 2, 2023

vyasr closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define a clear and enforceable standard file layout for cuDF Python #11474

Define a clear and enforceable standard file layout for cuDF Python #11474

vyasr commented Aug 4, 2022

shwina commented Aug 4, 2022

shwina commented Aug 4, 2022

vyasr commented Aug 5, 2022

shwina commented Aug 8, 2022

bdice commented Aug 8, 2022

mroeschke commented Aug 8, 2022

github-actions bot commented Sep 7, 2022

vyasr commented May 14, 2024 •

edited

Loading

Define a clear and enforceable standard file layout for cuDF Python #11474

Define a clear and enforceable standard file layout for cuDF Python #11474

Comments

vyasr commented Aug 4, 2022

shwina commented Aug 4, 2022

shwina commented Aug 4, 2022

vyasr commented Aug 5, 2022

shwina commented Aug 8, 2022

bdice commented Aug 8, 2022

mroeschke commented Aug 8, 2022

github-actions bot commented Sep 7, 2022

vyasr commented May 14, 2024 • edited Loading

vyasr commented May 14, 2024 •

edited

Loading