-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define a clear and enforceable standard file layout for cuDF Python #11474
Comments
Do we know of other projects that do something similar? |
Because Pandas :) >>> pd.core.common.pipe
<function pandas.core.common.pipe(obj, func: 'Callable[..., T] | tuple[Callable[..., T], str]', *args, **kwargs) -> 'T'> I think it's useful to mimic the way Pandas organizes its public modules. This seems to be the route That doesn't answer the question about how to organize our internal functions/classes though, and I'd be open to discussions around that. As for tests/benchmarks, I really think it's useful for them to follow the same layout as the API reference. So tests As for enforcement of any of these rules, I'm worried about over-engineering a solution here. Thus my question above if other projects do similar enforcement for code organization. |
Hmm so then would we want to move towards matching pandas more closely? For instance, If we go with matching pandas for APIs, then for tests/benchmarks I would say we have two options:
I have also never seen anything enforcing a file layout and I agree that it might be overengineered unless it's very easy. It would be nice if it were possible, though. @mroeschke suggested this and may have some ideas. |
Recognizing that Vyas is on vacation for a few weeks, so it'll be a while before we can pick back up on this discussion, just responding now so I don't forget: I'm much more in favor of 1. Matching tests with source files 1-1 can lead to unnecessary churn as we frequently move functions/methods across files. |
Agreed. I think the natural separation of API components is more aligned with the docs than the package structure. |
+1 as well for aligning the API with the docs as well. I had thrown out the idea of "enforceable" file layout just noticing in pandas, at least, that having soft conventions in organizing a code base can still make it hard for new and experienced contributors where to place things and causing drift in conventions. Maybe this is a largely unsolved problem in general though. If there's no easily available tool out there today, I definitely agree with the worries about over-engineering a solution. |
This issue has been labeled |
Since this issue was first opened, cudf.pandas was created and requires an even closer matching to pandas. As a result, this issue isn't quite relevant anymore. pylibcudf's modules will largely be organized to match libcudf headers, while cudf modules will match pandas as closely as possible. For discussion and work on the tests front, see #4730, #15723, and #12288. |
Is your feature request related to a problem? Please describe.
Aside from the core classes (DataFrame, Series, etc), each of which typically live in eponymous files, the organization of functions in cuDF Python currently leaves quite a bit to be desired. The main offender is the
core
subpackage, which has a number of functions living in largely arbitrary locations. For instance, we have filesalgorithms.py
andcommon.py
, and it's not clear whypipe
was placed in the latter while other functions were placed in the former. We also lack a consistent strategy on when a function gets its own file and when it lives with other functions in a shared module (we are generally OK about one class per file, although even that rule is violated occasionally). This problem also propagates to tests and benchmarks. Tests, in particular, are very disorganized to the point where it's nearly impossible to know which files to look at to determine whether some functionality is tested (cf. #9999). A desire for improved organization has come up in multiple PRs adding developer documentation (see the various PRs contributing to #6481).Describe the solution you'd like
We should come up with a simple, easily enforceable set of rules addressing the following:
core
, or (nearly equivalent) we need to decide on a set of named submodules where functions go.The next question is how we enforce these rules. It would be nice if we could find a way to enforce these programatically (ideally with a pre-commit hook). If we use the documentation organization as our baseline, the difficulty here would be that we'd potentially need to do significant rst parsing in order to use this ordering. On the flip side, it would be pretty straightforward to enforce with linting rules.
The text was updated successfully, but these errors were encountered: