Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PySpark to the add-ons flow #3169

Merged
merged 31 commits into from
Oct 26, 2023
Merged

Add PySpark to the add-ons flow #3169

merged 31 commits into from
Oct 26, 2023

Conversation

SajidAlamQB
Copy link
Contributor

@SajidAlamQB SajidAlamQB commented Oct 12, 2023

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

PySpark should be added to the add-ons options, allowing users to get setup for pyspark (see #2506 (comment)). This option should make use of the new spaceflights-pyspark starter: #2984

When a user select pyspark they should get the config needed to run Kedro project with pyspark. This shouldn't include the example pipelines. This will happen in a follow up ticket.

Development notes

Blocked by:
kedro-org/kedro-starters#170

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB SajidAlamQB self-assigned this Oct 12, 2023
@SajidAlamQB SajidAlamQB changed the base branch from main to develop October 12, 2023 14:03
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB SajidAlamQB marked this pull request as ready for review October 12, 2023 15:30
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general approach looks good, but instead of adding the spark specific files in the basic template, these should be fetched from the new starter: https://github.com/kedro-org/kedro-starters/tree/main/spaceflights-pyspark. You might have to also create a post_gen_project file there..

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB SajidAlamQB linked an issue Oct 13, 2023 that may be closed by this pull request
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
SajidAlamQB and others added 3 commits October 25, 2023 11:25
Signed-off-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
 into pyspark-add-on-flow

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@noklam
Copy link
Contributor

noklam commented Oct 25, 2023

I see all tests are passing but linting, I take the liberty to fix it.

@SajidAlamQB
Copy link
Contributor Author

I see all tests are passing but linting, I take the liberty to fix it.

lol we did it at the same time.

@merelcht
Copy link
Member

If you combine pyspark with any of the other add-ons you potentially get duplicate entries in your pyproject.toml, because the pyproject.toml in the starter already contains all settings for e.g. docs, linting etc. I guess the pyproject.toml will need to be stripped.

Not related to just pyspark: I also noticed that docs requirements are already present in the base pyproject.toml and will also end up being duplicated even if pyspark isn't selected.

@merelcht
Copy link
Member

merelcht commented Oct 25, 2023

Yeah part of the stuff was stripped from the pyproject.toml in the base template in https://github.com/kedro-org/kedro/pull/2987/files, but the starters were merged first so I didn't realise this had to happen.. 😅

@SajidAlamQB
Copy link
Contributor Author

If you combine pyspark with any of the other add-ons you potentially get duplicate entries in your pyproject.toml, because the pyproject.toml in the starter already contains all settings for e.g. docs, linting etc. I guess the pyproject.toml will need to be stripped.

We could just add checks to make sure that the entries we're adding are not already present.

@merelcht
Copy link
Member

If you combine pyspark with any of the other add-ons you potentially get duplicate entries in your pyproject.toml, because the pyproject.toml in the starter already contains all settings for e.g. docs, linting etc. I guess the pyproject.toml will need to be stripped.

We could just add checks to make sure that the entries we're adding are not already present.

But these optional requirements should only be added if the user wants them. If they haven't selected lint they shouldn't get any requirements/settings in their pyproject.toml related to ruff.

@noklam
Copy link
Contributor

noklam commented Oct 25, 2023

Our tests should have caught this - I will make a separate PR to fix the test. Essentially I think we need to parse the pyproject.toml properly instead of reading it as text file.

        with open(pyproject_file_path) as pyproject_file:
            requirements = pyproject_file.read()

@noklam
Copy link
Contributor

noklam commented Oct 25, 2023

I created the test in #3230, it catches an invalid combination that we need to fix.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some minor comments but I think this is almost ready to be merged. I suggest fixing the other issues I found in a separate PR, because they aren't caused by the changes here and that shouldn't block the work for adding the viz option.

kedro/framework/cli/starters.py Outdated Show resolved Hide resolved
tests/framework/cli/test_starters.py Show resolved Hide resolved
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB
Copy link
Contributor Author

I've left some minor comments but I think this is almost ready to be merged. I suggest fixing the other issues I found in a separate PR, because they aren't caused by the changes here and that shouldn't block the work for adding the viz option.

Oh, I've already added some changes that attempt to fix this but I can revert and open it up on new PR? @merelcht

@merelcht
Copy link
Member

Oh, I've already added some changes that attempt to fix this but I can revert and open it up on new PR? @merelcht

Ah no, that's great!! I'll have a look 😄 I just didn't want to block this PR by my earlier comments.

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB SajidAlamQB requested a review from merelcht October 26, 2023 12:48
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one more comment about the pyspark check. add_ons is a list there so == will always fail. But otherwise this looks good to be merged! 👍

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
@SajidAlamQB SajidAlamQB merged commit 5e9e05e into develop Oct 26, 2023
28 of 29 checks passed
@SajidAlamQB SajidAlamQB deleted the pyspark-add-on-flow branch October 26, 2023 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add pyspark to add-ons options
3 participants