Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45390][PYTHON] Remove distutils usage #43192

Closed
wants to merge 3 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Oct 1, 2023

What changes were proposed in this pull request?

This PR aims to remove distutils usage from Spark codebase.

BEFORE

$ git grep distutils | wc -l
      38

AFTER

$ git grep distutils | wc -l
       0

Why are the changes needed?

Currently, Apache Spark ignores the warnings but the module itself is removed at Python 3.12 via PEP-632 in favor of packaging package.

filterwarnings(
"ignore", message="distutils Version classes are deprecated. Use packaging.version instead."
)

Initially, #43184 proposed to follow Python community guideline via using packaging package, but, this PR is embedding LooseVersion Python class to avoid adding a new package requirement.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft October 1, 2023 01:46
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-45390][PYTHON] Remove distutils usage [SPARK-45390][PYTHON] Remove distutils usage Oct 1, 2023
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't check super closely but lgtm if it works cc @zhengruifeng and @ueshin

@dongjoon-hyun
Copy link
Member Author

Thank you!

@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review October 1, 2023 04:28
Comment on lines +224 to 225
python/pyspark/loose_version.py
python/docs/source/_static/copybutton.js
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's compatible but we need to take a look once more before doing that. It's because our Apache Spark (up to 3.5.0) binary distribution doesn't include Python Software Foundation yet.

So actually we already have PSF License stuff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no~

Yes, we had copybutton.js file.

spark-3.5.0-bin-hadoop3:$ find . -name copybutton.js
./python/docs/source/_static/copybutton.js

But, still no. We didn't have no PSF entry in LICENSE-binary which is a part of Apache Spark binary distribution. So, I added in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about copybutton.js but we need to add loose_version to binary because we had python/pyspark/cloudpickle.py and python/pyspark/join.py in BSD 3-Clause section. So, I added it to LICENSE-binary too.

import re


class LooseVersion:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this copy from distutils? If so, maybe we need to add a few comment here to explain it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, python/docs/source/_static/copybutton.js has few lines of comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's reimplemented by squashing Version class into the existing LooseVersion class. Let me add that.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have two comments on license and code comment.

@dongjoon-hyun
Copy link
Member Author

Thank you, Liang-Chi!

@dongjoon-hyun
Copy link
Member Author

Thank you all. All Python tests passed.

@dongjoon-hyun dongjoon-hyun deleted the remove_distutils branch October 1, 2023 17:45
dongjoon-hyun added a commit that referenced this pull request Nov 21, 2023
### What changes were proposed in this pull request?

This PR aims to add `Python 3.12` to Infra docker images.

Note that `Python 3.12` has a breaking change in the installation.
- `distutils` module itself is removed at Python 3.12 via [PEP-632](https://peps.python.org/pep-0632) in favor of `packaging` package.
- Apache Spark 4.0.0 is ready for Python 3.12 via SPARK-45390 by removing `distutils` usages
    - #43192
- However, some 3rd party packages are not ready for Python 3.12. So, this PR skips those kind of packages.

### Why are the changes needed?

This PR is a preparation to add a daily `Python 3.12` GitHub Action job later for Apache Spark 4.0.0.

As of today, Apache Spark 4.0.0 has Python 3.8 ~ Python 3.11 test coverage.
- Python 3.9 (Main)
    - https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml
- PyPy3.8, Python 3.10, Python 3.11 (Daily)
    - https://github.com/apache/spark/actions/workflows/build_python.yml

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
$ docker run -it --rm ghcr.io/dongjoon-hyun/apache-spark-ci-image:master-6939290578 python3.12 --version
Python 3.12.0

$ docker run -it --rm ghcr.io/dongjoon-hyun/apache-spark-ci-image:master-6939290578 python3.12 -m pip freeze
alembic==1.12.1
blinker==1.7.0
certifi==2019.11.28
chardet==3.0.4
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
contourpy==1.2.0
coverage==7.3.2
cycler==0.12.1
databricks-cli==0.18.0
dbus-python==1.2.16
distro-info==0.23+ubuntu1.1
docker==6.1.3
entrypoints==0.4
et-xmlfile==1.1.0
Flask==3.0.0
fonttools==4.45.0
gitdb==4.0.11
GitPython==3.1.40
googleapis-common-protos==1.56.4
greenlet==3.0.1
gunicorn==21.2.0
idna==2.8
importlib-metadata==6.8.0
itsdangerous==2.1.2
Jinja2==3.1.2
joblib==1.3.2
kiwisolver==1.4.5
lxml==4.9.3
Mako==1.3.0
Markdown==3.5.1
MarkupSafe==2.1.3
matplotlib==3.8.2
mlflow==2.8.1
numpy==1.26.2
oauthlib==3.2.2
openpyxl==3.1.2
packaging==23.2
pandas==2.1.3
Pillow==10.1.0
plotly==5.18.0
protobuf==4.25.1
pyarrow==14.0.1
PyGObject==3.36.0
PyJWT==2.8.0
pyparsing==3.1.1
python-apt==2.0.1+ubuntu0.20.4.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
querystring-parser==1.2.4
requests==2.31.0
requests-unixsocket==0.2.0
scikit-learn==1.3.2
scipy==1.11.4
setuptools==45.2.0
six==1.14.0
smmap==5.0.1
SQLAlchemy==2.0.23
sqlparse==0.4.4
tabulate==0.9.0
tenacity==8.2.3
threadpoolctl==3.2.0
typing_extensions==4.8.0
tzdata==2023.3
unattended-upgrades==0.1
unittest-xml-reporting==3.2.0
urllib3==2.1.0
websocket-client==1.6.4
Werkzeug==3.0.1
wheel==0.34.2
zipp==3.17.0
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43922 from dongjoon-hyun/SPARK-46020.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun added a commit that referenced this pull request Jul 25, 2024
… Python 12

### What changes were proposed in this pull request?

This PR aims to use `17-jammy` tag instead of `17` to prevent Python 12.

### Why are the changes needed?

Two days ago, `eclipse-temurin:17` switched its baseline OS to `Ubuntu 24.04` which brings `Python 3.12`.

```
$ docker run -it --rm eclipse-temurin:17 cat /etc/os-release | grep VERSION_ID
VERSION_ID="24.04"

$ docker run -it --rm eclipse-temurin:17-jammy cat /etc/os-release | grep VERSION_ID
VERSION_ID="22.04"
```

Since Python 3.12 supported is added only to Apache Spark 4.0.0, we need to keep using the previous OS, `Ubuntu 22.04`.

- #43184
- #43192

### Does this PR introduce _any_ user-facing change?

No. This aims to recover to the same OS for consistent behavior.

### How was this patch tested?

Pass the CIs with K8s IT. Currently, it's broken at Python image building phase.

- https://github.com/apache/spark/actions/workflows/build_branch35.yml

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47488 from dongjoon-hyun/SPARK-49005.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun added a commit that referenced this pull request Jul 25, 2024
…vent Python 3.12

### What changes were proposed in this pull request?

This PR aims to use `17-jammy` tag instead of `17-jre` to prevent Python 12.

### Why are the changes needed?

Two days ago, `eclipse-temurin:17` switched its baseline OS to `Ubuntu 24.04` which brings `Python 3.12`.

```
$ docker run -it --rm eclipse-temurin:17-jre cat /etc/os-release | grep VERSION_ID
VERSION_ID="24.04"

$ docker run -it --rm eclipse-temurin:17-jammy cat /etc/os-release | grep VERSION_ID
VERSION_ID="22.04"
```

Since Python 3.12 supported is added only to Apache Spark 4.0.0, we need to keep using the previous OS, `Ubuntu 22.04`.

- #43184
- #43192

### Does this PR introduce _any_ user-facing change?

No. This aims to recover to the same OS for consistent behavior.

### How was this patch tested?

Pass the CIs with K8s IT. Currently, it's broken at Python image building phase.

- https://github.com/apache/spark/actions/workflows/build_branch34.yml

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47489 from dongjoon-hyun/SPARK-49005-3.4.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024
…vent Python 3.12

This PR aims to use `17-jammy` tag instead of `17-jre` to prevent Python 12.

Two days ago, `eclipse-temurin:17` switched its baseline OS to `Ubuntu 24.04` which brings `Python 3.12`.

```
$ docker run -it --rm eclipse-temurin:17-jre cat /etc/os-release | grep VERSION_ID
VERSION_ID="24.04"

$ docker run -it --rm eclipse-temurin:17-jammy cat /etc/os-release | grep VERSION_ID
VERSION_ID="22.04"
```

Since Python 3.12 supported is added only to Apache Spark 4.0.0, we need to keep using the previous OS, `Ubuntu 22.04`.

- apache#43184
- apache#43192

No. This aims to recover to the same OS for consistent behavior.

Pass the CIs with K8s IT. Currently, it's broken at Python image building phase.

- https://github.com/apache/spark/actions/workflows/build_branch34.yml

No.

Closes apache#47489 from dongjoon-hyun/SPARK-49005-3.4.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants