Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark 3.x support for sarplus #1566

Merged
merged 40 commits into from
Dec 14, 2021
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
0d5fa3d
Upgrade sarplus to support Spark 3.x
simonzhaoms Nov 22, 2021
66b76d9
Add corresponding docs and simplify version specification
simonzhaoms Nov 22, 2021
b5d3b55
Update python package url
simonzhaoms Nov 23, 2021
dd1ceeb
Add macros for Spark 3.2.x
simonzhaoms Nov 26, 2021
24b4d48
Add sarplus testing and packaging workflow
simonzhaoms Dec 2, 2021
5af0c80
Add steps to publish python package
simonzhaoms Dec 8, 2021
52999e3
Add configs for scala package publish
simonzhaoms Dec 8, 2021
48028cf
Merge branch 'staging' into simonz/sarplus/spark3
simonzhaoms Dec 8, 2021
b50b94a
Add python 3.6 and 3.7
simonzhaoms Dec 8, 2021
74298ce
Add steps for Scala packaging
simonzhaoms Dec 9, 2021
92923ad
Rename scala bundle
simonzhaoms Dec 9, 2021
d34c3e2
Add license hader
simonzhaoms Dec 10, 2021
466ebc8
Format Python code with black
simonzhaoms Dec 12, 2021
bdb1892
Remove trailing whitespaces
simonzhaoms Dec 13, 2021
2006b58
Add Python README
simonzhaoms Dec 13, 2021
46d3ae3
Use VERSION as the only place for version update
simonzhaoms Dec 13, 2021
d2dbe11
Update workflow
simonzhaoms Dec 13, 2021
87d78d8
Remove unused code
simonzhaoms Dec 13, 2021
52376d1
Remove azure-pipelines.yml
simonzhaoms Dec 13, 2021
5c786fb
Update DEVELOPMENT.md
simonzhaoms Dec 13, 2021
f1f8cf0
Update README.md
simonzhaoms Dec 13, 2021
da74735
Update sarplus.yml
simonzhaoms Dec 13, 2021
8c696e4
Add link to publish scala package manually to central repository
simonzhaoms Dec 13, 2021
2da6acd
Merge branch 'staging' into simonz/sarplus/spark3
miguelgfierro Dec 13, 2021
1094e2a
Add docstring for SARPlus init function
simonzhaoms Dec 13, 2021
6845051
Merge remote-tracking branch 'origin/simonz/sarplus/spark3' into simo…
simonzhaoms Dec 13, 2021
bf1f0ec
Merge branch 'staging' into simonz/sarplus/spark3
anargyri Dec 13, 2021
fc7cab2
Use VERSION
simonzhaoms Dec 13, 2021
932c698
Merge branch 'simonz/sarplus/spark3' of simonzhaomsgithub:simonzhaoms…
simonzhaoms Dec 13, 2021
34c72f7
Add simon in AUTHORS.md
simonzhaoms Dec 13, 2021
ebff27b
Merge branch 'staging' into simonz/sarplus/spark3
anargyri Dec 13, 2021
fa9ece8
Remove GPG key
simonzhaoms Dec 13, 2021
d016370
Update sarplus.yml
simonzhaoms Dec 13, 2021
b8f31f2
Resolve flake8 errors
simonzhaoms Dec 13, 2021
decdb28
Merge branch 'simonz/sarplus/spark3' of simonzhaomsgithub:simonzhaoms…
simonzhaoms Dec 13, 2021
7ef200d
Update setup.py
simonzhaoms Dec 13, 2021
75028d5
Move VERSION as package data file of pysarplus
simonzhaoms Dec 14, 2021
8c44d14
Remove test data access token and move fixtures into conftest.py
simonzhaoms Dec 14, 2021
4ec139a
Corrent VERSION path
simonzhaoms Dec 14, 2021
594674c
Fix flake8 issues
simonzhaoms Dec 14, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 160 additions & 0 deletions .github/workflows/sarplus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# This workflow will run tests and do packaging for contrib/sarplus.
#
# References:
# * GitHub Actions workflow templates
# + [python package](https://github.com/actions/starter-workflows/blob/main/ci/python-package.yml)
# + [scala](https://github.com/actions/starter-workflows/blob/main/ci/scala.yml)
# * [GitHub hosted runner - Ubuntu 20.04 LTS](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-README.md)
# * [Azure Databricks runtime releases](https://docs.microsoft.com/en-us/azure/databricks/release-notes/runtime/releases)


name: sarplus test and package

on:
push:
paths:
- contrib/sarplus/python/**
- contrib/sarplus/scala/**
- contrib/sarplus/VERSION

env:
PYTHON_ROOT: ${{ github.workspace }}/contrib/sarplus/python
SCALA_ROOT: ${{ github.workspace }}/contrib/sarplus/scala

jobs:
python:
# Test pysarplus with different versions of Python.
# Package pysarplus and upload as GitHub workflow artifact when merged into
# the main branch.
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
steps:
- uses: actions/checkout@v2

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install -U build pip twine
python -m pip install -U flake8 pytest pytest-cov scikit-learn

- name: Lint with flake8
run: |
cd "${PYTHON_ROOT}"
# See https://flake8.pycqa.org/en/latest/user/index.html
flake8 .

- name: Package and check
run: |
cd "${PYTHON_ROOT}"
cp ../VERSION ./
python -m build --sdist
python -m twine check dist/*

- name: Test
env:
ACCESS_TOKEN: ${{ secrets.SARPLUS_TESTDATA_ACCESS_TOKEN }}
run: |
cd "${PYTHON_ROOT}"
python -m pip install dist/*.gz

cd "${SCALA_ROOT}"
export SPARK_VERSION=$(python -m pip show pyspark | grep -i version | cut -d ' ' -f 2)
SPARK_JAR_DIR=$(python -m pip show pyspark | grep -i location | cut -d ' ' -f2)/pyspark/jars
SCALA_JAR=$(ls ${SPARK_JAR_DIR}/scala-library*)
HADOOP_JAR=$(ls ${SPARK_JAR_DIR}/hadoop-client-api*)
SCALA_VERSION=${SCALA_JAR##*-}
export SCALA_VERSION=${SCALA_VERSION%.*}
HADOOP_VERSION=${HADOOP_JAR##*-}
export HADOOP_VERSION=${HADOOP_VERSION%.*}
sbt ++"${SCALA_VERSION}"! package

cd "${PYTHON_ROOT}"
pytest --token "${ACCESS_TOKEN}" ./tests
echo "sarplus_version=$(cat ../VERSION)" >> $GITHUB_ENV

- name: Upload Python package as GitHub artifact
if: github.ref == 'refs/heads/main' && matrix.python-version == '3.10'
uses: actions/upload-artifact@v2
with:
name: pysarplus-${{ env.sarplus_version }}
path: ${{ env.PYTHON_ROOT }}/dist/*.gz

scala-test:
# Test sarplus with different versions of Databricks runtime, 2 LTSs and 1
# latest.
runs-on: ubuntu-latest
strategy:
matrix:
include:
- scala-version: "2.12.10"
spark-version: "3.0.1"
hadoop-version: "2.7.4"
databricks-runtime: "ADB 7.3 LTS"

- scala-version: "2.12.10"
spark-version: "3.1.2"
hadoop-version: "2.7.4"
databricks-runtime: "ADB 9.1 LTS"

- scala-version: "2.12.14"
spark-version: "3.2.0"
hadoop-version: "3.3.1"
databricks-runtime: "ADB 10.0"

steps:
- uses: actions/checkout@v2

- name: Test
run: |
cd "${SCALA_ROOT}"
export SPARK_VERSION="${{ matrix.spark-version }}"
export HADOOP_VERSION="${{ matrix.hadoop-version }}"
sbt ++${{ matrix.scala-version }}! test

scala-package:
# Package sarplus and upload as GitHub workflow artifact when merged into
# the main branch.
needs: scala-test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- name: Package
env:
GPG_KEY: ${{ secrets.SARPLUS_GPG_PRI_KEY_ASC }}
run: |
# generate artifacts
cd "${SCALA_ROOT}"
export SPARK_VERSION="3.1.2"
export HADOOP_VERSION="2.7.4"
export SCALA_VERSION="2.12.10"
sbt ++${SCALA_VERSION}! package
sbt ++${SCALA_VERSION}! packageDoc
sbt ++${SCALA_VERSION}! packageSrc
sbt ++${SCALA_VERSION}! makePom
export SPARK_VERSION="3.2.0"
export HADOOP_VERSION="3.3.1"
export SCALA_VERSION="2.12.14"
sbt ++${SCALA_VERSION}! package

# sign with GPG
cd target/scala-2.12
gpg --import <(cat <<< "${GPG_KEY}")
for file in {*.jar,*.pom}; do gpg -ab "${file}"; done

# bundle
jar cvf sarplus-bundle_2.12-$(cat ../VERSION).jar *.jar *.pom *.asc
echo "sarplus_version=$(cat ../VERSION)" >> $GITHUB_ENV

- name: Upload Scala bundle as GitHub artifact
uses: actions/upload-artifact@v2
with:
name: sarplus-bundle_2.12-${{ env.sarplus_version }}
path: ${{ env.SCALA_ROOT }}/target/scala-2.12/sarplus-bundle_2.12-${{ env.sarplus_version }}.jar
99 changes: 85 additions & 14 deletions contrib/sarplus/DEVELOPMENT.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,112 @@
# Packaging

For [databricks](https://databricks.com/) to properly install a [C++ extension](https://docs.python.org/3/extending/building.html), one must take a detour through [pypi](https://pypi.org/).
Use [twine](https://github.com/pypa/twine) to upload the package to [pypi](https://pypi.org/).
For [databricks](https://databricks.com/) to properly install a [C++
extension](https://docs.python.org/3/extending/building.html), one
must take a detour through [pypi](https://pypi.org/). Use
[twine](https://github.com/pypa/twine) to upload the package to
[pypi](https://pypi.org/).

```bash
cd python

python setup.py sdist
# build dependencies
python -m pip install -U build pip twine

twine upload dist/pysarplus-*.tar.gz
cd python
cp ../VERSION ./ # version file
python -m build --sdist
python -m twine upload dist/*
```

On [Spark](https://spark.apache.org/) one can install all 3 components (C++, Python, Scala) in one pass by creating a [Spark Package](https://spark-packages.org/). Documentation is rather sparse. Steps to install
On [Spark](https://spark.apache.org/) one can install all 3 components
(C++, Python, Scala) in one pass by creating a [Spark
Package](https://spark-packages.org/). Steps to install

1. Package and publish the [pip package](python/setup.py) (see above)
2. Package the [Spark package](scala/build.sbt), which includes the [Scala formatter](scala/src/main/scala/microsoft/sarplus) and references the [pip package](scala/python/requirements.txt) (see below)
3. Upload the zipped Scala package to [Spark Package](https://spark-packages.org/) through a browser. [sbt spPublish](https://github.com/databricks/sbt-spark-package) has a few [issues](https://github.com/databricks/sbt-spark-package/issues/31) so it always fails for me. Don't use spPublishLocal as the packages are not created properly (names don't match up, [issue](https://github.com/databricks/sbt-spark-package/issues/17)) and furthermore fail to install if published to [Spark-Packages.org](https://spark-packages.org/).
2. Package the [Spark package](scala/build.sbt), which includes the
[Scala formatter](scala/src/main/scala/microsoft/sarplus) and
references the pip package (see below)
3. Upload the zipped Scala package bundle to [Nexus Repository
Manager](https://oss.sonatype.org/) through a browser (See [publish
manul](https://central.sonatype.org/publish/publish-manual/)).

```bash
export SPARK_VERSION="3.1.2"
export HADOOP_VERSION="2.7.4"
export SCALA_VERSION="2.12.10"
GPG_KEY="<gpg-private-key>"

# generate artifacts
cd scala
sbt spPublish
sbt ++${SCALA_VERSION}! package
sbt ++${SCALA_VERSION}! packageDoc
sbt ++${SCALA_VERSION}! packageSrc
sbt ++${SCALA_VERSION}! makePom

# generate the artifact (sarplus-*-spark32.jar) for Spark 3.2+
export SPARK_VERSION="3.2.0"
export HADOOP_VERSION="3.3.1"
export SCALA_VERSION="2.12.14"
sbt ++${SCALA_VERSION}! package

# sign with GPG
cd target/scala-${SCALA_VERSION%.*}
gpg --import <(cat <<< "${GPG_KEY}")
for file in {*.jar,*.pom}; do gpg -ab "${file}"; done

# bundle
jar cvf sarplus-bundle_2.12-$(cat ../VERSION).jar *.jar *.pom *.asc
```

where `SPARK_VERSION`, `HADOOP_VERSION`, `SCALA_VERSION` should be
customized as needed.


## Testing

To test the python UDF + C++ backend

```bash
cd python
python setup.py install && pytest -s tests/
# access token for https://recodatasets.blob.core.windows.net/sarunittest/
ACCESS_TOKEN="<test-data-blob-access-token>"

# build dependencies
python -m pip install -U build pip twine

# build
cd python
cp ../VERSION ./ # version file
python -m build --sdist

# test
pytest --token "${ACCESS_TOKEN}" ./tests
```

To test the Scala formatter

```bash
export SPARK_VERSION=3.2.0
export HADOOP_VERSION=3.3.1
export SCALA_VERSION=2.12.14

cd scala
sbt test
sbt ++${SCALA_VERSION}! test
```

(use ~test and it will automatically check for changes in source files, but not build.sbt)

## Notes for Spark 3.x ##

The code now has been modified to support Spark 3.x, and has been
tested under different versions of Databricks Runtime (including 6.4
Extended Support, 7.3 LTS, 9.1 LTS, 10.0 and 10.1) on Azure Databricks
Service. However, there is a breaking change of
[org/apache.spark.sql.execution.datasources.OutputWriter](https://github.com/apache/spark/blob/dc0fa1eef74238d745dabfdc86705b59d95b07e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala#L74)
on **Spark 3.2**, which adds an extra function `path()`, so an
additional JAR file with the classifier `spark32` will be needed if
running on Spark 3.2 (See above for packaging).

Also, extra configurations are also required when running on Spark
3.x:

```
spark.sql.sources.default parquet
spark.sql.legacy.createHiveTableByDefault true
```
Loading