Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add nightly cache tests #2083

Merged
merged 9 commits into from
Apr 19, 2021
Merged

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Apr 6, 2021

This PR will run cache tests every night against Spark version 3.1.1 or greater. We can remove this once Spark 3.1.1 is the main supported version

There are bugs in the Serializer code and will be fixed once #1717 is merged

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

@NvTimLiu @pxLi please review

@@ -90,6 +90,10 @@ jps

echo "----------------------------START TEST------------------------------------"
pushd $RAPIDS_INT_TESTS_HOME
#only run cache tests with our serializer in nightly test for Spark version >= 3.1.1
if [ "$(printf '%s\n' "3.1.1" "$SPARK_VER" | sort -V | head -n1)" = "3.1.1" ]; then
spark-submit $BASE_SPARK_SUBMIT_ARGS --conf spark.sql.cache.serializer=com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer --jars $RAPIDS_TEST_JAR ./runtests.py -v -rfExXs --std_input_path="$WORKSPACE/integration_tests/src/test/resources/" -k cache_test.py
Copy link
Collaborator

@pxLi pxLi Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: continue command w/ newline \, and indentation use 4 spaces. Also it would be better to move this after existing tests

not sure if we would like to give it a pytest marker for better management pytest.ini

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spark 3.2.0 line is using the class from com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer? I think we should create a separate one even if just a wrapper. We do this for the shuffle for instance to make it easier for users. it seems odd to use a class from 3.1.1 with 3.2.0 for instance. This can be a separate issue though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also agree, should run this last. @NvTimLiu @pxLi we don't have any common version parsing code do we?

Here we need ti as a separate spark-submit call because of the way the configuration is initialized, you can't change it during runtime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs currently all of our integration tests depend on https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/jenkins/version-def.sh for versions that actually get ENVs from pipeline setup. Can you share in which case we need to change it during runtime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs was referring to the case where we will have to match the package name based on the spark version. I have already addressed his concern but if there is a common place for the version parsing I can reuse that.

Please take a look at the code and let me know if you have any questions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand it correctly, need we to convert spark version to shim version as below?

`SPARK_VER=3.2.0-SNAPSHTOT --> SHIM_VER=320`

If yes, we can add below scripts

`SHIM_VER=${SPARK_VER/-SNAPSHTOT/} && SHIM_VER=${SHIM_VER//./}`

`--conf spark.sql.cache.serializer=com.nvidia.spark.rapids.shims.spark"${SHIM_VER}".ParquetCachedBatchSerializer`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the change @NvTimLiu PTAL

@pxLi pxLi requested review from tgravescs, revans2 and jlowe April 6, 2021 05:15
@pxLi pxLi added the test Only impacts tests label Apr 6, 2021
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

@pxLi @tgravescs I have addressed your concerns

@razajafri
Copy link
Collaborator Author

build

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@tgravescs
Copy link
Collaborator

is there a reason this is in draft, if you want reviews lets remove from draft.

* This is a wrapper to com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer
*/
class ParquetCachedBatchSerializer extends
com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a 3.2.0 version as well right?

@razajafri
Copy link
Collaborator Author

is there a reason this is in draft, if you want reviews lets remove from draft.

If we merge this, our nightly build will start failing as the ParquetCachedBatchSerializer doesn't have the Parquet changes. Once #1717 is merged we can merge this

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

@NvTimLiu Thanks, I have updated the script. PTAL

@razajafri razajafri marked this pull request as ready for review April 13, 2021 18:16
@razajafri razajafri requested a review from GaryShen2008 as a code owner April 13, 2021 18:16
@razajafri
Copy link
Collaborator Author

build

/**
* This is a wrapper to com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer
*/
class ParquetCachedBatchSerializer extends
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should update our docs -> additional-functionality/cache-serializer.md. to clarify to use the correct version that matches spark version.

@tgravescs
Copy link
Collaborator

overall looks good, I think we just need to update docs

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

@tgravescs I have updated the docs PTAL

@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri merged commit c5175b2 into NVIDIA:branch-0.5 Apr 19, 2021
@razajafri razajafri deleted the nightly_cache branch April 19, 2021 12:41
razajafri added a commit that referenced this pull request Apr 20, 2021
This reverts commit c5175b2.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
pxLi pushed a commit that referenced this pull request Apr 21, 2021
This reverts commit c5175b2.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* add nightly cache tests

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* addressed review comments

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* fixed copyright

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* add serializer to 320

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* Taking care of SNAPSHOT

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* added more docs for Serializer per Spark version

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
This reverts commit c5175b2.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* add nightly cache tests

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* addressed review comments

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* fixed copyright

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* add serializer to 320

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* Taking care of SNAPSHOT

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* added more docs for Serializer per Spark version

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
This reverts commit c5175b2.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants