-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add nightly cache tests #2083
add nightly cache tests #2083
Conversation
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
jenkins/spark-tests.sh
Outdated
@@ -90,6 +90,10 @@ jps | |||
|
|||
echo "----------------------------START TEST------------------------------------" | |||
pushd $RAPIDS_INT_TESTS_HOME | |||
#only run cache tests with our serializer in nightly test for Spark version >= 3.1.1 | |||
if [ "$(printf '%s\n' "3.1.1" "$SPARK_VER" | sort -V | head -n1)" = "3.1.1" ]; then | |||
spark-submit $BASE_SPARK_SUBMIT_ARGS --conf spark.sql.cache.serializer=com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer --jars $RAPIDS_TEST_JAR ./runtests.py -v -rfExXs --std_input_path="$WORKSPACE/integration_tests/src/test/resources/" -k cache_test.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: continue command w/ newline \
, and indentation use 4 spaces. Also it would be better to move this after existing tests
not sure if we would like to give it a pytest marker for better management pytest.ini
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the spark 3.2.0 line is using the class from com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer? I think we should create a separate one even if just a wrapper. We do this for the shuffle for instance to make it easier for users. it seems odd to use a class from 3.1.1 with 3.2.0 for instance. This can be a separate issue though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs currently all of our integration tests depend on https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/jenkins/version-def.sh for versions that actually get ENVs from pipeline setup. Can you share in which case we need to change it during runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs was referring to the case where we will have to match the package name based on the spark version. I have already addressed his concern but if there is a common place for the version parsing I can reuse that.
Please take a look at the code and let me know if you have any questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand it correctly, need we to convert spark version to shim version as below?
`SPARK_VER=3.2.0-SNAPSHTOT --> SHIM_VER=320`
If yes, we can add below scripts
`SHIM_VER=${SPARK_VER/-SNAPSHTOT/} && SHIM_VER=${SHIM_VER//./}`
`--conf spark.sql.cache.serializer=com.nvidia.spark.rapids.shims.spark"${SHIM_VER}".ParquetCachedBatchSerializer`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the change @NvTimLiu PTAL
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@pxLi @tgravescs I have addressed your concerns |
build |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
is there a reason this is in draft, if you want reviews lets remove from draft. |
* This is a wrapper to com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer | ||
*/ | ||
class ParquetCachedBatchSerializer extends | ||
com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a 3.2.0 version as well right?
If we merge this, our nightly build will start failing as the ParquetCachedBatchSerializer doesn't have the Parquet changes. Once #1717 is merged we can merge this |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@NvTimLiu Thanks, I have updated the script. PTAL |
build |
/** | ||
* This is a wrapper to com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer | ||
*/ | ||
class ParquetCachedBatchSerializer extends |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should update our docs -> additional-functionality/cache-serializer.md. to clarify to use the correct version that matches spark version.
overall looks good, I think we just need to update docs |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@tgravescs I have updated the docs PTAL |
build |
This reverts commit c5175b2. Signed-off-by: Raza Jafri <rjafri@nvidia.com>
* add nightly cache tests Signed-off-by: Raza Jafri <rjafri@nvidia.com> * addressed review comments Signed-off-by: Raza Jafri <rjafri@nvidia.com> * fixed copyright Signed-off-by: Raza Jafri <rjafri@nvidia.com> * add serializer to 320 Signed-off-by: Raza Jafri <rjafri@nvidia.com> * Taking care of SNAPSHOT Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added more docs for Serializer per Spark version Signed-off-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Raza Jafri <rjafri@nvidia.com>
This reverts commit c5175b2. Signed-off-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Raza Jafri <rjafri@nvidia.com>
* add nightly cache tests Signed-off-by: Raza Jafri <rjafri@nvidia.com> * addressed review comments Signed-off-by: Raza Jafri <rjafri@nvidia.com> * fixed copyright Signed-off-by: Raza Jafri <rjafri@nvidia.com> * add serializer to 320 Signed-off-by: Raza Jafri <rjafri@nvidia.com> * Taking care of SNAPSHOT Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added more docs for Serializer per Spark version Signed-off-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Raza Jafri <rjafri@nvidia.com>
This reverts commit c5175b2. Signed-off-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Raza Jafri <rjafri@nvidia.com>
This PR will run cache tests every night against Spark version 3.1.1 or greater. We can remove this once Spark 3.1.1 is the main supported version
There are bugs in the Serializer code and will be fixed once #1717 is merged
Signed-off-by: Raza Jafri rjafri@nvidia.com