Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CI fails on an inconsistent set of partial builds #3911

Closed
abellina opened this issue Oct 25, 2021 · 10 comments · Fixed by #3912
Closed

[BUG] CI fails on an inconsistent set of partial builds #3911

abellina opened this issue Oct 25, 2021 · 10 comments · Fixed by #3912
Assignees
Labels
bug Something isn't working build Related to CI / CD or cleanly building

Comments

@abellina
Copy link
Collaborator

I got this failure:

11:20:03  [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run (create-parallel-world) on project rapids-4-spark_2.12: An Ant BuildException has occured: exec returned: 255
11:20:03  [ERROR] around Ant part ...<exec failonerror="true" dir="/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3066/dist/target" executable="/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3066/dist/scripts/binary-dedupe.sh"/>... @ 78:222 in /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3066/dist/target/antrun/build-main.xml

It looks to come from:

function verify_same_sha_for_unshimmed() {
  set -e
  class_file="$1"

  # the raw spark3xx-common.txt file list contains all single-sha1 classes
  # including the ones that are unshimmed. Instead of expensively recomputing
  # sha1 look up if there is an entry with the unshimmed class as a suffix

  class_file_quoted=$(printf '%q' "$class_file")

  # TODO currently RapidsShuffleManager is "removed" from /spark3* by construction in
  # dist pom.xml via ant. We could delegate this logic to this script
  # and make both simmpler
  if [[ ! "$class_file_quoted" =~ (com/nvidia/spark/rapids/spark3.*/.*ShuffleManager.class|org/apache/spark/sql/rapids/shims/spark3.*/ProxyRapidsShuffleInternalManager.class) ]]; then

    if ! grep -q "/spark.\+/$class_file_quoted" "$SPARK3XX_COMMON_TXT"; then
      echo >&2 "$classFile is not bitwise-identical across shims"
      exit 255
    fi
  fi
}

There is more log output, that indicates that classFile is not defined:

11:35:17       [exec] 8/ verifying unshimmed classes have unique sha1 across shims
11:35:17       [exec]  is not bitwise-identical across shims
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 25, 2021
@gerashegalov
Copy link
Collaborator

gerashegalov commented Oct 25, 2021

The bug here is that we failed to print the class file triggering the error. Otherwise it catches the very problem it is designed to:

adb32b6c95a5b9f2a82a11ed9828863750ee6e73 *./parallel-world/spark301/com/nvidia/spark/rapids/SparkShims.class
d5baa0f70d061b869c501360da2d127ffe1d0e96 *./parallel-world/spark302/com/nvidia/spark/rapids/SparkShims.class
d5baa0f70d061b869c501360da2d127ffe1d0e96 *./parallel-world/spark303/com/nvidia/spark/rapids/SparkShims.class
d5baa0f70d061b869c501360da2d127ffe1d0e96 *./parallel-world/spark304/com/nvidia/spark/rapids/SparkShims.class
d5baa0f70d061b869c501360da2d127ffe1d0e96 *./parallel-world/spark311/com/nvidia/spark/rapids/SparkShims.class
d5baa0f70d061b869c501360da2d127ffe1d0e96 *./parallel-world/spark311cdh/com/nvidia/spark/rapids/SparkShims.class
d5baa0f70d061b869c501360da2d127ffe1d0e96 *./parallel-world/spark312/com/nvidia/spark/rapids/SparkShims.class
d5baa0f70d061b869c501360da2d127ffe1d0e96 *./parallel-world/spark313/com/nvidia/spark/rapids/SparkShims.class
d5baa0f70d061b869c501360da2d127ffe1d0e96 *./parallel-world/spark320/com/nvidia/spark/rapids/SparkShims.class

#3820 changed SparkShims trait causing implementing classes change the bytecode and it did not make it to the internal repo yet. So the build attempting to mix incompatible implementations with the one produced in this build fails as expected.

@jlowe
Copy link
Member

jlowe commented Oct 25, 2021

it did not make it to the internal repo yet.

So does that mean any PR that changes the shim code is likely to fail premerge CI until a nightly build runs?

@gerashegalov
Copy link
Collaborator

gerashegalov commented Oct 25, 2021

@jlowe yes, this is correct. We can fix this by eliminating the mismatch in premerge where we install only non-snapshots locally but request -Psnapshots for dist build.

https://github.com/NVIDIA/spark-rapids/blob/branch-21.12/jenkins/spark-premerge-build.sh#L49

mvn -U -B $MVN_URM_MIRROR '-Psnapshots,pre-merge' 

Making this consistent one way or another will fix it.

@jlowe
Copy link
Member

jlowe commented Oct 25, 2021

in premerge where we install only non-snapshots locally but request -Psnapshots for dist build.

I thought we were building and locally installing the snapshots in premerge? Looks like we are.
https://github.com/NVIDIA/spark-rapids/blob/branch-21.12/jenkins/spark-premerge-build.sh#L38-L49
Is one missing?

@gerashegalov
Copy link
Collaborator

I thought we were building and locally installing the snapshots in premerge? Looks like we are. https://github.com/NVIDIA/spark-rapids/blob/branch-21.12/jenkins/spark-premerge-build.sh#L38-L49 Is one missing?

@jlowe you are right, I got it backwards, confirming we do install snapshots

@jlowe
Copy link
Member

jlowe commented Oct 25, 2021

One issue is that we've peppered the premerge install script with -U which forces Maven to check for newer snapshots. If one happened to be published after we installed our version locally, it will prefer to download the one just published. I don't think we want (or need) -U in this script.

@tgravescs
Copy link
Collaborator

I think originally -U was used to pull the latest spark and cudf snapshots.. but that was before we changed the build and I'm not sure if at the time we were reusing .m2 across builds. We should check to see if build is reusing .m2. But either way the final 301 one doesn't need it.

@gerashegalov gerashegalov removed the ? - Needs Triage Need team to review and classify label Oct 25, 2021
@gerashegalov gerashegalov changed the title [BUG] CI failing at verify_same_sha_for_unshimmed [BUG] CI fails on an inconsistent set of partial builds Oct 25, 2021
@gerashegalov gerashegalov added this to the Oct 18 - Oct 29 milestone Oct 25, 2021
@gerashegalov gerashegalov added the build Related to CI / CD or cleanly building label Oct 26, 2021
@gerashegalov
Copy link
Collaborator

the root cause is fixed in #3912, a follow-up for diagnostic logging fix filed #3915

@pxLi
Copy link
Collaborator

pxLi commented Oct 26, 2021

This also failed in nightly build.
Per https://maven.apache.org/plugins/maven-dependency-plugin/usage.html#Overwrite_Rules, looks like if we -U the snapshot artifacts will always overwrite.
image

@gerashegalov CMIW. Not sure if we still need <overWriteSnapshots>true</overWriteSnapshots> in https://github.com/NVIDIA/spark-rapids/blob/branch-21.12/dist/pom.xml

@gerashegalov
Copy link
Collaborator

Thank @pxLi for pointing this out. Discussed this with @jlowe and @tgravescs , and we have an idea how to stop using this switch. I will add it to #3923

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants