Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to start the RapidsShuffleManager in databricks 9.1 #4141

Closed
abellina opened this issue Nov 17, 2021 · 2 comments
Closed

[BUG] Unable to start the RapidsShuffleManager in databricks 9.1 #4141

abellina opened this issue Nov 17, 2021 · 2 comments
Assignees
Labels
bug Something isn't working P0 Must have for release shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Nov 17, 2021

I tried this with databricks 9.1, but I haven't tried in other runtimes. I see the following:

java.lang.IllegalStateException: The ShuffleBufferCatalog is not initialized but the RapidsShuffleManager is configured
	at org.apache.spark.sql.rapids.RapidsShuffleInternalManagerBase.$anonfun$getCatalogOrThrow$1(RapidsShuffleInternalManagerBase.scala:271)
	at scala.Option.getOrElse(Option.scala:189)

So this used to work before, and I am looking at it to document how one might use this shuffle manager in databricks for #3768. To me it seems that we have a class mismatch at startup, because the code that toggles whether the ShuffleBufferCatalog gets instantiated or not is:

  private lazy val isRapidsShuffleConfigured: Boolean = {
    conf.contains("spark.shuffle.manager") &&
      conf.get("spark.shuffle.manager") == GpuShuffleEnv.RAPIDS_SHUFFLE_CLASS
  }

  def init(): Unit = {
    if (isRapidsShuffleConfigured) {
      val diskBlockManager = new RapidsDiskBlockManager(conf)
      shuffleCatalog =
          new ShuffleBufferCatalog(RapidsBufferCatalog.singleton, diskBlockManager)
      shuffleReceivedBufferCatalog =
          new ShuffleReceivedBufferCatalog(RapidsBufferCatalog.singleton)
    }
  }

The steps I have followed is to install UCX in the nodes via an init script (which does install UCX successfully):

#!/bin/bash
sudo apt install -y wget && \
wget https://github.com/openucx/ucx/releases/download/v1.11.2/ucx-v1.11.2-ubuntu18.04-mofed5.x-cuda11.2.deb &&
sudo dpkg -i ucx-v1.11.2-ubuntu18.04-mofed5.x-cuda11.2.deb &&
rm ucx-v1.11.2-ubuntu18.04-mofed5.x-cuda11.2.deb

Then I setup the "Spark Config" in my cluster:

spark.shuffle.service.enabled false
spark.shuffle.manager com.nvidia.spark.rapids.spark312db.RapidsShuffleManager
spark.executorEnv.UCX_ERROR_SIGNALS ""
spark.executorEnv.UCX_MEMTYPE_CACHE n

Next step is to build my own jar and add some logging.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 17, 2021
@sameerz sameerz added the P0 Must have for release label Nov 17, 2021
@abellina abellina self-assigned this Nov 17, 2021
@abellina abellina added the shuffle things that impact the shuffle plugin label Nov 17, 2021
@abellina abellina changed the title [BUG] Unable to start the RapidsShuffleManage in databricks 9.1 [BUG] Unable to start the RapidsShuffleManager in databricks 9.1 Nov 17, 2021
@abellina
Copy link
Collaborator Author

Ok this is likely due to misconfiguration. I tried adding some debugging and didn't see the logs, but I also didn't see us trying to get memory from RMM. It just means the executor plugin really wasn't started. Once I confirm, I'll likely close this.

@abellina
Copy link
Collaborator Author

Just confirming the above, I had forgotten a couple of things. I had more than 1 GPU per box, which is possible to use if we set up the resources correctly, and I had also not turned on the SQLPlugin itself in the config.

We document some of this in the "advanced" setup here https://github.com/NVIDIA/spark-rapids/blob/branch-21.12/docs/get-started/getting-started-databricks.md#advanced-cluster-configuration. But this unblocks what I was trying to do, which was to document how one might get UCX running here.

spark.plugins com.nvidia.spark.SQLPlugin
spark.dynamicAllocation.enabled false
spark.executorEnv.UCX_MEMTYPE_CACHE n
spark.executorEnv.UCX_ERROR_SIGNALS ""
spark.shuffle.service.enabled false
spark.sql.adaptive.enabled false
spark.shuffle.manager com.nvidia.spark.rapids.spark312db.RapidsShuffleManager

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release shuffle things that impact the shuffle plugin
Projects
None yet
Development

No branches or pull requests

2 participants