You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried this with databricks 9.1, but I haven't tried in other runtimes. I see the following:
java.lang.IllegalStateException: The ShuffleBufferCatalog is not initialized but the RapidsShuffleManager is configured
at org.apache.spark.sql.rapids.RapidsShuffleInternalManagerBase.$anonfun$getCatalogOrThrow$1(RapidsShuffleInternalManagerBase.scala:271)
at scala.Option.getOrElse(Option.scala:189)
So this used to work before, and I am looking at it to document how one might use this shuffle manager in databricks for #3768. To me it seems that we have a class mismatch at startup, because the code that toggles whether the ShuffleBufferCatalog gets instantiated or not is:
private lazy val isRapidsShuffleConfigured: Boolean = {
conf.contains("spark.shuffle.manager") &&
conf.get("spark.shuffle.manager") == GpuShuffleEnv.RAPIDS_SHUFFLE_CLASS
}
def init(): Unit = {
if (isRapidsShuffleConfigured) {
val diskBlockManager = new RapidsDiskBlockManager(conf)
shuffleCatalog =
new ShuffleBufferCatalog(RapidsBufferCatalog.singleton, diskBlockManager)
shuffleReceivedBufferCatalog =
new ShuffleReceivedBufferCatalog(RapidsBufferCatalog.singleton)
}
}
The steps I have followed is to install UCX in the nodes via an init script (which does install UCX successfully):
abellina
changed the title
[BUG] Unable to start the RapidsShuffleManage in databricks 9.1
[BUG] Unable to start the RapidsShuffleManager in databricks 9.1
Nov 17, 2021
Ok this is likely due to misconfiguration. I tried adding some debugging and didn't see the logs, but I also didn't see us trying to get memory from RMM. It just means the executor plugin really wasn't started. Once I confirm, I'll likely close this.
Just confirming the above, I had forgotten a couple of things. I had more than 1 GPU per box, which is possible to use if we set up the resources correctly, and I had also not turned on the SQLPlugin itself in the config.
I tried this with databricks 9.1, but I haven't tried in other runtimes. I see the following:
So this used to work before, and I am looking at it to document how one might use this shuffle manager in databricks for #3768. To me it seems that we have a class mismatch at startup, because the code that toggles whether the
ShuffleBufferCatalog
gets instantiated or not is:The steps I have followed is to install UCX in the nodes via an init script (which does install UCX successfully):
Then I setup the "Spark Config" in my cluster:
Next step is to build my own jar and add some logging.
The text was updated successfully, but these errors were encountered: