layout | title | nav_order | parent |
---|---|---|---|
page |
Shim Development |
4 |
Developer Overview |
RAPIDS Accelerator For Apache Spark supports multiple feature version lines of Apache Spark such as 3.1.x, 3.2.x, 3.3.0 and a number of vendor releases that contain a mix of patches from different upstream releases. These artifacts are generally incompatible between each other, at both source code level and even more often at the binary level. The role of the Shim layer is to hide these issues from the common code, maximize reuse, and minimize logic duplication.
This is achieved by using a ServiceProvider pattern. All Shims implement the same API, the suitable Shim implementation is loaded after detecting the current Spark build version attempting to instantiate our plugin. We use the ShimLoader class as a tight entry point for interacting with the host Spark runtime.
In the following we provide recipes for typical scenarios addressed by the Shim layer.
It's among the easiest issues to resolve. We define a method in SparkShims trait covering a superset of parameters from all versions and call it
SparkShimImpl.methodWithDiscrepancies(p_1, ..., p_n)
instead of referencing it directly. Shim implementations (SparkShimImpl) are in charge of dispatching it further to correct version-dependent methods. Moreover, unlike in the below sections conflicts between versions are easily avoided by using different package or class names for conflicting Shim implementations.
Upstream base classes we derive from might be incompatible in the sense that one version
requires us to implement/override the method M
whereas the other prohibits it by marking
the base implementation final
, E.g. org.apache.spark.sql.catalyst.trees.TreeNode
changes
between Spark 3.1.x and Spark 3.2.x. So instead of deriving from such classes directly we
inject an intermediate trait e.g. com.nvidia.spark.rapids.shims.ShimExpression
that
has a varying source code depending on the Spark version we compile against to overcome this
issue as you can see e.g., comparing TreeNode:
This resolves compile-time problems, however, now we face the problem at run time.
Plugin has to be able to deterministically load the right class files for identically named classes depending on the detected Spark runtime version. This is typically solved by using ASM-based relocation in the bytecode, however it does not work easily with Scala packages
So instead we resort to the idea of JDK's ParallelWorldClassloader in combination with the fact that Spark runtime uses mutable classloaders we can alter after detecting the runtime version. Using JarURLConnection URLs we create a Parallel World of the current version within the jar, e.g.:
Spark 3.0.2's URLs:
jar:file:/home/spark/rapids-4-spark_2.12-22.06.0.jar!/
jar:file:/home/spark/rapids-4-spark_2.12-22.06.0.jar!/spark3xx-common/
jar:file:/home/spark/rapids-4-spark_2.12-22.06.0.jar!/spark302/
Spark 3.2.0's URLs :
jar:file:/home/spark/rapids-4-spark_2.12-22.06.0.jar!/
jar:file:/home/spark/rapids-4-spark_2.12-22.06.0.jar!/spark3xx-common/
jar:file:/home/spark/rapids-4-spark_2.12-22.06.0.jar!/spark320/
Most classes needed by the plugin can be disambiguated using Parallel World locations without
reservations except for documented classes that are exposed to the user that may be loaded before
the Plugin is even instantiated by the Spark runtime. The most important example of such a class
is a configurable ShuffleManager. ShuffleManager
has also changed in a backwards incompatible
manner over the span of supported Spark versions.
The first issue with such a class, since it's loaded by Spark directly outside our control we cannot have a single class name for our implementation that would work across versions. This is resolved, by having the documented facade classes with a shim specifier in their package names.
The second issue that every parent class/trait in the inheritance graph is loaded using the classloader outside
Plugin's control. Therefore, all this bytecode must reside in the conventional jar location, and it must
be bitwise-identical across all shims. The only way to keep the source code for shared functionality unduplicated,
(i.e., in sql-plugin/src/main/scala
as opposed to be duplicated in sql-plugin/src/main/3*/scala
source code roots)
is to delay inheriting ShuffleManager
until as late as possible, as close as possible to the facade class where we
have to split the source code anyway. Use traits as much as possible for flexibility.
The third issue may arise from the fact that the shared logic may transitively reference a class that for one another reason resides in a Parallel World. Untangling this is tedious and may be unnecessary. The following approach robustly prevents from running into issues related to that.
We know that at the time such a class is loaded by Spark it's not strictly needed if the Plugin
has not been loaded yet. More accurately, it may not be strictly needed until later when the first
query can be run when the Spark SQL session and its extensions are initialized. It improves the
user experience if the first query is not penalized beyond necessary though. By design, Plugin guarantees
that the classloader is
set up at load time
before the DriverPlugin and ExecutorPlugin instances are called the init
method on.
By making a visible class merely a wrapper of the real implementation, extending scala.Proxy
where self
is a lazy
val, we prevent classes from Parallel Worlds to be loaded before they can be, and are actually required.
For examples see:
abstract class ProxyRapidsShuffleInternalManagerBase
class ExclusiveModeGpuDiscoveryPlugin
Note that we currently have to manually code up the delegation methods to the tune of:
def method(x: SomeThing) = self.method(x)
This could be automatically generated with a simple tool processing the scalap
output or Scala macros at
build/compile time. Pull requests are welcome.