Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve shading scope #1398

Closed
jlowe opened this issue Dec 15, 2020 · 4 comments
Closed

Improve shading scope #1398

jlowe opened this issue Dec 15, 2020 · 4 comments
Assignees
Labels
task Work required that improves the product but is not user facing

Comments

@jlowe
Copy link
Member

jlowe commented Dec 15, 2020

Is your feature request related to a problem? Please describe.
Currently the build system is shading things like ORC and Hive across the entire SQL plugin, and this triggers some issues with areas of the plugin that need to access Hive classes when interacting with spark-hive Catalyst classes (e.g.: #1393).

Describe the solution you'd like
I think it would be cleaner if the portions of the plugin were in a separate Maven submodule that was shaded, then we can avoid shading over the entire SQL plugin code. Shading would be localized to the code that requires it.

Describe alternatives you've considered
If we could get out of the business of shading entirely that would be awesome, but I don't see how we can support multiple versions of Apache Spark that could contain conflicting versions of Parquet/ORC if we try to use those dependencies directly.

@jlowe jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify build Related to CI / CD or cleanly building labels Dec 15, 2020
@sameerz sameerz added task Work required that improves the product but is not user facing and removed ? - Needs Triage Need team to review and classify feature request New feature or request build Related to CI / CD or cleanly building labels Dec 15, 2020
@jlowe jlowe changed the title [FEA] Improve shading scope Improve shading scope Dec 15, 2020
@res-life
Copy link
Collaborator

res-life commented Dec 28, 2021

Issue
Why excluding hive classes in aggregator pom is the conflict of package "org.apache.hadoop.hive".
The hive-storage-api jar and hive udf jars both have this package. Orc jars depends on hive-storage-api.

The hive class packages are relocated in rapids UDF, but these hive classes are not included in the shaded jar;
Hive jars are provided dependencies and shade plugin only includes compiled dependencies.
For example

In HiveSimpleUDF.scala
lazy val function = funcWrapper.createFunction[UDF]()

The type of "function" is org.apache.hadoop.hive.ql.exec.UDF, If not excluded in aggregator.xml, this line is relocated to

lazy val function: com.nvidia.shaded.spark.hadoop.hive.exec.UDF = funcWrapper.createFunction[UDF]()

The package of UDF in the spark runtime hive-exec jar is still org.apache.hadoop.hive.ql.exec.UDF.
This will cause ClassNotFound. So some of exclutions are added in aggregator.xml.
The ORC dpends on hive-storage-api and hive-storage-api also has package org.apache.hadoop.hive.
The ORC codes are relocated and shade, but also relocated hive classes that UDF referer to.
So the rapids code involving ORC should be shaded and the rapids code involving UDF shoudl not be shaded.
Option 1: Split ORC reader/writer related code to a new submodule.
Make ORC reader/writer related code shaded. Make other code of sql-plubin not shaded.
Actually, the creation of new module is not a regular refector. All the related classes should also be moved.
For example, when moving GpuOrcScanBase, the arm.scala must be moved, and then RapidsBuffer, then RapidsDiskBlockManager, then ShimDiskBlockManager. This will make the code mess.
Option 2: Use include instead of exclude in aggregator.xml.
Collect all the classes that in orc-core, orc-shim, orc-mapreduce and protobuf-java and then put them into includes in aggregator.xml. Remove all the excludes.
Of cause, should collect for all the Spark versions if the PR of #4408 is merged.
Of cause, we can reduce the number of including classes by only collecting the touched classes by using some java agent code.

@jlowe
Copy link
Member Author

jlowe commented Jan 5, 2022

The problem with Option 2 is it's sort of where we already are now, just using an include approach rather than an exclude approach. Either way, we have to carefully maintain the include/exclude list as the code evolves, otherwise we get a nasty problem not caught at compile time but only runtime, where an incorrect shading policy results in an unresolved or incompatible class error. Also Option 2 as written seems to have problems since it fails to address the Hive classes that are needed by ORC. If we run against a Spark version that does not provide an ORC-with-hive classifier, we'll get class not found errors for the Hive classes ORC needs at runtime unless we also shade Hive. And then we're back to the 'oops, I shaded too much of Hive' problem.

Option 1 is a lot more up-front code motion, but it has the nice property that once the code has been moved, we don't have the problem of needing to manually keep the shading inclusions/exclusions list in sync with the code as the code evolves. We can shade all of the ORC and Hive packages for the plugin I/O module without needing to worry about accidentally breaking the UDF code or any other code in the plugin that needs to access raw ORC or Hive classes directly for some reason. Basically this option means we need to split sql-plugin into three modules. Note I'm using just some sample names here, I'm not suggesting these are the best names:

  • sql-plugin-core
    This module contains all "core" or library-like functionality of sql-plugin, like Arm, type checking, spill framework, etc. Essentially everything that doesn't need to depend on the I/O execs like GpuParquetScan, GpuOrcScan, etc. and does not access Parquet, ORC, Hive, Hadoop classes, etc.
  • sql-plugin-io
    This module contains all the reader and writer code, e.g.: GpuParquetScan, GpuOrcScan, etc. It depends on the sql-plugin-core module. This module will shade the ORC/Hive jars as we are doing today but without the need for a maintained exclusion list.
  • sql-plugin
    This module contains all the top-level functionality that needs the I/O execs from sql-plugin-io but otherwise doesn't access Parquet, ORC, Hive, Hadoop or other problematic Spark dependencies that might require shading currently or in the future. Example classes in this module would include GpuOverrides, Plugin, etc. This module depends on sql-plugin-io.

To be clear I'm not adamant that we use option 1, but I'm not a big fan of option 2 since it seems essentially equivalent to what we're already doing today but just switching to an include list rather than an exclude list (and I don't think it will actually work in all cases due to the ignoring of Hive). That's only better or worse depending on which part of the code base we expect to evolve faster, as both require manual maintenance of the inclusion or exclusion list for shading.

So from my perspective, the discussion should be around whether we do option 1 or stick with the manual maintained shading lists we already have today.

@res-life
Copy link
Collaborator

res-life commented Feb 8, 2022

PR #4408 already removed the ORC shading, so can close this issue. @jlowe

@jlowe
Copy link
Member Author

jlowe commented Feb 8, 2022

Agree, thanks @res-life!

@jlowe jlowe closed this as completed Feb 8, 2022
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#1398)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

3 participants