Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AtomicCreateTableAsSelectExec #895

Merged
merged 2 commits into from
Apr 3, 2024

Conversation

amahussein
Copy link
Collaborator

Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me

Fixes #881

  • update the support level for existing Execs such as OverwriteByExpr, AppendDataExecV1, AtomicReplaceTableAsSelect
  • The Op is mainly support if the provider is Delta. Otherwise, the GPU should fallback to CPU.
  • We check if the provider is Delta by matching regex on node description. Ideally we should check the secondArgument of TableSpec but we don't have an AST for that expression.
  • Updates the score sheets assigning a default score as DataWritingCmd
  • Fixes hardcoded exec. Before that change, all writeCMDs mapped to DataWringCMDExec. This code changes that behavior when the Delta provider is satisfied.
  • updates the override_supported_configs.json file

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Fixes NVIDIA#881

- update the support level for existing Execs such as OverwriteByExpr,
  AppendDataExecV1, AtomicReplaceTableAsSelect
- The Op is mainly support if the provider is Delta. Otherwise, the GPU
  should fallback to CPU.
- We check if the provider is Delta by matching regex on node
  description. Ideally we should check the secondArgument of TableSpec
but we don't have an AST for that expression.
- Updates the score sheets assigning a default score as DataWritingCmd
- Fixes hardcoded exec. Before that change, all writeCMDs mapped to
  DataWringCMDExec. This code changes that behavior when the Delta
provider is satisfied.
- updates the override_supported_configs.json file
@amahussein amahussein added bug Something isn't working core_tools Scope the core module (scala) labels Apr 2, 2024
@amahussein amahussein self-assigned this Apr 2, 2024
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein. Do we need unit tests for these as we have delta tables involved as well?

)

// Checks whether a node is a write CMD Exec
def isWritingCmdExec(nodeName: String): Boolean = {
logicalWriteCommands.exists(nodeName.contains(_)) || DeltaLakeHelper.accepts(nodeName)
}

def getPhysicalExecName(opName: String): String = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look risky? Earlier if the logical operator was not present in logicalToPhysicalCmdMap it would throw an error. Now we would be returning defaultPhysicalCMD value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safer to have a default value. For example, this allows us to use the same method to get a physical representation of Delta Operators that are missing in the CSV files (i.e., MergeIntoCommandEdge, WriteIntoDeltaCommand, and SaveIntoDataSrcCMD).
Otherwise, we will have fill in the content of the map with all the Operators in Hive/Delta, which can be crash-able if someone forget to update the map with the new entry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. This is definitely reducing redundancy. We should be aware of this default behaviour.

@amahussein
Copy link
Collaborator Author

Thanks @amahussein. Do we need unit tests for these as we have delta tables involved as well?

We are lacking some of the tests in this area due to the complexity of env.
Spark3.5.0 requires Delta3.x which requires Scala2.13 binaries. As a work around, I import 2.12 binaries which somehow works with the unit-test we have.
A unit test will be creating Spark-session with Delta-Lake catalog enabled.
Then move fwd to create SQL queries that generate the exec we are interest in.

Do you have bandwidth to take over to try this out by committing to this PR new tests?

@amahussein amahussein requested a review from parthosa April 2, 2024 17:53
@parthosa
Copy link
Collaborator

parthosa commented Apr 2, 2024

Do you have bandwidth to take over to try this out by committing to this PR new tests?

Sure. I was looking into delta tables few weeks ago. This would be a good starting point.

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein. Based on offline discussion, we need to bump up delta package version to 3.x to support unit tests for spark 3.5.0. Created a separate issue #899 to add unit tests for delta tables.

@amahussein amahussein merged commit 83c8f7d into NVIDIA:dev Apr 3, 2024
15 checks passed
@amahussein amahussein deleted the spark-rapids-tools-865-881 branch April 3, 2024 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Support for AtomicCreateTableAsSelectExec
2 participants