Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rootExecutionID to output csv files #871

Merged
merged 5 commits into from
Mar 25, 2024

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Mar 23, 2024

This fixes #818

This PR adds a new column "Root SQL ID" . This is a follow-on of #810.
For profiling tool, new column is added to: sql_duration_and_executor_cpu_time_percent.csv
For qualification tool, new column is added for per-sql output file: rapids_4_spark_qualification_output_persql.csv

Updated the current unit tests to include the new column. For the eventlogs which doesn't have rootSQLId's, the column is empty in such case.

Notes:

  • In order to generate non-empty RootSqlID, you need to have Spark3.4.1+ in the classPath

@nartal1 nartal1 self-assigned this Mar 23, 2024
@nartal1 nartal1 added the core_tools Scope the core module (scala) label Mar 23, 2024
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1 for making the changes.

There is a tricky part that we need to address in order to fully resolve 818 and 780.
By default we build Tools for Spark 3.3.3, which means that the user-tools wrapper is configured to download dependencies for Spark333.
As a result, running the default build does not solve the duplicate calculation or correctly show the new column. Eventually, users won't see the change. The last point is mostly important as we are working on integrating Estimation modeling into Q tool.

There are 2 different ways to address this problem:

  1. Upgrade the default build to a more recent Spark. This requires:

    • update the accepted buildVer in the mvn pom and set a different default profile for the build.
    • update the user-tools CSP configurations accordingly. This is tricky because we need to find the CSP and Hadoop dependency Jars that should match the spark version.
  2. Modify the eventlog Parser to handle the rootID column regardless of the Spark version used in the runtime. Technically, it is an easy fix; but it will be good to find the right way to add it so that it can easily be ported for other overriding that we might need in the future.

Whatever the approach to address the visibility problem, we will need to have a followup issue/PR to cover that before we go for a new release.

Let me know if you have different thoughts.

@amahussein
Copy link
Collaborator

amahussein commented Mar 25, 2024

There is an open issue #872 to upgrade default spark versions

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1 !
We can move fwd with this PR given that we can address the dependencies in #872

@amahussein amahussein merged commit 96dce39 into NVIDIA:dev Mar 25, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add rootExecutionID to output files.
2 participants