Add rootExecutionID to output csv files #871

nartal1 · 2024-03-23T00:32:58Z

This fixes #818

This PR adds a new column "Root SQL ID" . This is a follow-on of #810.
For profiling tool, new column is added to: sql_duration_and_executor_cpu_time_percent.csv
For qualification tool, new column is added for per-sql output file: rapids_4_spark_qualification_output_persql.csv

Updated the current unit tests to include the new column. For the eventlogs which doesn't have rootSQLId's, the column is empty in such case.

Notes:

In order to generate non-empty RootSqlID, you need to have Spark3.4.1+ in the classPath

…t_rootId_column

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

amahussein

Thanks @nartal1 for making the changes.

There is a tricky part that we need to address in order to fully resolve 818 and 780.
By default we build Tools for Spark 3.3.3, which means that the user-tools wrapper is configured to download dependencies for Spark333.
As a result, running the default build does not solve the duplicate calculation or correctly show the new column. Eventually, users won't see the change. The last point is mostly important as we are working on integrating Estimation modeling into Q tool.

There are 2 different ways to address this problem:

Upgrade the default build to a more recent Spark. This requires:
- update the accepted buildVer in the mvn pom and set a different default profile for the build.
- update the user-tools CSP configurations accordingly. This is tricky because we need to find the CSP and Hadoop dependency Jars that should match the spark version.
Modify the eventlog Parser to handle the rootID column regardless of the Spark version used in the runtime. Technically, it is an easy fix; but it will be good to find the right way to add it so that it can easily be ported for other overriding that we might need in the future.

Whatever the approach to address the visibility problem, we will need to have a followup issue/PR to cover that before we go for a new release.

Let me know if you have different thoughts.

amahussein · 2024-03-25T19:51:21Z

There is an open issue #872 to upgrade default spark versions

amahussein

Thanks @nartal1 !
We can move fwd with this PR given that we can address the dependencies in #872

nartal1 added 4 commits March 21, 2024 16:29

add rootSqlId to output file

b733092

update tests

7ad2766

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into outpu…

1af4804

…t_rootId_column

update test

e1afddd

nartal1 self-assigned this Mar 23, 2024

nartal1 added the core_tools Scope the core module (scala) label Mar 23, 2024

update test results

074b989

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 force-pushed the output_rootId_column branch from aa3584b to 074b989 Compare March 23, 2024 00:40

amahussein requested changes Mar 25, 2024

View reviewed changes

amahussein approved these changes Mar 25, 2024

View reviewed changes

amahussein merged commit 96dce39 into NVIDIA:dev Mar 25, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rootExecutionID to output csv files #871

Add rootExecutionID to output csv files #871

nartal1 commented Mar 23, 2024 •

edited by amahussein

Loading

amahussein left a comment •

edited

Loading

amahussein commented Mar 25, 2024 •

edited

Loading

amahussein left a comment

Add rootExecutionID to output csv files #871

Add rootExecutionID to output csv files #871

Conversation

nartal1 commented Mar 23, 2024 • edited by amahussein Loading

amahussein left a comment • edited Loading

Choose a reason for hiding this comment

amahussein commented Mar 25, 2024 • edited Loading

amahussein left a comment

Choose a reason for hiding this comment

nartal1 commented Mar 23, 2024 •

edited by amahussein

Loading

amahussein left a comment •

edited

Loading

amahussein commented Mar 25, 2024 •

edited

Loading