Qualification tool: Detect RDD Api's in SQL plan #3819

nartal1 · 2021-10-14T02:57:32Z

This fixes #3535.

Whenever RDD is converted to Dataset/DataFrame, there is SerializeFromObject in the SQL plan. In this PR, the logic is same as that for Dataset i.e if we see RDD to Dataset/DataFrame conversion, then we don't consider the time taken by that operation while calculating the score.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

tgravescs

if possible, we should ask the requestor of this feature to test on the eventlogs they ran into this issue on.

tgravescs · 2021-10-14T13:54:24Z

docs/additional-functionality/qualification-profiling-tools.md

@@ -121,6 +121,8 @@ total task time in SQL Dataframe operations).

 Each application(event log) could have multiple SQL queries. If a SQL's plan has Dataset API inside such as keyword
 `$Lambda` or `.apply`, that SQL query is categorized as a DataSet SQL query, otherwise it is a Dataframe SQL query.
+If there are RDD to Dataset/Dataframe conversion then it would have `SerializeFromObject` in it's SQL plan. These


so the SerializeFromObject and show up when doing DataSet to Dataframe operation as well.

Also I would reword sentence above, perhaps have one that says: If a SQL's plan has a Dataset API or RDD call inside of it, that SQL query is not categorized as a Dataframe SQL query. We are unable to determine how much of that query is made up of Dataset or RDD calls so the entire query task time is not included in the score.

And honestly I would remove the other bit about how we determine RDD or Dataset as user shouldn't really need to know that.

tgravescs · 2021-10-14T13:58:51Z

tools/src/test/resources/QualificationExpectations/spark2_expectation.csv

@@ -1,2 +1,2 @@
 App Name,App ID,Score,Potential Problems,SQL Dataframe Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,ReadFileFormat Score,Unsupported Read File Formats and Types,Unsupported Write Data Format,Complex Types,Unsupported Nested Complex Types
-Spark shell,local-1624892957956,37581.00,"",3751,37581,17801,58.47,false,0,"",20,100.00,"","","",""
+Spark shell,local-1624892957956,0.0,"",0,0,17801,0.0,false,0,"",20,100.00,"","","",""


I'd rather not have these go to 0, I want to make sure we are parsing spark2 logs properly, if we need to create a new spark2 log that doesn't use RDD lets do that.

Sounds good. I will create another spark2 event log that doesn't use RDD

tgravescs · 2021-10-14T14:02:37Z

tools/src/test/resources/QualificationExpectations/rdd_to_dataframe_expectation.csv

@@ -0,0 +1,2 @@
+App Name,App ID,Score,Potential Problems,SQL DF Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,Read File Format Score,Unsupported Read File Formats and Types,Unsupported Write Data Format,,Complex Types,Unsupported Nested Complex Types
+Spark shell,local-1634067832387,0.0,"",0,0,108138,0.0,false,0,"",20,100.0,"","","",""


I'd also like to start moving away from static files if its something small that can be run real time. In this case I don't think we really need this one as it appears to be tested by other, right? if so just remove it

Yes, We do have SerializeFromObject in SQL plan in other event logs. So we are already testing it in other files. Will remove this and corresponding tests and event log.

nartal1 · 2021-10-14T21:43:19Z

if possible, we should ask the requestor of this feature to test on the eventlogs they ran into this issue on.

@viadea - Is it possbile for you test on the eventlogs where RDD to Dataframe conversion is used and verify if the scoring(i.e top N) better represents the queries which use only dataframes.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 · 2021-10-15T00:13:20Z

build

tgravescs · 2021-10-15T14:41:30Z

tests failing please take a look

…d_eventlog

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 · 2021-10-15T19:36:43Z

build

tools/src/test/resources/ProfilingExpectations/rapids_duration_and_cpu_expectation.csv

tools/src/main/scala/org/apache/spark/sql/rapids/tool/profiling/ApplicationInfo.scala

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

tgravescs · 2021-10-18T13:12:23Z

build

docs/spark-profiling-tool.md

…d_eventlog

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

tgravescs · 2021-10-18T18:46:14Z

build

nartal1 added 2 commits October 12, 2021 15:46

detect RDD API in Sql plan

0d792fe

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

update readme

a468c46

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 self-assigned this Oct 14, 2021

nartal1 added the tools label Oct 14, 2021

nartal1 added this to the Oct 4 - Oct 15 milestone Oct 14, 2021

update test results

f8feb4a

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

tgravescs reviewed Oct 14, 2021

View reviewed changes

addressed review comments

92c556c

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

nartal1 added 2 commits October 15, 2021 09:47

Merge branch 'branch-21.12' of github.com:NVIDIA/spark-rapids into rd…

8c56aac

…d_eventlog

fixed tests

b9213b2

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

tgravescs reviewed Oct 15, 2021

View reviewed changes

tools/src/test/resources/ProfilingExpectations/rapids_duration_and_cpu_expectation.csv Show resolved Hide resolved

tools/src/main/scala/org/apache/spark/sql/rapids/tool/profiling/ApplicationInfo.scala Outdated Show resolved Hide resolved

sameerz modified the milestones: Oct 4 - Oct 15, Oct 18 - Oct 29 Oct 15, 2021

adrress review comments and update tests

306a679

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

tgravescs reviewed Oct 18, 2021

View reviewed changes

docs/spark-profiling-tool.md Show resolved Hide resolved

nartal1 added 2 commits October 18, 2021 09:05

Merge branch 'branch-21.12' of github.com:NVIDIA/spark-rapids into rd…

62aa54c

…d_eventlog

update docs

435a81e

Signed-off-by: Niranjan Artal <nartal@nvidia.com>

tgravescs approved these changes Oct 18, 2021

View reviewed changes

nartal1 merged commit e6a69f1 into NVIDIA:branch-21.12 Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualification tool: Detect RDD Api's in SQL plan #3819

Qualification tool: Detect RDD Api's in SQL plan #3819

nartal1 commented Oct 14, 2021 •

edited

Loading

tgravescs left a comment

tgravescs Oct 14, 2021

tgravescs Oct 14, 2021

tgravescs Oct 14, 2021

nartal1 Oct 14, 2021

tgravescs Oct 14, 2021

nartal1 Oct 14, 2021

nartal1 commented Oct 14, 2021

nartal1 commented Oct 15, 2021

tgravescs commented Oct 15, 2021

nartal1 commented Oct 15, 2021

tgravescs commented Oct 18, 2021

tgravescs commented Oct 18, 2021

		@@ -0,0 +1,2 @@
		App Name,App ID,Score,Potential Problems,SQL DF Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,Read File Format Score,Unsupported Read File Formats and Types,Unsupported Write Data Format,,Complex Types,Unsupported Nested Complex Types
		Spark shell,local-1634067832387,0.0,"",0,0,108138,0.0,false,0,"",20,100.0,"","","",""

Qualification tool: Detect RDD Api's in SQL plan #3819

Qualification tool: Detect RDD Api's in SQL plan #3819

Conversation

nartal1 commented Oct 14, 2021 • edited Loading

tgravescs left a comment

Choose a reason for hiding this comment

tgravescs Oct 14, 2021

Choose a reason for hiding this comment

tgravescs Oct 14, 2021

Choose a reason for hiding this comment

tgravescs Oct 14, 2021

Choose a reason for hiding this comment

nartal1 Oct 14, 2021

Choose a reason for hiding this comment

tgravescs Oct 14, 2021

Choose a reason for hiding this comment

nartal1 Oct 14, 2021

Choose a reason for hiding this comment

nartal1 commented Oct 14, 2021

nartal1 commented Oct 15, 2021

tgravescs commented Oct 15, 2021

nartal1 commented Oct 15, 2021

tgravescs commented Oct 18, 2021

tgravescs commented Oct 18, 2021

nartal1 commented Oct 14, 2021 •

edited

Loading