-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualification tool: Detect RDD Api's in SQL plan #3819
Conversation
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible, we should ask the requestor of this feature to test on the eventlogs they ran into this issue on.
@@ -121,6 +121,8 @@ total task time in SQL Dataframe operations). | |||
|
|||
Each application(event log) could have multiple SQL queries. If a SQL's plan has Dataset API inside such as keyword | |||
`$Lambda` or `.apply`, that SQL query is categorized as a DataSet SQL query, otherwise it is a Dataframe SQL query. | |||
If there are RDD to Dataset/Dataframe conversion then it would have `SerializeFromObject` in it's SQL plan. These |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the SerializeFromObject and show up when doing DataSet to Dataframe operation as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I would reword sentence above, perhaps have one that says: If a SQL's plan has a Dataset API or RDD call inside of it, that SQL query is not categorized as a Dataframe SQL query. We are unable to determine how much of that query is made up of Dataset or RDD calls so the entire query task time is not included in the score.
And honestly I would remove the other bit about how we determine RDD or Dataset as user shouldn't really need to know that.
@@ -1,2 +1,2 @@ | |||
App Name,App ID,Score,Potential Problems,SQL Dataframe Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,ReadFileFormat Score,Unsupported Read File Formats and Types,Unsupported Write Data Format,Complex Types,Unsupported Nested Complex Types | |||
Spark shell,local-1624892957956,37581.00,"",3751,37581,17801,58.47,false,0,"",20,100.00,"","","","" | |||
Spark shell,local-1624892957956,0.0,"",0,0,17801,0.0,false,0,"",20,100.00,"","","","" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not have these go to 0, I want to make sure we are parsing spark2 logs properly, if we need to create a new spark2 log that doesn't use RDD lets do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will create another spark2 event log that doesn't use RDD
@@ -0,0 +1,2 @@ | |||
App Name,App ID,Score,Potential Problems,SQL DF Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,Read File Format Score,Unsupported Read File Formats and Types,Unsupported Write Data Format,,Complex Types,Unsupported Nested Complex Types | |||
Spark shell,local-1634067832387,0.0,"",0,0,108138,0.0,false,0,"",20,100.0,"","","","" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to start moving away from static files if its something small that can be run real time. In this case I don't think we really need this one as it appears to be tested by other, right? if so just remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, We do have SerializeFromObject in SQL plan in other event logs. So we are already testing it in other files. Will remove this and corresponding tests and event log.
@viadea - Is it possbile for you test on the eventlogs where RDD to Dataframe conversion is used and verify if the scoring(i.e top N) better represents the queries which use only dataframes. |
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
build |
tests failing please take a look |
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
build |
tools/src/test/resources/ProfilingExpectations/rapids_duration_and_cpu_expectation.csv
Show resolved
Hide resolved
tools/src/main/scala/org/apache/spark/sql/rapids/tool/profiling/ApplicationInfo.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
build |
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
build |
This fixes #3535.
Whenever RDD is converted to Dataset/DataFrame, there is
SerializeFromObject
in the SQL plan. In this PR, the logic is same as that for Dataset i.e if we see RDD to Dataset/DataFrame conversion, then we don't consider the time taken by that operation while calculating the score.