Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve implementation of finding median in StatisticsMetrics #1474

Merged
merged 2 commits into from
Dec 20, 2024

Conversation

amahussein
Copy link
Collaborator

Fixes #1461

Adds an InPlace median finding to improve the performance of the metric aggregates.
We used to sort a sequence to create StatisticsMetrics which turned out to be very expensive in large eventlogs.

It was found that there is a bottleneck in generateStageLevelAccums and generateSQLAccums due to the cost of sorting and using sequences of Long on metrics aggregates.

Impact on performance:

The PR improves the Qualification runtime. methods like

  • generateSQLAccums is down from 21,550 to 3,480 ms (~80% improvement)
  • generateStageLevelAccums is down from 115580 to 61,611 ms (~45% improvement)

Main optimization

Median finding: we used to convert a map to sequence. Then get it sorted. Then we pick the median, max, and min, and then we call seq.sum

  • This turned to be expensive for large eventlogs
  • This PR implements adds finding-median in linear time. On average, this is a linear complexity compared to sorting
  • This pull request includes several changes to the AppSQLPlanAnalyzer and AppSparkMetricsAnalyzer classes to improve performance and simplify the codebase by using the breakOut method for collection transformations and introducing new methods in StatisticsMetrics. Additionally, it removes an unused object and refactors the AccumInfo class.

Impact on output

By doing a diff on the output folder, it was found that the sql_plan_metrics_for_application.csv is differemt. The new output sounds more correct

In output generated by the dev branch:

appIndex,sqlID,nodeID,nodeName,accumulatorId,name,min,median,max,total,metricType,stageIds
1,236,0,"Execute InsertIntoHadoopFsRelationCommand",0,"number of written files",0,0,0,2,"sum","2"

The above seems incorrect because the total is non-zero while min, max and median are zeros.

Vs the new output

appIndex,sqlID,nodeID,nodeName,accumulatorId,name,min,median,max,total,metricType,stageIds
1,236,0,"Execute InsertIntoHadoopFsRelationCommand",0,"number of written files",2,2,2,2,"sum","2"

Improvements to Collection Transformations:

Introduction of New Methods in StatisticsMetrics:

Refactoring and Cleanup:

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Fixes NVIDIA#1461

Adds an InPlace median finding to improve the performance of the metric
aggregates.
We used to sort a sequence to create StatisticsMetrics which turned out
to be very expensive in large eventlogs.

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
@amahussein amahussein added the core_tools Scope the core module (scala) label Dec 19, 2024
@amahussein amahussein self-assigned this Dec 19, 2024
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein for these improvements. Minor question.

}

override def toString = {
arr mkString ("ArraySize(", ", ", ")")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should this be bounded by until - from?

* @return the median of the array.
*/
def findMedianInPlace(
arr: Array[Long])(implicit choosePivot: InPlaceMedianArrView => Long): Long = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting usage of implicit mechanism.

Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein for this refactor!

@cindyyuanjiang cindyyuanjiang merged commit 3db52ef into NVIDIA:dev Dec 20, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Investigate long execution time of eventlogs
3 participants