-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework Profile tool to not require Spark to run and process files faster #3161
Conversation
Signed-off-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
…nto profileRedesign
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick pass through it, but I didn't look deeply enough to truly feel confident enough to approve it yet. My biggest concern are the methods like Analysis.jobAndStageMetricsAggregation
that write things out, but also produce a result. It was odd before when it would print things out and update state, but now it is kind of a hybrid and it makes me a bit nervous it could get called too frequently.
almost all fo the functions do this and did it before, with the caveat that the writer has to be defined. The main reason they return things is just for testing purposes. We could separate out fetching it and writing it |
If it is just for testing then a comment or two about that would be fine. It just confused me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have had more time to go through it. I don't know if I got everything, but it does look good to me.
Merging this one and I am working on adding writing to CSV and I'll look at either commenting or splitting things into get functions and print functions in that PR. |
fixes #3046
Redesign the profile tool to not require using Spark and SQL. Similar to the Qualification tool changes. It now run much faster and with less memory. we could probably improve memory further usage if we run into issues but I was able to run on file with all 105 nds queries without changing size.
It now takes 9.7 seconds to run collect mode on all 105 NDS query event logs. took a minute to profile all 105 NDS queries that were in a single event log (using default heap size 1G). Took 16.7 seconds to compare all 105 NDS queries using default heap size.
This is compared to spark-submit way took 10.5 minutes to run collect mode on 105 NDS queries and required much more memory, compare mode we suggested limited to 10 and removed that now.
This does change a few things:
All tests pass, I spot checked our other integration tests, I still need to do all of them and update the expected results for them. I will do that in parallel of this being reviewed.