Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualification and Profiling tool handle Read formats and datatypes #2904

Merged
merged 154 commits into from
Jul 14, 2021

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented Jul 9, 2021

fixes #2757

Here is the summary of changes:

  1. Parse eventlogs for the datasource Read schema. This is different between datasource v1 and datasource v2 readers. v2 readers do not include the entire schema, it gets truncated. v1 has the entire schema.
  2. For profiling tool it just prints out the information it finds -> format, schema, location, filters. This works in compare mode as well.
  3. Modify our dist pom file so that on verify it generates a CSV file of the supported read formats and their datatypes. It also looks at some configs that are off by default. This file gets included in the tools jar and is read by the qualification tool only.
  4. The starting Score has been changed to be the total task times rather than the sql Duration/app duration. We think this will be more accurate and take out the issue of how many executors and it will handle the case where you have 12 hour job and 3 hours of it is DF. This before was ranked below for instance a 3 second job that was all DF ops.
  5. Qualification tool now has a read format score which just looks for any unsupported (configured off by default) formats or data types. If we find any unsupported it will take off from the starting score based on a configurable percent. Default is 20%. so if starting score is 100, the read format part of that score is 20. If we don't find any unsupported formats or data types, it stays 20, if we find all formats unsupported, that goes to 0 and the final score is 80.
  6. rankings of summary report changed to use the new score and it's printed to csv file. The read data source format and schema can be printed to CSV file as well but it's off by default because it can be very long.
  7. tests updated to the new scoring algorithm.
  8. qualification summary info also printed to stdout and made smaller to fit in 80 characters. I could split this off if you want, I was already modifying so just included it.

@tgravescs
Copy link
Collaborator Author

build

@nartal1
Copy link
Collaborator

nartal1 commented Jul 14, 2021

Overall it looks good. I don't have any additional comments as it meets the user requirements.
Wanted to know if you tried running large number of events logs with this patch and if the total runtime was not affected considerably.

@tgravescs
Copy link
Collaborator Author

tgravescs commented Jul 14, 2021

yes ran it over the nds results time was 44 seconds so about the same time as before.

@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit 9998174 into NVIDIA:branch-21.08 Jul 14, 2021
@tgravescs tgravescs deleted the datatypesnew branch July 14, 2021 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Profiling tool display input data types
2 participants