Qualification and Profiling tool handle Read formats and datatypes #2904

tgravescs · 2021-07-09T21:50:44Z

fixes #2757

Here is the summary of changes:

Parse eventlogs for the datasource Read schema. This is different between datasource v1 and datasource v2 readers. v2 readers do not include the entire schema, it gets truncated. v1 has the entire schema.
For profiling tool it just prints out the information it finds -> format, schema, location, filters. This works in compare mode as well.
Modify our dist pom file so that on verify it generates a CSV file of the supported read formats and their datatypes. It also looks at some configs that are off by default. This file gets included in the tools jar and is read by the qualification tool only.
The starting Score has been changed to be the total task times rather than the sql Duration/app duration. We think this will be more accurate and take out the issue of how many executors and it will handle the case where you have 12 hour job and 3 hours of it is DF. This before was ranked below for instance a 3 second job that was all DF ops.
Qualification tool now has a read format score which just looks for any unsupported (configured off by default) formats or data types. If we find any unsupported it will take off from the starting score based on a configurable percent. Default is 20%. so if starting score is 100, the read format part of that score is 20. If we don't find any unsupported formats or data types, it stays 20, if we find all formats unsupported, that goes to 0 and the final score is 80.
rankings of summary report changed to use the new score and it's printed to csv file. The read data source format and schema can be printed to CSV file as well but it's off by default because it can be very long.
tests updated to the new scoring algorithm.
qualification summary info also printed to stdout and made smaller to fit in 80 characters. I could split this off if you want, I was already modifying so just included it.

cleanup as well Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Signed-off-by: Thomas Graves <tgraves@apache.org>

tgravescs · 2021-07-12T16:07:07Z

build

tools/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualAppInfo.scala

nartal1 · 2021-07-14T06:54:09Z

Overall it looks good. I don't have any additional comments as it meets the user requirements.
Wanted to know if you tried running large number of events logs with this patch and if the total runtime was not affected considerably.

tgravescs · 2021-07-14T13:02:21Z

yes ran it over the nds results time was 44 seconds so about the same time as before.

tgravescs · 2021-07-14T13:02:32Z

build

tgravescs · 2021-07-14T18:23:05Z

build

tgravescs and others added 30 commits June 17, 2021 12:19

Support rolled and compressed logs for CSPs and Apache Spark, do some

0cf96a4

cleanup as well Signed-off-by: Thomas Graves <tgraves@nvidia.com>

add test files

8462be3

Add in db sim eventlogs

5e287df

add missing files

421f082

fix line length

e616293

print metadata

1514052

catch more exceptions

04c1e27

recurse

1482249

return actual node

1a80bb9

Add in another column to sort to keep output consistent

546bc5a

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Add in printing read schema

c1a7173

Signed-off-by: Thomas Graves <tgraves@apache.org>

refactor

5ea4fe9

fix null pointer

f0cb1a7

add app index col

f446950

fix

1d9c904

change to use lit

e45cb23

look for datasource v2

df38c5a

Update to print v2

45e1290

finish parsing schema v2

0df55be

sort it

5c6b367

fix parsing schema v2

7a6d3f9

handle ...

b816378

change to store string for now

2fb7b12

remove struct< from string

9aa5a52

remove debug

070792e

remove debug messages

4970797

remove log

5c41d2c

parse v2 file format

fc71e1b

fix including format:

274fa77

rename

c106f9a

tgravescs and others added 18 commits July 12, 2021 09:03

Add csv output for just not supported format and types

c91ee3d

Signed-off-by: Thomas Graves <tgraves@apache.org>

update tests

f90f9ec

handle empty string

b00f95b

rename test files

ea717ff

Change the way we report ns

4e1006a

fixes

0f0b47c

add more tests

746b404

update expected results

05f8a7e

fix tests

fb6f05b

dedup types more

36b0472

fix typo

7c775d0

update test

2ecdc74

fix bug processing jobs without sql

86049d7

fix bug

b1094da

add in complex and decimal eventlog

5aafb32

add test for complex and ecimal eventlog

3caac11

add in expectataion file

495e324

update readme

c8be975

nartal1 reviewed Jul 14, 2021

View reviewed changes

tools/src/main/scala/org/apache/spark/sql/rapids/tool/qualification/QualAppInfo.scala Outdated Show resolved Hide resolved

fix typo

378134e

Merge remote-tracking branch 'origin/branch-21.08' into datatypesnew

3b39b09

nartal1 approved these changes Jul 14, 2021

View reviewed changes

tgravescs merged commit 9998174 into NVIDIA:branch-21.08 Jul 14, 2021

tgravescs deleted the datatypesnew branch July 14, 2021 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualification and Profiling tool handle Read formats and datatypes #2904

Qualification and Profiling tool handle Read formats and datatypes #2904

tgravescs commented Jul 9, 2021 •

edited

Loading

tgravescs commented Jul 12, 2021

nartal1 commented Jul 14, 2021

tgravescs commented Jul 14, 2021 •

edited

Loading

tgravescs commented Jul 14, 2021

tgravescs commented Jul 14, 2021

Qualification and Profiling tool handle Read formats and datatypes #2904

Qualification and Profiling tool handle Read formats and datatypes #2904

Conversation

tgravescs commented Jul 9, 2021 • edited Loading

tgravescs commented Jul 12, 2021

nartal1 commented Jul 14, 2021

tgravescs commented Jul 14, 2021 • edited Loading

tgravescs commented Jul 14, 2021

tgravescs commented Jul 14, 2021

tgravescs commented Jul 9, 2021 •

edited

Loading

tgravescs commented Jul 14, 2021 •

edited

Loading