Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualification and Profiling tool handle Read formats and datatypes #2904

Merged
merged 154 commits into from
Jul 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
154 commits
Select commit Hold shift + click to select a range
0cf96a4
Support rolled and compressed logs for CSPs and Apache Spark, do some
tgravescs Jun 17, 2021
8462be3
add test files
tgravescs Jun 17, 2021
5e287df
Add in db sim eventlogs
tgravescs Jun 17, 2021
421f082
add missing files
tgravescs Jun 17, 2021
e616293
fix line length
tgravescs Jun 17, 2021
1514052
print metadata
tgravescs Jun 17, 2021
04c1e27
catch more exceptions
tgravescs Jun 17, 2021
1482249
recurse
tgravescs Jun 17, 2021
1a80bb9
return actual node
tgravescs Jun 17, 2021
546bc5a
Add in another column to sort to keep output consistent
tgravescs Jun 21, 2021
c1a7173
Add in printing read schema
tgravescs Jun 21, 2021
5ea4fe9
refactor
tgravescs Jun 21, 2021
f0cb1a7
fix null pointer
tgravescs Jun 21, 2021
f446950
add app index col
tgravescs Jun 21, 2021
1d9c904
fix
tgravescs Jun 21, 2021
e45cb23
change to use lit
tgravescs Jun 21, 2021
df38c5a
look for datasource v2
tgravescs Jun 21, 2021
45e1290
Update to print v2
tgravescs Jun 21, 2021
0df55be
finish parsing schema v2
tgravescs Jun 21, 2021
5c6b367
sort it
tgravescs Jun 21, 2021
7a6d3f9
fix parsing schema v2
tgravescs Jun 21, 2021
b816378
handle ...
tgravescs Jun 21, 2021
2fb7b12
change to store string for now
tgravescs Jun 21, 2021
9aa5a52
remove struct< from string
tgravescs Jun 21, 2021
070792e
remove debug
tgravescs Jun 21, 2021
4970797
remove debug messages
tgravescs Jun 21, 2021
5c41d2c
remove log
tgravescs Jun 21, 2021
fc71e1b
parse v2 file format
tgravescs Jun 22, 2021
274fa77
fix including format:
tgravescs Jun 22, 2021
c106f9a
rename
tgravescs Jun 22, 2021
4a93de0
add in test files
tgravescs Jun 22, 2021
ffcd0a0
update docs to monitoring page
tgravescs Jun 22, 2021
119bc01
Merge remote-tracking branch 'origin/branch-21.08' into datatypes
tgravescs Jun 22, 2021
466b9c5
cleanup and use spark session hadoop configuration
tgravescs Jun 22, 2021
0280263
cleanup
tgravescs Jun 22, 2021
cb73db8
Add in generation of support ops for tools
tgravescs Jun 22, 2021
b3d9e95
Merge branch 'datatypes' of github.com:tgravescs/spark-rapids into da…
tgravescs Jun 22, 2021
a5fc1fb
fixes
tgravescs Jun 22, 2021
d8c299b
add in text support and write comma
tgravescs Jun 22, 2021
9744e12
update output to csv
tgravescs Jun 22, 2021
25c7464
update reading qualification
tgravescs Jun 22, 2021
61e126a
Merge branch 'datatypes' of github.com:tgravescs/spark-rapids into da…
tgravescs Jun 22, 2021
655d8dd
update way qualification with schema
tgravescs Jun 22, 2021
123d37f
format
tgravescs Jun 22, 2021
8aee779
do per format
tgravescs Jun 22, 2021
58b9f62
fixes
tgravescs Jun 23, 2021
87094cc
Merge remote-tracking branch 'origin/branch-21.08' into datatypes
tgravescs Jun 25, 2021
05ea957
fix
tgravescs Jun 28, 2021
abc7ea2
Merge remote-tracking branch 'origin/branch-21.08' into datatypesnew
tgravescs Jun 30, 2021
d69c768
redo for redesign
tgravescs Jun 30, 2021
bd930fd
update to new design
tgravescs Jul 6, 2021
da4f470
fixes
tgravescs Jul 6, 2021
e61dbdd
check values
tgravescs Jul 6, 2021
8569133
fix syntax
tgravescs Jul 6, 2021
0d39c6e
fixes
tgravescs Jul 6, 2021
732565b
change the way we read csv asnd headers
tgravescs Jul 6, 2021
0496eac
lower case
tgravescs Jul 6, 2021
a8cf6ae
fixes
tgravescs Jul 6, 2021
351b22b
Add in auto generation of supported ops for tools
tgravescs Jul 6, 2021
f7ec113
Add type conversions
tgravescs Jul 6, 2021
51e7adc
add file format to csv
tgravescs Jul 6, 2021
e6ebdd9
fix close
tgravescs Jul 6, 2021
ab70406
print incomplete
tgravescs Jul 6, 2021
aa49544
Add separate class for type checker
tgravescs Jul 6, 2021
723dd2d
fix parameters
tgravescs Jul 6, 2021
31926cb
fix and debug
tgravescs Jul 6, 2021
43be9a1
calculate as percent
tgravescs Jul 6, 2021
ea83770
Calculate score with the read format and datatypes included
tgravescs Jul 7, 2021
2daacff
fixes
tgravescs Jul 7, 2021
79584e5
calculate percent rounded
tgravescs Jul 7, 2021
c9a460e
document and cleanup
tgravescs Jul 7, 2021
5792f84
use ratio
tgravescs Jul 7, 2021
78fee51
calculate taks duration
tgravescs Jul 7, 2021
e79a99b
take into accoutn configs
tgravescs Jul 7, 2021
6841b36
update recordings
tgravescs Jul 7, 2021
d7e6619
use committed off
tgravescs Jul 7, 2021
24f03b6
fix output
tgravescs Jul 7, 2021
9fb48ef
Merge branch 'datatypesnew' of github.com:tgravescs/spark-rapids into…
tgravescs Jul 7, 2021
b05866d
fix log file otuput
tgravescs Jul 7, 2021
ad88de2
round
tgravescs Jul 7, 2021
048f253
remove some unneeded rounding
tgravescs Jul 7, 2021
1dbf970
fix syntax
tgravescs Jul 8, 2021
19e80f8
Merge remote-tracking branch 'origin/branch-21.08' into datatypesnew
tgravescs Jul 8, 2021
b412282
Change finding datatypes
tgravescs Jul 8, 2021
757a154
fix types
tgravescs Jul 8, 2021
07701be
fixes and debug
tgravescs Jul 8, 2021
25aa8ca
add debug
tgravescs Jul 8, 2021
80f1f85
fix equals vs contains
tgravescs Jul 8, 2021
41d2536
Change scores
tgravescs Jul 8, 2021
0e98967
add option for outputting file formats
tgravescs Jul 8, 2021
f49adaf
fix string interpol
tgravescs Jul 8, 2021
a5cd06d
update output files
tgravescs Jul 8, 2021
c9a0659
cleanup
tgravescs Jul 8, 2021
5862ff9
Add test for profile datasource
tgravescs Jul 8, 2021
11e5cde
fix tests
tgravescs Jul 8, 2021
5fd1cc9
add compare info
tgravescs Jul 8, 2021
2742bb5
move case class
tgravescs Jul 8, 2021
e93fadd
write out compare
tgravescs Jul 8, 2021
f967425
write out header
tgravescs Jul 8, 2021
0816bc2
add test for dsv2
tgravescs Jul 9, 2021
66f4c16
Merge remote-tracking branch 'origin/branch-21.08' into datatypesnew
tgravescs Jul 9, 2021
6fe165c
Merge branch 'datatypesnew' of github.com:tgravescs/spark-rapids into…
tgravescs Jul 9, 2021
bae200a
add check for if decimal enabled
tgravescs Jul 9, 2021
bb3c3f7
Merge branch 'datatypesnew' of github.com:tgravescs/spark-rapids into…
tgravescs Jul 9, 2021
d8249e8
fixes
tgravescs Jul 9, 2021
92cdc52
Merge branch 'datatypesnew' of github.com:tgravescs/spark-rapids into…
tgravescs Jul 9, 2021
eaadbc4
Update decimal to use configured off
tgravescs Jul 9, 2021
2844e45
simplify weighted score for read data types
tgravescs Jul 9, 2021
df2f055
add plugin checker suite
tgravescs Jul 9, 2021
579b947
more tests
tgravescs Jul 9, 2021
a8bdca3
force re-read
tgravescs Jul 9, 2021
6ee3131
change source for testing
tgravescs Jul 9, 2021
7ac42c2
more tests
tgravescs Jul 9, 2021
1ed5573
fix tests
tgravescs Jul 9, 2021
eab9112
fix other test
tgravescs Jul 9, 2021
ee39cee
make plugin type checker optional
tgravescs Jul 9, 2021
57e7d3a
Update qualification test results
tgravescs Jul 9, 2021
cf9b0eb
update test and fixes
tgravescs Jul 9, 2021
a060f83
add test files
tgravescs Jul 9, 2021
2ec2461
more tests and cleanup
tgravescs Jul 9, 2021
a584df9
Merge branch 'datatypesnew' of github.com:tgravescs/spark-rapids into…
tgravescs Jul 9, 2021
3a10f38
move rounding
tgravescs Jul 9, 2021
1358ea8
Merge branch 'datatypesnew' of github.com:tgravescs/spark-rapids into…
tgravescs Jul 9, 2021
b1e3d95
write to stdout as well
tgravescs Jul 9, 2021
9417ddc
output
tgravescs Jul 9, 2021
6ba5efc
remove ln from println
tgravescs Jul 9, 2021
7219173
shrink output report spacing
tgravescs Jul 9, 2021
ecfd4c7
commonize size
tgravescs Jul 9, 2021
bd09305
add df bakc in header
tgravescs Jul 9, 2021
9955d2e
configure stdout off for tests
tgravescs Jul 9, 2021
bec2335
update readme
tgravescs Jul 9, 2021
3d213ff
update desc
tgravescs Jul 9, 2021
394a8a6
Fix missing extra info with type checks
tgravescs Jul 12, 2021
1d6aace
don't include supplement text in qualification tool csv file
tgravescs Jul 12, 2021
c91ee3d
Add csv output for just not supported format and types
tgravescs Jul 12, 2021
f90f9ec
update tests
tgravescs Jul 12, 2021
b00f95b
handle empty string
tgravescs Jul 12, 2021
ea717ff
rename test files
tgravescs Jul 12, 2021
4e1006a
Change the way we report ns
tgravescs Jul 12, 2021
0f0b47c
fixes
tgravescs Jul 12, 2021
746b404
add more tests
tgravescs Jul 12, 2021
05f8a7e
update expected results
tgravescs Jul 12, 2021
fb6f05b
fix tests
tgravescs Jul 12, 2021
36b0472
dedup types more
tgravescs Jul 12, 2021
7c775d0
fix typo
tgravescs Jul 12, 2021
2ecdc74
update test
tgravescs Jul 12, 2021
86049d7
fix bug processing jobs without sql
tgravescs Jul 12, 2021
b1094da
fix bug
tgravescs Jul 12, 2021
5aafb32
add in complex and decimal eventlog
tgravescs Jul 12, 2021
3caac11
add test for complex and ecimal eventlog
tgravescs Jul 12, 2021
495e324
add in expectataion file
tgravescs Jul 12, 2021
c8be975
update readme
tgravescs Jul 12, 2021
378134e
fix typo
tgravescs Jul 14, 2021
3b39b09
Merge remote-tracking branch 'origin/branch-21.08' into datatypesnew
tgravescs Jul 14, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions dist/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,24 @@
</launchers>
</configuration>
</execution>
<execution>
<id>update_supported_tools</id>
<phase>verify</phase>
<goals>
<goal>run</goal>
</goals>
<configuration>
<launchers>
<launcher>
<id>update_rapids_support_tools</id>
<mainClass>com.nvidia.spark.rapids.SupportedOpsForTools</mainClass>
<args>
<arg>${project.basedir}/../tools/src/main/resources/supportedDataSource.csv</arg>
</args>
</launcher>
</launchers>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
Expand Down
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -627,6 +627,7 @@
<exclude>dependency-reduced-pom.xml</exclude>
<exclude>**/.*/**</exclude>
<exclude>src/main/java/com/nvidia/spark/rapids/format/*</exclude>
<exclude>src/main/resources/supportedDataSource.csv</exclude>
<!-- Apache Rat excludes target folder for projects that are included by
default, but there are some projects that are conditionally included. -->
<exclude>**/target/**/*</exclude>
Expand Down
91 changes: 85 additions & 6 deletions sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala
Original file line number Diff line number Diff line change
Expand Up @@ -30,20 +30,23 @@ import org.apache.spark.sql.types._
*/
sealed abstract class SupportLevel {
def htmlTag: String
def text: String
}

/**
* N/A neither spark nor the plugin supports this.
*/
object NotApplicable extends SupportLevel {
override def htmlTag: String = "<td> </td>"
override def text: String = "NA"
}

/**
* Spark supports this but the plugin does not.
*/
object NotSupported extends SupportLevel {
override def htmlTag: String = "<td><b>NS</b></td>"
override def htmlTag: String = s"<td><b>$text</b></td>"
override def text: String = "NS"
}

/**
Expand All @@ -52,12 +55,14 @@ object NotSupported extends SupportLevel {
* types because they are not 100% supported.
*/
class Supported(val asterisks: Boolean = false) extends SupportLevel {
override def htmlTag: String =
override def htmlTag: String = s"<td>$text</td>"
override def text: String = {
if (asterisks) {
"<td>S*</td>"
"S*"
} else {
"<td>S</td>"
"S"
}
}
}

/**
Expand Down Expand Up @@ -86,10 +91,17 @@ class PartiallySupported(
None
}
val extraInfo = (note.toSeq ++ litOnly.toSeq ++ typeStr.toSeq).mkString("; ")
val allText = s"$text ($extraInfo)"
s"<td><em>$allText</em></td>"
}

// don't include the extra info in the supported text field for now
// as the qualification tool doesn't use it
override def text: String = {
if (asterisks) {
"<td><em>PS* (" + extraInfo + ")</em></td>"
"PS*"
} else {
"<td><em>PS (" + extraInfo + ")</em></td>"
"PS"
}
}
}
Expand Down Expand Up @@ -1641,3 +1653,70 @@ object SupportedOpsDocs {
}
}
}

object SupportedOpsForTools {
Copy link
Collaborator Author

@tgravescs tgravescs Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revans2 would be great if you could look at this part since you did Type checking stuff. outputs file https://github.com/NVIDIA/spark-rapids/pull/2904/files#diff-fffb7b35c0ad8bb096be63eecf71179426428a95063a6aab8772086f10bd4711 when run mvn verify


private def outputSupportIO() {
// Look at what we have for defaults for some configs because if the configs are off
// it likely means something isn't completely compatible.
val conf = new RapidsConf(Map.empty[String, String])
val types = TypeEnum.values.toSeq
val header = Seq("Format", "Direction") ++ types
println(header.mkString(","))
GpuOverrides.fileFormats.toSeq.sortBy(_._1.toString).foreach {
case (format, ioMap) =>
val formatEnabled = format.toString.toLowerCase match {
case "csv" => conf.isCsvEnabled && conf.isCsvReadEnabled
case "parquet" => conf.isParquetEnabled && conf.isParquetReadEnabled
case "orc" => conf.isOrcEnabled && conf.isOrcReadEnabled
case _ =>
throw new IllegalArgumentException("Format is unknown we need to add it here!")
}
val read = ioMap(ReadFileOp)
// we have lots of configs for various operations, just try to get the main ones
val readOps = types.map { t =>
val typeEnabled = if (format.toString.toLowerCase.equals("csv")) {
t.toString() match {
case "BOOLEAN" => conf.isCsvBoolReadEnabled
case "BYTE" => conf.isCsvByteReadEnabled
case "SHORT" => conf.isCsvShortReadEnabled
case "INT" => conf.isCsvIntReadEnabled
case "LONG" => conf.isCsvLongReadEnabled
case "FLOAT" => conf.isCsvFloatReadEnabled
case "DOUBLE" => conf.isCsvDoubleReadEnabled
case "TIMESTAMP" => conf.isCsvTimestampReadEnabled
case "DATE" => conf.isCsvDateReadEnabled
case "DECIMAL" => conf.decimalTypeEnabled
case _ => true
}
} else {
t.toString() match {
case "DECIMAL" => conf.decimalTypeEnabled
case _ => true
}
}
if (!formatEnabled || !typeEnabled) {
// indicate configured off by default
"CO"
} else {
read.support(t).text
}
}
// only support reads for now
println(s"${(Seq(format, "read") ++ readOps).mkString(",")}")
}
}

def help(): Unit = {
outputSupportIO()
}

def main(args: Array[String]): Unit = {
val out = new FileOutputStream(new File(args(0)))
Console.withOut(out) {
Console.withErr(out) {
SupportedOpsForTools.help()
}
}
}
}
107 changes: 57 additions & 50 deletions tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@ GPU generated event logs.
at the top level when specifying a directory.

Note: Spark event logs can be downloaded from Spark UI using a "Download" button on the right side,
or can be found in the location specified by `spark.eventLog.dir`.
or can be found in the location specified by `spark.eventLog.dir`. See the
[Apache Spark Monitoring](http://spark.apache.org/docs/latest/monitoring.html) documentation for
more information.

Optional:
- maven installed
Expand Down Expand Up @@ -99,12 +101,12 @@ rapids-4-spark-tools_2.12-<version>.jar \
The qualification tool is used to look at a set of applications to determine if the RAPIDS Accelerator for Apache Spark
might be a good fit for those applications. The tool works by processing the CPU generated event logs from Spark.

Currently it does this by looking at the amount of time spent doing SQL Dataframe
operations vs the entire application time: `(sum(SQL Dataframe Duration) / (application-duration))`.
The more time spent doing SQL Dataframe operations the higher the score is
and the more likely the plugin will be able to help accelerate that application.
Note that the application time is from application start to application end so if you are using an interactive
shell where there is nothing running from a while, this time will include that which might skew the score.
This tool is intended to give the users a starting point and does not guarantee the applications it scores highest
will actually be accelerated the most. Currently it works by looking at the amount of time spent in tasks of SQL
Dataframe operations. The more total task time doing SQL Dataframe operations the higher the score is and the more
likely the plugin will be able to help accelerate that application. The tool also looks for read data formats and types
that the plugin doesn't support and if it finds any not supported it will take away from the score (based on the
total task time in SQL Dataframe operations).

Each application(event log) could have multiple SQL queries. If a SQL's plan has Dataset API inside such as keyword
`$Lambda` or `.apply`, that SQL query is categorized as a DataSet SQL query, otherwise it is a Dataframe SQL query.
Expand All @@ -113,7 +115,8 @@ Note: the duration(s) reported are in milli-seconds.

There are 2 output files from running the tool. One is a summary text file printing in order the applications most
likely to be good candidates for the GPU to the ones least likely. It outputs the application ID, duration,
the SQL Dataframe duration and the SQL duration spent when we found SQL queries with potential problems.
the SQL Dataframe duration and the SQL duration spent when we found SQL queries with potential problems. It also
outputs this same report to STDOUT.
The other file is a CSV file that contains more information and can be used for further post processing.

Note, potential problems are reported in the CSV file in a separate column, which is not included in the score. This
Expand All @@ -133,21 +136,20 @@ Note that SQL queries that contain failed jobs are not included.

Sample output in csv:
```
App Name,App ID,Score,Potential Problems,SQL Dataframe Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures
job1,app-20210507174503-2538,98.13,"",952802,970984,63.14,false,0,""
job2,app-20210507180116-2539,97.88,"",903845,923419,64.88,false,0,""
job3,app-20210319151533-1704,97.59,"",737826,756039,33.95,false,0,""
App Name,App ID,Score,Potential Problems,SQL DF Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,Read File Format Score,Unsupported Read File Formats and Types
job3,app-20210507174503-1704,4320658.0,"",9569,4320658,26171,35.34,false,0,"",20,100.0,""
job1,app-20210507174503-2538,19864.04,"",6760,21802,83728,71.3,false,0,"",20,55.56,"Parquet[decimal]"
```

Sample output in text:
```
================================================================================================================
| App ID| App Duration| SQL Dataframe Duration|SQL Duration For Problematic|
================================================================================================================
|app-20210507174503-2538| 970984| 952802| 0|
|app-20210507180116-2539| 923419| 903845| 0|
|app-20210319151533-1704| 756039| 737826| 0|
===========================================================================
| App ID|App Duration|SQL DF Duration|Problematic Duration|
===========================================================================
|app-20210507174503-2538| 26171| 9569| 0|
|app-20210507174503-1704| 83738| 6760| 0|
```

## Download the Spark 3.x distribution
The Qualification tool requires the Spark 3.x jars to be able to run. If you do not already have
Spark 3.x installed, you can download the Spark distribution to any machine and include the jars
Expand Down Expand Up @@ -186,35 +188,40 @@ Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
com.nvidia.spark.rapids.tool.qualification.QualificationMain [options]
<eventlogs | eventlog directories ...>

-f, --filter-criteria <arg> Filter newest or oldest N eventlogs for
processing.eg: 100-newest (for processing
newest 100 event logs). eg: 100-oldest (for
processing oldest 100 event logs)
-m, --match-event-logs <arg> Filter event logs whose filenames contain the
input string
-n, --num-output-rows <arg> Number of output rows in the summary report.
Default is 1000.
--num-threads <arg> Number of thread to use for parallel
processing. The default is the number of cores
on host divided by 4.
--order <arg> Specify the sort order of the report. desc or
asc, desc is the default. desc (descending)
would report applications most likely to be
accelerated at the top and asc (ascending)
would show the least likely to be accelerated
at the top.
-o, --output-directory <arg> Base output directory. Default is current
directory for the default filesystem. The
final output will go into a subdirectory
called rapids_4_spark_qualification_output. It
will overwrite any existing directory with the
same name.
-t, --timeout <arg> Maximum time in seconds to wait for the event
logs to be processed. Default is 24 hours
(86400 seconds) and must be greater than 3
seconds. If it times out, it will report what
it was able to process up until the timeout.
-h, --help Show help message
-f, --filter-criteria <arg> Filter newest or oldest N eventlogs for
processing.eg: 100-newest (for processing
newest 100 event logs). eg: 100-oldest (for
processing oldest 100 event logs)
-m, --match-event-logs <arg> Filter event logs whose filenames contain the
input string
-n, --num-output-rows <arg> Number of output rows in the summary report.
Default is 1000.
--num-threads <arg> Number of thread to use for parallel
processing. The default is the number of cores
on host divided by 4.
--order <arg> Specify the sort order of the report. desc or
asc, desc is the default. desc (descending)
would report applications most likely to be
accelerated at the top and asc (ascending)
would show the least likely to be accelerated
at the top.
-o, --output-directory <arg> Base output directory. Default is current
directory for the default filesystem. The
final output will go into a subdirectory
called rapids_4_spark_qualification_output. It
will overwrite any existing directory with the
same name.
-r, --read-score-percent <arg> The percent the read format and datatypes
apply to the score. Default is 20 percent.
--report-read-schema Whether to output the read formats and
datatypes to the CSV file. This can be very
long. Default is false.
-t, --timeout <arg> Maximum time in seconds to wait for the event
logs to be processed. Default is 24 hours
(86400 seconds) and must be greater than 3
seconds. If it times out, it will report what
it was able to process up until the timeout.
-h, --help Show help message

trailing arguments:
eventlog (required) Event log filenames(space separated) or directories
Expand All @@ -223,9 +230,9 @@ Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
```

### Output
By default this outputs a 2 files under sub-directory `./rapids_4_spark_qualification_output/` that contains
the processed applications. The output will go into your default filesystem, it supports local filesystem
or HDFS.
The summary report goes to STDOUT and by default it outputs 2 files under sub-directory
`./rapids_4_spark_qualification_output/` that contain the processed applications. The output will
go into your default filesystem, it supports local filesystem or HDFS.

The output location can be changed using the `--output-directory` option. Default is current directory.

Expand Down
4 changes: 4 additions & 0 deletions tools/src/main/resources/supportedDataSource.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Format,Direction,BOOLEAN,BYTE,SHORT,INT,LONG,FLOAT,DOUBLE,DATE,TIMESTAMP,STRING,DECIMAL,NULL,BINARY,CALENDAR,ARRAY,MAP,STRUCT,UDT
CSV,read,CO,CO,CO,CO,CO,CO,CO,CO,CO,S,CO,NA,NS,NA,NA,NA,NA,NA
ORC,read,S,S,S,S,S,S,S,S,S*,S,CO,NA,NS,NA,NS,NS,NS,NS
Parquet,read,S,S,S,S,S,S,S,S,S*,S,CO,NA,NS,NA,PS*,PS*,PS*,NS
Original file line number Diff line number Diff line change
Expand Up @@ -137,9 +137,12 @@ object EventLogPathProcessor extends Logging {
}.toMap
}
} catch {
case e: FileNotFoundException =>
case fe: FileNotFoundException =>
logWarning(s"$pathString not found, skipping!")
Map.empty[EventLogInfo, Long]
case e: Exception =>
logWarning(s"Unexpected exception occurred reading $pathString, skipping!", e)
Map.empty[EventLogInfo, Long]
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ import org.apache.spark.sql.rapids.tool.ToolUtils
/**
* Class for writing local files, allows writing to distributed file systems.
*/
class ToolTextFileWriter(finalOutputDir: String, logFileName: String) extends Logging {
class ToolTextFileWriter(finalOutputDir: String, logFileName: String,
finalLocationText: String) extends Logging {

private val textOutputPath = new Path(s"$finalOutputDir/$logFileName")
private val fs = FileSystem.get(textOutputPath.toUri, new Configuration())
Expand All @@ -42,7 +43,7 @@ class ToolTextFileWriter(finalOutputDir: String, logFileName: String) extends Lo

def close(): Unit = {
outFile.foreach { file =>
logInfo(s"Output location: $textOutputPath")
logInfo(s"$finalLocationText output location: $textOutputPath")
file.flush()
file.close()
outFile = None
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -165,3 +165,19 @@ case class DatasetSQLCase(sqlID: Long)
case class ProblematicSQLCase(sqlID: Long, reason: String)

case class UnsupportedSQLPlan(sqlID: Long, nodeID: Long, nodeName: String, nodeDesc: String)

case class DataSourceCase(
sqlID: Long,
format: String,
location: String,
pushedFilters: String,
schema: String)

case class DataSourceCompareCase(
appIndex: Int,
appId: String,
sqlID: Long,
format: String,
location: String,
pushedFilters: String,
schema: String)
Loading