qualification and profiling tool support rolled and compressed event logs for CSPs and Apache Spark #2732

tgravescs · 2021-06-17T17:48:37Z

The main part of this pr is to support compressed and rolled event logs from the various CSPs and Apache Spark. This also has some cleanup to consolidate some duplicate code, move some code to common class for dealing with parsing event log paths, compressing some of the test files and a few style type changes.

The Databricks event logs are different then Apache spark, so special handling was added for them.

fixes #2690

cleanup as well Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs · 2021-06-17T17:54:43Z

build

gerashegalov · 2021-06-17T18:18:28Z

tools/src/main/scala/com/nvidia/spark/rapids/tool/EventLogPathProcessor.scala

+  val EVENT_LOG_FILE_NAME_PREFIX = "events_"
+
+  def isEventLogDir(status: FileStatus): Boolean = {
+    status.isDirectory && status.getPath.getName.startsWith(EVENT_LOG_DIR_NAME_PREFIX)


nit: would be slightly less duplication to delegate the name check to isEventLogDir(path: String)

tools/src/main/scala/com/nvidia/spark/rapids/tool/EventLogPathProcessor.scala

Signed-off-by: Thomas Graves <tgraves@apache.org>

tgravescs · 2021-06-18T19:16:06Z

build

tgravescs · 2021-06-18T20:54:42Z

investigating test failures, it seems in this env something must be deleted while its open, locally no issues and everything cleaned up

tgravescs · 2021-06-18T21:53:04Z

build

tgravescs · 2021-06-22T12:50:47Z

@gerashegalov @nartal1

gerashegalov

LGTM, some nits

gerashegalov · 2021-06-22T14:44:10Z

tools/README.md

+  The tool does not support nested directories, event log files or event log directories should be
+  at the top level when specifying a directory.
+
+Note: Spark event logs can be downloaded from Spark UI using a "Download" button on the right side,


low priority: we can link to https://spark.apache.org/docs/3.1.2/monitoring.html

gerashegalov · 2021-06-22T14:59:26Z

tools/src/main/scala/com/nvidia/spark/rapids/tool/EventLogPathProcessor.scala

+      // assume this is the current log and we want that one to be read last
+      LocalDateTime.now()
+    } else {
+      val date = fileParts(0).split("-")


nit: usually prefer pattern matching to indexed access to the tune of

val Array(_, yearStr, monthStr, dayStr, _*) = Array("something", "2021", "06", "22", "something", "else")

gerashegalov · 2021-06-22T15:09:59Z

tools/src/main/scala/org/apache/spark/sql/rapids/tool/profiling/ApplicationInfo.scala

+  def openEventLogInternal(log: Path, fs: FileSystem): InputStream = {
+    EventLogFileWriter.codecName(log) match {
+      case c if (c.isDefined && c.get.equals("gz")) =>
+        val in = new BufferedInputStream(fs.open(log))


BufferInputStream is not needed for GzipInputStream. in = fs.open(log))

gerashegalov · 2021-06-22T15:11:35Z

tools/src/main/scala/org/apache/spark/sql/rapids/tool/profiling/ApplicationInfo.scala


+    // at this point all paths should be valid event logs or event log dirs
+    val fs = eventlog.getFileSystem(new Configuration())


prefer to wire through hadoop conf from SparkContext to creating brand-new instances.

tgravescs · 2021-06-22T17:43:31Z

thanks, @gerashegalov I'll incorporate the nits in my next pr.

tgravescs added 5 commits June 17, 2021 12:19

Support rolled and compressed logs for CSPs and Apache Spark, do some

0cf96a4

cleanup as well Signed-off-by: Thomas Graves <tgraves@nvidia.com>

add test files

8462be3

Add in db sim eventlogs

5e287df

add missing files

421f082

fix line length

e616293

tgravescs added the feature request New feature or request label Jun 17, 2021

tgravescs added this to the June 7 - June 18 milestone Jun 17, 2021

tgravescs self-assigned this Jun 17, 2021

gerashegalov reviewed Jun 17, 2021

View reviewed changes

tgravescs added 5 commits June 18, 2021 10:49

Change how we check for databricks log files and reuse hadoop conf

c5bc4da

Signed-off-by: Thomas Graves <tgraves@apache.org>

update tests

d8fdad7

explicitly cleanup and fix names

100ac80

remove close as already closed

362227c

give temp dir unique name

7df5439

tgravescs added 2 commits June 18, 2021 16:33

cleanup

d4bf6b8

cleanup opening event log

64d3871

sameerz modified the milestones: June 7 - June 18, June 21 - July 2 Jun 21, 2021

gerashegalov approved these changes Jun 22, 2021

View reviewed changes

tgravescs merged commit 09e8390 into NVIDIA:branch-21.08 Jun 22, 2021

tgravescs deleted the compressedRolledNew branch June 22, 2021 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qualification and profiling tool support rolled and compressed event logs for CSPs and Apache Spark #2732

qualification and profiling tool support rolled and compressed event logs for CSPs and Apache Spark #2732

tgravescs commented Jun 17, 2021

tgravescs commented Jun 17, 2021

gerashegalov Jun 17, 2021

tgravescs commented Jun 18, 2021

tgravescs commented Jun 18, 2021

tgravescs commented Jun 18, 2021

tgravescs commented Jun 22, 2021

gerashegalov left a comment

gerashegalov Jun 22, 2021

gerashegalov Jun 22, 2021

gerashegalov Jun 22, 2021

gerashegalov Jun 22, 2021

tgravescs commented Jun 22, 2021


		// at this point all paths should be valid event logs or event log dirs
		val fs = eventlog.getFileSystem(new Configuration())

qualification and profiling tool support rolled and compressed event logs for CSPs and Apache Spark #2732

qualification and profiling tool support rolled and compressed event logs for CSPs and Apache Spark #2732

Conversation

tgravescs commented Jun 17, 2021

tgravescs commented Jun 17, 2021

gerashegalov Jun 17, 2021

Choose a reason for hiding this comment

tgravescs commented Jun 18, 2021

tgravescs commented Jun 18, 2021

tgravescs commented Jun 18, 2021

tgravescs commented Jun 22, 2021

gerashegalov left a comment

Choose a reason for hiding this comment

gerashegalov Jun 22, 2021

Choose a reason for hiding this comment

gerashegalov Jun 22, 2021

Choose a reason for hiding this comment

gerashegalov Jun 22, 2021

Choose a reason for hiding this comment

gerashegalov Jun 22, 2021

Choose a reason for hiding this comment

tgravescs commented Jun 22, 2021