-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qualification and profiling tool support rolled and compressed event logs for CSPs and Apache Spark #2732
qualification and profiling tool support rolled and compressed event logs for CSPs and Apache Spark #2732
Conversation
cleanup as well Signed-off-by: Thomas Graves <tgraves@nvidia.com>
build |
val EVENT_LOG_FILE_NAME_PREFIX = "events_" | ||
|
||
def isEventLogDir(status: FileStatus): Boolean = { | ||
status.isDirectory && status.getPath.getName.startsWith(EVENT_LOG_DIR_NAME_PREFIX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be slightly less duplication to delegate the name check to isEventLogDir(path: String)
tools/src/main/scala/com/nvidia/spark/rapids/tool/EventLogPathProcessor.scala
Outdated
Show resolved
Hide resolved
tools/src/main/scala/com/nvidia/spark/rapids/tool/EventLogPathProcessor.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Thomas Graves <tgraves@apache.org>
build |
investigating test failures, it seems in this env something must be deleted while its open, locally no issues and everything cleaned up |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some nits
The tool does not support nested directories, event log files or event log directories should be | ||
at the top level when specifying a directory. | ||
|
||
Note: Spark event logs can be downloaded from Spark UI using a "Download" button on the right side, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
low priority: we can link to https://spark.apache.org/docs/3.1.2/monitoring.html
// assume this is the current log and we want that one to be read last | ||
LocalDateTime.now() | ||
} else { | ||
val date = fileParts(0).split("-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: usually prefer pattern matching to indexed access to the tune of
val Array(_, yearStr, monthStr, dayStr, _*) = Array("something", "2021", "06", "22", "something", "else")
def openEventLogInternal(log: Path, fs: FileSystem): InputStream = { | ||
EventLogFileWriter.codecName(log) match { | ||
case c if (c.isDefined && c.get.equals("gz")) => | ||
val in = new BufferedInputStream(fs.open(log)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BufferInputStream is not needed for GzipInputStream. in = fs.open(log))
|
||
// at this point all paths should be valid event logs or event log dirs | ||
val fs = eventlog.getFileSystem(new Configuration()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to wire through hadoop conf from SparkContext to creating brand-new instances.
thanks, @gerashegalov I'll incorporate the nits in my next pr. |
The main part of this pr is to support compressed and rolled event logs from the various CSPs and Apache Spark. This also has some cleanup to consolidate some duplicate code, move some code to common class for dealing with parsing event log paths, compressing some of the test files and a few style type changes.
The Databricks event logs are different then Apache spark, so special handling was added for them.
fixes #2690