Refactor loadArchives() function #257

borislin · 2018-08-12T01:41:49Z

Patch for #247.

What does this Pull Request do?

This PR refactors loadArchives() function to accept optional parameters numFiles, maxSize, format and prefix, allowing us to have some fine grained control over archive files loading. This will greatly help us debug large collections and won't affect all current code that call loadArchives().

How should this be tested?

A description of what steps someone could take to:

Unit tests
Wrote a Scala script to get MIME distribution over large collections of ARC using the refactored loadArchives() function
3 ways to test this refactored function:
- Original use: loadArchives(path, sc)
- All optional parameters are specified: loadArchives(path, sc, args.format.toOption, args.prefix.toOption, args.files.toOption, args.size.toOption)
- Only some optional parameters are specified: loadArchives(path, sc, prefix = args.prefix.toOption, numFiles = args.files.toOption, maxSize = args.size.toOption)

Interested parties

@lintool @ianmilligan1

…es, format and maxSize

ianmilligan1 · 2018-08-12T01:43:16Z

Great, thanks @borislin. Could you provide your script here with how you used the refactored function? I can kick the tires on that soon.

lintool · 2018-08-12T12:19:11Z

src/main/scala/io/archivesunleashed/package.scala

+      * @param fs filesystem
+      * @param prefix prefix of archive files
+      * @param numFiles number of archive files
+      * @param maxSize maximu size of archive files


maximu typo

lintool · 2018-08-12T12:21:08Z

src/main/scala/io/archivesunleashed/package.scala

+      * @param dir the path to the directory containing archive files
+      * @param fs filesystem
+      * @param prefix prefix of archive files
+      * @param numFiles number of archive files


Can you explain how numFiles and maxSize interact? Both shouldn't be set, right?

numFiles is used for debugging only. For instance we can do a sanity check by running on only 10 files by setting numFiles = 10.

We can also limit the maximum file size since I found out that large files are causing the heartbeat timeout issue when I run AUT on those ARC files.

We ran into that issue a few weeks ago with auk in production. We run every job with the heartbeat flag now: https://github.com/archivesunleashed/auk/blob/master/app/jobs/collections_spark_job.rb#L51

@ruebot but how do you determine the correct value for spark_heartbeat_interval? by trial-and-error?

FWIW I determined by trial-and-error. Since it only affects a small number of (W)ARCs, putting the number relatively high was fine in our particular application. We have it set to 600s by default right now.

lintool · 2018-08-12T12:23:03Z

src/main/scala/io/archivesunleashed/package.scala

+      var files = indexFiles.filter(f => {
+        val path = f.getPath.getName
+        val fileSize = fs.getContentSummary(f.getPath).getLength
+        f.isFile && (prefix.isEmpty || path.startsWith(prefix.get)) && path.endsWith(".gz") && (maxSize.isEmpty || fileSize <= maxSize.get * 1000000)


line too long... add wrapping?

Also, maxSize.get * 1000000 is a bit janky - do Int.MaxValue? https://www.scala-lang.org/api/2.12.0/scala/Int$.html

What I mean here is to convert MB to bytes. I've added a helper function in the new commit.

lintool · 2018-08-12T12:23:24Z

src/main/scala/io/archivesunleashed/package.scala

-      sc.newAPIHadoopFile(path, classOf[ArchiveRecordInputFormat], classOf[LongWritable], classOf[ArchiveRecordWritable])
-        .filter(r => (r._2.getFormat == ArchiveFormat.ARC) ||
-          ((r._2.getFormat == ArchiveFormat.WARC) && r._2.getRecord.getHeader.getHeaderValue("WARC-Type").equals("response")))
+    def loadArchives(path: String, sc: SparkContext, format: Option[String] = None, prefix: Option[String] = None, numFiles: Option[Int] = None, maxSize : Option[Long] = None): RDD[ArchiveRecord] = {


ruebot · 2018-08-12T17:36:31Z

src/main/scala/io/archivesunleashed/package.scala

 import io.archivesunleashed.matchbox.ImageDetails
 import io.archivesunleashed.matchbox.ExtractDate.DateComponent
 // scalastyle:off underscore.import
 import io.archivesunleashed.matchbox.ExtractDate.DateComponent._
+


Remove blank line.

ruebot · 2018-08-12T17:36:36Z

src/main/scala/io/archivesunleashed/package.scala

+
+    val log: Logger = Logger.getLogger(getClass.getName)
+
+    /** Gets all archive files by applying filters prefix, numFiles and maxSize


Full stop at the end.

ruebot · 2018-08-12T17:36:57Z

src/main/scala/io/archivesunleashed/package.scala

 import org.apache.spark.{SerializableWritable, SparkContext}
 import org.apache.spark.rdd.RDD
+


Remove blank line.

codecov · 2018-08-13T06:22:43Z

Codecov Report

Merging #257 into master will increase coverage by 0.05%.
The diff coverage is 72.72%.

@@            Coverage Diff             @@
##           master     #257      +/-   ##
==========================================
+ Coverage   70.35%   70.41%   +0.05%     
==========================================
  Files          41       41              
  Lines        1039     1058      +19     
  Branches      191      193       +2     
==========================================
+ Hits          731      745      +14     
- Misses        242      244       +2     
- Partials       66       69       +3

Impacted Files	Coverage Δ
src/main/scala/io/archivesunleashed/package.scala	`82.53% <72.72%> (-1.58%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84a4c09...c733936. Read the comment docs.

ruebot · 2018-08-13T14:04:57Z

src/main/scala/io/archivesunleashed/package.scala

+    val log: Logger = Logger.getLogger(getClass.getName)
+
+    /** Convert MB to Bytes. **/
+    def mbToBytes(size: Long): Long = {


Doesn't look like this is covered in tests. https://codecov.io/gh/archivesunleashed/aut/pull/257/diff?src=pr&el=tree#D1-53

ruebot · 2018-08-13T14:05:32Z

src/main/scala/io/archivesunleashed/package.scala

+      val indexFiles = fs.listStatus(dir)
+      var files = indexFiles.filter(f => isValidFile(f, fs, prefix, maxSize)).map(f => f.getPath)
+      if (numFiles.isDefined) {
+        files = files.take(numFiles.get)


Needs to be tested. https://codecov.io/gh/archivesunleashed/aut/pull/257/diff?src=pr&el=tree#D1-76

lintool · 2018-08-14T12:26:02Z

hi @borislin - given your latest analyses, I also think we need a max size variable?

ianmilligan1 · 2018-09-13T15:43:25Z

Just following up on this PR - any movement on responding to the reviews and getting this moved forward?

ruebot · 2018-10-17T00:03:37Z

@ianmilligan1 @lintool is this PR superseded by #275? It's unclear which one is which for #247 -- this one or #275 -- or if it should be both.

borislin · 2018-10-17T01:15:55Z

@ruebot This PR has been superseded by #275. Close now.

ruebot · 2018-10-25T14:23:21Z

Don't delete this branch.

borislin added 8 commits July 27, 2018 15:59

Add MD5 Verification script and MIMEDistribution

fbcc978

Add loadArchive

1346407

Merge branch 'master' into mime

ae5dccb

Refactor loadArchives() to include optional parameters prefix, numFil…

060cbe4

…es, format and maxSize

Merge branch 'master' into mime

16bc0b3

Remove extra line

f9e8e3d

Remove MIMEDistribution

79c208b

Remove md5 python script

dcbcd8f

borislin requested review from lintool and ruebot August 12, 2018 01:42

lintool reviewed Aug 12, 2018

View reviewed changes

ruebot requested changes Aug 12, 2018

View reviewed changes

Address PR comments

c733936

ruebot requested changes Aug 13, 2018

View reviewed changes

borislin closed this Oct 17, 2018

ruebot mentioned this pull request Jan 11, 2019

Method to perform finer-grained selection of ARCs and WARCs #247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor loadArchives() function #257

Refactor loadArchives() function #257

borislin commented Aug 12, 2018 •

edited

Loading

ianmilligan1 commented Aug 12, 2018

lintool Aug 12, 2018

lintool Aug 12, 2018

borislin Aug 13, 2018

ruebot Aug 13, 2018

borislin Aug 13, 2018

ianmilligan1 Aug 13, 2018

lintool Aug 12, 2018

borislin Aug 13, 2018

lintool Aug 12, 2018

ruebot Aug 12, 2018

ruebot Aug 12, 2018

ruebot Aug 12, 2018

codecov bot commented Aug 13, 2018 •

edited

Loading

ruebot Aug 13, 2018

ruebot Aug 13, 2018

lintool commented Aug 14, 2018

ianmilligan1 commented Sep 13, 2018

ruebot commented Oct 17, 2018

borislin commented Oct 17, 2018

ruebot commented Oct 25, 2018


		val log: Logger = Logger.getLogger(getClass.getName)

		/** Gets all archive files by applying filters prefix, numFiles and maxSize

		import org.apache.spark.{SerializableWritable, SparkContext}
		import org.apache.spark.rdd.RDD

Refactor loadArchives() function #257

Refactor loadArchives() function #257

Conversation

borislin commented Aug 12, 2018 • edited Loading

What does this Pull Request do?

How should this be tested?

Interested parties

ianmilligan1 commented Aug 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 13, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lintool commented Aug 14, 2018

ianmilligan1 commented Sep 13, 2018

ruebot commented Oct 17, 2018

borislin commented Oct 17, 2018

ruebot commented Oct 25, 2018

borislin commented Aug 12, 2018 •

edited

Loading

codecov bot commented Aug 13, 2018 •

edited

Loading