Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor loadArchives() function #257

Closed
wants to merge 9 commits into from
Closed

Conversation

borislin
Copy link
Collaborator

@borislin borislin commented Aug 12, 2018

Patch for #247.


What does this Pull Request do?

This PR refactors loadArchives() function to accept optional parameters numFiles, maxSize, format and prefix, allowing us to have some fine grained control over archive files loading. This will greatly help us debug large collections and won't affect all current code that call loadArchives().

How should this be tested?

A description of what steps someone could take to:

Interested parties

@lintool @ianmilligan1

@borislin borislin requested review from lintool and ruebot August 12, 2018 01:42
@ianmilligan1
Copy link
Member

Great, thanks @borislin. Could you provide your script here with how you used the refactored function? I can kick the tires on that soon.

* @param fs filesystem
* @param prefix prefix of archive files
* @param numFiles number of archive files
* @param maxSize maximu size of archive files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maximu typo

* @param dir the path to the directory containing archive files
* @param fs filesystem
* @param prefix prefix of archive files
* @param numFiles number of archive files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how numFiles and maxSize interact? Both shouldn't be set, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numFiles is used for debugging only. For instance we can do a sanity check by running on only 10 files by setting numFiles = 10.

We can also limit the maximum file size since I found out that large files are causing the heartbeat timeout issue when I run AUT on those ARC files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ran into that issue a few weeks ago with auk in production. We run every job with the heartbeat flag now: https://github.com/archivesunleashed/auk/blob/master/app/jobs/collections_spark_job.rb#L51

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruebot but how do you determine the correct value for spark_heartbeat_interval? by trial-and-error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I determined by trial-and-error. Since it only affects a small number of (W)ARCs, putting the number relatively high was fine in our particular application. We have it set to 600s by default right now.

var files = indexFiles.filter(f => {
val path = f.getPath.getName
val fileSize = fs.getContentSummary(f.getPath).getLength
f.isFile && (prefix.isEmpty || path.startsWith(prefix.get)) && path.endsWith(".gz") && (maxSize.isEmpty || fileSize <= maxSize.get * 1000000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long... add wrapping?

Also, maxSize.get * 1000000 is a bit janky - do Int.MaxValue? https://www.scala-lang.org/api/2.12.0/scala/Int$.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean here is to convert MB to bytes. I've added a helper function in the new commit.

sc.newAPIHadoopFile(path, classOf[ArchiveRecordInputFormat], classOf[LongWritable], classOf[ArchiveRecordWritable])
.filter(r => (r._2.getFormat == ArchiveFormat.ARC) ||
((r._2.getFormat == ArchiveFormat.WARC) && r._2.getRecord.getHeader.getHeaderValue("WARC-Type").equals("response")))
def loadArchives(path: String, sc: SparkContext, format: Option[String] = None, prefix: Option[String] = None, numFiles: Option[Int] = None, maxSize : Option[Long] = None): RDD[ArchiveRecord] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap line?

import io.archivesunleashed.matchbox.ImageDetails
import io.archivesunleashed.matchbox.ExtractDate.DateComponent
// scalastyle:off underscore.import
import io.archivesunleashed.matchbox.ExtractDate.DateComponent._

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line.


val log: Logger = Logger.getLogger(getClass.getName)

/** Gets all archive files by applying filters prefix, numFiles and maxSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop at the end.

import org.apache.spark.{SerializableWritable, SparkContext}
import org.apache.spark.rdd.RDD

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line.

@codecov
Copy link

codecov bot commented Aug 13, 2018

Codecov Report

Merging #257 into master will increase coverage by 0.05%.
The diff coverage is 72.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #257      +/-   ##
==========================================
+ Coverage   70.35%   70.41%   +0.05%     
==========================================
  Files          41       41              
  Lines        1039     1058      +19     
  Branches      191      193       +2     
==========================================
+ Hits          731      745      +14     
- Misses        242      244       +2     
- Partials       66       69       +3
Impacted Files Coverage Δ
src/main/scala/io/archivesunleashed/package.scala 82.53% <72.72%> (-1.58%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84a4c09...c733936. Read the comment docs.

val log: Logger = Logger.getLogger(getClass.getName)

/** Convert MB to Bytes. **/
def mbToBytes(size: Long): Long = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val indexFiles = fs.listStatus(dir)
var files = indexFiles.filter(f => isValidFile(f, fs, prefix, maxSize)).map(f => f.getPath)
if (numFiles.isDefined) {
files = files.take(numFiles.get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lintool
Copy link
Member

lintool commented Aug 14, 2018

hi @borislin - given your latest analyses, I also think we need a max size variable?

@ianmilligan1
Copy link
Member

Just following up on this PR - any movement on responding to the reviews and getting this moved forward?

@ruebot
Copy link
Member

ruebot commented Oct 17, 2018

@ianmilligan1 @lintool is this PR superseded by #275? It's unclear which one is which for #247 -- this one or #275 -- or if it should be both.

@borislin
Copy link
Collaborator Author

@ruebot This PR has been superseded by #275. Close now.

@borislin borislin closed this Oct 17, 2018
@ruebot
Copy link
Member

ruebot commented Oct 25, 2018

Don't delete this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants