Add additional filters for fextFiles; resolves #362. #393

ruebot · 2019-12-18T03:19:17Z

GitHub issue(s):

DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

What does this Pull Request do?

Add additional filters for fextFiles; resolves #362.

Add filedesc, and dns filter (arc files)
Add test case

You can see filedesc and dns in the ARC test fixtures:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/aut/src/test/resources/arc/*gz",sc)
df.all().select("url").show(5, false)

+-------------------------------------------------+
|url                                              |
+-------------------------------------------------+
|filedesc://IAH-20080430204825-00000-blackbook.arc|
|dns:www.archive.org                              |
|http://www.archive.org/robots.txt                |
|http://www.archive.org/                          |
|http://www.archive.org/index.php                 |
+-------------------------------------------------+
only showing top 5 rows

How should this be tested?

The updated test catches the above examples examples.

I'm doing a more robust test on the BANQ collection in question on #362 now. I'll move this out of draft if it is successful.

- Add filedesc, and dns filter (arc files) - Add test case

ruebot · 2019-12-18T03:19:47Z

Testing with:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/data/banq-datathon/PQ/warcs/*gz", sc).textFiles();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5").orderBy(desc("md5")).write.parquet("/data/banq-datathon/PQ/derivatives/parquet/text")

df.select($"bytes", $"extension").saveToDisk("bytes", "/data/banq-datathon/PQ/derivatives/binaries/text/pq-2012-text", "extension")

sys.exit

codecov · 2019-12-18T03:33:28Z

Codecov Report

Merging #393 into master will decrease coverage by 0.03%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master     #393      +/-   ##
==========================================
- Coverage   77.15%   77.11%   -0.04%     
==========================================
  Files          40       40              
  Lines        1484     1486       +2     
  Branches      278      280       +2     
==========================================
+ Hits         1145     1146       +1     
  Misses        217      217              
- Partials      122      123       +1

ruebot · 2019-12-18T14:39:23Z

Successfully ran the job on the BANQ dataset twice (with both commits) without issue.

ianmilligan1

Tested locally and looks great!

Add additional filters for fextFiles; resolves #362.

1eb5ce8

- Add filedesc, and dns filter (arc files) - Add test case

tweak

9ae49f1

ruebot marked this pull request as ready for review December 18, 2019 14:38

ruebot requested a review from ianmilligan1 December 18, 2019 14:39

ianmilligan1 approved these changes Dec 18, 2019

View reviewed changes

ianmilligan1 merged commit 8eb43ff into master Dec 18, 2019

ianmilligan1 deleted the issue-362 branch December 18, 2019 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional filters for fextFiles; resolves #362. #393

Add additional filters for fextFiles; resolves #362. #393

ruebot commented Dec 18, 2019

ruebot commented Dec 18, 2019

codecov bot commented Dec 18, 2019 •

edited

Loading

ruebot commented Dec 18, 2019

ianmilligan1 left a comment

Add additional filters for fextFiles; resolves #362. #393

Add additional filters for fextFiles; resolves #362. #393

Conversation

ruebot commented Dec 18, 2019

What does this Pull Request do?

How should this be tested?

ruebot commented Dec 18, 2019

codecov bot commented Dec 18, 2019 • edited Loading

Codecov Report

ruebot commented Dec 18, 2019

ianmilligan1 left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 18, 2019 •

edited

Loading