Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional filters for fextFiles; resolves #362. #393

Merged
merged 2 commits into from
Dec 18, 2019
Merged

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Dec 18, 2019

GitHub issue(s):

What does this Pull Request do?

Add additional filters for fextFiles; resolves #362.

  • Add filedesc, and dns filter (arc files)
  • Add test case

You can see filedesc and dns in the ARC test fixtures:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/aut/src/test/resources/arc/*gz",sc)
df.all().select("url").show(5, false)
+-------------------------------------------------+
|url                                              |
+-------------------------------------------------+
|filedesc://IAH-20080430204825-00000-blackbook.arc|
|dns:www.archive.org                              |
|http://www.archive.org/robots.txt                |
|http://www.archive.org/                          |
|http://www.archive.org/index.php                 |
+-------------------------------------------------+
only showing top 5 rows

How should this be tested?

The updated test catches the above examples examples.

I'm doing a more robust test on the BANQ collection in question on #362 now. I'll move this out of draft if it is successful.

- Add filedesc, and dns filter (arc files)
- Add test case
@ruebot
Copy link
Member Author

ruebot commented Dec 18, 2019

Testing with:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/data/banq-datathon/PQ/warcs/*gz", sc).textFiles();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5").orderBy(desc("md5")).write.parquet("/data/banq-datathon/PQ/derivatives/parquet/text")

df.select($"bytes", $"extension").saveToDisk("bytes", "/data/banq-datathon/PQ/derivatives/binaries/text/pq-2012-text", "extension")

sys.exit

@codecov
Copy link

codecov bot commented Dec 18, 2019

Codecov Report

Merging #393 into master will decrease coverage by 0.03%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master     #393      +/-   ##
==========================================
- Coverage   77.15%   77.11%   -0.04%     
==========================================
  Files          40       40              
  Lines        1484     1486       +2     
  Branches      278      280       +2     
==========================================
+ Hits         1145     1146       +1     
  Misses        217      217              
- Partials      122      123       +1

@ruebot ruebot marked this pull request as ready for review December 18, 2019 14:38
@ruebot
Copy link
Member Author

ruebot commented Dec 18, 2019

Successfully ran the job on the BANQ dataset twice (with both commits) without issue.

@ruebot ruebot requested a review from ianmilligan1 December 18, 2019 14:39
Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and looks great!

@ianmilligan1 ianmilligan1 merged commit 8eb43ff into master Dec 18, 2019
@ianmilligan1 ianmilligan1 deleted the issue-362 branch December 18, 2019 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc
2 participants