Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344

Merged
merged 1 commit into from
Aug 14, 2019

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Aug 14, 2019

GitHub issue(s): #342

What does this Pull Request do?

How should this be tested?

4809 audio files
644 PDF files
232 video files
5685 total

Times were slightly slower 🤷‍♂️

  • This PR 4212.43s user 224.14s system 623% cpu 11:51.81 total
  • HEAD on master now: 3899.34s user 186.65s system 598% cpu 11:23.12 total

Interested parties

@jrwiebe @ianmilligan1

@ruebot ruebot requested review from ianmilligan1 and jrwiebe August 14, 2019 17:35
- Move audio, pdf, and video DF extraction to tuple map
- Provide two MimeType columns; mime_type_web_server and mime_type_tika
- Update tests
- Resolves #342
@codecov
Copy link

codecov bot commented Aug 14, 2019

Codecov Report

Merging #344 into master will increase coverage by 0.28%.
The diff coverage is 60%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #344      +/-   ##
=========================================
+ Coverage   74.91%   75.2%   +0.28%     
=========================================
  Files          39      39              
  Lines        1216    1230      +14     
  Branches      224     224              
=========================================
+ Hits          911     925      +14     
  Misses        214     214              
  Partials       91      91
Impacted Files Coverage Δ
src/main/scala/io/archivesunleashed/package.scala 80.38% <60%> (+1.4%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54c0c3e...47612ce. Read the comment docs.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally on sample data and worked nicely on my end. Same results as in #341 as well.

Screen Shot 2019-08-14 at 2 49 11 PM

Copy link
Contributor

@jrwiebe jrwiebe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with a sample from the CPP dataset, using the script from #341 and modifying the paths. It worked as expected, but especially on video there were a lot of files with no extension. (Also a few with extensions like ".php".) A strategy for extensions like I describe in #343 would fix this.

@ruebot
Copy link
Member Author

ruebot commented Aug 14, 2019

@jrwiebe yeah, I've noticed that the majority of missing extensions so far have been video files. My guess is that it's mostly to do with streaming.

@ianmilligan1 ianmilligan1 merged commit 01d12b4 into master Aug 14, 2019
@ianmilligan1 ianmilligan1 deleted the mimetype-columns branch August 14, 2019 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Tika's detected MIME type instead of ArchiveRecord getMimeType?
3 participants