Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datathon derivatives to app (binary info, web pages, web graph #447

Closed
ruebot opened this issue Apr 20, 2020 · 0 comments · Fixed by #451
Closed

Add datathon derivatives to app (binary info, web pages, web graph #447

ruebot opened this issue Apr 20, 2020 · 0 comments · Fixed by #451

Comments

@ruebot
Copy link
Member

ruebot commented Apr 20, 2020

Is your feature request related to a problem? Please describe.

There only way to create the derivatives we used for the recent datathon(s) is to do them via spark shell. We should add them to the app.

Describe the solution you'd like

Add the following derivatives to app:

  • Binaries
    • Audio
    • Images
    • PDFs
    • Presentation program files
    • Spreadsheets
    • Text files
    • Word processor files
    • Videos
  • Web pages
    • .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
  • Web graph
    • .webgraph()

Additional context

  • For the binary derivatives, we might want to sort out if we do just the binaries, all the info about the binary, or binaries + binary info?
  • For webpages, should we add a domain column, so it is similar to the "full-text" derivative, or should it completely replace the "full-text" derivative?
  • For webgraph, should this just be the DomainGraphExtractor as "TEXT"?
@ruebot ruebot self-assigned this Apr 20, 2020
@ruebot ruebot added the App label Apr 20, 2020
ruebot added a commit that referenced this issue Apr 21, 2020
- Resolves #447
- Add AudioInformationExtractor, ImageInformationExtractor,
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
- Add tests for the new extractors
- Update CommandLineApp to use new extractors
- Change "TEXT" to "csv"
- Lower case "GEXF" and "GRAPHML"
ruebot added a commit that referenced this issue Apr 21, 2020
- Resolves #447
- Add AudioInformationExtractor, ImageInformationExtractor,
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
- Add tests for the new extractors
- Update CommandLineApp to use new extractors
- Add domain, and language column WebPagesExtractor
- Change "TEXT" to "csv"
- Lower case "GEXF" and "GRAPHML"
ianmilligan1 pushed a commit that referenced this issue Apr 21, 2020
- Resolves #447
- Add AudioInformationExtractor, ImageInformationExtractor,
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
- Add tests for the new extractors
- Update CommandLineApp to use new extractors
- Add domain, and language column WebPagesExtractor
- Change "TEXT" to "csv"
- Lower case "GEXF" and "GRAPHML"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant