You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the binary derivatives, we might want to sort out if we do just the binaries, all the info about the binary, or binaries + binary info?
For webpages, should we add a domain column, so it is similar to the "full-text" derivative, or should it completely replace the "full-text" derivative?
For webgraph, should this just be the DomainGraphExtractor as "TEXT"?
The text was updated successfully, but these errors were encountered:
- Resolves#447
- Add AudioInformationExtractor, ImageInformationExtractor,
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
- Add tests for the new extractors
- Update CommandLineApp to use new extractors
- Change "TEXT" to "csv"
- Lower case "GEXF" and "GRAPHML"
- Resolves#447
- Add AudioInformationExtractor, ImageInformationExtractor,
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
- Add tests for the new extractors
- Update CommandLineApp to use new extractors
- Add domain, and language column WebPagesExtractor
- Change "TEXT" to "csv"
- Lower case "GEXF" and "GRAPHML"
- Resolves#447
- Add AudioInformationExtractor, ImageInformationExtractor,
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
- Add tests for the new extractors
- Update CommandLineApp to use new extractors
- Add domain, and language column WebPagesExtractor
- Change "TEXT" to "csv"
- Lower case "GEXF" and "GRAPHML"
Is your feature request related to a problem? Please describe.
There only way to create the derivatives we used for the recent datathon(s) is to do them via spark shell. We should add them to the app.
Describe the solution you'd like
Add the following derivatives to app:
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
.webgraph()
Additional context
webpages
, should we add a domain column, so it is similar to the "full-text" derivative, or should it completely replace the "full-text" derivative?webgraph
, should this just be theDomainGraphExtractor
as "TEXT"?The text was updated successfully, but these errors were encountered: