-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PlainTextExtractor to output a single column; text. #453
Conversation
- Resolves #452 - PlainTextExtractor runs RemoveHTML, and ExtractBoilerplate on `content` - Update test
Codecov Report
@@ Coverage Diff @@
## master #453 +/- ##
==========================================
- Coverage 76.72% 76.70% -0.02%
==========================================
Files 49 49
Lines 1422 1421 -1
Branches 264 264
==========================================
- Hits 1091 1090 -1
Misses 215 215
Partials 116 116 |
Documentation PR: archivesunleashed/aut-docs#58 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something seems to have gone awry - I get the two part files when running on data, but all I'm seeing are row after row of ""
I used this command in case I did something wrong:
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz --output /Users/ianmilligan1/desktop/results/plaintext
That's right. If you're using the test warc, there should only be one line with text. |
https://github.com/archivesunleashed/aut/pull/453/files#diff-617068d8eb9f49b4cac9249793a2d409R48 Line 35 in the output will have text. |
Oh, I'm using data from the CPP collection – and doesn't appear to have any data in the whole collection, whereas yes, there's like two records that come out in the text extractor. There's a lot of junk records you'd expect to see removed, but there are some legit URLs that aren't coming through. i.e. I'm getting 46KB of text whereas running the old Let me poke at this a bit - I am pretty sure I've run BoilerPipe on these WARCs before and the results have been bigger than this. |
@@ -32,7 +32,6 @@ object PlainTextExtractor { | |||
// scalastyle:off | |||
import spark.implicits._ | |||
// scalastyle:on | |||
d.select($"crawl_date", ExtractDomainDF($"url").as("domain"), | |||
$"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("text")) | |||
d.select(ExtractBoilerpipeTextDF(RemoveHTMLDF($"content")).as("content")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aut/src/main/scala/io/archivesunleashed/matchbox/ExtractBoilerpipeTextRDD.scala
Lines 31 to 33 in f1eb43b
def apply(input: String): String = { | |
removeBoilerplate(RemoveHTTPHeaderRDD(input)) | |
} |
We really don't have any documentation on using that from what I can tell.
Maybe we shouldn't be calling RemoveHTMLDF
before calling ExtractBoilerpipeTextDF
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right on that @ruebot. From comparing our docs, when we do regular text extract in DF, we use a command like:
.select($"crawl_date", ExtractDomainDF($"url").as("domain"), $"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("content"))
so yes the headers are removed and then the HTML is removed. Whereas in the boilerpipe we use:
.select($"crawl_date", ExtractDomainDF($"url"), $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))
So we omit the HTML step and run Boilerpipe on the HTTP-headerless content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. It should just be ExtractBoilerpipeTextDF
then, since that calls ExtractBoilerpipeTextRDD
which runs RemoveHTTPHeaderRDD
before running removeBoilerplate
.
I'll update this, the test, and push it up shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! And I've got the output now from the shell comparator, so I can quickly see how it more or less lines up.
Tested with the DF script from here: import io.archivesunleashed._
import io.archivesunleashed.df._
RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
.webpages()
.select($"crawl_date", ExtractDomainDF($"url"), $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))
.write.csv("plain-text-no-boilerplate-df-testing-453/") And yes results are more robust (i.e. from scrolling through the CSV at a glance obvious boilerplate has been removed but content is still there in many cases; exponentially more than when using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works well now - thanks @ruebot!
#58) * Documentation update for archivesunleashed/aut#453
GitHub issue(s): #452
What does this Pull Request do?
Update PlainTextExtractor to output a single column; text.
content
How should this be tested?
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/452-test/plaintext