-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a prose review – caught a few things, @ruebot.
I haven't obviously in the few minutes this has been open gone through all of the scripts to test them. Do you want me to do that? (it might take into next week as things are a bit swamped right now)
archive.write.csv("/path/to/export/directory/", header='true') | ||
``` | ||
|
||
If you want to store the results with the intention to read the results back later for further processing, then use Parquet format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?
@ianmilligan1 I tested most of them locally, and just wrote the rest of them. They should be fine, but definitely need to tested. No big rush until we get closer to sending out homework for NYC datathon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to bite the bullet and plow through this! Looks great, @ruebot - I've tested all the new scripts.
A few errors, which I've put in the comment. Three quarters of the docs refer to example.arc.gz
and the other quarter to example.warc.gz
- I'd be a fan of just using example.arc.gz
as you'll see.
current/text-analysis.md
Outdated
import io.archivesunleashed._ | ||
import io.archivesunleashed.df._ | ||
|
||
RecordLoader.loadArchives("example.warc.gz", sc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to example.arc.gz
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, I use warc for a bunch of them so you actually get results.
.select($"crawl_date", ExtractDomainDF($"url"), $"url", $"language", RemoveHTMLDF(RemoveHTTPHeaderDF($"content"))) | ||
.filter($"language" == "fr") | ||
.write.csv("plain-text-fr-df/") | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leads to error:
<pastie>:111: error: overloaded method value filter with alternatives:
(func: org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
(func: org.apache.spark.sql.Row => Boolean)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
(conditionExpr: String)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
(condition: org.apache.spark.sql.Column)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
cannot be applied to (Boolean)
.filter($"language" == "fr")
^
current/text-analysis.md
Outdated
import io.archivesunleashed._ | ||
import io.archivesunleashed.df._ | ||
|
||
RecordLoader.loadArchives("example.warc.gz", sc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend changing to arc for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in review, mostly just addresses quick and minor fixes. Note: while reviewing, line #s are provided, especially in cases where comment was added below section that needs addressing (because I wasn't able to see the blue + button). In some cases there are questions on formatting.
My focus was on text rather then code pieces. Like @ianmilligan1 I'm happy to run through the code snippets for testing.
The documentation is looking fantastic @ruebot!!
|
||
If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do) | ||
|
||
## Table of Contents | ||
|
||
Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete space proceeding paragraph
@@ -142,7 +142,7 @@ only showing top 20 rows | |||
|
|||
### Scala RDD | |||
|
|||
TODO | |||
**Will not be implemented.** | |||
|
|||
### Scala DF | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding that in the Python DF we mention 'width' and 'height' will be extracted, but the example outputs don't have these columns - are the dimensions embedded in the columns that are shown?
@@ -168,7 +198,7 @@ RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages() | |||
.filter(r => r._2 != "" && r._3 != "") | |||
.countItems() | |||
.filter(r => r._2 > 5) | |||
.saveAsTextFile("sitelinks-by-date/") | |||
.saveAsTextFile("sitelinks-by-date-rdd/") | |||
``` | |||
|
|||
The format of this output is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: line 205 "- Field one: Crawldate" should it be yyyyMMdd
or yyyymmdd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also line 220, ExtractLinks --> ExtractLinks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yyyyMMdd
.count() | ||
.filter($"count" > 5) | ||
.write.csv("sitelinks-details-df/") | ||
``` | ||
|
||
### Python DF | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 295: open-soure --> open-source
@@ -28,12 +28,11 @@ import io.archivesunleashed.matchbox._ | |||
sc.setLogLevel("INFO") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 15 - should we add the code
command (inline) for concatenation, as graph pass command is written directly below the paragraph?
current/standard-derivatives.md
Outdated
@@ -67,9 +66,10 @@ TODO | |||
How do I extract binary information of PDFs, audio files, video files, word processor files, spreadsheet files, presentation program files, and text files to a CSV file, or into the [Apache Parquet](https://parquet.apache.org/) format to [work with later](df-results.md#what-to-do-with-dataframe-results)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"How do I extract binary information " --> "How do I extract the binary information"
@@ -25,9 +25,10 @@ This script extracts the crawl date, domain, URL, and plain text from HTML files | |||
import io.archivesunleashed._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 12 --> capitalize Text (filtered by keyword)
…Addresses #372. - .all() column HttpStatus to http_status_code - Adds archive_filename to .all() - Significant README updates for setup - See also: archivesunleashed/aut-docs#39
@SamFritz @ianmilligan1 I think I hit everything raised. |
Looks good to me - I'll wait for @SamFritz's thumbs up and then I'm happy to merge (or, after reading your PR, you can squash + merge too!). 😄 |
👍 good to go :) |
A whole bunch of updates for archivesunleashed/aut#372 (comment)
Depends on archivesunleashed/aut#406
Partially hits #29.
Resolves #22.
Needs eyes, and testing since I touched so much. I'm probably inconsistent, or have funny mess-ups. Let me know 😄
When y'all approve, I'll squash and merge with a sane commit message.