Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

ruebot · 2020-01-16T21:50:31Z

A whole bunch of updates for archivesunleashed/aut#372 (comment)

Depends on archivesunleashed/aut#406

Partially hits #29.

Resolves #22.

Needs eyes, and testing since I touched so much. I'm probably inconsistent, or have funny mess-ups. Let me know 😄

When y'all approve, I'll squash and merge with a sane commit message.

ianmilligan1

Just did a prose review – caught a few things, @ruebot.

I haven't obviously in the few minutes this has been open gone through all of the scripts to test them. Do you want me to do that? (it might take into next week as things are a bit swamped right now)

current/README.md

current/binary-analysis.md

ianmilligan1 · 2020-01-17T02:29:31Z

current/df-results.md

+archive.write.csv("/path/to/export/directory/", header='true')
+```
+
+If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:


Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?

current/link-analysis.md

current/rdd-results.md

ruebot · 2020-01-17T03:00:10Z

@ianmilligan1 I tested most of them locally, and just wrote the rest of them. They should be fine, but definitely need to tested. No big rush until we get closer to sending out homework for NYC datathon.

ianmilligan1

Decided to bite the bullet and plow through this! Looks great, @ruebot - I've tested all the new scripts.

A few errors, which I've put in the comment. Three quarters of the docs refer to example.arc.gz and the other quarter to example.warc.gz - I'd be a fan of just using example.arc.gz as you'll see.

current/collection-analysis.md

current/link-analysis.md

ianmilligan1 · 2020-01-17T16:32:40Z

current/text-analysis.md

+import io.archivesunleashed._
+import io.archivesunleashed.df._
+
+RecordLoader.loadArchives("example.warc.gz", sc)


change to example.arc.gz?

iirc, I use warc for a bunch of them so you actually get results.

ianmilligan1 · 2020-01-17T16:33:47Z

current/text-analysis.md

+  .select($"crawl_date", ExtractDomainDF($"url"), $"url", $"language", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")))
+  .filter($"language" == "fr")
+  .write.csv("plain-text-fr-df/")
+```


Leads to error:

<pastie>:111: error: overloaded method value filter with alternatives: (func: org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and> (func: org.apache.spark.sql.Row => Boolean)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and> (conditionExpr: String)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and> (condition: org.apache.spark.sql.Column)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] cannot be applied to (Boolean) .filter($"language" == "fr") ^

ianmilligan1 · 2020-01-17T16:34:25Z

current/text-analysis.md

+import io.archivesunleashed._
+import io.archivesunleashed.df._
+
+RecordLoader.loadArchives("example.warc.gz", sc)


recommend changing to arc for consistency

See above note.

current/text-analysis.md

SamFritz

Added in review, mostly just addresses quick and minor fixes. Note: while reviewing, line #s are provided, especially in cases where comment was added below section that needs addressing (because I wasn't able to see the blue + button). In some cases there are questions on formatting.

My focus was on text rather then code pieces. Like @ianmilligan1 I'm happy to run through the code snippets for testing.

The documentation is looking fantastic @ruebot!!

current/README.md

SamFritz · 2020-01-17T15:26:41Z

current/README.md


 If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do) 
-
 ## Table of Contents

 Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.


Delete space proceeding paragraph

current/README.md

SamFritz · 2020-01-17T15:55:05Z

current/binary-analysis.md

@@ -142,7 +142,7 @@ only showing top 20 rows

 ### Scala RDD

-TODO
+**Will not be implemented.**

 ### Scala DF



I'm finding that in the Python DF we mention 'width' and 'height' will be extracted, but the example outputs don't have these columns - are the dimensions embedded in the columns that are shown?

SamFritz · 2020-01-17T16:26:54Z

current/link-analysis.md

@@ -168,7 +198,7 @@ RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)
-  .saveAsTextFile("sitelinks-by-date/")
+  .saveAsTextFile("sitelinks-by-date-rdd/")
 ```

 The format of this output is:


question: line 205 "- Field one: Crawldate" should it be yyyyMMdd or yyyymmdd?

Also line 220, ExtractLinks --> ExtractLinks ?

SamFritz · 2020-01-17T16:30:07Z

current/link-analysis.md

+  .count()
+  .filter($"count" > 5)
+  .write.csv("sitelinks-details-df/")
+```

 ### Python DF



line 295: open-soure --> open-source

SamFritz · 2020-01-17T16:35:37Z

current/standard-derivatives.md

@@ -28,12 +28,11 @@ import io.archivesunleashed.matchbox._
 sc.setLogLevel("INFO")


line 15 - should we add the code command (inline) for concatenation, as graph pass command is written directly below the paragraph?

SamFritz · 2020-01-17T16:36:19Z

current/standard-derivatives.md

@@ -67,9 +66,10 @@ TODO
 How do I extract binary information of PDFs, audio files, video files, word processor files, spreadsheet files, presentation program files, and text files to a CSV file, or into the [Apache Parquet](https://parquet.apache.org/) format to [work with later](df-results.md#what-to-do-with-dataframe-results)?


"How do I extract binary information " --> "How do I extract the binary information"

SamFritz · 2020-01-17T16:38:00Z

current/text-analysis.md

@@ -25,9 +25,10 @@ This script extracts the crawl date, domain, URL, and plain text from HTML files
 import io.archivesunleashed._


Line 12 --> capitalize Text (filtered by keyword)

…Addresses #372. - .all() column HttpStatus to http_status_code - Adds archive_filename to .all() - Significant README updates for setup - See also: archivesunleashed/aut-docs#39

ruebot · 2020-01-17T19:30:49Z

@SamFritz @ianmilligan1 I think I hit everything raised.

ianmilligan1 · 2020-01-17T20:22:15Z

Looks good to me - I'll wait for @SamFritz's thumbs up and then I'm happy to merge (or, after reading your PR, you can squash + merge too!). 😄

SamFritz · 2020-01-20T14:48:39Z

👍 good to go :)

ruebot added 7 commits January 16, 2020 16:48

Address archivesunleashed/aut#372 - DRAFT

88789a9

Add DF results Python section

a7bccf8

Add won't implement language to binary analysis.

5089a03

Add won't implement language to standard derivatives.

9dab703

Remove index, fix ToC in setting up.

3475498

Update README, add scala df to link analys, add TSV rdd results

b489e55

text-analysis scala df

9b6fe02

ruebot marked this pull request as ready for review January 17, 2020 02:16

ruebot requested review from lintool, ianmilligan1 and SamFritz January 17, 2020 02:17

ruebot mentioned this pull request Jan 17, 2020

Various DataFrame implementation updates for documentation clean-up; Addresses #372. archivesunleashed/aut#406

Merged

Add to be implemented; #22

9c34b18

ianmilligan1 requested changes Jan 17, 2020

View reviewed changes

more clean-up

d9441a9

review

47a96a8

ianmilligan1 requested changes Jan 17, 2020

View reviewed changes

SamFritz reviewed Jan 17, 2020

View reviewed changes

review

965350a

ianmilligan1 approved these changes Jan 17, 2020

View reviewed changes

ruebot merged commit 4ce3cb5 into master Jan 20, 2020

ruebot deleted the aut-372 branch January 20, 2020 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

ruebot commented Jan 16, 2020 •

edited

Loading

ianmilligan1 left a comment

ianmilligan1 Jan 17, 2020

ruebot commented Jan 17, 2020

ianmilligan1 left a comment

ianmilligan1 Jan 17, 2020

ruebot Jan 17, 2020

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

ruebot Jan 17, 2020

SamFritz left a comment

SamFritz Jan 17, 2020

SamFritz Jan 17, 2020

SamFritz Jan 17, 2020

SamFritz Jan 17, 2020

ruebot Jan 17, 2020

SamFritz Jan 17, 2020

SamFritz Jan 17, 2020

SamFritz Jan 17, 2020

SamFritz Jan 17, 2020

ruebot commented Jan 17, 2020

ianmilligan1 commented Jan 17, 2020 •

edited

Loading

SamFritz commented Jan 20, 2020

		@@ -28,12 +28,11 @@ import io.archivesunleashed.matchbox._
		sc.setLogLevel("INFO")

		@@ -67,9 +66,10 @@ TODO
		How do I extract binary information of PDFs, audio files, video files, word processor files, spreadsheet files, presentation program files, and text files to a CSV file, or into the [Apache Parquet](https://parquet.apache.org/) format to [work with later](df-results.md#what-to-do-with-dataframe-results)?

		@@ -25,9 +25,10 @@ This script extracts the crawl date, domain, URL, and plain text from HTML files
		import io.archivesunleashed._

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Conversation

ruebot commented Jan 16, 2020 • edited Loading

ianmilligan1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruebot commented Jan 17, 2020

ianmilligan1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamFritz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruebot commented Jan 17, 2020

ianmilligan1 commented Jan 17, 2020 • edited Loading

SamFritz commented Jan 20, 2020

ruebot commented Jan 16, 2020 •

edited

Loading

ianmilligan1 commented Jan 17, 2020 •

edited

Loading