Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Merged
merged 11 commits into from
Jan 20, 2020

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Jan 16, 2020

A whole bunch of updates for archivesunleashed/aut#372 (comment)

Depends on archivesunleashed/aut#406

Partially hits #29.

Resolves #22.

Needs eyes, and testing since I touched so much. I'm probably inconsistent, or have funny mess-ups. Let me know 😄

When y'all approve, I'll squash and merge with a sane commit message.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a prose review – caught a few things, @ruebot.

I haven't obviously in the few minutes this has been open gone through all of the scripts to test them. Do you want me to do that? (it might take into next week as things are a bit swamped right now)

current/README.md Outdated Show resolved Hide resolved
current/binary-analysis.md Show resolved Hide resolved
archive.write.csv("/path/to/export/directory/", header='true')
```

If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?

current/link-analysis.md Show resolved Hide resolved
current/rdd-results.md Outdated Show resolved Hide resolved
@ruebot
Copy link
Member Author

ruebot commented Jan 17, 2020

@ianmilligan1 I tested most of them locally, and just wrote the rest of them. They should be fine, but definitely need to tested. No big rush until we get closer to sending out homework for NYC datathon.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to bite the bullet and plow through this! Looks great, @ruebot - I've tested all the new scripts.

A few errors, which I've put in the comment. Three quarters of the docs refer to example.arc.gz and the other quarter to example.warc.gz - I'd be a fan of just using example.arc.gz as you'll see.

current/collection-analysis.md Show resolved Hide resolved
current/collection-analysis.md Show resolved Hide resolved
current/collection-analysis.md Show resolved Hide resolved
current/link-analysis.md Show resolved Hide resolved
current/link-analysis.md Show resolved Hide resolved
import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("example.warc.gz", sc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to example.arc.gz?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, I use warc for a bunch of them so you actually get results.

.select($"crawl_date", ExtractDomainDF($"url"), $"url", $"language", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")))
.filter($"language" == "fr")
.write.csv("plain-text-fr-df/")
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leads to error:

<pastie>:111: error: overloaded method value filter with alternatives:
  (func: org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (func: org.apache.spark.sql.Row => Boolean)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (conditionExpr: String)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 cannot be applied to (Boolean)
          .filter($"language" == "fr")
           ^

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("example.warc.gz", sc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend changing to arc for consistency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above note.

current/text-analysis.md Show resolved Hide resolved
current/text-analysis.md Show resolved Hide resolved
Copy link
Member

@SamFritz SamFritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in review, mostly just addresses quick and minor fixes. Note: while reviewing, line #s are provided, especially in cases where comment was added below section that needs addressing (because I wasn't able to see the blue + button). In some cases there are questions on formatting.

My focus was on text rather then code pieces. Like @ianmilligan1 I'm happy to run through the code snippets for testing.

The documentation is looking fantastic @ruebot!!

current/README.md Show resolved Hide resolved
current/README.md Show resolved Hide resolved

If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do)

## Table of Contents

Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete space proceeding paragraph

current/README.md Show resolved Hide resolved
@@ -142,7 +142,7 @@ only showing top 20 rows

### Scala RDD

TODO
**Will not be implemented.**

### Scala DF

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding that in the Python DF we mention 'width' and 'height' will be extracted, but the example outputs don't have these columns - are the dimensions embedded in the columns that are shown?

@@ -168,7 +198,7 @@ RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
.saveAsTextFile("sitelinks-by-date/")
.saveAsTextFile("sitelinks-by-date-rdd/")
```

The format of this output is:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: line 205 "- Field one: Crawldate" should it be yyyyMMdd or yyyymmdd?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also line 220, ExtractLinks --> ExtractLinks ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yyyyMMdd

.count()
.filter($"count" > 5)
.write.csv("sitelinks-details-df/")
```

### Python DF

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 295: open-soure --> open-source

@@ -28,12 +28,11 @@ import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 15 - should we add the code command (inline) for concatenation, as graph pass command is written directly below the paragraph?

@@ -67,9 +66,10 @@ TODO
How do I extract binary information of PDFs, audio files, video files, word processor files, spreadsheet files, presentation program files, and text files to a CSV file, or into the [Apache Parquet](https://parquet.apache.org/) format to [work with later](df-results.md#what-to-do-with-dataframe-results)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"How do I extract binary information " --> "How do I extract the binary information"

@@ -25,9 +25,10 @@ This script extracts the crawl date, domain, URL, and plain text from HTML files
import io.archivesunleashed._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 12 --> capitalize Text (filtered by keyword)

ianmilligan1 pushed a commit to archivesunleashed/aut that referenced this pull request Jan 17, 2020
…Addresses #372.

- .all() column HttpStatus to http_status_code
- Adds archive_filename to .all()
- Significant README updates for setup
- See also: archivesunleashed/aut-docs#39
@ruebot
Copy link
Member Author

ruebot commented Jan 17, 2020

@SamFritz @ianmilligan1 I think I hit everything raised.

@ianmilligan1
Copy link
Member

ianmilligan1 commented Jan 17, 2020

Looks good to me - I'll wait for @SamFritz's thumbs up and then I'm happy to merge (or, after reading your PR, you can squash + merge too!). 😄

@SamFritz
Copy link
Member

👍 good to go :)

@ruebot ruebot merged commit 4ce3cb5 into master Jan 20, 2020
@ruebot ruebot deleted the aut-372 branch January 20, 2020 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DISCUSSION: Mark TODO sections as won't implement where necessary
3 participants