Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Merged
merged 11 commits into from
Jan 20, 2020
5 changes: 2 additions & 3 deletions current/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,16 @@

The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing.
ruebot marked this conversation as resolved.
Show resolved Hide resolved

Most of this documentation is built on [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html). We are working on adding support for [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes). You can read more about this in our experimental [DataFrames section](#dataframes), and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial.
This documentation is based on a cookbook approach, providing a series of "recipes" for addressing a number of common analytics tasks to provide inspiration for your own analysis. We generally provide examples for [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html) in Scala, and [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) in both Scala and Python. We leave it up to you to choose Scala or Python flavours of Spark.

If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do)
ruebot marked this conversation as resolved.
Show resolved Hide resolved

## Table of Contents

Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete space proceeding paragraph


### Getting Started

- [Setting up the Archives Unleashed Toolkit](setting-up-aut.md)
- [Setting Things Up](https://github.com/archivesunleashed/aut/#dependencies)
- [Using the Archives Unleashed Toolkit at Scale](aut-at-scale.md)
- [Archives Unleashed Toolkit Walkthrough](toolkit-walkthrough.md)

ruebot marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
14 changes: 7 additions & 7 deletions current/binary-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The Archives Unleashed Toolkit supports binary object types for analysis:

### Scala RDD

TODO
**Will not be implemented.**
ruebot marked this conversation as resolved.
Show resolved Hide resolved

### Scala DF

Expand Down Expand Up @@ -142,7 +142,7 @@ only showing top 20 rows

### Scala RDD

TODO
**Will not be implemented.**

### Scala DF

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding that in the Python DF we mention 'width' and 'height' will be extracted, but the example outputs don't have these columns - are the dimensions embedded in the columns that are shown?

Expand Down Expand Up @@ -274,7 +274,7 @@ only showing top 20 rows

### Scala RDD

TODO
**Will not be implemented.**

### Scala DF

Expand Down Expand Up @@ -406,7 +406,7 @@ only showing top 20 rows

### Scala RDD

TODO
**Will not be implemented.**

### Scala DF

Expand Down Expand Up @@ -538,7 +538,7 @@ only showing top 20 rows

### Scala RDD

TODO
**Will not be implemented.**

### Scala DF

Expand Down Expand Up @@ -670,7 +670,7 @@ only showing top 20 rows

### Scala RDD

TODO
**Will not be implemented.**

### Scala DF

Expand Down Expand Up @@ -802,7 +802,7 @@ only showing top 20 rows

### Scala RDD

TODO
**Will not be implemented.**

### Scala DF

Expand Down
36 changes: 32 additions & 4 deletions current/collection-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ For example, supposed I wanted to extract the first-level directories?
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc) .keepValidPages()
RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.flatMap(r => """http://[^/]+/[^/]+/""".r.findAllIn(r.getUrl).toList)
.take(10)
```
Expand All @@ -119,7 +119,19 @@ What do I do with the results? See [this guide](rdd-results.md)!

### Scala DF

TODO
```scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val urlPattern = Set("""http://[^/]+/[^/]+/""".r)

RecordLoader.loadArchives("example.arc.gz", sc)
.webpages()
.select($"url")
.keepUrlPatternsDF(urlPattern)
.show(10, false)
ruebot marked this conversation as resolved.
Show resolved Hide resolved
```

### Python DF

Expand All @@ -144,7 +156,15 @@ What do I do with the results? See [this guide](rdd-results.md)!

### Scala DF

TODO
```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("example.arc.gz", sc)
.all()
.select($"url", $"http_status_code")
.show(10, false)
```
ruebot marked this conversation as resolved.
Show resolved Hide resolved

### Python DF

Expand Down Expand Up @@ -183,7 +203,15 @@ What do I do with the results? See [this guide](rdd-results.md)!

### Scala DF

TODO
```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("example.arc.gz", sc)
.all()
.select($"url", $"archive_filename")
.show(10, false)
ruebot marked this conversation as resolved.
Show resolved Hide resolved
```

### Python DF

Expand Down
67 changes: 65 additions & 2 deletions current/df-results.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Depending on your intended use of the output, you may want to include headers in
.write.option("header","true").csv("/path/to/export/directory/")
```

If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:
If you want to store the results with the intention to read the results back later for further processing, then use [Parquet](https://parquet.apache.org/) format (a [columnar storage format](http://en.wikipedia.org/wiki/Column-oriented_DBMS):

```scala
.write.parquet("/path/to/export/directory/")
Expand All @@ -81,4 +81,67 @@ Note that this works even across languages (e.g., export to Parquet from Scala,

## Python

TODO: Python basically the same, but with Python syntax. However, we should be explicit and lay out the steps.
If you want to return a set of results, the counterpart of `.take(10)` with RDDs is `.head(10)`.
So, something like (in Python):

```python
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocites/1").webpages()\
# more transformations here...
.select("http_status_code")
.head(10)
```

In the PySpark console, the results are returned as a List of rows, like the following:

```
[Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200')]
```

You can assign the tranformations to a variable, like this:

```python
archive = WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocites/1").webpages()
# more transformations here...
.head(10)
```

If you want _all_ results, replace `.head(10)` with `.collect()`.
This will return _all_ results to the console.

**WARNING**: Be careful with `.collect()`! If your results contain ten million records, TWUT will try to return _all of them_ to your console (on your physical machine).
Most likely, your machine won't have enough memory!

Alternatively, if you want to save the results to disk, replace `.show(20, false)` with the following:

```python
archive.write.csv("/path/to/export/directory/")
```

Replace `/path/to/export/directory/` with your desired location.
Note that this is a _directory_, not a _file_.

Depending on your intended use of the output, you may want to include headers in the CSV file, in which case:

```python
archive.write.csv("/path/to/export/directory/", header='true')
```

If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?


```python
archive.write.parquet("/path/to/export/directory/")
```

Replace `/path/to/export/directory/` with your desired location.
Note that this is a _directory_, not a _file_.

Later, as in a completely separate session, you can read the results back in and continuing processing, as follows:

```python
archive = spark.read.parquet("/path/to/export/directory/")

archive.show(20, false)
```

Parquet encodes metadata such as the schema and column types, so you can pick up exactly where you left off.
Note that this works even across languages (e.g., export to Parquet from Scala, read back in Python) or any system that supports Parquet.
Loading