archivesunleashed · ruebot · Jan 20, 2020 · Jan 16, 2020 · Jan 16, 2020 · Jan 16, 2020
diff --git a/current/README.md b/current/README.md
@@ -2,17 +2,16 @@
 
 The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing.
 
-Most of this documentation is built on [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html). We are working on adding support for [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes). You can read more about this in our experimental [DataFrames section](#dataframes), and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial.
+This documentation is based on a cookbook approach, providing a series of "recipes" for addressing a number of common analytics tasks to provide inspiration for your own analysis. We generally provide examples for [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html) in Scala, and [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) in both Scala and Python. We leave it up to you to choose Scala or Python flavours of Spark.
 
 If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do) 
-
 ## Table of Contents
 
 Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.
 
 ### Getting Started
 
-- [Setting up the Archives Unleashed Toolkit](setting-up-aut.md)
+- [Setting Things Up](https://github.com/archivesunleashed/aut/#dependencies)
 - [Using the Archives Unleashed Toolkit at Scale](aut-at-scale.md)
 - [Archives Unleashed Toolkit Walkthrough](toolkit-walkthrough.md)
 

diff --git a/current/binary-analysis.md b/current/binary-analysis.md
@@ -15,7 +15,7 @@ The Archives Unleashed Toolkit supports binary object types for analysis:
 
 ### Scala RDD
 
-TODO
+**Will not be implemented.**
 
 ### Scala DF
 
@@ -142,7 +142,7 @@ only showing top 20 rows
 
 ### Scala RDD
 
-TODO
+**Will not be implemented.**
 
 ### Scala DF
 
@@ -274,7 +274,7 @@ only showing top 20 rows
 
 ### Scala RDD
 
-TODO
+**Will not be implemented.**
 
 ### Scala DF
 
@@ -406,7 +406,7 @@ only showing top 20 rows
 
 ### Scala RDD
 
-TODO
+**Will not be implemented.**
 
 ### Scala DF
 
@@ -538,7 +538,7 @@ only showing top 20 rows
 
 ### Scala RDD
 
-TODO
+**Will not be implemented.**
 
 ### Scala DF
 
@@ -670,7 +670,7 @@ only showing top 20 rows
 
 ### Scala RDD
 
-TODO
+**Will not be implemented.**
 
 ### Scala DF
 
@@ -802,7 +802,7 @@ only showing top 20 rows
 
 ### Scala RDD
 
-TODO
+**Will not be implemented.**
 
 ### Scala DF
 

diff --git a/current/collection-analysis.md b/current/collection-analysis.md
@@ -108,7 +108,7 @@ For example, supposed I wanted to extract the first-level directories?
 import io.archivesunleashed._
 import io.archivesunleashed.matchbox._
 
-RecordLoader.loadArchives("example.arc.gz", sc) .keepValidPages()
+RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
   .flatMap(r => """http://[^/]+/[^/]+/""".r.findAllIn(r.getUrl).toList)
   .take(10)
 ```
@@ -119,7 +119,19 @@ What do I do with the results? See [this guide](rdd-results.md)!
 
 ### Scala DF
 
-TODO
+```scala
+
+import io.archivesunleashed._
+import io.archivesunleashed.df._
+
+val urlPattern = Set("""http://[^/]+/[^/]+/""".r)
+
+RecordLoader.loadArchives("example.arc.gz", sc)
+  .webpages()
+  .select($"url")
+  .keepUrlPatternsDF(urlPattern)
+  .show(10, false)
+```
 
 ### Python DF
 
@@ -144,7 +156,15 @@ What do I do with the results? See [this guide](rdd-results.md)!
 
 ### Scala DF
 
-TODO
+```scala
+import io.archivesunleashed._
+import io.archivesunleashed.df._
+
+RecordLoader.loadArchives("example.arc.gz", sc)
+  .all()
+  .select($"url", $"http_status_code")
+  .show(10, false)
+```
 
 ### Python DF
 
@@ -183,7 +203,15 @@ What do I do with the results? See [this guide](rdd-results.md)!
 
 ### Scala DF
 
-TODO
+```scala
+import io.archivesunleashed._
+import io.archivesunleashed.df._
+
+RecordLoader.loadArchives("example.arc.gz", sc)
+  .all()
+  .select($"url", $"archive_filename")
+  .show(10, false)
+```
 
 ### Python DF
 

diff --git a/current/df-results.md b/current/df-results.md
@@ -59,7 +59,7 @@ Depending on your intended use of the output, you may want to include headers in
   .write.option("header","true").csv("/path/to/export/directory/")
 ```
 
-If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:
+If you want to store the results with the intention to read the results back later for further processing, then use [Parquet](https://parquet.apache.org/) format (a [columnar storage format](http://en.wikipedia.org/wiki/Column-oriented_DBMS):
 
 ```scala
   .write.parquet("/path/to/export/directory/")
@@ -81,4 +81,67 @@ Note that this works even across languages (e.g., export to Parquet from Scala,
 
 ## Python
 
-TODO: Python basically the same, but with Python syntax. However, we should be explicit and lay out the steps.
+If you want to return a set of results, the counterpart of `.take(10)` with RDDs is `.head(10)`.
+So, something like (in Python):
+
+```python
+WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocites/1").webpages()\
+  # more transformations here...
+  .select("http_status_code")
+  .head(10)
+```
+
+In the PySpark console, the results are returned as a List of rows, like the following:
+
+```
+[Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200'), Row(http_status_code='200')]
+```
+
+You can assign the tranformations to a variable, like this:
+
+```python
+archive = WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocites/1").webpages()
+  # more transformations here...
+  .head(10)
+```
+
+If you want _all_ results, replace `.head(10)` with `.collect()`.
+This will return _all_ results to the console.
+
+**WARNING**: Be careful with `.collect()`! If your results contain ten million records, TWUT will try to return _all of them_  to your console (on your physical machine).
+Most likely, your machine won't have enough memory!
+
+Alternatively, if you want to save the results to disk, replace `.show(20, false)` with the following:
+
+```python
+archive.write.csv("/path/to/export/directory/")
+```
+
+Replace `/path/to/export/directory/` with your desired location.
+Note that this is a _directory_, not a _file_.
+
+Depending on your intended use of the output, you may want to include headers in the CSV file, in which case:
+
+```python
+archive.write.csv("/path/to/export/directory/", header='true')
+```
+
+If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:
+
+```python
+archive.write.parquet("/path/to/export/directory/")
+```
+
+Replace `/path/to/export/directory/` with your desired location.
+Note that this is a _directory_, not a _file_.
+
+Later, as in a completely separate session, you can read the results back in and continuing processing, as follows:
+
+```python
+archive = spark.read.parquet("/path/to/export/directory/")
+
+archive.show(20, false)
+```
+
+Parquet encodes metadata such as the schema and column types, so you can pick up exactly where you left off.
+Note that this works even across languages (e.g., export to Parquet from Scala, read back in Python) or any system that supports Parquet.