Skip to content

Commit

Permalink
Documentation updates for archivesunleashed/aut#541
Browse files Browse the repository at this point in the history
  • Loading branch information
ruebot committed Jun 14, 2022
1 parent 71ee62e commit 73e333e
Show file tree
Hide file tree
Showing 5 changed files with 861 additions and 2 deletions.
104 changes: 104 additions & 0 deletions docs/auk-derivatives.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,61 @@ RecordLoader.loadArchives(warcs, sc)
.option("encoding", "utf-8")
.save(results + "word-processor")

// Text files.
RecordLoader.loadArchives(warcs, sc)
.css()
.write
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.format("csv")
.option("escape", "\"")
.option("encoding", "utf-8")
.save(results + "css")

RecordLoader.loadArchives(warcs, sc)
.html()
.write
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.format("csv")
.option("escape", "\"")
.option("encoding", "utf-8")
.save(results + "html")

RecordLoader.loadArchives(warcs, sc)
.js()
.write
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.format("csv")
.option("escape", "\"")
.option("encoding", "utf-8")
.save(results + "js")

RecordLoader.loadArchives(warcs, sc)
.json()
.write
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.format("csv")
.option("escape", "\"")
.option("encoding", "utf-8")
.save(results + "json")

RecordLoader.loadArchives(warcs, sc)
.plainText()
.write
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.format("csv")
.option("escape", "\"")
.option("encoding", "utf-8")
.save(results + "plain-text")

RecordLoader.loadArchives(warcs, sc)
.xml()
.write
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.format("csv")
.option("escape", "\"")
.option("encoding", "utf-8")
.save(results + "xml")

sys.exit
```

Expand Down Expand Up @@ -275,4 +330,53 @@ WebArchive(sc, sqlContext, warcs).word_processor()\
.option("escape", "\"")\
.option("encoding", "utf-8")\
.save(results + "word_processor")

# Text files.
WebArchive(sc, sqlContext, warcs).css()\
.write\
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")\
.format("csv")\
.option("escape", "\"")\
.option("encoding", "utf-8")\
.save(results + "css")

WebArchive(sc, sqlContext, warcs).html()\
.write\
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")\
.format("csv")\
.option("escape", "\"")\
.option("encoding", "utf-8")\
.save(results + "html")

WebArchive(sc, sqlContext, warcs).js()\
.write\
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")\
.format("csv")\
.option("escape", "\"")\
.option("encoding", "utf-8")\
.save(results + "js")

WebArchive(sc, sqlContext, warcs).json()\
.write\
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")\
.format("csv")\
.option("escape", "\"")\
.option("encoding", "utf-8")\
.save(results + "json")

WebArchive(sc, sqlContext, warcs).plain_text()\
.write\
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")\
.format("csv")\
.option("escape", "\"")\
.option("encoding", "utf-8")\
.save(results + "plain-text")

WebArchive(sc, sqlContext, warcs).xml()\
.write\
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")\
.format("csv")\
.option("escape", "\"")\
.option("encoding", "utf-8")\
.save(results + "xml")
```
84 changes: 84 additions & 0 deletions docs/dataframe-schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,3 +153,87 @@ hyperlink information.
- `md5` (string)
- `sha1` (string)
- `bytes` (binary)

## CSS Information and Content

**`.css()`**

- `crawl_date` (string)
- `url` (string)
- `filename` (string)
- `extension` (string)
- `mime_type_web_server` (string)
- `mime_type_tika` (string)
- `md5` (string)
- `sha1` (string)
- `content` (string)

## HTML Information and Content

**`.html()`**

- `crawl_date` (string)
- `url` (string)
- `filename` (string)
- `extension` (string)
- `mime_type_web_server` (string)
- `mime_type_tika` (string)
- `md5` (string)
- `sha1` (string)
- `content` (string)

## Javascript Information and Content

**`.js()`**

- `crawl_date` (string)
- `url` (string)
- `filename` (string)
- `extension` (string)
- `mime_type_web_server` (string)
- `mime_type_tika` (string)
- `md5` (string)
- `sha1` (string)
- `content` (string)

## JSON Information and Content

**`.json()`**

- `crawl_date` (string)
- `url` (string)
- `filename` (string)
- `extension` (string)
- `mime_type_web_server` (string)
- `mime_type_tika` (string)
- `md5` (string)
- `sha1` (string)
- `content` (string)

## Plain text Information and Content

**`.plainText()`**

- `crawl_date` (string)
- `url` (string)
- `filename` (string)
- `extension` (string)
- `mime_type_web_server` (string)
- `mime_type_tika` (string)
- `md5` (string)
- `sha1` (string)
- `content` (string)

## XML Information and Content

**`.xml()`**

- `crawl_date` (string)
- `url` (string)
- `filename` (string)
- `extension` (string)
- `mime_type_web_server` (string)
- `mime_type_tika` (string)
- `md5` (string)
- `sha1` (string)
- `content` (string)
10 changes: 9 additions & 1 deletion docs/home.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,14 @@ and working with the results.
- [Extract Spreadsheet Information](binary-analysis.md#extract-spreadsheet-information)
- [Extract Video Information](binary-analysis.md#extract-video-information)
- [Extract Word Processor File Information](binary-analysis.md#extract-word-processor-files-information)
- **[Text Files (html, text, css, js, json, xml) Analysis](text-files-analysis.md)**:
How do I...
- [Extract CSS Information](text-files-analysis.md#extract-css-information)
- [Extract HTML Information](text-files-analysis.md#extract-html-information)
- [Extract Javascript Information](text-files-analysis.md#extract-javascript-information)
- [Extract JSON Information](text-files-analysis.md#extract-json-information)
- [Extract Plain Text Information](text-files-analysis.md#extract-plain-text-information)
- [Extract XML Information](text-files-analysis.md#extract-xml-information)

### Filtering Results

Expand All @@ -82,7 +90,7 @@ and working with the results.
**How do I...**

- [Use the Toolkit with spark-submit](aut-spark-submit-app.md)
- [Create the Archives Unleashed Cloud Scholarly Derivatives](auk-derivatives.md)
- [Create the Archives Research Compiute Hub (ARCH) Derivatives](auk-derivatives.md)
- [Extract Binary Info](extract-binary-info.md)
- [Extract Binaries to Disk](extract-binary.md)

Expand Down
Loading

0 comments on commit 73e333e

Please sign in to comment.