Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406

ruebot · 2020-01-16T21:50:17Z

GitHub issue(s): #372

What does this Pull Request do?

Various DataFrame implementation updates for documentation clean-up.

Adds archive_filename to .all()

Renames .all() column HttpStatus to http_status_code

Big README update to align with documentation, and remove lots of duplication that can be confusing, and difficult to keep up-to-date and in-sync. We can keep it all here in the README now.

See: archivesunleashed/aut-docs#39

How should this be tested?

- Looks like repos are forcing https to be used now: [WARNING] repository metadata for: 'artifact joda-time:joda-time' could not be retrieved from repository: maven due to an error: Failed to transfer file: http://repo.maven.apache.org/maven2/joda-time/joda-time/maven-metadata.xml. Return code is: 501 , ReasonPhrase:HTTPS Required.

codecov · 2020-01-16T22:11:28Z

Codecov Report

Merging #406 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #406   +/-   ##
=======================================
  Coverage   77.56%   77.56%           
=======================================
  Files          40       40           
  Lines        1542     1542           
  Branches      292      292           
=======================================
  Hits         1196     1196           
  Misses        218      218           
  Partials      128      128

…s/blob/45de55b06b09135393755bb10cd90c1e10a9dce7/current/setting-up-aut.md

ianmilligan1

Looks great - ran through it and the commands are nice. Big suggestions from my end are (a) adding back in local build instructions; and (b) trying to make the Python commands easier to parse, as I know we've struggled in the past.

Overall this is really great - thanks @ruebot !

ianmilligan1 · 2020-01-17T14:54:36Z

README.md

+### Python
+
+If you would like to use the Archives Unleashed Toolkit with PySpark and Jupyter Notebooks, you'll need to have a modern version of Python installed. We recommend using the [Anaconda Distribution](https://www.anaconda.com/distribution). This _should_ install Jupyter Notebook, as well as the PySpark bindings. If it doesn't, you can install either with `conda install` or `pip install`.
+


I would really recommend keeping build from local instructions here - my rationale is (a) this is a GitHub repo so it'd be nice for people to be able to build themselves; (b) we refer to local builds throughout these docs anyways. I know there's worries around keeping this README.md too long, but I would almost prefer to see the local builds here and if we're worried about length, moving some of the notebooks etc stuff out to the docs.

I think a long README for a GitHub repo is ok, esp. if there's good heading structure.

Some slight adaptation of the following:

Build it Locally

Build it yourself as per the instructions below:

Clone the repo:

$ git clone http://github.com/archivesunleashed/aut.git

You can then build The Archives Unleashed Toolkit.

$ mvn clean install

For the impatient, to skip tests:

$ mvn clean install -DskipTests

ianmilligan1 · 2020-01-17T14:56:19Z

README.md

+tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
+```
+
+#### Running Apache Spark Shell (Scala)


Consider removing this generic instruction? Apart from a sanity test to see that the spark shell is working, I don't know what we gain by getting people to launch it w/o the AUT package or jar, only to have to launch it again.

People who don't read the whole README will also get confused.

ianmilligan1 · 2020-01-17T14:57:38Z

README.md


-If you have Apache Spark ready to go, it's as easy as:
+There are a two options for loading the Archives Unleashed Toolkit. The advantages and disadvantages of using either option are going to depend on your setup (single machine vs cluster):


There are a two options for loading the Archives Unleashed Toolkit. -> There are a two options for loading the Archives Unleashed Toolkit: using packages to download the Archives Unleashed package, or to build it yourself.

No. I think this is misleading. That's not what packages does, and you can also download the uberjar. I'd prefer to stick with the existing language here.

ianmilligan1 · 2020-01-17T14:58:12Z

README.md

 ```

-### A little less easy
+HEAD (built locally):


you could put the local build instructions here if you don't want them up above? (your call?)

ianmilligan1 · 2020-01-17T14:58:58Z

README.md

+HEAD (built locally):
+
+```shell
+$ spark-shell --jars /path/to/aut-0.18.2-SNAPSHOT-fatjar.jar


--jars /path/to/aut-0.18.2-SNAPSHOT-fatjar.jar
->
--jars /path/to/aut/target/aut-0.18.2-SNAPSHOT-fatjar.jar

You can also download it, and put it whereever you want.

Aren't we saying built locally here?

ianmilligan1 · 2020-01-17T14:59:29Z

README.md

+
+### Archives Unleashed Toolkit with PySpark
+
+To run PySpark with the Archives Unleashed Toolkit loaded, you will need to provide PySpark with the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with `--packages` or `--jars` as described above. The Python bindings can be [downloaded](https://github.com/archivesunleashed/aut/releases/download/aut-0.18.1/aut-0.18.1.zip), or [built locally](#Introduction) (the zip file will be found in the `target` directory.


This is where the local build comes in handy. I would prefer to be able to build locally and just have all this stuff here, rather than futzing with downloads (all of us are different of course).

ianmilligan1 · 2020-01-17T15:02:21Z

README.md

+```shell
+$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --packages "io.archivesunleashed:aut:0.18.1"
+```
+


I think this command really throws people - @SamFritz might have had trouble with this too (not sure if she ever got it working). Maybe an example command? Or suggesting they run which python to get their Python path?

(putting a lot of this in a README might be a bridge too far)

i.e. here's my command

export PYSPARK_PYTHON=/Users/ianmilligan1/anaconda/bin/Python; export PYSPARK_DRIVER_PYTHON=/Users/ianmilligan1/anaconda/bin/Python; ./bin/pyspark --py-files /Users/ianmilligan1/dropbox/git/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.18.1"

(the important thing is I think the which python, as finding a python path can really throw users for a loop)

ianmilligan1 · 2020-01-17T15:05:31Z

README.md


+```shell
+$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --packages "io.archivesunleashed:aut:0.18.1"


boom, works great

and I think it flows nicely in this order

ianmilligan1 · 2020-01-17T15:06:05Z

README.md

 ```

-### Even less easy
+### Archives Unleashed Toolkit with Jupyter
+


add a note that they'll need to have Jupyter Notebooks installed? Just link out to https://jupyter.org/install?

ruebot added 2 commits January 16, 2020 09:16

Address #372 - DRAFT

debf5d5

ruebot mentioned this pull request Jan 16, 2020

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT archivesunleashed/aut-docs#39

Merged

Merge branch 'master' into issue-372

da8de5d

README updates; merge in https://github.com/archivesunleashed/aut-doc…

ce1e680

…s/blob/45de55b06b09135393755bb10cd90c1e10a9dce7/current/setting-up-aut.md

ruebot requested a review from ianmilligan1 January 17, 2020 02:20

ruebot marked this pull request as ready for review January 17, 2020 02:23

Updates for 6db3b1e

5679856

ianmilligan1 requested changes Jan 17, 2020

View reviewed changes

review

af280b7

ianmilligan1 approved these changes Jan 17, 2020

View reviewed changes

ianmilligan1 merged commit 9277e68 into master Jan 17, 2020

ianmilligan1 deleted the issue-372 branch January 17, 2020 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406

Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406

ruebot commented Jan 16, 2020 •

edited

Loading

codecov bot commented Jan 16, 2020

ianmilligan1 left a comment

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

ruebot Jan 17, 2020

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

ruebot Jan 17, 2020

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

ianmilligan1 Jan 17, 2020

		### Python

		If you would like to use the Archives Unleashed Toolkit with PySpark and Jupyter Notebooks, you'll need to have a modern version of Python installed. We recommend using the [Anaconda Distribution](https://www.anaconda.com/distribution). This _should_ install Jupyter Notebook, as well as the PySpark bindings. If it doesn't, you can install either with `conda install` or `pip install`.


		If you have Apache Spark ready to go, it's as easy as:
		There are a two options for loading the Archives Unleashed Toolkit. The advantages and disadvantages of using either option are going to depend on your setup (single machine vs cluster):


		### Archives Unleashed Toolkit with PySpark

		To run PySpark with the Archives Unleashed Toolkit loaded, you will need to provide PySpark with the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with `--packages` or `--jars` as described above. The Python bindings can be [downloaded](https://github.com/archivesunleashed/aut/releases/download/aut-0.18.1/aut-0.18.1.zip), or [built locally](#Introduction) (the zip file will be found in the `target` directory.


		```shell
		$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --packages "io.archivesunleashed:aut:0.18.1"

Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406

Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406

Conversation

ruebot commented Jan 16, 2020 • edited Loading

What does this Pull Request do?

How should this be tested?

codecov bot commented Jan 16, 2020

Codecov Report

ianmilligan1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Build it Locally

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruebot commented Jan 16, 2020 •

edited

Loading