-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406
Conversation
- Looks like repos are forcing https to be used now: [WARNING] repository metadata for: 'artifact joda-time:joda-time' could not be retrieved from repository: maven due to an error: Failed to transfer file: http://repo.maven.apache.org/maven2/joda-time/joda-time/maven-metadata.xml. Return code is: 501 , ReasonPhrase:HTTPS Required.
Codecov Report
@@ Coverage Diff @@
## master #406 +/- ##
=======================================
Coverage 77.56% 77.56%
=======================================
Files 40 40
Lines 1542 1542
Branches 292 292
=======================================
Hits 1196 1196
Misses 218 218
Partials 128 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - ran through it and the commands are nice. Big suggestions from my end are (a) adding back in local build instructions; and (b) trying to make the Python commands easier to parse, as I know we've struggled in the past.
Overall this is really great - thanks @ruebot !
### Python | ||
|
||
If you would like to use the Archives Unleashed Toolkit with PySpark and Jupyter Notebooks, you'll need to have a modern version of Python installed. We recommend using the [Anaconda Distribution](https://www.anaconda.com/distribution). This _should_ install Jupyter Notebook, as well as the PySpark bindings. If it doesn't, you can install either with `conda install` or `pip install`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really recommend keeping build from local instructions here - my rationale is (a) this is a GitHub repo so it'd be nice for people to be able to build themselves; (b) we refer to local builds throughout these docs anyways. I know there's worries around keeping this README.md too long, but I would almost prefer to see the local builds here and if we're worried about length, moving some of the notebooks etc stuff out to the docs.
I think a long README for a GitHub repo is ok, esp. if there's good heading structure.
Some slight adaptation of the following:
Build it Locally
Build it yourself as per the instructions below:
Clone the repo:
$ git clone http://github.com/archivesunleashed/aut.git
You can then build The Archives Unleashed Toolkit.
$ mvn clean install
For the impatient, to skip tests:
$ mvn clean install -DskipTests
README.md
Outdated
tar -xvf spark-2.4.4-bin-hadoop2.7.tgz | ||
``` | ||
|
||
#### Running Apache Spark Shell (Scala) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing this generic instruction? Apart from a sanity test to see that the spark shell is working, I don't know what we gain by getting people to launch it w/o the AUT package or jar, only to have to launch it again.
People who don't read the whole README will also get confused.
|
||
If you have Apache Spark ready to go, it's as easy as: | ||
There are a two options for loading the Archives Unleashed Toolkit. The advantages and disadvantages of using either option are going to depend on your setup (single machine vs cluster): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a two options for loading the Archives Unleashed Toolkit.
-> There are a two options for loading the Archives Unleashed Toolkit: using
packages to download the Archives Unleashed package, or to build it yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I think this is misleading. That's not what packages does, and you can also download the uberjar. I'd prefer to stick with the existing language here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
``` | ||
|
||
### A little less easy | ||
HEAD (built locally): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could put the local build instructions here if you don't want them up above? (your call?)
README.md
Outdated
HEAD (built locally): | ||
|
||
```shell | ||
$ spark-shell --jars /path/to/aut-0.18.2-SNAPSHOT-fatjar.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--jars /path/to/aut-0.18.2-SNAPSHOT-fatjar.jar
->
--jars /path/to/aut/target/aut-0.18.2-SNAPSHOT-fatjar.jar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also download it, and put it whereever you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we saying built locally here?
README.md
Outdated
|
||
### Archives Unleashed Toolkit with PySpark | ||
|
||
To run PySpark with the Archives Unleashed Toolkit loaded, you will need to provide PySpark with the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with `--packages` or `--jars` as described above. The Python bindings can be [downloaded](https://github.com/archivesunleashed/aut/releases/download/aut-0.18.1/aut-0.18.1.zip), or [built locally](#Introduction) (the zip file will be found in the `target` directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the local build comes in handy. I would prefer to be able to build locally and just have all this stuff here, rather than futzing with downloads (all of us are different of course).
```shell | ||
$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --packages "io.archivesunleashed:aut:0.18.1" | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this command really throws people - @SamFritz might have had trouble with this too (not sure if she ever got it working). Maybe an example command? Or suggesting they run which python
to get their Python path?
(putting a lot of this in a README might be a bridge too far)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. here's my command
export PYSPARK_PYTHON=/Users/ianmilligan1/anaconda/bin/Python; export PYSPARK_DRIVER_PYTHON=/Users/ianmilligan1/anaconda/bin/Python; ./bin/pyspark --py-files /Users/ianmilligan1/dropbox/git/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.18.1"
(the important thing is I think the which python
, as finding a python path can really throw users for a loop)
|
||
```shell | ||
$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --packages "io.archivesunleashed:aut:0.18.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boom, works great
and I think it flows nicely in this order
``` | ||
|
||
### Even less easy | ||
### Archives Unleashed Toolkit with Jupyter | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a note that they'll need to have Jupyter Notebooks installed? Just link out to https://jupyter.org/install?
GitHub issue(s): #372
What does this Pull Request do?
Various DataFrame implementation updates for documentation clean-up.
Adds
archive_filename
to.all()
Renames
.all()
columnHttpStatus
tohttp_status_code
Big README update to align with documentation, and remove lots of duplication that can be confusing, and difficult to keep up-to-date and in-sync. We can keep it all here in the README now.
See: archivesunleashed/aut-docs#39
How should this be tested?