diff --git a/README.rst b/README.rst index f56642a3af..6e01458e97 100644 --- a/README.rst +++ b/README.rst @@ -47,6 +47,19 @@ it's often difficult to work with. PUDL takes the original spreadsheets, CSV fil and databases and turns them into a unified resource. This allows users to spend more time on novel analysis and less time on data preparation. +Who is PUDL for? +---------------- + +The project is focused on serving researchers, activists, journalists, policy makers, +and small businesses that might not otherwise be able to afford access to this data from +commercial sources and who may not have the time or expertise to do all the data +processing themselves from scratch. + +We want to make this data accessible and easy to work with for as wide an audience as +possible: anyone from a grassroots youth climate organizers working with Google sheets +to university researchers with access to scalable cloud computing resources and everyone +in between! + What data is available? ----------------------- @@ -73,90 +86,37 @@ Program `__, from * `PHMSA Natural Gas Annual Report `__ * Machine Readable Specifications of State Clean Energy Standards -Who is PUDL for? ----------------- - -The project is focused on serving researchers, activists, journalists, policy makers, -and small businesses that might not otherwise be able to afford access to this data -from commercial sources and who may not have the time or expertise to do all the -data processing themselves from scratch. - -We want to make this data accessible and easy to work with for as wide an audience as -possible: anyone from a grassroots youth climate organizers working with Google -sheets to university researchers with access to scalable cloud computing -resources and everyone in between! - How do I access the data? ------------------------- -There are several ways to access PUDL outputs. For more details you'll want -to check out `the complete documentation -`__, but here's a quick overview: - -Datasette -^^^^^^^^^ -We publish a lot of the data on https://data.catalyst.coop using a tool called -`Datasette `__ that lets us wrap our databases in a relatively -friendly web interface. You can browse and query the data, make simple charts and -maps, and download portions of the data as CSV files or JSON so you can work with it -locally. For a quick introduction to what you can do with the Datasette interface, -check out `this 17 minute video `__. - -This access mode is good for casual data explorers or anyone who just wants to grab a -small subset of the data. It also lets you share links to a particular subset of the -data and provides a REST API for querying the data from other applications. - -Docker + Jupyter -^^^^^^^^^^^^^^^^ -Want access to all the published data in bulk? If you're familiar with Python -and `Jupyter Notebooks `__ and are willing to install Docker you -can: - -* `Download a PUDL data release `__ from - CERN's `Zenodo `__ archiving service. -* `Install Docker `__ -* Run the archived image using ``docker-compose up`` -* Access the data via the resulting Jupyter Notebook server running on your machine. - -If you'd rather work with the PUDL `SQLite `__ Databases and -`Apache Parquet `__ files directly, they are accessible -within the same Zenodo archive. - -The `PUDL Examples repository `__ -has more detailed instructions on how to work with the Zenodo data archive and Docker -image. - -The PUDL Development Environment -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -If you're more familiar with the Python data science stack and are comfortable working -with git, ``conda`` environments, and the Unix command line, then you can set up the -whole PUDL Development Environment on your own computer. This will allow you to run the -full data processing pipeline yourself, tweak the underlying source code, and (we hope!) -make contributions back to the project. - -This is by far the most involved way to access the data and isn't recommended for -most users. You should check out the `Development section `__ -of the main `PUDL documentation `__ for more -details. - -Nightly Data Builds -^^^^^^^^^^^^^^^^^^^ -If you are less concerned with reproducibility and want the freshest possible data -we automatically upload the outputs of our nightly builds to public S3 storage buckets -as part of the `AWS Open Data Registry -`__. This data is based on -the `dev branch `__, of PUDL, and -is updated most weekday mornings. It is also the data used to populate Datasette. - -The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded -directly via the web. See `Accessing Nightly Builds `__ -for links to the individual SQLite, JSON, and Apache Parquet outputs. +For details on how to access PUDL data, see the `data access documentation +`__. A quick +summary: + +* `Datasette `__ + provides browsable and queryable data from our nightly builds on the web: + https://data.catalyst.coop +* `Kaggle `__ + provides easy Jupyter notebook access to the PUDL data, updated weekly: + https://www.kaggle.com/datasets/catalystcooperative/pudl-project +* `Zenodo `__ + provides stable long-term access to our versioned data releases with a citeable DOI: + https://doi.org/10.5281/zenodo.3653158 +* `Nightly Data Builds `__ + push their outputs to the AWS Open Data Registry: + https://registry.opendata.aws/catalyst-cooperative-pudl/ + See `the nightly build docs `__ + for direct download links. +* `The PUDL Development Environment `__ + lets you run the PUDL data processing pipeline locally. Contributing to PUDL -------------------- + Find PUDL useful? Want to help make it better? There are lots of ways to help! -* First, be sure to read our `Code of Conduct `__. +* Check out our `contribution guide `__ + including our `Code of Conduct `__. * You can file a bug report, make a feature request, or ask questions in the `Github issue tracker `__. * Feel free to fork the project and make a pull request with new code, better @@ -165,8 +125,6 @@ Find PUDL useful? Want to help make it better? There are lots of ways to help! to support our work liberating public energy data. * `Hire us to do some custom analysis `__ and allow us to integrate the resulting code into PUDL. -* For more information check out the Contributing section of the - `PUDL Documentation `__ Licensing --------- @@ -193,10 +151,15 @@ Contact Us * Want to schedule a time to chat with us one-on-one about your PUDL use case, ideas for improvement, or get some personalized support? Join us for `Office Hours `__ +* `Follow us here on GitHub `__ +* Follow us on Mastodon: `@CatalystCoop@mastodon.energy `__ +* Follow us on BlueSky: `@catalyst.coop `__ +* `Follow us on LinkedIn `__ +* `Follow us on HuggingFace `__ * Follow us on Twitter: `@CatalystCoop `__ +* `Follow us on Kaggle `__ * More info on our website: https://catalyst.coop -* To hire us to provide customized data - extraction and analysis, you can email the maintainers: +* Email us if you'd like to hire us to provide customized data extraction and analysis: `hello@catalyst.coop `__ About Catalyst Cooperative diff --git a/docs/data_access.rst b/docs/data_access.rst index 49f49e55d1..3fb5655b74 100644 --- a/docs/data_access.rst +++ b/docs/data_access.rst @@ -30,14 +30,17 @@ which one is right for you and your use case. Select data to download as CSVs for local analysis in spreadsheets. Create sharable links to a particular selection of data. Access PUDL data via a REST API. + * - :ref:`access-kaggle` + - Data scientist, data analyst, Jupyter notebook user + - Easy Jupyter notebook access to all PUDL data products, including example + notebooks. Updated weekly based on the nightly builds. * - :ref:`access-nightly-builds` - Cloud Developer, Database User, Beta Tester - - Get the freshest data that has passed all data validations, updated most weekday - mornings. Fast downloads from AWS S3 storage buckets. + - Get the freshest data that has passed all of our data validations, updated most + weekday mornings. Fast, free downloads from AWS S3 storage buckets. * - :ref:`access-zenodo` - Researcher, Database User, Notebook Analyst - Use a stable, citable, fully processed version of the PUDL on your own computer. - Use PUDL in Jupyer Notebooks running in a stable, archived Docker container. Access the SQLite DB and Parquet files directly using any toolset. * - :ref:`access-development` - Python Developer, Data Wrangler @@ -69,6 +72,19 @@ data you've selected. SQLite to improve accessibility of the raw inputs, but they should generally not be used directly if the data you need has integrated into the PUDL database. +.. _access-kaggle: + +--------------------------------------------------------------------------------------- +Kaggle +--------------------------------------------------------------------------------------- + +Want to explore the PUDL data interactively in a Jupyter Notebook without needing to do +any setup? Our nightly build outputs (see below) automatically update `the PUDL Project +Dataset on Kaggle `__ +once a week. There are `several notebooks `__ +associated with the dataset, both curated by Catalyst and contributed by other Kaggle +users which you can use to get oriented to the PUDL database. + .. _access-nightly-builds: --------------------------------------------------------------------------------------- @@ -129,42 +145,22 @@ HTTPS using the following links: be quite large when uncompressed. To decompress them locally, you can use the ``gunzip`` command. - .. code-block:: console $ gunzip *.sqlite.gz - .. _access-zenodo: --------------------------------------------------------------------------------------- -Zenodo Archives +Zenodo --------------------------------------------------------------------------------------- -We use Zenodo to archive our fully processed data as SQLite databases and -Parquet files. We also archive a Docker image that contains the software environment -required to use PUDL within Jupyter Notebooks. You can find all our archived data -products in `the Catalyst Cooperative Community on Zenodo -`__. - -* The current version of the archived data and Docker container can be - downloaded from `This Zenodo archive `__ -* Detailed instructions on how to access the archived PUDL data using a Docker - container can be found in our `PUDL Examples repository - `__. -* The SQLite databases and Parquet files containing the PUDL data, the complete FERC 1 - database, and EPA CEMS hourly data are contained in that same archive, if you want - to access them directly without using PUDL. - -.. note:: - - If you're already familiar with Docker, you can also pull - `the image we use `__ to run - Jupyter directly: - - .. code-block:: console - - $ docker pull catalystcoop/pudl-jupyter:latest +We use Zenodo to archive and version our raw data inputs, the fully processed outputs, +and the PUDL software repositories. You can find all of our archives in +`the Catalyst Cooperative Community `__. +Zenodo assigns long-lived DOIs to each archive, suitable for citation in academic +journals and other publications. The most recent versioned PUDL data release can be +found using this Concept DOI: https://doi.org/10.5281/zenodo.3653158 .. _access-development: