Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure intro.rst and other pages for data warehouse #2912

Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 82 additions & 30 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,38 +52,90 @@ What is PUDL?
-------------

The `PUDL <https://catalyst.coop/pudl/>`__ Project is an open source data processing
pipeline that makes US energy data easier to access and use programmatically.
pipeline created by `Catalyst Cooperative
<https://catalyst.coop/>`__ that cleans, integrates, and standardizes some of the most
widely used public energy datasets in the US. Hundreds of gigabytes of valuable data
are published by US government agencies, but they are often difficult to work with.
PUDL takes the original spreadsheets, CSV files, and databases and turns them into a
unified resource.

PUDL is comprised of three core components:

- **Raw Data Archives**

- We `archive <https://github.com/catalyst-cooperative/pudl-archiver>`__ all the raw
data inputs on `Zenodo <https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__
to ensure perminant, versioned access to the data. In the event that an agency
changes how they publish data or deletes old files, the ETL will still have access
to the original inputs. Each of the data inputs may have several different versions
archived, and all are assigned a unique DOI and made available through the REST API.
- **ETL Pipeline**

- The ETL pipeline (this repo) ingests the raw archives, cleans them, integrates
them, and outputs them to a series of tables stored in SQLite Databases, Parquet
files, and pickle files (the Data Warehouse). Each release of the PUDL Python
package is embedded with a set of of DOIs to indicate which version of the raw
inputs it is meant to process. This process helps ensure that the ETL and it's
outputs are replicable.
- **Data Warehouse**

- The outputs from the ETL, sometimes called "PUDL outputs", are stored in a data
warehouse so that users can access the data without having to run any code. The
majority of the outputs are stored in ``pudl.sqlite``, however CEMS data are stored
in seperate Parquet files due to their large size. The warehouse also contains
pickled interim assets from the ETL process, should users want to access the data
at various stages of the cleaning process, and SQLite databases for the raw FERC
inputs.

For more information about each of the components, read our
`documentation <https://catalystcoop-pudl--2874.org.readthedocs.build/en/2874/intro.html>`__
.
Comment on lines +64 to +92
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including this early in the readme forces users to scroll through more text to get to the data access section which I'm assuming is what they care about.

I think this type of architecture information is more important for contributors which will be reading through the Development section of the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. I don't want to assume users know what they want yet though and this provides them with the opportunity to understand what happens to the data before they use it. Maybe we could run this by some other people to see what they think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just copy what's in the intro and use that instead:

- **Raw Data Archives** (raw, versioned inputs)
- **ETL Pipeline** (code to process, clean, and organize the raw inputs)
- **Data Warehouse** (location where ETL outputs, both interim and final, are stored)


Hundreds of gigabytes of valuable data are published by US government agencies, but
it's often difficult to work with. PUDL takes the original spreadsheets, CSV files,
and databases and turns them into a unified resource. This allows users to spend more
time on novel analysis and less time on data preparation.

What data is available?
-----------------------

PUDL currently integrates data from:

* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001-2022
* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2023-06
* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001-2022
* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001-2022
* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995-2022
* `FERC Form 1 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__: 1994-2021
* `FERC Form 714 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__: 2006-2020
* `US Census Demographic Profile 1 Geodatabase <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__: 2010
* **EIA Form 860**: 2001-2022
- `Source <https://www.eia.gov/electricity/data/eia860/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia860.html>`__
* **EIA Form 860m**: 2023-06
- `Source <https://www.eia.gov/electricity/data/eia860m/>`__
* **EIA Form 861**: 2001-2022
- `Source <https://www.eia.gov/electricity/data/eia861/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia861.html>`__
* **EIA Form 923**: 2001-2022
- `Source <https://www.eia.gov/electricity/data/eia923/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia923.html>`__
* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995-2022
- `Source <https://campd.epa.gov/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/epacems.html>`__
* **FERC Form 1**: 1994-2021
- `Source <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc1.html>`__
* **FERC Form 714**: 2006-2020
- `Source <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc714.html>`__
* **FERC Form 2**: 2021 (raw only)
- `Source <https://www.ferc.gov/industries-data/natural-gas/industry-forms/form-2-2a-3-q-gas-historical-vfp-data>`__
* **FERC Form 6**: 2021 (raw only)
- `Source <https://www.ferc.gov/general-information-1/oil-industry-forms/form-6-6q-historical-vfp-data>`__
* **FERC Form 60**: 2021 (raw only)
- `Source <https://www.ferc.gov/form-60-annual-report-centralized-service-companies>`__
* **US Census Demographic Profile 1 Geodatabase**: 2010
- `Source <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__
aesharpe marked this conversation as resolved.
Show resolved Hide resolved

Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment
Program <https://sloan.org/programs/research/energy-and-environment>`__, from
2021 to 2024 we will be integrating the following data as well:
2021 to 2024 we will be cleaning and integrating the following data as well:

* `EIA Form 176 <https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s176>`__
(The Annual Report of Natural Gas Supply and Disposition)
* `FERC Electric Quarterly Reports (EQR) <https://www.ferc.gov/industries-data/electric/power-sales-and-markets/electric-quarterly-reports-eqr>`__
* `FERC Form 2 <https://www.ferc.gov/industries-data/natural-gas/overview/general-information/natural-gas-industry-forms/form-22a-data>`__
(Annual Report of Major Natural Gas Companies)
* `PHMSA Natural Gas Annual Report <https://www.phmsa.dot.gov/data-and-statistics/pipeline/gas-distribution-gas-gathering-gas-transmission-hazardous-liquids>`__
* Machine Readable Specifications of State Clean Energy Standards

Who is PUDL for?
----------------
Expand All @@ -101,8 +153,8 @@ resources and everyone in between!
How do I access the data?
-------------------------

There are several ways to access PUDL outputs. For more details you'll want
to check out `the complete documentation
There are several ways to access the information in the PUDL Data Warehouse. For more
details you'll want to check out `the complete documentation
<https://catalystcoop-pudl.readthedocs.io>`__, but here's a quick overview:

Datasette
Expand All @@ -118,6 +170,19 @@ This access mode is good for casual data explorers or anyone who just wants to g
small subset of the data. It also lets you share links to a particular subset of the
data and provides a REST API for querying the data from other applications.

Nightly Data Builds
^^^^^^^^^^^^^^^^^^^
We automatically run the ETL every week night and upload the outputs to public S3
storage buckets as part of the `AWS Open Data Registry
<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__. This data is based on
the `dev branch <https://github.com/catalyst-cooperative/pudl/tree/dev>`__, of PUDL, and
is what we use to populate Datasette. Use this data access method if you want to
download the sqlite files directly.

You can download the outputs using the AWS CLI, the S3 API, or directly via the web.
See `Accessing Nightly Builds <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-nightly-builds>`__
for links to the individual SQLite, JSON, and Apache Parquet outputs.

aesharpe marked this conversation as resolved.
Show resolved Hide resolved
Docker + Jupyter
^^^^^^^^^^^^^^^^
Want access to all the published data in bulk? If you're familiar with Python
Expand Down Expand Up @@ -151,19 +216,6 @@ most users. You should check out the `Development section <https://catalystcoop-
of the main `PUDL documentation <https://catalystcoop-pudl.readthedocs.io>`__ for more
details.

Nightly Data Builds
^^^^^^^^^^^^^^^^^^^
If you are less concerned with reproducibility and want the freshest possible data
we automatically upload the outputs of our nightly builds to public S3 storage buckets
as part of the `AWS Open Data Registry
<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__. This data is based on
the `dev branch <https://github.com/catalyst-cooperative/pudl/tree/dev>`__, of PUDL, and
is updated most weekday mornings. It is also the data used to populate Datasette.

The nightly build outputs can be accessed using the AWS CLI, the S3 API, or downloaded
directly via the web. See `Accessing Nightly Builds <https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-nightly-builds>`__
for links to the individual SQLite, JSON, and Apache Parquet outputs.

Contributing to PUDL
--------------------
Find PUDL useful? Want to help make it better? There are lots of ways to help!
Expand Down
106 changes: 58 additions & 48 deletions docs/dev/naming_conventions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ Asset Naming Conventions
PUDL's data processing is divided into three layers of Dagster assets: Raw, Core
and Output. Dagster assets are the core unit of computation in PUDL. The outputs
of assets can be persisted to any type of storage though PUDL outputs are typically
tables in a SQLite database, parquet files or pickle files. The asset name is used
for the table or parquet file name. Asset names should generally follow this naming
convention:
tables in a SQLite database, parquet files or pickle files (read more about this here:
:doc:`../intro`). The asset name is used for the table or parquet file name. Asset
names should generally follow this naming convention:

.. code-block::

Expand All @@ -33,27 +33,32 @@ convention:

Raw layer
^^^^^^^^^
* This layer contains assets that extract data from spreadsheets and databases
and are persisted as pickle files.
* Naming convention: ``raw_{source}__{asset_name}``
This layer contains assets that extract data from spreadsheets and databases
and are persisted as pickle files.

Naming convention: ``raw_{source}__{asset_name}``

* ``asset_name`` is typically copied from the source data.
* ``asset_type`` is not included in this layer because the data modeling does not
yet conform to PUDL standards. Raw assets are typically just copies of the
source data.

Core layer
^^^^^^^^^^
* This layer contains assets that typically break denormalized raw assets into
well-modeled tables that serve as building blocks for downstream wide tables
and analyses. Well-modeled means tables in the database have logical
primary keys, foreign keys, datatypes and generally follow
:ref:`Tidy Data standards <tidy-data>`. Assets in this layer create
consistent categorical variables, decuplicate and impute data.
These assets are typically stored in parquet files or tables in a database.
* Naming convention: ``core_{source}__{asset_type}_{asset_name}``
* ``asset_type`` describes how the asset is modeled and its role in PUDL’s
collection of core assets. There are a handful of table types in this layer:

This layer contains assets that typically break denormalized raw assets into
well-modeled tables that serve as building blocks for downstream wide tables
and analyses. Well-modeled means tables in the database have logical
primary keys, foreign keys, datatypes and generally follow
:ref:`Tidy Data standards <tidy-data>`. Assets in this layer create
consistent categorical variables, decuplicate and impute data.
These assets are typically stored in parquet files or tables in a database.

Naming convention: ``core_{source}__{asset_type}_{asset_name}``

* ``asset_type`` describes how the asset is modeled and its role in PUDL’s
collection of core assets. There are a handful of table types in this layer:
* ``asset_type`` describes how the asset is modeled and its role in PUDL’s
collection of core assets. There are a handful of table types in this layer:
* ``assn``: Association tables provide connections between entities. This data
can be manually compiled or extracted from data sources. Examples:

Expand Down Expand Up @@ -82,32 +87,23 @@ Core layer
* ``core_ferc714__hourly_demand_pa``,
* ``core_ferc1__yearly_plant_in_service``.

Output layer
^^^^^^^^^^^^
* Assets in this layer use the well modeled tables from the Core layer to construct
wide and complete tables suitable for users to perform analysis on. This layer
contains intermediate tables that bridge the core and user-facing tables.
* Naming convention: ``out_{source}__{asset_type}_{asset_name}``
* ``source`` is optional in this layer because there can be assets that join data from
multiple sources.
* ``asset_type`` is also optional. It will likely describe the frequency at which
the data is reported (annual/monthly/hourly).

Intermediate Assets
^^^^^^^^^^^^^^^^^^^
* Intermediate assets are logical steps towards a final well-modeled core or
user-facing output asset. These assets are not intended to be persisted in the
database or accessible to the user. These assets are denoted by a preceding
underscore, like a private python method. For example, the intermediate asset
``_core_eia860__plants`` is a logical step towards the
``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets.
``_core_eia860__plants`` does some basic cleaning of the ``raw_eia860__plant``
asset but still contains duplicate plant entities. The computation intensive
harvesting process deduplicates ``_core_eia860__plants`` and outputs the
``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets which
follow Tiny Data standards.
* Limit the number of intermediate assets to avoid an extremely
cluttered DAG. It is appropriate to create an intermediate asset when:
Core Layer (Intermediate Assets)
aesharpe marked this conversation as resolved.
Show resolved Hide resolved
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Intermediate assets are logical steps towards a final well-modeled core or
user-facing output asset. These assets are not intended to be persisted in the
database or accessible to the user. These assets are denoted by a preceding
underscore, like a private python method. For example, the intermediate asset
``_core_eia860__plants`` is a logical step towards the
``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets.
``_core_eia860__plants`` does some basic cleaning of the ``raw_eia860__plant``
asset but still contains duplicate plant entities. The computation intensive
harvesting process deduplicates ``_core_eia860__plants`` and outputs the
``core_eia860__entity_plants`` and ``core_eia860__scd_plants`` assets which
follow Tiny Data standards.

Limit the number of intermediate assets to avoid an extremely
cluttered DAG. It is appropriate to create an intermediate asset when:

* there is a short and long running portion of a process. It is convenient to separate
the long and short-running processing portions into separate assets so debugging the
Expand All @@ -116,18 +112,32 @@ Intermediate Assets
example, the pre harvest assets in the ``_core_eia860`` and ``_core_eia923`` groups
are frequently inspected when new years of data are added.

Output layer
^^^^^^^^^^^^
This layer uses assets in the Core layer to construct wide and complete tables
suitable for users to perform analysis on. This layer can contain intermediate
tables that bridge the core and user-facing tables.

Naming convention: ``out_{source}__{asset_type}_{asset_name}``

* ``source`` is optional in this layer because there can be assets that join data from
multiple sources.
* ``asset_type`` is also optional. It will likely describe the frequency at which
the data is reported (annual/monthly/hourly).


Columns and Field Names
^^^^^^^^^^^^^^^^^^^^^^^
------------------------------

If two columns in different tables record the same quantity in the same units,
give them the same name. That way if they end up in the same dataframe for
comparison it's easy to automatically rename them with suffixes indicating
where they came from. For example, net electricity generation is reported to
both :doc:`FERC Form 1 <../data_sources/ferc1>` and :doc:`EIA 923
<../data_sources/eia923>`, so we've named columns ``net_generation_mwh`` in
each of those data sources. Similarly, give non-comparable quantities reported
in different data sources **different** column names. This helps make it clear
that the quantities are actually different.
both :doc:`FERC Form 1 <../data_sources/ferc1>` and
:doc:`EIA 923<../data_sources/eia923>`, so we've named columns ``net_generation_mwh``
in each of those data sources. Similarly, give non-comparable quantities reported in
different data sources **different** column names. This helps make it clear that the
quantities are actually different.

* ``total`` should come at the beginning of the name (e.g.
``total_expns_production``)
Expand Down
Loading