Restructure intro.rst and other pages for data warehouse #2912

aesharpe · 2023-10-02T14:54:11Z

Still WIP,

Need more input on the ETL section of the intro.rst page! I think you can probably just go ahead and work off this branch too add it @bendnorman what do you think?

…info. Add three components of PUDL description

bendnorman

Thank you @aesharpe! I propose we:

Use the README changes on this branch
Move the The Data Warehouse Design and Data Validation sections on the create-naming-convention-docs branch to the Data and ETL Design Guidelines page
Use the Naming Convention section changes from this branch

What do you think?

README.rst

bendnorman · 2023-10-03T19:31:10Z

README.rst

+- **Raw Data Archives**
+
+  - We `archive <https://github.com/catalyst-cooperative/pudl-archiver>`__ all the raw
+    data inputs on `Zenodo <https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__
+    to ensure perminant, versioned access to the data. In the event that an agency
+    changes how they publish data or deletes old files, the ETL will still have access
+    to the original inputs. Each of the data inputs may have several different versions
+    archived, and all are assigned a unique DOI and made available through the REST API.
+- **ETL Pipeline**
+
+  - The ETL pipeline (this repo) ingests the raw archives, cleans them, integrates
+    them, and outputs them to a series of tables stored in SQLite Databases, Parquet
+    files, and pickle files (the Data Warehouse). Each release of the PUDL Python
+    package is embedded with a set of of DOIs to indicate which version of the raw
+    inputs it is meant to process. This process helps ensure that the ETL and it's
+    outputs are replicable.
+- **Data Warehouse**
+
+  - The outputs from the ETL, sometimes called "PUDL outputs", are stored in a data
+    warehouse so that users can access the data without having to run any code. The
+    majority of the outputs are stored in ``pudl.sqlite``, however CEMS data are stored
+    in seperate Parquet files due to their large size. The warehouse also contains
+    pickled interim assets from the ETL process, should users want to access the data
+    at various stages of the cleaning process, and SQLite databases for the raw FERC
+    inputs.
+
+For more information about each of the components, read our
+`documentation <https://catalystcoop-pudl--2874.org.readthedocs.build/en/2874/intro.html>`__
+.


I think including this early in the readme forces users to scroll through more text to get to the data access section which I'm assuming is what they care about.

I think this type of architecture information is more important for contributors which will be reading through the Development section of the docs.

That's a fair point. I don't want to assume users know what they want yet though and this provides them with the opportunity to understand what happens to the data before they use it. Maybe we could run this by some other people to see what they think.

We could also just copy what's in the intro and use that instead:

- **Raw Data Archives** (raw, versioned inputs) - **ETL Pipeline** (code to process, clean, and organize the raw inputs) - **Data Warehouse** (location where ETL outputs, both interim and final, are stored)

docs/dev/naming_conventions.rst

bendnorman · 2023-10-03T19:38:54Z

docs/intro.rst

@@ -74,13 +46,43 @@ needed and organize them in a local :doc:`datastore <dev/datastore>`.
 .. _etl-process:

 ---------------------------------------------------------------------------------------
-The Data Warehouse Design
+The ETL Pipeline


I'm tempted to move this information to the development section of the docs. Do users actually care about the raw data archives, data warehouse and data validation?

I'm thinking we could move the The Data Warehouse Design and Data Validation sections to the Data and ETL Design Guidelines page?

I'm a little hesitant to have an ETL and Data Warehouse section because they cover similar topics. I think it's easier to think about our data processing just in terms of the raw, core, output layers as opposed to ETL steps.

Ok! As long as the concept of the Data Warehouse doesn't get lost in the code processing description I think that's fine. My concern with wanting to pull out the Data Warehouse section was similar to your comment above about people being primarily concerned with Data Access and therefore wanting to be able to jump strait to a Data Warehouse page / section might be nice, but I think depending on how we structure the rest of the docs this might not be an issue.

aesharpe · 2023-10-04T14:53:55Z

Move the The Data Warehouse Design and Data Validation sections on the create-naming-convention-docs branch to the Data and ETL Design Guidelines page

If we do this, I would wonder what the purpose of this introduction page is. Maybe we don't need it? Idk... I do feel like some brief description of what's going on would be nice as I don't think only developers would want to know this type of information. A lot of users might be curious what is actually happening to t the data they are using in between raw and final. Data and ETL Design Guidelines page feels a little bit hidden. Maybe we could take the mini paragraph descriptions for each section from the README page and put them in the intro instead of having longer descriptions there.

…sions_ferc1 table

bendnorman · 2023-10-05T08:38:54Z

I think you're right I shouldn't assume users don't care about how the data is processed. In that case, what if we just keep the data warehouse / processing language from create-naming-convention-docs branch on the intro page? Users can jump to the data access page if they like or continue to read about the data processing steps:

To get started using PUDL data, visit our Data Access page, or continue reading to learn more about the PUDL data processing pipeline.

Or we can move the data warehouse design language to the ETL Guidelines page and just link to it in the intro page.

I think we're starting to bump up against larger unanswered questions about our docs that are out of the scope of the renaming docs. To keep things simple, what if we:

Use the README changes on this branch
Keep the intro.rst page from the other branch.
Use the Naming Convention section changes from this branch

bendnorman · 2023-11-07T02:06:41Z

Changes in the branch were incorporated into #2874.

…cols (#2818) * Rename static tables * Rename Census DP1 assets * Test doc fix * Update core table names for EIA 860, 923, harvested tables, FERC1, code * Fix integration tests * Fix alembic * Rename 714, 861, epacems * update tests and rest of assets * Fix validation tests * Rename ferc output assets * Rename denorm_cash_flow_ferc1 and remove leading underscore from cross refs in pudl_db docs * Rename a missing ferc output table and add migration * Rename EIA denorm assets * Recreate ferc rename migration * Add docs cross ref fix for intermediate assets * Resolve small denorm EIA rename issues * Clean up notebooks * Apply naming convention to allocate generation fuel assets * Fix a missing gen fuel asset name in PudlTabl * Update migrations post ferc1 output rename merge * Update contributor facing documentation with new asset naming conventions * Add new naming convention to user facing documentation * Correct allocate-get-fuel down revision * Apply new naming convention to ferc714 respondents, hourly demand and eia861 service territories * Fix refs to renamed tables in release notes * Rename ferc714 and eia861 output tables in integration tests * Add missing balance authority fk migration * Rename out_ferc714__fipsified_respondents to out_ferc714__respondents_with_fips * Respond to first round of Austen's comments * Update rename-core-assets and clarify raw asset sentence * Restrict astroid version to avoid random autoapi error * Reset migrations and fix old table refs in docs * Fix names of inputs to exploded tables and xbrl calculation fixes * Rename mcoe and ppl assets * Fix small ppl migration issue * Format and sort intermediate resource name cross refs in data dictionary * Add upstream mcoe assets back to metadata * Update stragler PudlTabl method name * Add frequency to ppl asset name and some clean up * rename six of the non-contreversial FERC1 tables (core + out) * initial rename of the FERC1 core and out tables * add db migration * rename the ferc1 transformer classes in line with new table names * Incorporate some docs changes from #2912 * FINAL FINAL rename of ferc assets * ooooops remove the eia860m extraction edit bc that was not supposed to be in here ooop * Remove README.rst from index.rst and move intro content to index * Add deprecation warnings to PudlTabl and add minor naming docs updates * Rename heat_rate_mmbtu_mwh -> heat_rate_mmbtu_mwh_by_unit * Rename heat rate mmbtu mwh to follow existing naming convention * Remove PudlTabl removal data and make assn table name sources alphabetical * Explain why CEMS is stored as parquet * Rename heat_rate_mmbtu_mwh_eia/ferc1 columns to unit_heat_rate_mmbtu_per_mwh_eia/ferc1 * Remove unused ppe_cols_to_grab variable * Make association asset names more consistent * Add association assset naming convention to docs * Resolve migration issues with unit heat rate column * Update conda-lock.yml and rendered conda environment files. * Recreate heat rate migration revision * Use pudl_sqlite_io_manager for fuel_cost_by_generator assets * Update conda-lock.yml and rendered conda environment files. * Checkout lock files from dev * Update conda-lock.yml and rendered conda environment files. * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci * Remove intro.rst and update ferc s3 urls again * Update conda-lock.yml and rendered conda environment files. * Remove some old table names from metaddata * Update conda-lock.yml and rendered conda environment files. * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci * Remove ref to non existant doc page, remove files no longer in dev --------- Co-authored-by: bendnorman <bdn29@cornell.edu> Co-authored-by: Bennett Norman <bennett.norman@catalyst.coop> Co-authored-by: Christina Gosnell <cgosnell@catalyst.coop> Co-authored-by: bendnorman <bendnorman@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Restructure intro docs page and README to accommodate data warehouse …

949106e

…info. Add three components of PUDL description

aesharpe requested a review from bendnorman October 2, 2023 14:54

bendnorman reviewed Oct 3, 2023

View reviewed changes

Update plant_function values in csv to match those in the table_dimen…

c723a41

…sions_ferc1 table

bendnorman mentioned this pull request Nov 1, 2023

Add naming new naming convention to docs #2874

Merged

8 tasks

bendnorman added a commit that referenced this pull request Nov 1, 2023

Incorporate some docs changes from #2912

33fab91

bendnorman closed this Nov 7, 2023

bendnorman deleted the create-naming-convention-docs-austen branch November 7, 2023 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure intro.rst and other pages for data warehouse #2912

Restructure intro.rst and other pages for data warehouse #2912

aesharpe commented Oct 2, 2023

bendnorman left a comment

bendnorman Oct 3, 2023

aesharpe Oct 4, 2023

aesharpe Oct 4, 2023

bendnorman Oct 3, 2023

bendnorman Oct 3, 2023

aesharpe Oct 4, 2023

aesharpe commented Oct 4, 2023

bendnorman commented Oct 5, 2023

bendnorman commented Nov 7, 2023

Restructure intro.rst and other pages for data warehouse #2912

Restructure intro.rst and other pages for data warehouse #2912

Conversation

aesharpe commented Oct 2, 2023

bendnorman left a comment

Choose a reason for hiding this comment

bendnorman Oct 3, 2023

Choose a reason for hiding this comment

aesharpe Oct 4, 2023

Choose a reason for hiding this comment

aesharpe Oct 4, 2023

Choose a reason for hiding this comment

bendnorman Oct 3, 2023

Choose a reason for hiding this comment

bendnorman Oct 3, 2023

Choose a reason for hiding this comment

aesharpe Oct 4, 2023

Choose a reason for hiding this comment

aesharpe commented Oct 4, 2023

bendnorman commented Oct 5, 2023

bendnorman commented Nov 7, 2023