Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet download link to data dictionary #3984

Merged
merged 5 commits into from
Dec 18, 2024

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Dec 3, 2024

I added a parquet file download link to the data dictionary so it's easier for people to access the s3 files.

Tasks

Preview Give feedback

Testing

I built the docs locally and was able to download a parquet file.

To-do list

Preview Give feedback

{%- endif %}
* `Download this table as a Parquet file. <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/stable/{{ resource.name }}.parquet>`__
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't figure out how to format the URL so that users can download the version of the file associated with the version of the docs the user is viewing. Also, any idea what ref latest is pointing to? stable or nightly?

@bendnorman bendnorman requested a review from zschira December 3, 2024 21:12
@bendnorman
Copy link
Member Author

I wrote up an issue about this readthedocs build failure.

Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis will be super nice to have! I'm not sure about the refs, or how to point to the corresponding version. I wonder if, for the time being, it would be best to point to nightly builds and just make it clear in the comment that it's pointing to the latest version?

@e-belfer e-belfer added docs Documentation for users and contributors. parquet Issues related to the Apache Parquet file format which we use for long tables. labels Dec 11, 2024
@e-belfer
Copy link
Member

@bendnorman Just reran after #3989 merged to see if that's fixed the readthedocs problem. Did you not want to add documentation on Parquet files to the Data Access page as well?

@e-belfer
Copy link
Member

@bendnorman @zschira For what it's worth, my instinct is to link to the nightly Parquet file because Datasette shows the nightly data (unless I'm totally misreading our ETL script), and it'd be confusing to have links going to two different underlying datasets right next to one another.


Fully Processed SQLite Databases
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* `Main PUDL Database <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/pudl.sqlite.zip>`__
* `US Census DP1 Database (2010) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/censusdp1tract.sqlite.zip>`__

Hourly Tables as Parquet
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this section because now all tables are available as Parquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, but these tables aren't in SQLite and so I think shouting them out here is still helpful - if people are looking for them they won't be able to find them in the full DB.

@bendnorman
Copy link
Member Author

Thanks for the input y'all! I changed the template to point at the nightly files. I also updated the data access page.

@bendnorman bendnorman marked this pull request as ready for review December 12, 2024 01:17
@bendnorman bendnorman requested a review from e-belfer December 12, 2024 01:17
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small request, but otherwise looks great thank you!


Fully Processed SQLite Databases
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* `Main PUDL Database <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/pudl.sqlite.zip>`__
* `US Census DP1 Database (2010) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/censusdp1tract.sqlite.zip>`__

Hourly Tables as Parquet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, but these tables aren't in SQLite and so I think shouting them out here is still helpful - if people are looking for them they won't be able to find them in the full DB.

@@ -106,32 +108,19 @@ resulting outputs pass all of the data validation tests we've defined, the outpu
automatically uploaded to the `AWS Open Data Registry
<https://registry.opendata.aws/catalyst-cooperative-pudl/>`__, and used to deploy a new
version of Datasette (see above). These nightly build outputs can be accessed using the
AWS CLI, or programmatically via the S3 API. They can also be downloaded directly over
HTTPS using the following links:
AWS CLI, or programmatically via the S3 API. If you don't want to mess with the API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
AWS CLI, or programmatically via the S3 API. If you don't want to mess with the API
AWS CLI, or programmatically via the S3 API.
If you don't want to mess with the API

@bendnorman
Copy link
Member Author

Made the changes!

@bendnorman bendnorman requested a review from e-belfer December 18, 2024 00:09
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, all looks good to me!

@bendnorman bendnorman added this pull request to the merge queue Dec 18, 2024
Merged via the queue into main with commit 99ee4e5 Dec 18, 2024
17 checks passed
@bendnorman bendnorman deleted the add-parquet-files-data-dictionary branch December 18, 2024 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation for users and contributors. parquet Issues related to the Apache Parquet file format which we use for long tables.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants