Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new Download API #54

Merged
merged 131 commits into from
Sep 25, 2024
Merged

Use new Download API #54

merged 131 commits into from
Sep 25, 2024

Conversation

avaldebe
Copy link
Collaborator

@avaldebe avaldebe commented Aug 23, 2024

The new functionality for the new Parquet downloads API (see #52 for details) can be found on airbase.download_api:

  • airbase.download_api.types: type annotations for the API requests
  • airbase.download_api.dataset: data structures representing an API dataset the data required for requesting URLs
  • airbase.download_api.client: single requests to the API
  • airbase.download_api.session: machinery for requesting multiple URLs and downloading the files

On the CLI we have 3 new sub commands:

  • airbase historical: Historical Airbase data delivered between 2002 and 2012 before Air Quality Directive 2008/50/EC entered into force.
  • airbase verified: Verified data (E1a) from 2013 to 2022 reported by countries by 30 September each year for the previous year.
  • airbase unverified: Unverified data transmitted continuously (Up-To-Date/UTD/E2a) data from the beginning of 2023.

Also airbase.summary.DB was updated to handle the information needed for the Parquet downloads API

still missing:

  • download new metadata
  • full integration with airbase.AirbaseClient and airbase.AirbaseRequest, or
  • restoring airbase.AirbaseClient and airbase.AirbaseRequest to donwloading CSVs from the old API until EOL by the end of this year, or
  • re-implement CSV downloading only from the CLI

@avaldebe avaldebe linked an issue Aug 23, 2024 that may be closed by this pull request
@JohnPaton
Copy link
Owner

Nice, solid start.

I'm inclined to create an httpx.AsyncClient for the download service for connection re-use and resource limiting. I think it might be cleaner to put all methods on there, and then for the top-level user-facing functions we could instantiate a global client to use under the hood. What do you think?

I'll branch off here to try it out

@avaldebe
Copy link
Collaborator Author

Nice, solid start.

I'm inclined to create an httpx.AsyncClient for the download service for connection re-use and resource limiting. I think it might be cleaner to put all methods on there, and then for the top-level user-facing functions we could instantiate a global client to use under the hood. What do you think?

I'll branch off here to try it out

I updated this PR taking inspiration from yours. Please have a look at the CLI and the download function behind it.
It turns out that when adding cities to a request, if there is no country to march it the API returns urls for the complete dataset.
It took a bit of experimentation to find out this "undocumented feature", and some more to produce the requests.

@avaldebe avaldebe marked this pull request as draft September 6, 2024 16:50
@JohnPaton
Copy link
Owner

Hi, thanks for all the effort on this! I will take a look today or tomorrow 🤝

Copy link
Owner

@JohnPaton JohnPaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great, thanks for all the hard work. I think we've ended up with something really clean.

I just saw one little typo that I've requested a change for, after that we can merge.

Once we've merged we may need a separate PR updating the docs, and I'd like to see if I can get my uv PR in shape as well, and then we can go ahead and drop 1.0.0 I think.

airbase/parquet_api/session.py Outdated Show resolved Hide resolved
@JohnPaton
Copy link
Owner

Also something is up with the integration tests, looks like the service maybe being flaky but they are pretty red

@avaldebe
Copy link
Collaborator Author

avaldebe commented Sep 17, 2024

Also something is up with the integration tests, looks like the service maybe being flaky but they are pretty red

EEA updated the API schema and keep the version number.
There are 3 additional fields for a requests
The latest version of the docs (from June 2024?) says:

  • dateTimeStart: value that indicates from which date and time user want to filter the
    download. If parameter is not pa Format: yyyy-mm-ddTHH:MM:SSZ. Example: 2024-05-
    27T12:00:19Z.
  • dateTimeEnd: value that indicates until which date and time user want to filter the
    download. Format: yyyy-mm-ddTHH:MM:SSZ. Example: 2024-05-28T12:00:19Z.
    If dataTimeStart and dateTimeEnd parameters are not included in the request, the filter for
    the temporal coverage will not be applied and the entire set of data will be downloaded.
  • aggregationType: represents whether the data collected is obtaining the values:
    1. Hourly data.
    2. Daily data.
    3. Variable intervals (different than the previous observations such as weekly, monthly,
    etc.)

on a different part of the docs says

Temporal coverage
It is possible to filter and download the data with a specific temporal coverage. For this it is necessary to
select the beginning (Start) and the end (End) of the period to be downloaded.
In case Up To Date data (E2a) has been selected, it is possible to select a typical download temporal
coverage. If no temporal coverage is selected full data will be downloaded.

Also, the Swager UI shows a new undocumented entry point /ParquetFile/async

However, the braking change is that it the dataset field went from accepting a list of integers to accepting a single integer.

@JohnPaton
Copy link
Owner

I wonder how unstable this thing is going to be... good thing the integration tests are actually catching this stuff though.

Do you want to make those changes in this MR still? I can also push some commits to this branch if you want to hand off, you've done a lot already

@avaldebe
Copy link
Collaborator Author

I wonder how unstable this thing is going to be... good thing the integration tests are actually catching this stuff though.

Do you want to make those changes in this MR still? I can also push some commits to this branch if you want to hand off, you've done a lot already

I'm motivated to see this trough, and need it for work...
I'll update the to the latest version of the Parquet API and update AirbaseClient.download_metadata on this PR.

@avaldebe avaldebe marked this pull request as ready for review September 19, 2024 12:24
@avaldebe avaldebe requested a review from JohnPaton September 19, 2024 12:24
Copy link
Owner

@JohnPaton JohnPaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few nits but I'm not worried about them, this has ended up really clean and well-structured. Great collab and great work, thanks so much as always. Let's get this merged! I'll let you do the honours

This was referenced Sep 24, 2024
@avaldebe avaldebe merged commit 7acc37c into master Sep 25, 2024
19 checks passed
@avaldebe avaldebe deleted the new-api branch September 25, 2024 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New download service and format
2 participants