-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the canonical way to ETL an incremental pipeline of this data? #2839
Comments
@wpfl-dbt I no longer work on gov.uk search but the public timestamp in the JSON and Atom feed responses are reliable for subscribing to changes to documents - docs are written to search api almost immediately after being published / updated. If you share more about your goals for the pipeline this'd help to give advice.
Related pages - I guess you mean the related_links attribute on content items? These aren't generated by the search api. Since they're not complete or deterministic (more a human user navigational aid) I wouldn't use them for your purpose. To find docs published by your department you can either use the topic taxonomy or use the search api's organisation filter to get a complete machine-readable list.
Drafts aren't published to the search-api so this is probably something else. I'd check the status code and raw body of the response the next time you see this error. |
Thanks so much for coming back on this @bilbof , hugely appreciated! The assurance that we're okay to subscribe to those published/updated fields is perfect, thank you. The aim of my pipeline is to have the a table of published text and some metadata of this content for our analysts to use for ML models. On the missing documents, I've put together an example. The net zero strategy is an example of content I'd want my pipeline to ingest. This API request, which is essentially what I'm using plus a query string, returns the publication, but it doesn't return all the links I'd need to send to the content API to get the published text -- if I didn't recurse on what comes back from this content API request I'd miss the text of individual documents like 1. Why Net Zero, 2. The journey to Net Zero, etc. Am I missing something here? Is there a better way to do this? |
Also, on the missing documents, an example of the problem (at the time of writing) is the Specialist Investigations Alcohol Guidance manual from HMRC. The page lists and links to SIAG1000 Introduction: contents, a link that currently 404s for me in the frontend. Similarly in the content API, the manual page returns SIAG1000 in its child sections, and the API call to the page also 404s. Looking through the API I can't see a way to avoid this kind of error? |
I work at DBT and have been improving an ETL pipeline for gov.uk content we have based on parameters the department needs. I'd like to configure it so it ingests and overwrites data that's changed rather than ingesting everything over and over again.
My plan is:
updated_at
field to return results changed in the last few daysupdated_at
for new contentFrom the other side of the API, is that a good plan?
updated_at
reliably updated? Is is safe to base a pipeline on?JSONDecodeError
for very new items, which makes me think I'm picking up drafts. Is there a field I'm missing to ignore these until they're ready?The text was updated successfully, but these errors were encountered: