Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid re-scraping the file-list pages #13

Closed
mdavis-xyz opened this issue Apr 30, 2024 · 0 comments
Closed

Avoid re-scraping the file-list pages #13

mdavis-xyz opened this issue Apr 30, 2024 · 0 comments

Comments

@mdavis-xyz
Copy link
Contributor

I'm trying to download a few years of data for one small table. It takes about 1 minute to download each 100kB file, which is quite slow.
When I set the logging level to DEBUG, I see that this library is making 37 requests for each file that gets downloaded. Mostly to re-crawl the index pages.

e.g. when I try to download data for 2024, this library downloads page /Data_Archive/Wholesale_Electricity/MMSDM/2009/, twice.

requests

What's the purpose of this?

e.g. If I request data for Jan 2024, the library should only look in folder http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2024/MMSDM_2024_01/MMSDM_Historical_Data_SQLLoader/DATA/. I don't see why it needs to look in any other folder. Isn't this file structure predictable?
In fact, in the code on this line it seems the exact filename is guessed, not scraped. So why are these requests for HTML pages made at all? If they are required, can you please add a caching decorator? That would be a 2 line change, which should speed things up by x37.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants