Avoid re-scraping the file-list pages #13

mdavis-xyz · 2024-04-30T08:14:57Z

I'm trying to download a few years of data for one small table. It takes about 1 minute to download each 100kB file, which is quite slow.
When I set the logging level to DEBUG, I see that this library is making 37 requests for each file that gets downloaded. Mostly to re-crawl the index pages.

e.g. when I try to download data for 2024, this library downloads page /Data_Archive/Wholesale_Electricity/MMSDM/2009/, twice.

What's the purpose of this?

e.g. If I request data for Jan 2024, the library should only look in folder http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2024/MMSDM_2024_01/MMSDM_Historical_Data_SQLLoader/DATA/. I don't see why it needs to look in any other folder. Isn't this file structure predictable?
In fact, in the code on this line it seems the exact filename is guessed, not scraped. So why are these requests for HTML pages made at all? If they are required, can you please add a caching decorator? That would be a 2 line change, which should speed things up by x37.

The text was updated successfully, but these errors were encountered:

mdavis-xyz mentioned this issue May 14, 2024

Speed up the code #15

Merged

prakaa closed this as completed Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid re-scraping the file-list pages #13

Avoid re-scraping the file-list pages #13

mdavis-xyz commented Apr 30, 2024

Avoid re-scraping the file-list pages #13

Avoid re-scraping the file-list pages #13

Comments

mdavis-xyz commented Apr 30, 2024