-
Notifications
You must be signed in to change notification settings - Fork 18
Scraper Infrastructure Notes
Comparison of ebdata.retrieval vs. ebdata.blobs
YES | NO | |
YES | NO | |
YES [#note1] | YES [#note2] [#note3] | |
NO | YES | |
YES [#note4] | NO [#note4] | |
NO | YES, see safe_location(), but not automatic | |
YES (geotagging.py) | YES, create_newsitem() | |
YES (geotagging.py) | NO, you have to call create_newsitem() manually. | |
NO? | YES | |
NO? | YES, display_data() |
notes:
[#=note1] blobs stores crawl history as Page objects which have the text of the crawled page, and a .when_crawled timestamp, and a fair amount of other metadata.
[#=note2] retrieval.scrapers.newsitem_list_detail stores only a timestamp of when each schema was last scraped, by creating a ebpub.db.models.DataUpdate instance, which just has some basic statistics. Scraped content is not saved.
[#=note3]retrieval.scrapers.new_newsitem_list_detail creates instances of ebdata.retrieval.models.ScrapedPage (content and a bit of metadata about a crawled page, much simpler than blobs.models.Page), and NewsItemHistory (just a m2m mapping of ScrapedPages to NewsItems).
[#=note4] anecdotally, scrapers written against ebdata.blobs tend to be shorter.