Use HTTP 304 as saved_state to cut down on duplicates (Last-Modified/If-Modified-Since/ETag/If-None-Match) #101

danieleperera · 2020-10-05T20:36:54Z

Hi guys,

Have you got any other ideas on how to cut down on duplicates without using (Last-Modified/If-Modified-Since/ETag/If-None-Match). I'm trying to use the web.py source but some HTTP response headers don't the above tags.

I was thinking of creating a shasum of the content of a page and saving it as the saved_state and checking it later if there are any new items. However this would only work if you are scraping one page.

The text was updated successfully, but these errors were encountered:

cmmorrow · 2020-10-07T03:07:10Z

Hey @danieleperera, I like that idea. If you want to try to get it working, I'll review the PR.

cmmorrow added the enhancement New feature or general improvement label Oct 7, 2020

battleoverflow self-assigned this Jun 12, 2023

battleoverflow linked a pull request Jun 12, 2023 that will close this issue

Added HTTP status code to saved_state to cut down on duplicates #153

Merged

battleoverflow mentioned this issue Jun 12, 2023

v1.2.0 #151

Merged

battleoverflow closed this as completed Jun 12, 2023

battleoverflow linked a pull request Jun 12, 2023 that will close this issue

v1.2.0 #151

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HTTP 304 as saved_state to cut down on duplicates (Last-Modified/If-Modified-Since/ETag/If-None-Match) #101

Use HTTP 304 as saved_state to cut down on duplicates (Last-Modified/If-Modified-Since/ETag/If-None-Match) #101

danieleperera commented Oct 5, 2020

cmmorrow commented Oct 7, 2020

Use HTTP 304 as saved_state to cut down on duplicates (Last-Modified/If-Modified-Since/ETag/If-None-Match) #101

Use HTTP 304 as saved_state to cut down on duplicates (Last-Modified/If-Modified-Since/ETag/If-None-Match) #101

Comments

danieleperera commented Oct 5, 2020

cmmorrow commented Oct 7, 2020