Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use HTTP 304 as saved_state to cut down on duplicates (Last-Modified/If-Modified-Since/ETag/If-None-Match) #101

Closed
danieleperera opened this issue Oct 5, 2020 · 1 comment · Fixed by #153 or #151
Assignees
Labels
enhancement New feature or general improvement

Comments

@danieleperera
Copy link

Hi guys,

Have you got any other ideas on how to cut down on duplicates without using (Last-Modified/If-Modified-Since/ETag/If-None-Match). I'm trying to use the web.py source but some HTTP response headers don't the above tags.

I was thinking of creating a shasum of the content of a page and saving it as the saved_state and checking it later if there are any new items. However this would only work if you are scraping one page.

@cmmorrow
Copy link
Contributor

cmmorrow commented Oct 7, 2020

Hey @danieleperera, I like that idea. If you want to try to get it working, I'll review the PR.

@cmmorrow cmmorrow added the enhancement New feature or general improvement label Oct 7, 2020
@battleoverflow battleoverflow self-assigned this Jun 12, 2023
@battleoverflow battleoverflow mentioned this issue Jun 12, 2023
@battleoverflow battleoverflow linked a pull request Jun 12, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or general improvement
Projects
None yet
3 participants