Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial check of URL is causing issues #256

Closed
benoit74 opened this issue Dec 4, 2023 · 1 comment
Closed

Initial check of URL is causing issues #256

benoit74 opened this issue Dec 4, 2023 · 1 comment
Assignees
Labels
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Dec 4, 2023

In zimit, at the beginning of scraper execution it performs what is named a check_url:

zimit/zimit.py

Lines 467 to 508 in a62f31e

def check_url(url, user_agent, scope=None):
url = urllib.parse.urlparse(url)
try:
with requests.get(
url.geturl(),
stream=True,
allow_redirects=True,
timeout=(12.2, 27),
headers={"User-Agent": user_agent},
) as resp:
resp.raise_for_status()
except requests.exceptions.RequestException as exc:
print(f"failed to connect to {url.geturl()}: {exc}", flush=True)
raise SystemExit(1)
actual_url = urllib.parse.urlparse(resp.url)
# remove explicit port in URI for default-for-scheme as browsers does it
if actual_url.scheme == "https" and actual_url.port == 443:
actual_url = rebuild_uri(actual_url, port="")
if actual_url.scheme == "http" and actual_url.port == 80:
actual_url = rebuild_uri(actual_url, port="")
if actual_url.geturl() != url.geturl():
if scope in (None, "any"):
return actual_url.geturl()
print(
"[WARN] Your URL ({0}) redirects to {1} which {2} on same "
"first-level domain. Depending on your scopeType ({3}), "
"your homepage might be out-of-scope. Please check!".format(
url.geturl(),
actual_url.geturl(),
"is"
if get_fld(url.geturl()) == get_fld(actual_url.geturl())
else "is not",
scope,
)
)
return actual_url.geturl()
return url.geturl()

This check seems intended to check URL validity and clean it, including by following redirects.

It is however doing some harm: since the request is done by Python requests library, the anti-bot protection are regularly triggered.

See #255 for instance where removing the check_url (manually on my machine) allows Browsertrix to proceed (even if I'm not sure it will finish, protections might stop us at some points). Same problem occurs in #232. And we have many cases reported in the weekly routine where the youzim.it task is stopped by a Python error, i.e. something which happened in check_url.

We tried to enhance the situation with #229 and while it is way better now, it is still not sufficient. Advanced anti-bot protections are not tricked by the agent and they still identifies us as a bot (probably via TLS fingerprinting techniques).

I'm not sure how to move this forward, but clearly there is something to do.

I wonder if we should simply remove this check of URLs, it seems to me this is doing more harm than good and it is the user responsibility to input proper URLs. Do we have any notes / remembering of why this was introduced in details?

Note that doing the check and just not caring about errors returns is not sufficient, since doing the check usually trigger a temporary ban of our scraper IP.

Another option would be to introduce a CLI flag to optionally disable this check, but I feel like this scraper already has too many flags, and on youzim.it it will be hard for the end-user to know he should disable this check. And if he just ran the scraper with the check, the IP might be banned and he will have to wait (without really knowing about it) before running the scraper without the check.

@benoit74
Copy link
Collaborator Author

This might be solved with the upgrade to 1.0.0-beta5 where we will probably be able to remove the check_url operation since the redirect thing will be handled by browsertrix now and will be considered as a seed (and hence not suffer from scope issues).

The only thing we probably keep is the removal of default 443 and 80 ports.

To be confirmed in the PR for upgrading to 1.0.0-beta5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants