Initial check of URL is causing issues #256

benoit74 · 2023-12-04T09:03:23Z

In zimit, at the beginning of scraper execution it performs what is named a check_url:

Lines 467 to 508 in a62f31e

    
           def check_url(url, user_agent, scope=None): 
        
               url = urllib.parse.urlparse(url) 
        
               try: 
        
                   with requests.get( 
        
                       url.geturl(), 
        
                       stream=True, 
        
                       allow_redirects=True, 
        
                       timeout=(12.2, 27), 
        
                       headers={"User-Agent": user_agent}, 
        
                   ) as resp: 
        
                       resp.raise_for_status() 
        
               except requests.exceptions.RequestException as exc: 
        
                   print(f"failed to connect to {url.geturl()}: {exc}", flush=True) 
        
                   raise SystemExit(1) 
        
               actual_url = urllib.parse.urlparse(resp.url) 
        
               # remove explicit port in URI for default-for-scheme as browsers does it 
        
               if actual_url.scheme == "https" and actual_url.port == 443: 
        
                   actual_url = rebuild_uri(actual_url, port="") 
        
               if actual_url.scheme == "http" and actual_url.port == 80: 
        
                   actual_url = rebuild_uri(actual_url, port="") 
        
               if actual_url.geturl() != url.geturl(): 
        
                   if scope in (None, "any"): 
        
                       return actual_url.geturl() 
        
                   print( 
        
                       "[WARN] Your URL ({0}) redirects to {1} which {2} on same " 
        
                       "first-level domain. Depending on your scopeType ({3}), " 
        
                       "your homepage might be out-of-scope. Please check!".format( 
        
                           url.geturl(), 
        
                           actual_url.geturl(), 
        
                           "is" 
        
                           if get_fld(url.geturl()) == get_fld(actual_url.geturl()) 
        
                           else "is not", 
        
                           scope, 
        
                       ) 
        
                   ) 
        
                   return actual_url.geturl() 
        
               return url.geturl()

This check seems intended to check URL validity and clean it, including by following redirects.

It is however doing some harm: since the request is done by Python requests library, the anti-bot protection are regularly triggered.

See #255 for instance where removing the check_url (manually on my machine) allows Browsertrix to proceed (even if I'm not sure it will finish, protections might stop us at some points). Same problem occurs in #232. And we have many cases reported in the weekly routine where the youzim.it task is stopped by a Python error, i.e. something which happened in check_url.

We tried to enhance the situation with #229 and while it is way better now, it is still not sufficient. Advanced anti-bot protections are not tricked by the agent and they still identifies us as a bot (probably via TLS fingerprinting techniques).

I'm not sure how to move this forward, but clearly there is something to do.

I wonder if we should simply remove this check of URLs, it seems to me this is doing more harm than good and it is the user responsibility to input proper URLs. Do we have any notes / remembering of why this was introduced in details?

Note that doing the check and just not caring about errors returns is not sufficient, since doing the check usually trigger a temporary ban of our scraper IP.

Another option would be to introduce a CLI flag to optionally disable this check, but I feel like this scraper already has too many flags, and on youzim.it it will be hard for the end-user to know he should disable this check. And if he just ran the scraper with the check, the IP might be banned and he will have to wait (without really knowing about it) before running the scraper without the check.

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-02-29T13:39:13Z

This might be solved with the upgrade to 1.0.0-beta5 where we will probably be able to remove the check_url operation since the redirect thing will be handled by browsertrix now and will be considered as a seed (and hence not suffer from scope issues).

The only thing we probably keep is the removal of default 443 and 80 ports.

To be confirmed in the PR for upgrading to 1.0.0-beta5

benoit74 added the bug label Dec 4, 2023

benoit74 mentioned this issue Dec 4, 2023

Can't Zim a Wiki #255

Closed

benoit74 closed this as completed Mar 25, 2024

kelson42 assigned benoit74 Mar 25, 2024

kelson42 added this to the 2.0.0 milestone Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial check of URL is causing issues #256

Initial check of URL is causing issues #256

benoit74 commented Dec 4, 2023

benoit74 commented Feb 29, 2024

Initial check of URL is causing issues #256

Initial check of URL is causing issues #256

Comments

benoit74 commented Dec 4, 2023

benoit74 commented Feb 29, 2024