You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "./bin/triage_links", line 34, in get_url_parts
link = urljoin(record.url, record.href)
File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./bin/triage_links", line 102, in <module>
main()
File "./bin/triage_links", line 13, in main
CSVPipeline(callback=process).execute()
File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
self.save_csv()
File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
df = df.compute()
File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
results = schedule(dsk, keys, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11
To reproduce, run a broad crawl on this dataset and extract all links:
That it works fine when I drop dups is somewhat intriguing. Maybe the code is running out of memory or something. (Maybe there's a memory leak in there somewhere?)
If you've more memory than I do and it doesn't choke on your system as a result, you can probably use df.append() a few times to make it large enough to segfault.
See #58 (comment) and #58 (comment)
Also repeating here
To reproduce, run a broad crawl on this dataset and extract all links:
https://www.kaggle.com/cheedcheed/top1m
use
urljoin()
andurlsplit()
on each one.The text was updated successfully, but these errors were encountered: