An async Python web crawler.
In a virtual env:
pip install -r requirements.txt
(or pip-sync
if you have pip-tools
)
cd
to the root directory and:
./run_crawler.py https://github.com
cd
to the root directory and:
./run_tests.sh
Unfortunately, because of an issue with aiohttp
/python
, as mentioned here, the code that is reponsible for making the get requests might fail on machines that do not have a Python version of 3.7.4 and higher. For taht case, I have included a docker file and some commands where that you can use to run the crawler and the tests inside docker, using a Python 3.8 image.
- Install Docker.
- From the root directory, run:
docker build -t crawler .
docker run crawler python run_crawler.py https://github.com
docker run crawler python -m pytest tests/
If you need to ssh into the container to check what is there, like the crawler resuls file, you can do:
docker run -it crawler /bin/sh