Key Features • Design • Stack Decisions • Documentation • Usage • Testing • Load Testing • Tasks Monitoring • Continuous Integration • Enhancements
- Crawling 🕸️ - Crawl any website and count the occurrences of each word.
- Distributed 🚀 - All the heavy work happens in the background by Celery Workers.
- Caching using Redis 🏪 - We don't crawl the same website more than once during small interval.
- Two endpoints
POST /crawl
andGET /check_crawl_status/{task_id}
. - The user starts by calling
POST /crawl
with a URL. - We check if we've crawled this url in the past hour. If we did, we return the cached response to the user. A response will include
task_id
andtask_status
- Asynchronously, we send the request to a message broker (Redis). This message broker has two worker consumers consume the request from it.
- We can scale the workers numbers as much as our hardware can let us. It's distributed so we can easily scale vertically.
- The workers process the request, store the result to redis and update the task_status.
- The user calls
GET /check_crawl_status/{task_id}
to get the status of their crawl, if it was succeeded they will receive a response with the words count.
- FastAPI: Reliable async web server with built-in documentation.
- Celery: Reliable Distributed Task Queue to make sure the solution is scalable when needed.
- Separate the API from the Execution. This way we can scale our workers independently away from the APIs.
- Use Caching to make sure not to scrap the same URL within small interval of time (1 hour).
- Use Redis to store the workers results since the default timeout is 24 hours, and we don't need to persist the word count for more than that since it's always changing.
$ sh local_env_up.sh
Then visit http://localhost:8000/docs
for Swagger Documentation
$ sh local_env_up.sh
$ sudo docker-compose -f docker-compose.yml up --scale worker=2 --build
$ sh local_env_down.sh
$ curl --location --request POST 'http://localhost:8000/crawl' \
--header 'Content-Type: application/json' \
--data-raw '{
"url": "https://www.bbc.com"
}'
{
"id": "8b1766b4-6dc1-4f3d-bc6f-426066edc46f",
"url": "localhost:8000/check_crawl_status/8b1766b4-6dc1-4f3d-bc6f-426066edc46f"
}
$ curl --location --request \
GET 'localhost:8000/check_crawl_status/8b1766b4-6dc1-4f3d-bc6f-426066edc46f'
{
"status": "SUCCESS",
"result": {
"the": 54,
"to": 46,
"in": 22,
"of": 21,
..,
..,
},
"task_id": "b4035abd-f58f-4ab9-90bb-ebad535869d4"
}
The default is to get the words sorted by occurrences. Use sort
parameter to specify if you want it sorted alphabetically
for example:
$ curl --location --request \
GET 'localhost:8000/check_crawl_status/8b1766b4-6dc1-4f3d-bc6f-426066edc46f?sort=alphabetically'
$ docker-compose exec worker pytest .
$ docker-compose exec fastapi pytest .
I use locust
for load testing.
$ pip install locust
$ locust -f load_test.py
Monitor the tasks and workers using Flower Dashboard from http://localhost:5555/dashboard
Basic CI is integrated to the repo using Github Actions to run test cases on PRs and Master merges.
- Smart Crawler: Stream the text instead of downloading it.
- Better Mocks in tests and better test coverage.
- Continuous Delivery Action to publish the docker images to a docker registry.