Words Counter Crawler

Distributed web crawler to count to occurrences of words

Key Features • Design • Stack Decisions • Documentation • Usage • Testing • Load Testing • Tasks Monitoring • Continuous Integration • Enhancements

Key Features

Crawling 🕸️ - Crawl any website and count the occurrences of each word.
Distributed 🚀 - All the heavy work happens in the background by Celery Workers.
Caching using Redis 🏪 - We don't crawl the same website more than once during small interval.

High Level Design

Two endpoints POST /crawl and GET /check_crawl_status/{task_id}.
The user starts by calling POST /crawl with a URL.
We check if we've crawled this url in the past hour. If we did, we return the cached response to the user. A response will include task_id and task_status
Asynchronously, we send the request to a message broker (Redis). This message broker has two worker consumers consume the request from it.
We can scale the workers numbers as much as our hardware can let us. It's distributed so we can easily scale vertically.
The workers process the request, store the result to redis and update the task_status.
The user calls GET /check_crawl_status/{task_id} to get the status of their crawl, if it was succeeded they will receive a response with the words count.

Stack Decisions

FastAPI: Reliable async web server with built-in documentation.
Celery: Reliable Distributed Task Queue to make sure the solution is scalable when needed.
Separate the API from the Execution. This way we can scale our workers independently away from the APIs.
Use Caching to make sure not to scrap the same URL within small interval of time (1 hour).
Use Redis to store the workers results since the default timeout is 24 hours, and we don't need to persist the word count for more than that since it's always changing.

Documentation

$ sh local_env_up.sh

Then visit http://localhost:8000/docs for Swagger Documentation

Usage

Start the services

$ sh local_env_up.sh

Content of local_env_up.sh

$ sudo docker-compose -f docker-compose.yml up --scale worker=2 --build

Stop the services

$ sh local_env_down.sh

POST a Crawl Request

$ curl --location --request POST 'http://localhost:8000/crawl' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.bbc.com"
}'

POST a Crawl Response Example

{
    "id": "8b1766b4-6dc1-4f3d-bc6f-426066edc46f",
    "url": "localhost:8000/check_crawl_status/8b1766b4-6dc1-4f3d-bc6f-426066edc46f"
}

Get Crawl Status Example

$ curl --location --request \ 
GET 'localhost:8000/check_crawl_status/8b1766b4-6dc1-4f3d-bc6f-426066edc46f'

POST a Crawl Response Example

{
    "status": "SUCCESS",
    "result": {
        "the": 54,
        "to": 46,
        "in": 22,
        "of": 21,
        ..,
        ..,
        },
    "task_id": "b4035abd-f58f-4ab9-90bb-ebad535869d4"
}

The default is to get the words sorted by occurrences. Use sort parameter to specify if you want it sorted alphabetically

for example:

$ curl --location --request \ 
GET 'localhost:8000/check_crawl_status/8b1766b4-6dc1-4f3d-bc6f-426066edc46f?sort=alphabetically'

Testing

Test the Worker

$ docker-compose exec worker pytest .

Test the API

$ docker-compose exec fastapi pytest .

Load Testing

I use locust for load testing.

$ pip install locust

$ locust -f load_test.py

Tasks Monitoring

Monitor the tasks and workers using Flower Dashboard from http://localhost:5555/dashboard

Continuous Integration

Basic CI is integrated to the repo using Github Actions to run test cases on PRs and Master merges.

Enhancements

Smart Crawler: Stream the text instead of downloading it.
Better Mocks in tests and better test coverage.
Continuous Delivery Action to publish the docker images to a docker registry.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
api		api
celery_tasks		celery_tasks
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
load_test.py		load_test.py
local_env_down.sh		local_env_down.sh
local_env_up.sh		local_env_up.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Words Counter Crawler

Distributed web crawler to count to occurrences of words

Key Features

High Level Design

Stack Decisions

Documentation

Usage

Start the services

Content of local_env_up.sh

Stop the services

POST a Crawl Request

POST a Crawl Response Example

Get Crawl Status Example

POST a Crawl Response Example

Testing

Test the Worker

Test the API

Load Testing

Tasks Monitoring

Continuous Integration

Enhancements

About

Releases

Packages

Languages

License

mahsayedsalem/word_counter_crawler

Folders and files

Latest commit

History

Repository files navigation

Words Counter Crawler

Distributed web crawler to count to occurrences of words

Key Features

High Level Design

Stack Decisions

Documentation

Usage

Start the services

Content of local_env_up.sh

Stop the services

POST a Crawl Request

POST a Crawl Response Example

Get Crawl Status Example

POST a Crawl Response Example

Testing

Test the Worker

Test the API

Load Testing

Tasks Monitoring

Continuous Integration

Enhancements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages