Skip to content

mahsayedsalem/word_counter_crawler

Repository files navigation

Unit Tests

Words Counter Crawler

Distributed web crawler to count to occurrences of words

Key FeaturesDesignStack DecisionsDocumentationUsageTestingLoad TestingTasks MonitoringContinuous IntegrationEnhancements

Key Features

  • Crawling 🕸️ - Crawl any website and count the occurrences of each word.
  • Distributed 🚀 - All the heavy work happens in the background by Celery Workers.
  • Caching using Redis 🏪 - We don't crawl the same website more than once during small interval.

High Level Design

ScreenShot

  • Two endpoints POST /crawl and GET /check_crawl_status/{task_id}.
  • The user starts by calling POST /crawl with a URL.
  • We check if we've crawled this url in the past hour. If we did, we return the cached response to the user. A response will include task_id and task_status
  • Asynchronously, we send the request to a message broker (Redis). This message broker has two worker consumers consume the request from it.
  • We can scale the workers numbers as much as our hardware can let us. It's distributed so we can easily scale vertically.
  • The workers process the request, store the result to redis and update the task_status.
  • The user calls GET /check_crawl_status/{task_id} to get the status of their crawl, if it was succeeded they will receive a response with the words count.

Stack Decisions

  • FastAPI: Reliable async web server with built-in documentation.
  • Celery: Reliable Distributed Task Queue to make sure the solution is scalable when needed.
  • Separate the API from the Execution. This way we can scale our workers independently away from the APIs.
  • Use Caching to make sure not to scrap the same URL within small interval of time (1 hour).
  • Use Redis to store the workers results since the default timeout is 24 hours, and we don't need to persist the word count for more than that since it's always changing.

Documentation

$ sh local_env_up.sh 

Then visit http://localhost:8000/docs for Swagger Documentation

ScreenShot

Usage

Start the services

$ sh local_env_up.sh 

Content of local_env_up.sh

$ sudo docker-compose -f docker-compose.yml up --scale worker=2 --build

Stop the services

$ sh local_env_down.sh 

POST a Crawl Request

$ curl --location --request POST 'http://localhost:8000/crawl' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://www.bbc.com"
}'

POST a Crawl Response Example

{
    "id": "8b1766b4-6dc1-4f3d-bc6f-426066edc46f",
    "url": "localhost:8000/check_crawl_status/8b1766b4-6dc1-4f3d-bc6f-426066edc46f"
}

Get Crawl Status Example

$ curl --location --request \ 
GET 'localhost:8000/check_crawl_status/8b1766b4-6dc1-4f3d-bc6f-426066edc46f'

POST a Crawl Response Example

{
    "status": "SUCCESS",
    "result": {
        "the": 54,
        "to": 46,
        "in": 22,
        "of": 21,
        ..,
        ..,
        },
    "task_id": "b4035abd-f58f-4ab9-90bb-ebad535869d4"
}

The default is to get the words sorted by occurrences. Use sort parameter to specify if you want it sorted alphabetically

for example:

$ curl --location --request \ 
GET 'localhost:8000/check_crawl_status/8b1766b4-6dc1-4f3d-bc6f-426066edc46f?sort=alphabetically'

Testing

Test the Worker

$ docker-compose exec worker pytest .

Test the API

$ docker-compose exec fastapi pytest .

Load Testing

I use locust for load testing.

$ pip install locust 
$ locust -f load_test.py 

ScreenShot

Tasks Monitoring

Monitor the tasks and workers using Flower Dashboard from http://localhost:5555/dashboard

ScreenShot

ScreenShot

Continuous Integration

Basic CI is integrated to the repo using Github Actions to run test cases on PRs and Master merges.

Enhancements

  • Smart Crawler: Stream the text instead of downloading it.
  • Better Mocks in tests and better test coverage.
  • Continuous Delivery Action to publish the docker images to a docker registry.

About

Distributed web crawler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published