The development is in progress as well as the documentation
DBFeeder is an all-in-one solution that crawls and scraps information from the web to then populate a relational database.
The solution can be configured following the steps below:
-
Create json configuration files for crawler (instructions here)
-
Create json configuration files for scraper (instructions here)
-
Define entities (EF Core) using Devart Entity Developer (instructions here.
-
Update
docker-compose.yml
file in order to create a DAC service for each entity created
The solution runs using docker-compose.yml file:
docker compose build
docker compose up
A complete retrieval of a single entity information comprehends the following phases:
- Crawler extracting the target url
- Scraper extracting information from the target url
- Data Access Command generating the entity and populating the corresponding table
The solution is composed of the following docker images:
- Crawler: a container from an image of a .Net 7 worker service running in multithreading, 1 task for each source/configuration
- Scraper: a container containing multiple .Net 7 worker service processes, 1 process for each source
- DataAccessCommand: 1 container for each entity/DB table
Stack:
- Docker
- .Net 7
- RabbitMQ
- EF Core
- SQLite
maximize throughput allow scalability efficiency ensure robustness (needs more work) allow reusability
A simplified CQRS pattern has been applied consisting of a single DB and one DAC service for each table
In charge of retrieving urls from an HTML source page. More information here
In charge of retrieving information for the database population from the crawled urls. More information here
In charge of populating the database with the information scraped. More information here
This repo is dedicated to Peter, a friend who gave me the chance to learn how life can be enjoyable.