DBFeeder

The development is in progress as well as the documentation

Introduction

DBFeeder is an all-in-one solution that crawls and scraps information from the web to then populate a relational database.

Using DBFeeder

Configuration

The solution can be configured following the steps below:

Create json configuration files for crawler (instructions here)
Create json configuration files for scraper (instructions here)
Define entities (EF Core) using Devart Entity Developer (instructions here.
Update docker-compose.yml file in order to create a DAC service for each entity created

Launching the solution

The solution runs using docker-compose.yml file:

Build

docker compose build

Launch

docker compose up

Execution workflow

A complete retrieval of a single entity information comprehends the following phases:

Crawler extracting the target url
Scraper extracting information from the target url
Data Access Command generating the entity and populating the corresponding table

Architecture

The solution is composed of the following docker images:

Crawler: a container from an image of a .Net 7 worker service running in multithreading, 1 task for each source/configuration
Scraper: a container containing multiple .Net 7 worker service processes, 1 process for each source
DataAccessCommand: 1 container for each entity/DB table

Stack:

Docker
.Net 7
RabbitMQ
EF Core
SQLite

maximize throughput allow scalability efficiency ensure robustness (needs more work) allow reusability

A simplified CQRS pattern has been applied consisting of a single DB and one DAC service for each table

Services Overview

Crawler

In charge of retrieving urls from an HTML source page. More information here

Scraper

In charge of retrieving information for the database population from the crawled urls. More information here

Data Access Command

In charge of populating the database with the information scraped. More information here

Services instantiation

Last words

This repo is dedicated to Peter, a friend who gave me the chance to learn how life can be enjoyable.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Common		Common
CrawlerService		CrawlerService
DACService		DACService
DBFeederEntity		DBFeederEntity
Docs		Docs
ScraperService		ScraperService
Serilog.Sinks.SQLiteCustom		Serilog.Sinks.SQLiteCustom
.gitattributes		.gitattributes
.gitignore		.gitignore
DAC-dockerfile		DAC-dockerfile
DBFeeder.sln		DBFeeder.sln
LICENSE		LICENSE
README.md		README.md
crawler-dockerfile		crawler-dockerfile
docker-compose.bat		docker-compose.bat
docker-compose.yml		docker-compose.yml
feeder-progress.db		feeder-progress.db
launchSettings.json		launchSettings.json
scraper-dockerfile		scraper-dockerfile
scraper-mproc.sh		scraper-mproc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DBFeeder

Introduction

Using DBFeeder

Configuration

Launching the solution

Build

Launch

Execution workflow

Architecture

Services Overview

Crawler

Scraper

Data Access Command

Services instantiation

Last words

About

Releases

Packages

Languages

License

dapalex/DBFeeder

Folders and files

Latest commit

History

Repository files navigation

DBFeeder

Introduction

Using DBFeeder

Configuration

Launching the solution

Build

Launch

Execution workflow

Architecture

Services Overview

Crawler

Scraper

Data Access Command

Services instantiation

Last words

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages