covid19-data-lake

This project intends on creating a data lake using AWS S3 that has the following indices:

File-based content index based on the following article.
HyperLogLog metadata index to indicate the approximate number of unique values of a metadata field
Count-Min-Sketch metadata index to indicate the approximate number of repetitions per each value of a metadata field

The data lake is accompanied by an API to do the following:

Upload a file to the data lake and start the indexing process
Query for content based on the content-index
Query for metadata statistics based on the metadata indices

Project Structure

Project Name	Purpose
CovidDataLake.Cloud	Common code to access the cloud resources of the data lake
CovidDataLake.Common	Common code that is shared between all of the services
CovidDataLake.ContentIndexer	The engine that indexes the contents of files in the data lake
CovidDataLake.MetadataIndexer	The engine that indexes the metadata of files in the data lake
CovidDataLake.Pubsub	Common code to publish and subscribe to events in the ETL process
CovidDataLake.Queries	The business-logic of the queries performed on the data lake
CovidDataLake.Storage	Common code to handle usage of local disk storage
CovidDataLake.WebAPI	The API for the data lake, includes updates and queries

Getting Started Requirements

.NET 6.0 installed
Redis server running and configured correctly in all relevant appsettings.json files in the following way:

{
    "Redis": "[HOSTNAME]:[PORT],connectTimeout=15000,syncTimeout=15000"
}

Kafka cluster running with all instances configured correctly in all relevant appsettings.json files in the following way:

{
    "Kafka": {
        "Instances": [
            {
                "Host": "[HOST_NAME]",
                "Port": 9092
            }
        ],
        "Topic": "[TOPIC_NAME]",
        "GroupId": "[CONSUMER_GROUP_ID]" //this is used only for consuming projects (aka indexing engines)
}

Project-specific requirements are listed inside each project's folder

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
CovidDataLake.Cloud		CovidDataLake.Cloud
CovidDataLake.Common		CovidDataLake.Common
CovidDataLake.ContentIndexer		CovidDataLake.ContentIndexer
CovidDataLake.Kafka		CovidDataLake.Kafka
CovidDataLake.MetadataIndexer		CovidDataLake.MetadataIndexer
CovidDataLake.Pubsub		CovidDataLake.Pubsub
CovidDataLake.Queries		CovidDataLake.Queries
CovidDataLake.Storage		CovidDataLake.Storage
CovidDataLake.WebApi		CovidDataLake.WebApi
CovidDatalake.Scripts		CovidDatalake.Scripts
Scripts		Scripts
.editorconfig		.editorconfig
.gitignore		.gitignore
CovidDataLake.sln		CovidDataLake.sln
CovidDataLake.sln.DotSettings		CovidDataLake.sln.DotSettings
LICENSE		LICENSE
README.md		README.md
clusters.yaml		clusters.yaml
docker-aws.yml		docker-aws.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

covid19-data-lake

Project Structure

Getting Started Requirements

About

Releases

Packages

Languages

License

eran-gil/covid19-data-lake

Folders and files

Latest commit

History

Repository files navigation

covid19-data-lake

Project Structure

Getting Started Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages