This project intends on creating a data lake using AWS S3 that has the following indices:
- File-based content index based on the following article.
- HyperLogLog metadata index to indicate the approximate number of unique values of a metadata field
- Count-Min-Sketch metadata index to indicate the approximate number of repetitions per each value of a metadata field
The data lake is accompanied by an API to do the following:
- Upload a file to the data lake and start the indexing process
- Query for content based on the content-index
- Query for metadata statistics based on the metadata indices
Project Name | Purpose |
---|---|
CovidDataLake.Cloud | Common code to access the cloud resources of the data lake |
CovidDataLake.Common | Common code that is shared between all of the services |
CovidDataLake.ContentIndexer | The engine that indexes the contents of files in the data lake |
CovidDataLake.MetadataIndexer | The engine that indexes the metadata of files in the data lake |
CovidDataLake.Pubsub | Common code to publish and subscribe to events in the ETL process |
CovidDataLake.Queries | The business-logic of the queries performed on the data lake |
CovidDataLake.Storage | Common code to handle usage of local disk storage |
CovidDataLake.WebAPI | The API for the data lake, includes updates and queries |
- .NET 6.0 installed
- Redis server running and configured correctly in all relevant
appsettings.json
files in the following way:
{
"Redis": "[HOSTNAME]:[PORT],connectTimeout=15000,syncTimeout=15000"
}
- Kafka cluster running with all instances configured correctly in all relevant
appsettings.json
files in the following way:
{
"Kafka": {
"Instances": [
{
"Host": "[HOST_NAME]",
"Port": 9092
}
],
"Topic": "[TOPIC_NAME]",
"GroupId": "[CONSUMER_GROUP_ID]" //this is used only for consuming projects (aka indexing engines)
}
- Project-specific requirements are listed inside each project's folder