Skip to content

mgorav/data-health-service

Repository files navigation

Data Health As Service Using Graph DB

In this blog I will explore design/implementation of "Health Of The Data" by designing & building - "Data Health As Service" which finds the health of data from various contributors using graph inferencing. We also understand how to calculate health of data mathematically. Let's start our journey by understanding a modern data platform architecture.

Modern Data Platform Architecture

The below picture shows Modern Data Platform architecture:

alt text

The above architecture shows various key components/services of the "Data Platform Architecture". In this blog, we will be looking in "Data Health Service" and components/services required for it function.

Context

The prime goal of data in data platform is to enable data driven decision using reports. The quality of the reports build using data, at broad level dependent on:

  • Quality of data
  • Timeliness of data
  • Data lineage
  • ...

Since quality of the data depends on various factors, I would like to propose a service/component in the design called "Data Health As Service". This service will derive the health of the data from following key components:

  • Data Quality as service
  • ETL pipeline metadata service (pipeline context metadata)
  • ...

Hence in nut shell, "Data Health As Service" will provide "Index Of Readiness" of data as show below:

alt text

Data Health Using Index Of Readiness

As the saying goes - "Data is Gold". A good jeweller will tell you, before buying the gold, double/triple check the quality. The quality of the gold is measured in "carats". The value of "carats" governs the cost of the gold. For example, 24 carat gold is expensive then 22 carat and so on. The same applies to data as well. The "index of readiness of data" is calculated as:

Index Of Readiness = 1 / sum (score metrics influencing data quality) + score (data lineage) + score (data integrity)

If the index of readiness is closer to "zero", this implies that data is healthy.

alt text

So mathematically, we can drive health of the data using "Index Of Readiness". In real world scenario, it's hard to get all the parameter which affects the health of the data.

A million dollar question, how can we derive "data health sense", from the metadata non mathematically.

Data Health Using Graph DB

Graph databases are powerful in deducing connection between data (no matter what type data). Hence, I would like to propose the usage of graph database to derive health of the data. How would the graph model look like, to develop such a service?

Data Health Graph Model

The following diagram shows the design of graph model for data health deduciton:

alt text

DataHeath

The starting point the "Data Health Service". Its responsibility is to enroll data contributors which are contributing to final report/outcome or DQ metrices etc.

Contributor

This is responsible for capturing information affecting final report. An example of Contributor is an ETL pipeline.

Report

This is a final node in the graph. A typical example, can be a cube or batch report etc.

Data Heath Entities UML Class Diagram

Below diagram shows the data health entities class diagram:

alt text

Data Health Output

The graph db output looks like below:

alt text

Conclusion

Using above graph depiction it's very easy to deduce the health of data. If any of link from the "contributor/or report" is missing, this will directly imply health is unhealthy. This is great way to find the health of data as graph based approach allows iterative addition of data health influencing metrices.

How to Run?

Data As Service is build using:

  • Java 14
  • Spring Boot 2.3.3
  • Neo4j
  • Spring data neo4j

Below are steps to run this service:

  • Run neo4j docker
  • Run the DataHealthServiceApplicaiton

Reference Documentation

For further reference, please consider the following sections:

Guides

The following guides illustrate how to use some features concretely:

About

Data Health As Service

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages