Version: 0.0.1
ODD Redshift adapter is used OpenDataDiscovery for pull metadata to OpenDataDiscovery from AWS Amazon Redshift.
Metadata discovery process is very similar to metrics/logs/traces gathering process. We might have pull or push model. OpenDataDiscovery use pull model for ODD Redshift adapter.
Pulling metadata directly from the source seems is the most straightforward way to gather metadata, but it may become a nightmare to develope and maintain a centralized fleet of domain-specific crawlers. OpenDataDiscovery introduces new entity: OpenDataDiscovery Adapter. In this case, ODD Redshift adapter. The main goal of these adapters are to be source specific and expose only information could be gathered from certain data source.
ODD Redshift adapter is Python Flask application composed to Docker image. Web server is gunicorn. DB API library is psycopg2. For details about image see docker-compose.yml and Dockerfile. Use this files for local tests and development. In production Kubernetes on Amazon EKS (Amazon Elastic Kubernetes Service) can be used to run ODD Redshift adapter. See Helm Charts files for example.
Endpoint http://0.0.0.0:8080/entities/
ODD Redshift adapter start endpoint http://0.0.0.0:8080/entities/ by default,
and it connects to Redshift for collect metadata.
ODD Redshift adapter provides list of available metadata entities updated after changed_since time.
ODD Redshift adapter used next environmental variables:
Cache refreshing interval (minutes), default 60:
- SCHEDULER_INTERVAL_MINUTES=60
ODD Redshift adapter used PostgreSQL driver and setting for it:
- REDSHIFT_HOST=...redshift.amazonaws.com
- REDSHIFT_PORT=5439
- REDSHIFT_DATABASE=oddadapter
- REDSHIFT_USER=oddadapter
- REDSHIFT_PASSWORD=odd-adapter-password
More details see https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
Variables for FLASK see in FLASK documentation, for example:
- FLASK_ENVIRONMENT=development
Knowledge about data is spread amongst many people and systems. OpenDataDiscovery role is to provide a standard protocol how metadata can be collected and correlated in as automated fashion as possible. To enable many different datasources and tools to expose the metadata we need agreement on what data should be exposed and in what format (structures). Specification contains of high level entities:
- DataInput,
- DataTransformer (DataTransformerRun),
- DataSet
- DataQualityTest (DataQualityTest)
- DataConsumer.
Each entity has a unique url describing a place, system and an identifier in this system.
ODD Redshift adapter map Amazon Redshift metadata to OpenDataDiscovery metadata by Specification. Mapping details and examples see below.
BaseObject and DataEntity is base components for other metadata components. DataEntityList is list all types of DataEntity and it has data_source_oddrn.
Format ODDRN:
- DataEntityList.data_source_oddrn
format: f'//redshift/host/{REDSHIFT_HOST}:{REDSHIFT_PORT}/databases/{REDSHIFT_DATABASE}'
example: "//redshift/odd-redshift-adapter.c0mxpzchxkqz.us-west-2.redshift.amazonaws.com:5439/databases/oddadapter" - DataSet.parent_oddrn
format: f'//redshift/host/{REDSHIFT_HOST}:{REDSHIFT_PORT}/databases/{REDSHIFT_DATABASE}/schemas/{schema_name}' example: "//redshift/odd-redshift-adapter.c0mxpzchxkqz.us-west-2.redshift.amazonaws.com:5439/databases/oddadapter/schemas/smith" - BaseObject.oddrn for DataSet, DataSetField.parent_field_oddrn
format: f'//redshift/host/{REDSHIFT_HOST}:{REDSHIFT_PORT}/databases/{REDSHIFT_DATABASE}/schemas/{schema_name}/tables/{table_name}' example: "//redshift/odd-redshift-adapter.c0mxpzchxkqz.us-west-2.redshift.amazonaws.com:5439/databases/oddadapter/schemas/public/tables/event" - BaseObject.oddrn for DataSetField
format: f'//redshift/host/{REDSHIFT_HOST}:{REDSHIFT_PORT}/databases/{REDSHIFT_DATABASE}/schemas/{schema_name}/tables/{table_name}/columns/{column_name}' example: "//redshift/odd-redshift-adapter.c0mxpzchxkqz.us-west-2.redshift.amazonaws.com:5439/databases/oddadapter/schemas/public/tables/event/columns/eventid" - BaseObject.oddrn for DataTransformer
format: f'//redshift/host/{REDSHIFT_HOST}:{REDSHIFT_PORT}/databases/{REDSHIFT_DATABASE}/schemas/{schema_name}/tables/{view_name}' example: "//redshift/odd-redshift-adapter.c0mxpzchxkqz.us-west-2.redshift.amazonaws.com:5439/databases/oddadapter/schemas/public/tables/event_all"
Format DataEntity.created_at, DataEntity.updated_at and other values of datetime type:
format: %Y-%m-%dT%H:%M:%S.ssssss
example: 2021-03-25T06:12:13.488486
python: datetime.isoformat()
MetadataExtension.schema_url field use definitions from file schema:
https://raw.githubusercontent.com/opendatadiscovery/opendatadiscovery-specification/main/specification/extensions/redshift.json
DataSet is a collection of data stored in structured format. In ODD Redshift adapter DataSet might be a base table, view, materialized view, external table, external view, external materialized view.
Redshift views PG_CATALOG.SVV_TABLES, PG_CATALOG.SVV_ALL_TABLES, PG_CATALOG.SVV_REDSHIFT_TABLES, PG_CATALOG.SVV_EXTERNAL_TABLES, PG_CATALOG.SVV_TABLE_INFO maps to DataSet (and BaseObject, DataEntity, MetadataExtension) entities.
Redshift views PG_CATALOG.SVV_ALL_COLUMNS, PG_CATALOG.SVV_REDSHIFT_COLUMNS, PG_CATALOG.SVV_EXTERNAL_COLUMNS maps to DataSetField, DataSetFieldType (and BaseObject, DataEntity, MetadataExtension) entities.
Redshift views maps to DataTransformer entity (and Dataset too).
Column pg_catalog.svv_tables.remarks maps to DataTransformer.description.
DataTransformer.source_code_url field is not used.
DataTransformer.inputs field is not used (="").
DataTransformer.outputs field is not used (="").
Column pg_get_viewdef(pg_catalog.pg_class_info.reloid, true) maps to DataTransformer.sql field.
DataTransformer.subtype field = DATATRANSFORMER_VIEW.
Amazon Redshift has next types of DataTransformer:
- view,
- materialized view,
- external view,
- external materialized view
BooleanFieldStat, ComplexFieldStat, NumberFieldStat, StringFieldStat, BinaryFieldStat, IntegerFieldStat, DateTimeFieldStat
Redshift adapter get statistic data from Redshift tables and views to BooleanFieldStat, ComplexFieldStat, NumberFieldStat, StringFieldStat, BinaryFieldStat, IntegerFieldStat, DateTimeFieldStat entities.