Skip to content

Latest commit

 

History

History
169 lines (152 loc) · 6.57 KB

TODO.md

File metadata and controls

169 lines (152 loc) · 6.57 KB

To-do list

Legend
⬆️ Higher Priority
⬇️ Lower Priority
🔵 In Progress
🔷 On Deck

Basic Setup

  • Basic implementation of table namespace: C:03/02/2020
  • Basic Configuration: C:03/03/2020
  • Basic Operator Registration: C:03/04/2020
  • Package configuration: C:03/03/2020
  • Basic Documentation: C:03/04/2020
  • Prettify outputs and update documentation: C:03/04/2020
  • Validate mason configuration file using json_schema: C:03/04/2020
  • Validate operators according to json_schema C:03/04/2020
  • Add logger with log levels. C:03/05/2020
  • ⬆️ Validate client compatability with operators C:03/06/2020
  • ⬆️ Catch up old rest api interface (migrate https://github.com/samtecspg/data/tree/master/catalog/api to operators): C:/03/09/2020
  • ⬆️ Move over tests and mocks. C: 03/11/2020
  • ⬆️ More test coverage on basic funcitonality:
    • Parameters C: 03/11/2020
    • Configurations C:03/11/2020
    • Operators C: 03/12/2020
    • Engines C: 03/13/2020
    • Clients C: 03/23/2020
    • Cli (started, progress made)
  • ⬆️ Clean up rest api implementation C:03/13/2020
  • create "mason run" cli command C: 03/09/2020
  • Pull rest api responses through to swagger spec (200 status example) Redid rest api interface to not need this
  • ⬆️ Advanced Operator Registration Documentation
  • ⬆️ New Client Documentation
  • ⬇️ New Engine Documentation
  • ⬆️ Dockerize mason implementation C: 03/09/2020
  • Build and refine "Engines" first order concept C: 03/06/2020
  • Establish docker style sha registration for installed operators to fix version conflicts
  • Explore graphql for the api? Note found a way around this for now. wont do
  • Generalize Engines to be "registerable" and serial
  • Support multiple clients for a single engine type.
  • Establish common interfaces for metastore engine objects. Metastore engine models, IE Table, Database, Schedule, etc C: 03/20/2020
  • Allow operator definitions to have valid "engine set" configurations C:03/22/2020
  • Allow for multiple configurations C: 04/08/2020
  • Clean up multiple configurations -> add id, don't use enumerate. C:05/10/2020
  • Allow operators to only be one level deep, ie not have a namespace (both in definition and folder configuration): not going to do right now
  • ⬆️ Consolidate response.add_ actions and logger._ actions into one command C: 04/13/2020
  • Interpolate environment variables into config and have that affect config validation: C: 04/23/2020
  • Clean up mock implementations: C: 04/23/2020
  • Consolidate all AWS response error parsing methods.
  • Improve performance by moving around imports. C: 06/15/2020
  • Version checking in installed operators
  • Replace operator installation method with something more robust (Done, kind of)
  • Parameter type inference and checking
  • Parameter aliases: ex: database_name -> bucket_name

Test Cases

  • Malformed Parameters, extraneous ":". Improve parameter specification. Make docs more explicit C: 03/11/2020
  • Extraneous parameters. Showing up in "required parameters" error return incorrectly. C: 03/11/2020
  • Better errors around Permission errors C: 03/13/2020

Execution Engine

  • Look into using calcite or coral to extend spark operators to presto and hive (***)
  • Look into using protos to communicate metastore schema to execution engine or possibly look into other serialization formats (avro)
  • 'job_proxy' execution client which hits a mason client running against a job queue for requests

Metastore

  • Look into datahub internal schema representation

Workflows

  • Validated Infer Workflow (5-step)
  • Infer Workflow (1-step)
    • Glue Support C: 05/10/2020
    • Athena Support with local scheduler (this ended up just being local instantiating of underlying infer operator)
  • Athena Support with airflow scheduler
  • Allow run flag to trigger existing workflow

Operators

  • table summary operator
  • Infer operator
    • Glue Support: C: long time ago
  • Schema merge operator C:09/04/2020
  • JSON explode operator
  • S3 -> ES egress operator
  • 🔵 Table Format operator (reformats and repartitions data)
  • 🔷 Table "join" operator (on set of columns)
  • Dedupe Operator
  • Table Operators
    • Query (requires metastore and execution engine) C:04/28/2020
    • Delete. C: 04/29/2020
    • Delete Database
  • Seperate out database operator?
  • Metastore Database operator
    • List databases (~= s3 list buckets)
  • Jobs operators (scheduler):
    • Get C: 04/08/2020
    • List
  • Scheduler operators:
    • Delete C: 04/29/2020
    • Create
    • List
  • ⬇️ Smart cast operator --> all partitions but 1 have Int, but one has String, cast the string partition

Clients

Metastore

Glue

  • Basic setup. C: 4/04/2020
  • Fix conflicting schemas error with differing partition data. C: 04/2020

Hive

  • ⬇️ Basic setup

S3

  • ⬆️ Basic Setup C: 3/20/2020
  • Schema implementations
    • ParquetSchema C: 3/17/2020
    • CSV Schema C: 4/16/2020
    • JsonL schema
    • Json schema
    • ⬇️ Avro schema
    • ⬇️ Msgpack pack schema

Athena

  • DDL Generation
  • Add partitioning concepts to DDL generation

Execution Engine

Local

  • ⬆️ Basic setup

IPython/Jupyter

  • ⬆️ Basic setup
  • ⬆️ Papermill integration

Athena

  • Basic setup C: 04/26/2020

Spark

  • Basic setup
    • Kubernetes Operator Runner C: 4/06/2020
    • EMR Runner
    • Local Runner
  • Check that file format is supported

Presto

  • ⬆️ Basic setup

Dask

  • Basic Setup
    • Kubernetes Runner

Scheduler

  • Multiple step workflow implementation
  • DAG validation (validate that it is a Directed Acyclic Dag, not that its valid, thats already done)

Glue

  • Basic set up
  • Workflow Implementation

Airflow

  • Basic setup

Local (synchronous)

  • Basic setup: C: 06/10/2020

Storage

Redshift

Elasticsearch

S3

  • Move some metastore concepts over here like "paths"
  • Basic Setup
  • Redshift
  • Elasticsearch

⬆️ Preparing for public

  • Remove samtec specific examples from examples/ files. Use public examples