Skip to content

Latest commit

 

History

History
135 lines (104 loc) · 2.72 KB

setup.md

File metadata and controls

135 lines (104 loc) · 2.72 KB

Setup

Here below a brief overview of the setup realized.

Architecture

The architecture is composed of 5 components:

  • Storage
  • Processing
  • Interfaces
  • CI/CD
  • Containers

Storage:

  • data is stored on hdfs
  • documents are indexed on elastic search
  • code is stored on git and managed by gitlab

Processing:

  • distributed: spark (python API)
  • non distributed: scikit-learn, pandas (python)

Interfaces:

  • services and custom UI via APIs in Flask and Vue.js apps
  • JDBC of parquet tables via spark jdbc thrift server
  • JupyterHub: ETL and Data Science Notebooks
  • Kibana for elastic search exploration and dashboards
  • BI analtics with Redash and Power BI

CI/CD:

  • Scheduling and Execution via Gitlab CI

Containers:

  • Management, Deployment, Scaling via Kubernetes (To do)

Provisioning:

  • Ansible
  • Kubernetes

Getting Started

  • Data

    • hierarchy
    • conventions
  • Ingest:

    • jupyter
    • datalabframework,
    • spark
    • git / gitlab
    • gitlab ci/cd
  • ETL:

    • jupyter
    • datalabframework,
    • spark
    • git / gitlab
    • gitlab ci/cd
  • Reporting:

    • jupyter
    • datalabframework,
    • spark
    • git / gitlab
    • gitlab ci/cd
    • Templating
  • Data Science:

    • jupyter
    • datalabframework
    • spark
    • elastic
    • elastictools
    • git, gitlab
    • Flask Restful API
  • BI and Analytics

    • Kibana
    • BI Tools: Redash, Power BI, Tableau, etc
  • Provisioning

    • overview of system:
      • services running
      • Managing APIs
      • Managing Batch flows
      • Managing Streaming
    • boostrapping

Workshop

2 hours workshop audience: BI, data science, data engineering, devops and application teams

  • Teko

    • Intro, principles and architecture (Nat) (10 min)

    • Services overview (Dzung) (10 min)

      • Spark UI
      • Spark History
      • Spark thrift server
      • JupyterHub
      • HDFS (UI)
      • Gitlab CI/CD
      • BI tools: Redash
    • Ingestion (Tuan Anh) (10 min)

      • list of database and tables (status)
      • options for read and write (incremental vs full table scan)
      • diagram (source -> notebook -> gitlab ci -> hdfs)
      • future plans and improvements
    • ETL and Reporting (Hung) (10 min)

    • Data Services (Data Science) (Thuc) ( 10 min )

    • Data Services (Search) (Thuc) ( 10 min )

    • Data Services (API and UI) (Thuc) ( 10 min )

  • -break- (5 min)

  • VnPay

    • Data Model and Flow (Huong) (10 min)
    • ETL VnPay (Quan) (10 min)
    • ETL VnPay Reporting (Tuan) (10 min)
  • Future plans, Roadmap, Vision (Nat)

    • Minio
    • Kubernetes
    • BI connections
      • Power BI
      • Tableau
  • Q&A (all)