Skip to content

Latest commit

 

History

History
190 lines (97 loc) · 8.14 KB

README.md

File metadata and controls

190 lines (97 loc) · 8.14 KB

Data Workspace logo

Data Workspace - a PostgreSQL-based open source data analysis platform


This is the entry-point repository for Data Workspace, a PostgreSQL-based open source data analysis platform with features for users with a range of technical skills. It contains a brief catalogue of all Data Workspace repositories (below), the source for the Data Workspace developer documentation, and the Terraform code to deploy Data Workspace into AWS.

Tip

Looking for the Data Workspace Django application? It's now in the data-workspace-frontend repo.


Catalogue of Data Workspace repositories

The components of Data Workspace are stored across several Git repositories.

Core

  • data-workspace (this repository)

    Contains the Terraform code to deploy Data Workspace in AWS, and the public facing developer documentation for Data Workspace. See Contents of this repository for details of what goes where.

  • data-workspace-frontend

    Contains the core Django application the defines the most user-facing components of Data Workspace. Also contains "the proxy" that sits in front of the Django application that integrates with SSO and routes requests, for example to tools.

    Also contains the Dockerfiles for other components. However, it's planned to move these out to separate repositories.

Tools

Low level

Some of the components of Data Workspace are lower level, and less Data Workspace-specific - they can at least theorically be re-used outside of Data Workspace

  • pg-sync-roles

    Used to synchronise permissions between the data-workspace-frontend metadata database and users in the main PostgreSQL database.

  • mobius3

    Used in on-demand tools to sync user's files with S3

  • dns-rewrite-proxy

    Used in tools in order to filter and re-write DNS requests

  • theia-postgres

    Used in Theia to give reasonably straightforward access to a PostgreSQL database

  • mirror-git-to-s3
    git-lfs-http-mirror

    Used to mirror git repositories that use Large File Storage (LFS) to S3 and to then access them from inside tools.

  • ecs-pipeline

    Used to deploy Data Workspace from Jenkins

  • quicksight-bulk-update-datasets

    A CLI script to make bulk updates to Amazon Quicksight datasets

Ingesting data

These components are usually used to ingest data into the PostgreSQL database that's the core of Data Workspace

  • pg-bulk-ingest
    pg-force-execute

    Used to ingest large amounts of data in the PostgreSQL database

  • to-file-like-obj

    Used in serveral ways to convery from iterables of bytes to a file-like object for memory-efficient data ingestion. For example when parsing CSVs.

  • iterable-subprocess

    Used to extract data from archives in a format that requires running an external program.

  • stream-read-ods

    Used to extract data from Open Document Spreadsheet (ODS) files in a memory-efficient and disk-efficient way.

  • stream-unzip

    Used to extract data from ZIP files in a memory-efficient and disk-efficient way.

  • stream-read-xbrl

    Used to ingest data from Companies House.

  • sqlite-s3vfs

    Used to generate large and complex SQLite files that are then ingested into the Data Workspace PostgreSQL database.

  • s3-dropbox

    Used to power a simple API to accept incoming data files in any format and drop it in S3, subsequently ingested into Data Workspace.

Publishing data

These components are used when publishing data from Data Workspace.

  • public-data-api

    Makes data available to the public.

  • stream-zip

    Creates ZIP files in a memory-efficient and disk-efficient way.

  • stream-write-ods

    Creates Open Document Spreadsheet (ODS) files in a memory-efficient and disk-efficient way.

  • postgresql-proxy

    Part of the system that makes data available to other internal applications.


Contents of this repository