Skip to content

Commit

Permalink
Clean top-level directory files
Browse files Browse the repository at this point in the history
  • Loading branch information
pflooky committed Oct 16, 2024
1 parent bb88a0c commit 8360dce
Show file tree
Hide file tree
Showing 15 changed files with 81 additions and 159 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ jobs:
- name: Run integration tests
id: tests
uses: data-catering/insta-integration@v1
env:
LOG_LEVEL: debug
with:
configuration_file: misc/insta-integration/insta-integration.yaml
- name: Print results
run: |
echo "Records generated: ${{ steps.tests.outputs.num_records_generated }}"
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ RUN addgroup -S app \
&& apk add --no-cache libc6-compat bash \
&& mkdir -p /opt/app /opt/DataCaterer/connection /opt/DataCaterer/plan /opt/DataCaterer/execution /opt/DataCaterer/report \
&& chown -R app:app /opt/app /opt/DataCaterer/connection /opt/DataCaterer/plan /opt/DataCaterer/execution /opt/DataCaterer/report
COPY --chown=app:app script app/src/main/resources app/build/libs /opt/app/
COPY --chown=app:app misc/docker-image app/src/main/resources app/build/libs /opt/app/

USER app
WORKDIR /opt/app
Expand Down
180 changes: 49 additions & 131 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

A test data management tool with automated data generation, validation and cleanup.

![Basic data flow for Data Caterer](design/high_level_flow-run-config-basic-flow.svg)
![Basic data flow for Data Caterer](misc/design/high_level_flow-run-config-basic-flow.svg)

[Generate data](https://data.catering/setup/generator/data-generator/) for databases, files, messaging systems or HTTP
requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run
Expand Down Expand Up @@ -34,21 +34,21 @@ and deep dive into issues [from the generated report](https://data.catering/samp
- [Alerts to be notified of results](https://data.catering/setup/report/alert/)
- [Run as GitHub Action](https://github.com/data-catering/insta-integration)

![Basic flow](design/basic_data_caterer_flow_medium.gif)
![Basic flow](misc/design/basic_data_caterer_flow_medium.gif)

## Quick start

1. [Mac download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-mac.zip)
2. [Windows download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-windows.zip)
1. [UI App: Mac download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-mac.zip)
2. [UI App: Windows download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-windows.zip)
1. After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
2. Double-click 'DataCaterer-1.0.0' to install Data Caterer
3. Click on 'More info' then at the bottom, click 'Run anyway'
4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application
5. If your browser doesn't open, go to [http://localhost:9898](http://localhost:9898) in your preferred browser
3. [Linux download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-linux.zip)
3. [UI App: Linux download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-linux.zip)
4. Docker
```shell
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer-basic:0.11.9
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer-basic:0.11.11
```
[Open localhost:9898](http://localhost:9898).

Expand All @@ -64,147 +64,65 @@ cd data-caterer-example && ./run.sh

### Supported data sources

Data Caterer supports the below data sources. Additional data sources can be added on a demand basis. [Check here for
the full roadmap](#roadmap).

| Data Source Type | Data Source | Support | Free |
|------------------|------------------------------------|---------|------|
| Cloud Storage | AWS S3 |||
| Cloud Storage | Azure Blob Storage |||
| Cloud Storage | GCP Cloud Storage |||
| Database | Cassandra |||
| Database | MySQL |||
| Database | Postgres |||
| Database | Elasticsearch |||
| Database | MongoDB |||
| File | CSV |||
| File | Delta Lake |||
| File | JSON |||
| File | Iceberg |||
| File | ORC |||
| File | Parquet |||
| File | Hudi |||
| HTTP | REST API |||
| Messaging | Kafka |||
| Messaging | Solace |||
| Messaging | ActiveMQ |||
| Messaging | Pulsar |||
| Messaging | RabbitMQ |||
| Metadata | Great Expectations |||
| Metadata | Marquez |||
| Metadata | OpenAPI/Swagger |||
| Metadata | OpenMetadata |||
| Metadata | Open Data Contract Standard (ODCS) |||
| Metadata | Amundsen |||
| Metadata | Datahub |||
| Metadata | Data Contract CLI |||
| Metadata | Solace Event Portal |||


## Supported use cases

1. Insert into single data sink
2. Insert into multiple data sinks
1. Foreign keys associated between data sources
2. Number of records per column value
3. Set random seed at column and whole data generation level
4. Generate real-looking data (via DataFaker) and edge cases
1. Names, addresses, places etc.
2. Edge cases for each data type (e.g. newline character in string, maximum integer, NaN, 0)
3. Nullability
5. Send events progressively
6. Automatically insert data into data source
1. Read metadata from data source and insert for all sub data sources (e.g. tables)
2. Get statistics from existing data in data source if exists
7. Track and delete generated data
8. Extract data profiling and metadata from given data sources
1. Calculate the total number of combinations
9. Validate data
1. Basic column validations (not null, contains, equals, greater than)
2. Aggregate validations (group by account_id and sum amounts should be less than 100, each account should have at
least one transaction)
3. Upstream data source validations (generate data and then check same data is inserted in another data source with
potential transformations)
4. Column name validations (check count and ordering of column names)
10. Data migration validations
1. Ensure row counts are equal
2. Check both data sources have same values for key columns
Data Caterer supports the below data sources. [Check here for the full roadmap](#roadmap).

| Data Source Type | Data Source | Support |
|------------------|------------------------------------|---------|
| Cloud Storage | AWS S3 ||
| Cloud Storage | Azure Blob Storage ||
| Cloud Storage | GCP Cloud Storage ||
| Database | Cassandra ||
| Database | MySQL ||
| Database | Postgres ||
| Database | Elasticsearch ||
| Database | MongoDB ||
| File | CSV ||
| File | Delta Lake ||
| File | JSON ||
| File | Iceberg ||
| File | ORC ||
| File | Parquet ||
| File | Hudi ||
| HTTP | REST API ||
| Messaging | Kafka ||
| Messaging | Solace ||
| Messaging | ActiveMQ ||
| Messaging | Pulsar ||
| Messaging | RabbitMQ ||
| Metadata | Data Contract CLI ||
| Metadata | Great Expectations ||
| Metadata | Marquez ||
| Metadata | OpenAPI/Swagger ||
| Metadata | OpenMetadata ||
| Metadata | Open Data Contract Standard (ODCS) ||
| Metadata | Amundsen ||
| Metadata | Datahub ||
| Metadata | Solace Event Portal ||

## Run Configurations

Different ways to run Data Caterer based on your use case:

![Types of run configurations](design/high_level_flow-run-config.svg)

## Sponsorship

Data Caterer is set up under a sponsorware model where all features are available to sponsors. The core features
are available here in this project for all to use/fork/update/improve etc., as the open core.

Sponsors have access to the following features:

- All data sources (see [here for all data sources](https://data.catering/setup/connection/))
- Batch and Event generation
- [Auto generation from data connections or metadata sources](https://data.catering/setup/guide/scenario/auto-generate-connection/)
- Suggest data validations
- [Clean up generated and consumed data](https://data.catering/setup/guide/scenario/delete-generated-data/)
- Run as many times as you want, not charged by usage
- Metadata discovery
- [Plus more to come](#roadmap)
Data Caterer is set up under a sponsorship model. If you require support or additional features from Data Caterer
as an enterprise, you are required to be a sponsor for the project.

[Find out more details here to help with sponsorship.](https://data.catering/sponsor)

This is inspired by the [mkdocs-material project](https://github.com/squidfunk/mkdocs-material) which
[follows the same model](https://squidfunk.github.io/mkdocs-material/insiders/).

## Contributing

[View details here about how you can contribute to the project.](CONTRIBUTING.md)
[View details here about how you can contribute to the project.](misc/CONTRIBUTING.md)

## Additional Details

## Run Configurations

Different ways to run Data Caterer based on your use case:

![Types of run configurations](misc/design/high_level_flow-run-config.svg)

### Design

[Design motivations and details can be found here.](https://data.catering/setup/design)

### Roadmap

[Can check here for full list.](https://data.catering/use-case/roadmap/)

#### UI

1. Allow the application to run with UI enabled
2. Runs as a long-lived app with UI that interacts with the existing app as a single container
3. Ability to run as UI, Spark job or both
4. Persist data in files or database (Postgres)
5. UI will show the history of data generation/validation runs, delete generated data, create new scenarios, define data connections

#### Distribution

##### Docker

```shell
gradle clean :api:shadowJar :app:shadowJar
docker build --build-arg "APP_VERSION=0.7.0" --build-arg "SPARK_VERSION=3.5.0" --no-cache -t datacatering/data-caterer:0.7.0 .
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone -v data-caterer-data:/opt/data-caterer --name datacaterer datacatering/data-caterer:0.7.0
#open localhost:9898
```

##### Jpackage

```bash
JPACKAGE_BUILD=true gradle clean :api:shadowJar :app:shadowJar
# Mac
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-mac.cfg"
# Windows
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-windows.cfg"
# Linux
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-linux.cfg"
```

##### Java 17 VM Options

```shell
--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED
```
-Dlog4j.configurationFile=classpath:log4j2.properties
20 changes: 0 additions & 20 deletions local-docker-build.sh

This file was deleted.

File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
29 changes: 29 additions & 0 deletions misc/distribution/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#### Distribution

##### Docker

```shell
gradle clean :api:shadowJar :app:shadowJar
docker build --build-arg "APP_VERSION=0.7.0" --build-arg "SPARK_VERSION=3.5.0" --no-cache -t datacatering/data-caterer:0.7.0 .
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone -v data-caterer-data:/opt/data-caterer --name datacaterer datacatering/data-caterer:0.7.0
#open localhost:9898
```

##### Jpackage

```bash
JPACKAGE_BUILD=true gradle clean :api:shadowJar :app:shadowJar
# Mac
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-mac.cfg"
# Windows
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-windows.cfg"
# Linux
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-linux.cfg"
```

##### Java 17 VM Options

```shell
--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED
```
-Dlog4j.configurationFile=classpath:log4j2.properties
File renamed without changes.
File renamed without changes.
5 changes: 0 additions & 5 deletions run-docker.sh

This file was deleted.

0 comments on commit 8360dce

Please sign in to comment.