Skip to content

Commit

Permalink
Merge pull request #48 from exasol/develop
Browse files Browse the repository at this point in the history
Adds Kafka import integration
  • Loading branch information
morazow authored Oct 31, 2019
2 parents e069a05 + fcd6b20 commit c87eaac
Show file tree
Hide file tree
Showing 81 changed files with 4,455 additions and 1,346 deletions.
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ script:

after_success:
- bash <(curl -s https://codecov.io/bash)
- ./sbtx coveralls

before_deploy:
- echo "Ensure assembly jar file is created for a $TRAVIS_TAG"
Expand Down
18 changes: 18 additions & 0 deletions AUTHORS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Authors

A successful open-source project depends on its [community of
contributors][contributors].

## Maintainers

The maintainers of the project are:

* Exasol Developers <[Exasol](https://github.com/exasol)>

## Contributors

These are the people whose contributions have made the project possible:

* Hari Nair (CommScope)

[contributors]: https://github.com/exasol/cloud-storage-etl-udfs/graphs/contributors
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ steps below to submit your patches.
- **Make sure everything is working**, run `./scripts/ci.sh`
- If everything is okay, commit and push to your fork
- [Submit a pull request][submit-pr]
- Let's work together to get your changes reviewed
- Let us work together to get your changes reviewed
- Merge into master or development branches

If your commit fixes any particular issue, please specify it in your commit
Expand Down
264 changes: 32 additions & 232 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

[![Build Status][travis-badge]][travis-link]
[![Codecov][codecov-badge]][codecov-link]
[![Coveralls][coveralls-badge]][coveralls-link]
[![GitHub Latest Release][gh-release-badge]][gh-release-link]

<p style="border: 1px solid black;padding: 10px; background-color: #FFFFCC;">
Expand All @@ -10,239 +11,37 @@ source project which is officially supported by Exasol. For any question, you
can contact our support team.
</p>

## Table of contents

* [Overview](#overview)
* [A short example](#a-short-example)
* [Features](#features)
* [Configuration](#configuration-parameters)
* [Setup and deployment](#setup-and-deployment)
* [Building from source](#building-from-source)
* [Contributing](#contributing)

## Overview

This repository contains helper code to create [Exasol][exasol] ETL UDFs in
order to read from and write to public cloud storage services such as [AWS
S3][s3], [Google Cloud Storage][gcs] and [Azure Blob Storage][azure].

Please be aware that Exasol already supports natively [loading CSV format from
AWS S3][sol-594]; however, not from Google Cloud Storage and Azure Storage
systems. Additionally, transfering data between Exasol and [Apache
Hive][apache-hive] is supported via [Hadoop ETL UDFs][hadoop-etl-udfs].

## A short example

Here we show an excerpt from a simple example of importing and exporting Parquet
formatted data stored in Amazon S3.

Please see [the full list of all cloud storage providers and guidelines to
configure them](./docs/overview.md).

### Create an Exasol table

We are going to use a `SALES_POSITIONS` Exasol table to import data into or to
export its contents to Amazon S3.

```sql
CREATE SCHEMA RETAIL;
OPEN SCHEMA RETAIL;

DROP TABLE IF EXISTS SALES_POSITIONS;

CREATE TABLE SALES_POSITIONS (
SALES_ID INTEGER,
POSITION_ID SMALLINT,
ARTICLE_ID SMALLINT,
AMOUNT SMALLINT,
PRICE DECIMAL(9,2),
VOUCHER_ID SMALLINT,
CANCELED BOOLEAN
);
```

### Import from S3

```sql
IMPORT INTO SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 's3a://my-bucket/parquet/import/sales_positions/*'
DATA_FORMAT = 'PARQUET'
S3_ACCESS_KEY = 'MY_AWS_ACCESS_KEY'
S3_SECRET_KEY = 'MY_AWS_SECRET_KEY'
S3_ENDPOINT = 's3.MY_REGION.amazonaws.com'
PARALLELISM = 'nproc()';
```
This repository contains helper code to create [Exasol][exasol] user defined
functions (UDFs) in order to read from and write to public cloud storage
systems.

### Export to S3

```sql
EXPORT SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 's3a://my-bucket/parquet/export/sales_positions/'
S3_ACCESS_KEY = 'MY_AWS_ACCESS_KEY'
S3_SECRET_KEY = 'MY_AWS_SECRET_KEY'
S3_ENDPOINT = 's3.MY_REGION.amazonaws.com'
PARALLELISM = 'iproc(), floor(random()*4)';
```

Please change the paths and parameters accordingly.
Additionally, it provides UDF scripts to import data from [Apache
Kafka][apache-kafka] clusters.

## Features

The following table shows currently supported features with the latest realese.

<table>
<tr>
<th rowspan="2">Storage System / Data Format</th>
<th colspan="2">Parquet</th>
<th colspan="2">Avro</th>
<th colspan="2">Orc</th>
</tr>
<tr>
<th>IMPORT</th>
<th>EXPORT</th>
<th>IMPORT</th>
<th>EXPORT</th>
<th>IMPORT</th>
<th>EXPORT</th>
</tr>
<tr>
<td>Amazon S3</td>
<td rowspan="4" align="center">&#10004;</td>
<td rowspan="4" align="center">&#10004;</td>
<td rowspan="4" align="center">&#10004;</td>
<td rowspan="4" align="center">&#10005;</td>
<td rowspan="4" align="center">&#10004;</td>
<td rowspan="4" align="center">&#10005;</td>
</tr>
<tr>
<td>Google Cloud Storage</td>
</tr>
<tr>
<td>Azure Blob Storage</td>
</tr>
<tr>
<td>Azure Data Lake (Gen1) Storage</td>
</tr>
</table>

## Configuration Parameters

The following configuration parameters should be provided when using the
cloud-storage-etl-udfs.

| Parameter | Default | Description
|:-------------------------------|:---------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------|
|``BUCKET_PATH`` |*<none>* |A path to the data bucket. It should start with cloud storage system specific schema, for example `s3a`. |
|``DATA_FORMAT`` |``PARQUET`` |The data storage format in the provided path. |
|``PARALLELISM IN IMPORT`` |``nproc()`` |The number of parallel instances to be started for importing data. *Please multiply this to increase the parallelism*. |
|``PARALLELISM IN EXPORT`` |``iproc()`` |The parallel instances for exporting data. *Add another random number to increase the parallelism per node*. For example, ``iproc(), floor(random()*4)``. |
|``PARQUET_COMPRESSION_CODEC`` |``uncompressed``|The compression codec to use when exporting the data into parquet files. Other options are: `snappy`, `gzip` and `lzo`. |
|``EXPORT_BATCH_SIZE`` |``100000`` |The number of records per file from each vm. For exampl, if a single vm gets `1M` records, it will export ten files with default 100000 records each. |
|``storage specific parameters`` |*<none>* |These are parameters for specific cloud storage for authentication purpose. |

Please see [the parameters specific for each cloud storage and how to configure
them here](./docs/overview.md).

## Setup and deployment

Please follow the steps described below in order to setup the `IMPORT` and
`EXPORT` UDF scripts.

### Download the file

Download the latest jar file from [releases][jars].

Additionally, you can also build it from the source by following the [build from
source](#building-from-source) guide. This will allow you to use latest commits
that are not released yet.
* Import formatted data from public cloud storage systems.
* Following data formats are supported as source file format when importing:
[Apache Avro][avro], [Apache Orc][orc] and [Apache Parquet][parquet].
* Export Exasol table data to public cloud storage systems.
* Following data formats are supported as sink file format when exporting:
[Apache Parquet][parquet].
* Following cloud storage systems are supported: [Amazon S3][s3], [Google Cloud
Storage][gcs], [Azure Blob Storage][azure-blob] and [Azure Data Lake (Gen1)
Storage][azure-data-lake].
* Import Apache Avro formatted data from Apache Kafka clusters.

### Create Exasol Bucket
## Documentation

In order to use the import or export functionality of `cloud-storage-etl-udfs`,
you have to upload the jar to a bucket in the Exasol bucket file system
(BucketFS).
For more information please check out the following guides.

For this overview we are using an example bucket named `bucket1`.

### Upload the JAR file to the bucket

This will allow using the jar in the ETL UDF scripts later on. Before uploading
the jar, please make sure that the BucketFS ports are open.

Here we use the port number `2580` for http.

```bash
curl \
-X PUT \
-T path/to/jar/cloud-storage-etl-udfs-{VERSION}.jar \
http://w:MY-PASSWORD@EXA-NODE-ID:2580/bucket1/cloud-storage-etl-udfs-{VERSION}.jar
```

Please change other required parameters such as `VERSION`, `EXA-NODE-ID`.

### Create ETL UDFs scripts

Run the following SQL commands to create Exasol scripts.

```sql
CREATE SCHEMA ETL;
OPEN SCHEMA ETL;

-- Import related scripts

CREATE OR REPLACE JAVA SET SCRIPT IMPORT_PATH(...) EMITS (...) AS
%scriptclass com.exasol.cloudetl.scriptclasses.ImportPath;
%jar /buckets/bfsdefault/bucket1/cloud-storage-etl-udfs-{VERSION}.jar;
/

CREATE OR REPLACE JAVA SET SCRIPT IMPORT_FILES(...) EMITS (...) AS
%env LD_LIBRARY_PATH=/tmp/;
%scriptclass com.exasol.cloudetl.scriptclasses.ImportFiles;
%jar /buckets/bfsdefault/bucket1/cloud-storage-etl-udfs-{VERSION}.jar;
/

CREATE OR REPLACE JAVA SCALAR SCRIPT IMPORT_METADATA(...)
EMITS (filename VARCHAR(200), partition_index VARCHAR(100)) AS
%scriptclass com.exasol.cloudetl.scriptclasses.ImportMetadata;
%jar /buckets/bfsdefault/bucket1/cloud-storage-etl-udfs-{VERSION}.jar;
/

-- Export related scripts

CREATE OR REPLACE JAVA SET SCRIPT EXPORT_PATH(...) EMITS (...) AS
%scriptclass com.exasol.cloudetl.scriptclasses.ExportPath;
%jar /buckets/bfsdefault/bucket1/cloud-storage-etl-udfs-{VERSION}.jar;
/

CREATE OR REPLACE JAVA SET SCRIPT EXPORT_TABLE(...) EMITS (ROWS_AFFECTED INT) AS
%scriptclass com.exasol.cloudetl.scriptclasses.ExportTable;
%jar /buckets/bfsdefault/bucket1/cloud-storage-etl-udfs-{VERSION}.jar;
/
```

Please do not forget to change the bucket name or the latest jar version
according to your setup.

## Building from source

Clone the repository,

```bash
git clone https://github.com/exasol/cloud-storage-etl-udfs

cd cloud-storage-etl-udfs/
```

Create assembly jar,

```bash
./sbtx assembly
```

The packaged jar should be located at
`target/scala-2.12/cloud-storage-etl-udfs-{VERSION}.jar`.
* [User Guide](docs/user_guide.md)
- [Cloud Storage Systems](docs/storage/cloud_storages.md)
- [Apache Kafka Import](docs/kafka/import.md)
* [Deployment Guide](docs/deployment_guide.md)
* [Developer Guide](docs/developer_guide.md)

## Contributing

Expand All @@ -251,20 +50,21 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines.
For requesting a feature, providing a feedback or reporting an issue, please
open a [Github issue][gh-issues].

[travis-badge]: https://travis-ci.org/exasol/cloud-storage-etl-udfs.svg?branch=master
[travis-badge]: https://img.shields.io/travis/exasol/cloud-storage-etl-udfs/master.svg?logo=travis
[travis-link]: https://travis-ci.org/exasol/cloud-storage-etl-udfs
[codecov-badge]: https://codecov.io/gh/exasol/cloud-storage-etl-udfs/branch/master/graph/badge.svg
[codecov-link]: https://codecov.io/gh/exasol/cloud-storage-etl-udfs
[gh-release-badge]: https://img.shields.io/github/release/exasol/cloud-storage-etl-udfs.svg
[coveralls-badge]: https://img.shields.io/coveralls/exasol/cloud-storage-etl-udfs.svg
[coveralls-link]: https://coveralls.io/github/exasol/cloud-storage-etl-udfs
[gh-release-badge]: https://img.shields.io/github/release/exasol/cloud-storage-etl-udfs.svg?logo=github
[gh-release-link]: https://github.com/exasol/cloud-storage-etl-udfs/releases/latest
[gh-issues]: https://github.com/exasol/cloud-storage-etl-udfs/issues
[exasol]: https://www.exasol.com/en/
[sol-594]: https://www.exasol.com/support/browse/SOL-594
[apache-hive]: https://hive.apache.org/
[hadoop-etl-udfs]: https://github.com/exasol/hadoop-etl-udfs
[s3]: https://aws.amazon.com/s3/
[gcs]: https://cloud.google.com/storage/
[azure]: https://azure.microsoft.com/en-us/services/storage/blobs/
[parquet]: https://parquet.apache.org/
[azure-blob]: https://azure.microsoft.com/en-us/services/storage/blobs/
[azure-data-lake]: https://azure.microsoft.com/en-us/solutions/data-lake/
[apache-kafka]: https://kafka.apache.org/
[avro]: https://avro.apache.org/
[jars]: https://github.com/exasol/cloud-storage-etl-udfs/releases
[orc]: https://orc.apache.org/
[parquet]: https://parquet.apache.org/
Loading

0 comments on commit c87eaac

Please sign in to comment.