Skip to content

Easily create resources for Redshift Spectrum verification.

Notifications You must be signed in to change notification settings

sano307/redshift-spectrum-demo

Repository files navigation

redshift-spectrum-demo

How to get the parquet data in S3 via Redshift Spectrum based on the Glue Table.

Requirements

  • Terraform 0.14.9
  • Python 3.8.7
    • pipenv

Environment

  • macOS (Catalina)

Usage

First of all, change the S3 bucket name what to store the parquet data.

// terraform.tfvars
- s3_bucket_parquet = "redshift-spectrum-demo"
+ s3_bucket_parquet = "YOUR_BUCKET_NAME"

Check the necessary infra resources and deploy to AWS cloud.

$ terraform init
$ terraform plan
$ terraform apply

Create the parquet data for testing, and upload this to S3 bucket.

$ cd utils
$ pipenv --python 3.8.7
$ pipenv install --dev
$ pipenv run python generate_parquet.py
$ aws s3 cp ../data/sample.parq s3://YOUR_BUCKET_NAME/parquet/

Check the endpoint and port of Redshift Cluster, and try to access Redshift Cluster using psql command.

$ aws redshift describe-clusters --cluster-identifier demo-redshift | jq '.Clusters[].Endpoint'
{
  "Address": "demo-redshift.xxxxxxxxxxxx.ap-northeast-1.redshift.amazonaws.com",
  "Port": 5439
}
$ psql -h demo-redshift.xxxxxxxxxxxx.ap-northeast-1.redshift.amazonaws.com -p 5439 -U admin -d demo
Password for user admin:
psql (13.2, server 8.0.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

demo=#
demo=#\q

Let's run the Glue Crawler for create a Glue Table in the Glue Catalog based on parquet data what uploaded to the S3 bucket.

$ aws glue start-crawler --name demo-crawler
$ aws glue get-table --database-name demo-database --name demo_parquet

And then, create an external schema for the Redshift Spectrum, because of connect to the Redshift cluster with the Glue Table. Don't forget to replace YOUR_ACCOUNT_ID with your account ID in create-external-schema.sql.

$ psql -h demo-redshift.xxxxxxxxxxxx.ap-northeast-1.redshift.amazonaws.com -p 5439 -U admin -d demo < sql/create-external-schema.sql
Password for user admin:
CREATE SCHEMA

Finally, let's query the parquet data in S3 via Redshift Spectrum.

$ psql -h demo-redshift.xxxxxxxxxxxx.ap-northeast-1.redshift.amazonaws.com -p 5439 -U admin -d demo < sql/get-parquet.sql > result
$ cat result
 year |  make  |  model
------+--------+---------
 2014 | toyota | corolla
 2017 | nissan | sentra
 2018 | honda  | civic
 2020 | hyndai | nissan
(4 rows)

You have to delete AWS resources if the verification was completed.

$ terraform destroy

Author

Inseo Kim

License

MIT

About

Easily create resources for Redshift Spectrum verification.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published