forked from cloudposse/terraform-aws-glue
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.yaml
322 lines (261 loc) · 11.2 KB
/
README.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
#
# This is the canonical configuration for the `README.md`
# Run `make readme` to rebuild the `README.md`
#
# Name of this project
name: terraform-aws-glue
# Logo for this project
#logo: docs/logo.png
# License of this project
license: "APACHE2"
# Copyrights
copyrights:
- name: "Cloud Posse, LLC"
url: "https://cloudposse.com"
year: "2021"
# Canonical GitHub repo
github_repo: cloudposse/terraform-aws-glue
# Badges to display
badges:
- name: "Latest Release"
image: "https://img.shields.io/github/release/cloudposse/terraform-aws-glue.svg"
url: "https://github.com/cloudposse/terraform-aws-glue/releases/latest"
- name: "Slack Community"
image: "https://slack.cloudposse.com/badge.svg"
url: "https://slack.cloudposse.com"
- name: "Discourse Forum"
image: "https://img.shields.io/discourse/https/ask.sweetops.com/posts.svg"
url: "https://ask.sweetops.com/"
# List any related terraform modules that this module may be used with or that this module depends on.
related:
- name: "terraform-aws-components"
description: "Catalog of terraform AWS components"
url: "https://github.com/cloudposse/terraform-aws-components"
references:
- name: Glue Getting Started Guide
description: Guide for getting oriented with glue and spark
url: https://docs.aws.amazon.com/glue/latest/dg/getting-started.html
- name: Program AWS Glue ETL Scripts in Python
description: Documentation about the process of running ETL with AWS Glue and the Python programming language
url: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python.html
- name: Python shell jobs in AWS Glue
description: Documentation about the process of configuring and running Python shell jobs in AWS Glue
url: https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html
- name: AWS Glue Jobs unit testing
description: Illustrates the execution of PyTest unit test cases for AWS Glue jobs in AWS CodePipeline using AWS CodeBuild projects
url: https://github.com/aws-samples/aws-glue-jobs-unit-testing
- name: AWS Glue knowledge center
description: Why does my AWS Glue crawler or ETL job fail with the error "Insufficient Lake Formation permission(s)"?
url: https://aws.amazon.com/premiumsupport/knowledge-center/glue-insufficient-lakeformation-permissions/
# Short description of this project
description: |-
Terraform modules for provisioning and managing AWS [Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html) resources.
The following Glue resources are supported:
- [Catalog database](modules/glue-catalog-database)
- [Catalog table](modules/glue-catalog-table)
- [Connection](modules/glue-connection)
- [Crawler](modules/glue-crawler)
- [Job](modules/glue-job)
- [Registry](modules/glue-registry)
- [Schema](modules/glue-schema)
- [Trigger](modules/glue-trigger)
- [Workflow](modules/glue-workflow)
Refer to [modules](modules) for more details.
# How to use this module
usage: |-
For a complete example, see [examples/complete](examples/complete).
The example provisions a Glue catalog database and a Glue crawler that crawls a public dataset in an S3 bucket and writes the metadata into the Glue catalog database.
It also provisions an S3 bucket with a Glue Job Python script, and a destination S3 bucket for Glue job results.
And finally, it provisions a Glue job pointing to the Python script in the S3 bucket, and a Glue trigger that triggers the Glue job on a schedule.
The Glue job processes the dataset, cleans up the data, and writes the result into the destination S3 bucket.
For an example on how to provision source and destination S3 buckets, Glue Catalog database and table, and a Glue crawler that processes
data in the source S3 bucket and writes the result into the destination S3 bucket,
see [examples/crawler](examples/crawler).
For automated tests of the examples using [bats](https://github.com/bats-core/bats-core) and [Terratest](https://github.com/gruntwork-io/terratest)
(which tests and deploys the examples on AWS), see [test](test).
# Example usage
examples: |-
```hcl
locals {
enabled = module.this.enabled
s3_bucket_source = module.s3_bucket_source.bucket_id
role_arn = module.iam_role.arn
# The dataset used in this example consists of Medicare-Provider payment data downloaded from two Data.CMS.gov sites:
# Inpatient Prospective Payment System Provider Summary for the Top 100 Diagnosis-Related Groups - FY2011, and Inpatient Charge Data FY 2011.
# AWS modified the data to introduce a couple of erroneous records at the tail end of the file
data_source = "s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv"
}
module "glue_catalog_database" {
source = "cloudposse/glue/aws//modules/glue-catalog-database"
# Cloud Posse recommends pinning every module to a specific version
# version = "x.x.x"
catalog_database_description = "Glue Catalog database for the data located in ${local.data_source}"
location_uri = local.data_source
attributes = ["payments"]
context = module.this.context
}
module "glue_catalog_table" {
source = "cloudposse/glue/aws//modules/glue-catalog-table"
# Cloud Posse recommends pinning every module to a specific version
# version = "x.x.x"
catalog_table_name = "medicare"
catalog_table_description = "Test Glue Catalog table"
database_name = module.glue_catalog_database.name
storage_descriptor = {
# Physical location of the table
location = local.data_source
}
context = module.this.context
}
resource "aws_lakeformation_permissions" "default" {
principal = local.role_arn
permissions = ["ALL"]
table {
database_name = module.glue_catalog_database.name
name = module.glue_catalog_table.name
}
}
# Crawls the data in the S3 bucket and puts the results into a database in the Glue Data Catalog.
# The crawler will read the first 2 MB of data from that file, and recognize the schema.
# After that, the crawler will sync the table `medicare` in the Glue database.
module "glue_crawler" {
source = "cloudposse/glue/aws//modules/glue-crawler"
# Cloud Posse recommends pinning every module to a specific version
# version = "x.x.x"
crawler_description = "Glue crawler that processes data in ${local.data_source} and writes the metadata into a Glue Catalog database"
database_name = module.glue_catalog_database.name
role = local.role_arn
schedule = "cron(0 1 * * ? *)"
schema_change_policy = {
delete_behavior = "LOG"
update_behavior = null
}
catalog_target = [
{
database_name = module.glue_catalog_database.name
tables = [module.glue_catalog_table.name]
}
]
context = module.this.context
depends_on = [
aws_lakeformation_permissions.default
]
}
# Source S3 bucket to store Glue Job scripts
module "s3_bucket_source" {
source = "cloudposse/s3-bucket/aws"
# Cloud Posse recommends pinning every module to a specific version
# version = "x.x.x"
acl = "private"
versioning_enabled = false
force_destroy = true
allow_encrypted_uploads_only = true
allow_ssl_requests_only = true
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
attributes = ["source"]
context = module.this.context
}
resource "aws_s3_object" "job_script" {
bucket = local.s3_bucket_source
key = "data_cleaning.py"
source = "${path.module}/scripts/data_cleaning.py"
force_destroy = true
etag = filemd5("${path.module}/scripts/data_cleaning.py")
tags = module.this.tags
}
# Destination S3 bucket to store Glue Job results
module "s3_bucket_destination" {
source = "cloudposse/s3-bucket/aws"
# Cloud Posse recommends pinning every module to a specific version
# version = "x.x.x"
acl = "private"
versioning_enabled = false
force_destroy = true
allow_encrypted_uploads_only = true
allow_ssl_requests_only = true
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
attributes = ["destination"]
context = module.this.context
}
module "iam_role" {
source = "cloudposse/iam-role/aws"
# Cloud Posse recommends pinning every module to a specific version
# version = "x.x.x"
principals = {
"Service" = ["glue.amazonaws.com"]
}
managed_policy_arns = [
"arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
]
policy_document_count = 0
policy_description = "Policy for AWS Glue with access to EC2, S3, and Cloudwatch Logs"
role_description = "Role for AWS Glue with access to EC2, S3, and Cloudwatch Logs"
context = module.this.context
}
module "glue_workflow" {
source = "cloudposse/glue/aws//modules/glue-workflow"
# Cloud Posse recommends pinning every module to a specific version
# version = "x.x.x"
workflow_description = "Test Glue Workflow"
max_concurrent_runs = 2
context = module.this.context
}
module "glue_job" {
source = "cloudposse/glue/aws//modules/glue-job"
# Cloud Posse recommends pinning every module to a specific version
# version = "x.x.x"
job_description = "Glue Job that runs data_cleaning.py Python script"
role_arn = local.role_arn
glue_version = var.glue_version
worker_type = "Standard"
number_of_workers = 2
max_retries = 2
# The job timeout in minutes
timeout = 20
command = {
# The name of the job command. Defaults to `glueetl`.
# Use `pythonshell` for Python Shell Job Type, or `gluestreaming` for Streaming Job Type.
name = "glueetl"
script_location = format("s3://%s/data_cleaning.py", local.s3_bucket_source)
python_version = 3
}
context = module.this.context
}
module "glue_trigger" {
source = "cloudposse/glue/aws//modules/glue-trigger"
# Cloud Posse recommends pinning every module to a specific version
# version = "x.x.x"
workflow_name = module.glue_workflow.name
trigger_enabled = true
start_on_creation = true
trigger_description = "Glue Trigger that triggers a Glue Job on a schedule"
schedule = "cron(15 12 * * ? *)"
type = "SCHEDULED"
actions = [
{
job_name = module.glue_job.name
# The job run timeout in minutes. It overrides the timeout value of the job
timeout = 10
}
]
context = module.this.context
}
```
# Other files to include in this README from the project folder
include:
- "docs/targets.md"
- "docs/terraform.md"
# Contributors to this project
contributors:
- name: "Erik Osterman"
github: "osterman"
- name: "Leo Przybylski"
github: "r351574nc3"
- name: "Andriy Knysh"
github: "aknysh"