[athena] changes to make athena part of a default deployment #599

ryandeivert · 2018-02-14T22:52:12Z

to: @austinbyers or @chunyong-lin
cc: @airbnb/streamalert-maintainers, @jacknagz
size: large
resolves #N/A

NOTE: this is a replacement for #592. All new updates will be made here.

Background

In order to support some upcoming features, StreamAlert must always have alerts sent to S3 to allow for Athena querying. Therefore, the Athena partition refresh function is no longer optional and should be created upon StreamAlert initialization.

Changes

Updating cli to handle deploying athena on init
- This removes some now unnecessary manage.py athena subcommands like init, create-db, etc.
- Athena database creation is now handled via Terraform.
The athena AWS resources are now name-spaced to the user-specified prefix as to not conflict if multiple deploys exist in one AWS account.
Updating athena terraform generate code and updating default local config for athena.
Three additional options related to the athena function are now configurable by the user (within the athena_partition_refresh_config in lambda.json:
- database_name: the name of the Athena database to use. (default is: <prefix>_streamalert)
- results_bucket : the S3 bucket to use for Athena query results and metastore storage (default is: <prefix>.streamalert.athena-results)
- queue_name: the name of the sqs queue to use for bucket notifications (default is: <prefix>_streamalert_athena_data_bucket_notifications)

Other Changes

Preliminary doc updates to go with changes. Another wave will follow.
Syncing carbon black logs schema updates

Testing

Updating some unit tests for changes.
Deployed in test account end-to-end and tested to make sure alerts searchable in S3.

austinbyers

I'm not an expert on the Athena changes, but the code LGTM!

coveralls · 2018-02-15T07:17:56Z

Coverage decreased (-0.04%) to 95.66% when pulling 0003e5a on ryandeivert-athena-default-output into 8ad15c4 on release-2.0.0.

austinbyers · 2018-02-15T17:28:29Z

stream_alert/athena_partition_refresh/main.py

+        results_bucket = athena_config.get('results_bucket', '').strip()
+        if results_bucket == '':
+            self.athena_results_bucket = 's3://{}.streamalert.athena-results'.format(self.prefix)
+        elif results_bucket[:5] != 's3://':


results_bucket.startswith('s3://')

jacknagz

I'm excited for this change!! A couple more clarifying comments

jacknagz · 2018-02-15T18:12:09Z

stream_alert_cli/athena/handler.py

-
-
-def rebuild_partitions(athena_client, options, config):
+def rebuild_partitions(options, config):


This method is still needs the constraint on options.type == data, since it won't work for the alerts table

jacknagz · 2018-02-15T18:12:47Z

stream_alert_cli/athena/handler.py

@@ -199,17 +191,15 @@ def _construct_create_table_statement(schema, table_name, bucket):
        bucket=bucket)


-def create_table(athena_client, options, config):
+def create_table(options, table_type, config):


I might be missing something, but is there an added benefit of passing in table_type as an arg if it's in options.type?

the idea was this would allow for a more generic function with a better interface by accepting a string for the table_type.

the only property of options that was accessed within this function was type, which means if any caller wanted to use this function it essentially has to create a namedtuple first that would have a type property corresponding to the table type. it's fine to keep as is, just makes using it more difficult

jacknagz · 2018-02-15T18:20:09Z

stream_alert_cli/manage_lambda/deploy.py

+
+        athena_opts = namedtuple('AthenaOptions', ['bucket', 'refresh_type'])
+        opts = athena_opts(alerts_bucket, 'add_hive_partition')
+        athena_create_table(opts, 'alerts', config)


I was under the impression we were revisiting table creation on every deploy per the last PR convo, is that not the case?

jacknagz · 2018-02-15T18:22:12Z

stream_alert_cli/config.py

@@ -116,26 +70,29 @@ def set_prefix(self, prefix):
            LOGGER_CLI.error('Prefix cannot contain underscores')
            return

+        tf_state_bucket = '{}.streamalert.terraform.state'.format(prefix)


(not related to this line)

Any reason to delete the generate_athena method? It could still be useful for current users who don't have an Athena config

@jacknagz We don't auto-generate any other part of the config (e.g. Lambda template code for the alert processor or alert merger), and I think we should be consistent. So either we should auto-generate all of it or none of it, and IMO it would be over-engineering to auto-generate every part of the config an upgrading user might not have (because they'll likely need to change the config anyway). With documentation, users can just update their config manually. So my vote is to get rid of the generate_athena method

#simplify 💣

@austinbyers and @jacknagz maybe we need to discuss this because I'm not sure what the best approach is. my initial impression was to nix it as well (as seen in my commit history). let's chat today

ryandeivert · 2018-02-27T02:55:18Z

@jacknagz & @austinbyers I should clarify that this PR is a direct mirror of #592 and your feedback there was not addressed yet. This PR was simply to go against the release branch instead of master. Your comments lead me to believe you thought I had addressed previous PR feedback in this PR, which is not the case. Sorry for the misunderstanding!!

needs rebase

* Storing default lambda configuration for athena * Updating CLI to remove config addition logic

ryandeivert · 2018-03-12T23:00:04Z

hey @austinbyers and @jacknagz PTAL. I've updated the branch and addressed the feedback from previous comments

EDIT: just fixed a pylint error and pushed a commit

austinbyers

Thanks @ryandeivert! I read the whole thing over again and found a few small things now that I have a better understanding of the codebase :)

austinbyers · 2018-03-13T17:51:46Z

conf/lambda.json

+    "source_bucket": "PREFIX_GOES_HERE.streamalert.source",
+    "source_current_hash": "<auto_generated>",
+    "source_object_key": "<auto_generated>",
+    "third_party_libraries": []


This key is no longer necessary

austinbyers · 2018-03-13T17:53:17Z

manage.py

-    manage.py athena create-table --type alerts --bucket s3.bucket.name --refresh_type add_hive_partition
-
-    manage.py athena create-table --type data --bucket s3.bucket.name --refresh_type add_hive_partition --table_name my_athena_table
+    manage.py athena init                     Initialize the Athena base config (for legacy support)


Since 2.0 will not be backwards compatible, do we need this legacy support? If all it does is update the conf/lambda.json, I would say nix it (it's not hard to manually add)
EDIT: see also my response to Jack below

thanks @austinbyers, that was my impression as well

austinbyers · 2018-03-13T18:03:05Z

stream_alert_cli/config.py

@@ -116,26 +70,29 @@ def set_prefix(self, prefix):
            LOGGER_CLI.error('Prefix cannot contain underscores')
            return

+        tf_state_bucket = '{}.streamalert.terraform.state'.format(prefix)


@jacknagz We don't auto-generate any other part of the config (e.g. Lambda template code for the alert processor or alert merger), and I think we should be consistent. So either we should auto-generate all of it or none of it, and IMO it would be over-engineering to auto-generate every part of the config an upgrading user might not have (because they'll likely need to change the config anyway). With documentation, users can just update their config manually. So my vote is to get rid of the generate_athena method

#simplify 💣

austinbyers · 2018-03-13T18:09:29Z

stream_alert_cli/terraform/handler.py

-        # Remove old Terraform files
-        terraform_clean(config)
+    LOGGER_CLI.info('Deploying Lambda Functions')
+    # deploy both lambda functions


We don't really need this comment

austinbyers · 2018-03-13T18:11:54Z

tests/unit/stream_alert_athena_partition_refresh/test_main.py

-                    "unit-testing.streamalerts": "alerts",
-                    "unit-testing.streamalert.data": "data"
-                }
+            'enabled': True,


This is no longer used in the config , let's also remove it from the test.
EDIT: I just found it in the config too! But are we using it? If so, we should probably remove it since it's no longer optional

yes good catch - I'll remove :)

chunyong-lin

Thanks for the refactoring! 👏 Few comments from me.

chunyong-lin · 2018-03-13T20:32:53Z

manage.py

+    _add_default_athena_args(athena_create_table_parser)
+
+    # Validate the provided schema-override options
+    def _validate_override(val):


The val will be a set, should we go through each element in the set? Also, for the case that column_foo= will be passed the validation. Maybe we can do
if len(val.split('=')) != 2:

hey @chunyong-lin - val will actually not be a set, but will be each individual item within the set.

Good suggestion though - I can add an additional check to make sure the right about of equals-separated-values is provided

Made the change!

chunyong-lin · 2018-03-13T20:39:47Z

manage.py

+        help=ARGPARSE_SUPPRESS)
+
+
+def _add_default_athena_args(athena_parser):


Does it make sense to rename this method to _add_required_athena_args? I first impression when see default word and it gives me impression that there is default value, it is not required to set it. Actually, the all the args are required here, and users must provide values for these args.
But I might interpreter this differently from others.

I can rename but chose this since the arguments are the 'defaults' related to the actual argparsers that we're constructing here (not the 'defaults' the user will have to provide). Also, you'll notice that not ever argument that is added within the function is in fact 'required' - see the 'debug' param.

chunyong-lin · 2018-03-13T20:54:17Z

stream_alert_cli/athena/handler.py

+        table (str): The name of the table being rebuilt
+        bucket (str): The s3 bucket to be used as the location for Athena data
+        table_type (str): The type of table being refreshed
+            Types of 'data' and 'alert' are accepted, but only 'data' is implemented


Is data or alert is implemented? In docs/source/athena-setup.rst line 37, it sounds that it is alert is implemented.

Good catch! But this is just for 'rebuilding' part, not simply supporting the alerts

chunyong-lin · 2018-03-13T21:04:09Z

stream_alert_cli/terraform/athena.py

+    database = athena_config.get('database_name', '{}_streamalert'.format(prefix))
+
+    results_bucket_name = athena_config.get('results_bucket', '').strip()
+    if results_bucket_name == '':


You may just use if not results_bucket_name:. The empty string is treated as false anyway.

Oh good catch on this - I missed updating this one :)

chunyong-lin · 2018-03-13T21:04:47Z

stream_alert_cli/terraform/athena.py

+        results_bucket_name = '{}.streamalert.athena-results'.format(prefix)
+
+    queue_name = athena_config.get('queue_name', '').strip()
+    if queue_name == '':


Same as above, optional to use if not queue_name:

jacknagz

A couple final comments

jacknagz · 2018-03-14T16:32:01Z

stream_alert/athena_partition_refresh/main.py

@@ -403,7 +383,7 @@ class StreamAlertSQSClient(object):
        received_messages: A list of receieved SQS messages
        processed_messages: A list of processed SQS messages
    """
-    QUEUENAME = 'streamalert_athena_data_bucket_notifications'
+    DEFAULT_QUEUE_NAME = '{}_streamalert_athena_data_bucket_notifications'


This is a pretty long Queue name, can we shorten it? The max length is 80 chars.

Yeah we can - I was just mimicking what our queue name is internally

jacknagz · 2018-03-14T16:34:15Z

stream_alert_cli/athena/handler.py

@@ -309,22 +297,23 @@ def athena_handler(options, config):
        options (namedtuple): The parsed args passed from the CLI
        config (CLIConfig): Loaded StreamAlert CLI
    """
-    athena_client = StreamAlertAthenaClient(config, results_key_prefix='stream_alert_cli')


What's the benefit of removing the instantiation here and doing it in each subcommand's method?

This allows any caller using these functions to not have to worry about created an athena client that needs to be passed it.

Example here:

streamalert/stream_alert_cli/terraform/handler.py

Line 121 in 8d653cb

create_table(None, alerts_bucket, 'alerts', config)

Also, not every subcommand needs this (ie: init)

jacknagz · 2018-03-14T16:44:29Z

conf/lambda.json

@@ -23,6 +23,18 @@
      "subnet_ids": []
    }
  },
+  "athena_partition_refresh_config": {


is this missing the results_bucket?

also database_name?

those options are only needed if the user wants to override the defaults that we use. I can add the defaults here but have some concerns. Mainly, I worry that our config is growing unnecessarily large with superfluous options that the vast majority of users wouldn't need to worry about. What are your thoughts? Have it here or omit it and just document the options config settings well?

austinbyers

We'll have a larger discussion on the interaction between the CLI and the config at another time. Thanks for this change!

👍 🚀

chunyong-lin

LGTM. Thanks for the change.

ryandeivert force-pushed the ryandeivert-athena-default-output branch from 6afe624 to 0a6fbc4 Compare February 14, 2018 22:53

austinbyers reviewed Feb 14, 2018

View reviewed changes

austinbyers previously approved these changes Feb 14, 2018

View reviewed changes

austinbyers reviewed Feb 15, 2018

View reviewed changes

jacknagz reviewed Feb 15, 2018

View reviewed changes

ryandeivert added cli terraform documentation historical search labels Feb 21, 2018

ryandeivert mentioned this pull request Feb 21, 2018

[athena] changes to make athena part of a default deployment #592

Closed

ryandeivert force-pushed the ryandeivert-athena-default-output branch 3 times, most recently from 21c5df5 to 93409b2 Compare March 12, 2018 22:58

ryandeivert added 7 commits March 12, 2018 15:58

[cli] updating cli to handle deploying athena on init

1d2f9db

[tf] athena terraform updates to create database, etc

32f8825

[config] updating athena config options

3d8e7fd

* Storing default lambda configuration for athena * Updating CLI to remove config addition logic

[logs] syncing carbon black logs schema updates

a1c9310

[docs] start of athena doc updates

5f1e3d1

[tests] updating some unit tests

7e9a05e

[pr] addressing feedback from @jacknagz and @austinbyers

5faeff0

ryandeivert force-pushed the ryandeivert-athena-default-output branch from 93409b2 to 5faeff0 Compare March 12, 2018 22:58

[cli] breaking out function due to pylint too-many-statements error

42d33e8

airbnb deleted a comment from coveralls Mar 12, 2018

austinbyers reviewed Mar 13, 2018

View reviewed changes

ryandeivert force-pushed the ryandeivert-athena-default-output branch from ec7a3b5 to a32892b Compare March 13, 2018 20:05

[conf] adding prefix to kms alias name so it is namespaced per-deploy

523c1bc

ryandeivert force-pushed the ryandeivert-athena-default-output branch from 6541bf4 to b9bf4b9 Compare March 13, 2018 21:04

chunyong-lin reviewed Mar 13, 2018

View reviewed changes

ryandeivert force-pushed the ryandeivert-athena-default-output branch 3 times, most recently from 594c93f to 8d653cb Compare March 14, 2018 00:32

jacknagz approved these changes Mar 14, 2018

View reviewed changes

austinbyers approved these changes Mar 14, 2018

View reviewed changes

chunyong-lin approved these changes Mar 14, 2018

View reviewed changes

ryandeivert force-pushed the ryandeivert-athena-default-output branch 2 times, most recently from 851db76 to bfff54a Compare March 14, 2018 18:44

[pr] addressing feedback from jacknagz and austinbyers, take 2

0003e5a

ryandeivert force-pushed the ryandeivert-athena-default-output branch from bfff54a to 0003e5a Compare March 14, 2018 21:01

ryandeivert merged commit 8d426e3 into release-2.0.0 Mar 14, 2018

ryandeivert deleted the ryandeivert-athena-default-output branch March 14, 2018 21:18

ryandeivert added this to the 2.0.0 milestone Mar 15, 2018

ryandeivert mentioned this pull request Mar 20, 2018

[outputs] adding support for required alerting outputs #643

Merged

ryandeivert added improvement athena and removed historical search labels Jul 10, 2018



		def rebuild_partitions(athena_client, options, config):
		def rebuild_partitions(options, config):

		help=ARGPARSE_SUPPRESS)


		def _add_default_athena_args(athena_parser):

[athena] changes to make athena part of a default deployment #599

[athena] changes to make athena part of a default deployment #599

Conversation

ryandeivert commented Feb 14, 2018 • edited Loading

NOTE: this is a replacement for #592. All new updates will be made here.

Background

Changes

Other Changes

Testing

austinbyers left a comment

Choose a reason for hiding this comment

coveralls commented Feb 15, 2018 • edited Loading

Choose a reason for hiding this comment

jacknagz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinbyers Mar 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryandeivert commented Feb 27, 2018

ryandeivert commented Mar 12, 2018 • edited Loading

austinbyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinbyers Mar 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinbyers Mar 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chunyong-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacknagz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinbyers left a comment

Choose a reason for hiding this comment

chunyong-lin left a comment

Choose a reason for hiding this comment

ryandeivert commented Feb 14, 2018 •

edited

Loading

coveralls commented Feb 15, 2018 •

edited

Loading

austinbyers Mar 13, 2018 •

edited

Loading

ryandeivert commented Mar 12, 2018 •

edited

Loading

austinbyers Mar 13, 2018 •

edited

Loading

austinbyers Mar 13, 2018 •

edited

Loading