Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[athena] changes to make athena part of a default deployment #599

Merged
merged 10 commits into from
Mar 14, 2018

Conversation

ryandeivert
Copy link
Contributor

@ryandeivert ryandeivert commented Feb 14, 2018

to: @austinbyers or @chunyong-lin
cc: @airbnb/streamalert-maintainers, @jacknagz
size: large
resolves #N/A

NOTE: this is a replacement for #592. All new updates will be made here.

Background

  • In order to support some upcoming features, StreamAlert must always have alerts sent to S3 to allow for Athena querying. Therefore, the Athena partition refresh function is no longer optional and should be created upon StreamAlert initialization.

Changes

  • Updating cli to handle deploying athena on init
    • This removes some now unnecessary manage.py athena subcommands like init, create-db, etc.
    • Athena database creation is now handled via Terraform.
  • The athena AWS resources are now name-spaced to the user-specified prefix as to not conflict if multiple deploys exist in one AWS account.
  • Updating athena terraform generate code and updating default local config for athena.
  • Three additional options related to the athena function are now configurable by the user (within the athena_partition_refresh_config in lambda.json:
    • database_name: the name of the Athena database to use. (default is: <prefix>_streamalert)
    • results_bucket : the S3 bucket to use for Athena query results and metastore storage (default is: <prefix>.streamalert.athena-results)
    • queue_name: the name of the sqs queue to use for bucket notifications (default is: <prefix>_streamalert_athena_data_bucket_notifications)

Other Changes

  • Preliminary doc updates to go with changes. Another wave will follow.
  • Syncing carbon black logs schema updates

Testing

  • Updating some unit tests for changes.
  • Deployed in test account end-to-end and tested to make sure alerts searchable in S3.

@ryandeivert ryandeivert force-pushed the ryandeivert-athena-default-output branch from 6afe624 to 0a6fbc4 Compare February 14, 2018 22:53
Copy link
Contributor

@austinbyers austinbyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on the Athena changes, but the code LGTM!

austinbyers
austinbyers previously approved these changes Feb 14, 2018
@coveralls
Copy link

coveralls commented Feb 15, 2018

Coverage Status

Coverage decreased (-0.04%) to 95.66% when pulling 0003e5a on ryandeivert-athena-default-output into 8ad15c4 on release-2.0.0.

results_bucket = athena_config.get('results_bucket', '').strip()
if results_bucket == '':
self.athena_results_bucket = 's3://{}.streamalert.athena-results'.format(self.prefix)
elif results_bucket[:5] != 's3://':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results_bucket.startswith('s3://')

Copy link
Contributor

@jacknagz jacknagz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm excited for this change!! A couple more clarifying comments



def rebuild_partitions(athena_client, options, config):
def rebuild_partitions(options, config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is still needs the constraint on options.type == data, since it won't work for the alerts table

@@ -199,17 +191,15 @@ def _construct_create_table_statement(schema, table_name, bucket):
bucket=bucket)


def create_table(athena_client, options, config):
def create_table(options, table_type, config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but is there an added benefit of passing in table_type as an arg if it's in options.type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was this would allow for a more generic function with a better interface by accepting a string for the table_type.

the only property of options that was accessed within this function was type, which means if any caller wanted to use this function it essentially has to create a namedtuple first that would have a type property corresponding to the table type. it's fine to keep as is, just makes using it more difficult


athena_opts = namedtuple('AthenaOptions', ['bucket', 'refresh_type'])
opts = athena_opts(alerts_bucket, 'add_hive_partition')
athena_create_table(opts, 'alerts', config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression we were revisiting table creation on every deploy per the last PR convo, is that not the case?

@@ -116,26 +70,29 @@ def set_prefix(self, prefix):
LOGGER_CLI.error('Prefix cannot contain underscores')
return

tf_state_bucket = '{}.streamalert.terraform.state'.format(prefix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not related to this line)

Any reason to delete the generate_athena method? It could still be useful for current users who don't have an Athena config

Copy link
Contributor

@austinbyers austinbyers Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacknagz We don't auto-generate any other part of the config (e.g. Lambda template code for the alert processor or alert merger), and I think we should be consistent. So either we should auto-generate all of it or none of it, and IMO it would be over-engineering to auto-generate every part of the config an upgrading user might not have (because they'll likely need to change the config anyway). With documentation, users can just update their config manually. So my vote is to get rid of the generate_athena method

#simplify 💣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@austinbyers and @jacknagz maybe we need to discuss this because I'm not sure what the best approach is. my initial impression was to nix it as well (as seen in my commit history). let's chat today

@ryandeivert
Copy link
Contributor Author

@jacknagz & @austinbyers I should clarify that this PR is a direct mirror of #592 and your feedback there was not addressed yet. This PR was simply to go against the release branch instead of master. Your comments lead me to believe you thought I had addressed previous PR feedback in this PR, which is not the case. Sorry for the misunderstanding!!

@austinbyers austinbyers dismissed their stale review March 2, 2018 18:58

needs rebase

@ryandeivert ryandeivert force-pushed the ryandeivert-athena-default-output branch 3 times, most recently from 21c5df5 to 93409b2 Compare March 12, 2018 22:58
@ryandeivert ryandeivert force-pushed the ryandeivert-athena-default-output branch from 93409b2 to 5faeff0 Compare March 12, 2018 22:58
@ryandeivert
Copy link
Contributor Author

ryandeivert commented Mar 12, 2018

hey @austinbyers and @jacknagz PTAL. I've updated the branch and addressed the feedback from previous comments

EDIT: just fixed a pylint error and pushed a commit

@airbnb airbnb deleted a comment from coveralls Mar 12, 2018
Copy link
Contributor

@austinbyers austinbyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ryandeivert! I read the whole thing over again and found a few small things now that I have a better understanding of the codebase :)

conf/lambda.json Outdated
"source_bucket": "PREFIX_GOES_HERE.streamalert.source",
"source_current_hash": "<auto_generated>",
"source_object_key": "<auto_generated>",
"third_party_libraries": []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This key is no longer necessary

manage.py athena create-table --type alerts --bucket s3.bucket.name --refresh_type add_hive_partition

manage.py athena create-table --type data --bucket s3.bucket.name --refresh_type add_hive_partition --table_name my_athena_table
manage.py athena init Initialize the Athena base config (for legacy support)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 2.0 will not be backwards compatible, do we need this legacy support? If all it does is update the conf/lambda.json, I would say nix it (it's not hard to manually add)
EDIT: see also my response to Jack below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @austinbyers, that was my impression as well

@@ -116,26 +70,29 @@ def set_prefix(self, prefix):
LOGGER_CLI.error('Prefix cannot contain underscores')
return

tf_state_bucket = '{}.streamalert.terraform.state'.format(prefix)
Copy link
Contributor

@austinbyers austinbyers Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacknagz We don't auto-generate any other part of the config (e.g. Lambda template code for the alert processor or alert merger), and I think we should be consistent. So either we should auto-generate all of it or none of it, and IMO it would be over-engineering to auto-generate every part of the config an upgrading user might not have (because they'll likely need to change the config anyway). With documentation, users can just update their config manually. So my vote is to get rid of the generate_athena method

#simplify 💣

# Remove old Terraform files
terraform_clean(config)
LOGGER_CLI.info('Deploying Lambda Functions')
# deploy both lambda functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need this comment

"unit-testing.streamalerts": "alerts",
"unit-testing.streamalert.data": "data"
}
'enabled': True,
Copy link
Contributor

@austinbyers austinbyers Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer used in the config , let's also remove it from the test.
EDIT: I just found it in the config too! But are we using it? If so, we should probably remove it since it's no longer optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good catch - I'll remove :)

@ryandeivert ryandeivert force-pushed the ryandeivert-athena-default-output branch from ec7a3b5 to a32892b Compare March 13, 2018 20:05
@ryandeivert ryandeivert force-pushed the ryandeivert-athena-default-output branch from 6541bf4 to b9bf4b9 Compare March 13, 2018 21:04
Copy link
Contributor

@chunyong-lin chunyong-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactoring! 👏 Few comments from me.

_add_default_athena_args(athena_create_table_parser)

# Validate the provided schema-override options
def _validate_override(val):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The val will be a set, should we go through each element in the set? Also, for the case that column_foo= will be passed the validation. Maybe we can do
if len(val.split('=')) != 2:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @chunyong-lin - val will actually not be a set, but will be each individual item within the set.

Good suggestion though - I can add an additional check to make sure the right about of equals-separated-values is provided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change!

help=ARGPARSE_SUPPRESS)


def _add_default_athena_args(athena_parser):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to rename this method to _add_required_athena_args? I first impression when see default word and it gives me impression that there is default value, it is not required to set it. Actually, the all the args are required here, and users must provide values for these args.
But I might interpreter this differently from others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename but chose this since the arguments are the 'defaults' related to the actual argparsers that we're constructing here (not the 'defaults' the user will have to provide). Also, you'll notice that not ever argument that is added within the function is in fact 'required' - see the 'debug' param.

table (str): The name of the table being rebuilt
bucket (str): The s3 bucket to be used as the location for Athena data
table_type (str): The type of table being refreshed
Types of 'data' and 'alert' are accepted, but only 'data' is implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is data or alert is implemented? In docs/source/athena-setup.rst line 37, it sounds that it is alert is implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! But this is just for 'rebuilding' part, not simply supporting the alerts

database = athena_config.get('database_name', '{}_streamalert'.format(prefix))

results_bucket_name = athena_config.get('results_bucket', '').strip()
if results_bucket_name == '':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may just use if not results_bucket_name:. The empty string is treated as false anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch on this - I missed updating this one :)

results_bucket_name = '{}.streamalert.athena-results'.format(prefix)

queue_name = athena_config.get('queue_name', '').strip()
if queue_name == '':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, optional to use if not queue_name:

@ryandeivert ryandeivert force-pushed the ryandeivert-athena-default-output branch 3 times, most recently from 594c93f to 8d653cb Compare March 14, 2018 00:32
Copy link
Contributor

@jacknagz jacknagz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple final comments

@@ -403,7 +383,7 @@ class StreamAlertSQSClient(object):
received_messages: A list of receieved SQS messages
processed_messages: A list of processed SQS messages
"""
QUEUENAME = 'streamalert_athena_data_bucket_notifications'
DEFAULT_QUEUE_NAME = '{}_streamalert_athena_data_bucket_notifications'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty long Queue name, can we shorten it? The max length is 80 chars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can - I was just mimicking what our queue name is internally

@@ -309,22 +297,23 @@ def athena_handler(options, config):
options (namedtuple): The parsed args passed from the CLI
config (CLIConfig): Loaded StreamAlert CLI
"""
athena_client = StreamAlertAthenaClient(config, results_key_prefix='stream_alert_cli')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of removing the instantiation here and doing it in each subcommand's method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows any caller using these functions to not have to worry about created an athena client that needs to be passed it.

Example here:

create_table(None, alerts_bucket, 'alerts', config)

Also, not every subcommand needs this (ie: init)

@@ -23,6 +23,18 @@
"subnet_ids": []
}
},
"athena_partition_refresh_config": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this missing the results_bucket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also database_name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those options are only needed if the user wants to override the defaults that we use. I can add the defaults here but have some concerns. Mainly, I worry that our config is growing unnecessarily large with superfluous options that the vast majority of users wouldn't need to worry about. What are your thoughts? Have it here or omit it and just document the options config settings well?

Copy link
Contributor

@austinbyers austinbyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have a larger discussion on the interaction between the CLI and the config at another time. Thanks for this change!

👍 🚀

Copy link
Contributor

@chunyong-lin chunyong-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the change.

@ryandeivert ryandeivert force-pushed the ryandeivert-athena-default-output branch 2 times, most recently from 851db76 to bfff54a Compare March 14, 2018 18:44
@ryandeivert ryandeivert force-pushed the ryandeivert-athena-default-output branch from bfff54a to 0003e5a Compare March 14, 2018 21:01
@ryandeivert ryandeivert merged commit 8d426e3 into release-2.0.0 Mar 14, 2018
@ryandeivert ryandeivert deleted the ryandeivert-athena-default-output branch March 14, 2018 21:18
@ryandeivert ryandeivert added this to the 2.0.0 milestone Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants