-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[athena] changes to make athena part of a default deployment #599
[athena] changes to make athena part of a default deployment #599
Conversation
6afe624
to
0a6fbc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on the Athena changes, but the code LGTM!
results_bucket = athena_config.get('results_bucket', '').strip() | ||
if results_bucket == '': | ||
self.athena_results_bucket = 's3://{}.streamalert.athena-results'.format(self.prefix) | ||
elif results_bucket[:5] != 's3://': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
results_bucket.startswith('s3://')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm excited for this change!! A couple more clarifying comments
stream_alert_cli/athena/handler.py
Outdated
|
||
|
||
def rebuild_partitions(athena_client, options, config): | ||
def rebuild_partitions(options, config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is still needs the constraint on options.type == data
, since it won't work for the alerts
table
stream_alert_cli/athena/handler.py
Outdated
@@ -199,17 +191,15 @@ def _construct_create_table_statement(schema, table_name, bucket): | |||
bucket=bucket) | |||
|
|||
|
|||
def create_table(athena_client, options, config): | |||
def create_table(options, table_type, config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but is there an added benefit of passing in table_type
as an arg if it's in options.type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea was this would allow for a more generic function with a better interface by accepting a string for the table_type.
the only property of options
that was accessed within this function was type
, which means if any caller wanted to use this function it essentially has to create a namedtuple first that would have a type
property corresponding to the table type. it's fine to keep as is, just makes using it more difficult
|
||
athena_opts = namedtuple('AthenaOptions', ['bucket', 'refresh_type']) | ||
opts = athena_opts(alerts_bucket, 'add_hive_partition') | ||
athena_create_table(opts, 'alerts', config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression we were revisiting table creation on every deploy per the last PR convo, is that not the case?
@@ -116,26 +70,29 @@ def set_prefix(self, prefix): | |||
LOGGER_CLI.error('Prefix cannot contain underscores') | |||
return | |||
|
|||
tf_state_bucket = '{}.streamalert.terraform.state'.format(prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not related to this line)
Any reason to delete the generate_athena
method? It could still be useful for current users who don't have an Athena config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacknagz We don't auto-generate any other part of the config (e.g. Lambda template code for the alert processor or alert merger), and I think we should be consistent. So either we should auto-generate all of it or none of it, and IMO it would be over-engineering to auto-generate every part of the config an upgrading user might not have (because they'll likely need to change the config anyway). With documentation, users can just update their config manually. So my vote is to get rid of the generate_athena
method
#simplify 💣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@austinbyers and @jacknagz maybe we need to discuss this because I'm not sure what the best approach is. my initial impression was to nix it as well (as seen in my commit history). let's chat today
@jacknagz & @austinbyers I should clarify that this PR is a direct mirror of #592 and your feedback there was not addressed yet. This PR was simply to go against the release branch instead of master. Your comments lead me to believe you thought I had addressed previous PR feedback in this PR, which is not the case. Sorry for the misunderstanding!! |
21c5df5
to
93409b2
Compare
* Storing default lambda configuration for athena * Updating CLI to remove config addition logic
93409b2
to
5faeff0
Compare
hey @austinbyers and @jacknagz PTAL. I've updated the branch and addressed the feedback from previous comments EDIT: just fixed a pylint error and pushed a commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ryandeivert! I read the whole thing over again and found a few small things now that I have a better understanding of the codebase :)
conf/lambda.json
Outdated
"source_bucket": "PREFIX_GOES_HERE.streamalert.source", | ||
"source_current_hash": "<auto_generated>", | ||
"source_object_key": "<auto_generated>", | ||
"third_party_libraries": [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This key is no longer necessary
manage.py athena create-table --type alerts --bucket s3.bucket.name --refresh_type add_hive_partition | ||
|
||
manage.py athena create-table --type data --bucket s3.bucket.name --refresh_type add_hive_partition --table_name my_athena_table | ||
manage.py athena init Initialize the Athena base config (for legacy support) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 2.0 will not be backwards compatible, do we need this legacy support? If all it does is update the conf/lambda.json
, I would say nix it (it's not hard to manually add)
EDIT: see also my response to Jack below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @austinbyers, that was my impression as well
@@ -116,26 +70,29 @@ def set_prefix(self, prefix): | |||
LOGGER_CLI.error('Prefix cannot contain underscores') | |||
return | |||
|
|||
tf_state_bucket = '{}.streamalert.terraform.state'.format(prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacknagz We don't auto-generate any other part of the config (e.g. Lambda template code for the alert processor or alert merger), and I think we should be consistent. So either we should auto-generate all of it or none of it, and IMO it would be over-engineering to auto-generate every part of the config an upgrading user might not have (because they'll likely need to change the config anyway). With documentation, users can just update their config manually. So my vote is to get rid of the generate_athena
method
#simplify 💣
# Remove old Terraform files | ||
terraform_clean(config) | ||
LOGGER_CLI.info('Deploying Lambda Functions') | ||
# deploy both lambda functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really need this comment
"unit-testing.streamalerts": "alerts", | ||
"unit-testing.streamalert.data": "data" | ||
} | ||
'enabled': True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer used in the config , let's also remove it from the test.
EDIT: I just found it in the config too! But are we using it? If so, we should probably remove it since it's no longer optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes good catch - I'll remove :)
ec7a3b5
to
a32892b
Compare
6541bf4
to
b9bf4b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the refactoring! 👏 Few comments from me.
_add_default_athena_args(athena_create_table_parser) | ||
|
||
# Validate the provided schema-override options | ||
def _validate_override(val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The val
will be a set, should we go through each element in the set? Also, for the case that column_foo=
will be passed the validation. Maybe we can do
if len(val.split('=')) != 2:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @chunyong-lin - val
will actually not be a set, but will be each individual item within the set.
Good suggestion though - I can add an additional check to make sure the right about of equals-separated-values is provided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change!
help=ARGPARSE_SUPPRESS) | ||
|
||
|
||
def _add_default_athena_args(athena_parser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to rename this method to _add_required_athena_args
? I first impression when see default
word and it gives me impression that there is default value, it is not required to set it. Actually, the all the args are required here, and users must provide values for these args.
But I might interpreter this differently from others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can rename but chose this since the arguments are the 'defaults' related to the actual argparsers that we're constructing here (not the 'defaults' the user will have to provide). Also, you'll notice that not ever argument that is added within the function is in fact 'required' - see the 'debug' param.
table (str): The name of the table being rebuilt | ||
bucket (str): The s3 bucket to be used as the location for Athena data | ||
table_type (str): The type of table being refreshed | ||
Types of 'data' and 'alert' are accepted, but only 'data' is implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is data
or alert
is implemented? In docs/source/athena-setup.rst
line 37, it sounds that it is alert
is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! But this is just for 'rebuilding' part, not simply supporting the alerts
stream_alert_cli/terraform/athena.py
Outdated
database = athena_config.get('database_name', '{}_streamalert'.format(prefix)) | ||
|
||
results_bucket_name = athena_config.get('results_bucket', '').strip() | ||
if results_bucket_name == '': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may just use if not results_bucket_name:
. The empty string is treated as false anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good catch on this - I missed updating this one :)
stream_alert_cli/terraform/athena.py
Outdated
results_bucket_name = '{}.streamalert.athena-results'.format(prefix) | ||
|
||
queue_name = athena_config.get('queue_name', '').strip() | ||
if queue_name == '': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, optional to use if not queue_name:
594c93f
to
8d653cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple final comments
@@ -403,7 +383,7 @@ class StreamAlertSQSClient(object): | |||
received_messages: A list of receieved SQS messages | |||
processed_messages: A list of processed SQS messages | |||
""" | |||
QUEUENAME = 'streamalert_athena_data_bucket_notifications' | |||
DEFAULT_QUEUE_NAME = '{}_streamalert_athena_data_bucket_notifications' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty long Queue name, can we shorten it? The max length is 80 chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can - I was just mimicking what our queue name is internally
@@ -309,22 +297,23 @@ def athena_handler(options, config): | |||
options (namedtuple): The parsed args passed from the CLI | |||
config (CLIConfig): Loaded StreamAlert CLI | |||
""" | |||
athena_client = StreamAlertAthenaClient(config, results_key_prefix='stream_alert_cli') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of removing the instantiation here and doing it in each subcommand's method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows any caller using these functions to not have to worry about created an athena client that needs to be passed it.
Example here:
create_table(None, alerts_bucket, 'alerts', config) |
Also, not every subcommand needs this (ie: init
)
@@ -23,6 +23,18 @@ | |||
"subnet_ids": [] | |||
} | |||
}, | |||
"athena_partition_refresh_config": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this missing the results_bucket
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also database_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those options are only needed if the user wants to override the defaults that we use. I can add the defaults here but have some concerns. Mainly, I worry that our config is growing unnecessarily large with superfluous options that the vast majority of users wouldn't need to worry about. What are your thoughts? Have it here or omit it and just document the options config settings well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have a larger discussion on the interaction between the CLI and the config at another time. Thanks for this change!
👍 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the change.
851db76
to
bfff54a
Compare
bfff54a
to
0003e5a
Compare
to: @austinbyers or @chunyong-lin
cc: @airbnb/streamalert-maintainers, @jacknagz
size: large
resolves #N/A
NOTE: this is a replacement for #592. All new updates will be made here.
Background
Changes
manage.py
athena subcommands likeinit
,create-db
, etc.prefix
as to not conflict if multiple deploys exist in one AWS account.athena_partition_refresh_config
inlambda.json
:database_name
: the name of the Athena database to use. (default is:<prefix>_streamalert
)results_bucket
: the S3 bucket to use for Athena query results and metastore storage (default is:<prefix>.streamalert.athena-results
)queue_name
: the name of the sqs queue to use for bucket notifications (default is:<prefix>_streamalert_athena_data_bucket_notifications
)Other Changes
Testing