Skip to content

Commit

Permalink
Merge pull request #420 from airbnb/jacknaglieri-update-docs-october
Browse files Browse the repository at this point in the history
[docs] add athena data table creation docs
  • Loading branch information
jacknagz authored Oct 27, 2017
2 parents 6a52f6a + bba14d5 commit 1372b2b
Show file tree
Hide file tree
Showing 4 changed files with 56 additions and 14 deletions.
4 changes: 2 additions & 2 deletions docs/source/account.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ However, the StreamAlert application itself, and its supporting services, must b

Configuration
-------------
As outlined above, choose or create the AWS account you'll use to house the StreamAlert infrastructure.
As outlined above, choose or create the AWS account you will use to house the StreamAlert infrastructure.

If you're interested in demo'ing StreamAlert, you can create a hassle free-tier AWS account `here <https://aws.amazon.com/free/>`_.
If you are interested in simply demo'ing StreamAlert, you can create a hassle free-tier AWS account `here <https://aws.amazon.com/free/>`_.

account_id
~~~~~~~~~~
Expand Down
27 changes: 21 additions & 6 deletions docs/source/athena-deploy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,20 +35,20 @@ Internals

Each time the Athena Partition Refresh Lambda function starts up, it does the following:

* Polls the SQS Queue for the latest S3 event notifications (up to 50)
* Polls the SQS Queue for the latest S3 event notifications (up to 100)
* S3 event notifications contain context around any new object written to a data bucket (as configured below)
* A set of unique S3 Bucket IDs is deduplicated from the notifications
* Queries Athena to verify the ``streamalert`` database exists
* Refreshes the Athena tables as configured below in the ``repair_type.repair_hive_table`` key
* Deletes messages off the Queue once the Athena table(s) is successfully refreshed
* Refreshes the Athena tables as configured below in the ``repair_type`` key
* Deletes messages off the Queue once partitions are created

Getting Started
---------------

Configure Lambda Settings
~~~~~~~~~~~~~~~~~~~~~~~~~

Open ``conf/lambda.json``, and fill in the following ``Required`` options:
Open ``conf/lambda.json``, and fill in the following required options below:


=================================== ======== ==================== ===========
Expand All @@ -60,7 +60,7 @@ Key Required Default Descriptio
``memory`` ``No`` ``128`` The amount of memory (in MB) allocated to the Lambda function
``timeout`` ``No`` ``60`` The maximum duration of the Lambda function (in seconds)
``refresh_interval`` ``No`` ``rate(10 minutes)`` The rate of which the Athena Lambda function is invoked in the form of a `CloudWatch schedule expression <http://amzn.to/2u5t0hS>`_.
``refresh_type.add_hive_partition`` ``No`` ``{}`` Not currently supported
``refresh_type.add_hive_partition`` ``No`` ``{}`` Add specific Hive partitions for new S3 objects. This field is automatically populated when configuring your data tables.
``refresh_type.repair_hive_table`` ``Yes`` ``{}`` Key value pairs of S3 buckets and associated Athena table names. Currently only supports the default alerts bucket created with every cluster.
=================================== ======== ==================== ===========

Expand Down Expand Up @@ -98,4 +98,19 @@ After configuring the above settings, deploy the Lambda function:
This will create all of the underlying infrastructure to automatically refresh Athena tables.

Going forward, if the deploy flag ``--processor all`` is used, it will redeploy this function along with the ``rule_processor`` and ``alert_processor``.
Going forward, if the deploy flag ``--processor all`` is used, it will redeploy this function along with the ``rule_processor`` and ``alert_processor``.

Monitoring
~~~~~~~~~~

Once deployed, it's recommended to monitor the following SQS metrics for ``streamalert_athena_data_bucket_notifications``:

* ``NumberOfMessagesReceived``
* ``NumberOfMessagesSent``
* ``NumberOfMessagesDeleted``

All three of these metrics should have very close values.

If the ``NumberOfMessagesSent`` is much higher than the other two metrics, the ``refresh_interval`` should be increased in the configuration.

For high throughput production environments, an internal of 1 to 2 minutes is recommended.
23 changes: 20 additions & 3 deletions docs/source/athena-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,30 @@ Next, create the ``streamalert`` database:
$ python manage.py athena create-db
Finally, create the ``alerts`` table for searching generated StreamAlerts:
Create the ``alerts`` table for searching generated StreamAlerts:

.. code-block:: bash
$ python manage.py athena create-table --type alerts --bucket <s3.bucket.id.goes.here>
Create tables for data sent to StreamAlert:

.. code-block:: bash
$ python manage.py athena create-table \
--type data \
--bucket <prefix>.streamalert.data \
--refresh_type add_hive_partition \
--table_name <log_name>
Note: The log name above is representative of an enabled log source to your StreamAlert deployment.

For example, if you have 'cloudwatch' in your sources, you would want to create tables for all possible subtypes. This includes ``cloudwatch_events`` and ``cloudwatch_flow_logs``. Also notice that ``:`` is substituted with ``_``; this is due to Hive limitations on table names.

Repeat this process for all relevant data tables in your deployment.

Next Steps
~~~~~~~~~~
----------

`Configure and deploy the Athena Partition Refresher Lambda function <athena-deploy.html>`_
* `Configure and deploy the Athena Partition Refresher Lambda function <athena-deploy.html>`_
* `Configure and deploy Kinesis Firehose for delivery of data to S3 <firehose.html>`_
16 changes: 13 additions & 3 deletions docs/source/firehose.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ Overview

* To enable historical search of all data classified by StreamAlert, Kinesis Firehose can be used.
* This feature can be used for long-term data persistence and historical search (coming soon).
* This works by delivering data to Amazon S3, which can be loaded and queried by AWS Athena.
* This works by delivering data to AWS S3, which can be loaded and queried by AWS Athena.

Infrastructure
--------------
Configuration
-------------

When enabling the Kinesis Firehose module, a dedicated Delivery Stream is created per each log type.

Expand Down Expand Up @@ -112,3 +112,13 @@ Key Required Default Description
``buffer_interval`` ``No`` ``300 (seconds)`` The frequency of data delivery to Amazon S3
``compression_format`` ``No`` ``GZIP`` The compression algorithm to use on data stored in S3
====================== ======== ==================== ===========

Deploying
---------

Once the options above are set, deploy the infrastructure with the following commands:

.. code-block:: bash
$ python manage.py terraform build
$ python manage.py lambda deploy --processor rule

0 comments on commit 1372b2b

Please sign in to comment.