Merge pull request #420 from airbnb/jacknaglieri-update-docs-october

[docs] add athena data table creation docs
airbnb · Oct 27, 2017 · 1372b2b · 1372b2b
2 parents 6a52f6a + bba14d5
commit 1372b2b
Show file tree

Hide file tree

Showing 4 changed files with 56 additions and 14 deletions.
diff --git a/docs/source/account.rst b/docs/source/account.rst
@@ -10,9 +10,9 @@ However, the StreamAlert application itself, and its supporting services, must b
 
 Configuration
 -------------
-As outlined above, choose or create the AWS account you'll use to house the StreamAlert infrastructure.
+As outlined above, choose or create the AWS account you will use to house the StreamAlert infrastructure.
 
-If you're interested in demo'ing StreamAlert, you can create a hassle free-tier AWS account `here <https://aws.amazon.com/free/>`_.
+If you are interested in simply demo'ing StreamAlert, you can create a hassle free-tier AWS account `here <https://aws.amazon.com/free/>`_.
 
 account_id
 ~~~~~~~~~~

diff --git a/docs/source/athena-deploy.rst b/docs/source/athena-deploy.rst
@@ -35,20 +35,20 @@ Internals
 
 Each time the Athena Partition Refresh Lambda function starts up, it does the following:
 
-* Polls the SQS Queue for the latest S3 event notifications (up to 50)
+* Polls the SQS Queue for the latest S3 event notifications (up to 100)
 * S3 event notifications contain context around any new object written to a data bucket (as configured below)
 * A set of unique S3 Bucket IDs is deduplicated from the notifications
 * Queries Athena to verify the ``streamalert`` database exists
-* Refreshes the Athena tables as configured below in the ``repair_type.repair_hive_table`` key
-* Deletes messages off the Queue once the Athena table(s) is successfully refreshed
+* Refreshes the Athena tables as configured below in the ``repair_type`` key
+* Deletes messages off the Queue once partitions are created
 
 Getting Started
 ---------------
 
 Configure Lambda Settings
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Open ``conf/lambda.json``, and fill in the following ``Required`` options:
+Open ``conf/lambda.json``, and fill in the following required options below:
 
 
 ===================================  ========  ====================   ===========
@@ -60,7 +60,7 @@ Key                                  Required  Default                Descriptio
 ``memory``                           ``No``    ``128``                The amount of memory (in MB) allocated to the Lambda function
 ``timeout``                          ``No``    ``60``                 The maximum duration of the Lambda function (in seconds)
 ``refresh_interval``                 ``No``    ``rate(10 minutes)``   The rate of which the Athena Lambda function is invoked in the form of a `CloudWatch schedule expression <http://amzn.to/2u5t0hS>`_.
-``refresh_type.add_hive_partition``  ``No``    ``{}``                 Not currently supported
+``refresh_type.add_hive_partition``  ``No``    ``{}``                 Add specific Hive partitions for new S3 objects.  This field is automatically populated when configuring your data tables.
 ``refresh_type.repair_hive_table``   ``Yes``   ``{}``                 Key value pairs of S3 buckets and associated Athena table names.  Currently only supports the default alerts bucket created with every cluster.
 ===================================  ========  ====================   ===========
 
@@ -98,4 +98,19 @@ After configuring the above settings, deploy the Lambda function:
 
 This will create all of the underlying infrastructure to automatically refresh Athena tables.
 
-Going forward, if the deploy flag ``--processor all`` is used, it will redeploy this function along with the ``rule_processor`` and ``alert_processor``.
+Going forward, if the deploy flag ``--processor all`` is used, it will redeploy this function along with the ``rule_processor`` and ``alert_processor``.
+
+Monitoring
+~~~~~~~~~~
+
+Once deployed, it's recommended to monitor the following SQS metrics for ``streamalert_athena_data_bucket_notifications``:
+
+* ``NumberOfMessagesReceived``
+* ``NumberOfMessagesSent``
+* ``NumberOfMessagesDeleted``
+
+All three of these metrics should have very close values.
+
+If the ``NumberOfMessagesSent`` is much higher than the other two metrics, the ``refresh_interval`` should be increased in the configuration.
+
+For high throughput production environments, an internal of 1 to 2 minutes is recommended. 
diff --git a/docs/source/athena-setup.rst b/docs/source/athena-setup.rst
@@ -49,13 +49,30 @@ Next, create the ``streamalert`` database:
 
   $ python manage.py athena create-db
 
-Finally, create the ``alerts`` table for searching generated StreamAlerts:
+Create the ``alerts`` table for searching generated StreamAlerts:
 
 .. code-block:: bash
 
   $ python manage.py athena create-table --type alerts --bucket <s3.bucket.id.goes.here>
 
+Create tables for data sent to StreamAlert:
+
+.. code-block:: bash
+
+  $ python manage.py athena create-table \ 
+    --type data \
+    --bucket <prefix>.streamalert.data \
+    --refresh_type add_hive_partition \
+    --table_name <log_name>
+
+Note: The log name above is representative of an enabled log source to your StreamAlert deployment.
+
+For example, if you have 'cloudwatch' in your sources, you would want to create tables for all possible subtypes.  This includes ``cloudwatch_events`` and ``cloudwatch_flow_logs``.  Also notice that ``:`` is substituted with ``_``; this is due to Hive limitations on table names.
+
+Repeat this process for all relevant data tables in your deployment.
+
 Next Steps
-~~~~~~~~~~
+----------
 
-`Configure and deploy the Athena Partition Refresher Lambda function <athena-deploy.html>`_
+* `Configure and deploy the Athena Partition Refresher Lambda function <athena-deploy.html>`_
+* `Configure and deploy Kinesis Firehose for delivery of data to S3 <firehose.html>`_
diff --git a/docs/source/firehose.rst b/docs/source/firehose.rst
@@ -6,10 +6,10 @@ Overview
 
 * To enable historical search of all data classified by StreamAlert, Kinesis Firehose can be used.
 * This feature can be used for long-term data persistence and historical search (coming soon).
-* This works by delivering data to Amazon S3, which can be loaded and queried by AWS Athena.
+* This works by delivering data to AWS S3, which can be loaded and queried by AWS Athena.
 
-Infrastructure
---------------
+Configuration
+-------------
 
 When enabling the Kinesis Firehose module, a dedicated Delivery Stream is created per each log type.
 
@@ -112,3 +112,13 @@ Key                      Required  Default               Description
 ``buffer_interval``      ``No``    ``300 (seconds)``     The frequency of data delivery to Amazon S3
 ``compression_format``   ``No``    ``GZIP``              The compression algorithm to use on data stored in S3
 ======================   ========  ====================  ===========
+
+Deploying
+---------
+
+Once the options above are set, deploy the infrastructure with the following commands:
+
+.. code-block:: bash
+
+  $ python manage.py terraform build
+  $ python manage.py lambda deploy --processor rule