This tool helps an organization better understand its GCS bucket usage patterns, across all of its projects by computing aggregation statistics related to bucket access patterns in order to help better generate object lifecycle management.
This tool utilizes GCS audit logs in order to find how often objects are read within a bucket, and then displays the corresponding statistics. Once these are computed, the tool then creates a recommended OLM policy, such as by telling a user to downgrade the storage classes of objects within a bucket to "nearline" if they have not been read in over 30 days. This will help optimize costs for an organization as otherwise it may be hard to know how often buckets are being accessed and if they are configured with the appropriate storage class/OLM based on its access patterns.
An organization-level audit log sink captures create, read, and delete events
for any GCS bucket, and stores those into two daily tables, cloudaudit_googleapis_com_activity_<date>
and cloudaudit_googleapis_com_data_access_<date>
in a BigQuery dataset. These
events include relevant metadata about the object such as labels, location, and
details about where the request is coming from.
The audit logs will have separate entries for the creation and read of a GCS bucket or object, but these entries are generic audit log entries and we’ll need to do some additional work to surface the metadata interesting to us.
Before starting, let’s gather some prerequisite information.OUTPUT_PROJECT_ID
: The ID of the project where you want to create the table.OUTPUT_DATASET_ID
: The ID of the BigQuery dataset where you want to create the table.BQ_LOCATION
: BigQuery location to use for your dataset, such as 'US' or 'EU'.ORG_ID
: The numeric ID of the GCP organization
Now export them as environment variables in your Cloud Shell for subsequent use. The following are example values of what you may use.
export AUDIT_LOG_PROJECT_ID=gcsauditlogs
export AUDIT_LOG_DATASET_ID=gcs_audit_logs
export OUTPUT_PROJECT_ID=gcsusagerecommder
export OUTPUT_DATASET_ID=gcs_usage_log
export OUTPUT_TABLE_NAME=gcs_bucket_insights
export BQ_LOCATION=EU
export ORG_ID=622243302570
AUDIT_LOG_PROJECT_ID
is the project ID in BigQuery where you will export the audit logs.AUDIT_LOG_DATASET_ID
is the dataset ID in BigQuery where the audit logs will be exported to.OUTPUT_PROJECT_ID
is the project ID where the bucket insights table will be.OUTPUT_DATASET_ID
is the dataset ID where the bucket insights table will be.OUTPUT_TABLE_NAME
is the name of the BigQuery table holding the bucket insights.BQ_LOCATION
is either EU or US, and will be for both datasets created.ORG_ID
Find yours by invokinggcloud organizations list
- Create and manage a BigQuery dataset in the project you’ve chosen
- bigquery.datasets.create
- bigquery.tables.create
- bigquery.tables.getData
- bigquery.tables.updateData
And you will need permissions at the organization level to:
-
Create an organization-level audit log sink
- roles/logging.configWriter
-
Scan your inventory:
- storage.buckets.get
- storage.buckets.list
- storage.objects.get
- storage.objects.list
- resourcemanager.organizations.get
- resourcemanager.projects.get
- resourcemanager.projects.list
You will need these permissions only during initial setup.
Let’s walk through the steps necessary to create and populate the dataset. Let’s create the dataset inside the project you specified above. Note that this does not specify a partition expiration. Audit logs can get costly, so it is recommended to choose a default partition expiration for this dataset, such as 90 days.bq mk --location=${BQ_LOCATION} -d "${OUTPUT_PROJECT_ID}:${OUTPUT_DATASET_ID}"
For GCS, we'll be enabling two types of audit logs:
-
Admin Activity logs: Entries for operations that modify the configuration or metadata of a project, bucket, or object.
-
Data Access logs: Entries for operations that modify objects or read a project, bucket, or object. There are several sub-types of data access logs:
- ADMIN_READ: Entries for operations that read the configuration or metadata of a project, bucket, or object.
- DATA_READ: Entries for operations that read an object.
- DATA_WRITE: Entries for operations that create or modify an object.
By default, only admin activity logs are enabled for organizations. However, data access logs are not configured by default. For this example, we will enable them in the GCP console; however, you could do so with the CLI or programmatically, as well.
To do this, go to the console.
-
Filter on Cloud Storage, and select ADMIN_READ, DATA_READ, AND DATA_WRITE. Then click save.
-
Now, we will export the logs from Stackdriver to BigQuery.
gcloud logging sinks create gcs_usage \
bigquery.googleapis.com/projects/${AUDIT_LOG_PROJECT_ID}/datasets/${AUDIT_LOG_DATASET_ID} \
--log-filter='resource.type="gcs_bucket" AND
(protoPayload.methodName:"storage.buckets.delete" OR
protoPayload.methodName:"storage.buckets.create" OR
protoPayload.methodName:"storage.buckets.get" OR
protoPayload.methodName:"storage.objects.get" OR
protoPayload.methodName:"storage.objects.delete" OR
protoPayload.methodName:"storage.objects.create")' \
--organization=${ORG_ID} --include-children
This command will create and return a service account ID, such as serviceAccount:o125240632470-886280@gcp-sa-logging.iam.gserviceaccount.com.
Note down the account name, as we'll use it in the next step!
- Optional: Create a Stackdriver Logging Exclusion Filter In the spirit of cost optimization, we can optionally set an exclusion filter for Stackdriver Logging, only after the logging sink is set up. This will allow us to control the log entries which are sent to Stackdriver, since for this solution we only care about their presence in BigQuery, rather than paying for their additional ingest in Stackdriver Logging. This is specifically needed for the Data Access logs, as Admin Activity and System Event logs are both free and cannot be excluded. If the sole manner in which your organization wants to consume these logs is in BigQuery, then you should follow these steps to save on the ingest. Note that this must be done in the console, as it's not yet available in gcloud.
To do this:
-
Navigate to the Logs Ingestion Page as a part of Stackdriver Logging Resource Usage Page, and select the Exclusions Tab.
-
In the expansion panel, enter the following filtering query, which should match what you used to create the sink in the previous step:
resource.type="gcs_bucket" AND
(protoPayload.methodName:"storage.buckets.delete" OR
protoPayload.methodName:"storage.buckets.create" OR
protoPayload.methodName:"storage.buckets.get" OR
protoPayload.methodName:"storage.objects.get" OR
protoPayload.methodName:"storage.objects.delete" OR
protoPayload.methodName:"storage.objects.create")
- In the Exclusions Editor right-hand panel, enter:
- Name: GCS_Data_Access_Logs_Filter
- Description: Excluding GCS Data Access logs by enabling direct export to BigQuery sync for bucket usage intelligence.
- Percent to Exclude: 100
- Click 'Create Exclusion'.
- A warning will appear, specifying that logs can still go to our BigQuery sink. Click 'Create Exclusion'.
This can be done a number of ways, and here we’ll just follow the manual process.
Create a GCS object in any project in your GCP organization, and you should see
an entry arrive in the ${OUTPUT_DATASET_ID}
dataset.
The daily BigQuery tables named cloudaudit_googleapis_com_activity_<date>
and
cloudaudit_googleapis_com_data_access_<date>
will appear in your project the
first time a GCS object/bucket is created or read after the audit log sink is
created. If you are not seeing tables being created, you may have a permissions
issue. Troubleshoot by looking at your project activity
in the project where you created the GCS resource. There you should see your
audit log sink service account creating the daily audit log BigQuery tables.
We’ll run a process to populate a new, separate table in our dataset with an inventory of the currently existing buckets in your GCP organization to fill in this missing link.
This process only needs to be run once and would benefit from a low-latency network location to GCP. In our example we will run it in the cloud shell.
git clone https://github.com/GoogleCloudPlatform/professional-services.git
cd tools/gcs-usage-recommender
gcloud auth application-default login
Set up your Python environment:
virtualenv venv --python python3
source venv/bin/activate
pip install -r requirements.txt
Run the script to create a local JSON file of all GCS buckets in your org:
python3 -m python.backfill ${ORG_ID}
Create the table with the existing GCS inventory before audit logs were enabled, which will use the JSON file created in the previous step.
bq --location=${BQ_LOCATION} load \
--source_format=NEWLINE_DELIMITED_JSON \
${OUTPUT_PROJECT_ID}:${OUTPUT_DATASET_ID}.${OUTPUT_TABLE_NAME} \
initial_gcs_inventory.json \
schema.json
Enable the BigQuery Data Transfer API
gcloud services enable bigquerydatatransfer.googleapis.com
Configure your configuration variables
export audit_log_query=$(cat audit_log_query.sql | sed -e "s/{OUTPUT_PROJECT_ID}/$OUTPUT_PROJECT_ID/g" -e "s/{OUTPUT_DATASET_ID}/$OUTPUT_DATASET_ID/g" -e "s/{OUTPUT_TABLE_NAME}/$OUTPUT_TABLE_NAME/g" -e "s/{AUDIT_LOG_PROJECT_ID}/$AUDIT_LOG_PROJECT_ID/g" -e "s/{AUDIT_LOG_DATASET_ID}/$AUDIT_LOG_DATASET_ID/g")
Upload the logic to generate a scheduled query job. This is recommended to run daily as it computes the read count over days.
bq query \
--project_id $OUTPUT_PROJECT_ID \
--use_legacy_sql=false \
--destination_table=$OUTPUT_DATASET_ID.$OUTPUT_TABLE_NAME \
--display_name="GCS Bucket Insights Table" \
--replace=true \
--schedule='every 24 hours' "$audit_log_query"
This will prompt you to enter an authorization code on your first time. Go to the URL that the prompt specifies, copy the code, and paste it back into the terminal. After this, your scheduled query is created successfully. Verify this by checking in the cloud console
The table that is created uses the schema described above.
Note that this will only be interesting after audit logs have been enabled for at least a few days. All backfilled inventory also defaults to "-1", meaning that it has yet to have been accessed.