Skip to content

Simple example showing how to use Cloud Run to pre-process raw events from PubSub and publish them to new topic.

License

Notifications You must be signed in to change notification settings

mchmarny/preprocessd

Repository files navigation

preprocessd

Simple example showing how to use Cloud Run to pre-process events before persisting them to the backing store (e.g. BigQuery). This is a common use-case where the raw data (e.g. submitted through REST API) needs to be pre-processed (e.g. decorated with additional attributed, classified, or simply validated) before saving.

Cloud Run is a great platform to build these kind of ingestion or pre-processing services:

  • Write each one of the pre-processing steps in the most appropriate (or favorite) development language
  • Bring your own runtime (or even specific version of that runtime) along with custom libraries
  • Dynamically scale up and down with your PubSub event load
  • Scale to 0, and don't pay anything, when there is nothing to process
  • Use granular access control with service account and policy bindings

Event Source

In this example will will use the synthetic events on PubSub topic generated by pubsub-event-maker utility. We will use it to mock synthetic utilization data from 3 devices and publish them to Cloud PubSub on the eventmaker topic in your project. The PubSub payload looks something like this:

{
    "source_id": "device-1",
    "event_id": "eid-b6569857-232c-4e6f-bd51-cda4e81f3e1f",
    "event_ts": "2019-06-05T11:39:50.403778Z",
    "label": "utilization",
    "mem_used": 34.47265625,
    "cpu_used": 6.5,
    "load_1": 1.55,
    "load_5": 2.25,
    "load_15": 2.49,
    "random_metric": 94.05090880450125
}

The instructions on how to configure pubsub-event-maker to start sending these events are here.

Pre-requirements

GCP Project and gcloud SDK

If you don't have one already, start by creating new project and configuring Google Cloud SDK. Similarly, if you have not done so already, you will have set up Cloud Run.

Setup

Build Container Image

Cloud Run runs container images. To build one we are going to use the included Dockerfile and submit the build job to Cloud Build using bin/image script.

Note, you should review each one of the provided scripts for complete content of these commands

bin/image

If this is first time you use the build service you may be prompted to enable the build API

Service Account and IAM Policies

In this example we are going to follow the principle of least privilege (POLP) to ensure our Cloud Run service has only the necessary rights and nothing more:

  • run.invoker - required to execute Cloud Run service
  • pubsub.editor - required to create and publish to Cloud PubSub
  • logging.logWriter - required for Stackdriver logging
  • cloudtrace.agent - required for Stackdriver tracing
  • monitoring.metricWriter - required to write custom metrics to Stackdriver

To do that we will create a GCP service account and assign the necessary IAM policies and roles using bin/account script:

bin/account

Cloud Run Service

Once you have configured the GCP accounts, you can deploy a new Cloud Run service and set it to run under that account using and preventing unauthenticated access bin/service script:

bin/service

PubSub Subscription

To enable PubSub to send topic data to Cloud Run service we will need to create a PubSub topic subscription and configure it to "push" events to the Cloud Service we deployed above.

bin/pubsub

Log

You can see the raw data and all the application log entries made by the service in Cloud Run service logs.

Cloud Run Log

Saving Results

The process of saving resulting data from this service will depend on your target (the place where you want to save the data). HCP has a number of existing connectors and templates so, in most cases, you do not have to even write any code. Here is an example of a Dataflow template that streams PubSub topic data to BigQuery:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/PubSub_to_BigQuery \
    --parameters \
inputTopic=projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME,\
outputTableSpec=YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME

This approach will automatically deal with back-pressure, retries, monitoring and is not subject to the batch insert quote limits.

Cleanup

To cleanup all resources created by this sample execute the bin/cleanup script.

bin/cleanup

Disclaimer

This is my personal project and it does not represent my employer. I take no responsibility for issues caused by this code. I do my best to ensure that everything works, but if something goes wrong, my apologies is all you will get.

About

Simple example showing how to use Cloud Run to pre-process raw events from PubSub and publish them to new topic.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published