Skip to content

Latest commit

 

History

History
124 lines (101 loc) · 5.49 KB

setting_region.md

File metadata and controls

124 lines (101 loc) · 5.49 KB

Setting GCP region

What to consider

Google Cloud Platform services are available in many locations across the globe. You can minimize network latency and network transport costs by running your Dataflow job in the same region as its input bucket, output dataset, and temporary directory are located. More specifically, in order to run Variant Transforms most efficiently you should make sure all the following resources are located in the same region:

  • Your source bucket set by --input_pattern flag.
  • Your pipeline's temporary location set by --temp_location flag.
  • Your output BigQuery dataset set by --output_table flag.
  • Your Dataflow pipeline set by --region flag.
  • Your Life Sciences API location set by --location flag.

Running jobs in a particular region

The Dataflow API requires setting a GCP region via --region flag to run.

When running from Docker, the Cloud Life Sciences API is used to spin up a worker that launches and monitors the Dataflow job. Cloud Life Sciences API is a regionalized service that runs in multiple regions. This is set with the --location flag. The Life Sciences API location is where metadata about the pipeline's progress will be stored, and can be different from the region where the data is processed. Note that Cloud Life Sciences API is not available in all regions, and if this flag is left out, the metadata will be stored in us-central1. See the list of Currently Available Locations.

In addition to this requirment you might also choose to run Variant Transforms in a specific region following your project’s security and compliance requirements. For example, in order to restrict your processing job to europe-west4 (Netherlands), set the region and location as follows:

COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region europe-west4 \
  --location europe-west4 \
  --temp_location "${TEMP_LOCATION}" \
  "${COMMAND}"

Note that values of --project, --region, and --temp_location flags will be automatically passed as COMMAND inputs in piplines_runner.sh.

Instead of setting --region flag for each run, you can set your default region using the following command. In that case, you will not need to set the --region flag any more. For more information, please refer to cloud SDK page.

gcloud config set compute/region "europe-west1"

Similarly, you can set the default project using the following commands:

gcloud config set project GOOGLE_CLOUD_PROJECT

If you are running Variant Transforms from GitHub, you need to specify all three required Dataflow inputs as below.

python3 -m gcp_variant_transforms.vcf_to_bq \
  ... \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region europe-west1 \
  --temp_location "${TEMP_LOCATION}"

Setting Google Cloud Storage bucket region

You can choose your GCS bucket's region when you are creating it. When you create a bucket, you permanently define its name, its geographic location, and the project it is part of. For an existing bucket, you can check its information to find out about its geographic location.

Setting BigQuery dataset region

You can choose the region for the BigQuery dataset at dataset creation time.

BigQuery dataset region

Advanced Flags

Variant Transforms supports specifying a subnetwork to use with the --subnetwork flag. This can be used to start the processing VMs in a specific network of your Google Cloud project as opposed to the default network.

Variant Transforms allows disabling the use of external IP addresses with the --use_public_ips flag. If not specified, this defaults to true, so to restrict the use of external IP addresses, use --use_public_ips false. Note that without external IP addresses, VMs can only send packets to other internal IP addresses. To allow these VMs to connect to the external IP addresses used by Google APIs and services, you can enable Private Google Access on the subnet.

For example, to run Variant Transforms in a VPC you already created called custom-network-eu-west with no public IP addresses you can add these flags to the example above as follows:

COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region europe-west4 \
  --location europe-west4 \
  --temp_location "${TEMP_LOCATION}" \
  --subnetwork custom-network-eu-west \
  --use_public_ips false \
  "${COMMAND}"