Skip to content

Nodejs package to manage Google Cloud BigQuery datasets, and tables and perform CRUD operations against them.

License

Notifications You must be signed in to change notification settings

nicolasdao/google-cloud-bigquery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google Cloud BigQuery · NPM Tests License Neap

Google Cloud BigQuery is a node.js package to maintain BigQuery table, either explicitely or using a Google Cloud Storage (including automatically updating the tables' schema).

Table of Contents

Install

npm i google-cloud-bigquery

Getting started

Prerequisite

Before using this package, you must first:

  1. Have a Google Cloud Account.
  2. Have a both a BigQuery DB and a Bucket in the same region (the bucket is only in case you wish to maintain BigQuery schema using data stored a Google Cloud Storage). As of December 2018, BigQuery is only supported in the following locations:
    • asia-northeast1 (Tokyo)
    • asia-east1 (Taiwan)
    • asia-southeast1 (Singapore)
    • australia-southeast1 (Sydney)
    • europe-north1 (Finland)
    • europe-west2 (London)
    • us-east4 (Northern Virginia)
    • eu (Multi regions in the EU)
    • us (Multi regions in the US)
  3. Have a Service Account set up with the following 2 roles:
    • roles/bigquery.admin
    • roles/storage.objectAdmin (only in case you wish to maintain BigQuery schema using data stored a Google Cloud Storage)
  4. Get the JSON keys file for that Service Account above.
  5. Save that JSON key into a service-account.json file (make sure it is located under a path that is accessible to your app), or save the following properties to either manually set up the client or set up environment variables:
    • project_id
    • client_email
    • private_key
  6. Modify the service-account.json above by adding a new location_id property with the location ID of your BigQuery service (e.g., australia-southeast1).

Four ways to create a client

This library supports four different ways to create a client. The first method is the recommended way:

  1. User the hosting identity
  2. Using a service-account.json
  3. Using explicit credentials
  4. Using environment variables

User the hosting identity

const { client } = require('google-cloud-bigquery')

const bigQuery = client.new()

In this case, the package fetches the credentials automatically. It will try three different techniques to get those data, and if none of them work, an error is thrown. Those techniques are:

  1. If the code is hosted on GCP (e.g., Cloud Compute, App Engine, Cloud Function or Cloud Run) then the credentials are extracted from the service account associated with the GCP service.
  2. If the GOOGLE_APPLICATION_CREDENTIALS environment variable exists, its value is supposed to be the path to a service account JSON key file on the hosting machine.
  3. If the ~/.config/gcloud/application_default_credentials.json file exists, then the credentials it contains are used (more about setting that file up below).

When developing on your local environment, use either #2 or #3. #3 is equivalent to being invited by the SysAdmin to the project and granted specific privileges. To set up ~/.config/gcloud/application_default_credentials.json, follow those steps:

  • Make sure you have a Google account that can access both the GCP project and the resources you need on GCP.
  • Install the GCloud CLI on your environment.
  • Execute the following commands:
     gcloud auth login
     gcloud config set project <YOUR_GCP_PROJECT_HERE>
     gcloud auth application-default login
    
    The first command logs you in. The second command sets the <YOUR_GCP_PROJECT_HERE> as your default project. Finally, the third command creates a new ~/.config/gcloud/application_default_credentials.json file with the credentials you need for the <YOUR_GCP_PROJECT_HERE> project.

Using a service-account.json

We assume that you have created a Service Account in your Google Cloud Account (using IAM) and that you've downloaded a service-account.json (the name of the file does not matter as long as it is a valid json file). The first way to create a client is to provide the path to that service-account.json as shown in the following example:

const { join } = require('path')
const { client } = require('google-cloud-bigquery')

const bigQuery = client.new({ jsonKeyFile: join(__dirname, './service-account.json') })

Using explicit credentials

This method is similar to the previous one. You should have dowloaded a service-account.json, but instead of providing its path, you provide some of its details explicitly:

const { client } = require('google-cloud-bigquery')

const bigQuery = client.new({
	credentials: {
		project_id: 'test', 
		location_id: 'australia-southeast1',
		client_email:'something-1234@your-project-id.iam.gserviceaccount.com', 
		private_key: '-----BEGIN PRIVATE KEY-----\n123456789-----END PRIVATE KEY-----\n'
	}
})

All those details should be coming from the service-account.json you downloaded in the Prerequisite step.

Using environment variables

const { client } = require('google-cloud-bigquery')

const bigQuery = client.new()

The above will only work if all the following environment variables are set:

  • GOOGLE_CLOUD_BIGQUERY_PROJECT_ID or GOOGLE_CLOUD_PROJECT_ID
  • GOOGLE_CLOUD_BIGQUERY_LOCATION_ID or GOOGLE_CLOUD_LOCATION_ID
  • GOOGLE_CLOUD_BIGQUERY_CLIENT_EMAIL or GOOGLE_CLOUD_CLIENT_EMAIL
  • GOOGLE_CLOUD_BIGQUERY_PRIVATE_KEY or GOOGLE_CLOUD_PRIVATE_KEY

WARNING: If you're using NPM's dotenv, wrap your PRIVATE_KEY between double-quotes, otherwise some characters are escaped which corrupts the key.

Creating a new table

const { join } = require('path')
const { client } = require('google-cloud-bigquery')

const bigQuery = client.new({ jsonKeyFile: join(__dirname, './service-account.json') })

// Assumes that 'your-dataset-id' already exists
const db = bigQuery.db.get('your-dataset-id')
const userTbl = db.table('user')

userTbl.exists()
	.then(yes => yes 
		? console.log(`Table '${userTbl.name}' already exists in DB '${db.name}'`)
		: userTbl.create.new({ 
			schema: {
				id: 'integer',
				username: 'string',
				friends: [{
					id: 'integer',
					username: 'string',
					score: 'float'
				}],
				country: {
					code: 'string',
					name: 'string'
				},
				married: 'boolean',
				tags:['string'],
				inserted_date: 'timestamp'
			} 
		}).then(() => console.log(`Table '${userTbl.name}' successfully added to DB '${db.name}'`)))

Inserting data

userTbl.insert.values({ data:[{
		id: 1,
		username: 'Nicolas',
		inserted_date: new Date()
	}, {
		id: 2,
		username: 'Brendan',
		country: {
			code: 'AU',
			name: 'Australia'
		},
		friends:[{
			id: 1,
			username: 'Nicolas',
			score: 0.87
		}, {
			id: 3,
			username: 'Boris',
			score: 0.9
		}],
		inserted_date: new Date()
	}, {
		id: '3',
		username: 'Boris',
		tags:['admin',1],
		inserted_date: Date.now()/1000
	}]
})

IMPORTANT NOTE ABOUT QUOTAS AND LIMITS

Notice that the data input accept both single objects or array of objects. Though BigQuery can ingest up to 10,000 rows per request and 100,000 rows per seconds, it is recommended to keep the maximum amount of rows per request to 500. You can read more about the quotas and limits at https://cloud.google.com/bigquery/quotas#streaming_inserts.

To prevent inserting more than 500 rows per request, you can either code it yourself, or rely on our own implementation using the safeMode flag as follow:

userTbl.insert.values({ data: lotsOfUsers, safeMode: true })
	.then(() => console.log(`All users inserted`))

This safeMode flag will check that there is less than 500 items in the lotsOfUsers array. If there are more than 500 items, the array is broken down in batches of 500 items which are then inserted sequentially. That means that if you're inserting 5000 users, there will be 10 sequential request of 500 users.

Getting data

db.query.execute({ 
	sql:`select * from ${db.name}.${userTbl.name} where id = @id`, 
	params: { id: 2 } 
})
.then(data => console.log(JSON.stringify(data, null, ' ')))

// Query Output
// ============
//
// [
//  {
//   "id": 2,
//   "username": "Brendan",
//   "friends": [
//    {
//     "id": 1,
//     "username": "Nicolas",
//     "score": 0.87
//    },
//    {
//     "id": 3,
//     "username": "Boris",
//     "score": 0.9
//    }
//   ],
//   "country": {
//    "code": "AU",
//    "name": "Australia"
//   },
//   "married": null,
//   "tags": [],
//   "inserted_date": "2018-11-14T03:17:16.830Z"
//  }
// ]

Updating the table's schema

With BigQuery, only 2 types of updates are possible:

  1. Adding new fields
  2. Relaxing the constraint on a field from REQUIRED to NULLABLE

The second type of update is not usefull here as this project always creates nullable fields. The following example shows how to perform a schema update if the local schema is different from the current BigQuery schema:

// Let's add a new 'deleted_date' field to our local schema
const newSchema = {
	id: 'integer',
	username: 'string',
	friends: [{
		id: 'integer',
		username: 'string',
		score: 'float'
	}],
	country: {
		code: 'string',
		name: 'string'
	},
	married: 'boolean',
	tags:['string'],
	inserted_date: 'timestamp',
	deleted_date: 'timestamp'
}

userTbl.schema.isDiff(newSchema)
	.then(yes => yes
		? Promise.resolve(console.log(`Schema changes detected. Updating now...`))
			.then(() => userTbl.schema.update(newSchema))
			.then(() => console.log(`Schema successfully updated.`))
		: console.log(`No schema updates found`)
	)

Best Practices - Reliability & Performances

Inserting A Lot Of Rows At Once

All insert operations use the BigQuery Streaming Insert API. There are many quotas limits (more details at https://cloud.google.com/bigquery/quotas#streaming_inserts), but the top ones to be aware of are:

  • Each row cannot be larger than 1 MB.
  • The maximum number of rows that can be inserted at once is 10,000, but the documentation recommends to limit those batch inserts to 500 for performance reasons.
  • The maximum size of all rows in a single insert cannot exceed 10 MB. Google does not recommend smaller inserts to improve performances, but in our experience, limiting to 2 MB improves performance and reliability (it all depends on your network conditions).

To alleviate the need to pre-process your data before inserting all your rows, we've added a configurable safeMode in our insert API:

userTbl.insert.values({
	data: veryBigArrayOfUsers,
	safeMode: true
})

This will automatically insert all those rows sequentially by batch of 2 MB or 500 rows (which ever is reached first). You can configure that safe mode as follow:

userTbl.insert.values({
	data: veryBigArrayOfUsers,
	safeMode: true,
	batchSize: 5*1024*1024, // 5 MB max instead of the default 2 MB.
	batchCount: 1000 	// 1000 rows max instead of the default 500.
})

Avoiding Schema Errors When Inserting Data

BigQuery casting capabilities are quite limited. When a type does not fit into the table, that row will either crashes the entire insert, or will be completely ignored (we're using that last setting). To make sure that as much data is being inserted as possible, we've added an option called forcedSchema in the db.table('some-table').insert.values api:

userTbl.insert.values({
	data:{
		id: '123.34',
		username: { hello: 'world' },
		inserted_date: new Date(2018,10,14)
	},
	forcedSchema:{
		id: 'integer',
		username: 'string',
		inserted_date: 'timestamp'
	}
})

Under the hood, this code will transform the data payload to the following:

{
	id: 123,
	username: 'Object',
	inserted_date: '2018-11-13T13:00:00.000Z'
}

This object is guaranteed to comply to the schema. This will guarantee that all the data are inserted.

Avoiding Network Errors

Networks errors (e.g. socket hang up, connect ECONNREFUSED) are a fact of life. To deal with those undeterministic errors, this library uses a simple exponential back off retry strategy, which will reprocess your read or write request for 10 seconds by default. You can increase that retry period as follow:

// Retry timeout for QUERIES
db.query.execute({ 
	sql:`select * from ${db.name}.${userTbl.name} where id = @id`, 
	params: { id: 2 },
	timeout: 30000		// 30 seconds retry period timeout
})

// Retry timeout for INSERTS
userTbl.insert.values({
	data: users,
	timeout: 30000 		// 30 seconds retry period timeout
})

Snippets To Put It All Together

Indempotent Script To Keep Your DB Tables In Sync

The code snippet below shows how you can create a new tables if they don't exist yet and update their schema if their schema has changed when compared with the local version.

const { join } = require('path')
const { client } = require('google-cloud-bigquery')
// The line below assumes you have a file 'schema.js' located under 'path-to-your-schema-file'
// organised in a way where the 'schema' object below is structured as follow:
// 	schema.table_01 	This is the schema of 'table_01'
//	schema.table_02 	This is the schema of 'table_02'
const schema = require('path-to-your-schema-file/schema.js')

const bigQuery = client.new({ jsonKeyFile: join(__dirname, './service-account.json') })
const db = bigQuery.db.get('your-dataset-id')

const tbl_01 = db.table('table_01')
const tbl_02 = db.table('table_02')

const maintainTablesScript = () => {
	console.log('\nChecking for BigQuery tables updates...')
	return [tbl_01, tbl_02].map(table => ({ table, schema: schema[table.name] })).reduce((job, { table, schema }) => job
		.then(() => 
			table.exists()
				.then(tableExists => tableExists
					? console.log(`  - Table '${table.name}': Table already exists in DB '${db.name}'.`)
					: Promise.resolve(console.log(`  - Table '${table.name}': Table not found. Creating it now...`))
						.then(() => table.create.new({ schema }))
						.then(() => console.log(`  - Table '${table.name}': Table successfully created.`))
				)
				.then(() => table.schema.isDiff(schema))
				.then(schemaHasChanged => schemaHasChanged
					? Promise.resolve(console.log(`  - Table '${table.name}': Schema changes detected in table. Updating now...`))
						.then(() => table.schema.update(schema))
						.then(() => console.log(`  - Table '${table.name}': Schema successfully updated.`))
					: console.log(`  - Table '${table.name}': No schema updates found.`)
				)
		)
		.catch(err => {
			console.log(`  - Table '${table.name}': Oops... An error occured: ${err.message}`)
		}), 
	Promise.resolve(null))
}

maintainTablesScript()

This Is What We re Up To

We are Neap, an Australian Technology consultancy powering the startup ecosystem in Sydney. We simply love building Tech and also meeting new people, so don't hesitate to connect with us at https://neap.co.

Our other open-sourced projects:

GraphQL

  • graphql-serverless: GraphQL (incl. a GraphiQL interface) middleware for webfunc.
  • schemaglue: Naturally breaks down your monolithic graphql schema into bits and pieces and then glue them back together.
  • graphql-s2s: Add GraphQL Schema support for type inheritance, generic typing, metadata decoration. Transpile the enriched GraphQL string schema into the standard string schema understood by graphql.js and the Apollo server client.
  • graphql-authorize: Authorization middleware for graphql-serverless. Add inline authorization straight into your GraphQl schema to restrict access to certain fields based on your user's rights.

React & React Native

Tools

License

Copyright (c) 2018, Neap Pty Ltd. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of Neap Pty Ltd nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NEAP PTY LTD BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Neap Pty Ltd logo

About

Nodejs package to manage Google Cloud BigQuery datasets, and tables and perform CRUD operations against them.

Resources

License

Stars

Watchers

Forks

Packages

No packages published