-
Notifications
You must be signed in to change notification settings - Fork 79
UDFs
Note: this page is about the legacy %bigquery udf
magic, which is being deprecated as of the v1.0 release. Please see the migration guide page here.
Datalab supports Javascript user-defined functions (UDFs) for BigQuery. UDFs are Javascript functions that take a row object and emitter function as input, perform some computation, and then call the emitter function to output a resulting transformed row object (or possibly multiple rows). A UDF is thus similar to the Map function in a MapReduce.
The BigQuery UDF documentation explains that a defineFunction()
call is needed to define a UDF, including its input fields, output schema, and so forth. Datalab is a bit simpler; it makes use of jsdoc-style @param
comments to achieve the same result. Also note that a UDF function should not have a name or be assigned to a variable. It should ideally be stateless, as you cannot guarantee consistency of state across multiple nodes, but you can call support functions.
The basic skeleton of a UDF in Datalab looks like:
%bigquery udf -m sample_udf
/**
* A UDF function should take an input row `r` and emitter function `emitFn()`.
* It should create one or more transformed forms of the row and output
* them by calling the emitter.
*
* We define two parameters, below, which specify the schema of the input row and
* the output row.
*
* In our example we will output a row with one less column which is the product
* of the count and weight.
*
* @param {{timestamp: timestamp, label: string, count: integer, weight: float}} r
* @param function({{timestamp: timestamp, label: string, weighted_count:float}}) emitFn
*/
function(r, emitFn) {
emitFn({
timestamp: r.timestamp,
label: r.label,
weighted_count: r.count * r.weight
});
}
Strictly speaking, the types are not needed for the input row, so we could have used:
* @param {{timestamp, label, count, weight}} r
but it is good practice to include them.
To apply this UDF to a table [foo]
we can use:
%%sql
SELECT * FROM sample_udf([foo])
The %udf magic parses the cell and creates a variable named sample_udf
in the notebook, referencing a datalab.bigquery.UDF
class instance. This object in turn contains a number of fields:
-
_name
: the name of the UDF; -
_implementation
: a string containing the body of the UDF function only (no support function code); -
_support_code
: a string containing the Javascript code for any other functions defined in the cell; as mentioned before it is possible to include and call support functions from the main UDF; -
_imports
: the support functions do not need to be explicitly contained in the cell. They could be in a Javascript file stored in GCS and referenced via an@import
docstring. This member holds the list of such import URLs; -
_code
: this is a string containing the inline support code and the necessary call todefineFunction
; it is what is needed to be passed to BigQuery along with the_implementation
and_imports
to use the UDF.
When parsing a query defined in a %sql
cell, Datalab looks for calls to UDFs, and try to resolve them. If the %sql
cell is meant to be executed immediately (i.e. no -m/--module argument) then it will try to resolve these using the notebook execution context; otherwise it will use the UDFs passed in to the Query
constructor with the udfs
parameter or any other UDFs that are passed in as named arguments to the Query. That is, the %sql
cell above could be implemented with:
%%sql -m sample_query
SELECT * FROM sample_udf([foo])
and then either:
import datalab.bigquery as bq
bq.Query(sample_query, udfs=[sample_udf]).results()
or:
import datalab.bigquery as bq
bq.Query(sample_query, sample_udf=sample_udf).results()