Circus Train BigQuery is an extension of Circus Train which allows tables in Google BigQuery to be replicated to Hive.
This document contains a collection of notes put together by developers who have worked on the project to provide some explanations of the code and how it works. This is not completely exhaustive of all the inner workings of the project, so do feel free to add more information or detail.
First and foremost, it's worth having a read through the Circus Train README.md file and the Circus Train BigQuery README.md. These are pretty extensive guides containing a lot of info on the projects, including how to run each of them and the different configurations which can be used.
It may also be useful to read through the DEVELOPERS.md file for Circus Train.
At a high level, CTBQ uses the Replication and S3-MapReduce Copier integration points provided by CT to replicate data out of BigQuery and into Hive. It first runs a query to extract the data out of BQ into a temporary BQ table (applying any necessary partition filters) and then uses Google's BigQuery and Storage APIs to create a Job which extracts that data out of this temporary table onto Google Cloud Storage (GCS) as Avro files. A Hive table object (i.e. not a real Hive table) is then put "on top" of this data and Circus Train is instructed to replicate the data from this GCS source to the target as normal.
- A temporary table is created by applying a partition filter to the source table.
- The data from this temporary table extracted to GCS.
- Circus Train then uses this GCS location to perform the replication to Hive.
- The data from the entire source table is extracted to GCS.
- This GCS location is used by Circus Train to perform the replication.
As mentioned, BigQuery is an extension of Circus Train so processing will begin in the Circus Train code. Control is given to BigQuery which will set up the table to be replicated, and then Circus Train will perform the actual replication.
name - Title for the class being described
(CT) - Indicates that the class is a part of the Circus Train code
<Name>
- Capitalised names indicate classes
<name>
- Lowercase names indicate methods
Locomotive (CT)
- Creates a new
Replication
(CT) object using theReplicationFactory
(CT) and callsreplicate
on it.
*ReplicationFactory (CT)
- Returns a
Replication
(CT) object, the type depends on whether the source table is partitioned or not. - Creates a
Source
(CT) object which refers back toHiveEndpoint
(CT) which then callsgetTable
on theBigQueryMetastoreClient
object.
Replication (CT)
- Either
PartitionedTableReplication
(CT) orUnpartitionedTableReplication
(CT). - Uses the
BigQueryCopierFactory
to generate theBigQueryCopier
, then callscopy
to copy over the table data. - After that will update the metadata of the table.
BigQueryMetastoreClientFactory
- Checks the given URI is acceptable for BigQuery.
- Returns a new instance of the
BigQueryMetastoreClient
.
BigQueryMetastoreClient
- Registers a container with the
ExtractionService
. - Container has an
UpdateTableSchemaAction
to be executed by theExtractionService
. - Returns a Hive table - based on the BigQuery table. Creates the cached table object.
BigQueryMetastore
- Configures and runs a Query Job in GBQ which will execute the partition filter query on the source table, loads the result to a temporary GBQ table.
TableServiceFactory
- Will create either a
PartitionedTableService
or anUnpartitionedTableService
, based on whether the Hive table being replicated is partitioned or not. - Created inside
BigQueryMetastoreClient
.
UnpartitionedTableService
- Takes the hive table in the constructor.
- Returns an empty list of partitions.
PartitionedTableService
- Pass in
BigQueryTableFilterer
in the constructor. This executes the filter query onto the table. - Calls generate on the
HivePartitionGenerator
.
HivePartitionGenerator
- Calls the
BigQueryToHivePartitionConverter
to generate a new basic partition. - Uses
BigQueryPartitionGenerator
to add data from BQ table to the partition. - Sets partition parameters.
BigQueryPartitionGenerator
- Generates the partition in BigQuery using the query:
select * except ....
- Schedules this partition for extraction with the
ExtractionService
.
BigQueryTableFilterer
- Creates the GBQ tables using the filtered query.
- Has a
DeleteTableAction
so they are deleted after.
Composite Copier Factory (CT)
- Creates the
BigQueryCopierFactory
instance. - Adds it to a
CompositeCopier
object. - Later in
PartitionedTableReplication
orUnpartitionedTableReplication
, copy will be called on theCompositeCopier
- which will call copy on theBQCopier
.
BigQueryCopierFactory
- Creates the
BigQueryCopier
.
BigQueryCopier
- Runs extract on
ExtractionService
. - Delegates to the Circus Train copier, e.g.
S3MapReduce
, which performs the copy using the location of the data in GCS.
ExtractionService
- Runs the actions on the
PostExtractionActions
inside theExtractionContainers
. - Calls the
DataExtractor
class which gets the data from the temp tables. - Also has a cleanup method.
DataExtractor
- Extracts the data from the temporary
BigQueryTable
and puts it into GCS.
ExtractionContainer
- Takes a
PostExtractionAction
to run after the extraction of data. - In the
BigQueryTableFilterer
it passes in an action to delete the table. PartitionedTableService
calls the method that does this.- The
ExtractionContainer
object is passed toExtractionService
via theregister(_)
method.
The copying of the data is then carried out by Circus Train copiers.
If you would like to ask any questions about or discuss Circus Train or Circus Train BigQuery please join our mailing list at
https://groups.google.com/forum/#!forum/circus-train-user
The Circus Train BigQuery logo is licensed under the Creative Commons Attribution-Share Alike 4.0 International license. It includes an adaption of the Google BigQuery logo that is similarly licensed under the CC BY-SA 4.0 International license. The Circus Train logo uses the Ewert font by Johan Kallas under the SIL Open Font License (OFL).
This project is available under the Apache 2.0 License.
Copyright 2016-2020 Expedia, Inc.