-
Notifications
You must be signed in to change notification settings - Fork 12
Extending Squerall
Squerall was built from the ground up with extensibility in mind. It can programmatically be extended in two ways:
- Supporting more data sources.
- Adding a new query engine.
Squerall makes use of Sparks and Presto's connectors. Both Spark and Presto make it convenient to use a connector.
To add a new data source, in most cases only a connection template is needed:
spark.read.format(format).options(options).load
Where format
is a predefined Spark string constant denoting the data source to connect to, and options
is a map of Spark options proper to the specified data source.
For example, to load data from a Cassandra table, the template takes the form:
val options = Map("keyspace" -> "db", "table" -> "product")
val df = spark.read.format("org.apache.spark.sql.cassandra").options(options).load
The way the above is injected in the code is easy, head to SparkExecutor.scala, locate the sourceType match .. case
and add at the end a new case
with the necessary code. Once there, take a moment and look at how the currently supported data sources are configured.
In few cases, there is some preparation needed before executing the template, like in MongoDB a ReadConfig
object is to be created beforehand (see it in SparkExecutor.scala). However, the connector documentation in general is clear on that.
Visit Spark Packages Web page to browse all available connectors and find your way to the documentation page.
Presto makes adding a new source even easier, no code change is required. All it needs is creating a new config file inside (Presto home)/etc/catalog
and adding a number of query=value lines. For example, to add support to Cassandra, create a file named cassandra.properties
with the following content:
connector.name=cassandra
cassandra.contact-points=localhost
Visit Presto connectors Web page to browse all available connectors.
Thanks to Squerall modular code design, other query engines can be added alongside Spark and Presto. The requirements is that the new query engine has to have a similar concept to connectors explained above. The effort and expertise required here is significantly higher than connecting to a new data source inside one query engine explained above. But this documentation page will try to guide you through the process.
You need to create two Scala classes:
- A class to define an entity model.
- A class to define an executor.
This class represents the object that will be loaded from the data source and be transformed and queried (projection, selection, join, etc.). In Spark, this corresponds to DataFrame
, in Presto a custom model has been created called DataQueryFrame.
DataFrame
is a huge class, have a look at DataQueryFrame
and get and idea about what is needed there.
This class should extend the unified class QueryExecutor, it contains methods necessary for executing the query, like query()
, join()
, project()
. Have a look at SparkExecutor and PrestoExcutor to get and idea about what is needed.
The Executor is instantiated and used in the Main class, whereas the Entity Model is instantiated inside the Executor class. the former is made differently between Spark and Presto. In Spark, it's created from th connection template explained above:
var df : DataFrame = null
...
case "cassandra" => df = spark.read.format("org.apache.spark.sql.cassandra").options(options).load
Whereas in Presto, it is initiated normally:
val finalDQF = new DataQueryFrame()
Candidate query engines are Drill, Alluxio, Impala, Flink. For query engines similar to Presto in the sense that they work like databases directly accepting SQL statements, such as Drill, both Entity Model and Executor class can be largely copied/inspired from Presto's Entity Model and Executor classes.