Skip to content

Extending Squerall

Mohamed Nadjib MAMI edited this page Mar 20, 2019 · 9 revisions

Squerall was built from the ground up with extensibility in mind. It can programmatically be extended in two ways:

  • Supporting more data sources.
  • Adding a new query engine.

Supporting more data sources

Squerall makes use of Sparks and Presto's connectors. Both Spark and Presto make it convenient to use a connector.

In Spark

To add a new data source, in most cases only a connection template is needed:

spark.read.format(format).options(options).load

Where format is a predefined Spark string constant denoting the data source to connect to, and options is a map of Spark options proper to the specified data source.

For example, to load data from a Cassandra table, the template takes the form:

val options = Map("keyspace" -> "db", "table" -> "product")
val df = spark.read.format("org.apache.spark.sql.cassandra").options(options).load

The way the above is injected in the code is easy, head to SparkExecutor.scala, locate the sourceType match .. case and add at the end a new case with the necessary code. Once there, take a moment and look at how the currently supported data sources are configured.

In few cases, there is some preparation needed before executing the template, like in MongoDB a ReadConfig object is to be created beforehand (see it in SparkExecutor.scala). However, the connector documentation in general is clear on that.

Visit Spark Packages Web page to browse all available connectors and find your way to the documentation page.

In Presto

Presto makes adding a new source even easier, no code change is required. All it needs is creating a new config file inside (Presto home)/etc/catalog and adding a number of query=value lines. For example, to add support to Cassandra, create a file named cassandra.properties with the following content:

connector.name=cassandra
cassandra.contact-points=localhost

Visit Presto connectors Web page to browse all available connectors.

Adding a new query engine

Thanks to Squerall modular code design, other query engines can be added alongside Spark and Presto. The requirements is that the new query engine has to have a similar concept to connectors explained above. The effort and expertise required here is significantly higher than connecting to a new data source inside one query engine explained above. But this documentation page will try to guide you through the process.

You need to create two Scala classes:

  • A class to define an entity model.
  • A class to define an executor.

Entity Model class

This class represents the object that will be loaded from the data source and be transformed and queried (projection, selection, join, etc.). In Spark, this corresponds to DataFrame, in Presto a custom model has been created called DataQueryFrame.

DataFrame is a huge class, have a look at DataQueryFrame and get and idea about what is needed there.

Executor class

This class should extend the unified class QueryExecutor, it contains methods necessary for executing the query, like query(), join(), project(). Have a look at SparkExecutor and PrestoExcutor to get and idea about what is needed.

The Executor is instantiated and used in the Main class, whereas the Entity Model is instantiated inside the Executor class. the former is made differently between Spark and Presto. In Spark, it's created from th connection template explained above:

var df : DataFrame = null
...
case "cassandra" => df = spark.read.format("org.apache.spark.sql.cassandra").options(options).load

Whereas in Presto, it is initiated normally:

val finalDQF = new DataQueryFrame()

Candidate query engines are Drill, Alluxio, Impala, Flink. For query engines similar to Presto in the sense that they work like databases directly accepting SQL statements, such as Drill, both Entity Model and Executor class can be largely copied/inspired from Presto's Entity Model and Executor classes.

Clone this wiki locally