A light weight single header alternative to DBMS
This library was developed during my honors project at the University of Victoria under the supervision of Bill Bird. The original development occurred in this Metaprogramming Optimization repository, but was moved into a new, dedicated, home repository. The project was inspired and influenced by Hana Dusíková's great Compile Time Regular Expressions library (CTRE).
Maintenance may slow in the near future due to my employer's open source software policy. I will be looking into getting approval to continue maintaining this project.
Supported features:
- SQL query syntax for data processing
SELECT
data queryingAS
column renamingCROSS JOIN
(note: all column names of each relation must be unique)NATURAL JOIN
(note: natural join will attempt to join on the first column of each relation)WHERE
clause predicates on numeric andstd::string
types- Wildcard selection with
*
- Nested queries
- Uppercase and lowercase SQL keywords
- Modern
!=
and legacy<>
not-equal operator - Standard SQL operator precedence in
WHERE
clause - Schemas support all default constructable types
- Indexes for schemas (used for sorting the data)
- Range loop and structured binding declaration support
- Loading data from files (no header row)
- Storing data from
sql::schema
andsql::query
objects to files - Element querying from
sql::row
objects withsql::get<"column-name">
Unsupported features (future work):
INNER JOIN
,OUTER JOIN
,LEFT JOIN
, andRIGHT JOIN
GROUP BY
,HAVING
, andORDER BY
(using indexes can simulate some of these features)IN
operation withinWHERE
clause- Template argument error detection
As of April 2020, Constexpr SQL is only supported by GCC 9.0+
. The compiler support is constrained because of the widespread use of the new C++20
feature "Class Types in Non-Type Template Parameters" (proposal P0732R2
) which is only implemented by GCC 9.0+
. Library users specify SQL queries and column labels with string literals which are converted into constexpr
objects all of which relies on functionality from P0732R2
.
The following example shows usage of all class templates a user is expected to define to use the library.
#include <iostream>
#include <string>
#include "sql.hpp"
using books =
sql::schema<
"books", sql::index<"title">,
sql::column<"title", std::string>,
sql::column<"genre", std::string>,
sql::column<"year", unsigned>,
sql::column<"pages", unsigned>
>;
using authored =
sql::schema<
"authored", sql::index<>,
sql::column<"title", std::string>,
sql::column<"name", std::string>
>;
using query =
sql::query<
"SELECT title AS book, name AS author, year, pages "
"FROM books NATURAL JOIN (SELECT * FROM authored WHERE name = \"Harlan Ellison\") "
"WHERE year = 1967 OR year >= 1972 AND genre = \"science fiction\"",
books, authored
>;
int main()
{
authored a{ sql::load<authored>("tests/data/authored.tsv", '\t') };
books b{ sql::load<books>("tests/data/books.tsv", '\t') };
for (query q{ b, a }; auto const& [book, author, year, pages] : q)
{
std::cout << book << '\t' << author << '\t' << year << '\t' << pages << '\n';
}
return 0;
}
sql::schema
defines a relation used in a query. sql::index
defines how an sql::schema
sorts its data (unsorted if unspecified). sql::column
types are used to define the rows in an sql::schema
. sql::query
wraps a query statement and the sql::schema
types the query will operate on. sql::load
can be used to load data from a file into an sql::schema
.
The example is from example.cpp
in the root of the repository, and can be compiled and executed with the following command:
g++ -std=c++2a -pedantic -Wall -Wextra -Werror -O3 -Isingle-header/ -o example example.cpp && ./example
It is strongly recommended to compile with optimizations enabled, otherwise expect template bloat. Use of multiple objects of the same sql::query
type is considered undefined behavior (due to issues involving static members). Instantiating sql::query
objects should be performed within a guarded scope, like in the example. However, there are no use restrictions to sql::schema
types. sql::schema
types may be used multiple times within a single query or in many queries at once. There are more examples and information in presentation.pdf
at the root of the repository.
The library has a significant testing system which is composed of two script pipelines. All tests use the data from another project of mine called Terminus
which is a library database shell. The correctness testing pipeline generates nearly 1.5 million test queries, then Constexpr SQL's output is compared against the output of SQLite3
performing the same queries. The performance testing pipeline executes six different SQL queries implemented using Constexpr SQL and hand coded SQL. The queries are executed over 65 thousand times (256 for CROSS JOIN
due to computational complexity), and the execution timing is captured using the Linux time
tool.
The runner.sh
script in the tests
directory will execute correctness testing, and the runner.sh
script in tests/perf
will execute performance testing.
The following sections provide a high-level description about how the library is implemented. Hopefully, the sections will provide useful code and document references to others looking to write similar libraries.
The sql::schema
class template represents relational schemas and, when instantiated, SQL tables. The class template is parameterized on three template parameters: Name
, Index
, and Col
template parameter pack. Name
defines the SQL table name which is matched against table names in a query's FROM
statement. The Index
template argument is used to support GROUP BY
statements by using SFINAE to select the underlying column data container (std::vector
or std::multiset
). The Index
template argument, when fully specified, provides the comparator functor used by the std::multiset
container. The Cols
template parameter pack is expanded into the sql::row
type for the schema. sql::schema
objects support structured binding declarations which is facilitated partly through the sql::schema
API and partly through std
namespace injections from sql::row
helping to satisfy the argument dependant lookup of the get<i>
function.
Reference the example earlier for proper usage of sql::schema
. Notice in the example the string literal as a template argument. String literals are lvalue reference types which are passed as const
pointers. Normally, pointers can not be used as template arguments. With the new C++20 feature mentioned earlier, a cexpr::string
constructor template can be deduced to turn the string literal into a constexpr
object. The deduction is enabled through cexpr::string
's class template argument deduction guide which provides a mapping of constructor arguments to template parameters.
The sql::query
class template is the user interface to the SQL query parser. The class is templated on a cexpr::string
object (the SQL query) and a template parameter pack of sql::schema
types. At compile time, the SQL query string is parsed into the relational algebra expression tree representing the query's computation. The constructor to a fully specified sql::query
class takes a variadic pack of sql::schema
objects which it uses to seed the relational algebra expression tree with iterators to data. The sql::query
object can then be used in a range loop with structured binding declarations like in the example.
The relational algebra expression tree uses static members to hold data, so only one object of a single fully specified sql::query
class can exist at once in the program. To ensure this the object should be constructed within a guarded scope like in the example. It's worth noting that even though this class template's source file is the largest among the code base, nearly all of it is only live during compilation to parse the SQL query. In fact, the runtime interface is deliberately insubstantial, merely providing an wrapper to support range loops and structured binding declarations.
In compliance with range loop syntax, sql::query
has an associated iterator class sql::query_iterator
. sql::query_iterator
wraps the type representing the relational algebra expression and handles all of the idiosyncrasies of its usage in favor of the familiar forward iterator
interface. When an sql::query_iterator
is dereferenced, it returns a constant reference to an sql::row
object representing the current row of output from the query stream.
The sql::row
class template is a template recursive linked-list (similar to std::tuple
). A template recursive linked-list is a template metaprogramming pattern which expresses a type analogous to a traditional linked-list. sql::row
implements this pattern with two template parameters Col
and Next
. Col
represents the sql::column
which the node in list holds a data element from. Next
represents the type of the next node in the list, which is either another sql::row
type or sql::void_row
. Because the next node is expressed as a type the template linked-list does not incur the overhead of holding a next pointer nor the run time cost of dereferencing a pointer to iterate (also makes better use of the cache). A quirk to this pattern is that the node data type need not be homogenous across the list, instead the list may be composed of heterogenous data types. Also, template linked-list access is computed at compile time, so the run time cost is constant.
The example does not demonstrate the sql::get<cexpr::string>
helper function. This helper function allows the user to query a specific element from an sql::row
object by column name. Additionally, sql::row
has ML-like functions for getting the head
and tail
of the object.
At the moment, ra::projection
, ra::rename
, ra::cross
, ra::natural
, ra::selection
, and ra::relation
are the only relational algebra nodes implemented. ra::projection
and ra::rename
are unary operators which take a single sql::row
from their Input
relational algebra operator and fold their operation over the row before propagating the transformed row to their Output
. The fold
is implemented as a template recursive function. ra::cross
outputs the cross product of two relations. ra::natural
implements a natural join between two relations using a hash table buffer of the right relation for performance. ra::selection
uses a predicate function constructed from a WHERE
clause to filter rows in a query. ra::relation
is the only terminal node in the expression tree which is used for retrieving the next input in the stream. These operators are composable types and are used to serialize the relational algebra expression tree. Individual objects of each type are not instantiated to compose the expression tree. Instead to ensure the expression tree is a zero overhead abstraction, the types implement a static
member function next
used to request data from its input type. The actual constexpr
template recursive recursive descent SQL parser will serialize these individual nodes together into the appropriate expression tree.
As a proof of concept for constexpr
parsing, two math expression parsers were implemented in old repository: cexpr::prefix
and cexpr::infix
. cexpr::prefix
demonstrates the fundamental method of constexpr
parsing an expression tree into a class template. cexpr::infix
extends this to perform constepxr
recursive descent parsing. cexpr::infix
and the SQL query parser are a whole order of magnitude more complex, because there's recursive function template instantiations to many different function templates. The explanation of constexpr
parsing is illustrated through cexpr::prefix
for simplicity.
The expression tree created while parsing is a template recursive tree which shares similar properties to the template linked-list (discussed earlier). A notable benefit to this data structure is that because the tree is composed of types rather than data values, the tree can be used to express computation models (expression trees) rather than just a tree based container.
Fundamentally, the parsing is accomplished by calling a template recursive static constexpr
parsing function member parameterized on the token position which the parser's "cursor" is standing on (Pos
template parameter). At each "cursor" position, the function uses the token to decide the how to proceed. If the token indicates the start of an operation, the parser recurses the immediate left subexpression ("cursor" + 1). On return, the left subexpression will report the depth in the token stream it recursed at which point the right subexpression will pick up at this position. This control flow is expressed in this line of cexpr::prefix::parse
. Once both left and right subexpressions are parsed, the new node's type is formed (decltype
left and right subexpressions) which is then propagated to the caller. Otherwise, if the token indicates a terminal, then an appropriate terminal node is constructed. It is necessary that the "cursor" position is unique across template instantiations, otherwise template memoization will lead to "infinite" recursion.
The few ancillary class templates used to support this parsing and the math node struct
templates can be found in the templ
namespace of the old repository. There is also a driver program using the parsers in the old repository. In the Constexpr SQL parser, all of the entities in the templ
namespace were replaced for more template metaprogramming idiomatic structures.