This is DataFusion's implementation of SQLancer on SQLancer's testing framework.
datafusion-sqlancer
has found ~30 bugs in DataFusion.
SQLancer (Synthesized Query Lancer) is a tool to automatically test Database Management Systems (DBMS) in order to find logic bugs in their implementation. It's a black box fuzzer which performs SQL-level testings.
SQLancer operates in the following two phases:
- Database and Query Generation: Generating random database tables, and then generate complex queries using those tables
- Testing: The goal of this phase is to detect the logic bugs automatically based on the generated database, using several test oracles.
- The weakest test oracle ensures that a query won't crash the
datafusion
engine. Since the query generator can construct chaotic queries, it's possible to detect crash bugs. SQLancer
comes with several stronger test oracles, here is the example of one of those:Non-optimizing Reference Engine Construction (NoREC) Randomly generated query(Q1): select * from t1 where v1 > 0; Mutated query(Q2): select v1 > 0 from t1; Consistency check: The result size of Q1 should be equal to the number of `True` in Q2's output
- The weakest test oracle ensures that a query won't crash the
Above showed consistency check generated Q1 (very likely to be optimized by predicate pushdown), and Q2(hard to be optimized), such test oracle focuses on the correctness of the optimizer. There are 5 similar test oracles available to be implemented, those carefully designed checks make this testing framework really powerful.
More details about test oracles can be found in SQLancer page.
DataFusion-SQLancer
can be viewed as an extendible SQL-level fuzz testing framework:
- It can generate random tables and queries that are supported by
DataFusion
. - It's convenient to implement more test oracles.
- One way is to come up with more
SQLancer
style oracles like making the aboveQ2
intoselect v1>= 0 from t1
, and checkQ2
's result size is greater or equal toQ1
's - Another way is to change
DataFusion
's configurations: for exampleQ1
'st1
is in-memory table, andQ2
'st2
can uset1
backed by aParquet
file with same content. Or letQ1
execute in multiple thread/partition, andQ2
will be executed in single thread/partition.
- One way is to come up with more
Requirements:
- Java 11 or above
- Maven (
sudo apt install maven
on Ubuntu)
Download datafusion-sqlancer
git clone https://github.com/datafusion-contrib/datafusion-sqllancer
cd sqlancer
First start the DataFusion
server
cd src/sqlancer/datafusion/server/datafusion_server
cargo run --release
Then compile and start SQLancer
mvn package -DskipTests
cd target
java --add-opens=java.base/java.nio=org.apache.arrow.memory.core,ALL-UNNAMED -jar sqlancer-*.jar --random-seed 0 --num-threads 1 --max-generated-databases 1 --num-tries 10 --num-queries 500 datafusion
If the execution prints progress information every five seconds, then the tool works as expected. Execution logs can be found at target/logs/datafusion/
For execution java --add-opens=java.base/java.nio=org.apache.arrow.memory.core,ALL-UNNAMED -jar sqlancer-*.jar --random-seed 0 --num-threads 1 --max-num-inserts 10 --num-tries 10 --num-queries 500 datafusion
- It will perform 10(
--num-tries
) rounds of tests, at the beginning of each round, several random tables will be created. - In each round, 100(
--num-queries
) random query will be generated, each query will be test against random test oracles. If some query failed or 100 queries finished,SQLancer
will go to next round and generate new databases.
Table generator will first pick random number of tables, also random number of column (in random type), and get corresponding SQL CREATE
statements to populate tables.
After, query generator know existing table schemas, it will generate a SQL query in a top-down way:
- Generate a statement like
SELECT [expr1] FROM [expr2] WHERE [expr3]
, and with some probability, add extraORDER BY/JOIN/...
clause. - Generate expressions also in a top-down recursive way, for example in
WHERE
clause generator wants a boolean expressiongenerateExpression(BOOL) -> exprA==exprB // Pick one expression which will generate boolean result -> generateExpression(numeric) for exprA -> ... -> generateExpression(numeric) for exprB -> ...
Notes for query generation:
- Expression generation is "typed", which means generator can specify what type to generate for a expression, but it can eventually generate a different type.
- One reason is the generation implementation is "best effort" for simplicity, so the generation process can fail and produce a query that can't be executed on datafusion (which is fine: just continue with the next query)
- Generator will ignore target type and pick another random type sometimes, to stress
DataFusion
engine to handle incorrect input argument types.(e.g. generate a numeric type inWHERE
clause)
- Expression generation is just a simple random walk on allowed syntax tree, some traditional fuzzing techniques is hard to apply in this context:
- There is no feedback to guide the test case generation yet.
- Generation is grammar-based, it can't generate bad input like
SELECT SELECT 1;
. (It's possible to explore this strategy in the future) - One way to let this approach attack more corner case is deliberately put more extreme values in the generation process:
- e.g., Populating empty tables, insert more
NULL
,INF
,NaN
into tables etc.
- e.g., Populating empty tables, insert more
DataFusion
server is adapted from datafusion-examples, now only supported single threadSQLancer
execution.generateAndTestDatabase()
is a "driver" method to create random databases, generate queries, and finally test results.DataFusionExpressionGenerator.java
includes top-down expression generation logic.test/DataFusionNoRECOracle.java
contains the final result check for generated queries.
JOIN
s,ORDER BY
,WHERE
- Numeric scalar functions/expression operators
- String scalar functions/expression operators
- Aggregate functions,
HAVING
clause - Window functions
- (TODO) Time related data type functions
- (TODO) Subquery
- (TODO) Queries from parquet, csv
- (TODO) Exploit different configurations (change config knobs like
target_partition
,prefer_hash_join
etc.
Note: most oracles only apply to a subset of available query types, for advanced SQL features like window functions we can only generate random queries and report crashes. More context for below test oracles at https://github.com/sqlancer/sqlancer/tree/main
- NoREC
- TLP
- (TODO) PQS
- (TODO) DQP for logical bugs in joins
- (TODO) EET for logic bugs in joins and subqueries
If any bug is found by SQLancer
, it will print a full reproducer to terminal output, and also writes to logs/datafusion_custom_log/error_report.log
.
- Then, first verify the bug with latest
datafusion
main branch. - If bug is confirmed, try to reduce the bug and report it to the
datafusion
community. - If it's a false positive, feel free to open an issue here.
Upstream SQLancer repository have implementations of many DBMS, we plan to keep the development in datafusion-contrib
, and sync with upstream once a while.
All kinds of contributions are welcome: feel free to open issues, discussions, or pull requests.