Apache Drill

https://drill.apache.org/

Open source distributed MPP columnar in-memory database that runs SQL-on-Hadoop or files.

Key Points
Connectivity
- Ports
- HAProxy Configs
Storage Backends
Show All Queries in all Drillbit UIs
Security
Install
Sample-data shipped with Drill
Custom Parsing
Workspaces
Conversion to Parquet
Convert JSON to CSV
Storage Plugins
Troubleshooting

Key Points

Forked from MapR around Drill 0.6 / 0.7
ANSI SQL 2003 or 2011 + extensions for nested data formats
Calcite SQL parser
read-only as of 0.6
no desc SQL command as of 0.7
0.7 introduced multicast replacement, specify nodes for AWS
1.1+ SQL Windowing functions
MPP - columnar execution in memory
- optimized pipelined execution
- late schema binding
  - schema changes during live query -> all operators reconfigure themselves
- query optimizer pluggable rules + cost model at various stages of execution
  - storage plugins can expose rules to drill optimizer
- vectorized CPU execution (simultaneous batches) to keep all CPU pipelines busy for efficiency
- 256KB batch pipelines between nodes
- only spills to disk on memory overflow
- no query checkpointing - faster but query must user must retry query upon any node failure
- no real scheduling
- no concurrent resource management, Memory / CPU contention etc
  - limits query submissions but cannot calculate resources each query uses
  - can crash JVM OOM
HBase storage plugin predicate push down to filter at HBase rather than retrieve all + then filter
optimizer outputs Physical plan
clients can give session / system options eg. which join algo or what parallelization to apply

Data Support

can run SQL queries directly on JSON data from json files under a directory
rapidly evolve schema - auto-determines JSON schema on the fly
create view <select query>; to allow SQL access to the data
multi-source queries can combine different kinds of data sources in the same query
schema can change over course of a query, operators designed to reconfigure themselves

Drillbits

Drillbit on every server (equivalent to an Impala node)
All drillbits are peers, same as Impala, connection node becomes the query planner
Clients can connect directly to drillbit, but recommended to go via ZooKeeper
node failure triggers query re-run

Drill Explorer

Windows GUI app bundled with ODBC driver.

C:\Program Files\MapR Drill Driver\lib\DrillExplorer.exe

ODBC opts:
- QueryTimeout=180
- HandShakeTimeout=5

UDFs

User defined functions.

Dynamic UDFs - jars in $CLASSPATH - configure a dir for users to drop jars in to
Manual UDFs - edit module.conf
Hive UDF jars

Query Execution

Query Parsing

SQL Parser
- -> Logical Plan
  - -> Rule + CBO Cost Based Optimizer
    - -> Physical Plan
      - -> EPF Execution Plan Fragments
        
        -> nodes (based on load etc)
        
        -> ZooKeeper for the drillbit locations (bottleneck?)
Root (Foreman) -> Intermediate nodes -> Leaf nodes
- leaf nodes read + execute EPFs
- intermediate nodes do partial aggregations as soon as receive from leaf
- foreman - final aggregation + return to client

Drill Improvements Needed

query scheduling
resource constraints
in-flight query recovery
no SQL Grants
no column level security, only file level
- MapR not gonna fix, believe ACLs should be done at storage layer
- Workarounds:
  - Views - not recommended - giving Spark access later bypasses SQL layer column security as has to read files
  - use MapR-DB as Drill backend, has ACEs on CFs
CSV - no header skipping support yet as of 1.1.0, see DRILL-624
no DESCRIBE support for JSON/CSV/Parquet/MongoDB as of 1.1.0

Connectivity

HTTP UI
JDBC driver
ODBC driver - not open source (Simba)

Ports

Port	Description
8047	HTTP UI
31010	JDBC / ODBC
31011	Control port
31012	Data port

HAProxy Configs

For load balancing the HTTP UI and JDBC / ODBC ports:

HariSekhon/HAProxy-configs - apache-drill.cfg

HariSekhon/HAProxy-configs - apache-drill-centralized-foreman.cfg

Storage Backends

Files:
- JSON
- CSV
- Apache logs
Hive
HBase
MapR-DB
JDBC - RDBMS - MySQL, PostgreSQL, Oracle, SQL Server
AWS S3 (1.3+)
Kafka 0.10+ (1.12+) - json messages
OpenTSDB (1.12+) - lists metrics up to Integer.MAX -> use <opentsdb_name>; show tables;

Show All Queries in all Drillbit UIs

Admin users get shutdown button in UI -> Options tab

Point pstore to HDFS / MapR-FS for UI on all nodes to show all queries

(defaults to local filesystem /opt/mapr/drill/drill-<version>/logs/profiles)

Add to drill-override.conf:

drill.exec {
  sys.store.provider.zk.blobroot: "maprfs:///apps/drill/pstore"
  "hdfs:///apps/drill/pstore"
}

or use HBase which should be more performant:

sys.store.provider: {
  class: "org.apache.drill.exec.store.hbase.config.HBasePStoreProvider",
    hbase: {
      table: "drill_store",
      config: {
      "hbase.zookeeper.quorum": "<ip>,<ip>,<ip>",
      "hbase.zookeeper.property.clientPort": "2181"
    }
  }
}

Security

Kerberos
- supports client encryption via SASL (Drill 1.11+)
Plain - uses PAM
- does not support client encryption
no column-level security - views permissions are file-based ACE / ACLs
use views to limit rows / columns
hack: build CSV read into view to verify user in CSV for dynamic permissioning
views - stored as a file in current workspace
- only file-level permissions on .view file
- set max.chained.user.hops = 1 (default: 3)
- no way to restrict view creation to admins only
- no way to prevent users creating views then granting permissions to unauthorized users (would have to lock down to no hops for both admins + users)
- limit data set to superuser + superuser create all views + max.chained.user.hops = 1
cannot query Hive views, only tables (as of Drill 1.11 - early 2018)
Chained Impersonation - view executes as view owner using max.chained.user.hops = 3
- avoid this, use the below instead
Inbound Impersonation - proxy_principals like Hadoop's proxyusers

drill-override.conf:

drill.exec.impersonation.enabled: true

ALTER SYSTEM SET `exec.impersonation.inbound.policies`='[
  {
    proxy_principals:  { users: ["myServiceAccount"] },  # set 'users'
    target_principals: { users: ["*"] }                  #  or 'groups' for each
  },  # repeat these pairs as many times as needed
]';

bin/sqlline -u "jdbc:drill:schema=dfs;zk=myZkCluster;impersonation_target=hari" -u "myServiceAccount" -p "$password"

SELECT blah from <cluster>.<workspace>.<table_or_pattern> where blah > N;

dfs       sub-dir     pattern
HBase
Hive      database    table
mongo     db          collection

SELECT name, flatten(fillings) AS f FROM dfs.users.`/donuts.json` WHERE f.cal < 300;

Interpret HBase data as JSON on the fly - convert_from() function is pluggable:

SELECT d.name, count(d.fillings) FROM ( SELECT convert_from(cf1.donut, json) AS d FROM hbase.user.`donuts` );

Install

On MapR

yum install -y mapr-drill

/opt/mapr/server/configure.sh -R -L /opt/mapr/configure-$(date '%F_%T').log

maprcli nodes services -name drill -start -nodes <list_of_nodes>

Embedded ZK mode

sqlline -u "jdbc:drill:zk=local"

Standalone Mode with a separate ZooKeeper host

sqlline -u "jdbc:drill:zk=$zookeeper_host"

Distributed Mode with ZooKeeper Cluster

sqlline -u "jdbc:drill:[schema=<storage plugin>;]zk=<zk name>[:<port>][,<zk name2>[:<port>]... ]"

Can omit specifying dfs. storage plugin (must quote otherwise semicolon breaks the zk) can still specify other storage plugins to query though such as:

cp.`employee.json`

sqlline -u "jdbc:drill:schema=dfs;zk=local"

Shortcut for sqlline -u jdbc:drill:zk=local available from 1.0 onwards

drill-embedded

2015-07-17 14:15:47.381 java[48990:1877987] Unable to load realm info from SCDynamicStore
Jul 17, 2015 2:15:52 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.0.0
"drill baby drill"
0: jdbc:drill:zk=local>

Configure via http://localhost:8047/

!set
!set maxwidth 10000

Distributed Mode

Requires ZooKeeper.

Configure cluster-id and zk.connect:

In conf/drill-override.conf:

drill.exec:{
  cluster-id: "<mydrillcluster>",
  zk.connect: "<zkhostname1>:<port>,<zkhostname2>:<port>,<zkhostname3>:<port>"
}

In conf/drill-env.sh:

planner.memory.max_query_memory_per_node

DRILL_MAX_DIRECT_MEMORY="8G" # will default to total system memory if not set
DRILL_MAX_HEAP="4G"

# uses vars above for -Xmx instead
export DRILL_JAVA_OPTS="..."

If performance is an issue, replace the -ea flag with -Dbounds=false, as shown in the following example:

drillbit.sh [--config <conf-dir>] (start|stop|status|restart|autorestart)

bin/drillbit.sh start

Finds Drill using ZooKeeper via drill-override.conf:

drill-conf

drill-localhost

SELECT * FROM sys.drillbits;

output:

+-----------+------------+---------------+------------+----------+
| hostname  | user_port  | control_port  | data_port  | current  |
+-----------+------------+---------------+------------+----------+
| starfury  | 31010      | 31011         | 31012      | true     |
+-----------+------------+---------------+------------+----------+

SELECT * FROM sys WHERE name LIKE 'security.admin.users';

output:

'security.admin.user_groups'

Sample-data shipped with Drill

cp - the ClassPath plugin points to a JAR file in the Drill classpath that contains the Transaction Processing Performance Council (TPC) benchmark schema TPC-H that you can query.

SELECT * FROM cp.`employee.json` LIMIT 3;

SELECT * FROM dfs.`/usr/local/drill/sample-data/region.parquet`;

Custom Parsing

Storage -> DFS -> update -> formats sections add:

"hari": {
  "type": "text",
  "extensions": [ "txt" ],
  "quote": "~",
  "escape": "~",
  "delimiter": "^"
},

cd /tmp
wget https://www.ars.usda.gov/SP2UserFiles/Place/12354500/Data/SR27/dnload/sr27asc.zip
unzip sr27asc.zip

Workspaces

Workspaces are abbreviations to access files or tables.

Storage Plugins -> DFS -> update -> workspace section add:

"hari": {
  "location": "/users/Hari/Downloads/drill-data",
  "writable": true,
  "storageformat": "parquet"
},

Conversion to Parquet

Create a Parquet table from this weird delimited data, writes the data to Downloads/drill-nndb/<myTable>/0_0_0.parquet.

Need to have configured DFS storage plugin to create a workspace with writable => true before you can write a table there which is Parquet by default

CREATE TABLE dfs.hari.myTable(col1, col2)
AS SELECT columns[0], columns[1] FROM dfs.tmp.`sr27asc/FOOD_DES.txt`;

Convert JSON to CSV

Works nicely, requires .json file extension though, find file in myDir/0_0_0.csv.

USE dfs.hari;
ALTER SESSION SET `store.format`='csv';
CREATE TABLE dfs.hari.`myDir` AS SELECT * FROM dfs.hari.`my.json`;

Where there are multiple CSVs generated though some CSVs have some columns and others don't which makes it harder to deal with them all together.

Storage Plugins

Drill saves storage plugin configurations in a temporary directory (embedded mode) or in ZooKeeper (distributed mode).

Storage plugins:

Hive (metadata only, does not use Hive execution engine, uses data directly)
HBase
Cassandra
Kudu
MongoDB

MongoDB 3.0

read only

Load MongoDB test data set

wget -c "http://media.mongodb.org/zips.json"

Import to tests.zips:

mongoimport "zips.json"

http://localhost:8047/storage -> mongo -> update -> change "enabled": true -> update

SHOW DATABASES;

SHOW TABLES IN `dfs.default`;

SHOW TABLES IN sys;

SELECT * FROM sys.drillbits;

SELECT * FROM information_schema.columns WHERE TABLE = 'myTable';

default all queries to mongo storage plugin using 'use mongo'

doesn't seem to work, still need to prefix mongo otherwise get

Jul 20, 2015 6:01:50 PM org.apache.calcite.sql.validate.SqlValidatorException <init>
SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'test.zips' not found
Jul 20, 2015 6:01:50 PM org.apache.calcite.runtime.CalciteException <init>
SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 15 to line 1, column 18: Table 'test.zips' not found
Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 'test.zips' not found

[Error Id: 6b2dd038-af33-4466-930b-d58f0ad8e55d on starfury:31010] (state=,code=0)

USE mongo;

USE mongo.test;
SHOW TABLES;

SELECT * FROM mongo.test.zips;

SELECT COUNT(*) FROM zips

SELECT state,city,AVG(pop) FROM zips GROUP BY state,city;

SELECT loc FROM zips LIMIT 10;

select in to json list using array index notation:

SELECT loc[0] FROM zips LIMIT 10;

ETL mongo to parquet:

CREATE TABLE dfs.nndb.mongoZips AS SELECT * from mongo.test.zips;

There is no DROP table yet DRILL-3426 targeted 1.3+

Access embedded JSON fields via fully qualified table object reference

SELECT b.myField.subField FROM dfs.down.`my.json` b LIMIT 1;

Troubleshooting

Drill Kerberos principals require host component eg. drill/<cluster_name>@$REALM not drill@$REALM which results in

IllegalStateException empty name string

drill-env.sh - avoid:

DRILL_JAVA_LIB_PATH # use pam4j instead in drill-override.conf
MAPR_TICKETFILE_LOCATION=/tmp/mapruserticket_$UID

Drill:

SELECT * FROM dfs.`/tmp/test.dat`

ERROR: VALIDATION ERROR: From line 1, column 17: Object '/tmp/test.dat' not found within 'dfs'
SQL Query null

Fix: caused by wrong filename extension, renaming to test.csv works.

sqlline IllegalArgumentExeption - remove spaces in connection string!

If older Drill version is started against stale ZooKeeper from higher version (eg. Docker) it will error out with:

Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "config" (class org.apache.drill.exec.store.dfs.FileSystemConfig), not marked as ignorable (4 known properties: "enabled", "formats", "connection", "workspaces"])

https://issues.apache.org/jira/browse/DRILL-4383

Fix: delete ZK and start fresh

Ported from private Knowledge Base page 2014+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drill.md

drill.md

Apache Drill

Key Points

Data Support

Drillbits

Drill Explorer

UDFs

Query Execution

Query Parsing

Drill Improvements Needed

Connectivity

Ports

HAProxy Configs

Storage Backends

Show All Queries in all Drillbit UIs

Security

Install

On MapR

Embedded ZK mode

Standalone Mode with a separate ZooKeeper host

Distributed Mode with ZooKeeper Cluster

Distributed Mode

Sample-data shipped with Drill

Custom Parsing

Workspaces

Conversion to Parquet

Convert JSON to CSV

Storage Plugins

Troubleshooting

Files

drill.md

Latest commit

History

drill.md

File metadata and controls

Apache Drill

Key Points

Data Support

Drillbits

Drill Explorer

UDFs

Query Execution

Query Parsing

Drill Improvements Needed

Connectivity

Ports

HAProxy Configs

Storage Backends

Show All Queries in all Drillbit UIs

Security

Install

On MapR

Embedded ZK mode

Standalone Mode with a separate ZooKeeper host

Distributed Mode with ZooKeeper Cluster

Distributed Mode

Sample-data shipped with Drill

Custom Parsing

Workspaces

Conversion to Parquet

Convert JSON to CSV

Storage Plugins

Troubleshooting