Open source distributed MPP columnar in-memory database that runs SQL-on-Hadoop or files.
- Key Points
- Connectivity
- Storage Backends
- Show All Queries in all Drillbit UIs
- Security
- Install
- Sample-data shipped with Drill
- Custom Parsing
- Workspaces
- Conversion to Parquet
- Convert JSON to CSV
- Storage Plugins
- Troubleshooting
-
Forked from MapR around Drill 0.6 / 0.7
-
ANSI SQL 2003 or 2011 + extensions for nested data formats
-
Calcite SQL parser
-
read-only as of 0.6
-
no
desc
SQL command as of 0.7 -
0.7 introduced multicast replacement, specify nodes for AWS
-
1.1+ SQL Windowing functions
-
MPP - columnar execution in memory
-
optimized pipelined execution
-
late schema binding
- schema changes during live query -> all operators reconfigure themselves
-
query optimizer pluggable rules + cost model at various stages of execution
- storage plugins can expose rules to drill optimizer
-
vectorized CPU execution (simultaneous batches) to keep all CPU pipelines busy for efficiency
-
256KB batch pipelines between nodes
-
only spills to disk on memory overflow
-
no query checkpointing - faster but query must user must retry query upon any node failure
-
no real scheduling
-
no concurrent resource management, Memory / CPU contention etc
- limits query submissions but cannot calculate resources each query uses
- can crash JVM OOM
-
-
HBase storage plugin predicate push down to filter at HBase rather than retrieve all + then filter
-
optimizer outputs Physical plan
-
clients can give session / system options eg. which join algo or what parallelization to apply
- can run SQL queries directly on JSON data from json files under a directory
- rapidly evolve schema - auto-determines JSON schema on the fly
create view <select query>;
to allow SQL access to the data- multi-source queries can combine different kinds of data sources in the same query
- schema can change over course of a query, operators designed to reconfigure themselves
- Drillbit on every server (equivalent to an Impala node)
- All drillbits are peers, same as Impala, connection node becomes the query planner
- Clients can connect directly to drillbit, but recommended to go via ZooKeeper
- node failure triggers query re-run
Windows GUI app bundled with ODBC driver.
C:\Program Files\MapR Drill Driver\lib\DrillExplorer.exe
- ODBC opts:
QueryTimeout=180
HandShakeTimeout=5
User defined functions.
- Dynamic UDFs - jars in
$CLASSPATH
- configure a dir for users to drop jars in to - Manual UDFs - edit
module.conf
- Hive UDF jars
-
SQL Parser
- ->
Logical Plan
- ->
Rule
+CBO Cost Based Optimizer
- ->
Physical Plan
- ->
EPF Execution Plan Fragments
- ->
nodes
(based on load etc)- -> ZooKeeper for the drillbit locations (bottleneck?)
- ->
- ->
- ->
- ->
- ->
-
Root (Foreman) -> Intermediate nodes -> Leaf nodes
- leaf nodes read + execute EPFs
- intermediate nodes do partial aggregations as soon as receive from leaf
- foreman - final aggregation + return to client
- query scheduling
- resource constraints
- in-flight query recovery
- no SQL Grants
- no column level security, only file level
- MapR not gonna fix, believe ACLs should be done at storage layer
- Workarounds:
- Views - not recommended - giving Spark access later bypasses SQL layer column security as has to read files
- use MapR-DB as Drill backend, has ACEs on CFs
- CSV - no header skipping support yet as of 1.1.0, see DRILL-624
- no DESCRIBE support for JSON/CSV/Parquet/MongoDB as of 1.1.0
- HTTP UI
- JDBC driver
- ODBC driver - not open source (Simba)
Port | Description |
---|---|
8047 | HTTP UI |
31010 | JDBC / ODBC |
31011 | Control port |
31012 | Data port |
For load balancing the HTTP UI and JDBC / ODBC ports:
HariSekhon/HAProxy-configs - apache-drill.cfg
HariSekhon/HAProxy-configs - apache-drill-centralized-foreman.cfg
- Files:
- JSON
- CSV
- Apache logs
- Hive
- HBase
- MapR-DB
- JDBC - RDBMS - MySQL, PostgreSQL, Oracle, SQL Server
- AWS S3 (1.3+)
- Kafka 0.10+ (1.12+) - json messages
- OpenTSDB (1.12+) - lists metrics up to Integer.MAX -> use <opentsdb_name>; show tables;
Admin users get shutdown button in UI -> Options
tab
Point pstore to HDFS / MapR-FS for UI on all nodes to show all queries
(defaults to local filesystem /opt/mapr/drill/drill-<version>/logs/profiles
)
Add to drill-override.conf
:
drill.exec {
sys.store.provider.zk.blobroot: "maprfs:///apps/drill/pstore"
"hdfs:///apps/drill/pstore"
}
or use HBase which should be more performant:
sys.store.provider: {
class: "org.apache.drill.exec.store.hbase.config.HBasePStoreProvider",
hbase: {
table: "drill_store",
config: {
"hbase.zookeeper.quorum": "<ip>,<ip>,<ip>",
"hbase.zookeeper.property.clientPort": "2181"
}
}
}
-
- supports client encryption via SASL (Drill 1.11+)
-
Plain - uses PAM
- does not support client encryption
-
no column-level security - views permissions are file-based ACE / ACLs
-
use views to limit rows / columns
-
hack: build CSV read into view to verify user in CSV for dynamic permissioning
-
views - stored as a file in current workspace
- only file-level permissions on .view file
- set max.chained.user.hops = 1 (default: 3)
- no way to restrict view creation to admins only
- no way to prevent users creating views then granting permissions to unauthorized users (would have to lock down to no hops for both admins + users)
- limit data set to superuser + superuser create all views + max.chained.user.hops = 1
-
cannot query Hive views, only tables (as of Drill 1.11 - early 2018)
-
Chained Impersonation - view executes as view owner using max.chained.user.hops = 3
- avoid this, use the below instead
-
Inbound Impersonation - proxy_principals like Hadoop's proxyusers
drill-override.conf
:
drill.exec.impersonation.enabled: true
ALTER SYSTEM SET `exec.impersonation.inbound.policies`='[
{
proxy_principals: { users: ["myServiceAccount"] }, # set 'users'
target_principals: { users: ["*"] } # or 'groups' for each
}, # repeat these pairs as many times as needed
]';
bin/sqlline -u "jdbc:drill:schema=dfs;zk=myZkCluster;impersonation_target=hari" -u "myServiceAccount" -p "$password"
SELECT blah from <cluster>.<workspace>.<table_or_pattern> where blah > N;
dfs sub-dir pattern
HBase
Hive database table
mongo db collection
SELECT name, flatten(fillings) AS f FROM dfs.users.`/donuts.json` WHERE f.cal < 300;
Interpret HBase data as JSON on the fly - convert_from()
function is pluggable:
SELECT d.name, count(d.fillings) FROM ( SELECT convert_from(cf1.donut, json) AS d FROM hbase.user.`donuts` );
yum install -y mapr-drill
/opt/mapr/server/configure.sh -R -L /opt/mapr/configure-$(date '%F_%T').log
maprcli nodes services -name drill -start -nodes <list_of_nodes>
sqlline -u "jdbc:drill:zk=local"
Standalone Mode with a separate ZooKeeper host
sqlline -u "jdbc:drill:zk=$zookeeper_host"
sqlline -u "jdbc:drill:[schema=<storage plugin>;]zk=<zk name>[:<port>][,<zk name2>[:<port>]... ]"
Can omit specifying dfs.
storage plugin (must quote otherwise semicolon breaks the zk)
can still specify other storage plugins to query though such as:
cp.`employee.json`
sqlline -u "jdbc:drill:schema=dfs;zk=local"
Shortcut for sqlline -u jdbc:drill:zk=local
available from 1.0 onwards
drill-embedded
2015-07-17 14:15:47.381 java[48990:1877987] Unable to load realm info from SCDynamicStore
Jul 17, 2015 2:15:52 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.0.0
"drill baby drill"
0: jdbc:drill:zk=local>
Configure via http://localhost:8047/
!set
!set maxwidth 10000
Requires ZooKeeper.
Configure cluster-id
and zk.connect
:
In conf/drill-override.conf
:
drill.exec:{
cluster-id: "<mydrillcluster>",
zk.connect: "<zkhostname1>:<port>,<zkhostname2>:<port>,<zkhostname3>:<port>"
}
In conf/drill-env.sh
:
planner.memory.max_query_memory_per_node
DRILL_MAX_DIRECT_MEMORY="8G" # will default to total system memory if not set
DRILL_MAX_HEAP="4G"
# uses vars above for -Xmx instead
export DRILL_JAVA_OPTS="..."
If performance is an issue, replace the -ea
flag with -Dbounds=false
, as shown in the following example:
drillbit.sh [--config <conf-dir>] (start|stop|status|restart|autorestart)
bin/drillbit.sh start
Finds Drill using ZooKeeper via drill-override.conf
:
drill-conf
drill-localhost
SELECT * FROM sys.drillbits;
output:
+-----------+------------+---------------+------------+----------+
| hostname | user_port | control_port | data_port | current |
+-----------+------------+---------------+------------+----------+
| starfury | 31010 | 31011 | 31012 | true |
+-----------+------------+---------------+------------+----------+
SELECT * FROM sys WHERE name LIKE 'security.admin.users';
output:
'security.admin.user_groups'
cp
- the ClassPath plugin points to a JAR file in the Drill classpath that contains the Transaction Processing Performance Council (TPC) benchmark schema TPC-H that you can query.
SELECT * FROM cp.`employee.json` LIMIT 3;
SELECT * FROM dfs.`/usr/local/drill/sample-data/region.parquet`;
Storage
-> DFS
-> update
-> formats
sections add:
"hari": {
"type": "text",
"extensions": [ "txt" ],
"quote": "~",
"escape": "~",
"delimiter": "^"
},
cd /tmp
wget https://www.ars.usda.gov/SP2UserFiles/Place/12354500/Data/SR27/dnload/sr27asc.zip
unzip sr27asc.zip
Workspaces are abbreviations to access files or tables.
Storage Plugins
-> DFS
-> update
-> workspace
section add:
"hari": {
"location": "/users/Hari/Downloads/drill-data",
"writable": true,
"storageformat": "parquet"
},
Create a Parquet table from this weird delimited data,
writes the data to Downloads/drill-nndb/<myTable>/0_0_0.parquet
.
Need to have configured DFS storage plugin to create a workspace with writable => true
before you can write a table there which is Parquet by default
CREATE TABLE dfs.hari.myTable(col1, col2)
AS SELECT columns[0], columns[1] FROM dfs.tmp.`sr27asc/FOOD_DES.txt`;
Works nicely, requires .json
file extension though, find file in myDir/0_0_0.csv
.
USE dfs.hari;
ALTER SESSION SET `store.format`='csv';
CREATE TABLE dfs.hari.`myDir` AS SELECT * FROM dfs.hari.`my.json`;
Where there are multiple CSVs generated though some CSVs have some columns and others don't which makes it harder to deal with them all together.
Drill saves storage plugin configurations in a temporary directory (embedded mode) or in ZooKeeper (distributed mode).
Storage plugins:
- Hive (metadata only, does not use Hive execution engine, uses data directly)
- HBase
- Cassandra
- Kudu
- MongoDB
MongoDB 3.0
- read only
Load MongoDB test data set
wget -c "http://media.mongodb.org/zips.json"
Import to tests.zips
:
mongoimport "zips.json"
http://localhost:8047/storage -> mongo
-> update
-> change "enabled": true
-> update
SHOW DATABASES;
SHOW TABLES IN `dfs.default`;
SHOW TABLES IN sys;
SELECT * FROM sys.drillbits;
SELECT * FROM information_schema.columns WHERE TABLE = 'myTable';
default all queries to mongo storage plugin using 'use mongo'
doesn't seem to work, still need to prefix mongo otherwise get
Jul 20, 2015 6:01:50 PM org.apache.calcite.sql.validate.SqlValidatorException <init>
SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'test.zips' not found
Jul 20, 2015 6:01:50 PM org.apache.calcite.runtime.CalciteException <init>
SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 15 to line 1, column 18: Table 'test.zips' not found
Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 'test.zips' not found
[Error Id: 6b2dd038-af33-4466-930b-d58f0ad8e55d on starfury:31010] (state=,code=0)
USE mongo;
USE mongo.test;
SHOW TABLES;
SELECT * FROM mongo.test.zips;
SELECT COUNT(*) FROM zips
SELECT state,city,AVG(pop) FROM zips GROUP BY state,city;
SELECT loc FROM zips LIMIT 10;
select in to json list using array index notation:
SELECT loc[0] FROM zips LIMIT 10;
ETL mongo to parquet:
CREATE TABLE dfs.nndb.mongoZips AS SELECT * from mongo.test.zips;
There is no DROP table yet DRILL-3426 targeted 1.3+
Access embedded JSON fields via fully qualified table object reference
SELECT b.myField.subField FROM dfs.down.`my.json` b LIMIT 1;
Drill Kerberos principals require host component eg. drill/<cluster_name>@$REALM
not drill@$REALM
which results in
IllegalStateException empty name string
drill-env.sh
- avoid:
DRILL_JAVA_LIB_PATH # use pam4j instead in drill-override.conf
MAPR_TICKETFILE_LOCATION=/tmp/mapruserticket_$UID
Drill:
SELECT * FROM dfs.`/tmp/test.dat`
ERROR: VALIDATION ERROR: From line 1, column 17: Object '/tmp/test.dat' not found within 'dfs'
SQL Query null
Fix: caused by wrong filename extension, renaming to test.csv
works.
sqlline IllegalArgumentExeption - remove spaces in connection string!
If older Drill version is started against stale ZooKeeper from higher version (eg. Docker) it will error out with:
Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "config" (class org.apache.drill.exec.store.dfs.FileSystemConfig), not marked as ignorable (4 known properties: "enabled", "formats", "connection", "workspaces"])
https://issues.apache.org/jira/browse/DRILL-4383
Fix: delete ZK and start fresh
Ported from private Knowledge Base page 2014+