Druid Datasource Options

Name	Description	Default(if any)	Override in SQLContext?
sourceDataframe	The DataFrame with the raw data	(required)	no
timeDimensionColumn	The column that represents time in the Druid Index	(required)	no
druidHost	Zookeeper ensemble used by Druid servers	(required)	no
druidDatasource	the name of the corresponding Datasource in Druid for this Spark Datasource	(required)	no
starSchema	The details of the StarSchema, see [Defining a StarSchema](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a-Star-Schema) for details	(required)	no
columnMapping	mapping of names from the raw schema to Druid	none	no
functionalDependencies	future use	none	no
pushHLLTODruid	Push HyperLogLog aggregator to Druid	true	no
	future use
streamDruidQueryResults	Controls whether Query results from Druid are streamed into Spark Operator pipeline.	true	no
	currently cannot be changed
loadMetadataFromAllSegments	When loading Druid DataSource metadata should the query interval be the entire dataSource interval, or only the latests segment is enough. Default is to load from the latest segment; loading from all segments can be very slow.	false	no
	currently cannot be changed
zkSessionTimeoutMilliSecs	Zookeeper connection timeout	30000	no
zkEnableCompression	Zookeper connection enable compression	true	no
zkDruidPath	Root Path in Zookeeper for Druid	/druid	no
queryHistoricalServers	A Query Execution Optimization, which directly talks to Historical Servers and does a post aggregation across Historical outputs in Spark	false	yes
	only takes effect if the cost model is off
maxResultCardinality	If the result cardinality of a Query exceeds this value then Query is not converted to a Druid Query.		no
	future use
numSegmentsPerHistoricalQuery	The number of segments queries in 1 DruidQuery to a Historical Server.	Int.MaxInt	yes
	only takes effect if the cost model is off
zkQualifyDiscoveryNames	When connecting to a Druid 0.9 cluster, set this to true	false	no
numProcessingThreadsPerHistorical	number of processing threads per druid historical daemon	equal to spark.num.cores	no
useSmile	communication with Druid use the Smile binary json format	true	yes
nonAggQueryHandling	allow Druid Select Query on DataSource	push_none	no
	set to push_filters: push when there is at least filter expressions
	set to push_project_and_filters: push even for simple scans
queryGranularity	used to estimate index cardinality for any timePeriod	none	no
	valid values are none,all,second, minute, hour,day etc. or a custom PeriodGranularity
	these match granularities available in Druid.
	set it to the Query Granularity of your index
allowTopN	druid TopN queries are approximate in their aggregation and ranking, this flag controls if TopN query rewrites should happen.	false	yes
topNMaxThreshold	if druid TopN queries are enabled, this property controls the maximum limit for which such rewrites are done. For limits beyond this value the GroupBy query is executed.	100000	yes

Override in SQL Context means that the setting in the SQLContext(these will have the prefix spark.sparklinedata.druid.option) takes precedence over the value in Druid DataSource. This enables runtime behavior changes on for example whether to use Smile protocol or not.
queryHistoricalServers, numSegmentsPerHistoricalQuery will be ignored if the cost model is on. If the cost model is off all queries on this DataSource will be executed using these settings. Of course if a Query cannot be pushed to historical(for example queries with Limit) then these settings will be ignored for that Query. queryHistoricalServers, numSegmentsPerHistoricalQuery can be overridden by setting the values in the SQLContext, when a Query is executed the current settings in the SQLContext will be used. This is how you can try different Query execution options on a per Query basis.

Overview
Quick Start
- Installing and Setup Druid
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
Dev. Guide
Reference Architectures
- Accelerating existing SQL Datasets
Releases
Cluster Spinup Tool
TPCH Benchmark
- Generating Denormalized TPCH Dataset
- Build TPCH Index for Benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Druid Datasource Options

Clone this wiki locally