Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Druid Datasource Options

hbutani edited this page Sep 2, 2016 · 2 revisions
Name Description Default(if any) Override in SQLContext?
sourceDataframe The DataFrame with the raw data (required) no
timeDimensionColumn The column that represents time in the Druid Index (required) no
druidHost Zookeeper ensemble used by Druid servers (required) no
druidDatasource the name of the corresponding Datasource in Druid for this Spark Datasource (required) no
starSchema The details of the StarSchema, see [Defining a StarSchema](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a-Star-Schema) for details (required) no
columnMapping mapping of names from the raw schema to Druid none no
functionalDependencies **future use** none no
pushHLLTODruid Push HyperLogLog aggregator to Druid true no
**future use**
streamDruidQueryResults Controls whether Query results from Druid are streamed into Spark Operator pipeline. true no
**currently cannot be changed**
loadMetadataFromAllSegments When loading Druid DataSource metadata should the query interval be the entire dataSource interval, or only the latests segment is enough. Default is to load from the latest segment; loading from all segments can be very slow. false no
**currently cannot be changed**
zkSessionTimeoutMilliSecs Zookeeper connection timeout 30000 no
zkEnableCompression Zookeper connection enable compression true no
zkDruidPath Root Path in Zookeeper for Druid /druid no
queryHistoricalServers A Query Execution Optimization, which directly talks to Historical Servers and does a post aggregation across Historical outputs in Spark false yes
**only takes effect if the cost model is off**
maxResultCardinality If the result cardinality of a Query exceeds this value then Query is not converted to a Druid Query. no
**future use**
numSegmentsPerHistoricalQuery The number of segments queries in 1 DruidQuery to a Historical Server. Int.MaxInt yes
**only takes effect if the cost model is off**
zkQualifyDiscoveryNames When connecting to a Druid 0.9 cluster, set this to true false no
numProcessingThreadsPerHistorical number of processing threads per druid historical daemon equal to spark.num.cores no
useSmile communication with Druid use the Smile binary json format true yes
nonAggQueryHandling allow Druid Select Query on DataSource push_none no
set to push_filters: push when there is at least filter expressions
set to push_project_and_filters: push even for simple scans
queryGranularity used to estimate index cardinality for any timePeriod none no
valid values are none,all,second, minute, hour,day etc. or a custom PeriodGranularity
these match granularities available in Druid.
set it to the Query Granularity of your index
allowTopN druid TopN queries are approximate in their aggregation and ranking, this flag controls if TopN query rewrites should happen. false yes
topNMaxThreshold if druid TopN queries are enabled, this property controls the maximum limit for which such rewrites are done. For limits beyond this value the GroupBy query is executed. 100000 yes
  • Override in SQL Context means that the setting in the SQLContext(these will have the prefix spark.sparklinedata.druid.option) takes precedence over the value in Druid DataSource. This enables runtime behavior changes on for example whether to use Smile protocol or not.
  • queryHistoricalServers, numSegmentsPerHistoricalQuery will be ignored if the cost model is on. If the cost model is off all queries on this DataSource will be executed using these settings. Of course if a Query cannot be pushed to historical(for example queries with Limit) then these settings will be ignored for that Query. queryHistoricalServers, numSegmentsPerHistoricalQuery can be overridden by setting the values in the SQLContext, when a Query is executed the current settings in the SQLContext will be used. This is how you can try different Query execution options on a per Query basis.
Clone this wiki locally