Approximate Count and Spatial Queries

We can push approximate count and spatial filters to Druid, if the index has the relevant HyperLogLog, Sketch or Spatial Index on the column.

Approximate Count Queries

Approximate Aggregations are pushed to Druid if the pushHLLTODruid option is true on the Druid DataSource(by default this is true). The approx_count_distinct is translated to a hyperUnique aggregation in Druid; whereas DataSketchhive udfs are pushed down as a ThetaSketch aggregator to druid.

When defining a Druid DataSource define DruidRelationColumnInfos to describe the association of source columns to the relevant HyperLogLog, Sketch or Spatial Index. For example consider a ZipCodes DataSet where each row captures the recording of presence(no other metric) at a given time, location(longitude, latitude) and place(state, county, city, zipcode). We build a Druid Index with a HLL and Sketch for the City column(i.e. the city dimension is replaced by an approximate sketch data-structure of it) and a Spatial Index based on the latitude/longitude fields. The Field definition will be:

[  
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"city",
      "hllMetric":"unique_city",
      "sketchMetric":"city_sketch"
   },
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"latitude",
      "spatialIndex":{  
         "druidColumn":"coordinates",
         "spatialPosition":0,
         "minValue":-90.0,
         "maxValue":90.0
      }
   },
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"longitude",
      "spatialIndex":{  
         "druidColumn":"coordinates",
         "spatialPosition":1,
         "minValue":-180.0,
         "maxValue":180.0
      }
   }
]

The entire DDL for the Druid DataSource is:

CREATE TABLE if not exists zipCodes
      USING org.sparklinedata.druid
      OPTIONS (sourceDataframe "default.zipCodesBase",
      timeDimensionColumn "record_date",
      druidDatasource "zipCodes",
      druidHost 'localhost',
      zkQualifyDiscoveryNames "true",
      numProcessingThreadsPerHistorical '1',
      nonAggregateQueryHandling "push_project_and_filters",
      columnInfos '[ {   "jsonClass" : "DruidRelationColumnInfo",   "column" : "city",   "hllMetric" : "unique_city",   "sketchMetric" : "city_sketch" }, {   "jsonClass" : "DruidRelationColumnInfo",   "column" : "latitude",   "spatialIndex" : {     "druidColumn" : "coordinates",     "spatialPosition" : 0,     "minValue" : -90.0,     "maxValue" : 90.0   } }, {   "jsonClass" : "DruidRelationColumnInfo",   "column" : "longitude",   "spatialIndex" : {     "druidColumn" : "coordinates",     "spatialPosition" : 1,     "minValue" : -180.0,     "maxValue" : 180.0   } } ]',
      allowTopNRewrite "true")

Approx_Count_Distinct Examples

Now we can write queries that compute the approximate count for cities. Here are couple of examples:

select approx_count_distinct(city)  
from zipCodes;

select approx_count_distinct(city)
from zipCodesFull 
where substring(state,1,1) = 'N';

These get executed as hyperunique aggregations on Druid. For the second query the Druid Query is:

{  
   "jsonClass":"TimeSeriesQuerySpec",
   "queryType":"timeseries",
   "dataSource":"zipCodesAll",
   "intervals":[  
      "2016-01-01T00:00:00.000Z/2016-01-01T00:00:01.000Z"
   ],
   "granularity":"all",
   "filter":{  
      "jsonClass":"JavascriptFilterSpec",
      "type":"javascript",
      "dimension":"state",
      "function":"function (state) {\n            \n            \n              \n\n            return(((state).substr(0, 1))  ==  (\"N\"));\n            }"
   },
   "aggregations":[  
      {  
         "type":"hyperUnique",
         "name":"alias-1",
         "fieldName":"unique_city"
      }
   ],
   "context":{  
      "queryId":"query-1091728133402033"
   }
}

DataSketch Queries

Coming Soon

Spatial Queries

You can use Druid Spatial Indexing to build a Spatial Index of 1 or more dimension columns. You then have to describe in the DDL the source column association to the Spatial Index using DruidRelationColumnInfos. For the ZipCodes dataset containing latitude/longitude information for each fact, we describe this so:

[  
   ...,
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"latitude",
      "spatialIndex":{  
         "druidColumn":"coordinates",
         "spatialPosition":0,
         "minValue":-90.0,
         "maxValue":90.0
      }
   },
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"longitude",
      "spatialIndex":{  
         "druidColumn":"coordinates",
         "spatialPosition":1,
         "minValue":-180.0,
         "maxValue":180.0
      }
   }
]

The complete DDL for the ZipCode dataset was given in the Approximate Count Queries section.

You can now write Select and GroupBy queries with filters on the latitude/longitude that get pushed as Druid Spatial Filters with Rectangular Bounds. For example the following query:

select latitude, longitude  
from zipCodes 
where latitude > 42.5 and longitude is not null

generates the following Druid Query:

{  
   "jsonClass":"SelectSpecWithIntervals",
   "queryType":"select",
   "dataSource":"zipCodes",
   "dimensions":[  
      "coordinates",
      "coordinates"
   ],
   "metrics":[  
      "count"
   ],
   "filter":{  
      "jsonClass":"SpatialFilterSpec",
      "dimension":"coordinates",
      "bound":{  
         "jsonClass":"RectangularBound",
         "minCoords":[  
            42.5,
            -180.0
         ],
         "maxCoords":[  
            90.0,
            180.0
         ],
         "type":"rectangular"
      },
      "type":"spatial"
   },
   "pagingSpec":{  
      "pagingIdentifiers":{  

      },
      "threshold":10000
   },
   "intervals":[  
      "2016-01-01T00:00:00.000Z/2016-01-01T00:00:01.000Z"
   ],
   "descending":false,
   "granularity":"all",
   "context":{  
      "queryId":"query-1154108458312940"
   }
}

And the following GroupBy query:

select approx_count_distinct(city)  
from zipCodes 
where latitude > 0 and longitude is not null 
      and latitude < 18 or (longitude > -80 and longitude < 10)

generates the following Druid Query:

{  
   "jsonClass":"TimeSeriesQuerySpec",
   "queryType":"timeseries",
   "dataSource":"zipCodes",
   "intervals":[  
      "2016-01-01T00:00:00.000Z/2016-01-01T00:00:01.000Z"
   ],
   "granularity":"all",
   "filter":{  
      "jsonClass":"LogicalFilterSpec",
      "type":"or",
      "fields":[  
         {  
            "jsonClass":"LogicalFilterSpec",
            "type":"and",
            "fields":[  
               {  
                  "jsonClass":"SpatialFilterSpec",
                  "dimension":"coordinates",
                  "bound":{  
                     "jsonClass":"RectangularBound",
                     "minCoords":[  
                        4.9E-324,
                        -180.0
                     ],
                     "maxCoords":[  
                        18.0,
                        180.0
                     ],
                     "type":"rectangular"
                  },
                  "type":"spatial"
               }
            ]
         },
         {  
            "jsonClass":"LogicalFilterSpec",
            "type":"and",
            "fields":[  
               {  
                  "jsonClass":"SpatialFilterSpec",
                  "dimension":"coordinates",
                  "bound":{  
                     "jsonClass":"RectangularBound",
                     "minCoords":[  
                        -90.0,
                        -80.0
                     ],
                     "maxCoords":[  
                        90.0,
                        10.0
                     ],
                     "type":"rectangular"
                  },
                  "type":"spatial"
               }
            ]
         }
      ]
   },
   "aggregations":[  
      {  
         "type":"hyperUnique",
         "name":"alias-1",
         "fieldName":"unique_city"
      }
   ],
   "context":{  
      "queryId":"query-1154332234832112"
   }
}

Note how where possible multiple SQL predicates on spatial dimensions are combined into a single Druid Spatial Filter

Overview
Quick Start
- Installing and Setup Druid
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
Dev. Guide
Reference Architectures
- Accelerating existing SQL Datasets
Releases
Cluster Spinup Tool
TPCH Benchmark
- Generating Denormalized TPCH Dataset
- Build TPCH Index for Benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approximate Count and Spatial Queries

Approximate Count Queries

Approx_Count_Distinct Examples

DataSketch Queries

Spatial Queries

Clone this wiki locally