Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Approximate Count and Spatial Queries

hbutani edited this page Oct 14, 2016 · 2 revisions

We can push approximate count and spatial filters to Druid, if the index has the relevant HyperLogLog, Sketch or Spatial Index on the column.

Approximate Count Queries

Approximate Aggregations are pushed to Druid if the pushHLLTODruid option is true on the Druid DataSource(by default this is true). The approx_count_distinct is translated to a hyperUnique aggregation in Druid; whereas DataSketchhive udfs are pushed down as a ThetaSketch aggregator to druid.

When defining a Druid DataSource define DruidRelationColumnInfos to describe the association of source columns to the relevant HyperLogLog, Sketch or Spatial Index. For example consider a ZipCodes DataSet where each row captures the recording of presence(no other metric) at a given time, location(longitude, latitude) and place(state, county, city, zipcode). We build a Druid Index with a HLL and Sketch for the City column(i.e. the city dimension is replaced by an approximate sketch data-structure of it) and a Spatial Index based on the latitude/longitude fields. The Field definition will be:

[  
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"city",
      "hllMetric":"unique_city",
      "sketchMetric":"city_sketch"
   },
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"latitude",
      "spatialIndex":{  
         "druidColumn":"coordinates",
         "spatialPosition":0,
         "minValue":-90.0,
         "maxValue":90.0
      }
   },
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"longitude",
      "spatialIndex":{  
         "druidColumn":"coordinates",
         "spatialPosition":1,
         "minValue":-180.0,
         "maxValue":180.0
      }
   }
]

The entire DDL for the Druid DataSource is:

CREATE TABLE if not exists zipCodes
      USING org.sparklinedata.druid
      OPTIONS (sourceDataframe "default.zipCodesBase",
      timeDimensionColumn "record_date",
      druidDatasource "zipCodes",
      druidHost 'localhost',
      zkQualifyDiscoveryNames "true",
      numProcessingThreadsPerHistorical '1',
      nonAggregateQueryHandling "push_project_and_filters",
      columnInfos '[ {   "jsonClass" : "DruidRelationColumnInfo",   "column" : "city",   "hllMetric" : "unique_city",   "sketchMetric" : "city_sketch" }, {   "jsonClass" : "DruidRelationColumnInfo",   "column" : "latitude",   "spatialIndex" : {     "druidColumn" : "coordinates",     "spatialPosition" : 0,     "minValue" : -90.0,     "maxValue" : 90.0   } }, {   "jsonClass" : "DruidRelationColumnInfo",   "column" : "longitude",   "spatialIndex" : {     "druidColumn" : "coordinates",     "spatialPosition" : 1,     "minValue" : -180.0,     "maxValue" : 180.0   } } ]',
      allowTopNRewrite "true")

Approx_Count_Distinct Examples

Now we can write queries that compute the approximate count for cities. Here are couple of examples:

select approx_count_distinct(city)  
from zipCodes;

select approx_count_distinct(city)
from zipCodesFull 
where substring(state,1,1) = 'N';

These get executed as hyperunique aggregations on Druid. For the second query the Druid Query is:

{  
   "jsonClass":"TimeSeriesQuerySpec",
   "queryType":"timeseries",
   "dataSource":"zipCodesAll",
   "intervals":[  
      "2016-01-01T00:00:00.000Z/2016-01-01T00:00:01.000Z"
   ],
   "granularity":"all",
   "filter":{  
      "jsonClass":"JavascriptFilterSpec",
      "type":"javascript",
      "dimension":"state",
      "function":"function (state) {\n            \n            \n              \n\n            return(((state).substr(0, 1))  ==  (\"N\"));\n            }"
   },
   "aggregations":[  
      {  
         "type":"hyperUnique",
         "name":"alias-1",
         "fieldName":"unique_city"
      }
   ],
   "context":{  
      "queryId":"query-1091728133402033"
   }
}

DataSketch Queries

Coming Soon

Spatial Queries

You can use Druid Spatial Indexing to build a Spatial Index of 1 or more dimension columns. You then have to describe in the DDL the source column association to the Spatial Index using DruidRelationColumnInfos. For the ZipCodes dataset containing latitude/longitude information for each fact, we describe this so:

[  
   ...,
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"latitude",
      "spatialIndex":{  
         "druidColumn":"coordinates",
         "spatialPosition":0,
         "minValue":-90.0,
         "maxValue":90.0
      }
   },
   {  
      "jsonClass":"DruidRelationColumnInfo",
      "column":"longitude",
      "spatialIndex":{  
         "druidColumn":"coordinates",
         "spatialPosition":1,
         "minValue":-180.0,
         "maxValue":180.0
      }
   }
]

The complete DDL for the ZipCode dataset was given in the Approximate Count Queries section.

You can now write Select and GroupBy queries with filters on the latitude/longitude that get pushed as Druid Spatial Filters with Rectangular Bounds. For example the following query:

select latitude, longitude  
from zipCodes 
where latitude > 42.5 and longitude is not null 

generates the following Druid Query:

{  
   "jsonClass":"SelectSpecWithIntervals",
   "queryType":"select",
   "dataSource":"zipCodes",
   "dimensions":[  
      "coordinates",
      "coordinates"
   ],
   "metrics":[  
      "count"
   ],
   "filter":{  
      "jsonClass":"SpatialFilterSpec",
      "dimension":"coordinates",
      "bound":{  
         "jsonClass":"RectangularBound",
         "minCoords":[  
            42.5,
            -180.0
         ],
         "maxCoords":[  
            90.0,
            180.0
         ],
         "type":"rectangular"
      },
      "type":"spatial"
   },
   "pagingSpec":{  
      "pagingIdentifiers":{  

      },
      "threshold":10000
   },
   "intervals":[  
      "2016-01-01T00:00:00.000Z/2016-01-01T00:00:01.000Z"
   ],
   "descending":false,
   "granularity":"all",
   "context":{  
      "queryId":"query-1154108458312940"
   }
}

And the following GroupBy query:

select approx_count_distinct(city)  
from zipCodes 
where latitude > 0 and longitude is not null 
      and latitude < 18 or (longitude > -80 and longitude < 10)

generates the following Druid Query:

{  
   "jsonClass":"TimeSeriesQuerySpec",
   "queryType":"timeseries",
   "dataSource":"zipCodes",
   "intervals":[  
      "2016-01-01T00:00:00.000Z/2016-01-01T00:00:01.000Z"
   ],
   "granularity":"all",
   "filter":{  
      "jsonClass":"LogicalFilterSpec",
      "type":"or",
      "fields":[  
         {  
            "jsonClass":"LogicalFilterSpec",
            "type":"and",
            "fields":[  
               {  
                  "jsonClass":"SpatialFilterSpec",
                  "dimension":"coordinates",
                  "bound":{  
                     "jsonClass":"RectangularBound",
                     "minCoords":[  
                        4.9E-324,
                        -180.0
                     ],
                     "maxCoords":[  
                        18.0,
                        180.0
                     ],
                     "type":"rectangular"
                  },
                  "type":"spatial"
               }
            ]
         },
         {  
            "jsonClass":"LogicalFilterSpec",
            "type":"and",
            "fields":[  
               {  
                  "jsonClass":"SpatialFilterSpec",
                  "dimension":"coordinates",
                  "bound":{  
                     "jsonClass":"RectangularBound",
                     "minCoords":[  
                        -90.0,
                        -80.0
                     ],
                     "maxCoords":[  
                        90.0,
                        10.0
                     ],
                     "type":"rectangular"
                  },
                  "type":"spatial"
               }
            ]
         }
      ]
   },
   "aggregations":[  
      {  
         "type":"hyperUnique",
         "name":"alias-1",
         "fieldName":"unique_city"
      }
   ],
   "context":{  
      "queryId":"query-1154332234832112"
   }
}

Note how where possible multiple SQL predicates on spatial dimensions are combined into a single Druid Spatial Filter

Clone this wiki locally