Skip to content

Latest commit

 

History

History
35 lines (22 loc) · 2.92 KB

07e-geo_ip_matching.asciidoc

File metadata and controls

35 lines (22 loc) · 2.92 KB

Geo-IP Matching

You can learn a lot about your site’s audience in aggregate by mapping IP addresses to geolocation. Not just in itself, but joined against other datasets, like census data, store locations, weather and time. [1]

Maxmind makes their GeoLite IP-to-geo database available under an open license (CC-BY-SA)[2]. Out of the box, its columns are beg_ip, end_ip, location_id, where the first two columns show the low and high ends (inclusive) of a range that maps to that location. Every address lies in at most one range; locations may have multiple ranges.

This arrangement caters to range queries in a relational database, but isn’t suitable for our needs. A single IP-geo block can span thousands of addresses.

To get the right locality, take each range and break it at some block level. Instead of having 1.2.3.4 to 1.2.5.6 on one line, let’s use the first three quads (first 24 bits) and emit rows for 1.2.3.4 to 1.2.3.255, 1.2.4.0 to 1.2.4.255, and 1.2.5.0 to 1.2.5.6. This lets us use the first segment as the partition key, and the full ip address as the sort key.

    lines            bytes  description                  file
15_288_766   1_094_541_688  24-bit partition key         maxmind-geolite_city-20121002.tsv
 2_288_690     183_223_435  16-bit partition key         maxmind-geolite_city-20121002-16.tsv
 2_256_627      75_729_432  original (not denormalized)  GeoLiteCity-Blocks.csv

Range Queries

////Gently introduce the concept. "So, here’s what range queries are all about, in a nutshell…​" Amy////

This is a generally-applicable approach for doing range queries.

  • Choose a regular interval, fine enough to avoid skew but coarse enough to avoid ballooning the dataset size.

  • Whereever a range crosses an interval boundary, split it into multiple records, each filling or lying within a single interval.

  • Emit a compound key of [interval, join_handle, beg, end], where

    • interval is

    • join_handle identifies the originating table, so that records are grouped for a join (this is what ensures If the interval is transparently a prefix of the index (as it is here), you can instead just ship the remainder: [interval, join_handle, beg_suffix, end_suffix].

  • Use the

In the geodata section, the "quadtile" scheme is (if you bend your brain right) something of an extension on this idea — instead of splitting ranges on regular intervals, we’ll split regions on a regular grid scheme.


1. These databases only impute a coarse-grained estimate of each visitor’s location — they hold no direct information about the persom. Please consult your priest/rabbi/spirit guide/grandmom or other appropriate moral compass before diving too deep into the world of unmasking your site’s guests.
2. For serious use, there are professional-grade datasets from Maxmind, Quova, Digital Element among others.