The following is a slightly contrived example of how you can use this extension to read and export multiple geospatial data formats, transform geometries between different coordinate reference systems and work with spatial property and predicate functions.
For a complete list of all supported functions, please have a look at the function reference.
Let's start by loading the spatial extension and the parquet extension so we can import the NYC taxi ride data in parquet format, and the accompanying taxi zone data from a shapefile, using the spatial ST_Read
gdal-based table function. We then create a table for the rides and a table for the zones. Note that ST_Read
produces a table with a wkb_geometry
column that contains the geometry data encoded as a WKB (Well-Known Binary) blob, which we then convert to the GEOMETRY
type using the ST_GeomFromWKB
function.
LOAD spatial;
LOAD parquet;
CREATE TABLE rides AS SELECT *
FROM './test/data/nyc_taxi/yellow_tripdata_2010-01-limit1mil.parquet';
-- Load the NYC taxi zone data from a shapefile using the gdal-based st_read function
CREATE TABLE zones AS SELECT zone, LocationId, borough, geom
FROM st_read('./test/data/nyc_taxi/taxi_zones/taxi_zones.shp');
SELECT * FROM rides LIMIT 10;
vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VTS | 2010-01-01 00:00:17 | 2010-01-01 00:00:17 | 3 | 0.0 | -73.87105699999998 | 40.773522 | 1 | -73.871048 | 40.773545 | CAS | 45.0 | 0.0 | 0.5 | 0.0 | 0.0 | 45.5 | |
VTS | 2010-01-01 00:00:20 | 2010-01-01 00:00:20 | 1 | 0.05 | -73.97512999999998 | 40.789973 | 1 | -73.97498799999998 | 40.790598 | CAS | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 | |
CMT | 2010-01-01 00:00:23 | 2010-01-01 00:00:25 | 1 | 0.0 | -73.999431 | 40.71216 | 1 | 0 | -73.99915799999998 | 40.712421 | No | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 |
CMT | 2010-01-01 00:00:33 | 2010-01-01 00:00:55 | 1 | 0.0 | -73.97721699999998 | 40.749633 | 1 | 0 | -73.97732899999998 | 40.749629 | Cas | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 |
VTS | 2010-01-01 00:01:00 | 2010-01-01 00:01:00 | 1 | 0.0 | -73.942313 | 40.784332 | 1 | -73.942313 | 40.784332 | Cre | 10.0 | 0.0 | 0.5 | 2.0 | 0.0 | 12.5 | |
VTS | 2010-01-01 00:01:06 | 2010-01-01 00:01:06 | 2 | 0.38 | -73.97463 | 40.756687 | 1 | -73.979872 | 40.759143 | CAS | 3.7 | 0.5 | 0.5 | 0.0 | 0.0 | 4.7 | |
VTS | 2010-01-01 00:01:07 | 2010-01-01 00:01:07 | 2 | 0.23 | -73.987358 | 40.718475 | 1 | -73.98518 | 40.720468 | CAS | 2.9 | 0.5 | 0.5 | 0.0 | 0.0 | 3.9 | |
CMT | 2010-01-01 00:00:02 | 2010-01-01 00:01:08 | 1 | 0.1 | -73.992807 | 40.741418 | 1 | 0 | -73.995799 | 40.742596 | No | 2.9 | 0.5 | 0.5 | 0.0 | 0.0 | 3.9 |
VTS | 2010-01-01 00:01:23 | 2010-01-01 00:01:23 | 1 | 0.6099999999999999 | -73.98003799999998 | 40.74306 | 1 | -73.974862 | 40.750387 | CAS | 3.7 | 0.5 | 0.5 | 0.0 | 0.0 | 4.7 | |
VTS | 2010-01-01 00:01:34 | 2010-01-01 00:01:34 | 1 | 0.02 | -73.954122 | 40.801173 | 1 | -73.95431499999998 | 40.800897 | CAS | 45.0 | 0.0 | 0.5 | 0.0 | 0.0 | 45.5 |
SELECT * FROM zones LIMIT 10;
zone | LocationID | borough | geom |
---|---|---|---|
Newark Airport | 1 | EWR | POLYGON (...) |
Jamaica Bay | 2 | Queens | MULTIPOLYGON (...) |
Allerton/Pelham Gardens | 3 | Bronx | POLYGON (...) |
Alphabet City | 4 | Manhattan | POLYGON (...) |
Arden Heights | 5 | Staten Island | POLYGON (...) |
Arrochar/Fort Wadsworth | 6 | Staten Island | POLYGON (...) |
Astoria | 7 | Queens | POLYGON (...) |
Astoria Park | 8 | Queens | POLYGON (...) |
Auburndale | 9 | Queens | POLYGON (...) |
Baisley Park | 10 | Queens | POLYGON (...) |
Let's compare the trip distance to the linear distance between the pickup and dropoff points to figure out how efficient the taxi drivers are (or how dirty the data is, since some diffs seem to be negative). We transform the coordinates from WGS84 (EPSG:4326) (lat/lon) to the NAD83 / New York Long Island ftUS (ESRI:102718) projection and calculate the distance using the ST_Distance
function, which in this case gives the distance in feet, which we then convert to miles (5280 ft/mile). Trips that are smaller than the aerial distance are likely to be erroneous, so we use this query to filter out some bad data. Although this is not entirely accurate since the distance we use does not take into account the curvature of the earth, but it is a good enough approximation for our purposes.
CREATE TABLE cleaned_rides AS SELECT
st_point(pickup_latitude, pickup_longitude) as pickup_point,
st_point(dropoff_latitude, dropoff_longitude) as dropoff_point,
dropoff_datetime::TIMESTAMP - pickup_datetime::TIMESTAMP as time,
trip_distance,
st_distance(
st_transform(pickup_point, 'EPSG:4326', 'ESRI:102718'),
st_transform(dropoff_point, 'EPSG:4326', 'ESRI:102718')) / 5280 as aerial_distance,
trip_distance - aerial_distance as diff
FROM rides
WHERE diff > 0
ORDER BY diff DESC;
SELECT * FROM cleaned_rides LIMIT 10;
pickup_point | dropoff_point | time | trip_distance | aerial_distance | diff |
---|---|---|---|---|---|
POINT (40.758149 -73.963267) | POINT (40.743807 -73.915763) | 01:49:25 | 47.4 | 2.6820365663951677 | 44.71796343360483 |
POINT (40.764592 -73.971798) | POINT (40.743878 -73.991015) | 01:09:29 | 45.9 | 1.7492118606943174 | 44.15078813930568 |
POINT (40.733306 -73.987289) | POINT (40.758895 -73.987341) | 02:15:53 | 45.2 | 1.7657013363262366 | 43.434298663673765 |
POINT (40.755965 -73.973138) | POINT (40.756137 -73.973535) | 02:48:19 | 41.9 | 0.02397481410159387 | 41.876025185898406 |
POINT (40.645337 -73.77656899999998) | POINT (40.645389 -73.77632699999998) | 00:50:09 | 41.4 | 0.013215531558232116 | 41.38678446844177 |
POINT (40.743983 -73.986063) | POINT (40.759483 -73.985377) | 00:00:18 | 41.9 | 1.070141742954722 | 40.829858257045274 |
POINT (40.782197 -73.98247) | POINT (40.713868 -74.03883) | 02:14:00 | 45.05 | 5.565739906339228 | 39.484260093660765 |
POINT (40.662826 -73.78962799999998) | POINT (40.656764 -73.794222) | 02:19:32 | 39.8 | 0.4829476946393737 | 39.317052305360626 |
POINT (40.644837 -73.781722) | POINT (40.646732 -73.801497) | 01:29:00 | 39.8 | 1.0475276883275062 | 38.75247231167249 |
POINT (40.754573 -73.98841199999998) | POINT (40.763428 -73.968002) | 01:31:00 | 39.74 | 1.2329411412165936 | 38.50705885878341 |
Now lets join the taxi rides with the taxi zones to get the start and end zone for each ride using the ST_Within
function to check if a point is within a polygon. Again we need to transform the coordinates from WGS84 to the NAD83 since the taxi zone data also use that projection.
-- Since we dont have spatial indexes yet, use a smaller dataset for the following example.
DELETE FROM cleaned_rides WHERE rowid > 5000;
CREATE TABLE joined AS
SELECT
pickup_point,
dropoff_point,
start_zone.zone as start_zone,
end_zone.zone as end_zone,
trip_distance,
time,
FROM cleaned_rides
JOIN zones as start_zone
ON ST_Within(st_transform(pickup_point, 'EPSG:4326', 'ESRI:102718'), start_zone.geom)
JOIN zones as end_zone
ON ST_Within(st_transform(dropoff_point, 'EPSG:4326', 'ESRI:102718'), end_zone.geom);
SELECT * FROM joined USING SAMPLE 10 ROWS;
pickup_point | dropoff_point | start_zone | end_zone | trip_distance | time |
---|---|---|---|---|---|
POINT (40.722223 -73.98385299999998) | POINT (40.715507 -73.992438) | East Village | Lower East Side | 10.3 | 00:19:16 |
POINT (40.648687 -73.783522) | POINT (40.649567 -74.005812) | JFK Airport | Sunset Park West | 23.57 | 00:28:00 |
POINT (40.761603 -73.96661299999998) | POINT (40.760232 -73.96344499999998) | Upper East Side South | Sutton Place/Turtle Bay North | 17.6 | 00:27:05 |
POINT (40.697212 -73.937495) | POINT (40.652377 -73.93983299999998) | Stuyvesant Heights | East Flatbush/Farragut | 13.55 | 00:24:00 |
POINT (40.721462 -73.993583) | POINT (40.774205 -73.90441699999998) | Lower East Side | Steinway | 28.75 | 01:03:00 |
POINT (40.716955 -74.004328) | POINT (40.754688 -73.991612) | TriBeCa/Civic Center | Garment District | 18.4 | 00:46:12 |
POINT (40.740052 -73.994918) | POINT (40.75439 -73.98587499999998) | Flatiron | Garment District | 24.2 | 00:35:25 |
POINT (40.763017 -73.95949199999998) | POINT (40.763615 -73.959182) | Lenox Hill East | Lenox Hill West | 18.4 | 00:33:46 |
POINT (40.865663 -73.927458) | POINT (40.86537 -73.927352) | Washington Heights North | Washington Heights North | 10.47 | 00:27:00 |
POINT (40.738408 -73.980345) | POINT (40.696038 -73.955493) | Gramercy | Bedford | 16.4 | 00:21:47 |
We can export the joined table to a GeoJSONSeq file using the GDAL copy function, passing in a GDAL layer creation option.
Since GeoJSON only supports a single geometry per feature, we can use the ST_MakeLine
function to combine the pickup and dropoff points into a single line geometry. The default coordinate reference system for GeoJSON is WGS84, but the coordinates are expected to be in longitude/latitude, so we need to flip the geometry using the ST_FlipCoordinates
function.
COPY (
SELECT
ST_AsWKB(ST_FlipCoordinates(ST_MakeLine(pickup_point, dropoff_point))) as wkb_geometry,
start_zone,
end_zone,
time::VARCHAR as trip_time
FROM joined)
TO 'joined.geojsonseq'
WITH (FORMAT GDAL, DRIVER 'GeoJSONSeq', LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');
head -n 10 joined.geojsonseq
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:52:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.789923, 40.643515 ], [ -73.97608, 40.680395 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776445, 40.645422 ], [ -73.98427, 40.670782 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:45:42" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776878, 40.645065 ], [ -73.992153, 40.662571 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.788028, 40.641508 ], [ -73.97584, 40.670927 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:47:58" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.781855, 40.644749 ], [ -73.980129, 40.663663 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:10" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.787494, 40.641559 ], [ -73.974694, 40.673479 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:59" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.790138, 40.643342 ], [ -73.982721, 40.662379 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.786952, 40.641248 ], [ -73.97421, 40.676237 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:33:21" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.783892, 40.648514 ], [ -73.979283, 40.669721 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:45" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776643, 40.645272 ], [ -73.978873, 40.66723 ] ] } }
If you need to set some "Layer creation option" (format specific), i.e. those options that you set via CLI with -lco
, you can use this kind of syntax, in which the WRITE_BBOX
and RFC7946
- available in GeoJSON
format - options are set:
COPY (SELECT * from st_read('input.shp'))
TO 'output.geojson'
WITH (FORMAT GDAL, DRIVER 'GeoJSON',LAYER_CREATION_OPTIONS ('WRITE_BBOX=YES', 'RFC7946=YES'))