Prune TOI adapted to postgres and standalone tileserver #204

ambientlight · 2017-05-08T10:13:00Z

Overview:

tilequeue consume-tile-traffic has been added that parses tileserver output and inserts them into tile_traffic_v4 with structure as required by prune-tiles-of-interest.
tilequeue prune-tiles-of-interest has been modified to fallback to postgres configuration when database-uri is not specified, minor refactoring done to use make_store's store instance for sake of directory store support in addition to s3.

consume-tile-traffic sample usage

nohup python tileserver/__init__.py config.yaml &
# add nohup.out path into config.yaml's toi-prune:tile-traffic-log-path
tilequeue consume-tile-traffic --config config.yaml

consume-tile-traffic details
regex is used to extract tile request logs from tileserver output, tile_traffic_v4 has been mimicked with structure derived from prune-tiles-of-interest code, which will be created if not present in database. DATEADD()/GETDATE() replicas are also added for the sake of compatibility of prune-tiles-of-interest code for redshift.

DBAffinityConnectionsNoLimit has been utilized for postgres connection. Small extension has been made to allow specifying readonly parameter on init(which defaults to True).

tilesize has been mocked to 512 since it is not available in tileserver output. Please suggest me how to gracefully handle it here.

prune-tiles-of-interests modification details
DATEADD() interval kind argument is a string in postgres, so single quotes were added around date from dateadd({opt_quote}day{opt_quote}, -{days}, getdate())) when using postgres

I got a feeling that s3 store related tile deletion code is a bit hacky with toi-prune:s3 config being redundant. Original store configuration should have been used instead, though I am not 100% sure about this.
As a suggestion, s3 has been renamed to store to keep the correct semantic meaning, also the entries that are colliding with original store entries has been removed. Original store configuration is used instead, inline delete_from_s3 has been deleted in favor of its implementation inside S3/TileDirectory, and thus make_store's store instance is used.
Zip Metatile associated format instance has been included inside extension_to_format/name_to_format since I guess it was forgotten there.

In order not to break the existing behavior, when toi-prune:s3 configuration is specified, it behaves exactly as before, with s3's entries overriding original store configs.

…fied

…delete_tiles for S3/TileDirectory store, zip_metatile added into extension_to_format/name_to_format

rmarianski

👍

I'd prefer to sort out the questions around configuration before merging this in, even though it looks like it'll work as is to me.

rmarianski · 2017-05-08T16:27:19Z

config.yaml.sample

+  # a reduced version of prev s3 entity
+  store:
+    layer: all
+    format: zip


My thoughts about the configuration:

we can probably just use store throughout and grow any api needed to support any operations across all backends.

regarding the duplication, on the one hand it's nice to just have it in one place. But in practice, it's very useful to have the configuration for the different operations separate, which allows us to handle production deploys and moments of transition more readily. Yaml allows us to specify pointers, so maybe we can do something clever here to refer to other sections?

What do you think @iandees ?

it's very useful to have the configuration for the different operations separate, which allows us to handle production deploys and moments of transition more readily

What does moments of transition mean here?

But different commands still operate on the same store, right? They can differ only like in .alpha .beta .rc .release where different stores might be used? In this circumstance why not having separate configuration like config.alpha.yaml for each maturity? Or you mean having something like:

store: seed: type: directory name: tiles path: ../tiles prune-tiles-of-interest: type: s3 name: tiles bucket: mapzen-tiles-dev-us-east reduced-redundancy: true date-prefix: 20170322

What does moments of transition mean here?

Production deployments basically.

But different commands still operate on the same store, right? They can differ only like in .alpha .beta .rc .release where different stores might be used? In this circumstance why not having separate configuration like config.alpha.yaml for each maturity? Or you mean having something like:

I think this has to do with the mechanics of our deployments more than anything else. The problem is that during a deployment we have some services use new store / database configurations, and others use the previous ones. It just ended up being easier to manage the configuration this way for some cases.

To give you a more concrete idea, typically we have tilequeue (offline tile generation) updated first to use newer code and configuration, and do a seed run to put tiles in a new store location. Then when that is more or less finished, tileserver/tapalcatl/others will get updated to point to this location, as well as any code updates to correspond with the new tiles.

We can probably clean this up more, but we're planning on re-evaluating our running infrastructure shortly anyway, so we'll probably end up considering different configuration changes in that light.

rmarianski · 2017-05-08T16:34:04Z

tilequeue/command.py

+            cursor.execute("INSERT into tile_traffic_v4 (date, z, x, y, tilesize, service, host) VALUES ('%s', %d, %d, %d, %d, '%s', '%s')"
+                            % (timestamp, coord.zoom, coord.column, coord.row, 512, 'vector-tiles', host))
+
+        logger.info('Inserted %d records' % len(iped_dated_coords_to_insert))


Mind folding the filter into the loop for readability? Something like this?

n_coords_inserted += 1 for host, timestamp, coord_int in iped_dated_coords_to_insert: if not max_timestamp or timestamp > max_timestamp: coord = coord_unmarshall_int(coord_int) cursor.execute(...) n_coords_inserted += 1 logger.info('Inserted %d records' % n_coords_inserted))

addressed in e95e234

rmarianski · 2017-05-08T16:37:09Z

tilequeue/command.py

@@ -1124,7 +1161,9 @@ def delete_from_s3(s3_parts, coord_ints):
                    len(toi_to_remove))

        for coord_ints in grouper(toi_to_remove, 1000):
-            delete_from_s3(s3_parts, coord_ints)
+            removed = store.delete_tiles(map(lambda coord_int: coord_unmarshall_int(coord_int), coord_ints), 


Can just be: map(coord_unmarshall_int, coord_ints)

addressed in bbfc011

rmarianski · 2017-05-08T16:39:57Z

tilequeue/utils.py

+
+    iped_dated_coords = map(lambda match: (match.group(1), 
+                                           datetime.strptime(match.group(2), '%d/%B/%Y %H:%M:%S'), 
+                                           coord_marshall_int(create_coord(match.group(6), match.group(7), match.group(5)))), matches)


We take the less functional or "pythonic" (I recognize it's a loaded term) approach in the rest of the code base. Mind just converting these into a loop?

addressed in e95e234

rmarianski · 2017-05-08T16:42:48Z

tilequeue/command.py

        with conn.cursor() as cur:
            cur.execute("""
                select x, y, z, tilesize, count(*)
                from tile_traffic_v4
-                where (date >= dateadd(day, -{days}, getdate()))
+                where (date >= dateadd({opt_quote}day{opt_quote}, -{days}, getdate()))


Do we need to have this be different based on postgresql/redshift backends? @iandees: any chance that the redshift syntax would support the postgresql syntax with quotes?

select dateadd('day', -30, getdate());

works on RedShift.

addressed in bbfc011

rmarianski · 2017-05-08T16:44:51Z

tilequeue/command.py

+        cfg.store_type = 's3'
+        cfg.s3_bucket = store_parts['bucket']
+        cfg.s3_date_prefix = store_parts['date-prefix']
+        cfg.s3_path = store_parts['path']


This will definitely work, but related to the configuration discussion, I'd rather we normalized that where possible to avoid needing this kind of code.

sure, waiting for what approach is decided regarding configuration.
this was done just to adapt the existing behavior to make_store's generated store instances.

@rmarianski Do these changes work for you now?

Yea, LGTM 👍

iandees · 2017-05-08T19:42:14Z

config.yaml.sample

  # Connection and query configuration for a RedShift database containing
  # request information for tiles.
  redshift:
+    # if database-uri is not specified, postgresql configuration above is used


I'd rather see you change the redshift configuration topic to something more generic so that a user can specify a generic postgres-compatible database URI rather than point off to another place.

@iandees great suggestion. So maybe in the context of it's fields we can rename redshift to traffic-history or just history?

@iandees, @rmarianski: So you guys prefer no default fallback to postgresql configuration?

@iandees, @rmarianski: so I ditched the fallback to default postgresql configuration in 2ca05fe
generic postgres URI like postgresql://localhost:5432/osm works as excepted.

Also I renamed redshift entry to tile-history for clarity. What do you think?

iandees · 2017-05-08T19:42:29Z

config.yaml.sample

@@ -210,6 +217,10 @@ toi-prune:
    path: osm
    layer: all
    format: zip
+  # a reduced version of prev s3 entity


I'm not sure what this means.

the primary store configuration specifies all entries from toi-prune:s3 except layer and format. reduced version is a portion of the store-related configuration that specifies only remaining layer and format.

iandees · 2017-05-08T19:45:23Z

tilequeue/command.py

+    logger.info('Consuming tile traffic logs ...')
+    logger.info(cfg.tile_traffic_log_path)
+
+    iped_dated_coords = None


What's iped mean?

I kinda needed to figure out how to name a tuple that contains ip-address of client, request timestamp, coord

so... 'ed suffix has with meaning, couldn't figure out a better name that will be self-explanatory, though it is still confusing I guess. Do you have any suggestions?

How about something to do with logs? log_entry, log_record, log_data or something like that?
Names are hard :)

addressed in 8e4e864

adopted your suggestion - tile_log_records, sounds definitely better but also less explicit of what it actually is.

iandees · 2017-05-08T19:57:03Z

tilequeue/command.py

                          "be present in the config yaml")
+
+    is_postgres_conn_info = isinstance(db_conn_info, dict)
+    if is_postgres_conn_info:


I'd like to see this block go away when the redshift section in config simply has a database URI.

@iandees definitely, will do.

@iandees addressed in 2ca05fe

iandees · 2017-05-08T19:59:24Z

tilequeue/command.py

        with conn.cursor() as cur:
            cur.execute("""
                select x, y, z, tilesize, count(*)
                from tile_traffic_v4
-                where (date >= dateadd(day, -{days}, getdate()))
+                where (date >= dateadd({opt_quote}day{opt_quote}, -{days}, getdate()))


select dateadd('day', -30, getdate());

works on RedShift.

…ift/postgres, simplified map in delete_tiles

…tched the fallback to default postgres config, renamed .yaml redshift entry to a tile-history for clarity

ambientlight · 2017-05-10T15:40:52Z

@rmarianski, @iandees: seems I have addressed all of your requests guys, please let me know if there is anything else needed now on my side

zerebubuth · 2017-05-23T15:13:40Z

@iandees does this look good now?

iandees · 2017-05-23T15:20:18Z

tilequeue/utils.py

+            host inet not null
+        )''')
+
+def postgres_add_compat_date_utils(cursor):


Why not use the built-in postgres-compatible date functions?

I didn't want to alter your query to redshift or have 2 distinct queries in prune_tiles_of_interest, since standard Postgres doesn't have dateadd() and getdate() I just added their implementation so there is no need for altering the query or differentiation between Postgres and redshift

I'd prefer to use date functions that are shared between the two so that wrapping functions like this aren't required.

Alright, I haven't used redshift thought before, can you suggest a preferable way?

I thought we had found something that worked on both platforms in a previous discussion, but looking back I see that's not the case.

It looks like both RedShift and PostgreSQL support interval literals:

On PostgreSQL

iandees=# select current_timestamp - interval '30 days' as dateplus; dateplus ------------------------------- 2017-05-06 11:12:11.184408-04 (1 row)

On RedShift

analytics=# select current_timestamp - interval '30 days' as dateplus; dateplus ------------------------------- 2017-05-06 15:12:20.733685+00 (1 row)

Can you change to using this interval literal notation in both cases and remove the code to add a wrapping function?

Amazing, sure. On it.

…tils supported both in postgres and redshift

ambientlight · 2017-06-05T15:49:44Z

tilequeue/command.py

@@ -1051,7 +1047,7 @@ def tilequeue_prune_tiles_of_interest(cfg, peripherals):
            cur.execute("""
                select x, y, z, tilesize, count(*)
                from tile_traffic_v4
-                where (date >= dateadd('day', -{days}, getdate()))
+                where (date >= (current_timestamp - interval '{days} days'))


@iandees this works well for me.

Great, thanks.

nvkelso · 2017-06-05T16:17:03Z

@ambientlight Thanks for your contribution!

Can you resolve the logging.conf.sample conflict and then we'll merge the PR? :)

ambientlight · 2017-06-05T16:23:10Z

@nvkelso done. Thanks!

nvkelso · 2017-06-05T16:29:13Z

config.yaml.sample

@@ -206,6 +209,10 @@ toi-prune:
    path: osm
    layer: all
    format: zip
+  # a reduced version of prev s3 entity


Can you make this comment more generic (so it's less about what was and more about what the following config is for)?

@ambientlight ☝️

@nvkelso yeah, actually forgot to change the .sample... done.

nvkelso · 2017-06-05T18:50:35Z

Woot! 🎉

ambientlight added 6 commits May 8, 2017 13:49

consume-tile-traffic base implementation

ccb565b

added new config entry

e931c6b

fallback to postgres settings when redshift database-uri is not speci…

96fdb73

…fied

toi-prune to use store configuration instead of redundant s3 config, …

6bd1f8e

…delete_tiles for S3/TileDirectory store, zip_metatile added into extension_to_format/name_to_format

Merge https://github.com/tilezen/tilequeue into consume-tile-traffic

2a8997a

corrected assertion message

8692be2

nvkelso added the in review label May 8, 2017

ambientlight mentioned this pull request May 8, 2017

Commands related questions #198

Closed

rmarianski approved these changes May 8, 2017

View reviewed changes

iandees requested changes May 8, 2017

View reviewed changes

ambientlight added 5 commits May 9, 2017 15:09

logging configuration for consume_tile_traffic

5281f80

converted lambdas to loops in consume_tile_traffic

e95e234

removed conditional single quote in SQL since it both works for redsh…

bbfc011

…ift/postgres, simplified map in delete_tiles

removed unnecessary differentiation between redshift and postgres, di…

2ca05fe

…tched the fallback to default postgres config, renamed .yaml redshift entry to a tile-history for clarity

renamed iped_dated_coords to tile_log_records

8e4e864

rmarianski approved these changes May 10, 2017

View reviewed changes

iandees reviewed May 30, 2017

View reviewed changes

removed postgres compatibility utilities in favour of build-in date u…

2258b56

…tils supported both in postgres and redshift

ambientlight commented Jun 5, 2017

View reviewed changes

iandees approved these changes Jun 5, 2017

View reviewed changes

ambientlight added 2 commits June 6, 2017 00:20

resolve conflict in logging.conf.sample

2d1b08f

Merge branch 'master' into consume-tile-traffic

e999dc6

nvkelso reviewed Jun 5, 2017

View reviewed changes

fixed ambiguous comment

ff54142

nvkelso merged commit 5f71f10 into tilezen:master Jun 5, 2017

nvkelso removed the in review label Jun 5, 2017

rmarianski mentioned this pull request Jun 9, 2017

Tag v1.9.0 #216

Merged

iandees mentioned this pull request Jul 24, 2017

Change key from redshift to tile-history to match the code. tilezen/chef-tilequeue#43

Merged

Prune TOI adapted to postgres and standalone tileserver #204

Prune TOI adapted to postgres and standalone tileserver #204

Conversation

ambientlight commented May 8, 2017

rmarianski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ambientlight May 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ambientlight May 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmarianski May 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ambientlight commented May 10, 2017

zerebubuth commented May 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

On PostgreSQL

On RedShift

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvkelso commented Jun 5, 2017

ambientlight commented Jun 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ambientlight Jun 5, 2017 • edited Loading

Choose a reason for hiding this comment

nvkelso commented Jun 5, 2017

ambientlight May 9, 2017 •

edited

Loading

ambientlight May 9, 2017 •

edited

Loading

rmarianski May 9, 2017 •

edited

Loading

ambientlight Jun 5, 2017 •

edited

Loading