Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune TOI adapted to postgres and standalone tileserver #204

Merged
merged 15 commits into from
Jun 5, 2017

Conversation

ambientlight
Copy link
Contributor

Overview:

  1. tilequeue consume-tile-traffic has been added that parses tileserver output and inserts them into tile_traffic_v4 with structure as required by prune-tiles-of-interest.
  2. tilequeue prune-tiles-of-interest has been modified to fallback to postgres configuration when database-uri is not specified, minor refactoring done to use make_store's store instance for sake of directory store support in addition to s3.

consume-tile-traffic sample usage

nohup python tileserver/__init__.py config.yaml &
# add nohup.out path into config.yaml's toi-prune:tile-traffic-log-path
tilequeue consume-tile-traffic --config config.yaml

consume-tile-traffic details
regex is used to extract tile request logs from tileserver output, tile_traffic_v4 has been mimicked with structure derived from prune-tiles-of-interest code, which will be created if not present in database. DATEADD()/GETDATE() replicas are also added for the sake of compatibility of prune-tiles-of-interest code for redshift.

DBAffinityConnectionsNoLimit has been utilized for postgres connection. Small extension has been made to allow specifying readonly parameter on init(which defaults to True).

tilesize has been mocked to 512 since it is not available in tileserver output. Please suggest me how to gracefully handle it here.

prune-tiles-of-interests modification details
DATEADD() interval kind argument is a string in postgres, so single quotes were added around date from dateadd({opt_quote}day{opt_quote}, -{days}, getdate())) when using postgres

I got a feeling that s3 store related tile deletion code is a bit hacky with toi-prune:s3 config being redundant. Original store configuration should have been used instead, though I am not 100% sure about this.
As a suggestion, s3 has been renamed to store to keep the correct semantic meaning, also the entries that are colliding with original store entries has been removed. Original store configuration is used instead, inline delete_from_s3 has been deleted in favor of its implementation inside S3/TileDirectory, and thus make_store's store instance is used.
Zip Metatile associated format instance has been included inside extension_to_format/name_to_format since I guess it was forgotten there.

In order not to break the existing behavior, when toi-prune:s3 configuration is specified, it behaves exactly as before, with s3's entries overriding original store configs.

Copy link
Member

@rmarianski rmarianski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

I'd prefer to sort out the questions around configuration before merging this in, even though it looks like it'll work as is to me.

# a reduced version of prev s3 entity
store:
layer: all
format: zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts about the configuration:

  • we can probably just use store throughout and grow any api needed to support any operations across all backends.
  • regarding the duplication, on the one hand it's nice to just have it in one place. But in practice, it's very useful to have the configuration for the different operations separate, which allows us to handle production deploys and moments of transition more readily. Yaml allows us to specify pointers, so maybe we can do something clever here to refer to other sections?

What do you think @iandees ?

Copy link
Contributor Author

@ambientlight ambientlight May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's very useful to have the configuration for the different operations separate, which allows us to handle production deploys and moments of transition more readily

What does moments of transition mean here?

But different commands still operate on the same store, right? They can differ only like in .alpha .beta .rc .release where different stores might be used? In this circumstance why not having separate configuration like config.alpha.yaml for each maturity? Or you mean having something like:

store:
  seed:
    type: directory
    name: tiles
    path: ../tiles
  prune-tiles-of-interest:
    type: s3
    name: tiles
    bucket: mapzen-tiles-dev-us-east
    reduced-redundancy: true
    date-prefix: 20170322

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does moments of transition mean here?

Production deployments basically.

But different commands still operate on the same store, right? They can differ only like in .alpha .beta .rc .release where different stores might be used? In this circumstance why not having separate configuration like config.alpha.yaml for each maturity? Or you mean having something like:

I think this has to do with the mechanics of our deployments more than anything else. The problem is that during a deployment we have some services use new store / database configurations, and others use the previous ones. It just ended up being easier to manage the configuration this way for some cases.

To give you a more concrete idea, typically we have tilequeue (offline tile generation) updated first to use newer code and configuration, and do a seed run to put tiles in a new store location. Then when that is more or less finished, tileserver/tapalcatl/others will get updated to point to this location, as well as any code updates to correspond with the new tiles.

We can probably clean this up more, but we're planning on re-evaluating our running infrastructure shortly anyway, so we'll probably end up considering different configuration changes in that light.

cursor.execute("INSERT into tile_traffic_v4 (date, z, x, y, tilesize, service, host) VALUES ('%s', %d, %d, %d, %d, '%s', '%s')"
% (timestamp, coord.zoom, coord.column, coord.row, 512, 'vector-tiles', host))

logger.info('Inserted %d records' % len(iped_dated_coords_to_insert))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind folding the filter into the loop for readability? Something like this?

n_coords_inserted += 1
for host, timestamp, coord_int in iped_dated_coords_to_insert:
    if not max_timestamp or timestamp > max_timestamp:
        coord = coord_unmarshall_int(coord_int)
        cursor.execute(...)
        n_coords_inserted += 1

logger.info('Inserted %d records' % n_coords_inserted))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in e95e234

@@ -1124,7 +1161,9 @@ def delete_from_s3(s3_parts, coord_ints):
len(toi_to_remove))

for coord_ints in grouper(toi_to_remove, 1000):
delete_from_s3(s3_parts, coord_ints)
removed = store.delete_tiles(map(lambda coord_int: coord_unmarshall_int(coord_int), coord_ints),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just be: map(coord_unmarshall_int, coord_ints)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in bbfc011


iped_dated_coords = map(lambda match: (match.group(1),
datetime.strptime(match.group(2), '%d/%B/%Y %H:%M:%S'),
coord_marshall_int(create_coord(match.group(6), match.group(7), match.group(5)))), matches)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We take the less functional or "pythonic" (I recognize it's a loaded term) approach in the rest of the code base. Mind just converting these into a loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in e95e234

with conn.cursor() as cur:
cur.execute("""
select x, y, z, tilesize, count(*)
from tile_traffic_v4
where (date >= dateadd(day, -{days}, getdate()))
where (date >= dateadd({opt_quote}day{opt_quote}, -{days}, getdate()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have this be different based on postgresql/redshift backends? @iandees: any chance that the redshift syntax would support the postgresql syntax with quotes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select dateadd('day', -30, getdate());

works on RedShift.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in bbfc011

cfg.store_type = 's3'
cfg.s3_bucket = store_parts['bucket']
cfg.s3_date_prefix = store_parts['date-prefix']
cfg.s3_path = store_parts['path']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will definitely work, but related to the configuration discussion, I'd rather we normalized that where possible to avoid needing this kind of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, waiting for what approach is decided regarding configuration.
this was done just to adapt the existing behavior to make_store's generated store instances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmarianski Do these changes work for you now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, LGTM 👍

# Connection and query configuration for a RedShift database containing
# request information for tiles.
redshift:
# if database-uri is not specified, postgresql configuration above is used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see you change the redshift configuration topic to something more generic so that a user can specify a generic postgres-compatible database URI rather than point off to another place.

Copy link
Contributor Author

@ambientlight ambientlight May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iandees great suggestion. So maybe in the context of it's fields we can rename redshift to traffic-history or just history?

@iandees, @rmarianski: So you guys prefer no default fallback to postgresql configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iandees, @rmarianski: so I ditched the fallback to default postgresql configuration in 2ca05fe
generic postgres URI like postgresql://localhost:5432/osm works as excepted.

Also I renamed redshift entry to tile-history for clarity. What do you think?

@@ -210,6 +217,10 @@ toi-prune:
path: osm
layer: all
format: zip
# a reduced version of prev s3 entity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the primary store configuration specifies all entries from toi-prune:s3 except layer and format. reduced version is a portion of the store-related configuration that specifies only remaining layer and format.

logger.info('Consuming tile traffic logs ...')
logger.info(cfg.tile_traffic_log_path)

iped_dated_coords = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's iped mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda needed to figure out how to name a tuple that contains ip-address of client, request timestamp, coord

so... 'ed suffix has with meaning, couldn't figure out a better name that will be self-explanatory, though it is still confusing I guess. Do you have any suggestions?

Copy link
Member

@rmarianski rmarianski May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something to do with logs? log_entry, log_record, log_data or something like that?
Names are hard :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 8e4e864

adopted your suggestion - tile_log_records, sounds definitely better but also less explicit of what it actually is.

"be present in the config yaml")

is_postgres_conn_info = isinstance(db_conn_info, dict)
if is_postgres_conn_info:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see this block go away when the redshift section in config simply has a database URI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iandees definitely, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iandees addressed in 2ca05fe

with conn.cursor() as cur:
cur.execute("""
select x, y, z, tilesize, count(*)
from tile_traffic_v4
where (date >= dateadd(day, -{days}, getdate()))
where (date >= dateadd({opt_quote}day{opt_quote}, -{days}, getdate()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select dateadd('day', -30, getdate());

works on RedShift.

@ambientlight
Copy link
Contributor Author

@rmarianski, @iandees: seems I have addressed all of your requests guys, please let me know if there is anything else needed now on my side

@zerebubuth
Copy link
Member

@iandees does this look good now?

host inet not null
)''')

def postgres_add_compat_date_utils(cursor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the built-in postgres-compatible date functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to alter your query to redshift or have 2 distinct queries in prune_tiles_of_interest, since standard Postgres doesn't have dateadd() and getdate() I just added their implementation so there is no need for altering the query or differentiation between Postgres and redshift

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use date functions that are shared between the two so that wrapping functions like this aren't required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I haven't used redshift thought before, can you suggest a preferable way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had found something that worked on both platforms in a previous discussion, but looking back I see that's not the case.

It looks like both RedShift and PostgreSQL support interval literals:

On PostgreSQL

iandees=# select current_timestamp - interval '30 days' as dateplus;
           dateplus
-------------------------------
 2017-05-06 11:12:11.184408-04
(1 row)

On RedShift

analytics=# select current_timestamp - interval '30 days' as dateplus;
           dateplus
-------------------------------
 2017-05-06 15:12:20.733685+00
(1 row)

Can you change to using this interval literal notation in both cases and remove the code to add a wrapping function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, sure. On it.

…tils supported both in postgres and redshift
@@ -1051,7 +1047,7 @@ def tilequeue_prune_tiles_of_interest(cfg, peripherals):
cur.execute("""
select x, y, z, tilesize, count(*)
from tile_traffic_v4
where (date >= dateadd('day', -{days}, getdate()))
where (date >= (current_timestamp - interval '{days} days'))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iandees this works well for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks.

@nvkelso
Copy link
Member

nvkelso commented Jun 5, 2017

@ambientlight Thanks for your contribution!

Can you resolve the logging.conf.sample conflict and then we'll merge the PR? :)

@ambientlight
Copy link
Contributor Author

@nvkelso done. Thanks!

@@ -206,6 +209,10 @@ toi-prune:
path: osm
layer: all
format: zip
# a reduced version of prev s3 entity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this comment more generic (so it's less about what was and more about what the following config is for)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ambientlight ☝️

Copy link
Contributor Author

@ambientlight ambientlight Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvkelso yeah, actually forgot to change the .sample... done.

@nvkelso
Copy link
Member

nvkelso commented Jun 5, 2017

Woot! 🎉

@nvkelso nvkelso merged commit 5f71f10 into tilezen:master Jun 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants