-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune TOI adapted to postgres and standalone tileserver #204
Conversation
…delete_tiles for S3/TileDirectory store, zip_metatile added into extension_to_format/name_to_format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I'd prefer to sort out the questions around configuration before merging this in, even though it looks like it'll work as is to me.
config.yaml.sample
Outdated
# a reduced version of prev s3 entity | ||
store: | ||
layer: all | ||
format: zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts about the configuration:
- we can probably just use
store
throughout and grow any api needed to support any operations across all backends. - regarding the duplication, on the one hand it's nice to just have it in one place. But in practice, it's very useful to have the configuration for the different operations separate, which allows us to handle production deploys and moments of transition more readily. Yaml allows us to specify pointers, so maybe we can do something clever here to refer to other sections?
What do you think @iandees ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's very useful to have the configuration for the different operations separate, which allows us to handle production deploys and moments of transition more readily
What does moments of transition mean here?
But different commands still operate on the same store, right? They can differ only like in .alpha .beta .rc .release
where different stores might be used? In this circumstance why not having separate configuration like config.alpha.yaml
for each maturity? Or you mean having something like:
store:
seed:
type: directory
name: tiles
path: ../tiles
prune-tiles-of-interest:
type: s3
name: tiles
bucket: mapzen-tiles-dev-us-east
reduced-redundancy: true
date-prefix: 20170322
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does moments of transition mean here?
Production deployments basically.
But different commands still operate on the same store, right? They can differ only like in .alpha .beta .rc .release where different stores might be used? In this circumstance why not having separate configuration like config.alpha.yaml for each maturity? Or you mean having something like:
I think this has to do with the mechanics of our deployments more than anything else. The problem is that during a deployment we have some services use new store / database configurations, and others use the previous ones. It just ended up being easier to manage the configuration this way for some cases.
To give you a more concrete idea, typically we have tilequeue (offline tile generation) updated first to use newer code and configuration, and do a seed run to put tiles in a new store location. Then when that is more or less finished, tileserver/tapalcatl/others will get updated to point to this location, as well as any code updates to correspond with the new tiles.
We can probably clean this up more, but we're planning on re-evaluating our running infrastructure shortly anyway, so we'll probably end up considering different configuration changes in that light.
tilequeue/command.py
Outdated
cursor.execute("INSERT into tile_traffic_v4 (date, z, x, y, tilesize, service, host) VALUES ('%s', %d, %d, %d, %d, '%s', '%s')" | ||
% (timestamp, coord.zoom, coord.column, coord.row, 512, 'vector-tiles', host)) | ||
|
||
logger.info('Inserted %d records' % len(iped_dated_coords_to_insert)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind folding the filter into the loop for readability? Something like this?
n_coords_inserted += 1
for host, timestamp, coord_int in iped_dated_coords_to_insert:
if not max_timestamp or timestamp > max_timestamp:
coord = coord_unmarshall_int(coord_int)
cursor.execute(...)
n_coords_inserted += 1
logger.info('Inserted %d records' % n_coords_inserted))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in e95e234
tilequeue/command.py
Outdated
@@ -1124,7 +1161,9 @@ def delete_from_s3(s3_parts, coord_ints): | |||
len(toi_to_remove)) | |||
|
|||
for coord_ints in grouper(toi_to_remove, 1000): | |||
delete_from_s3(s3_parts, coord_ints) | |||
removed = store.delete_tiles(map(lambda coord_int: coord_unmarshall_int(coord_int), coord_ints), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just be: map(coord_unmarshall_int, coord_ints)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in bbfc011
tilequeue/utils.py
Outdated
|
||
iped_dated_coords = map(lambda match: (match.group(1), | ||
datetime.strptime(match.group(2), '%d/%B/%Y %H:%M:%S'), | ||
coord_marshall_int(create_coord(match.group(6), match.group(7), match.group(5)))), matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We take the less functional or "pythonic" (I recognize it's a loaded term) approach in the rest of the code base. Mind just converting these into a loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in e95e234
tilequeue/command.py
Outdated
with conn.cursor() as cur: | ||
cur.execute(""" | ||
select x, y, z, tilesize, count(*) | ||
from tile_traffic_v4 | ||
where (date >= dateadd(day, -{days}, getdate())) | ||
where (date >= dateadd({opt_quote}day{opt_quote}, -{days}, getdate())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have this be different based on postgresql/redshift backends? @iandees: any chance that the redshift syntax would support the postgresql syntax with quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select dateadd('day', -30, getdate());
works on RedShift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in bbfc011
cfg.store_type = 's3' | ||
cfg.s3_bucket = store_parts['bucket'] | ||
cfg.s3_date_prefix = store_parts['date-prefix'] | ||
cfg.s3_path = store_parts['path'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will definitely work, but related to the configuration discussion, I'd rather we normalized that where possible to avoid needing this kind of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, waiting for what approach is decided regarding configuration.
this was done just to adapt the existing behavior to make_store's generated store instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmarianski Do these changes work for you now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, LGTM 👍
config.yaml.sample
Outdated
# Connection and query configuration for a RedShift database containing | ||
# request information for tiles. | ||
redshift: | ||
# if database-uri is not specified, postgresql configuration above is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see you change the redshift
configuration topic to something more generic so that a user can specify a generic postgres-compatible database URI rather than point off to another place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iandees great suggestion. So maybe in the context of it's fields we can rename redshift to traffic-history
or just history
?
@iandees, @rmarianski: So you guys prefer no default fallback to postgresql configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iandees, @rmarianski: so I ditched the fallback to default postgresql configuration in 2ca05fe
generic postgres URI like postgresql://localhost:5432/osm
works as excepted.
Also I renamed redshift
entry to tile-history
for clarity. What do you think?
config.yaml.sample
Outdated
@@ -210,6 +217,10 @@ toi-prune: | |||
path: osm | |||
layer: all | |||
format: zip | |||
# a reduced version of prev s3 entity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the primary store configuration specifies all entries from toi-prune:s3
except layer and format. reduced version is a portion of the store-related configuration that specifies only remaining layer and format.
tilequeue/command.py
Outdated
logger.info('Consuming tile traffic logs ...') | ||
logger.info(cfg.tile_traffic_log_path) | ||
|
||
iped_dated_coords = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's iped
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda needed to figure out how to name a tuple that contains ip-address of client, request timestamp, coord
so... 'ed suffix has with meaning, couldn't figure out a better name that will be self-explanatory, though it is still confusing I guess. Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something to do with logs? log_entry
, log_record
, log_data
or something like that?
Names are hard :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in 8e4e864
adopted your suggestion - tile_log_records
, sounds definitely better but also less explicit of what it actually is.
tilequeue/command.py
Outdated
"be present in the config yaml") | ||
|
||
is_postgres_conn_info = isinstance(db_conn_info, dict) | ||
if is_postgres_conn_info: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see this block go away when the redshift
section in config simply has a database URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iandees definitely, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tilequeue/command.py
Outdated
with conn.cursor() as cur: | ||
cur.execute(""" | ||
select x, y, z, tilesize, count(*) | ||
from tile_traffic_v4 | ||
where (date >= dateadd(day, -{days}, getdate())) | ||
where (date >= dateadd({opt_quote}day{opt_quote}, -{days}, getdate())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select dateadd('day', -30, getdate());
works on RedShift.
…ift/postgres, simplified map in delete_tiles
…tched the fallback to default postgres config, renamed .yaml redshift entry to a tile-history for clarity
@rmarianski, @iandees: seems I have addressed all of your requests guys, please let me know if there is anything else needed now on my side |
@iandees does this look good now? |
tilequeue/utils.py
Outdated
host inet not null | ||
)''') | ||
|
||
def postgres_add_compat_date_utils(cursor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the built-in postgres-compatible date functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to alter your query to redshift or have 2 distinct queries in prune_tiles_of_interest, since standard Postgres doesn't have dateadd() and getdate() I just added their implementation so there is no need for altering the query or differentiation between Postgres and redshift
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to use date functions that are shared between the two so that wrapping functions like this aren't required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I haven't used redshift thought before, can you suggest a preferable way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had found something that worked on both platforms in a previous discussion, but looking back I see that's not the case.
It looks like both RedShift and PostgreSQL support interval literals:
On PostgreSQL
iandees=# select current_timestamp - interval '30 days' as dateplus;
dateplus
-------------------------------
2017-05-06 11:12:11.184408-04
(1 row)
On RedShift
analytics=# select current_timestamp - interval '30 days' as dateplus;
dateplus
-------------------------------
2017-05-06 15:12:20.733685+00
(1 row)
Can you change to using this interval literal notation in both cases and remove the code to add a wrapping function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, sure. On it.
…tils supported both in postgres and redshift
tilequeue/command.py
Outdated
@@ -1051,7 +1047,7 @@ def tilequeue_prune_tiles_of_interest(cfg, peripherals): | |||
cur.execute(""" | |||
select x, y, z, tilesize, count(*) | |||
from tile_traffic_v4 | |||
where (date >= dateadd('day', -{days}, getdate())) | |||
where (date >= (current_timestamp - interval '{days} days')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iandees this works well for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks.
@ambientlight Thanks for your contribution! Can you resolve the |
@nvkelso done. Thanks! |
config.yaml.sample
Outdated
@@ -206,6 +209,10 @@ toi-prune: | |||
path: osm | |||
layer: all | |||
format: zip | |||
# a reduced version of prev s3 entity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this comment more generic (so it's less about what was and more about what the following config is for)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvkelso yeah, actually forgot to change the .sample... done.
Woot! 🎉 |
Overview:
tilequeue consume-tile-traffic
has been added that parses tileserver output and inserts them into tile_traffic_v4 with structure as required byprune-tiles-of-interest
.tilequeue prune-tiles-of-interest
has been modified to fallback to postgres configuration when database-uri is not specified, minor refactoring done to usemake_store
's store instance for sake of directory store support in addition to s3.consume-tile-traffic sample usage
consume-tile-traffic details
regex is used to extract tile request logs from tileserver output, tile_traffic_v4 has been mimicked with structure derived from
prune-tiles-of-interest
code, which will be created if not present in database. DATEADD()/GETDATE() replicas are also added for the sake of compatibility ofprune-tiles-of-interest
code for redshift.DBAffinityConnectionsNoLimit has been utilized for postgres connection. Small extension has been made to allow specifying readonly parameter on init(which defaults to True).
tilesize has been mocked to 512 since it is not available in tileserver output. Please suggest me how to gracefully handle it here.
prune-tiles-of-interests modification details
DATEADD() interval kind argument is a string in postgres, so single quotes were added around date from
dateadd({opt_quote}day{opt_quote}, -{days}, getdate()))
when using postgresI got a feeling that s3 store related tile deletion code is a bit hacky with
toi-prune:s3
config being redundant. Original store configuration should have been used instead, though I am not 100% sure about this.As a suggestion, s3 has been renamed to store to keep the correct semantic meaning, also the entries that are colliding with original store entries has been removed. Original store configuration is used instead, inline
delete_from_s3
has been deleted in favor of its implementation inside S3/TileDirectory, and thusmake_store
's store instance is used.Zip Metatile associated format instance has been included inside
extension_to_format/name_to_format
since I guess it was forgotten there.In order not to break the existing behavior, when
toi-prune:s3
configuration is specified, it behaves exactly as before, with s3's entries overriding original store configs.