-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command to automatically prune tiles of interest #176
Changes from 2 commits
0e2abca
254ac0f
88f6891
071cc52
4049b24
654e4a1
a08bea5
2aaa6d6
f186db9
f3ce544
0164a64
e4138ed
c082775
376d48f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -953,49 +953,71 @@ def tilequeue_prune_tiles_of_interest(cfg, peripherals): | |
logger = make_logger(cfg, 'prune_tiles_of_interest') | ||
logger.info('Pruning tiles of interest') | ||
|
||
logger.info('Fetching tiles of interest ...') | ||
tiles_of_interest = peripherals.redis_cache_index.fetch_tiles_of_interest() | ||
n_toi = len(tiles_of_interest) | ||
logger.info('Fetching tiles of interest ... done. %s found', n_toi) | ||
|
||
logger.info('Fetching tiles recently requested ...') | ||
import psycopg2 | ||
|
||
redshift_uri = cfg.yml.get('redshift_uri') | ||
prune_cfg = cfg.yml.get('toi-prune', {}) | ||
|
||
redshift_uri = prune_cfg.get('redshift-uri') | ||
assert redshift_uri, ("A redshift connection URI must " | ||
"be present in the config yaml") | ||
|
||
redshift_days_to_query = cfg.yml.get('redshift_days') | ||
redshift_days_to_query = prune_cfg.get('days') | ||
assert redshift_days_to_query, ("Number of days to query " | ||
"redshift is not specified") | ||
|
||
tiles_recently_requested = set() | ||
new_toi = set() | ||
with psycopg2.connect(redshift_uri) as conn: | ||
with conn.cursor() as cur: | ||
cur.execute(""" | ||
select x, y, z | ||
from tile_traffic_v4 | ||
where (date >= dateadd(day, -{days}, current_date)) | ||
and (z between 0 and 16) | ||
and (z between 10 and 16) | ||
and (x between 0 and pow(2,z)-1) | ||
and (y between 0 and pow(2,z)-1) | ||
group by z, x, y | ||
order by z, x, y | ||
);""".format(days=redshift_days_to_query)) | ||
order by z, x, y;""".format(days=redshift_days_to_query)) | ||
n_trr = cur.rowcount | ||
for (x, y, z) in cur: | ||
coord = create_coord(x, y, z) | ||
coord_int = coord_marshall_int(coord) | ||
tiles_recently_requested.add(coord_int) | ||
new_toi.add(coord_int) | ||
|
||
logger.info('Fetching tiles recently requested ... done. %s found', n_trr) | ||
|
||
logger.info('Computing tiles of interest to remove ...') | ||
toi_to_remove = tiles_of_interest - tiles_recently_requested | ||
logger.info('Computing tiles of interest to remove ... done. %s found', | ||
for name, info in prune_cfg.get('always-include-bboxes', {}).items(): | ||
logger.info('Adding in tiles from %s ...', name) | ||
|
||
bounds = map(float, info['bbox'].split(',')) | ||
bounds_tileset = set() | ||
for coord in tile_generator_for_single_bounds( | ||
bounds, info['min_zoom'], info['max_zoom']): | ||
coord_int = coord_marshall_int(coord) | ||
bounds_tileset.add(coord_int) | ||
n_inc = len(bounds_tileset) | ||
new_toi = new_toi.union(bounds_tileset) | ||
|
||
logger.info('Adding in tiles from %s ... done. %s found', name, n_inc) | ||
|
||
logger.info('New tiles of interest set includes %s tiles', len(new_toi)) | ||
|
||
logger.info('Fetching existing tiles of interest ...') | ||
tiles_of_interest = peripherals.redis_cache_index.fetch_tiles_of_interest() | ||
n_toi = len(tiles_of_interest) | ||
logger.info('Fetching existing tiles of interest ... done. %s found', | ||
n_toi) | ||
|
||
logger.info('Computing tiles to remove ...') | ||
toi_to_remove = tiles_of_interest - new_toi | ||
logger.info('Computing tiles to remove ... done. %s found', | ||
len(toi_to_remove)) | ||
|
||
logger.info('Removing tiles from TOI and S3 ...') | ||
# Null out the reference to old TOI to save some memory | ||
tiles_of_interest = None | ||
|
||
logger.info('Removing %s tiles from TOI and S3 ...', | ||
len(toi_to_remove)) | ||
|
||
def delete_tile_of_interest(coord_int): | ||
# Remove from the redis toi set | ||
|
@@ -1010,7 +1032,8 @@ def delete_tile_of_interest(coord_int): | |
# FIXME: Think about doing this in a thread/process pool | ||
delete_tile_of_interest(coord_int) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can also think about formalizing this a bit more and doing this out of process. What's the order of the amount that we've been managing in the past, several million? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't calculated the number of tiles that would be removed. I can do that now. I was thinking about putting together an SQS queue and a worker process to do the deletes, but that seemed heavy-handed. Maybe a lambda task to do the delete? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was thinking the same. It's operationally heavier, but I think we'll need something like that if we want to scale past multiple processes/threads on a single instance. Maybe a good use case for batch? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if this is a comprehensive list of event sources for lambda, but thinking about it more I think that's a reasonable option. One idea is that we can split up the list into groups of 10k or so, push those groups to a location on s3, and have lambda listen to that. Lambda would perform the delete and remove that object from s3. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just tested this on dev and each run of the |
||
|
||
logger.info('Removing tiles from TOI and S3 ... done') | ||
logger.info('Removing %s tiles from TOI and S3 ... done', | ||
len(toi_to_remove)) | ||
|
||
|
||
def tilequeue_tile_sizes(cfg, peripherals): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should
16
here be a configurable max zoom? Right now that's 16, but with 2x2 metatiles it'd be 15.