Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster "perform SQL updates" #1844

Merged
merged 10 commits into from
Feb 25, 2019
Merged

Conversation

zerebubuth
Copy link
Member

@zerebubuth zerebubuth commented Feb 15, 2019

There's a few things going on here:

  • Execute one UPDATE per table, rather than several. This should reduce the number of passes over the table, especially since the lines and polygons tables touched every row when setting the label placement point.
  • Shard the updates 4x over osm_id for the OSM tables. This helps a lot when the UPDATE is CPU-bound, rather than disk-bound, as it seems to be when we're calculating a lot of ST_PointOnSurface() and min zoom pl/pgsql functions.
  • Move index creation into one file per index, to be executed in parallel. PostgreSQL allows any number of index creations to run concurrently, and these can be very long-running. Also remove ANALYZE from apply-*.sql, so it can run after index creation.

mz_road_level IS NOT NULL OR
mz_transit_level IS NOT NULL OR
mz_water_min_zoom IS NOT NULL;
SHARDING;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHERE SHARDING; is new to me. Is it a Postgres-ism, or related to the script changes in data/perform-sql-updates.sh? Might be worth an inline comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nothing in PostgreSQL, sadly. It comes from a hack where I'm replacing SHARDING with the range of osm_id to shard over. I would have used a proper Jinja2 template, but that was going to introduce a lot of incidental complexity.

I'm not totally sure whether an inline comment would work there, or if it would have weird side-effects... I'll have to test and make sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments in e684cd8, looks like they're correctly preserved through all the string mangling we're doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a pain to name it something like $SHARDING and replace on that? Just thinking it might make it clearer, although the comment is good anyway :)

# guide the distribution of jobs, so hopefully they end up mostly evenly sized.
for tbl in polygon line point; do
sql_script="apply-planet_osm_${tbl}.sql"
for pct in 25 50 75; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the machine we run this on always 4 cores?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, it would be related to the number of cores on the PostgreSQL server, not the machine we're running the script on. However, there's a couple of things that make matching the number of statements to the range of queries hard:

  1. Doing this in shell script is possible, but makes me uncomfortable. I find that getting shell code correct is a lot harder than it looks. Going to fixed concurrency seemed like a good compromise. We can go to variable concurrency, but I'd want to switch all of this into Python, or something else that has less magic than shell.
  2. I'm not aware of any call in PostgreSQL to tell us how many cores the server has, which is the important number to match. We could pass that in as an argument, but again; complexity.

Overall, it seemed to me like a good compromise to get something working quickly. If we want it to be more readable and more robust, then I can take a look at doing that next week?

Copy link
Member

@nvkelso nvkelso Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with hardcoding it now... just as long as we're getting good use out of the PostgresSQL Server we generally run this on. If that's 4 cores, so be it! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the comment, please also note this is configured for a 4 core PostgreSQL machine and would need to be adopted to other cores?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Fixed in dbebede.

@zerebubuth zerebubuth changed the title WIP: Faster "perform SQL updates" Faster "perform SQL updates" Feb 22, 2019
@zerebubuth
Copy link
Member Author

I've tested this out, and it seems to work - the min_zoom and indexing step was down from 54h before to 20h on this one test. Please have a look and let me know what you think.

Copy link
Member

@nvkelso nvkelso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment nit, otherwise LGTM

@nvkelso nvkelso added this to the v1.8.0 milestone Feb 22, 2019
mz_road_level IS NOT NULL OR
mz_transit_level IS NOT NULL OR
mz_water_min_zoom IS NOT NULL;
SHARDING;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a pain to name it something like $SHARDING and replace on that? Just thinking it might make it clearer, although the comment is good anyway :)

@zerebubuth
Copy link
Member Author

Is it a pain to name it something like $SHARDING and replace on that?

$SHARDING might be tricky (and fragile) since $ is a special character in sed. How about {{SHARDING}} in a061e45? Hopefully that makes us think "templating".

@zerebubuth zerebubuth merged commit 76377eb into master Feb 25, 2019
@zerebubuth zerebubuth deleted the zerebubuth/test-faster-sql-updates branch February 25, 2019 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants