Faster "perform SQL updates" #1844

zerebubuth · 2019-02-15T19:18:27Z

There's a few things going on here:

Execute one UPDATE per table, rather than several. This should reduce the number of passes over the table, especially since the lines and polygons tables touched every row when setting the label placement point.
Shard the updates 4x over osm_id for the OSM tables. This helps a lot when the UPDATE is CPU-bound, rather than disk-bound, as it seems to be when we're calculating a lot of ST_PointOnSurface() and min zoom pl/pgsql functions.
Move index creation into one file per index, to be executed in parallel. PostgreSQL allows any number of index creations to run concurrently, and these can be very long-running. Also remove ANALYZE from apply-*.sql, so it can run after index creation.

…done in parallel with each other.

…n new indexes SQL.

nvkelso · 2019-02-15T20:07:52Z

data/apply-planet_osm_line.sql

-    mz_road_level IS NOT NULL OR
-    mz_transit_level IS NOT NULL OR
-    mz_water_min_zoom IS NOT NULL;
+    SHARDING;


WHERE SHARDING; is new to me. Is it a Postgres-ism, or related to the script changes in data/perform-sql-updates.sh? Might be worth an inline comment?

It's nothing in PostgreSQL, sadly. It comes from a hack where I'm replacing SHARDING with the range of osm_id to shard over. I would have used a proper Jinja2 template, but that was going to introduce a lot of incidental complexity.

I'm not totally sure whether an inline comment would work there, or if it would have weird side-effects... I'll have to test and make sure.

Added comments in e684cd8, looks like they're correctly preserved through all the string mangling we're doing.

Is it a pain to name it something like $SHARDING and replace on that? Just thinking it might make it clearer, although the comment is good anyway :)

nvkelso · 2019-02-15T20:08:59Z

data/perform-sql-updates.sh

+# guide the distribution of jobs, so hopefully they end up mostly evenly sized.
+for tbl in polygon line point; do
+    sql_script="apply-planet_osm_${tbl}.sql"
+    for pct in 25 50 75; do


Is the machine we run this on always 4 cores?

Ideally, it would be related to the number of cores on the PostgreSQL server, not the machine we're running the script on. However, there's a couple of things that make matching the number of statements to the range of queries hard:

Doing this in shell script is possible, but makes me uncomfortable. I find that getting shell code correct is a lot harder than it looks. Going to fixed concurrency seemed like a good compromise. We can go to variable concurrency, but I'd want to switch all of this into Python, or something else that has less magic than shell.

I'm not aware of any call in PostgreSQL to tell us how many cores the server has, which is the important number to match. We could pass that in as an argument, but again; complexity.

Overall, it seemed to me like a good compromise to get something working quickly. If we want it to be more readable and more robust, then I can take a look at doing that next week?

I'm fine with hardcoding it now... just as long as we're getting good use out of the PostgresSQL Server we generally run this on. If that's 4 cores, so be it! :)

In the comment, please also note this is configured for a 4 core PostgreSQL machine and would need to be adopted to other cores?

Good idea. Fixed in dbebede.

zerebubuth · 2019-02-22T11:20:23Z

I've tested this out, and it seems to work - the min_zoom and indexing step was down from 54h before to 20h on this one test. Please have a look and let me know what you think.

nvkelso

One comment nit, otherwise LGTM

…ism in the osm_id loops.

rmarianski · 2019-02-22T15:56:15Z

data/apply-planet_osm_line.sql

-    mz_road_level IS NOT NULL OR
-    mz_transit_level IS NOT NULL OR
-    mz_water_min_zoom IS NOT NULL;
+    SHARDING;


Is it a pain to name it something like $SHARDING and replace on that? Just thinking it might make it clearer, although the comment is good anyway :)

…ng' over osm_id for perform-sql-updates.sh.

zerebubuth · 2019-02-22T17:51:25Z

Is it a pain to name it something like $SHARDING and replace on that?

$SHARDING might be tricky (and fragile) since $ is a special character in sed. How about {{SHARDING}} in a061e45? Hopefully that makes us think "templating".

zerebubuth added 6 commits February 15, 2019 19:04

Add debugging timing output to SQL table updates.

1b0fa5f

Move indexes to their own separate parallel step, as they can all be …

7d6c738

…done in parallel with each other.

Use single update statements for planet_osm_* tables.

ab47afd

Add 4x sharding over osm_id for updates to planet_osm_* tables.

890dad2

Remove debugging timing.

1e5d5fb

Remove more debugging. Fix leading whitespace and lack of semicolon o…

8254225

…n new indexes SQL.

nvkelso reviewed Feb 15, 2019

View reviewed changes

zerebubuth added 2 commits February 21, 2019 16:56

Remove debug: don't echo queries in perform SQL updates.

810a5b8

Add comments about what SHARDING means in the min zoom update scripts.

e684cd8

zerebubuth changed the title ~~WIP: Faster "perform SQL updates"~~ Faster "perform SQL updates" Feb 22, 2019

zerebubuth requested a review from rmarianski February 22, 2019 11:19

nvkelso approved these changes Feb 22, 2019

View reviewed changes

nvkelso added this to the v1.8.0 milestone Feb 22, 2019

Add comment about why perform-sql-updates.sh has 4-way fixed parallel…

dbebede

…ism in the osm_id loops.

rmarianski approved these changes Feb 22, 2019

View reviewed changes

Use syntax more likely to suggest templating / replacement in 'shardi…

a061e45

…ng' over osm_id for perform-sql-updates.sh.

zerebubuth merged commit 76377eb into master Feb 25, 2019

zerebubuth deleted the zerebubuth/test-faster-sql-updates branch February 25, 2019 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster "perform SQL updates" #1844

Faster "perform SQL updates" #1844

zerebubuth commented Feb 15, 2019 •

edited

Loading

nvkelso Feb 15, 2019

zerebubuth Feb 15, 2019

zerebubuth Feb 22, 2019

rmarianski Feb 22, 2019

nvkelso Feb 15, 2019

zerebubuth Feb 15, 2019

nvkelso Feb 15, 2019 •

edited

Loading

nvkelso Feb 22, 2019

zerebubuth Feb 22, 2019

zerebubuth commented Feb 22, 2019

nvkelso left a comment

rmarianski Feb 22, 2019

zerebubuth commented Feb 22, 2019

Faster "perform SQL updates" #1844

Faster "perform SQL updates" #1844

Conversation

zerebubuth commented Feb 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvkelso Feb 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zerebubuth commented Feb 22, 2019

nvkelso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zerebubuth commented Feb 22, 2019

zerebubuth commented Feb 15, 2019 •

edited

Loading

nvkelso Feb 15, 2019 •

edited

Loading