Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedule B slow sql (65 to 800+ times faster by add a new index: idx_sched_b_2019_2020_disb_dt_sub_id #4531

Closed
2 tasks
dzhang-fec opened this issue Jul 31, 2020 · 7 comments · Fixed by #4534
Closed
2 tasks
Assignees
Milestone

Comments

@dzhang-fec
Copy link
Contributor

dzhang-fec commented Jul 31, 2020

Completion criteria

  • Add new composite index
  • Remove any unneeded COALESCE indexes?

I am not sure where to put my long investigation with test cases. So I put the link here just in case.
related tickets:
#2791
#4491

Query from @lbeaufort
Problem: too slow (timeout by 5+mins)
Suggestion and analysis:
We suggest to add a new index to let the ORDER DESC to use index
CREATE INDEX idx_sched_b_2019_2020_disb_dt_sub_id
ON disclosure.fec_fitem_sched_b_2019_2020 USING btree
(disb_dt DESC NULLS FIRST, sub_id DESC NULLS FIRST)
TABLESPACE pg_default;

" -> Index Scan using idx_sched_b_2019_2020_disb_dt_sub_id on fec_fitem_sched_b_2019_2020 (cost=0.57..31662387.50 rows=70824809 width=1353)"
Result: 14mins to 0.6 sec. (800+ times faster)
Tests: new index is at the "dev", no new index at stage for your further index.
Note: we need add these indexes at all of partitions (23)/

SELECT *
FROM disclosure.fec_fitem_sched_b LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_1 ON disclosure.fec_fitem_sched_b.cmte_id = ofec_committee_history_mv_1.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_1.cycle LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_2 ON disclosure.fec_fitem_sched_b.clean_recipient_cmte_id = ofec_committee_history_mv_2.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_2.cycle
WHERE disclosure.fec_fitem_sched_b.two_year_transaction_period IN (2020)
AND disclosure.fec_fitem_sched_b.disb_dt >= '1/1/2019'
AND disclosure.fec_fitem_sched_b.disb_dt <= '12/31/2020'
ORDER BY disclosure.fec_fitem_sched_b.disb_dt DESC, disclosure.fec_fitem_sched_b.sub_id DESC
LIMIT 20

-------old: at stage--------------------------
execution time: Total query runtime: 14 min 44 secs.

image
Successfully run. Total query runtime: 14 min 44 secs.
20 rows affected.

Execution plan:
"Limit (cost=52247515.16..52247515.21 rows=20 width=2852)"
" -> Sort (cost=52247515.16..52424579.40 rows=70825696 width=2852)"
" Sort Key: fec_fitem_sched_b.disb_dt DESC, fec_fitem_sched_b.sub_id DESC"
" -> Gather (cost=41824.15..50362868.86 rows=70825696 width=2852)"
" Workers Planned: 2"
" -> Hash Left Join (cost=40824.15..43279299.26 rows=29510707 width=2852)"
" Hash Cond: ((fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_2.cycle) AND ((fec_fitem_sched_b.clean_recipient_cmte_id)::text = (ofec_committee_history_mv_2.committee_id)::text))"
" -> Hash Left Join (cost=20412.08..22462791.97 rows=29510707 width=2078)"
" Hash Cond: ((fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_1.cycle) AND ((fec_fitem_sched_b.cmte_id)::text = (ofec_committee_history_mv_1.committee_id)::text))"
" -> Append (cost=0.00..7051870.60 rows=29510707 width=1324)"
" -> Parallel Seq Scan on fec_fitem_sched_b (cost=0.00..0.00 rows=1 width=5322)"
" Filter: ((disb_dt >= '2019-01-01 00:00:00'::timestamp without time zone) AND (disb_dt <= '2020-12-31 00:00:00'::timestamp without time zone) AND (two_year_transaction_period = '2020'::numeric))"
" -> Parallel Seq Scan on fec_fitem_sched_b_2019_2020 (cost=0.00..7051870.60 rows=29510706 width=1324)"
" Filter: ((disb_dt >= '2019-01-01 00:00:00'::timestamp without time zone) AND (disb_dt <= '2020-12-31 00:00:00'::timestamp without time zone) AND (two_year_transaction_period = '2020'::numeric))"
" -> Hash (cost=18569.55..18569.55 rows=16635 width=754)"
" -> Bitmap Heap Scan on ofec_committee_history_mv ofec_committee_history_mv_1 (cost=387.34..18569.55 rows=16635 width=754)"
" Recheck Cond: (cycle = '2020'::numeric)"
" -> Bitmap Index Scan on idx_ofec_committee_history_mv_cycle_committee_id (cost=0.00..383.18 rows=16635 width=0)"
" Index Cond: (cycle = '2020'::numeric)"
" -> Hash (cost=18569.55..18569.55 rows=16635 width=754)"
" -> Bitmap Heap Scan on ofec_committee_history_mv ofec_committee_history_mv_2 (cost=387.34..18569.55 rows=16635 width=754)"
" Recheck Cond: (cycle = '2020'::numeric)"
" -> Bitmap Index Scan on idx_ofec_committee_history_mv_cycle_committee_id (cost=0.00..383.18 rows=16635 width=0)"
" Index Cond: (cycle = '2020'::numeric)"

---new: at dev add a new index:------------------------------------------------------------------------------------------
---add a new index:
CREATE INDEX idx_sched_b_2019_2020_disb_dt_sub_id
ON disclosure.fec_fitem_sched_b_2019_2020 USING btree
(disb_dt DESC NULLS FIRST, sub_id DESC )
TABLESPACE pg_default;

---execution time: 600msec
image

---execution plan: go tto use the new index
"Limit (cost=1.43..79.60 rows=20 width=2879)"
" -> Nested Loop Left Join (cost=1.43..276842326.65 rows=70824810 width=2879)"
" -> Nested Loop Left Join (cost=1.01..154517950.13 rows=70824810 width=2106)"
" -> Merge Append (cost=0.59..32193573.60 rows=70824810 width=1353)"
" Sort Key: fec_fitem_sched_b.disb_dt DESC, fec_fitem_sched_b.sub_id DESC"
" -> Sort (cost=0.01..0.02 rows=1 width=5322)"
" Sort Key: fec_fitem_sched_b.disb_dt DESC, fec_fitem_sched_b.sub_id DESC"
" -> Seq Scan on fec_fitem_sched_b (cost=0.00..0.00 rows=1 width=5322)"
" Filter: ((disb_dt >= '2019-01-01 00:00:00'::timestamp without time zone) AND (disb_dt <= '2020-12-31 00:00:00'::timestamp without time zone) AND (two_year_transaction_period = '2020'::numeric))"
" -> Index Scan using idx_sched_b_2019_2020_disb_dt_sub_id on fec_fitem_sched_b_2019_2020 (cost=0.57..31662387.50 rows=70824809 width=1353)"
" Index Cond: ((disb_dt >= '2019-01-01 00:00:00'::timestamp without time zone) AND (disb_dt <= '2020-12-31 00:00:00'::timestamp without time zone))"
" Filter: (two_year_transaction_period = '2020'::numeric)"
" -> Index Scan using idx_ofec_committee_history_mv_cycle_committee_id on ofec_committee_history_mv ofec_committee_history_mv_1 (cost=0.42..1.72 rows=1 width=753)"
" Index Cond: ((fec_fitem_sched_b.two_year_transaction_period = cycle) AND (cycle = '2020'::numeric) AND ((fec_fitem_sched_b.cmte_id)::text = (committee_id)::text))"
" -> Index Scan using idx_ofec_committee_history_mv_cycle_committee_id on ofec_committee_history_mv ofec_committee_history_mv_2 (cost=0.42..1.72 rows=1 width=753)"
" Index Cond: ((fec_fitem_sched_b.two_year_transaction_period = cycle) AND (cycle = '2020'::numeric) AND ((fec_fitem_sched_b.clean_recipient_cmte_id)::text = (committee_id)::text))"

@dzhang-fec dzhang-fec changed the title schedule B slow sql schedule B slow sql (800+ times faster by add a new index Jul 31, 2020
@dzhang-fec
Copy link
Contributor Author

==========case 2: same solution: 3mins to 0.9sec (65 times faster) ==============
Slow query: from @lbeaufort
SELECT *
FROM disclosure.fec_fitem_sched_b LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_1 ON disclosure.fec_fitem_sched_b.cmte_id = ofec_committee_history_mv_1.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_1.cycle LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_2 ON disclosure.fec_fitem_sched_b.clean_recipient_cmte_id = ofec_committee_history_mv_2.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_2.cycle
WHERE disclosure.fec_fitem_sched_b.cmte_id IN ('C00694323')
AND disclosure.fec_fitem_sched_b.two_year_transaction_period IN (2020)
AND disclosure.fec_fitem_sched_b.image_num >= '202007159245096984'
AND disclosure.fec_fitem_sched_b.image_num <= '202007159249651462'
AND (coalesce(disclosure.fec_fitem_sched_b.disb_dt, CAST('9999-12-31' AS DATE)), disclosure.fec_fitem_sched_b.sub_id) < ('2020-06-30', 4072720201808827193)
ORDER BY disclosure.fec_fitem_sched_b.disb_dt DESC, disclosure.fec_fitem_sched_b.sub_id DESC
LIMIT 100

------------------================================without xhnage

execution 3+mins at stage
image

execution plan: not use index
"Limit (cost=2314541.23..2314541.48 rows=100 width=2852)"
" -> Sort (cost=2314541.23..2317002.51 rows=984515 width=2852)"
" Sort Key: fec_fitem_sched_b.disb_dt DESC, fec_fitem_sched_b.sub_id DESC"
" -> Nested Loop Left Join (cost=21412.50..2276913.77 rows=984515 width=2852)"
" Join Filter: (((fec_fitem_sched_b.cmte_id)::text = (ofec_committee_history_mv_1.committee_id)::text) AND (fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_1.cycle))"
" -> Gather (cost=21412.08..2259680.32 rows=984515 width=2078)"
" Workers Planned: 2"
" -> Hash Left Join (cost=20412.08..2160228.82 rows=410215 width=2078)"
" Hash Cond: ((fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_2.cycle) AND ((fec_fitem_sched_b.clean_recipient_cmte_id)::text = (ofec_committee_history_mv_2.committee_id)::text))"
" -> Append (cost=0.00..1924480.70 rows=410215 width=1324)"
" -> Parallel Seq Scan on fec_fitem_sched_b (cost=0.00..0.00 rows=1 width=5322)"
" Filter: (((image_num)::text >= '202007159245096984'::text) AND ((image_num)::text <= '202007159249651462'::text) AND ((cmte_id)::text = 'C00694323'::text) AND (two_year_transaction_period = '2020'::numeric) AND (ROW(COALESCE(disb_dt, '9999-12-31 00:00:00'::timestamp without time zone), sub_id) < ROW('2020-06-30 00:00:00'::timestamp without time zone, '4072720201808827193'::numeric)))"
" -> Parallel Bitmap Heap Scan on fec_fitem_sched_b_2019_2020 (cost=415620.70..1924480.70 rows=410214 width=1324)"
" Recheck Cond: (((image_num)::text >= '202007159245096984'::text) AND ((image_num)::text <= '202007159249651462'::text) AND ((cmte_id)::text = 'C00694323'::text))"
" Filter: ((two_year_transaction_period = '2020'::numeric) AND (ROW(COALESCE(disb_dt, '9999-12-31 00:00:00'::timestamp without time zone), sub_id) < ROW('2020-06-30 00:00:00'::timestamp without time zone, '4072720201808827193'::numeric)))"
" -> BitmapAnd (cost=415620.70..415620.70 rows=992772 width=0)"
" -> Bitmap Index Scan on idx_sched_b_2019_2020_image_num_disb_amt_sub_id (cost=0.00..149156.38 rows=5672981 width=0)"
" Index Cond: (((image_num)::text >= '202007159245096984'::text) AND ((image_num)::text <= '202007159249651462'::text))"
" -> Bitmap Index Scan on idx_sched_b_2019_2020_cmte_id_disb_amt_sub_id (cost=0.00..265971.81 rows=12493766 width=0)"
" Index Cond: ((cmte_id)::text = 'C00694323'::text)"
" -> Hash (cost=18569.55..18569.55 rows=16635 width=754)"
" -> Bitmap Heap Scan on ofec_committee_history_mv ofec_committee_history_mv_2 (cost=387.34..18569.55 rows=16635 width=754)"
" Recheck Cond: (cycle = '2020'::numeric)"
" -> Bitmap Index Scan on idx_ofec_committee_history_mv_cycle_committee_id (cost=0.00..383.18 rows=16635 width=0)"
" Index Cond: (cycle = '2020'::numeric)"
" -> Materialize (cost=0.42..4.44 rows=1 width=754)"
" -> Index Scan using idx_ofec_committee_history_mv_cycle_committee_id on ofec_committee_history_mv ofec_committee_history_mv_1 (cost=0.42..4.44 rows=1 width=754)"
" Index Cond: ((cycle = '2020'::numeric) AND ((committee_id)::text = 'C00694323'::text))"

---------------------------------------------------After new index change====================
------execution time: 0.9sec
image

------execution plan: using the new index ( Index Scan using idx_sched_b_2019_2020_disb_dt_sub_id on fec_fitem_sched_b_2019_2020
"Limit (cost=1.43..3523.85 rows=100 width=2879)"
" -> Nested Loop Left Join (cost=1.43..34161306.39 rows=969823 width=2879)"
" -> Nested Loop Left Join (cost=1.01..32458755.48 rows=969823 width=2106)"
" Join Filter: (((fec_fitem_sched_b.cmte_id)::text = (ofec_committee_history_mv_1.committee_id)::text) AND (fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_1.cycle))"
" -> Merge Append (cost=0.59..32441779.13 rows=969823 width=1353)"
" Sort Key: fec_fitem_sched_b.disb_dt DESC, fec_fitem_sched_b.sub_id DESC"
" -> Sort (cost=0.01..0.02 rows=1 width=5322)"
" Sort Key: fec_fitem_sched_b.disb_dt DESC, fec_fitem_sched_b.sub_id DESC"
" -> Seq Scan on fec_fitem_sched_b (cost=0.00..0.00 rows=1 width=5322)"
" Filter: (((image_num)::text >= '202007159245096984'::text) AND ((image_num)::text <= '202007159249651462'::text) AND ((cmte_id)::text = 'C00694323'::text) AND (two_year_transaction_period = '2020'::numeric) AND (ROW(COALESCE(disb_dt, '9999-12-31 00:00:00'::timestamp without time zone), sub_id) < ROW('2020-06-30 00:00:00'::timestamp without time zone, '4072720201808827193'::numeric)))"
" -> Index Scan using idx_sched_b_2019_2020_disb_dt_sub_id on fec_fitem_sched_b_2019_2020 (cost=0.57..32434505.44 rows=969822 width=1353)"
" Filter: (((image_num)::text >= '202007159245096984'::text) AND ((image_num)::text <= '202007159249651462'::text) AND ((cmte_id)::text = 'C00694323'::text) AND (two_year_transaction_period = '2020'::numeric) AND (ROW(COALESCE(disb_dt, '9999-12-31 00:00:00'::timestamp without time zone), sub_id) < ROW('2020-06-30 00:00:00'::timestamp without time zone, '4072720201808827193'::numeric)))"
" -> Materialize (cost=0.42..4.44 rows=1 width=753)"
" -> Index Scan using idx_ofec_committee_history_mv_cycle_committee_id on ofec_committee_history_mv ofec_committee_history_mv_1 (cost=0.42..4.44 rows=1 width=753)"
" Index Cond: ((cycle = '2020'::numeric) AND ((committee_id)::text = 'C00694323'::text))"
" -> Index Scan using idx_ofec_committee_history_mv_cycle_committee_id on ofec_committee_history_mv ofec_committee_history_mv_2 (cost=0.42..1.75 rows=1 width=753)"
" Index Cond: ((fec_fitem_sched_b.two_year_transaction_period = cycle) AND (cycle = '2020'::numeric) AND ((fec_fitem_sched_b.clean_recipient_cmte_id)::text = (committee_id)::text))"

@dzhang-fec dzhang-fec changed the title schedule B slow sql (800+ times faster by add a new index schedule B slow sql (65 to 800+ times faster by add a new index: idx_sched_b_2019_2020_disb_dt_sub_id Jul 31, 2020
@rfultz
Copy link
Contributor

rfultz commented Jul 31, 2020

Related to #4491

@lbeaufort
Copy link
Member

Thanks for your work on this, @dzhang-fec, this looks good so far. There is one query that seems slower than before that might need another index:

SELECT *
FROM disclosure.fec_fitem_sched_b LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_1 ON disclosure.fec_fitem_sched_b.cmte_id = ofec_committee_history_mv_1.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_1.cycle LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_2 ON disclosure.fec_fitem_sched_b.clean_recipient_cmte_id = ofec_committee_history_mv_2.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_2.cycle 
WHERE disclosure.fec_fitem_sched_b.cmte_id IN ('C00401224') 
AND disclosure.fec_fitem_sched_b.cmte_tp IN ('P') 
AND disclosure.fec_fitem_sched_b.two_year_transaction_period IN (2020) 
AND disclosure.fec_fitem_sched_b.disb_dt >= '1/1/19' 
AND disclosure.fec_fitem_sched_b.disb_dt <= '12/31/20'
ORDER BY disclosure.fec_fitem_sched_b.disb_dt DESC, disclosure.fec_fitem_sched_b.sub_id DESC 
 LIMIT 30

Examples: https://docs.google.com/spreadsheets/d/1J5QK4mw7HXbMnaDV_ClCGWEgbbI3VZZhXDqmjIFwBqI/edit#gid=0

@dzhang-fec
Copy link
Contributor Author

@lbeaufort the above sql slowness issue is different. We need to add another index. Can I deal this new issue at #4533


AND disclosure.fec_fitem_sched_b.cmte_tp IN ('P') 
AND disclosure.fec_fitem_sched_b.two_year_transaction_period IN (2020) 
AND disclosure.fec_fitem_sched_b.disb_dt >= '1/1/19' 
AND disclosure.fec_fitem_sched_b.disb_dt <= '12/31/20'
ORDER BY disclosure.fec_fitem_sched_b.disb_dt DESC, disclosure.fec_fitem_sched_b.sub_id DESC 
 LIMIT 30

Examples: https://docs.google.com/spreadsheets/d/1J5QK4mw7HXbMnaDV_ClCGWEgb

@lbeaufort the above sql slowness issue is different. We need to add another index. Can I deal this new issue at #4533

dzhang-fec added a commit that referenced this issue Aug 4, 2020
…: idx_sched_b__disb_dt_sub_id #4531  idx_sched_b_cmte_id_cmte_tp_disb_dt_sub_id index added.  #4533

Fix schedule B slow sql (65 to 800+ times faster by add 2 new indexes: idx_sched_b__disb_dt_sub_id #4531
idx_sched_b_cmte_id_cmte_tp_disb_dt_sub_id index added.  #4533
@lbeaufort
Copy link
Member

@

@lbeaufort the above sql slowness issue is different. We need to add another index. Can I deal this new issue at #4533


AND disclosure.fec_fitem_sched_b.cmte_tp IN ('P') 
AND disclosure.fec_fitem_sched_b.two_year_transaction_period IN (2020) 
AND disclosure.fec_fitem_sched_b.disb_dt >= '1/1/19' 
AND disclosure.fec_fitem_sched_b.disb_dt <= '12/31/20'
ORDER BY disclosure.fec_fitem_sched_b.disb_dt DESC, disclosure.fec_fitem_sched_b.sub_id DESC 
 LIMIT 30

Examples: https://docs.google.com/spreadsheets/d/1J5QK4mw7HXbMnaDV_ClCGWEgb

@lbeaufort the above sql slowness issue is different. We need to add another index. Can I deal this new issue at #4533

@dzhang-fec a new ticket is fine with me! I think after that we can monitor in production and add as needed. Is the best course of action to add a new migration file for the composite index in this issue, and removing any unneeded COALESCE indexes, or do you want a separate ticket for removing indexes?

@lbeaufort
Copy link
Member

@dzhang-fec I see you're way ahead of me with this PR #4534 thank you! My only remaining question then is whether we should remove any of the COALESCE indexes in place. Thanks again!

@lbeaufort
Copy link
Member

@dzhang-fec I see you're way ahead of me with this PR #4534 thank you! My only remaining question then is whether we should remove any of the COALESCE indexes in place. Thanks again!

@dzhang-fec and I discussed this and we're going to leave the indexes in place for now out of caution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants