Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](load) fix broker load progress due to retry #42959

Merged
merged 3 commits into from
Nov 8, 2024

Conversation

kaijchen
Copy link
Contributor

Proposed changes

Currently, when retrying a broker load, it will use a different load_id with the same job_id.
The total_scan_nums in progress is accumulated by job_id.
This will cause the total_scan_nums progress to be multiple of the actual scan nums.

For example, suppose a 10 instance broker load gets retried:

initial (0/10) 0% -> retry (0/20) 0% -> finish (10/20) 50%

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@kaijchen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 42561 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c9ec9ecdf9ddbdcbb851548e1bc4dca47ad2c4d7, data reload: false

------ Round 1 ----------------------------------
q1	17869	7783	7532	7532
q2	2065	183	159	159
q3	10659	1126	1191	1126
q4	10361	919	942	919
q5	7799	3228	3179	3179
q6	270	145	145	145
q7	1049	600	605	600
q8	9356	2091	2104	2091
q9	6864	6605	6654	6605
q10	7123	2474	2486	2474
q11	455	256	257	256
q12	439	219	214	214
q13	17786	3002	3031	3002
q14	244	208	210	208
q15	567	508	521	508
q16	650	588	594	588
q17	1018	585	579	579
q18	7413	6749	6848	6749
q19	1371	1133	1075	1075
q20	486	181	182	181
q21	4154	3366	3392	3366
q22	1123	1028	1005	1005
Total cold run time: 109121 ms
Total hot run time: 42561 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7556	7445	8028	7445
q2	353	239	235	235
q3	3147	2985	3063	2985
q4	2122	1795	1865	1795
q5	5804	5858	5928	5858
q6	255	143	138	138
q7	2317	1798	1789	1789
q8	3621	3624	3751	3624
q9	9159	9066	9063	9063
q10	3704	3625	3640	3625
q11	619	495	491	491
q12	856	625	585	585
q13	9088	3161	3213	3161
q14	324	283	283	283
q15	578	529	520	520
q16	694	655	642	642
q17	1976	1735	1706	1706
q18	8403	7869	7562	7562
q19	1901	1851	1648	1648
q20	2166	1855	1872	1855
q21	5777	5650	5494	5494
q22	1137	1080	1081	1080
Total cold run time: 71557 ms
Total hot run time: 61584 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 196639 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c9ec9ecdf9ddbdcbb851548e1bc4dca47ad2c4d7, data reload: false

query1	1280	1004	967	967
query2	6236	2035	2016	2016
query3	11461	4802	4707	4707
query4	34030	23738	23757	23738
query5	4830	445	431	431
query6	267	170	166	166
query7	3989	287	295	287
query8	303	216	218	216
query9	9384	2624	2614	2614
query10	503	250	253	250
query11	18199	15374	15440	15374
query12	157	102	100	100
query13	1567	417	409	409
query14	8788	7624	7103	7103
query15	260	170	177	170
query16	7893	455	450	450
query17	1464	564	559	559
query18	2081	302	287	287
query19	273	140	147	140
query20	125	110	112	110
query21	206	103	107	103
query22	4877	4543	4299	4299
query23	34896	33969	33914	33914
query24	7963	2821	2797	2797
query25	562	373	364	364
query26	719	153	155	153
query27	2047	274	278	274
query28	6500	2412	2403	2403
query29	727	404	415	404
query30	253	156	163	156
query31	1017	772	817	772
query32	91	54	56	54
query33	600	262	266	262
query34	919	489	508	489
query35	999	907	878	878
query36	1068	959	936	936
query37	123	73	71	71
query38	4385	4302	4282	4282
query39	1457	1420	1401	1401
query40	198	97	98	97
query41	47	45	47	45
query42	107	97	98	97
query43	527	476	471	471
query44	1151	793	797	793
query45	174	165	160	160
query46	1145	693	687	687
query47	1943	1877	1842	1842
query48	434	317	324	317
query49	903	386	404	386
query50	875	396	392	392
query51	7115	7079	7067	7067
query52	104	91	94	91
query53	254	175	179	175
query54	701	391	408	391
query55	74	75	78	75
query56	255	249	239	239
query57	1259	1207	1190	1190
query58	220	214	211	211
query59	3136	3026	2985	2985
query60	275	250	249	249
query61	121	119	119	119
query62	823	673	666	666
query63	216	186	180	180
query64	3679	722	698	698
query65	3283	3161	3260	3161
query66	813	311	312	311
query67	15966	15798	15607	15607
query68	4392	555	530	530
query69	438	262	257	257
query70	1221	1068	1121	1068
query71	314	256	249	249
query72	6358	3960	3955	3955
query73	751	353	351	351
query74	10038	9055	8939	8939
query75	3328	2651	2730	2651
query76	2331	1056	1042	1042
query77	383	277	263	263
query78	10645	9533	9512	9512
query79	1495	584	601	584
query80	1103	414	420	414
query81	570	238	245	238
query82	1019	117	112	112
query83	235	137	133	133
query84	226	67	70	67
query85	1265	309	285	285
query86	383	283	287	283
query87	4718	4886	4603	4603
query88	3177	2286	2155	2155
query89	412	300	278	278
query90	1973	182	181	181
query91	129	104	103	103
query92	58	47	47	47
query93	1821	536	519	519
query94	911	288	290	288
query95	352	242	245	242
query96	624	286	277	277
query97	2886	2704	2705	2704
query98	224	194	190	190
query99	1647	1320	1311	1311
Total cold run time: 293724 ms
Total hot run time: 196639 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.31 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c9ec9ecdf9ddbdcbb851548e1bc4dca47ad2c4d7, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.03	0.03
query3	0.23	0.06	0.06
query4	1.64	0.11	0.10
query5	0.42	0.40	0.40
query6	1.14	0.64	0.66
query7	0.02	0.02	0.01
query8	0.04	0.03	0.03
query9	0.56	0.49	0.48
query10	0.56	0.55	0.55
query11	0.14	0.10	0.10
query12	0.15	0.11	0.12
query13	0.60	0.61	0.60
query14	2.70	2.73	2.70
query15	0.90	0.82	0.83
query16	0.37	0.38	0.39
query17	1.07	1.05	1.04
query18	0.20	0.20	0.20
query19	1.97	1.88	1.93
query20	0.02	0.01	0.01
query21	15.37	0.59	0.59
query22	2.71	2.47	2.42
query23	16.94	0.90	0.80
query24	3.36	1.28	1.84
query25	0.19	0.17	0.06
query26	0.76	0.14	0.14
query27	0.04	0.05	0.04
query28	9.80	1.10	1.08
query29	12.54	3.22	3.23
query30	0.25	0.06	0.06
query31	2.87	0.38	0.37
query32	3.29	0.46	0.45
query33	2.99	2.97	3.03
query34	17.04	4.44	4.43
query35	4.51	4.52	4.44
query36	0.65	0.48	0.49
query37	0.08	0.06	0.05
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.15	0.13	0.14
query41	0.08	0.02	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.61 s
Total hot run time: 33.31 s

@kaijchen
Copy link
Contributor Author

kaijchen commented Nov 6, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 41285 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c75cc6bb67211c7ad024d85a9625a90a487540e2, data reload: false

------ Round 1 ----------------------------------
q1	17581	7530	7325	7325
q2	2041	163	161	161
q3	10559	1111	1173	1111
q4	10228	852	791	791
q5	7733	3020	3048	3020
q6	237	145	143	143
q7	1015	597	595	595
q8	9366	1934	1989	1934
q9	6554	6424	6478	6424
q10	7050	2422	2394	2394
q11	452	249	246	246
q12	405	210	207	207
q13	17768	3016	3010	3010
q14	235	206	217	206
q15	568	537	498	498
q16	640	587	571	571
q17	962	576	551	551
q18	7442	6764	6761	6761
q19	1329	991	990	990
q20	478	180	181	180
q21	4071	3223	3169	3169
q22	1120	998	1011	998
Total cold run time: 107834 ms
Total hot run time: 41285 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7314	7235	7254	7235
q2	328	225	225	225
q3	2919	2811	2819	2811
q4	1921	1755	1729	1729
q5	5410	5452	5519	5452
q6	217	137	136	136
q7	2117	1707	1755	1707
q8	3248	3385	3379	3379
q9	8483	8535	8576	8535
q10	3530	3437	3405	3405
q11	591	486	506	486
q12	763	554	577	554
q13	8166	2971	3023	2971
q14	296	262	269	262
q15	563	505	511	505
q16	676	642	631	631
q17	1848	1610	1568	1568
q18	7854	7477	7483	7477
q19	1674	1610	1622	1610
q20	2054	1795	1815	1795
q21	5412	5215	5179	5179
q22	1128	999	1013	999
Total cold run time: 66512 ms
Total hot run time: 58651 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 190695 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c75cc6bb67211c7ad024d85a9625a90a487540e2, data reload: false

query1	971	365	366	365
query2	6521	2030	2013	2013
query3	6778	213	209	209
query4	33938	23469	23447	23447
query5	4331	441	427	427
query6	259	172	168	168
query7	4605	295	298	295
query8	297	220	224	220
query9	9758	2617	2637	2617
query10	500	246	250	246
query11	18255	15317	15164	15164
query12	147	103	97	97
query13	1647	410	431	410
query14	10061	6485	6908	6485
query15	248	173	185	173
query16	8072	440	439	439
query17	1647	576	546	546
query18	2141	292	288	288
query19	371	144	143	143
query20	114	109	111	109
query21	208	100	102	100
query22	4695	4430	4555	4430
query23	35074	34022	34732	34022
query24	11564	2751	2739	2739
query25	695	409	402	402
query26	1791	159	163	159
query27	2809	269	278	269
query28	7910	2454	2419	2419
query29	1047	421	429	421
query30	324	167	166	166
query31	1029	825	827	825
query32	96	56	58	56
query33	766	276	274	274
query34	1244	512	520	512
query35	900	760	717	717
query36	1109	944	941	941
query37	145	74	73	73
query38	4493	4225	4245	4225
query39	1530	1423	1413	1413
query40	290	103	100	100
query41	49	47	49	47
query42	111	101	95	95
query43	530	481	479	479
query44	1203	802	809	802
query45	184	174	167	167
query46	1123	683	692	683
query47	1939	1826	1850	1826
query48	428	306	330	306
query49	1209	401	397	397
query50	807	371	385	371
query51	7321	7149	7219	7149
query52	98	88	89	88
query53	260	181	176	176
query54	1168	415	417	415
query55	76	77	80	77
query56	252	256	245	245
query57	1337	1223	1161	1161
query58	234	230	210	210
query59	3127	3045	2943	2943
query60	271	238	247	238
query61	106	103	100	100
query62	839	665	679	665
query63	208	186	184	184
query64	5128	658	603	603
query65	3256	3214	3197	3197
query66	1220	315	311	311
query67	15996	15765	15762	15762
query68	5021	560	538	538
query69	424	251	250	250
query70	1237	1144	1141	1141
query71	379	255	240	240
query72	6367	3998	3976	3976
query73	763	362	373	362
query74	10339	9064	9249	9064
query75	3452	2697	2654	2654
query76	2951	1106	1024	1024
query77	399	264	263	263
query78	10471	9386	9439	9386
query79	1716	587	594	587
query80	1035	424	401	401
query81	543	238	236	236
query82	952	114	125	114
query83	207	138	138	138
query84	232	66	65	65
query85	1225	312	298	298
query86	365	308	297	297
query87	4825	4677	4564	4564
query88	3289	2190	2132	2132
query89	382	290	289	289
query90	1944	185	181	181
query91	133	100	102	100
query92	57	48	49	48
query93	1278	540	536	536
query94	755	286	281	281
query95	341	243	245	243
query96	601	274	278	274
query97	2868	2699	2719	2699
query98	203	189	194	189
query99	1534	1304	1329	1304
Total cold run time: 303148 ms
Total hot run time: 190695 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.63 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c75cc6bb67211c7ad024d85a9625a90a487540e2, data reload: false

query1	0.04	0.03	0.04
query2	0.07	0.03	0.03
query3	0.25	0.06	0.06
query4	1.66	0.10	0.10
query5	0.41	0.42	0.41
query6	1.16	0.65	0.65
query7	0.01	0.02	0.02
query8	0.04	0.03	0.05
query9	0.56	0.50	0.49
query10	0.55	0.54	0.56
query11	0.14	0.11	0.10
query12	0.14	0.12	0.11
query13	0.60	0.59	0.59
query14	2.71	2.71	2.77
query15	0.89	0.82	0.82
query16	0.39	0.39	0.38
query17	1.08	1.06	1.02
query18	0.22	0.22	0.23
query19	1.97	1.86	2.05
query20	0.01	0.02	0.01
query21	15.37	0.61	0.58
query22	2.72	2.94	1.98
query23	16.86	1.13	0.78
query24	3.26	1.87	1.00
query25	0.19	0.04	0.05
query26	0.65	0.14	0.14
query27	0.05	0.05	0.05
query28	10.07	1.09	1.08
query29	12.54	3.18	3.18
query30	0.25	0.06	0.06
query31	2.89	0.39	0.38
query32	3.26	0.46	0.46
query33	2.97	3.01	3.06
query34	17.03	4.47	4.48
query35	4.48	4.48	4.46
query36	0.67	0.48	0.48
query37	0.09	0.06	0.06
query38	0.04	0.04	0.03
query39	0.03	0.02	0.02
query40	0.16	0.12	0.12
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.62 s
Total hot run time: 32.63 s

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Nov 8, 2024

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Nov 8, 2024
Copy link
Contributor

github-actions bot commented Nov 8, 2024

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 3e142d8 into apache:master Nov 8, 2024
26 of 28 checks passed
github-actions bot pushed a commit that referenced this pull request Nov 8, 2024
## Proposed changes

Currently, when retrying a broker load, it will use a different
`load_id` with the same `job_id`.
The `total_scan_nums` in progress is accumulated by `job_id`.
This will cause the `total_scan_nums` progress to be multiple of the
actual scan nums.

For example, suppose a 10 instance broker load gets retried:

```
initial (0/10) 0% -> retry (0/20) 0% -> finish (10/20) 50%
```
dataroaring pushed a commit that referenced this pull request Nov 10, 2024
Cherry-picked from #42959

Co-authored-by: Kaijie Chen <ckj@apache.org>
kaijchen added a commit to kaijchen/doris that referenced this pull request Nov 18, 2024
Currently, when retrying a broker load, it will use a different
`load_id` with the same `job_id`.
The `total_scan_nums` in progress is accumulated by `job_id`.
This will cause the `total_scan_nums` progress to be multiple of the
actual scan nums.

For example, suppose a 10 instance broker load gets retried:

```
initial (0/10) 0% -> retry (0/20) 0% -> finish (10/20) 50%
```
kaijchen added a commit to kaijchen/doris that referenced this pull request Nov 18, 2024
Currently, when retrying a broker load, it will use a different
`load_id` with the same `job_id`.
The `total_scan_nums` in progress is accumulated by `job_id`.
This will cause the `total_scan_nums` progress to be multiple of the
actual scan nums.

For example, suppose a 10 instance broker load gets retried:

```
initial (0/10) 0% -> retry (0/20) 0% -> finish (10/20) 50%
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.8-merged dev/3.0.3-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants