Fix race-conditions in IdealStateGroupCommit #14214

dinoocch · 2024-10-12T00:28:57Z

This attempts to fix two issues I believe are introduced in #13976
I think both can cause a table's idealstate to be updated into an unintended state.

We luckily found this issue very quickly when upgrading thanks to consistent push failures.

My understanding of the code is that it intends to:

Read the idealstate once for a (potential) batch of updates
Apply all the outstanding updates to the znode
Write to zookeeper once for the whole batch

Issue 1 - Missing updates

Previously the code calculated the idealstate value and blindly set it in the update. Particularly, if the idealstate is updated between the first and second reads you may unintentionally clobber the update.

// In commit, any -> updatedIdealState is used
updateIdealState(helixManager, resourceName, anyIdealState -> finalUpdatedIdealState, retryPolicy, noChangeOk);

// Then in updateIdealState we read the znode again and use the version of that read in the update
IdealState idealState = dataAccessor.getProperty(idealStateKey);
...
if (dataAccessor.getBaseDataAccessor()
                  .set(idealStateKey.getPath(), updatedZNRecord, idealState.getRecord().getVersion(),
                      AccessOption.PERSISTENT))

Issue 2 - Typo in use of `resourceName` instead of `mergedResourceName`

We did not actually experience this, but in the original code:

updateIdealState(helixManager, resourceName, anyIdealState -> finalUpdatedIdealState,
              retryPolicy, noChangeOk);

I believe mergedResourceName should actually be used instead on the off-chance that we picked up some outstanding update before processing ours.

Luckily I think the risk of running into it is practically reduced by having 100 queues.

codecov-commenter · 2024-10-12T01:22:26Z

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 63.85%. Comparing base (59551e4) to head (83ff864).
Report is 1177 commits behind head on master.

Files with missing lines	Patch %	Lines
...inot/common/utils/helix/IdealStateGroupCommit.java	86.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14214      +/-   ##
============================================
+ Coverage     61.75%   63.85%   +2.09%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2623     +187     
  Lines        133233   144420   +11187     
  Branches      20636    22099    +1463     
============================================
+ Hits          82274    92215    +9941     
- Misses        44911    45398     +487     
- Partials       6048     6807     +759

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.81% <86.66%> (+2.10%)`	⬆️
java-21	`63.69% <86.66%> (+2.06%)`	⬆️
skip-bytebuffers-false	`63.84% <86.66%> (+2.09%)`	⬆️
skip-bytebuffers-true	`63.67% <86.66%> (+35.94%)`	⬆️
temurin	`63.85% <86.66%> (+2.09%)`	⬆️
unittests	`63.84% <86.66%> (+2.10%)`	⬆️
unittests1	`55.54% <0.00%> (+8.65%)`	⬆️
unittests2	`34.35% <86.66%> (+6.62%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jasperjiaguo

lgtm

jasperjiaguo · 2024-10-12T01:12:16Z

pinot-common/src/main/java/org/apache/pinot/common/utils/helix/IdealStateGroupCommit.java

@@ -124,34 +124,27 @@ public IdealState commit(HelixManager helixManager, String resourceName,
          Entry first = queue._pending.poll();
          processed.add(first);
          String mergedResourceName = first._resourceName;


Not quite related to the fix, but I think
Queue queue = getQueue(resourceName);
so queue should already only has resourceName?
Why do we pull mergedResourceName again?

Not quite --

private Queue getQueue(String resourceName) { return _queues[(resourceName.hashCode() & Integer.MAX_VALUE) % _queues.length]; }

So just it's a bit likely the queue only has 1 resource but not necessarily

jasperjiaguo · 2024-10-12T01:30:18Z

pinot-common/src/main/java/org/apache/pinot/common/utils/helix/IdealStateGroupCommit.java

+                  it.remove();
+                }
+                return updatedIdealState;
+              },
              retryPolicy, noChangeOk);


Not quite related to the fix, but if some max attempt failure happens, we would lost track of the series of IS updates removed from the queue?

If I understand the code+fix correctly,

We could have multiple threads adding different segments to the idealstate.

This case is fixed in line 142 by passing updatedIdealState to the updater, so that all updates are applied cumulatively.

The first thread that enters this logic may end updating all of the segments that other threads have added to the queue.

The other threads are busy-waiting, and will find the queue empty when their turn comes (i.e. they set queue._running to themselves and get to examine the queue), and will return SUCCESS to the caller (irrespective of whether the updating thread has succeeded or not).

Is my understanding right? If so, we should add a status of some sort to each entry, and have only the thread that enters the entry remove it, so that all threads can return errors. Otherwise, the inbound call (a segment push) that caused the first thread to happen will think that it could not push the segment, and a retry may succeed pushing that segment. The other segment pushers will think that their push succeeded and may never retry.

@dinoocch or @jasperjiaguo can you validate this?

In fact, things may get more interesting if the retries start to get added as new entries ... (can that happen?)

@mcvsubbu do you mean in the case of an exception writing to zk?

I'm not fully sure of the intent, but the current code has the following behavior:

The executing thread will throw the exception

Any grouped executions would return null

Particularly, the code uses Entry._sent to indicate an entry was already processed and this is the condition for the while loop. So I believe the worst case is that we might either (a) run one extra iteration of the processing or (b) wait 10ms.

We do lose the exception though in all other threads (I don't know if that is an issue or not though/if null is enough to imply an error).

Same question as @jasperjiaguo. If the retry eventually fails, the algorithm would have removed the changes from the queue and we would lose them. Perhaps we need to only remove from queue once we know that the update (or retry) has been successful.

Jackie-Jiang · 2024-10-14T18:25:45Z

Thanks for debugging this! Since this piece of code is borrowed from Helix GroupCommit.java, can you help also check if Helix has the same issue? This might explain apache/helix#2309

dinoocch · 2024-10-14T20:28:23Z

Thanks for debugging this! Since this piece of code is borrowed from Helix GroupCommit.java, can you help also check if Helix has the same issue? This might explain apache/helix#2309

I think the helix code linked in the original pr perhaps has a different, related issue --

https://github.com/apache/helix/blob/5b1a036bc11a893980ece40872feead3472f98d9/helix-core/src/main/java/org/apache/helix/GroupCommit.java#L158

success = accessor.set(mergedKey, merged, options);

I guess this is most likely using ZkBaseDataAccessor, which set is:

@Override
public boolean set(String path, T record, int options) {
  return set(path, record, -1, options);
}

-1 implies to ignore the version, so I do think this could cause the behavior of lost updates.

There's also this class HelixGroupCommit

This seems to do everything correctly so far as I can tell.

Since this class is much more recently modified, I'm assuming it is the one that is used in Helix? But don't know really without investigating.

xiangfu0 · 2024-10-14T21:44:20Z

pinot-common/src/main/java/org/apache/pinot/common/utils/helix/IdealStateGroupCommit.java

-          }
-          IdealState finalUpdatedIdealState = updatedIdealState;
-          updateIdealState(helixManager, resourceName, anyIdealState -> finalUpdatedIdealState,
+          updateIdealState(helixManager, mergedResourceName, idealState -> {


Good catch!
Can you try to update the IdealStateGroupCommitTest.java test file and see if it passes.

/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, * software distributed under the License is distributed on an * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY * KIND, either express or implied. See the License for the * specific language governing permissions and limitations * under the License. */ package org.apache.pinot.controller.helix; import java.util.ArrayList; import java.util.List; import java.util.Random; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.function.Function; import org.apache.helix.HelixManager; import org.apache.helix.model.IdealState; import org.apache.pinot.common.metrics.ControllerMeter; import org.apache.pinot.common.metrics.ControllerMetrics; import org.apache.pinot.common.utils.helix.HelixHelper; import org.apache.pinot.common.utils.helix.IdealStateGroupCommit; import org.apache.pinot.spi.utils.retry.RetryPolicies; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.testng.Assert; import org.testng.annotations.AfterClass; import org.testng.annotations.BeforeClass; import org.testng.annotations.Test; public class IdealStateGroupCommitTest { private static final Logger LOGGER = LoggerFactory.getLogger(IdealStateGroupCommit.class); private static final ControllerTest TEST_INSTANCE = ControllerTest.getInstance(); private static final String TABLE_NAME = "potato_OFFLINE"; private static final int NUM_PROCESSORS = 10; private static final int NUM_UPDATES = 2400; @BeforeClass public void setUp() throws Exception { TEST_INSTANCE.setupSharedStateAndValidate(); IdealState idealState = new IdealState(TABLE_NAME); idealState.setStateModelDefRef("OnlineOffline"); idealState.setRebalanceMode(IdealState.RebalanceMode.CUSTOMIZED); idealState.setReplicas("1"); idealState.setNumPartitions(0); TEST_INSTANCE.getHelixAdmin() .addResource(TEST_INSTANCE.getHelixClusterName(), TABLE_NAME, idealState); } @AfterClass public void tearDown() { TEST_INSTANCE.cleanup(); } @Test public void testGroupCommit() throws InterruptedException { List<IdealStateGroupCommit> groupCommitList = new ArrayList<>(); for (int i = 0; i < NUM_PROCESSORS; i++) { groupCommitList.add(new IdealStateGroupCommit()); } ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(400); for (int i = 0; i < NUM_UPDATES; i++) { Runnable runnable = new IdealStateUpdater(TEST_INSTANCE.getHelixManager(), groupCommitList.get(new Random().nextInt(NUM_PROCESSORS)), TABLE_NAME, i); newFixedThreadPool.submit(runnable); } IdealState idealState = HelixHelper.getTableIdealState(TEST_INSTANCE.getHelixManager(), TABLE_NAME); while (idealState.getNumPartitions() < NUM_UPDATES) { Thread.sleep(500); idealState = HelixHelper.getTableIdealState(TEST_INSTANCE.getHelixManager(), TABLE_NAME); System.out.println("idealState.getNumPartitions() = " + idealState.getNumPartitions()); } Assert.assertEquals(idealState.getNumPartitions(), NUM_UPDATES); ControllerMetrics controllerMetrics = ControllerMetrics.get(); long idealStateUpdateSuccessCount = controllerMetrics.getMeteredTableValue(TABLE_NAME, ControllerMeter.IDEAL_STATE_UPDATE_SUCCESS).count(); Assert.assertTrue(idealStateUpdateSuccessCount < NUM_UPDATES); LOGGER.info("{} IdealState update are successfully commited with {} times zk updates.", NUM_UPDATES, idealStateUpdateSuccessCount); } } class IdealStateUpdater implements Runnable { private final HelixManager _helixManager; private final IdealStateGroupCommit _commit; private final String _tableName; private final int _i; public IdealStateUpdater(HelixManager helixManager, IdealStateGroupCommit commit, String tableName, int i) { _helixManager = helixManager; _commit = commit; _tableName = tableName; _i = i; } @Override public void run() { _commit.commit(_helixManager, _tableName, new Function<IdealState, IdealState>() { @Override public IdealState apply(IdealState idealState) { idealState.setPartitionState("test_id" + _i, "test_id" + _i, "ONLINE"); return idealState; } }, RetryPolicies.exponentialBackoffRetryPolicy(5, 1000L, 2.0f), false); HelixHelper.getTableIdealState(_helixManager, _tableName); } }

From this test, this is not a correct fix.

xiangfu0 · 2024-10-14T22:00:44Z

pinot-common/src/main/java/org/apache/pinot/common/utils/helix/IdealStateGroupCommit.java

+                  processed.add(ent);
+                  updatedIdealState = ent._updater.apply(updatedIdealState);
+                  ent._updatedIdealState = updatedIdealState;
+                  it.remove();


If the first retry failed, the second retry will have no entry to update.

Oops, very true. Let me fix this tomorrow.

xiangfu0 · 2024-10-15T09:37:01Z

I made some minor changes based on your diff: #14237

dinoocch · 2024-10-15T17:04:42Z

I made some minor changes based on your diff: #14237

Sure, we can use that pr instead. I left a couple questions

dinoocch added 2 commits October 11, 2024 17:11

Fix race-conditions in IdealStateGroupCommit

f705652

idealStateCopy -> updatedIdealState

83ff864

jasperjiaguo approved these changes Oct 12, 2024

View reviewed changes

dinoocch marked this pull request as ready for review October 13, 2024 23:03

Jackie-Jiang added the bugfix label Oct 14, 2024

xiangfu0 reviewed Oct 14, 2024

View reviewed changes

xiangfu0 mentioned this pull request Oct 15, 2024

Fix race condition in IdealStateGroupCommit #14237

Merged

dinoocch closed this Oct 15, 2024

dinoocch deleted the dino/idealstate-group-commit-version branch October 15, 2024 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race-conditions in IdealStateGroupCommit #14214

Fix race-conditions in IdealStateGroupCommit #14214

dinoocch commented Oct 12, 2024

codecov-commenter commented Oct 12, 2024 •

edited

Loading

jasperjiaguo left a comment

jasperjiaguo Oct 12, 2024

dinoocch Oct 12, 2024

jasperjiaguo Oct 12, 2024

mcvsubbu Oct 14, 2024

dinoocch Oct 14, 2024

mayankshriv Oct 15, 2024

Jackie-Jiang commented Oct 14, 2024

dinoocch commented Oct 14, 2024

xiangfu0 Oct 14, 2024

xiangfu0 Oct 14, 2024

xiangfu0 Oct 14, 2024

dinoocch Oct 15, 2024

xiangfu0 commented Oct 15, 2024

dinoocch commented Oct 15, 2024

Fix race-conditions in IdealStateGroupCommit #14214

Fix race-conditions in IdealStateGroupCommit #14214

Conversation

dinoocch commented Oct 12, 2024

Issue 1 - Missing updates

Issue 2 - Typo in use of resourceName instead of mergedResourceName

codecov-commenter commented Oct 12, 2024 • edited Loading

Codecov Report

jasperjiaguo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang commented Oct 14, 2024

dinoocch commented Oct 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangfu0 commented Oct 15, 2024

dinoocch commented Oct 15, 2024

Issue 2 - Typo in use of `resourceName` instead of `mergedResourceName`

codecov-commenter commented Oct 12, 2024 •

edited

Loading