Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-atomicity of operations #68

Closed
4 tasks
andrey-kuprianov opened this issue Nov 25, 2020 · 4 comments · Fixed by #107
Closed
4 tasks

Non-atomicity of operations #68

andrey-kuprianov opened this issue Nov 25, 2020 · 4 comments · Fixed by #107

Comments

@andrey-kuprianov
Copy link

Surfaced from Informal Systems IBC Audit of cosmos-sdk hash cosmos/cosmos-sdk@6cbbe0d

Description

In multiple places throughout the Cosmos-SDK, functions contain several sub-calls that may fail due to various reasons. At the same time, the sub-calls update the shared state, and the functions do not backtrack the state changes done by the first sub-calls if one of the subsequent sub-calls fails.

Consider this example of SubtractCoins():

for _, coin := range amt {
    balance := k.GetBalance(ctx, addr, coin.Denom)
    locked := sdk.NewCoin(coin.Denom, lockedCoins.AmountOf(coin.Denom))
    spendable := balance.Sub(locked)
    _, hasNeg := sdk.Coins{spendable}.SafeSub(sdk.Coins{coin})
    if hasNeg {
        return sdkerrors.Wrapf(sdkerrors.ErrInsufficientFunds, "%s is smaller than %s", spendable, coin)
    }
    newBalance := balance.Sub(coin)
    err := k.SetBalance(ctx, addr, newBalance)
    if err != nil {
        return err
    }
}

The function iterates over the set of coins and for each coin checks whether enough coins are available for the current denomination. The balance is updated for each coin as the loop progresses. Consider the scenario where the balance for the first coin is updated successfully, but for the second coin setting of the balance fails because of the negative amount of coins. The function returns the error, but the balance for the first coin remains updated in the shared state.

This violates one of the most basic assumptions of function atomicity, namely that either

  1. the function updates the state and succeeds, or
  2. the function returns an error, but the shared state is unmodified.

We have found multiple occasions of such non-atomic functions; here are some, besides the example above:

Bank:

  • AddCoins similar issue with the difference it panics when overflow happens for some denomination.
  • SendCoins first subtracts from the sender account and then adds to the receiver. If subtract is successful and add fails the state is changed.
  • DelegateCoins: delegation is processed denomination by denomination: the insufficient funds is check inside the loop allowing state updates even when an error is reached.
  • UndelegateCoins similar scenario.
  • BurnCoins and MintCoins use SubtractCoins and AddCoins.

ICS20:

The problem is that all above functions have implicit assumption on the behavior of the caller. This implicit assumption is that whenever any such function returns an error, the only correct behavior is to propagate this error up the stack.

While this assumption seems indeed to be satisfied by the present Cosmos SDK codebase (with one exception, see the problem scenario below), it is not documented anywhere in the Cosmos SDK developer documentation. There are only hints to this in the form similar to this paragraph in the Cosmos SDK handler documentation:

The Context contains all the necessary information needed to process the msg, as well as a cache-wrapped copy of the latest state. If the msg is succesfully processed, the modified version of the temporary state contained in the ctx will be written to the main state.

Such hints do not constitute enough developer guidance to avoid introducing severe bugs, especially for Cosmos SDK newcomers.

Problem Scenario

We have found one particular place where this non-atomicity has almost led to the real bug. Namely, in ICS20 OnRecvPacket we have two consecutive operations, minting and sending.

// mint new tokens if the source of the transfer is the same chain
if err := k.bankKeeper.MintCoins(
    ctx, types.ModuleName, sdk.NewCoins(voucher),
); err != nil {
    return err
}


// send to receiver
if err := k.bankKeeper.SendCoinsFromModuleToAccount(
    ctx, types.ModuleName, receiver, sdk.NewCoins(voucher),
); err != nil {
    return err
}

If the minting succeeds, but the sending fails, the function returns an error, while the funds are moved to the module account.

This is how this function is called in applications/transfer/module.go:

err := am.keeper.OnRecvPacket(ctx, packet, data)
if err != nil {
    acknowledgement = channeltypes.NewErrorAcknowledgement(err.Error())
}

We see that the error is turned into a negative acknowledgement. If sending of the coins above was to fail with an error, then the negative acknowledgement would be sent, but also the funds would be silently moved to the module account under the hood. We have carefully examined the code of SendCoinsFromModuleToAccount, and found out that its current implementation can only panic, e.g. when the receiver account overflows. But if there was a possibility for it to return an error this scenario would constitute a real problem. Please note that these two functions are probably written by two different developers, and it would be perfectly legitimate for SendCoinsFromModuleToAccount to return an error -- this is how close it comes to being a real bug. Please see also the https://github.com/cosmos/ics/issues/504 for more details on this issue.

Recommendation

  • Short term: properly explain in the developer documentation the implicit assumption of propagating all returned errors up the stack , as well as in the inline documentation for all functions exhibiting non-atomic behavior.
  • Long term: one of the following needs to be done:
    • either make the SDK functions atomic as described above;
    • or introduce a separate explicit step for handlers, say CommitState, that the handler will need to call to write state changes to the store.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@aaronc
Copy link
Member

aaronc commented Nov 25, 2020

Thanks @andrey-kuprianov. ADR 033 (#7459) could make the calls between different modules atomic. I can update the ADR to indicate that.

Within a single module, the module developer would still need to guarantee atomicity.

Your second suggestion could be addressed with something like Begin, Commit and Rollback transaction isolation methods on Context. It's not a bad idea to introduce those, but developers would still need to use them correctly.

I don't think we should require all methods to call Commit although maybe others have a different opinion.

Do you think that maybe the ADR 033 inter-module isolation would address this well enough?

@tac0turtle
Copy link
Member

duplicate of cosmos/cosmos-sdk#6370?

@aaronc
Copy link
Member

aaronc commented Nov 25, 2020

duplicate of cosmos/cosmos-sdk#6370?

It's different actually. cosmos/cosmos-sdk#6370 as I understand it is related to the multi-store design of using multiple IAVL trees (see cosmos/cosmos-sdk#6370 (comment)). This is around how we are committing state and only rolling back at the transaction level rather than any time a sub-method returns an error.

@colin-axner
Copy link
Contributor

@andrey-kuprianov thank you for the write up!

In relation to the problem scenario, a recent change in cosmos/cosmos-sdk#7968 should definitely prevent the bug (if the bank code changed to allow an error to be returned) from occurring since we now panic if that scenario occurs.

This is definitely something that should at least be well documented since the fix on our side had not come from consideration of this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants