Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relayer deployment failing due to unsupported value type #2330

Closed
conorsch opened this issue Apr 10, 2023 · 12 comments
Closed

relayer deployment failing due to unsupported value type #2330

conorsch opened this issue Apr 10, 2023 · 12 comments
Assignees

Comments

@conorsch
Copy link
Contributor

Describe the bug
The CI step to deploy the relayer has started failing, post merge of #2315.

To Reproduce
Easily reproducible locally:

cd deployments/relayer
just

Expected behavior
Relayer config scripts can build a connection between the Penumbra chains.

Additional context
Log messages contain failure messages such as:

ts=2023-04-10T16:51:52.724670Z lvl=debug msg="Assembled connection handshake message" path_name=penumbra_path chain_id=penumbra-testnet-theb-ff3b9445 client_id=07-tendermint-143 msgError="unsupported value type"
ts=2023-04-10T16:51:52.724735Z lvl=debug msg="Will broadcast connection handshake message" path_name=penumbra_path chain_id=penumbra-testnet-theb-ff3b9445 client_id=07-tendermint-143 msgError="unsupported value type"
ts=2023-04-10T16:51:52.724735Z lvl=debug msg="Will broadcast connection handshake message" path_name=penumbra_path chain_id=penumbra-testnet-theb-ff3b9445 client_id=07-tendermint-143 msgError="unsupported value type"
ts=2023-04-10T16:51:52.743759Z lvl=debug msg="Broadcasting penumbra tx" provider_type=penumbra
ts=2023-04-10T16:51:52.761052Z lvl=debug msg="Received response from sending messages" provider_type=penumbra response= responseError="unsupported value type"
ts=2023-04-10T16:51:52.761090Z lvl=debug msg="Successfully broadcasted connection handshake message" path_name=penumbra_path chain_id=penumbra-testnet-theb-ff3b9445 client_id=07-tendermint-143 msgError="unsupported value type"

Note the ubiquitous "unsupported value type". Let's figure out what broke and see if we can get it working again.

@conorsch
Copy link
Contributor Author

conorsch commented Apr 11, 2023

Progress with the updated protobufs from PR into upstream. With the latest code, we still encounter failure, but the config building step shows a different error:

rly --debug transact channel penumbra_path
Error: identifier cannot be blank: invalid identifier

Tracking that message down, it comes from ibc-go: https://github.com/cosmos/ibc-go/blob/a18f96af99e4cec3b75ac9f9399534427679597f/modules/core/24-host/validate.go#L41 , during validation of either a path name, or a client identifier. It's not clear to me which from the docstring.

Recall that the final step in config-building for the relayer is rly transact link penumbra_path, which decomposes into three steps:

  • rly transact clients penumbra_path
  • rly transact connection penumbra_path
  • rly transact channel penumbra_path

It's that last step that's failing immediately, on the "invalid identifier" error.

@conorsch
Copy link
Contributor Author

The relayer reports that connections aren't established:

❯ rly paths list
 0: penumbra_path        -> chns(✔) clnts(✔) conn(✘) (penumbra-testnet-thebe<>penumbra-testnet-thebe-1d6a5abb)

But I'm not sure how to look up the corresponding values via pcli query ibc connection to verify that information.

@avahowell
Copy link
Contributor

hmm, i don't think that error is coming from ibc-go, because there's no IBC-go in the path for testing between two Penumbra testnets. in the past, when we had this error, it was due to the identifier not getting parsed and persisted correctly in the relayer side. Can we look at the events for connection creation to verify that they're emitting IDs correctly?

@conorsch
Copy link
Contributor Author

Can we look at the events for connection creation to verify that they're emitting IDs correctly?

We sure can! Here's a gist containing the full logs from an end-to-end configure-and-try-run invocation: https://gist.github.com/conorsch/615e6002ae806a04ffcb02db6bea27cd Grepping through that text, it looks to me like the client_ids are populared; but I still see some "unsupported value type" messages, e.g.

374:ts=2023-04-11T23:54:21.367887Z lvl=debug msg="Assembled connection handshake message" path_name=penumbra_path chain_id=penumbra-testnet-thebe-c83c87af client_id=07-tendermint-43 msgError="unsupported value type"
375:ts=2023-04-11T23:54:21.367912Z lvl=debug msg="Will broadcast connection handshake message" path_name=penumbra_path chain_id=penumbra-testnet-thebe-c83c87af client_id=07-tendermint-43 msgError="unsupported value type"
379:ts=2023-04-11T23:54:21.645208Z lvl=debug msg="Successfully broadcasted connection handshake message" path_name=penumbra_path chain_id=penumbra-testnet-thebe-c83c87af client_id=07-tendermint-43 msgError="unsupported value type"

The steps I used to generate these logs were:

  1. Clone latest relayer code, make build, then copy the build/rly binary to $PATH
  2. Hop into deployments/relayer dir in the penumbra repo, run just (or each script individually)

We can at any point build a custom image and run that on the cluster, but for iterating on the debugging, local rebuilds will be fastest, I expect.

conorsch added a commit that referenced this issue Apr 12, 2023
Not currently working due to a mismatch in the protos between preview &
testnet. Refs #2330.
@conorsch
Copy link
Contributor Author

We've tracked this down to breaking changes in the protos between preview & testnet. To resolve, we must:

  1. Wait till all breaking changes have landed for current testnet.
  2. Submit another PR updating the proto gen in relayer repo, as done in penumbra provider: update generated protos cosmos/relayer#1168
  3. Redeploy relayer against testnet and preview again.

It'll break again if corresponding protos break. Let's also consider tackling #2349 for a bit more of a heads up.

conorsch added a commit that referenced this issue Apr 14, 2023
Re-enabling relayer post merge of latest protos upstream in the relayer
repo. Refs #2330.
This reverts commit 2459462.
conorsch added a commit that referenced this issue Apr 14, 2023
Oops, too early: although the protos are updated in the relayer
code now, we still have a mismatch between preview & testnet.
Deferring update until release of Elara (#2360). Refs #2330.
This reverts commit 2483648.
erwanor pushed a commit that referenced this issue Apr 14, 2023
Re-enabling relayer post merge of latest protos upstream in the relayer
repo. Refs #2330.
This reverts commit 2459462.
erwanor pushed a commit that referenced this issue Apr 14, 2023
Oops, too early: although the protos are updated in the relayer
code now, we still have a mismatch between preview & testnet.
Deferring update until release of Elara (#2360). Refs #2330.
This reverts commit 2483648.
@conorsch conorsch self-assigned this Apr 20, 2023
@conorsch
Copy link
Contributor Author

Confirmed relayer working between testnet and local devnet, as described in 89fa953. The relayer repo has the most recent proto definitions, as well. Will check again prior to testnet release on Monday. Post-release, we'll reactivate the relayer deployment between testnet & preview, and then we can lose this ticket.

conorsch added a commit that referenced this issue Apr 26, 2023
Re-enables relayer deployment on testnet 51. Refs #2330.
We want the relayer active so we can continue to evaluate
IBC behavior. If we make breaking changes to the associated protos,
we can click it off until next testnet, and use devnet/preview
relaying in the interim.

This reverts commit 3470db2.
@conorsch
Copy link
Contributor Author

Redeployed the relayer, but it's failing again on the same "invalid identifier" error. Added some debug statements on a local build, and traced that error to ibc-go for verifying the connection id string, specifically for the src.PathEnd, which is the testnet, rather than preview:

Failed to verify host for path end 'penumbra-testnet-elara:cl(07-tendermint-48):co()' ; error was: 'identifier cannot be blank: invalid identifier'Error: identifier cannot be blank: invalid identifier

The co() string should contain a connection id. It doesn't. Earlier in testnet 51, I was able to build a connection, connection-0, which is visible in chain state:

querying testnet connection-0 via rly
❯ rly q connection  penumbra-testnet connection-0  | jq
{
  "connection": {
    "client_id": "07-tendermint-0",
    "versions": [
      {
        "identifier": "1",
        "features": [
          "ORDER_ORDERED",
          "ORDER_UNORDERED"
        ]
      }
    ],
    "state": "STATE_OPEN",
    "counterparty": {
      "client_id": "07-tendermint-0",
      "connection_id": "connection-0",
      "prefix": {
        "key_prefix": "UGVudW1icmFBcHBIYXNo"
      }
    },
    "delay_period": "0"
  },
  "proof": "CrIJCq8JChhjb25uZWN0aW9ucy9jb25uZWN0aW9uLTASbAoPMDctdGVuZGVybWludC0wEiMKATESDU9SREVSX09SREVSRUQSD09SREVSX1VOT1JERVJFRBgDIjIKDzA3LXRlbmRlcm1pbnQtMBIMY29ubmVjdGlvbi0wGhEKD1BlbnVtYnJhQXBwSGFzaBoVCAEQARgBKg1KTVQ6OkxlYWZOb2RlIjQIARIwSk1UOjpJbnRybmFsTm9kZeNCY8jzBN0sIbrGNKeufpc9kQdF2GpjfMppnFe/YZm6IjYIARIQSk1UOjpJbnRybmFsTm9kZRogcsKvHYGLuA1XAdrmhr5gDA0gCezLpnoINMPbKPHRAz0iNAgBEjBKTVQ6OkludHJuYWxOb2RlKfLv8hvHkEKlqFDovC6H6d95TMAjWjYMe6eQtqwElEQiNggBEhBKTVQ6OkludHJuYWxOb2RlGiBYGd3qRSOStVKYL76SIYmqwoA+Kx/xl9L2+4S1MN2vYyI2CAESEEpNVDo6SW50cm5hbE5vZGUaINr++LmVMYBCqtVhdj4gSRXiKeaiaHTrNTm2iCXRshTiIjQIARIwSk1UOjpJbnRybmFsTm9kZeI9eZuH3kZe8Kmr9X46br021Mbv/o/rVbMyMfyNasjCIjQIARIwSk1UOjpJbnRybmFsTm9kZQzXY2LKgSi1sphOU4FM7hCEcTrHUZodIQ2Joy7DlvulIjQIARIwSk1UOjpJbnRybmFsTm9kZQ9peKXqlSDkkd7fiSAPH+P2oNqJSyiohGiYfS7f4pj1IjQIARIwSk1UOjpJbnRybmFsTm9kZepqL3/ULmgkyW3709jc5VuweleDnu0HkoFKKOXplU2XIjQIARIwSk1UOjpJbnRybmFsTm9kZRjVoOZe42rfFOgKbEyBaphzF+oCaCqmOtFig/Xp5s5UIjQIARIwSk1UOjpJbnRybmFsTm9kZbOLyc3m9+rNitNazXjxrMP1MtnPCJ7kWfXCd4Ilu6UUIjQIARIwSk1UOjpJbnRybmFsTm9kZYrpHtmB2oA3r4AWieJBxWOEZrps/kena9Q4HBL5ey6gIjQIARIwSk1UOjpJbnRybmFsTm9kZY7iabTCW04jiC3G9Pg0z6J6UzQqXEioJuTa1/rtn2DNIjQIARIwSk1UOjpJbnRybmFsTm9kZddZ/0HLPrjPTJau8tar06nyhoSVI1u1Je3Lbhr2JiP5IjYIARIQSk1UOjpJbnRybmFsTm9kZRogGwtVN69m6HuiBE10YX9f7DDHlNKXNWVzj2tBKBb0wbEiNggBEhBKTVQ6OkludHJuYWxOb2RlGiCDovxbhNg00ZNuZL38LuAw6JD6REm7LLIvC6ll/AwBlyI0CAESMEpNVDo6SW50cm5hbE5vZGUNtF6m5bdcRucaKHPi2ymmBaTervP13OEPCRcl2uX1ZyI2CAESEEpNVDo6SW50cm5hbE5vZGUaIIzCNy9PAAhcmqfydVQxDWldnntKuYsHSHcAqhriWtiLIjYIARIQSk1UOjpJbnRybmFsTm9kZRog5+EMqroPq6EmGJq9mcxrDwIexZ5dcMIKYYX5GVf4MSwKOQo3Cg9QZW51bWJyYUFwcEhhc2gSIJQjSMaQGWgWz1LyZenRCwVeygE7NgMIKxXNjMMofCtQGgIIAQ==",
  "proof_height": {
    "revision_number": "0",
    "revision_height": "19946"
  }
}
querying testnet connection-0 via pcli
❯ pcli q ibc connection connection-0
{
  "state": "Open",
  "client_id": "07-tendermint-0",
  "counterparty": {
    "client_id": "07-tendermint-0",
    "connection_id": "connection-0",
    "prefix": "PenumbraAppHash"
  },
  "versions": [
    {
      "identifier": "1",
      "features": [
        "ORDER_ORDERED",
        "ORDER_UNORDERED"
      ]
    }
  ],
  "delay_period": {
    "secs": 0,
    "nanos": 0
  }
}

But thereafter, i.e. all connections greater than connection-0, the counterparty connection string is empty:

❯ pcli q ibc connection connection-11 | jq .counterparty.connection_id
null

which is why the connection_id fails to validate, because the ibc-go code ensures it isn't an empty string: https://github.com/cosmos/ibc-go/blob/87ac1a9ac3e28983ccf9fa5b6cefe6ff60ddee20/modules/core/24-host/validate.go#L40-L42 Note that the state for the later connections is stuck on "init", rather than "open" for connection-0, which makes sense, because the counterparty chain con id is empty, so the connection was never successfully opened.

Next, I'll try repeating these procedures between preview and a local devnet, to rule out recent code changes breaking things again.

conorsch added a commit that referenced this issue Apr 27, 2023
Shuts off the relayer again, because it's stuck in a crash loop. See
details in #2330. We should consider replacing the full relayer
deployment with a simpler CI check that tries to build the configs,
then reports an error and stays stopped if it fails. The current setup
will continue to retry forever, which junks up the chains with "init"
connections.

This reverts commit fbc80b7.
@conorsch
Copy link
Contributor Author

Indeed, was able to build a path between preview and local devnet:

[...snip...]
2023-04-27T15:37:18.760565Z     debug   Context done, quitting PathProcessor    {"chain_id_1": "penumbra-devnet-local-3339493c", "chain_id_2": "penumbra-testnet-elara-085b99f8", "client_id_1": "07-tendermint-0", "client_id_2": "07-tendermint-0", "error": "context deadline exceeded"}
Finished! No errors encountered setting up link. Available paths:
 0: penumbra_path        -> chns(✔) clnts(✔) conn(✔) (penumbra-devnet-local-3339493c<>penumbra-testnet-elara-085b99f8)

Additionally, I was able build several paths, all of which stayed open:

❯ for i in $(seq 0 2) ; do pcli -n http://localhost:8080 q ibc connection connection-$i | jq -r .state ; done
Open
Open
Open

However, still no evidence of successfully built channels yet. While it's possible I'm not querying them correctly—the pcli ibc query channel command needs a port, which is always transfer, and a channel_id, the value of which I don't yet know—I suspect the relayer isn't properly building the events, so the channel never gets created. From the pd logs from the local devnet:

2023-04-27T22:28:33.745353Z  INFO abci:Query{path="/ibc.core.channel.v1.Query/ConnectionChannels" height=block::Height(0) prove=false}: query=Query { data: b"\n\x0cconnection-2\x12\x05\x18\x
e8\x07 \x01", path: "/ibc.core.channel.v1.Query/ConnectionChannels", height: block::Height(0), prove: false }                                                                                 
2023-04-27T22:28:33.746701Z  INFO abci:Query{path="/ibc.core.channel.v1.Query/Channels" height=block::Height(0) prove=false}: query=Query { data: b"\n\x05\x18\xe8\x07 \x01", path: "/ibc.core
.channel.v1.Query/Channels", height: block::Height(0), prove: false }                                                                                                                         
2023-04-27T22:28:33.746947Z  INFO abci:Query{path="/ibc.core.connection.v1.Query/Connections" height=block::Height(0) prove=false}: query=Query { data: b"\n\x05\x18\xe8\x07 \x01", path: "/ib
c.core.connection.v1.Query/Connections", height: block::Height(0), prove: false } 

Compare the malformed data block with other successful queries, such as:

2023-04-27T22:28:17.123877Z  INFO abci:Query{path="state/key" height=block::Height(982) prove=true}: query=Query { data: b"clients/07-tendermint-0/clientState", path: "state/key", height: bl
ock::Height(982), prove: true }                                                                                                                                                               
2023-04-27T22:28:17.125102Z  INFO abci:Query{path="state/key" height=block::Height(982) prove=true}: query=Query { data: b"connections/connection-2", path: "state/key", height: block::Height
(982), prove: true }                                                                                                                                                                          
2023-04-27T22:28:17.125497Z  INFO abci:Query{path="state/key" height=block::Height(982) prove=true}: query=Query { data: b"clients/07-tendermint-0/consensusStates/0-1175", path: "state/key",
 height: block::Height(982), prove: true } 

After some discussion with @avahowell, we expect we'll need to patch the golang relayer code and see if we can coax out channel construction. Will work on that tomorrow.

@conorsch
Copy link
Contributor Author

The /ibc.core.channel.v1.Query/ConnectionChannels calls in the Penumbra provider (in the golang relayer code) appear to be making gRPC calls against a non-gRPC Tendermint API endpoint. See relevant code here: https://github.com/cosmos/relayer/blob/main/relayer/chains/penumbra/query.go#L585-L596

The client and connection calls use the abci.RequestQuery interface, and those queries work just fine, as shown in the log output above. We should therefore update the channel-related query calls in the Penumbra provider to use the abci query interface throughout. We're not sure when or how this broke in the relayer code, but the recent IBC overhaul in support of #2379 created ample opportunity for breakage on the Penumbra side.

@conorsch
Copy link
Contributor Author

conorsch commented May 3, 2023

Spent some time on this with @avahowell today. We suspect that the client_id for the counterparty chain is not being passed correctly, but we didn't find any low-hanging fruit like simple inversion of args due to recent refactoring. One observation we made was that more use of anyhow::Context throughout the IBC handshake flow would be helpful; there are currently two places in the connection code where "client not found" can be emitted.

Regarding the querying logic above, that remains an issue and one we intend to solve, but it's lower priority than ensuring our rust code is conforming to the IBC spec. We'll focus on that near-term, and follow up on querying improvements once we have channels built successfully again.

@conorsch
Copy link
Contributor Author

conorsch commented May 3, 2023

We should therefore update the channel-related query calls in the Penumbra provider to use the abci query interface throughout.

Today we determined that this change to the golang relayer code is indeed necessary, because the channel-creation logic is gated on the queries being populated correctly. We should update the relayer code to 1) make working query calls and 2) error out if they aren't made successfully. The fact that the relayer currently exits 0 prior to channel creation obscured this problem for too long.

But thereafter, i.e. all connections greater than connection-0, the counterparty connection string is empty

Second, we also identified a logic bug in our IBC implementation, in which we mix up which client_id belongs to the source vs counterparty chain. This bug was hard to find, because much of our testing involved clean slates, such as a fresh preview and a fresh devnet, so the client_id values incremented in tandem, obscuring the bug. The explanation above in this thread shows the dawning realization that something was amiss. @avahowell is working on a fix for that pronto.

Finally, as to "how did channel creation ever work?" we now believe we had originally tested with the --override flag enabled on the relayer, which instructed it to blow past the failing queries and attempt channel creation no matter what. I wasn't using --override in our setup, which partly explains the regression. Once the client_id mix-up bug is addressed, we can confirm or disprove this hypothesis by trying with --override enabled, and we should be mostly back to working relaying. We still aim to clean up the relayer query logic, though.

@conorsch
Copy link
Contributor Author

conorsch commented May 5, 2023

After most recent changes landing in #2482, we have channel creation restored. Behold:

We got clients:

❯ pcli -n http://localhost:8080 q ibc client 07-tendermint-0 | jq -r .chain_id
penumbra-testnet-elara-a56dd8b8

We got connections:

❯ pcli -n http://localhost:8080 q ibc connection connection-0 | jq -r .state
Open

And—drumroll, please—we even have channels:

❯ pcli -n http://localhost:8080 q ibc channel transfer channel-0
{
  "state": "Open",
  "ordering": "Unordered",
  "remote": {
    "port_id": "transfer",
    "channel_id": "channel-0"
  },
  "connection_hops": [
    "connection-0"
  ],
  "version": "ics20-1"
}

As part of this push, we've been testing with locally built versions of the relayer, with the most recent protobuf definitions. Those are now PR'd into the upstream repo here: cosmos/relayer#1181

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Testnet 52: Amalthe
Development

No branches or pull requests

2 participants