-
-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query currentProtocolParameters fails against 8.0.0 node #314
Comments
I suspect that this is the same issues that was going to be resolved by IntersectMBO/cardano-node#5100. It was closed only because it targeted a branch that was merged. |
@JaredCorduan What is the path forward here? Should we fix all downstream tools like Ogmios, or wait for an 8.0.1 patch to the node that restores the miniprotocol version api contract? |
@AndrewWestberg I am personally working on node 8.1.0 myself now. See IntersectMBO/cardano-node#5243. I think I am through all real the integration work, and am just waiting on things like a now consensus release and updating packages and rebasing, etc. So I think it makes sense to wait on 8.1.0. |
@AndrewWestberg can you test that PR out and see if it addresses your problems? |
I actually know the reason for this failure. Serialization for PParams was fixed in In hindsight, I think there was a graceful way to fix this bug by using a different serializer in the query implementation and adding a new query with fixed serialization, but I suspect it is too late now, since serialization has already changed. Downstream tools will need to adjust their deserializers |
@lehins is totally right, I was wrong. Folks will need to adjust to 8.0.0, apologies :( |
@JaredCorduan @lehins Even if the cbor had bugs before, you should respect the protocol version contracts instead of having Ogmios, phyrhose, TxPipe, Goroboros, and everybody else downstream code work-arounds. |
Though we do care that the format has now changed in an incompatible manner.
I don't think this is an acceptable outcome. The reason being that the ouroboros mini-protocols and the entire framework around it are built to be multi-era and multi-codecs capable. There's an inherent complexity coming with that framework. Client applications are performing a handshake with the node to negotiate precise protocol versions. Specific codecs are associated to versions. So, it is reasonable from client applications to expect servers (i.e. the node) to honor the other part of the protocol. If not, then let's please ditch the entire mini-protocol framework and use something a lot simpler and more adapted where breaking changes like that are meant to happen.
I don't think you were wrong no. Changing the serialization back to what they were is the right call. If the format was different from what it should've been then it is actually too late to change it. Or if changed, then it needs to go hand-in-hand with the multi-version codecs supported by the Ouroboros framework. |
I agree that we need to fix the encoding. The whole contract of negotiating a given @KtorZ versioning we do is a very standard thing, and it doesn't even have anything to do with mini-protocols (/ typed-protocols). Furthermore, structure gives simplicity and that's been our aim since the beginning. |
Well, I think that from a client perspective, we kind of missed the aim for simplicity 😅 ... I've never encountered something as complicated as the mini protocol to use as a client application in my entire engineering career. I can fathom what I get from that complexity, the immutability, version negotiation and the correct-by-construction protocols so that's okay. But if I am now being told that I can't even get the one thing I was promised in exchange of that complexity, I am -- reasonably I believe -- a little disappointed.
I am not sure to follow your thoughts here? Surely, codecs are currently versioned -- or at least pretend to be? For example, this is how Ogmios currently selects codecs using methods and data-types from the ouroboros-network libraries: codecs
:: forall m. (MonadST m)
=> EpochSlots
-> NodeToClientVersion
-> ClientCodecs (CardanoBlock StandardCrypto) m
codecs epochSlots nodeToClientV =
clientCodecs cfg (supportedVersions ! nodeToClientV) nodeToClientV
where
supportedVersions = supportedNodeToClientVersions (Proxy @(CardanoBlock StandardCrypto)) |
The part which was incorrect CBOR was a list of length 4, but where the CBOR claimed it was a list of length 5. |
@AndrewWestberg It actually was invalid, that's how I discovered it. Generic cbor decoder was choking on the PParams, because number of elements was reported one smaller than there was actually number of elements in the list.
I totally agree with you. This breakage was unintentional., That's why I said: "I think there was a graceful way to fix this bug, but..." |
@KtorZ This is true. It was wrong, not because protocol version was unrolled, but it was wrong because this unrolling affected the total number of elements in the list without adjusting the encoded value for the length of the list. I am actually surprised that no-one in the community discovered this. |
@coot You mean that we need to revert to the broken encoding in order to support downstream users of the query. I have no problem with that. If that's what people need, I am not gonna stand in the way. However, if that's what we wanna do then we also have to deprecate this query and implement a new one that has the correct encoding.
Don't be so sensitive and get disappointed so quickly, nothing has been decided yet. We are having a discussion here. This query breakage was unintentional, I apologize for that. But the bug was real and needed to be fixed. Problem is that the serialization bug was in ledger, while the breakage happened in consensus. The take away from this should be that we need to invent a mechanism to protect us from this happening in the future. |
I mean, this isn't a first time a "bug is fixed" and things break down the line. I'd hope that by now, we had a better process in place to prevent this kind of situation from happening a lot earlier. A software version has been tagged and released, and people should reasonably expect to use this as is. This one particular issue is low impact because it only affects state queries as far as I can tell but what if it had affected the block serialisation format? I am puzzled that discrepancies like this can surface all the way without being detected prior to a release. So, yes, I am a little sensitive about this now. I mean not to sound aggressive or anything like this, but that's somehow hard to capture on a written form. I am just puzzled and worried. Note that one way to have caught earlier would have been to have cddl specification for the state queries, and test against the specification. I believe we've been gently asking for these for a while in the community but priorities have been elsewhere. |
@lehins I believe that this particular issue can be "fixed" by simply bumping the size of the serialized map by one (and reverting the change). This makes the PParams serialization valid, yet preserves the on-the-wire format for client applications and existing decoder where the protocol version is inlined as two fields. The reason why no one bumped into this earlier is probably because they're either using the ledger as a library directly (e.g. Ogmios, Hydra, db-sync); or because no one have written a generic cbor decoder for protocol parameters and have written them by hand; thus likely ignoring the encoded length altogether. That doesn't quite solve: #313 however, which I believe to be quite closely related. But for another encoder. |
Ledger is fully responsible for [de]serialization of everything that goes on chain. That's why we have cddl spec and a whole lot of tests that ensure we don't break it. In fact we recently introduced versioned serialization for all types that live on chain, so we can gracefully fix bugs and deprecate undesired features when going from one era to another. With respect to serialization of queries, we don't have the same scrutiny, unfortunately. In my opinion the problem lays in the separation of concerns: consensus is responsible for the queries, while serialization of the types returned by queries live in ledger. This makes it incredibly hard for ledger team to know serialization of which types from ledger state is allowed to change and which ones cannot be changed. Now that it is on my radar I'll make sure we'll have this problem resolved
I 100% support you on this. Considering the issue that just happen, we should give it a higher priority.
I do like this suggestion. This would allow us to avoid deprecation of the old query, while fixing the bug of invalid CBOR. Technically it would still be a breaking change to the query, but the impact would be minimal, as you very well alluded. It does have another small downside of not matching the way protocol version is stored in PParamsUpdates, but I don't think it is terribly important. I am not quite sure what's the reason for #313 That query hasn't changed, nor its serialization. That requires some investigation. |
@lehins we can change the encoding, as long as a new @KtorZ I won't argue but you're confusing simplicity with things being easy. We aim for the former not for the latter. And by the way in the context of |
From my perspective reverting it doesn't solve the problem because end users still cannot use generic decoders. However; the fix in 8.0.0 is sub-optimal in changing the logic users have to use. The optimal solution is to keep the structure the same but prefix with 5 rather than 4 so it fixes the cbor decoder issues and causes minimal impact to end users. Would you agree with doing that @KtorZ To elaborate, what we propose is we switch it to:
Per @coot we should also increase the network protocol version to denote a breaking change. |
@coot NodeToClient was not introduced in |
@disassembler I believe you and I are saying the same thing. |
This issue appears to be working with Ogmios 5.6.0 against cardano-node 8.1.1-pre. |
Cardano Preview network just hardforked to protocol version 9 so everybody is forced to upgrade to cardano-node v8 or higher, but this issue is still a blocker |
@SebastienGllmt This issue has been fixed in cardano-node-8.1, so I am not quite sure why it hasn't been closed yet. Also, the fact that SanchoNet (preview) is on protocol version 9, doesn't force anyone to upgrade, only those who want to play around with it can do that. The Cardano TestNet is still on protocol version 8, so if you use it for testing and such, you can continue doing so. |
@lehins i think 8.1 only fixed part of the problem. What I understood from @disassembler is that there are still changes wrt how the ledger now expects definite vs indefinite cbor structure. So it still fails. This will get fixed with Ogmios' next upgrade. I was hoping to delay the integration with 8.x to later but it sounds necessary. |
Oh yeah, that was a bug in older, pre 8 version of the node. It was unable to handle definite length encoding. Which is now fixed. So, I would definitely upgrade to at least 8.1 sooner rather than later. There are never guarantees on which length encoding will be used, so this was a definite bug.
|
What Git revision / release tag are you using?
Ogmios 5.6.0
Do you use any client SDK? If yes, which one?
Kogmios
Describe what the problem is?
The underlying cbor of the response between node and Ogmios has changed.
maxCollateralInputs
used to be in a cbor integer value outside of the array returned from the node. This has been moved inside the array (bug fixed).Cbor indexes for certain parameters coming back in the payload have changed
Protocol Version is now an array instead of two separate integer numbers:
protocolMajorVersion was: index 12... now, index 12[0]
protocolMinorVersion was: index 13... now, index 12[1]
Any index that was over 13 is now off by 1. for example
utxoCostPerByte was: index 15... now, index 14
What should be the expected behavior?
Well, it's probably going to take a special ogmios release to fix it. The node team obviously broke the contract of communicating with the node on these given miniprotocols.
If applicable, what are the logs from the server around the occurrence of the problem?
The text was updated successfully, but these errors were encountered: