Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ADR 038 plugin system #11691

Closed

Conversation

egaxhaj
Copy link
Contributor

@egaxhaj egaxhaj commented Apr 19, 2022

For #10096

Implements #11175

On top of #10639

This is an extension and refactor of the existing ADR-038 streaming service work to introduce a plugin system to the SDK and load streaming services using this system rather than building them into the SDK binary.

The plugin system introduced here is meant to be extensible, so that if other components/features of the SDK wish to be included as plugins in the future they can extend the base plugin interface defined here and leverage this plugin building and loading/preloading system.


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title
  • added ! to the type prefix if API or client breaking change
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

Comment on lines +209 to +213
go func() {
if err := streamingListener.ListenBeginBlock(app.deliverState.ctx, req, res); err != nil {
app.logger.Error("BeginBlock listening hook failed", "height", req.Header.Height, "err", err)
}
}()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this goroutine is spawned and executed asynchronously, it's possible that it's scheduled when app.deliverState has one value, and executes when app.deliverState has a different value, e.g. representing the next block. This would mean that ListenBeginBlock will be called with a ctx that doesn't correspond to its sibling req and res parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been debugging an AppHash mismatch error when I test the plugins on our provenance blockchain. This is most likely the cause.

Our initial thinking here was that we wanted to make calls to multiple listeners asynchronous. I'll test this out with synchronized listeners and report my findings.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary consequence of making listeners synchronous is that slow listeners can impact node liveness. If HaltAppOnDeliveryError is true, then users are explicitly opting in to this risk, so it's (arguably) OK there. But if that setting is false, then I think it would be a mistake to make delivery synchronous.

The solution here is to produce the complete event when this method is called, and emit that event value to listeners. The event value would need to contain only plain-old-data, i.e. no pointers or references to anything else in the node or app or etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been debugging an AppHash mismatch error when I test the plugins on our provenance blockchain. This is most likely the cause.

Our initial thinking here was that we wanted to make calls to multiple listeners asynchronous. I'll test this out with synchronized listeners and report my findings.

I tested this and it is not the cause of the AppHash error I see on a two node localnet. I disabled all the listeners in abci.go (thinking this was the cause) and I was able to narrow it down to either the Inject() or Start() as the cause. When you turn on the plugin system and enable any plugin it will result in an AppHash error. I also used the iaveiwer tool but didn't see any differences in the iAVL tree. This leads me to think the state change is happening at runtime before it's persisted.

Something like.

panic: Failed to process committed block (11:AF09D0FF0AACA156034874C0A14F3C2A4EB4FF384D22201062AF5E4ADD2FAE4A): wrong Block.Header.AppHash.  Expected 6AC03C53EC307EFA46AA61B006833C9C6A6A37C1B54D8B184D2B1B468D8026AB, got 7003EC3FBBDEADC9D62E27C8DF75B27BDF7EF4B575E5653E8D7921571CB9C64C

@i-norden ^^^ perhaps you can take a look at the plugin system and maybe find what is changing the app state.

The primary consequence of making listeners synchronous is that slow listeners can impact node liveness. If HaltAppOnDeliveryError is true, then users are explicitly opting in to this risk, so it's (arguably) OK there. But if that setting is false, then I think it would be a mistake to make delivery synchronous.

The solution here is to produce the complete event when this method is called, and emit that event value to listeners. The event value would need to contain only plain-old-data, i.e. no pointers or references to anything else in the node or app or etc.

I agree ^^^. I was testing to see if asynchronous vs synchronous listeners was the cause for my AppHash error above. We'll also need to pass in any addition info required to keep track of what block is being emitted etc.

so the new listener interface could look like this?

type ABCIListener interface {
	ListenBeginBlock(logger log.Logger, blockHeight int64, req []byte, res []byte) error
	ListenEndBlock(logger log.Logger, blockHeight int64, req []byte, res []byte) error
	ListenDeliverTx(logger log.Logger, blockHeight int64, req []byte, res []byte) error
	HaltAppOnDeliveryError() bool
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An ABCIListener interface should

  1. Take a context.Context as a first parameter, and respect context cancellation
  2. Already know about any log.Logger it needs, and therefore not take a logger as a parameter
  3. Not need to signal HaltAppOnDeliveryError thru a method, but instead either
    a. Halt the app directly via os.Exit or panic or whatever, or
    b. Signal that information via return error from each method, detected by the caller via errors.Is/As

// in fire-and-forget fashion.
//
// This behavior is controlled by a corresponding app config setting.
HaltAppOnDeliveryError() bool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, implementations of ABCIListener which are configured to HaltAppOnDeliveryError should terminate the process if any of the Listen-class methods return a non-nil error. Is that correct? If so, would it not be simpler to have those method implementations halt the app directly, rather than delegating that responsibility to the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

app.halt() is private so it would not be possible to halt the app from the listeners. Also, do we want to expose more of the internal API to the listeners?

func (app *BaseApp) halt() {

@i-norden ^^^

Copy link

@peterbourgon peterbourgon Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that app.halt just kills the process un-cleanly. Presuming that is the intended behavior, you don't need a special method to do it :) as you can call os.Exit(0) — or, rather, os.Exit(1), as 0 indicates success, which an app halt most certainly isn't — from anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) very true but it's an unclean way of exiting. Also, it will bypass error reporting in the listener methods. ListenBeginBlock, ListenEndBlock, ListenDeliverTx report errors up the stack so their method signature will need to change.

We should have the the broader team chime in on this a bit and provide feedback as to how we want plugins to behave in regards to this.

app.halt does eventually os.Exit(0) so perhaps we just fix its behavior?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egaxhaj

very true but it's an unclean way of exiting. Also, it will bypass error reporting in the listener methods . . .

Sure! Halting an app is (AFAIK) an unclean exit, by definition. I don't think there's any way to reliably issue a "halt" from within the SDK that (a) actually terminates the process, and (b) unwinds the call stack, and captures/logs errors, in a clean and predictable way. Happy to be corrected.

app.halt does eventually os.Exit(0) so perhaps we just fix its behavior?

Is the current behavior not correct? What would be more correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents is this is a secondary system that should have no effect on the app and consensus. If the indexer is halting the app to not lose data then a different approach needs to be thought of in which it doesn't affect the app.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marbar3778 - @i-norden and I have had discussions about how a secondary system would work that does not effect the app. However, we agreed to move forward with the current design to allow for time to be able to work through a secondary system design.

Comment on lines +38 to +40
on = false # turn the plugin system, as a whole, on or off
enabled = ["list", "of", "plugin", "names", "to", "enable"]
dir = "the directory to load non-preloaded plugins from; defaults to cosmos-sdk/plugin/plugins"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other configuration pieces use enable as the name of the on/off field for a feature.
Examples: api.enable, grpc-web.enable, grpc.enable, rosetta.enable, and statesync.enable.
The only exception is telemetry which named the field enabled.

As such, I think it would be better to rename these fields:

  • plugins.on -> plugins.enable
  • plugins.enabled -> plugins.names (or maybe plugins.plugins or something other than enabled which is too easily confused with enable.

Comment on lines +1 to +14
###############################################################################
### Plugin system configuration ###
###############################################################################

[plugins]

# turn the plugin system, as a whole, on or off
on = true

# List of plugin names to enable from the plugin/plugins/*
enabled = ["kafka"]

# The directory to load non-preloaded plugins from; defaults $GOPATH/src/github.com/cosmos/cosmos-sdk/plugin/plugins
dir = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These configuration pieces should be in a Plugins field in the BaseConfig (defined in server/config/config.go). They should also have corresponding entries in the config file template defined in server/config/toml.go.

Also, since each plugin will require its own set of custom configuration parameters, I it'd be best to have a separate plugins.toml file for the specific configuration of a plugin (or even a file for each plugin). That is, this [plugins] section would go in app.toml but the other configuration pieces (e.g. plugins.streaming.file or plugins.streaming.kafka) would go in separate config files.

The reason I say all this, is that the app.toml can be updated programmatically, but it uses the BaseConfig struct template (defined in that toml.go file) to do so. If there are custom fields added to the file, those will be stomped on during such an update. Because of the nature of plugins, the BaseConfig struct cannot know all the configuration options available for any plugin. That then also applies to the app.toml config file. While you can put things in there manually that can be read and used, as far as I know, the templating system doesn't allow for custom maps like what would be needed for each plugin.

Basically, in app.toml we would configure the app to enable/use a plugin. But the config for that plugin should go in a separate file.

Maybe the enabled field could be renamed to config_files that point to the config files of each plugin.

Copy link
Collaborator

@SpicyLemon SpicyLemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @marbar3778's first comment up top. Since these are plugins, they should be optional additions instead of required tag-a-longs. This is primarily an issue with the kafka plugin; the kafka go library doesn't support ARM, and requires special external setup and build treatment on ARM machines. Really, this holds true for any plugin that would require importing a library not needed anywhere else in the SDK.

Plus, having the plugins in a separate repo would better demonstrate how someone could build and add their own plugin.

import (
"errors"
"fmt"
"github.com/confluentinc/confluent-kafka-go/kafka"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, this import is out of order (it should be just above "github.com/spf13/cast").

Second, this import is problematic for ARM architecture. Currently, it doesn't support ARM natively, and instead, relies on a shared C library. That means that anyone trying to use Cosmos-SDK on an ARM machine will need to install that shared library, build using the dynamic tag, and make sure various environment variables are defined correctly. The problem is significant enough, this plugin probably should not be provided with the SDK by default.

Until these plugins are spun off into their own repos/modules (and only pulled in when desired), I suggest hiding this plugin behind a build tag (similar to what Tendermint does for some database types).

Basically, unless someone wants to use this kafka plugin, the "github.com/confluentinc/confluent-kafka-go/kafka" library should not be required by the SDK. That should be true of any plugin that imports something not needed anywhere else in the SDK. It's just more painful in this case due to all the extra build requirements needed on ARM machines.

Comment on lines +6 to +7
"github.com/confluentinc/confluent-kafka-go/kafka"
"github.com/tendermint/tendermint/libs/log"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports are in the wrong spot.

require.Nil(t, err)

// validate data stored in Kafka
require.Equal(t, expectedBeginBlockReqBytes, getMessageValueForTopic(msgs, string(BeginBlockReqTopic), 0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get these to pass.

$ docker-compose -f plugin/plugins/kafka/docker-compose.yml up -d zookeeper broker
[+] Running 3/3
 ⠿ Network kafka_default  Created                                                                                                                0.0s
 ⠿ Container zookeeper    Started                                                                                                                0.4s
 ⠿ Container broker       Started                                                                                                                1.0s
$ go test -mod=readonly -tags "dynamic cgo ledger test_ledger_mock norace" ./... --run 'Kafka'
<snip>
--- FAIL: TestKafkaStreamingService (53.45s)
    service_test.go:217:
        	Error Trace:	service_test.go:217
        	            				service_test.go:165
        	Error:      	Not equal:
        	            	expected: []byte{0x7b, 0x22, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x22, 0x41, 0x51, 0x49, 0x44, 0x42, 0x41, 0x55, 0x47, 0x42, 0x77, 0x67, 0x4a, 0x22, 0x2c, 0x22, 0x68, 0x65, 0x61, 0x64, 0x65, 0x72, 0x22, 0x3a, 0x7b, 0x22, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f, 0x6e, 0x22, 0x3a, 0x7b, 0x22, 0x62, 0x6c, 0x6f, 0x63, 0x6b, 0x22, 0x3a, 0x22, 0x30, 0x22, 0x2c, 0x22, 0x61, 0x70, 0x70, 0x22, 0x3a, 0x22, 0x30, 0x22, 0x7d, 0x2c, 0x22, 0x63, 0x68, 0x61, 0x69, 0x6e, 0x5f, 0x69, 0x64, 0x22, 0x3a, 0x22, 0x22, 0x2c, 0x22, 0x68, 0x65, 0x69, 0x67, 0x68, 0x74, 0x22, 0x3a, 0x22, 0x31, 0x22, 0x2c, 0x22, 0x74, 0x69, 0x6d, 0x65, 0x22, 0x3a, 0x22, 0x30, 0x30, 0x30, 0x31, 0x2d, 0x30, 0x31, 0x2d, 0x30, 0x31, 0x54, 0x30, 0x30, 0x3a, 0x30, 0x30, 0x3a, 0x30, 0x30, 0x5a, 0x22, 0x2c, 0x22, 0x6c, 0x61, 0x73, 0x74, 0x5f, 0x62, 0x6c, 0x6f, 0x63, 0x6b, 0x5f, 0x69, 0x64, 0x22, 0x3a, 0x7b, 0x22, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x70, 0x61, 0x72, 0x74, 0x5f, 0x73, 0x65, 0x74, 0x5f, 0x68, 0x65, 0x61, 0x64, 0x65, 0x72, 0x22, 0x3a, 0x7b, 0x22, 0x74, 0x6f, 0x74, 0x61, 0x6c, 0x22, 0x3a, 0x30, 0x2c, 0x22, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x7d, 0x7d, 0x2c, 0x22, 0x6c, 0x61, 0x73, 0x74, 0x5f, 0x63, 0x6f, 0x6d, 0x6d, 0x69, 0x74, 0x5f, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x64, 0x61, 0x74, 0x61, 0x5f, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x76, 0x61, 0x6c, 0x69, 0x64, 0x61, 0x74, 0x6f, 0x72, 0x73, 0x5f, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x6e, 0x65, 0x78, 0x74, 0x5f, 0x76, 0x61, 0x6c, 0x69, 0x64, 0x61, 0x74, 0x6f, 0x72, 0x73, 0x5f, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x63, 0x6f, 0x6e, 0x73, 0x65, 0x6e, 0x73, 0x75, 0x73, 0x5f, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x61, 0x70, 0x70, 0x5f, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x6c, 0x61, 0x73, 0x74, 0x5f, 0x72, 0x65, 0x73, 0x75, 0x6c, 0x74, 0x73, 0x5f, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x65, 0x76, 0x69, 0x64, 0x65, 0x6e, 0x63, 0x65, 0x5f, 0x68, 0x61, 0x73, 0x68, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x70, 0x72, 0x6f, 0x70, 0x6f, 0x73, 0x65, 0x72, 0x5f, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x7d, 0x2c, 0x22, 0x6c, 0x61, 0x73, 0x74, 0x5f, 0x63, 0x6f, 0x6d, 0x6d, 0x69, 0x74, 0x5f, 0x69, 0x6e, 0x66, 0x6f, 0x22, 0x3a, 0x7b, 0x22, 0x72, 0x6f, 0x75, 0x6e, 0x64, 0x22, 0x3a, 0x31, 0x2c, 0x22, 0x76, 0x6f, 0x74, 0x65, 0x73, 0x22, 0x3a, 0x5b, 0x5d, 0x7d, 0x2c, 0x22, 0x62, 0x79, 0x7a, 0x61, 0x6e, 0x74, 0x69, 0x6e, 0x65, 0x5f, 0x76, 0x61, 0x6c, 0x69, 0x64, 0x61, 0x74, 0x6f, 0x72, 0x73, 0x22, 0x3a, 0x5b, 0x5d, 0x7d}
        	            	actual  : []byte{0xa, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x12, 0x15, 0xa, 0x0, 0x18, 0x1, 0x22, 0xb, 0x8, 0x80, 0x92, 0xb8, 0xc3, 0x98, 0xfe, 0xff, 0xff, 0xff, 0x1, 0x2a, 0x2, 0x12, 0x0, 0x1a, 0x2, 0x8, 0x1}

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,32 +1,5 @@
        	            	-([]uint8) (len=465) {
        	            	- 00000000  7b 22 68 61 73 68 22 3a  22 41 51 49 44 42 41 55  |{"hash":"AQIDBAU|
        	            	- 00000010  47 42 77 67 4a 22 2c 22  68 65 61 64 65 72 22 3a  |GBwgJ","header":|
        	            	- 00000020  7b 22 76 65 72 73 69 6f  6e 22 3a 7b 22 62 6c 6f  |{"version":{"blo|
        	            	- 00000030  63 6b 22 3a 22 30 22 2c  22 61 70 70 22 3a 22 30  |ck":"0","app":"0|
        	            	- 00000040  22 7d 2c 22 63 68 61 69  6e 5f 69 64 22 3a 22 22  |"},"chain_id":""|
        	            	- 00000050  2c 22 68 65 69 67 68 74  22 3a 22 31 22 2c 22 74  |,"height":"1","t|
        	            	- 00000060  69 6d 65 22 3a 22 30 30  30 31 2d 30 31 2d 30 31  |ime":"0001-01-01|
        	            	- 00000070  54 30 30 3a 30 30 3a 30  30 5a 22 2c 22 6c 61 73  |T00:00:00Z","las|
        	            	- 00000080  74 5f 62 6c 6f 63 6b 5f  69 64 22 3a 7b 22 68 61  |t_block_id":{"ha|
        	            	- 00000090  73 68 22 3a 6e 75 6c 6c  2c 22 70 61 72 74 5f 73  |sh":null,"part_s|
        	            	- 000000a0  65 74 5f 68 65 61 64 65  72 22 3a 7b 22 74 6f 74  |et_header":{"tot|
        	            	- 000000b0  61 6c 22 3a 30 2c 22 68  61 73 68 22 3a 6e 75 6c  |al":0,"hash":nul|
        	            	- 000000c0  6c 7d 7d 2c 22 6c 61 73  74 5f 63 6f 6d 6d 69 74  |l}},"last_commit|
        	            	- 000000d0  5f 68 61 73 68 22 3a 6e  75 6c 6c 2c 22 64 61 74  |_hash":null,"dat|
        	            	- 000000e0  61 5f 68 61 73 68 22 3a  6e 75 6c 6c 2c 22 76 61  |a_hash":null,"va|
        	            	- 000000f0  6c 69 64 61 74 6f 72 73  5f 68 61 73 68 22 3a 6e  |lidators_hash":n|
        	            	- 00000100  75 6c 6c 2c 22 6e 65 78  74 5f 76 61 6c 69 64 61  |ull,"next_valida|
        	            	- 00000110  74 6f 72 73 5f 68 61 73  68 22 3a 6e 75 6c 6c 2c  |tors_hash":null,|
        	            	- 00000120  22 63 6f 6e 73 65 6e 73  75 73 5f 68 61 73 68 22  |"consensus_hash"|
        	            	- 00000130  3a 6e 75 6c 6c 2c 22 61  70 70 5f 68 61 73 68 22  |:null,"app_hash"|
        	            	- 00000140  3a 6e 75 6c 6c 2c 22 6c  61 73 74 5f 72 65 73 75  |:null,"last_resu|
        	            	- 00000150  6c 74 73 5f 68 61 73 68  22 3a 6e 75 6c 6c 2c 22  |lts_hash":null,"|
        	            	- 00000160  65 76 69 64 65 6e 63 65  5f 68 61 73 68 22 3a 6e  |evidence_hash":n|
        	            	- 00000170  75 6c 6c 2c 22 70 72 6f  70 6f 73 65 72 5f 61 64  |ull,"proposer_ad|
        	            	- 00000180  64 72 65 73 73 22 3a 6e  75 6c 6c 7d 2c 22 6c 61  |dress":null},"la|
        	            	- 00000190  73 74 5f 63 6f 6d 6d 69  74 5f 69 6e 66 6f 22 3a  |st_commit_info":|
        	            	- 000001a0  7b 22 72 6f 75 6e 64 22  3a 31 2c 22 76 6f 74 65  |{"round":1,"vote|
        	            	- 000001b0  73 22 3a 5b 5d 7d 2c 22  62 79 7a 61 6e 74 69 6e  |s":[]},"byzantin|
        	            	- 000001c0  65 5f 76 61 6c 69 64 61  74 6f 72 73 22 3a 5b 5d  |e_validators":[]|
        	            	- 000001d0  7d                                                |}|
        	            	+([]uint8) (len=38) {
        	            	+ 00000000  0a 09 01 02 03 04 05 06  07 08 09 12 15 0a 00 18  |................|
        	            	+ 00000010  01 22 0b 08 80 92 b8 c3  98 fe ff ff ff 01 2a 02  |."............*.|
        	            	+ 00000020  12 00 1a 02 08 01                                 |......|
        	            	 }
        	Test:       	TestKafkaStreamingService
FAIL
FAIL	github.com/cosmos/cosmos-sdk/plugin/plugins/kafka/service	53.923s
<snip>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing the require.Equal calls to Asseret.Equal so that I could see the results of those test points as well. I did that in testListenBeginBlock, testListenDeliverTx1, testListenDeliverTx2, and testListenEndBlock. All of those pieces fail for me with similar unreadable messages coming back from getMessageValueForTopic.

Relatedly, it's okay to use require when the test should immediately stop due to a failure (e.g. when setting up a test), but after setup, when doing several checks (like these msg value comparisons), it's better to use assert. That way, you know more about what's breaking and aren't constantly rerunning tests just to find the next thing you need to fix.

@@ -0,0 +1,560 @@
package service
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in this file look like they require a running kafka broker/zookeeper. If that's the case, they shouldn't be run by default (e.g. with make test). But they should still be runnable, possibly with a different make target. Furthermore, they should still be run as part of the github actions and checks, but will probably need a specific job to do so. Also, there should probably be a comment at the top with guidance on running these tests.

@egaxhaj egaxhaj closed this Aug 1, 2022
@egaxhaj egaxhaj reopened this Aug 1, 2022
@egaxhaj
Copy link
Contributor Author

egaxhaj commented Aug 1, 2022

Closed by accident.

@egaxhaj
Copy link
Contributor Author

egaxhaj commented Aug 2, 2022

I agree with @marbar3778's first comment up top. Since these are plugins, they should be optional additions instead of required tag-a-longs. This is primarily an issue with the kafka plugin; the kafka go library doesn't support ARM, and requires special external setup and build treatment on ARM machines. Really, this holds true for any plugin that would require importing a library not needed anywhere else in the SDK.

Plus, having the plugins in a separate repo would better demonstrate how someone could build and add their own plugin.

@marbar3778

Our intent is to host these in a separate repository as discussed here. However, we would still be faced with the similar problems as stated above with other third party libraries in the future, would we not?

Another area of concern is how the plugin injection system works. Underneath it all, the plugin system relies on Go's plugin.Plugin to open and load plugins. Go's plugin system comes with a few restrictions.

  • the pre-compiled plugins must have package main as their package name or they'll not compile.
  • the same Go version must be used between the pre-compiled plugins and the code loading them or they'll fail to load. This will require all plugins to be pre-compiled on each Go version bump of the SDK.
  • supported only on Linux, FreeBSD, and macOS

Instead, why not take a similar approach as in the ABCI protocol and use gRPC to push the data out to another process? This other gRPC process can listen on localhost.

  • we are now able to leverage a common design theme that the SDK already uses
  • plugin authors can now leverage the language of their choice
  • no more having to solve issues about third party dependencies
  • no more separate repository to deal with
  • simple design

Take a look at hashicorp/go-plugin. We can borrow from their approach.

Our requirements are to push state-change and ABCI request response events for BeginBlock, EndBlock, DeliverTx (FinalizeBlock in ABCI++) out to an external system. These are already processed events, our approach needs to be elegant and simple and most importantly should not deal with any of the issues mentioned above.

A few questions have come up about overhead. If this is a huge deterrent for those concerned, then state-listening, when enabled, can be a no-op if a node is running as a validator. Same as when enabling grpc-web is a no-op if grpc is disabled.

@peterbourgon
Copy link

peterbourgon commented Aug 7, 2022

Underneath it all, the plugin system relies on Go's plugin.Plugin to open and load plugins. Go's plugin system comes with a few restrictions . . .

Not just restrictions! Like e.g. package netchan, Go's stdlib package plugin was always experimental, and has been effectively abandoned for years. It's not an appropriate choice for anything other than toy projects.

There are basically two viable approaches for plugins in Go: external processes and RPCs i.e. hashicorp/go-plugin as @egaxhaj mentions, or compile-time options i.e. caddy-server/caddy. I'd consider this refactor a blocking requirement.

@tac0turtle
Copy link
Member

Instead, why not take a similar approach as in the ABCI protocol and use gRPC to push the data out to another process? This other gRPC process can listen on localhost.

im a huge fan of this approach. I like it a lot better than the current approach.

@alexanderbez
Copy link
Contributor

I really prefer the gRPC option as well as I feel like it already fits nicely into the SDK architecture and would allow for a great range of flexibility of plugin implementations.

@tac0turtle
Copy link
Member

could we use hashicorup-plugins and replace all the current code?

@egaxhaj
Copy link
Contributor Author

egaxhaj commented Aug 16, 2022

could we use hashicorup-plugins and replace all the current code?

Yes, I'll work on a PR using the hashicorp-plugin to see how well it fits with what we want to do.

@egaxhaj
Copy link
Contributor Author

egaxhaj commented Aug 23, 2022

could we use hashicorup-plugins and replace all the current code?

Yes, I'll work on a PR using the hashicorp-plugin to see how well it fits with what we want to do.

Update on the timeline. I'm wrapping up some other work this week and will be out on PTO 8/25 - 9/5. I'll be free to focus my time on this when I'm back in the office.

@tac0turtle
Copy link
Member

thank you for the update!! enjoy the time off

@tac0turtle
Copy link
Member

@egaxhaj following up on this. do you need some help in pushing this over the finish line?

@egaxhaj
Copy link
Contributor Author

egaxhaj commented Sep 19, 2022

@egaxhaj following up on this. do you need some help in pushing this over the finish line?
I was wrapping up work that prevented me from looking at this any sooner. I'll be solely focusing on this this week. I'll checkin later into the week to let you know of my progress.

@egaxhaj
Copy link
Contributor Author

egaxhaj commented Oct 6, 2022

replaced by #13472

@egaxhaj egaxhaj closed this Oct 6, 2022
@egaxhaj egaxhaj mentioned this pull request Oct 12, 2022
19 tasks
@egaxhaj egaxhaj mentioned this pull request Dec 7, 2022
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:Store T: ADR An issue or PR relating to an architectural decision record Type: ADR Type: Build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants