Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metricbeat: fix the consumergroup of kafka module incorrectly work #5880

Merged
merged 1 commit into from
Jan 17, 2018

Conversation

wangdisdu
Copy link
Contributor

The consumergroup fetch offset info by sending OffsetFetch request to Kafka.
But the topic and partition parameters missed in the sent OffsetFetch request.
These parameters is required, and kafka will do not response anything if they are missed.

This is Kafka protocol guide: http://kafka.apache.org/protocol.html#The_Messages_OffsetFetch

@elasticmachine
Copy link
Collaborator

Can one of the admins verify this patch?

_partitions := make(map[string][]int32)
for topic, partitions := range topics {
if topicsFilter == nil || topicsFilter(topic) {
for partitionID, _ := range partitions {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should omit 2nd value from range; this loop is equivalent to for partitionID := range ...

@@ -230,11 +230,16 @@ func (b *Broker) DescribeGroups(
return groups, nil
}

func (b *Broker) FetchGroupOffsets(group string) (*sarama.OffsetFetchResponse, error) {
func (b *Broker) FetchGroupOffsets(group string, partitions map[string][]int32) (*sarama.OffsetFetchResponse, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported method Broker.FetchGroupOffsets should have comment or be unexported

_partitions := make(map[string][]int32)
for topic, partitions := range topics {
if topicsFilter == nil || topicsFilter(topic) {
for partitionID, _ := range partitions {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should omit 2nd value from range; this loop is equivalent to for partitionID := range ...

@@ -230,11 +230,16 @@ func (b *Broker) DescribeGroups(
return groups, nil
}

func (b *Broker) FetchGroupOffsets(group string) (*sarama.OffsetFetchResponse, error) {
func (b *Broker) FetchGroupOffsets(group string, partitions map[string][]int32) (*sarama.OffsetFetchResponse, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported method Broker.FetchGroupOffsets should have comment or be unexported

@wangdisdu wangdisdu changed the title metricbeat: fix the consumergroup of kafka module do not fetch anything metricbeat: fix the consumergroup of kafka module incorrectly work Dec 14, 2017
@wangdisdu wangdisdu force-pushed the metricbeat-kafka-consumergroup branch 2 times, most recently from 8a04335 to 03d7f64 Compare December 15, 2017 02:30
return err
for waiting > 0 {
ret := <-results
waiting -= 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should replace waiting -= 1 with waiting--


wg.Add(1)
go func() {
waiting += 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should replace waiting += 1 with waiting++

@wangdisdu wangdisdu force-pushed the metricbeat-kafka-consumergroup branch from 03d7f64 to e4d1877 Compare December 15, 2017 02:44
@ruflin ruflin requested a review from urso December 17, 2017 10:36
@ruflin ruflin added the module label Dec 29, 2017
@ruflin
Copy link
Contributor

ruflin commented Dec 29, 2017

@wangdisdu Just wanted to let you know that we saw the PR. As @urso was thinking already a bit about how to fix Kafka I'm hoping he can have a look at it.

Copy link

@urso urso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing to the kafka module. It's a great way forword. Unfortunately I don't think it fixes all issues we have: basically some consumergroup state being held by another kafka broker.


waiting := 0
for group, topics := range assignments {
_partitions := make(map[string][]int32)
Copy link

@urso urso Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no underscore please.

in beats codebase we prefer to use make/new only if there is not other syntax pattern. Use

partitions = map[string][]int32{}

How about (reduce key lookups and appends/alloc to slice):

  queryTopics := map[string][]int32{}
  for topic, partitions := range topics {
    if topicFilter != nil && !topicFilter(topic) {
      continue
    }

    // copy partition ids
    L := len(partitions)
    if L == 0 {
      continue
    }

    ids, i := make([]int32, L), 0
    for partition := range partitions {
      ids[i], i = partition, i+1
    }
    queryTopics[topic] = ids
  }

  if len(queryTopics) == 0 {
    continue
  }
 
  ...

Copy link
Contributor Author

@wangdisdu wangdisdu Jan 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed: change underscore variable name

for partitionID := range partitions {
if _, ok := _partitions[topic]; !ok {
_partitions[topic] = make([]int32, 0)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if-clause is not required. _partitions[topic] returns nil, if topic is not known. append(([]int32)(nil), value) returns a new []int32{value}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed: optimalized the code generate partition array

results := make(chan result, len(groups))
for _, group := range groups {
group := group
results := make(chan result, len(assignments))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the loop collecting data does not wait for wg.Wait anymore, but through all results, the wg is not really required anymore. We can even introduce an unbuffered channel make(chan result) now.

Copy link
Contributor Author

@wangdisdu wangdisdu Jan 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed: removed unecessary waitgroup

ret := <-results
waiting--
if ret.err != nil {
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally we did return an error -> we still want to return the first error we encounter

for waiting > 0 {
  ret := <- results
  waiting--
  if ret.err != nil {
    if err == nil {
      err = ret.err
    }
    continue
  }

  ...
}
if err != nil {
  return err
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed: return the error if has

@wangdisdu wangdisdu force-pushed the metricbeat-kafka-consumergroup branch from e4d1877 to 664fb09 Compare January 5, 2018 03:45
@wangdisdu
Copy link
Contributor Author

wangdisdu commented Jan 5, 2018

Hi @ruflin @urso Thanks for your reviewed. I have adjusted this PR:

  1. removed unecessary waitgroup
  2. change underscore variable name
  3. optimalized the code generate partition array
  4. return the error if has

And For the issue: some consumergroup state being held by another kafka broker. I do not think it is an issue. Because the metricbeat will collect consumergroup info only from configed kafka host. If someone want collect all info, he need to config all kafka host in config file.

}
return err
continue
Copy link

@urso urso Jan 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can stop processing data after having seen an error by doing this:

for waiting > 0 {
  ret := <-results
  waiting--
  if ret.err != nil && err == nil {
    err = ret.err
  }
  if err != nil {
    continue
  }

  ...
}

This will keep the waiting loop alive, assign the first ret.err only, but also stop processing any further results (results are discarded anyways).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@urso I fix it.

@urso
Copy link

urso commented Jan 5, 2018

And For the issue: some consumergroup state being held by another kafka broker. I do not think it is an issue. Because the metricbeat will collect consumergroup info only from configed kafka host. If someone want collect all info, he need to config all kafka host in config file.

This might still be a problem. The consumergroup needs to correlate data, which are potentially distributed between different kafka nodes. Problem is, the metricsets in metricbeat are fully isolated. That is, even if you configure all brokers, the metricsets are running in isolation per broker -> can not correlated meta-data distributed on different brokers.

See related ticket: #4285
The discuss forums discussion linked in the ticket contains quite some more details.

@wangdisdu wangdisdu force-pushed the metricbeat-kafka-consumergroup branch from 664fb09 to f0c00d1 Compare January 5, 2018 12:27
@wangdisdu
Copy link
Contributor Author

wangdisdu commented Jan 11, 2018

@urso Hi, I found out that it is not a problem.

The offsets for a given consumer group are maintained by a specific broker called the group coordinator. Consumer needs to fetch offset from this specific broker. Link: kafka protocol guide

It means meta data of consumer group stored in one specific broker called coordinator.
We get the group list from a coordinator first, and can fetch all offsets of one group info from it.

And I test and verify it on kafka 0.10.2.
I got all offset info of group "wangdi-test-mb-group" from one broker "192.168.31.31:9092":

2018-01-11T20:02:36.540+0800	DEBUG	[publish]	pipeline/processor.go:275	Publish event: {
  "@timestamp": "2018-01-11T12:02:36.531Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "doc",
    "version": "7.0.0-alpha1"
  },
  "metricset": {
    "name": "consumergroup",
    "module": "kafka",
    "host": "192.168.31.31:9092",
    "rtt": 8546
  },
  "kafka": {
    "consumergroup": {
      "offset": 16692,
      "meta": "",
      "broker": {
        "id": 2,
        "address": "192.168.31.31:9092"
      },
      "topic": "wangdi-test-mb-topic",
      "partition": 0,
      "client": {
        "id": "consumer-1",
        "host": "172.16.128.23",
        "member_id": "consumer-1-ed7ae7c7-0534-49b2-84c9-26b8c64063df"
      },
      "id": "wangdi-test-mb-group"
    }
  },
  "beat": {
    "version": "7.0.0-alpha1",
    "name": "localhost",
    "hostname": "localhost"
  }
}
2018-01-11T20:02:36.540+0800	DEBUG	[publish]	pipeline/processor.go:275	Publish event: {
  "@timestamp": "2018-01-11T12:02:36.531Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "doc",
    "version": "7.0.0-alpha1"
  },
  "kafka": {
    "consumergroup": {
      "id": "wangdi-test-mb-group",
      "offset": 16429,
      "topic": "wangdi-test-mb-topic",
      "partition": 1,
      "meta": "",
      "broker": {
        "id": 2,
        "address": "192.168.31.31:9092"
      },
      "client": {
        "id": "consumer-1",
        "host": "172.16.128.23",
        "member_id": "consumer-1-ed7ae7c7-0534-49b2-84c9-26b8c64063df"
      }
    }
  },
  "metricset": {
    "name": "consumergroup",
    "module": "kafka",
    "host": "192.168.31.31:9092",
    "rtt": 8566
  },
  "beat": {
    "name": "localhost",
    "hostname": "localhost",
    "version": "7.0.0-alpha1"
  }
}
2018-01-11T20:02:36.540+0800	DEBUG	[publish]	pipeline/processor.go:275	Publish event: {
  "@timestamp": "2018-01-11T12:02:36.531Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "doc",
    "version": "7.0.0-alpha1"
  },
  "metricset": {
    "name": "consumergroup",
    "module": "kafka",
    "host": "192.168.31.31:9092",
    "rtt": 8574
  },
  "kafka": {
    "consumergroup": {
      "partition": 2,
      "offset": 16499,
      "broker": {
        "id": 2,
        "address": "192.168.31.31:9092"
      },
      "id": "wangdi-test-mb-group",
      "meta": "",
      "client": {
        "member_id": "consumer-1-ed7ae7c7-0534-49b2-84c9-26b8c64063df",
        "id": "consumer-1",
        "host": "172.16.128.23"
      },
      "topic": "wangdi-test-mb-topic"
    }
  },
  "beat": {
    "name": "localhost",
    "hostname": "localhost",
    "version": "7.0.0-alpha1"
  }
}

@urso
Copy link

urso commented Jan 11, 2018

@wangdisdu Cool. how many kafka brokers have you used? Did you use one central metricbeat, or one per host? Normally we recommend installing metricbeat on each host and configure metricbeat kafka module to connect to localhost broker only (metricbeat matches the meta-data with the hostname to figure out the actual brokers name from the metadata response). With having one metricbeat instance per host, we can also collect CPU/memory/disk usage on this host as well. Even though state is distributed between kafka nodes, each metricbeat will collect a subset of the overall kafka cluster state.

PR look good to me. Can you add a changelog about fixing the OffsetFetch request?

@wangdisdu
Copy link
Contributor Author

wangdisdu commented Jan 12, 2018

@urso For test kafka metricset, I build a kafka cluster with 3 brokers.
For produce test data, I use standard java kafka client to produce and consume message.
I test metricbeat on per broker by configing local host ip (same as localhost).

And I added changelog.

…ot fetch anything

The `consumergroup` fetch offset info by sending `OffsetFetch` request to Kafka.
But the `topic` and `partition` parameters missed in the sent `OffsetFetch` request.
These parameters is required, and kafka will do not response anything if they are missed.
@wangdisdu wangdisdu force-pushed the metricbeat-kafka-consumergroup branch from f0c00d1 to f07636a Compare January 12, 2018 01:51
@wangdisdu
Copy link
Contributor Author

@ruflin Will this PR be merged? So I can do more thing base on it, like batching requests together for efficiency.

@urso
Copy link

urso commented Jan 17, 2018

@wangdisdu how many partitions and consumer groups did you use. The ticket I mentioned uses 10 partitions and 3 consumer groups, as sometimes for less consumer groups we could see all data. Only adding more groups/partitions to the picture to show us some missing consumer groups data.

@urso urso merged commit da70593 into elastic:master Jan 17, 2018
@urso
Copy link

urso commented Jan 17, 2018

Merged the PR. Thank you for taking the time to work on this long outstanding issue. Your help is highly appreciated ;)

@wangdisdu
Copy link
Contributor Author

wangdisdu commented Jan 18, 2018

@urso It working well even if add more groups. I test and verify it:

  • I create topic wd-test-mb-cg with config 10 partitions and 2 peplication factors
    image

  • I use 5 consumer groups to fetch message from topic wd-test-mb-cg
    image

  • I checked the data is not missing, the count of partition of every group is always 10
    image

  • I increase count of consumer groups to 50 to fetch message from topic wd-test-mb-cg
    image

  • I checked the data is not missing, the count of partition of every group is always 10
    image

@ruflin
Copy link
Contributor

ruflin commented Jan 18, 2018

@wangdisdu I can only double down on what @urso said. We really appreciate your work here and being patient with us to get the changes in. ❤️

@urso urso added the needs_backport PR is waiting to be backported to other branches. label Feb 2, 2018
@urso urso removed the needs_backport PR is waiting to be backported to other branches. label Jun 18, 2018
@npredey
Copy link

npredey commented Jun 20, 2018

Which version of metricbeat is this available on and will it work with Elasticsearch 5.3? Thanks.

@ruflin
Copy link
Contributor

ruflin commented Jun 21, 2018

This is part of the 6.3 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants