Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop elixir < 1.5 support #386

Merged
merged 8 commits into from
Apr 4, 2020
Merged

Drop elixir < 1.5 support #386

merged 8 commits into from
Apr 4, 2020

Conversation

joshuawscott
Copy link
Member

@joshuawscott joshuawscott commented Nov 18, 2019

  • Switch to extra_applications to allow using kayrock in releases
  • Add an overhead timeout in server.ex to allow timeout failures to happen where they should
  • testing:
    • Use env vars in the docker-compose to ease the configuration.
    • Use several listeners for the Kafka brokers so that we don't have to detect the broker IPs
    • use random topic names to make sure each test is independent

Copy link
Collaborator

@dantswain dantswain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could get away with just testing 1.5 with OTP 19 but It's not that big of a deal either way. It's just robot time anyways.

scripts/docker_up.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@jbruggem jbruggem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dantswain
Copy link
Collaborator

Yeah no i ain't approvin this until the build succeeds 😂

I totes approve in theory, tho.

Copy link
Collaborator

@dantswain dantswain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 1. Identify the problem.
Step 2. FIX IT

@jbruggem
Copy link
Collaborator

@joshuawscott do mind if I take over ?

@jbruggem
Copy link
Collaborator

I have run the tests locally many times, and there's a test that fails very frequently: the check on the number of open ports after shutting down consumer groups:

I believe those two checks to be unreliable and I'm in favor of removing them.

For the rest, generous timeouts seem to help.

@jbruggem jbruggem force-pushed the extra_applications branch from 66add73 to 857fabd Compare March 10, 2020 12:20
@jbruggem
Copy link
Collaborator

jbruggem commented Mar 10, 2020

I made a force-push with the changes I needed locally to make this work.

@joshuawscott If needed, I pushed your original branch as a tag (joshuawscott_extra_applications)

I also took the opportunity to clean up a bit the code around docker-compose. All the custom scripts and config can be elegantly replaced by env vars.

@jbruggem jbruggem force-pushed the extra_applications branch 10 times, most recently from d4b9a47 to dcddda3 Compare March 10, 2020 16:57
@jbruggem
Copy link
Collaborator

I've been trying to make the tests pass for a few days.

My conclusion is that the tests are simply unstable. Often times, it's due to timeouts in genserver calls.

The consumer group tests in particular have frequent failures, but the failures are not limited to them.

I'm not sure what to do with this.

@bjhaid
Copy link
Member

bjhaid commented Mar 11, 2020

I failures are not consistent, I just kicked the failing builds in travis... it's probably important to investigate and fix the flakes, but some of them have been there for a while

@joshuawscott
Copy link
Member Author

Thanks for looking!

@bjhaid
Copy link
Member

bjhaid commented Mar 11, 2020

1.9.4 as compared to the others seems to be failing consistently after ~5 runs, so it seems there's a genuine failure there.

@jbruggem
Copy link
Collaborator

jbruggem commented Mar 11, 2020

1.9.4 as compared to the others seems to be failing consistently after ~5 runs, so it seems there's a genuine failure there.

Indeed. I'm testing with an 1.9.4 without credo and coveralls, to see if it's linked.

I'm also testing 1.10, to see if the problem also happens there.

@jbruggem
Copy link
Collaborator

Can confirm:

  • the problem is not related to running coveralls
  • the problem appears with Elixir 1.9 and Elixir 1.10

@jbruggem
Copy link
Collaborator

Locally, the following test can reproduce the errors reliably. There's a timeout after a few hundred messages:

  test "produce 10_000 with an acq required returns offset each time" do
    for index <- 0..10_000 do
      {:ok, offset} =
        KafkaEx.produce(%Proto.Produce.Request{
          topic: "food",
          partition: 0,
          required_acks: 1,
          messages: [%Proto.Produce.Message{value: "hey_#{index}"}]
        })

      refute offset == nil
    end
  end
  1) test produce 1_000 with an acq required returns offset each time (KafkaEx.Integration.Test)
     test/integration/integration_test.exs:125
     ** (exit) exited in: GenServer.call(:kafka_ex, {:produce, %KafkaEx.Protocol.Produce.Request{api_version: 0, compression: :none, messages: [%KafkaEx.Protocol.Produce.Message{key: nil, timestamp: nil, value: "hey_440"}], partition: 0, required_acks: 1, timeout: 0, topic: "food"}}, 60000)
         ** (EXIT) time out
     code: for index <- 0..10_000 do
     stacktrace:
       (elixir 1.10.2) lib/gen_server.ex:1023: GenServer.call/3
       test/integration/integration_test.exs:128: anonymous fn/2 in KafkaEx.Integration.Test."test produce 1_000 with an acq required returns offset each time"/1
       (elixir 1.10.2) lib/enum.ex:3371: Enum.reduce_range_inc/4
       test/integration/integration_test.exs:126: (test)

     The following output was logged:
     
     23:31:06.684 [error] Receiving data from broker "localhost":9093 failed with :timeout

@jbruggem
Copy link
Collaborator

By re-trying on failure, I don't get any error any more, even with 100k messages produced.

It seems that in NetworkClient, the Socket is closed upon failure, meaning (I guess) that a new socket is opened and that one does not time out.

(off-topic: during my investigation, I discovered another bug btw: the same :sync_timeout value is used both for the GenServer.call timeout and for the network request timeout, meaning that the GenServer call wrapping the network request always times out before the request. Instead, we should set the GenServer call timeout to always be a bit bigger than the network request timeout)

@jbruggem
Copy link
Collaborator

jbruggem commented Mar 12, 2020

Just realized this is probably related to: #389

  • disabling SSL (using PLAINTEXT between kafka_ex and the server) makes the problem disappear
  • reverting to OTP 21.3 makes the problem disappear

on #389 , a possible root cause in erlang was mentionned: https://github.com/erlang/otp/pull/2224/files

@jbruggem
Copy link
Collaborator

Ideas on how to deal with this are welcome :).

@jbruggem
Copy link
Collaborator

I created an issue in the OTP project: https://bugs.erlang.org/browse/ERL-1213

@jbruggem jbruggem force-pushed the extra_applications branch from 3868869 to d7194fe Compare March 31, 2020 14:47
@jbruggem jbruggem force-pushed the extra_applications branch from d7194fe to b97a62b Compare March 31, 2020 14:57
@jbruggem
Copy link
Collaborator

jbruggem commented Mar 31, 2020

I'll leave the OTP 22 to another PR/issue. I don't think it should block this PR.

@jbruggem jbruggem force-pushed the extra_applications branch 2 times, most recently from f6167c1 to b97a62b Compare April 1, 2020 08:57
@jbruggem
Copy link
Collaborator

jbruggem commented Apr 1, 2020

Well, the CI passes. I suggest we review this and get it merged :)

@jbruggem jbruggem requested review from bjhaid and dantswain April 1, 2020 13:39
@bjhaid
Copy link
Member

bjhaid commented Apr 1, 2020

@jbruggem thanks for seeing this through, can you please update the description to include all the other things introduced in this PR, thanks! and probably make sure the new description is propagated to the commit message when this is merged. Thanks!

@joshuawscott
Copy link
Member Author

👍 if I could approve my own PR I would 😂

@jbruggem
Copy link
Collaborator

jbruggem commented Apr 1, 2020

@jbruggem thanks for seeing this through, can you please update the description to include all the other things introduced in this PR, thanks! and probably make sure the new description is propagated to the commit message when this is merged. Thanks!

Done, and will do !

@jbruggem
Copy link
Collaborator

jbruggem commented Apr 3, 2020

I will merge today if no other remark is added.

@jbruggem jbruggem merged commit aaa0aa4 into master Apr 4, 2020
@bjhaid bjhaid deleted the extra_applications branch April 23, 2020 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants