Cluster panic #828

alexanderfefelov · 2020-10-15T02:49:15Z

Some time after start, all nodes of my cluster (two servers and two agents) crash with the same error:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x160 pc=0x948a8a]

goroutine 623378 [running]:
google.golang.org/grpc.(*Server).Stop(0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1482 +0x4a
github.com/distribworks/dkron/v3/plugin.(*ExecutorClient).Execute(0xc00091e6a0, 0xc00b231640, 0x357c300, 0xc000e27820, 0xc0011bc980, 0x25bb801, 0xc0044c6150)
        /home/runner/work/dkron/dkron/plugin/executor.go:65 +0x190
github.com/distribworks/dkron/v3/dkron.(*AgentServer).AgentRun(0xc0001295b8, 0xc0044c6060, 0x35e2620, 0xc003374e20, 0x0, 0x0)
        /home/runner/work/dkron/dkron/dkron/grpc_agent.go:83 +0x6f4
github.com/distribworks/dkron/v3/plugin/types._Agent_AgentRun_Handler(0x2561d80, 0xc0001295b8, 0x35dcda0, 0xc0002186c0, 0x4ba5ae0, 0xc000e2a200)
        /home/runner/work/dkron/dkron/plugin/types/dkron.pb.go:1862 +0x109
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000a12b60, 0x35ea7e0, 0xc0019b4300, 0xc000e2a200, 0xc000b56150, 0x4b45e20, 0x0, 0x0, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1329 +0xcbc
google.golang.org/grpc.(*Server).handleStream(0xc000a12b60, 0x35ea7e0, 0xc0019b4300, 0xc000e2a200, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1409 +0xc64
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc001ab37e0, 0xc000a12b60, 0x35ea7e0, 0xc0019b4300, 0xc000e2a200)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746 +0xa1
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:744 +0xa1
runtime: note: your Linux kernel may be buggy
runtime: note: see https://golang.org/wiki/LinuxKernelSignalVectorBug
runtime: note: mlock workaround for kernel bug failed with errno 12

Environment

Dkron 3.0.5
Docker (servers: Dockerfile, run script, agents: Dockerfile, run script)

uname -a:

  Linux dkron-server-1.backpack.test 5.4.0-48-generic #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

yvanoers · 2020-10-15T20:45:34Z

@alexanderfefelov
Is there any reason to assume the kernel bug mentioned in the log is not the cause?
Is it at all possible to upgrade the kernel and see if that alleviates the issue?

But on the other hand, there may be a timing issue with the executor. If so, I would expect this to occur with jobs that are very short lived. Do you have such jobs in your cluster?

alexanderfefelov · 2020-10-16T06:13:49Z

Is there any reason to assume the kernel bug mentioned in the log is not the cause?

Maybe you're right.

Is it at all possible to upgrade the kernel and see if that alleviates the issue?

I'll try it out.

Do you have such jobs in your cluster?

Yes, all my jobs are short-lived now.

Bcz distribworks/dkron#828

BlackDex · 2020-10-20T07:52:34Z

I have encountered the same error running on a CentOS based system with kernel 4.14.

yvanoers · 2020-10-20T22:23:22Z

@alexanderfefelov @BlackDex ,
I've created a pull request (#835) in an attempt to fix this. I have no decent way to test this, would be great if someone is willing and able to give this a try.

piotrlipiarz-ef · 2020-11-09T12:29:13Z

We are observing very similar issue with our cluster of 3 servers. We are using

dkron/dkron:v3.0.4

uname -a:
Linux dkron-cluster-0 4.15.0-1027-aws #27-Ubuntu SMP Fri Nov 2 15:14:20 UTC 2018 x86_64 Linux

Observed panic seems to originate from here

dkron/plugin/executor.go

Line 65 in 19c0982

s.Stop()

as a result our cluster stops to execute jobs, despite that this panic happens to only one of the nodes.

the error in logs looks like:

/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:744 +0xa1
--
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746 +0xa1
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000edaa00, 0xc0003f2680, 0x357f1a0, 0xc00109ca80, 0xc000ef1800)
/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1409 +0xc64
google.golang.org/grpc.(*Server).handleStream(0xc0003f2680, 0x357f1a0, 0xc00109ca80, 0xc000ef1800, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1329 +0xcbc
google.golang.org/grpc.(*Server).processStreamingRPC(0xc0003f2680, 0x357f1a0, 0xc00109ca80, 0xc000ef1800, 0xc000cf4780, 0x4aaedc0, 0x0, 0x0, 0x0)
/dkron/plugin/types/dkron.pb.go:1862 +0x109
github.com/distribworks/dkron/v3/plugin/types._Agent_AgentRun_Handler(0x252a1c0, 0xc00000f040, 0x35717c0, 0xc0001ac600, 0x4b0ea40, 0xc000ef1800)
/dkron/dkron/grpc_agent.go:83 +0x6f4
github.com/distribworks/dkron/v3/dkron.(*AgentServer).AgentRun(0xc00000f040, 0xc000d384e0, 0x3577040, 0xc000cc8610, 0x0, 0x0)
/dkron/plugin/executor.go:65 +0x190
github.com/distribworks/dkron/v3/plugin.(*ExecutorClient).Execute(0xc000ba8c20, 0xc000f34780, 0x3511180, 0xc0012c6d20, 0xc000e2a800, 0xc000104501, 0x101)
/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1482 +0x4a
google.golang.org/grpc.(*Server).Stop(0x0)
goroutine 2579 [running]:
{"service":"dkron","journald":{"_PID":"814","_SYSTEMD_INVOCATION_ID":"00e226322d7d42daa35605c8209f8cc9","_CAP_EFFECTIVE":"3fffffffff","_SOURCE_REALTIME_TIMESTAMP":"1604214389481076","PRIORITY":"3","CONTAINER_NAME":"k8s_dkron_dkron-cluster-0_tfprod_78ace7ae-1b7f-11eb-9a01-029fce2970d4_19","_SYSTEMD_CGROUP":"/system.slice/docker.service","_GID":"0","_MACHINE_ID":"3c325c220068e72790ee19b05f9d5d88","_SELINUX_CONTEXT":"unconfined\n","_COMM":"dockerd","CONTAINER_ID_FULL":"9845f9bc1c140eb037c22335b0d4bad8ddc3696f1d004b6dcff5ff9ee20d4241","CONTAINER_TAG":"9845f9bc1c14","_BOOT_ID":"2cc14745820148a4ab685328775b0890","_SYSTEMD_UNIT":"docker.service","_TRANSPORT":"journal","_CMDLINE":"/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock","SYSLOG_IDENTIFIER":"9845f9bc1c14","_SYSTEMD_SLICE":"system.slice","_UID":"0","_HOSTNAME":"kubeworker-7","CONTAINER_ID":"9845f9bc1c14","_EXE":"/usr/bin/dockerd-ce"}}
[signal SIGSEGV: segmentation violation code=0x1 addr=0x160 pc=0x9488fa]
panic: runtime error: invalid memory address or nil pointer dereference

Is there maybe some known configuration setting that could prevent cluster to go down in such scenario? Or any timeline when #835 could be merged and released?

vcastellm · 2020-11-09T17:08:34Z

🙏 if someone can give the fix in the PR a try

yvanoers · 2020-11-09T20:35:15Z

Maybe it would help if we provided a package or binary for testing purposes?

alexanderfefelov · 2020-12-14T10:37:03Z

Agent crash on 3.0.8:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x160 pc=0x94baaa]

goroutine 20660 [running]:
google.golang.org/grpc.(*Server).Stop(0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1482 +0x4a
github.com/distribworks/dkron/v3/plugin.(*ExecutorClient).Execute(0xc000c32280, 0xc00060cf80, 0x3581720, 0xc00000d5a0, 0xc00013ab00, 0x25c0201, 0xc00067f8c0)
        /home/runner/work/dkron/dkron/plugin/executor.go:65 +0x190
github.com/distribworks/dkron/v3/dkron.(*AgentServer).AgentRun(0xc00012a1c0, 0xc00067f710, 0x35e7c60, 0xc000cdebc0, 0x0, 0x0)
        /home/runner/work/dkron/dkron/dkron/grpc_agent.go:83 +0x6f4
github.com/distribworks/dkron/v3/plugin/types._Agent_AgentRun_Handler(0x2566820, 0xc00012a1c0, 0x35e23e0, 0xc000b8acc0, 0x4bacc00, 0xc0000e2b00)
        /home/runner/work/dkron/dkron/plugin/types/dkron.pb.go:1862 +0x109
google.golang.org/grpc.(*Server).processStreamingRPC(0xc0008f8340, 0x35efd60, 0xc0009d8f00, 0xc0000e2b00, 0xc000b1f1d0, 0x4b4cf20, 0x0, 0x0, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1329 +0xcbc
google.golang.org/grpc.(*Server).handleStream(0xc0008f8340, 0x35efd60, 0xc0009d8f00, 0xc0000e2b00, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1409 +0xc64
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000055820, 0xc0008f8340, 0x35efd60, 0xc0009d8f00, 0xc0000e2b00)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746 +0xa1
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:744 +0xa1

ArminNrz · 2020-12-29T15:36:39Z

Hi
When I run a Dkron server and set 100 jobs that each job run every 10 sec and send an HTTP request

So after 1 min, I see this error

spy16 · 2021-02-22T14:40:50Z

I have the exact same issue as well.

Dkron cluster of 3 x n2-standard-4 (4 vCPUs, 16 GB memory) (Google Cloud)
Jobs
- Schedule: @every 1s
- Executor: HTTP
- HTTP Callback: POST http://job-worker-host:8081/job?job_id={jobID} (endpoint simply emits a statsd counter)
- Number of Jobs: 100

The expectation was to see 100 rps sustained throughput on the /job endpoint. But one or all nodes crash with following panic after sometime (all jobs completed about 20 executions)

Note: sometimes they crash exactly when I try to open the /ui endpoint. Not entirely sure if they are cause-effect or just coincident.

INFO[2021-02-22T21:30:29+07:00] grpc_agent: Starting job                      job=test_j98 node=p-dkron-node-a-01
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x168 pc=0x9535ca]

goroutine 423018 [running]:
google.golang.org/grpc.(*Server).Stop(0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:1626 +0x4a
github.com/distribworks/dkron/v3/plugin.(*ExecutorClient).Execute(0xc0003894a0, 0xc002105860, 0x3b177a0, 0xc000eff240, 0xc0005bd950, 0x1, 0x0)
        /home/runner/work/dkron/dkron/plugin/executor.go:65 +0x191
github.com/distribworks/dkron/v3/dkron.(*AgentServer).AgentRun(0xc00101bb20, 0xc0019675c0, 0x3b88be0, 0xc003f5c650, 0x0, 0x0)
        /home/runner/work/dkron/dkron/dkron/grpc_agent.go:85 +0x6fb
github.com/distribworks/dkron/v3/plugin/types._Agent_AgentRun_Handler(0x2969a20, 0xc00101bb20, 0x3b82be0, 0xc0023c9740, 0x535e610, 0xc0018d8c00)
        /home/runner/work/dkron/dkron/plugin/types/dkron_grpc.pb.go:540 +0x109
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000130c40, 0x3b91700, 0xc002f4f080, 0xc0018d8c00, 0xc0009484b0, 0x52f76e0, 0x0, 0x0, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:1464 +0xcbd
google.golang.org/grpc.(*Server).handleStream(0xc000130c40, 0x3b91700, 0xc002f4f080, 0xc0018d8c00, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:1544 +0xc96
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc002514e50, 0xc000130c40, 0x3b91700, 0xc002f4f080, 0xc0018d8c00)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:878 +0xa1
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:876 +0x204

yvanoers · 2021-02-23T12:41:34Z

I released a preproduction build of Dkron v3.1.4 with patch #835 built in.
A docker image is also available as yvanoers/dkron.

Hopefully this will make it easier for someone to test the PR (#835).

ncsibra · 2021-08-11T08:40:53Z

We faced this issue too, the PR is already merged, @Victorcoder when this will be released/tagged?

vcastellm · 2021-08-11T10:41:10Z

@ncsibra this is difficult to test, did you test if it works properly with the pre-release version?

ncsibra · 2021-08-11T11:40:36Z

I assumed it was tested because it's already merged.
Do you mean with yvanoers version, based on the v3.1.4 release?
Or do you have an official pre-release docker image for the latest version(that I didn't find)?

yvanoers · 2021-08-11T19:08:58Z

@ncsibra please note that PR #835 has not yet been merged.
PR #1008 has, and that change is/was a part of #835.

ncsibra · 2021-08-11T19:14:52Z

@yvanoers You're right, sorry, I mixed them up somehow.

ncsibra · 2021-08-13T09:55:13Z

@yvanoers I tested your branch with the f31c7f5f32e30424a7868922a61e9198da5c74ce commit.
Created a docker image based on the default Dockerfile (not Dockerfile.hub), I assume it's doesn't matter.
Deployed to our dev environment and added 100 jobs, calling an HTTP endpoint in every 10s, like mentioned in this comment.
24h and 9336 successful execution later (every job has this number of execution) dkron still running and the panic not occurred.
Based on this, I think it's fixed.

vcastellm · 2021-09-15T20:42:42Z

This are really good news @ncsibra, thanks for testing, I'm going to merge and include the fix in the next release.

alexanderfefelov added a commit to alexanderfefelov/docker-backpack that referenced this issue Oct 16, 2020

Try to use latest Ubuntu

8b4b71c

Bcz distribworks/dkron#828

yvanoers mentioned this issue Oct 20, 2020

Make Execute() wait for initialization of status gRPC server before use #835

Merged

vcastellm added the bug label Nov 9, 2020

vcastellm closed this as completed in #835 Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster panic #828

Cluster panic #828

alexanderfefelov commented Oct 15, 2020 •

edited

Loading

yvanoers commented Oct 15, 2020

alexanderfefelov commented Oct 16, 2020

BlackDex commented Oct 20, 2020

yvanoers commented Oct 20, 2020

piotrlipiarz-ef commented Nov 9, 2020

vcastellm commented Nov 9, 2020

yvanoers commented Nov 9, 2020

alexanderfefelov commented Dec 14, 2020

ArminNrz commented Dec 29, 2020 •

edited

Loading

spy16 commented Feb 22, 2021 •

edited

Loading

yvanoers commented Feb 23, 2021

ncsibra commented Aug 11, 2021

vcastellm commented Aug 11, 2021

ncsibra commented Aug 11, 2021

yvanoers commented Aug 11, 2021

ncsibra commented Aug 11, 2021

ncsibra commented Aug 13, 2021 •

edited

Loading

vcastellm commented Sep 15, 2021

Cluster panic #828

Cluster panic #828

Comments

alexanderfefelov commented Oct 15, 2020 • edited Loading

Environment

yvanoers commented Oct 15, 2020

alexanderfefelov commented Oct 16, 2020

BlackDex commented Oct 20, 2020

yvanoers commented Oct 20, 2020

piotrlipiarz-ef commented Nov 9, 2020

vcastellm commented Nov 9, 2020

yvanoers commented Nov 9, 2020

alexanderfefelov commented Dec 14, 2020

ArminNrz commented Dec 29, 2020 • edited Loading

spy16 commented Feb 22, 2021 • edited Loading

yvanoers commented Feb 23, 2021

ncsibra commented Aug 11, 2021

vcastellm commented Aug 11, 2021

ncsibra commented Aug 11, 2021

yvanoers commented Aug 11, 2021

ncsibra commented Aug 11, 2021

ncsibra commented Aug 13, 2021 • edited Loading

vcastellm commented Sep 15, 2021

alexanderfefelov commented Oct 15, 2020 •

edited

Loading

ArminNrz commented Dec 29, 2020 •

edited

Loading

spy16 commented Feb 22, 2021 •

edited

Loading

ncsibra commented Aug 13, 2021 •

edited

Loading