Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster panic #828

Closed
alexanderfefelov opened this issue Oct 15, 2020 · 18 comments · Fixed by #835
Closed

Cluster panic #828

alexanderfefelov opened this issue Oct 15, 2020 · 18 comments · Fixed by #835
Labels

Comments

@alexanderfefelov
Copy link

alexanderfefelov commented Oct 15, 2020

Some time after start, all nodes of my cluster (two servers and two agents) crash with the same error:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x160 pc=0x948a8a]

goroutine 623378 [running]:
google.golang.org/grpc.(*Server).Stop(0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1482 +0x4a
github.com/distribworks/dkron/v3/plugin.(*ExecutorClient).Execute(0xc00091e6a0, 0xc00b231640, 0x357c300, 0xc000e27820, 0xc0011bc980, 0x25bb801, 0xc0044c6150)
        /home/runner/work/dkron/dkron/plugin/executor.go:65 +0x190
github.com/distribworks/dkron/v3/dkron.(*AgentServer).AgentRun(0xc0001295b8, 0xc0044c6060, 0x35e2620, 0xc003374e20, 0x0, 0x0)
        /home/runner/work/dkron/dkron/dkron/grpc_agent.go:83 +0x6f4
github.com/distribworks/dkron/v3/plugin/types._Agent_AgentRun_Handler(0x2561d80, 0xc0001295b8, 0x35dcda0, 0xc0002186c0, 0x4ba5ae0, 0xc000e2a200)
        /home/runner/work/dkron/dkron/plugin/types/dkron.pb.go:1862 +0x109
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000a12b60, 0x35ea7e0, 0xc0019b4300, 0xc000e2a200, 0xc000b56150, 0x4b45e20, 0x0, 0x0, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1329 +0xcbc
google.golang.org/grpc.(*Server).handleStream(0xc000a12b60, 0x35ea7e0, 0xc0019b4300, 0xc000e2a200, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1409 +0xc64
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc001ab37e0, 0xc000a12b60, 0x35ea7e0, 0xc0019b4300, 0xc000e2a200)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746 +0xa1
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:744 +0xa1
runtime: note: your Linux kernel may be buggy
runtime: note: see https://golang.org/wiki/LinuxKernelSignalVectorBug
runtime: note: mlock workaround for kernel bug failed with errno 12

Environment

  • Dkron 3.0.5

  • Docker (servers: Dockerfile, run script, agents: Dockerfile, run script)

  • uname -a:

      Linux dkron-server-1.backpack.test 5.4.0-48-generic #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
    
@yvanoers
Copy link
Collaborator

@alexanderfefelov
Is there any reason to assume the kernel bug mentioned in the log is not the cause?
Is it at all possible to upgrade the kernel and see if that alleviates the issue?

But on the other hand, there may be a timing issue with the executor. If so, I would expect this to occur with jobs that are very short lived. Do you have such jobs in your cluster?

@alexanderfefelov
Copy link
Author

Is there any reason to assume the kernel bug mentioned in the log is not the cause?

Maybe you're right.

Is it at all possible to upgrade the kernel and see if that alleviates the issue?

I'll try it out.

Do you have such jobs in your cluster?

Yes, all my jobs are short-lived now.

alexanderfefelov added a commit to alexanderfefelov/docker-backpack that referenced this issue Oct 16, 2020
@BlackDex
Copy link

I have encountered the same error running on a CentOS based system with kernel 4.14.

@yvanoers
Copy link
Collaborator

@alexanderfefelov @BlackDex ,
I've created a pull request (#835) in an attempt to fix this. I have no decent way to test this, would be great if someone is willing and able to give this a try.

@piotrlipiarz-ef
Copy link

We are observing very similar issue with our cluster of 3 servers. We are using

  • dkron/dkron:v3.0.4
uname -a:
Linux dkron-cluster-0 4.15.0-1027-aws #27-Ubuntu SMP Fri Nov 2 15:14:20 UTC 2018 x86_64 Linux

Observed panic seems to originate from here

s.Stop()

as a result our cluster stops to execute jobs, despite that this panic happens to only one of the nodes.

the error in logs looks like:

/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:744 +0xa1
--
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746 +0xa1
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000edaa00, 0xc0003f2680, 0x357f1a0, 0xc00109ca80, 0xc000ef1800)
/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1409 +0xc64
google.golang.org/grpc.(*Server).handleStream(0xc0003f2680, 0x357f1a0, 0xc00109ca80, 0xc000ef1800, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1329 +0xcbc
google.golang.org/grpc.(*Server).processStreamingRPC(0xc0003f2680, 0x357f1a0, 0xc00109ca80, 0xc000ef1800, 0xc000cf4780, 0x4aaedc0, 0x0, 0x0, 0x0)
/dkron/plugin/types/dkron.pb.go:1862 +0x109
github.com/distribworks/dkron/v3/plugin/types._Agent_AgentRun_Handler(0x252a1c0, 0xc00000f040, 0x35717c0, 0xc0001ac600, 0x4b0ea40, 0xc000ef1800)
/dkron/dkron/grpc_agent.go:83 +0x6f4
github.com/distribworks/dkron/v3/dkron.(*AgentServer).AgentRun(0xc00000f040, 0xc000d384e0, 0x3577040, 0xc000cc8610, 0x0, 0x0)
/dkron/plugin/executor.go:65 +0x190
github.com/distribworks/dkron/v3/plugin.(*ExecutorClient).Execute(0xc000ba8c20, 0xc000f34780, 0x3511180, 0xc0012c6d20, 0xc000e2a800, 0xc000104501, 0x101)
/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1482 +0x4a
google.golang.org/grpc.(*Server).Stop(0x0)
goroutine 2579 [running]:
{"service":"dkron","journald":{"_PID":"814","_SYSTEMD_INVOCATION_ID":"00e226322d7d42daa35605c8209f8cc9","_CAP_EFFECTIVE":"3fffffffff","_SOURCE_REALTIME_TIMESTAMP":"1604214389481076","PRIORITY":"3","CONTAINER_NAME":"k8s_dkron_dkron-cluster-0_tfprod_78ace7ae-1b7f-11eb-9a01-029fce2970d4_19","_SYSTEMD_CGROUP":"/system.slice/docker.service","_GID":"0","_MACHINE_ID":"3c325c220068e72790ee19b05f9d5d88","_SELINUX_CONTEXT":"unconfined\n","_COMM":"dockerd","CONTAINER_ID_FULL":"9845f9bc1c140eb037c22335b0d4bad8ddc3696f1d004b6dcff5ff9ee20d4241","CONTAINER_TAG":"9845f9bc1c14","_BOOT_ID":"2cc14745820148a4ab685328775b0890","_SYSTEMD_UNIT":"docker.service","_TRANSPORT":"journal","_CMDLINE":"/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock","SYSLOG_IDENTIFIER":"9845f9bc1c14","_SYSTEMD_SLICE":"system.slice","_UID":"0","_HOSTNAME":"kubeworker-7","CONTAINER_ID":"9845f9bc1c14","_EXE":"/usr/bin/dockerd-ce"}}
[signal SIGSEGV: segmentation violation code=0x1 addr=0x160 pc=0x9488fa]
panic: runtime error: invalid memory address or nil pointer dereference

Is there maybe some known configuration setting that could prevent cluster to go down in such scenario? Or any timeline when #835 could be merged and released?

@vcastellm vcastellm added the bug label Nov 9, 2020
@vcastellm
Copy link
Member

🙏 if someone can give the fix in the PR a try

@yvanoers
Copy link
Collaborator

yvanoers commented Nov 9, 2020

Maybe it would help if we provided a package or binary for testing purposes?

@alexanderfefelov
Copy link
Author

Agent crash on 3.0.8:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x160 pc=0x94baaa]

goroutine 20660 [running]:
google.golang.org/grpc.(*Server).Stop(0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1482 +0x4a
github.com/distribworks/dkron/v3/plugin.(*ExecutorClient).Execute(0xc000c32280, 0xc00060cf80, 0x3581720, 0xc00000d5a0, 0xc00013ab00, 0x25c0201, 0xc00067f8c0)
        /home/runner/work/dkron/dkron/plugin/executor.go:65 +0x190
github.com/distribworks/dkron/v3/dkron.(*AgentServer).AgentRun(0xc00012a1c0, 0xc00067f710, 0x35e7c60, 0xc000cdebc0, 0x0, 0x0)
        /home/runner/work/dkron/dkron/dkron/grpc_agent.go:83 +0x6f4
github.com/distribworks/dkron/v3/plugin/types._Agent_AgentRun_Handler(0x2566820, 0xc00012a1c0, 0x35e23e0, 0xc000b8acc0, 0x4bacc00, 0xc0000e2b00)
        /home/runner/work/dkron/dkron/plugin/types/dkron.pb.go:1862 +0x109
google.golang.org/grpc.(*Server).processStreamingRPC(0xc0008f8340, 0x35efd60, 0xc0009d8f00, 0xc0000e2b00, 0xc000b1f1d0, 0x4b4cf20, 0x0, 0x0, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1329 +0xcbc
google.golang.org/grpc.(*Server).handleStream(0xc0008f8340, 0x35efd60, 0xc0009d8f00, 0xc0000e2b00, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:1409 +0xc64
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000055820, 0xc0008f8340, 0x35efd60, 0xc0009d8f00, 0xc0000e2b00)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:746 +0xa1
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.29.1/server.go:744 +0xa1

@ArminNrz
Copy link

ArminNrz commented Dec 29, 2020

Hi
When I run a Dkron server and set 100 jobs that each job run every 10 sec and send an HTTP request
Screenshot from 2020-12-29 18-01-11
So after 1 min, I see this error

Screenshot from 2020-12-29 18-03-51

@spy16
Copy link

spy16 commented Feb 22, 2021

I have the exact same issue as well.

  • Dkron cluster of 3 x n2-standard-4 (4 vCPUs, 16 GB memory) (Google Cloud)
  • Jobs
    • Schedule: @every 1s
    • Executor: HTTP
    • HTTP Callback: POST http://job-worker-host:8081/job?job_id={jobID} (endpoint simply emits a statsd counter)
    • Number of Jobs: 100

The expectation was to see 100 rps sustained throughput on the /job endpoint. But one or all nodes crash with following panic after sometime (all jobs completed about 20 executions)

Note: sometimes they crash exactly when I try to open the /ui endpoint. Not entirely sure if they are cause-effect or just coincident.

INFO[2021-02-22T21:30:29+07:00] grpc_agent: Starting job                      job=test_j98 node=p-dkron-node-a-01
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x168 pc=0x9535ca]

goroutine 423018 [running]:
google.golang.org/grpc.(*Server).Stop(0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:1626 +0x4a
github.com/distribworks/dkron/v3/plugin.(*ExecutorClient).Execute(0xc0003894a0, 0xc002105860, 0x3b177a0, 0xc000eff240, 0xc0005bd950, 0x1, 0x0)
        /home/runner/work/dkron/dkron/plugin/executor.go:65 +0x191
github.com/distribworks/dkron/v3/dkron.(*AgentServer).AgentRun(0xc00101bb20, 0xc0019675c0, 0x3b88be0, 0xc003f5c650, 0x0, 0x0)
        /home/runner/work/dkron/dkron/dkron/grpc_agent.go:85 +0x6fb
github.com/distribworks/dkron/v3/plugin/types._Agent_AgentRun_Handler(0x2969a20, 0xc00101bb20, 0x3b82be0, 0xc0023c9740, 0x535e610, 0xc0018d8c00)
        /home/runner/work/dkron/dkron/plugin/types/dkron_grpc.pb.go:540 +0x109
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000130c40, 0x3b91700, 0xc002f4f080, 0xc0018d8c00, 0xc0009484b0, 0x52f76e0, 0x0, 0x0, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:1464 +0xcbd
google.golang.org/grpc.(*Server).handleStream(0xc000130c40, 0x3b91700, 0xc002f4f080, 0xc0018d8c00, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:1544 +0xc96
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc002514e50, 0xc000130c40, 0x3b91700, 0xc002f4f080, 0xc0018d8c00)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:878 +0xa1
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:876 +0x204

@yvanoers
Copy link
Collaborator

I released a preproduction build of Dkron v3.1.4 with patch #835 built in.
A docker image is also available as yvanoers/dkron.

Hopefully this will make it easier for someone to test the PR (#835).

@ncsibra
Copy link

ncsibra commented Aug 11, 2021

We faced this issue too, the PR is already merged, @Victorcoder when this will be released/tagged?

@vcastellm
Copy link
Member

@ncsibra this is difficult to test, did you test if it works properly with the pre-release version?

@ncsibra
Copy link

ncsibra commented Aug 11, 2021

I assumed it was tested because it's already merged.
Do you mean with yvanoers version, based on the v3.1.4 release?
Or do you have an official pre-release docker image for the latest version(that I didn't find)?

@yvanoers
Copy link
Collaborator

@ncsibra please note that PR #835 has not yet been merged.
PR #1008 has, and that change is/was a part of #835.

@ncsibra
Copy link

ncsibra commented Aug 11, 2021

@yvanoers You're right, sorry, I mixed them up somehow.

@ncsibra
Copy link

ncsibra commented Aug 13, 2021

@yvanoers I tested your branch with the f31c7f5f32e30424a7868922a61e9198da5c74ce commit.
Created a docker image based on the default Dockerfile (not Dockerfile.hub), I assume it's doesn't matter.
Deployed to our dev environment and added 100 jobs, calling an HTTP endpoint in every 10s, like mentioned in this comment.
24h and 9336 successful execution later (every job has this number of execution) dkron still running and the panic not occurred.
Based on this, I think it's fixed.

@vcastellm
Copy link
Member

This are really good news @ncsibra, thanks for testing, I'm going to merge and include the fix in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants