Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s-dqlite is stuck pinned at high 80s% usage #36

Open
alexmarshallces opened this issue Dec 14, 2022 · 23 comments
Open

k8s-dqlite is stuck pinned at high 80s% usage #36

alexmarshallces opened this issue Dec 14, 2022 · 23 comments

Comments

@alexmarshallces
Copy link

Problem:

The k8s-dqlite process for synchronizing microk8s is stuck at 80% and higher CPU usage at all times on microk8s nodes configured for High Availability. This causes "database is locked" errors when attempting to deploy any resources to the microk8s cluster, e.g. Helm Charts. On 5/6 nodes, the k8s-dqlite process has single digit CPU consumption, but on the 6th node, it has 80%+, constantly.

Expected behavior: k8s-dqlite should not be consuming more than 25-30% CPU at any given time.

Environment:
Cluster: microk8s v1.25
Nodes: 6
OS: Ubuntu 20.04

Logs:

Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:52:21 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:52:21Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:52:24 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:52:24.711775 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:40:37 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:40:37Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:40:37 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:40:37Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:40:37 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:40:37Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:40:41 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:40:41Z" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
Dec 14 17:40:44 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:40:44Z" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
Dec 14 17:42:20 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:42:20Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:42:26 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:42:26Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:47:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:47:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:47:13 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:47:13Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:47:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:47:16Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:47:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:47:16Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:47:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:47:16Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:47:17 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:47:17Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:47:48 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:47:48Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:47:48 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:47:48Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:47:48 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:47:48Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:48:23 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:48:23Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:48:25 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:48:25Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:49:54 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:49:54Z" level=error msg="failed to delete revision 41374: exec (try: 500): database is locked"
Dec 14 17:50:31 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:31Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error while range on /registry/crd.projectcalico.org/caliconodestatuses/ /registry/crd.projectcalico.org/cali>
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error in txn: exec (try: 0): call exec-sql (budget 31.530053395s): receive: header: EOF"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error in txn: exec (try: 0): call exec-sql (budget 26.102840766s): receive: header: EOF"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error in txn: query (try: 0): call query-sql (budget 26.211612942s): receive: header: EOF"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error in txn: query (try: 0): call query-sql (budget 0s): receive: header: EOF"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="failed to record compact revision: exec (try: 0): call exec-sql (budget 0s): receive: header: EOF"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error in txn: query (try: 0): call query-sql (budget 10.778006512s): receive: header: EOF"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error in txn: query (try: 0): call query-sql (budget 6.398362368s): receive: header: EOF"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error while range on /registry/leases/kube-node-lease/ces-yeg1-k103 : query (try: 0): call query-sql (budget >
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="fail to list latest changes: query (try: 0): call query-sql (budget 0s): receive: header: EOF"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="fail to list latest changes: query (try: 0): call query-sql (budget 0s): receive: header: EOF"
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error while range on /registry/leases/kube-node-lease/ces-yeg1-k103 : query (try: 0): call query-sql (budget >
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:50:35Z" level=error msg="error while range on /registry/leases/kube-node-lease/ces-yeg1-k103 : query (try: 0): call query-sql (budget >
Dec 14 17:50:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:50:35.821917 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: read tcp 10.248.177.35:19001->10.248.177.33:54264: read: connecti>
Dec 14 17:51:13 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:13Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:21 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:21Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:21 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:21Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:51:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:35Z" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:51:38 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:51:38Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:52:21 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:52:21Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:52:24 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:52:24.711775 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:52:24 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:52:24Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:52:32 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:52:32Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:52:37 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:52:37Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:52:53 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:52:53Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:52:56 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:52:56Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:53:05 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:53:05.714134 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:53:05 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:53:05Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:53:15 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:53:15Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:53:15 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:53:15Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:53:15 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:53:15.715201 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:53:15 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:53:15Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:53:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:53:16.557336 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:53:17 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:53:17Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:53:17 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:53:17Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:53:52 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:53:52.281499 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:53:52 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:53:52.282150 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:54:46 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:54:46Z" level=error msg="error in txn: exec (try: 39): context canceled"
Dec 14 17:54:46 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:54:46Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:55:10 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:55:10Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:55:11 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:55:11Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:55:11 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:55:11Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:55:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:55:12Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:55:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:55:35.284487 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:55:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:55:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:55:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:55:35Z" level=error msg="error in txn: query (try: 0): failed to create dqlite connection: no available dqlite leader server found"
Dec 14 17:55:36 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:55:36Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:55:43 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:55:43.185950 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:55:47 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:55:47.744529 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:55:51 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:55:51Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:04 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:04Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:12 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:12Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:13 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:13Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:13 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:13Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:13 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:13Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:56:13 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:13Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:56:20 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:20Z" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
Dec 14 17:56:23 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:23Z" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
Dec 14 17:56:55 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:56:55Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:57:01 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:57:01Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:57:04 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:57:04.986049 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:57:05 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:57:05Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:57:06 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:57:06Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:57:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:57:16.080231 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:57:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:57:16Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:58:25 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:25Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:58:25 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:25Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:29 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:58:29.635744 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:35 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:35Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 17:58:39 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:39Z" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
Dec 14 17:58:42 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:42Z" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
Dec 14 17:58:44 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T17:58:44Z" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
Dec 14 17:59:59 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:59:59.748446 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 18:01:01 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:01:01Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 18:01:09 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:01:09Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 18:01:21 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 18:01:21.373190 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 18:02:18 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 18:02:18.952072 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 18:02:31 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 18:02:31.810114 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 18:03:27 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:03:27Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 18:03:58 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 18:03:58.233073 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC
Dec 14 18:04:06 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:06Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 18:04:15 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:15Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 18:04:15 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:15Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 18:04:15 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:15Z" level=error msg="error in txn: exec (try: 40): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: exec (try: 0): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: query (try: 0): context canceled"
Dec 14 18:04:16 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: time="2022-12-14T18:04:16Z" level=error msg="error in txn: query (try: 0): context canceled"

@neoaggelos
Copy link
Contributor

@MathieuBordere can you share some insight as to why this might be happening?

@MathieuBordere
Copy link

MathieuBordere commented Dec 15, 2022

Hard to see from the logs, Dec 14 17:52:24 ces-yeg1-k103 microk8s.daemon-k8s-dqlite[19886]: I1214 17:52:24.711775 19886 log.go:198] [ERROR] dqlite: proxy: first: remote -> local: local error: tls: bad record MAC looks a bit concerning.

Is there a way to enable more logging? There are the LIBRAFT_TRACE=1 and LIBDQLITE_TRACE=1 environment variables that can be set to enable detailed logging, I believe you have a way to enable these in the snap? Would be interesting to see the output on the impacted node.

@MathieuBordere
Copy link

CC @cole-miller

@MathieuBordere
Copy link

MathieuBordere commented Dec 15, 2022

Is this an arm machine by any chance? There are issues with golang TLS performance and RSA certs on arm machines, it's advised to switch to an ECDSA cert. See e.g. canonical/go-dqlite#170.

@neoaggelos
Copy link
Contributor

There are the LIBRAFT_TRACE=1 and LIBDQLITE_TRACE=1 environment variables that can be set to enable detailed logging, I believe you have a way to enable these in the snap? Would be interesting to see the output on the impacted node.

Yes, these can be set in /var/snap/microk8s/current/args/k8s-dqlite-env. @alexmarshallces can you uncomment these lines and afterwards restart k8s-dqlite with the following command:

sudo snap restart microk8s.daemon-k8s-dqlite

Note that this restart will have the side-effect of moving the leader around, so do it with care.

@alexmarshallces
Copy link
Author

There are the LIBRAFT_TRACE=1 and LIBDQLITE_TRACE=1 environment variables that can be set to enable detailed logging, I believe you have a way to enable these in the snap? Would be interesting to see the output on the impacted node.

Yes, these can be set in /var/snap/microk8s/current/args/k8s-dqlite-env. @alexmarshallces can you uncomment these lines and afterwards restart k8s-dqlite with the following command:

sudo snap restart microk8s.daemon-k8s-dqlite

Note that this restart will have the side-effect of moving the leader around, so do it with care.

@neoaggelos Yes, I'll uncomment these and send the results right away. Thank you for the warning about the leader relocation.

@MathieuBordere
Copy link

Is this an arm machine by any chance? There are issues with golang TLS performance and RSA certs on arm machines, it's advised to switch to an ECDSA cert. See e.g. canonical/go-dqlite#170.

^ @alexmarshallces

@alexmarshallces
Copy link
Author

Is this an arm machine by any chance? There are issues with golang TLS performance and RSA certs on arm machines, it's advised to switch to an ECDSA cert. See e.g. canonical/go-dqlite#170.

^ @alexmarshallces

No, this is an amd64 machine running Ubuntu 20.04 on top of a Windows Server 2019 Hyper-V hypervisor.

@alexmarshallces
Copy link
Author

@neoaggelos Please find attached a compressed tarball containing a log file representing the problem. It contains the debug output of DQLite and Raft per the arguments specified above enabled. Any insights as to what could be happening from yourself and @MathieuBordere are greatly appreciated.
dqlite-log.tar.gz

@MathieuBordere
Copy link

MathieuBordere commented Dec 15, 2022

@alexmarshallces How would you describe the issues that you saw during the timeframe of the logs you provided? Just high CPU load or requests timing out?

@alexmarshallces
Copy link
Author

@MathieuBordere Typically both: high CPU load on the Leader node for DQLite, along with frequent errors returned by calls to the API server with the response database is locked or timeouts to the apiserver. e.g. Helm charts will fail to install with database in locked errors or context deadline exceeded errors. Various calls to the cluster with kubectl will fail with a variety of errors include timeout or context deadline exceeded errors. e.g. a call to query the pods in a namespace will work, then repeating the same call 10 seconds later won't work. Then repeating the same call a minute later will work. Similar results for event the most simple of calls for things like kubectl get nodes or kubectl get namespaces. This is overall leading to a high level of inconsistency and instability of the cluster. After some period of time, I'll also start to see InvalidDiskCapacity errors start to show up in the event logs for the nodes. My suspicion here is that processes trying to schedule nodes are failing due to synchronization timeouts with DQLite.

@MathieuBordere
Copy link

How's the latency between the nodes, can you provide some measurements? (between the voters and the leader is ideal)

@MathieuBordere
Copy link

I'm seeing a lot of writes that are blocked on writes that are ongoing, line like vfsShmLock:1572 EXCLUSIVE lock contention ofst:0 n:1 exclusive[0]=1 shared[0]=0. We're investigating it.

@alexmarshallces
Copy link
Author

alexmarshallces commented Dec 15, 2022

How's the latency between the nodes, can you provide some measurements? (between the voters and the leader is ideal)

@MathieuBordere

Six (6) nodes total:

Leader: k203
Voters: k101, k102, k103, k201, k202

Naming convention: k{x}0{y}
x => hardware instance
y => VM instance within hardware instance

i.e. We have two hardware instances that each host 3 VM nodes, so any connection k10{y} <==> k20{y} is going across physical hardware boundaries via a network switch in the same rack as both servers.

Using ping I see the ranges:
k203 => k101 : [0.2ms , 10ms]
k203 => k102 : [0.2ms , 10ms]
k203 => k103 : [0.2ms , 10ms]
k203 => k201 : [0.2ms , 4ms]
k203 => k202 : [0.2ms , 4ms]

@MathieuBordere
Copy link

Thank you, we're going to try and reproduce your cluster and see if we can trigger the behavior. Could you also provide some numbers on how fast your storage is? I think @neoaggelos and @ktsakalozos have documented somewhere how to provide it.

@alexmarshallces
Copy link
Author

Thank you, we're going to try and reproduce your cluster and see if we can trigger the behavior. Could you also provide some numbers on how fast your storage is? I think @neoaggelos and @ktsakalozos have documented somewhere how to provide it.

@MathieuBordere

From the microk8s recommendations for disk performance running sudo hdparm -Tt /dev/sda:

From our real hardware instances (some variation, but this is representative):

			/dev/sda:
			 Timing cached reads:   23656 MB in  1.99 seconds = 11910.41 MB/sec
			SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
			 Timing buffered disk reads: 370 MB in  3.03 seconds = 121.99 MB/sec

From an identically configured VM in Azure (same vCPU count, same amount of RAM, same software installation):

				/dev/sda:
				 Timing cached reads:   16248 MB in  1.99 seconds = 8169.00 MB/sec
				SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
				 Timing buffered disk reads: 538 MB in  3.01 seconds = 178.70 MB/sec

@alexmarshallces
Copy link
Author

How's the latency between the nodes, can you provide some measurements? (between the voters and the leader is ideal)

@MathieuBordere

Six (6) nodes total:

Leader: k203 Voters: k101, k102, k103, k201, k202

Naming convention: k{x}0{y} x => hardware instance y => VM instance within hardware instance

i.e. We have two hardware instances that each host 3 VM nodes, so any connection k10{y} <==> k20{y} is going across physical hardware boundaries via a network switch in the same rack as both servers.

Using ping I see the ranges: k203 => k101 : [0.2ms , 10ms] k203 => k102 : [0.2ms , 10ms] k203 => k103 : [0.2ms , 10ms] k203 => k201 : [0.2ms , 4ms] k203 => k202 : [0.2ms , 4ms]

@MathieuBordere @neoaggelos My earlier statement about the switch wasn't accurate, I need to correct myself:

Between the two servers, we have two switches operating in an Active-Active redundant configuration, so there are two paths between each of the two servers. I suspect this being a cause of issues here with the synchronization and failed transactions in the DQLite database. Can you please confirm or rule out this impact ?

@MathieuBordere
Copy link

MathieuBordere commented Dec 16, 2022

How's the latency between the nodes, can you provide some measurements? (between the voters and the leader is ideal)

@MathieuBordere
Six (6) nodes total:
Leader: k203 Voters: k101, k102, k103, k201, k202
Naming convention: k{x}0{y} x => hardware instance y => VM instance within hardware instance
i.e. We have two hardware instances that each host 3 VM nodes, so any connection k10{y} <==> k20{y} is going across physical hardware boundaries via a network switch in the same rack as both servers.
Using ping I see the ranges: k203 => k101 : [0.2ms , 10ms] k203 => k102 : [0.2ms , 10ms] k203 => k103 : [0.2ms , 10ms] k203 => k201 : [0.2ms , 4ms] k203 => k202 : [0.2ms , 4ms]

@MathieuBordere @neoaggelos My earlier statement about the switch wasn't accurate, I need to correct myself:

Between the two servers, we have two switches operating in an Active-Active redundant configuration, so there are two paths between each of the two servers. I suspect this being a cause of issues here with the synchronization and failed transactions in the DQLite database. Can you please confirm or rule out this impact ?

I don't have enough knowledge of switch configurations to be able to answer that question, but my gut feeling says it shouldn't have an impact (you could always test it).

@alexmarshallces
Copy link
Author

alexmarshallces commented Dec 16, 2022

How's the latency between the nodes, can you provide some measurements? (between the voters and the leader is ideal)

@MathieuBordere
Six (6) nodes total:
Leader: k203 Voters: k101, k102, k103, k201, k202
Naming convention: k{x}0{y} x => hardware instance y => VM instance within hardware instance
i.e. We have two hardware instances that each host 3 VM nodes, so any connection k10{y} <==> k20{y} is going across physical hardware boundaries via a network switch in the same rack as both servers.
Using ping I see the ranges: k203 => k101 : [0.2ms , 10ms] k203 => k102 : [0.2ms , 10ms] k203 => k103 : [0.2ms , 10ms] k203 => k201 : [0.2ms , 4ms] k203 => k202 : [0.2ms , 4ms]

@MathieuBordere @neoaggelos My earlier statement about the switch wasn't accurate, I need to correct myself:
Between the two servers, we have two switches operating in an Active-Active redundant configuration, so there are two paths between each of the two servers. I suspect this being a cause of issues here with the synchronization and failed transactions in the DQLite database. Can you please confirm or rule out this impact ?

I don't have enough knowledge of switch configurations to be able to answer that question, but my gut feeling says it shouldn't have an impact (you could always test it).

I think that's what we're doing right now, lol. If you or @neoaggelos have any further insight from your attempts to replicate the setup, it's gratefully received and welcome.

@MathieuBordere
Copy link

From the logs I see that some writes on the leader take a very long time to complete e.g. the log line with 1 commands starting at 9485324, it takes approx 4s until I see leader: written 1 entries starting at 9485324: status 0 and the command is committed.

Can you check if the disk is having write performance issues?

@alexmarshallces
Copy link
Author

From the logs I see that some writes on the leader take a very long time to complete e.g. the log line with 1 commands starting at 9485324, it takes approx 4s until I see leader: written 1 entries starting at 9485324: status 0 and the command is committed.

Can you check if the disk is having write performance issues?

@MathieuBordere Based on the recommendations here for detecting disk performance bottlenecks in Hyper-V and reviewing the VM Host performance counters, I don't appear to have any disk write performance issues: The latency I'm seeing for writes in the worst case for bursts are 3ms / write, well within the 50ms/write threshold discussed in the documentation and average write performance is less than 1ms / write.

Since you're much more familiar with dqlite's internal workings than I, is there anything within that 4s transaction other than disk I/O that could be affecting the latency of the transaction ? Any chance there's a network operation in there ?

@alexmarshallces
Copy link
Author

@MathieuBordere Further to my previous message, I have:

  • Verified the support for Ubuntu 20.04 on Hyper-V here.
  • Enumerated the best practices for running Linux in Hyper-V and confirmed the following:
    • My block size is 16 MB rather than the recommended 1 MB. This value is set by a 3rd party script provided here but I can ensure this gets changed if you feel it's a significant enough concern.
    • Even though we're not using failover clustering for this particular application, we're still using static MAC addresses, so the IP addresses shouldn't be changing (though I've noticed that occasionally they do, which is solved by a restart or refresh of the network interface, eth0).
    • I'm unable to identify the disk scheduler as the given path in the documentation is not set.
    • NUMA shouldn't be an issue since we're using kernel version 5.4.0-105-generic, which is well past the mentioned version 2.6.37.
    • Using fdisk to verify my partition yields the following output:
      No errors detected. Header version: 1.0 Using 3 out of 128 partitions. A total of 2014 free sectors is available in 1 segment.

@MathieuBordere
Copy link

If my suspicion about the disk write performance degrading after a lot of writes (and dqlite writes a lot to disk currently) is correct, then what might help is lowering dqlite's snapshot frequency. I'm not sure in how far microk8s has support for it, but it's worth experimenting with in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants