Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Unstable Jaeger Deployment with Cassandra ; Cassandra STS is failing #555

Open
yitzhtal opened this issue Feb 28, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@yitzhtal
Copy link

What happened?

Cassandra stateful set is not stable and keeps crashing.

Steps to reproduce

  1. Install OTEL SDK on some app.
  2. Install Jaeger latest helm chart 1.0.0.

Expected behavior

Jaeger available with alll pods running stable.

Relevant log output

│ INFO  [main] 2024-02-28 09:58:00,582 QueryProcessor.java:163 - Preloaded 0 prepared statements                                                                                                                                                                                                                             │
│ INFO  [main] 2024-02-28 09:58:00,582 StorageService.java:657 - Cassandra version: 3.11.6                                                                                                                                                                                                                                   │
│ INFO  [main] 2024-02-28 09:58:00,582 StorageService.java:658 - Thrift API version: 20.1.0                                                                                                                                                                                                                                  │
│ INFO  [main] 2024-02-28 09:58:00,582 StorageService.java:659 - CQL supported versions: 3.4.4 (default: 3.4.4)                                                                                                                                                                                                              │
│ INFO  [main] 2024-02-28 09:58:00,582 StorageService.java:661 - Native protocol supported versions: 3/v3, 4/v4, 5/v5-beta (default: 4/v4)                                                                                                                                                                                   │
│ INFO  [main] 2024-02-28 09:58:00,599 IndexSummaryManager.java:87 - Initializing index summary manager with a memory pool size of 99 MB and a resize interval of 60 minutes                                                                                                                                                 │
│ INFO  [main] 2024-02-28 09:58:00,604 MessagingService.java:750 - Starting Messaging Service on /10.50.26.33:7000 (eth0)                                                                                                                                                                                                    │
│ INFO  [main] 2024-02-28 09:58:00,619 OutboundTcpConnection.java:108 - OutboundTcpConnection using coalescing strategy DISABLED                                                                                                                                                                                             │
│ INFO  [HANDSHAKE-jaeger-solutions-cassandra-0.jaeger-solutions-cassandra.jaeger-solutions.svc.cluster.local/10.50.30.49] 2024-02-28 09:58:00,628 OutboundTcpConnection.java:561 - Handshaking version with jaeger-solutions-cassandra-0.jaeger-solutions-cassandra.jaeger-solutions.svc.cluster.local/10.50.30.49          │
│ INFO  [ScheduledTasks:1] 2024-02-28 09:58:03,885 TokenMetadata.java:517 - Updating topology for all endpoints that have changed                                                                                                                                                                                            │
│ Exception (java.lang.UnsupportedOperationException) encountered during startup: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true                                                                                                                       │
│ java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true                                                                                                                                                              │
│     at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:613)                                                                                                                                                                                                                      │
│     at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:844)                                                                                                                                                                                                                                  │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:703)                                                                                                                                                                                                                                     │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:652)                                                                                                                                                                                                                                     │
│     at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:397)                                                                                                                                                                                                                                        │
│     at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)                                                                                                                                                                                                                                     │
│     at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757)                                                                                                                                                                                                                                         │
│ ERROR [main] 2024-02-28 09:58:06,635 CassandraDaemon.java:774 - Exception encountered during startup                                                                                                                                                                                                                       │
│ java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true                                                                                                                                                              │
│     at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:613) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                │
│     at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:844) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                            │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:703) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                               │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:652) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                               │
│     at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:397) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                                   │
│     at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                                │
│     at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                                                                                                    │
│ INFO  [StorageServiceShutdownHook] 2024-02-28 09:58:06,637 HintsService.java:209 - Paused hints dispatch                                                                                                                                                                                                                   │
│ WARN  [StorageServiceShutdownHook] 2024-02-28 09:58:06,637 Gossiper.java:1655 - No local state, state is in silent shutdown, or node hasn't joined, not announcing shutdown                                                                                                                                                │
│ INFO  [StorageServiceShutdownHook] 2024-02-28 09:58:06,637 MessagingService.java:985 - Waiting for messaging service to quiesce                                                                                                                                                                                            │
│ INFO  [ACCEPT-/10.50.26.33] 2024-02-28 09:58:06,638 MessagingService.java:1346 - MessagingService has terminated the accept() thread                                                                                                                                                                                       │
│ INFO  [StorageServiceShutdownHook] 2024-02-28 09:58:06,759 HintsService.java:209 - Paused hints dispatch

Screenshot

Screenshot 2024-02-28 at 11 58 37

Additional context

Running Jaeger on a dedicated namespace on EKS.

Jaeger backend version

1.53.0

SDK

OpenTelemetry SDK.

Pipeline

No response

Stogage backend

Cassandra

Operating system

Linux

Deployment model

Kubernetes

Deployment configs

provisionDataStore:
  cassandra: true
  elasticsearch: false
  kafka: false
agent:
  enabled: false
query:
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - jaeger-ui-solutions.internal.lightrun.com
  config: |-
    {
      "dependencies": {
        "dagMaxNumServices": 200,
        "menuEnabled": true
      },
      "archiveEnabled": true,
      "tracking": {
        "gaID": "UA-000000-2",
        "trackErrors": true
      }
    }
cassandra:
  resources:
     requests:
       memory: 10Gi
       cpu: 6
     limits:
       memory: 16Gi
       cpu: 10
collector:
  service:
    otlp:
      grpc:
         name: otlp-grpc
         port: 4317
      http:
         name: otlp-http
         port: 4318
@yitzhtal yitzhtal added the bug Something isn't working label Feb 28, 2024
@Vivekgaddigi
Copy link

Try the latest version 1.0.2

@yitzhtal
Copy link
Author

yitzhtal commented Mar 9, 2024

I upgraded to 1.0.2 and used node selector for more stable nodes (not spot instances).
It works now, see if it'll be stable, I'll update

@Vivekgaddigi
Copy link

close the issue if it sorted

@yitzhtal
Copy link
Author

I still can't seem to make Jaeger stable, I got this errors:

 ERROR [main] 2024-04-11 08:29:47,486 CassandraDaemon.java:774 - Exception encountered during startup                                                                                                                                       │
│ java.lang.RuntimeException: A node required to move the data consistently is down (/10.50.13.161). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false  │
│     at org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:294) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                               │
│     at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:177) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                      │
│     at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:87) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                         │
│     at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1530) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                               │
│     at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1024) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                           │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:718) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                               │
│     at org.apache.cassandra.service.StorageService.initServer(StorageService.java:652) ~[apache-cassandra-3.11.6.jar:3.11.6]                                                                                                               │
│     at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:397) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                   │
│     at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                │
│     at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757) [apache-cassandra-3.11.6.jar:3.11.6]                                                                                                                    │
│ INFO  [StorageServiceShutdownHook] 2024-04-11 08:29:47,488 HintsService.java:209 - Paused hints dispatch                                                                                                                                   │
│ WARN  [StorageServiceShutdownHook] 2024-04-11 08:29:47,488 Gossiper.java:1655 - No local state, state is in silent shutdown, or node hasn't joined, not announcing shutdown                                                                │
│ INFO  [StorageServiceShutdownHook] 2024-04-11 08:29:47,488 MessagingService.java:985 - Waiting for messaging service to quiesce                                                                                                            │
│ INFO  [ACCEPT-/10.50.10.10] 2024-04-11 08:29:47,489 MessagingService.java:1346 - MessagingService has terminated the accept() thread

@yurishkuro yurishkuro transferred this issue from jaegertracing/jaeger Apr 11, 2024
@robertwenquan
Copy link

robertwenquan commented Oct 27, 2024

looks similar. ran into this with one of the pod keeps crashing
with the 3.0.10 chart

jaeger-cassandra-0                  1/1     Running            0                  13d   10.0.3.24     c21   <none>           <none>
jaeger-cassandra-1                  0/1     CrashLoopBackOff   6 (2m7s ago)       12m   10.0.10.216   c34   <none>           <none>
jaeger-cassandra-2                  1/1     Running            0                  46d   10.0.0.47     p11   <none>           <none>

@Konrad-Smolko
Copy link

Konrad-Smolko commented Dec 3, 2024

Same issue here. K8S cluster on our own machines. 3 nodes, 3 cassandra pods deployed via Jaeger's chart. Nodes were recently upgraded, and they were rebooted. Now jaeger-cassandra-2 pod is in a crashloop, complaining about lack of cassandra pod on the third node - which I assume was rebooted last.

Cassandra logs mention: (...) restart the node with -Dcassandra.consistent.rangemovement=false

I have yet to figure out where exactly to add such a flag. in Jaeger's chart, adding this:

storage:
    cassandra:
        cmdlineParams:
            cassandra.consistent.rangemovement: false

didn't seem to do anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants