Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Database crashing into a degraded state #36217

Closed
1 task done
nairan-deshaw opened this issue Sep 12, 2024 · 24 comments
Closed
1 task done

[Bug]: Database crashing into a degraded state #36217

nairan-deshaw opened this issue Sep 12, 2024 · 24 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@nairan-deshaw
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.7
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0
- OS(Ubuntu or CentOS): RHEL
- CPU/Memory: 512 CPU / 1TB K8s cluster
- GPU: NA
- Others: NA

Current Behavior

Since upgrading our Milvus deployment to v2.4.7 we are facing crashes where the database goes into a degraded state. This has happened twice in the last 30 days. The components also stay functional, but none of the DB operations work. The attu UI gives gRPC context deadline exceeded errors. Usually restarting the mixcoord component fixes this issue.

While we took a deeper look at the logs prior to the issue and during the issue, we could not find a trigger point and why the database was not able to recover from the failures. Attaching the logs for the window with this issue. Can you help with what could be causing this and why the database needs the mixcoord to be restarted for things to be functional again?

Expected Behavior

Database should continue working normally

Steps To Reproduce

Reproducers are unclear, but has been happening on v2.4.7

Milvus Log

lines.txt

Anything else?

From the logs, the errors continue to cascade and the database becomes unoperational. While we do see the same error logs every now and then, they do not cause db crashes every time.

@nairan-deshaw nairan-deshaw added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 12, 2024
@yanliang567
Copy link
Contributor

yanliang567 commented Sep 14, 2024

@nairan-deshaw quick questions:

  1. which version of milvus did you upgrade from?
  2. the logs you attached seems to only have rootcoord pod info after crash, but we need all tge completed milvus pods logs, so could you please help to collect the logs with the tool below after the crashing happening.
    please refer this doc to export the whole Milvus logs for investigation(with this script, it collects the logs before crashing, which helps us understand what causes the crash)

/assign @nairan-deshaw

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 14, 2024
@nairan-deshaw
Copy link
Author

  • We upgraded from v2.3.12
  • We might not be able to provide full logs, but can send logs in the 30min window before the crash from all the components. The attached logs have all the WARN logs from all pods in the said window.

@yanliang567
Copy link
Contributor

@nairan-deshaw logs in 30 min window before the crash is okay, but i suggest set the milvus log level to debug to provide more info to us. in case you need to know how to config the milvus config: https://milvus.io/docs/configure-helm.md?tab=component

@nairan-deshaw
Copy link
Author

@yanliang567 attaching a file with all the logs (DEBUG level) in the 30 min window. Let me know if that helps.
logs.tar.gz

@yanliang567
Copy link
Contributor

/assign @weiliu1031
please help to take a look
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Sep 18, 2024
@xiaofan-luan
Copy link
Collaborator

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.7
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0
- OS(Ubuntu or CentOS): RHEL
- CPU/Memory: 512 CPU / 1TB K8s cluster
- GPU: NA
- Others: NA

Current Behavior

Since upgrading our Milvus deployment to v2.4.7 we are facing crashes where the database goes into a degraded state. This has happened twice in the last 30 days. The components also stay functional, but none of the DB operations work. The attu UI gives gRPC context deadline exceeded errors. Usually restarting the mixcoord component fixes this issue.

While we took a deeper look at the logs prior to the issue and during the issue, we could not find a trigger point and why the database was not able to recover from the failures. Attaching the logs for the window with this issue. Can you help with what could be causing this and why the database needs the mixcoord to be restarted for things to be functional again?

Expected Behavior

Database should continue working normally

Steps To Reproduce

Reproducers are unclear, but has been happening on v2.4.7

Milvus Log

lines.txt

Anything else?

From the logs, the errors continue to cascade and the database becomes unoperational. While we do see the same error logs every now and then, they do not cause db crashes every time.

I think from the current log, there is no obvious reason from log except for some timeout .

What do you mean by "none of the DB operations work." does that mean all the coord crashed?

Could you try to pprof and catch goroutine info and cpu usage of coordinator so we can analyze on it?

@nairan-deshaw
Copy link
Author

What do you mean by "none of the DB operations work." does that mean all the coord crashed?

The inserts, queries etc. all time out and we end up seeing the below error on client side:

E0819 01:13:41.110673102 1719680 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:14:41.120204252 1720476 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:15:41.133475728 1721014 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:16:41.166404019 1721536 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:17:41.259244512 1722012 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
[20240819 01:17:41:259 decorators.py:100 WARNING] [release_partitions] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed>

The mixcoord itself does not crash which is why k8s does not restart the pod, instead it goes into a degraded state from which it becomes non recoverable. In addition to that attu also becomes unresponsive and keeps loading for a long time before erroring out with Context deadline exceeded.

Could you try to pprof and catch goroutine info and cpu usage of coordinator so we can analyze on it?

Any steps / references on how to do this?

@xiaofan-luan
Copy link
Collaborator

To get pprof, check this scripts

curl "http://${ip}:9091/debug/pprof/goroutine?debug=10" -o ./$log_dir/goroutine/$pod.log || echo "export goroutine for pod $pod failed"

The first command goroutines, mutex, cpu profiling and block point, this will give us more details to analyze.
curl -o goroutine.txt http://10.28.191.52:9091/debug/pprof/goroutine
curl -o mutex.txt http://10.28.191.52:9091/debug/pprof/mutex
curl -o profile.txt http://10.28.191.52:9091/debug/pprof/profile
curl -o block.txt http://10.28.191.52:9091/debug/pprof/block

to check logs, check this scripts
https://github.com/milvus-io/milvus/tree/master/deployments/export-log

Most likely there is a very slow operation in coordinator, but we need to know what the operation is. most likely this will be etcd

@nairan-deshaw
Copy link
Author

Hi @xiaofan-luan, attaching the logs for the suggested endpoints from the current mixcoord component. Let me if that helps or more logs are needed:
logs.tar.gz

@xiaofan-luan
Copy link
Collaborator

from the log, there isn't any suspicious log except for one index node failed.
[datacoord/indexnode_manager.go:131] ["get IndexNode slots failed"] [nodeID=5293]

Is there any warn/error or any node crash?

From the inital log, the query is in many section_ids and this is not recommend(could be very slow). there are many timeout at that moment but we don't have many details so no idea about whether this is querynode timeout or due to some crash or what

@xiaofan-luan
Copy link
Collaborator

right now the datacoord works fine, cpu usage is high on doing
image

One guess is that we have a bug of leaking files , see #36161. And what might happens is after running for while there maybe many leakage cause the datacoord to be slower. This is fixed on 2.4.12

My question is how many collections/partitions are there in your use case? the file seems to be accumulating very fast, because we don't see this happen on even large clusters

@nairan-deshaw
Copy link
Author

nairan-deshaw commented Sep 27, 2024

Is there any warn/error or any node crash?

Checked that no other pod crashes right before this state happens

how many collections/partitions are there in your use case?

We have 12 collections about 1300 partitions which increases at regular intervals. We have currently under 150 million embeddings.

@xiaofan-luan
Copy link
Collaborator

Is there any warn/error or any node crash?

Checked that no other pod crashes right before this state happens

how many collections/partitions are there in your use case?

We have 12 collections about 1300 partitions which increases at regular intervals. We have currently under 150 million embeddings.

try to upgrade to 2.4.12 see if it helps. Maybe also try to increase the resources of coords. Did you see any increase on mixcoord's cpu usage?

@nairan-deshaw
Copy link
Author

Did you see any increase on mixcoord's cpu usage?

We did see some normal spikes on the pods, but not anything unusual. Currently we have allocated 4 CPU and 8GB for the mixcoord.

We will try upgrading to v2.4.12. In addition, we are planning to deploy a standby mixcoord. Will that provide any benefits in this scenario? @xiaofan-luan

@xiaofan-luan
Copy link
Collaborator

Did you see any increase on mixcoord's cpu usage?

We did see some normal spikes on the pods, but not anything unusual. Currently we have allocated 4 CPU and 8GB for the mixcoord.

We will try upgrading to v2.4.12. In addition, we are planning to deploy a standby mixcoord. Will that provide any benefits in this scenario? @xiaofan-luan

I'm doubting on adding a standby mixcoord could help. We probably need to figure out the reason of panic before do the next step.

Please check:

  1. if there are any warn/error log when coord exit
  2. is there any etcd slow message
  3. why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge

@nairan-deshaw
Copy link
Author

if there are any warn/error log when coord exit

Logs before and during the crash are attached in #36217 (comment)

is there any etcd slow message

Any specific filters for looking at these logs?

why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge

Below are the usage patterns for CPU and memory from our cluster during the crash window:

image

image

@xiaofan-luan
Copy link
Collaborator

if there are any warn/error log when coord exit

Logs before and during the crash are attached in #36217 (comment)

is there any etcd slow message

Any specific filters for looking at these logs?

why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge

Below are the usage patterns for CPU and memory from our cluster during the crash window:

image

image

Are you using bulk insertions for all the inserted data?

@nairan-deshaw
Copy link
Author

Yes, we are using bulk inserts.

@xiaofan-luan
Copy link
Collaborator

this might be a known bug since some of the imported segment info is leaked and the cpu accumulated for a while until coord reboot.
@bigsheeper is working on solving it

Copy link

stale bot commented Nov 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Nov 9, 2024
@yanliang567 yanliang567 assigned bigsheeper and unassigned weiliu1031 Nov 11, 2024
@stale stale bot removed the stale indicates no udpates for 30 days label Nov 11, 2024
@yanliang567
Copy link
Contributor

@bigsheeper were all the fixed for this issue merged? is it related to a pr that we reverted recently?

@bigsheeper
Copy link
Contributor

@bigsheeper were all the fixed for this issue merged?

Yes, the imported segment leak issue has been resolved, which should help prevent high CPU usage in mixcoord.

is it related to a pr that we reverted recently?

Nope.

@nairan-deshaw
Copy link
Author

Thanks, does 2.4.15 have the potential fix in place?

Copy link

stale bot commented Dec 21, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Dec 21, 2024
@stale stale bot closed this as completed Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants