-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.13] drop measurement/series taking a very long time #6669
Comments
It looks like your process is deadlocked. There was a fix for this in #6627. |
@jwilder I tried this again using 0.14.0~n201605240800, and I'm not sure if it is doing any better. It's been running for quite a while, and I can't query _internal stats or use the cli client. I see the http writes for other databases going past in the logs, but I'm also seeing a lot of "failed to write point batch to database... timeout" for UDP writes. I was able to make the calls for the pprof data, which I'll include if it helps. |
Another interestig bit is that trying to drop the entire database which holds the bad metrics causes influxd to lock up completely, use up all the memory on the box and then get killed by the OOM killer. Shouldn't that just be a matter of removing some metadata, and deleting the database directory? |
@cheribral I see another problem. The |
Apologies for the stream of comments, but the server came back up, thinking the database was gone, but the data files were still there. I was also seeing statistics for the database in _internal. I shut down influx, moved the data files for that database out of the data directory as well as removing the wal files, and everything seems to be coherent again. I was then a bit suspicious, so I tried to drop a large measurement in another database, and that one worked fine. At this point I'm not sure if the multiple failed attempts at removing data in that original database corrupted it in a way the server couldn't handle or not, so I'm not sure whether this issue still needs to be open. I'll close it supposing that you don't need any more noise than necessary :) |
@jwilder I have met the same problem. And I also noticed that many drop actions, include drop on shards, databases, measurements and retention policies would block all the other write/ read request to influxdb. Is there any plan to optimize the write lock to act on a narrower scope? It is common that we want to query on one database (or even retention policy) when dropping another. |
@ivanyu Yes. We're working on it. |
@jwilder any updates? We're also seeing this on influx 1.0. |
@jwilder I'm guessing this will be no surprise, but seeing this in v1.1 too. Assuming the root cause is understood, is there some workaround while we wait for the fix. The reason I ask is because it does interfere with workflow when testing at scale and needing to reclaim space between iterations. |
This issue is closed as it was related to a deadlock and some serial deletion code that was fixed. If you are having issues with deletes, please log a new issue with details. |
Bug report
Centos 7, AWS r3.2xlarge, provisioned IOPs etc.
Steps to reproduce:
Expected behavior:
Unsure, but I would hope that a delete could finish at least somewhere in the region of time it would take to read and rewrite the entire measurement.
Actual behavior:
It takes hours to finish without doing any significant work on the server.
I'm not sure if this is related to #6250, but this didn't seem like this should be included in that issue. I've attached the files @jwilder referenced in #6250. Although I know that deletes are expensive, this doesn't seem quite right. It does make it very hard to recover from a mistake like we have here, where someone didn't realize their measurements' cardinality would be so high.
block.txt
goroutine.txt
The text was updated successfully, but these errors were encountered: