-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InfluxDB starts for 2 hours #6250
Comments
Would you be able to run:
There is a lot of lock contention related to loading the meta data index still. I suspect you are hitting that. |
|
@undera Can you run |
|
@undera Looks like a permissions issue. The file is owned by |
Good to know. Probably error diagnostics could be improved to hint me the permissions issue. Here's the output with sudo:
|
@undera It shouldn't panic like that. That's a separate bug. How many shards do you have in |
I have 217 of subfolders there |
Hi, https://gist.github.com/aderumier/63921e366ac3eae5cc73965075f04370 around 1h. |
@aderumier There are no specific fixes for this issue in 0.13. If you are able to test #6618, that would be helpful to know if it improves things. |
@jwilder I would like to help, but I have no idea how to build influxdb from source.Do you have a build somewhere ? |
@aderumier #6618 should be in tonights nightly. If you could test it out, that would be great. |
Removing my previous comment; its very fast but thats because its only starting with a small fraction of our data. Same tombstone error reported above (and its not permissions!). Filed #6641 to track |
@jwilder I tried the latest nightly on one of our bigger databases with 11.5 million series, and it brought the startup time to 10 minutes from 34 minutes. There was still a lot of time though where most of the CPUs weren't doing anything, and the disk was quiet, so I'd assume there is still something getting hung up somewhere. |
@jwilder I have tried the latest nightly too, it's take around 10min now instead 1hour previously. As for cheribral, cpu and disk are quite low during the startup. https://gist.github.com/aderumier/cafae2fbcf57c45254f79cc5f3e98ea0 |
@aderumier @cheribral Could you run this after the server fully starts and attach the output to the issue?
Also, when you restart can you monitor the process and see if you have a high amount of |
@jwilder : iowait was around 0,5% , cpu idle around 60% (of 16cores). I can't restart it anymore, I have a tsm crash https://gist.github.com/aderumier/7be9507af94fb11423ce2ce6335b292d I think it's related to #6641 |
@aderumier I created a separate issue for the interface panic: #6652 The latest nightly reverts the commit that introduced that panic. |
@jwilder could this issue by any chance also cause "drop measurement" queries to take forever? I'm trying to clean up the database used to test this, and the queries have been running for hours blocking writes, and occassionally spitting out some lines about compaction in the log. |
If you grab these profiles, it might help identify the slowness.
|
Ok, it's starting fine with last nightly (+-10min) https://gist.github.com/aderumier/6c906efbf4dd8be9d75fcfae23ea46a1 debug output just after startup block.txt goroutine.txt |
I have WIP branch that would be useful if someone could test on their dataset to see if it helps or not. This branch is still very experimental so please do not test on your production servers. The branch is: https://github.com/influxdata/influxdb/tree/jw-db-index. It should significantly reduce the database index lock contention at startup and make better use of available cores for loading shards. |
@jwilder, I have tested the jw-db-index branch,no improvement for me (15min startup) around 60% idle cpu of 16 cores https://gist.github.com/aderumier/fcd4311f4fc42aa2e82b87b4601c91ba |
@aderumier Thanks for testing it. Would you be able to grab the block and goroutine profile while it's running? Is there much |
iowait and sys time are around 0, I see only user cpu dumptsmdev dump for block and goroutine profile, do you want it just after start, or during the start ? |
Block at the end. Goroutine somewhere in the middle of loading. |
I can't launch goroutine profile when loading, because port 8086 is not listening |
just after start: |
@aderumier A few things I can see:
I'll take a look at the scanning, but you might want to see if you can reduce your series cardinality as well. If you don't need something as a tag, removing it will help future data. |
Do you known why my collectd database index are fast to load (around 30s), database volume size: collectd : 74G , telegraf : 98G [store] 2016/05/19 06:46:35 /var/lib/influxdb/data/collectd/default/1 opened in 24.880605565s [shard] 2016/05/19 06:50:32 /var/lib/influxdb/data/telegraf/default/66 database index loaded in 4m14.321419953s |
@aderumier Can you run |
Summary: Statistics so 220825 for collectd vs 11387193 Series for telegraf, so 50x more Series for telegraf. |
Is it possible to limit the number of keys in tsm files ? |
I also notice that I have only 4 shards (each 7days) :/var/lib/influxdb/data/telegraf/default# ls -lah Is it possible to create more small shards to improve load parallelism ? |
I think that should do it ? |
As of #6743, this issue should be resolved. There is still more tuning and improvement to be made to startup, but the the hour long startup time in some cases should be resolved now. |
This is the continuation of the issue #5764 . The machine I use and database I use is the same.
I have upgraded influxdb to 0.12 from debian (0.12.0-1). Now the start takes slightly over 2 hours. 33% increase is not bad, but still unacceptable to have downtime of 2 hours just to start. My dream is to replace MySQL server with InfluxDB, because I see InfluxDB is better to store my timeline data. MySQL starts within several seconds (I will keep memory consumption comparison for separate thread).
Can we improve InfluxDB to beat MySQL? :) I have my feeling that the long startup (and memory consumption also) is the price of being "schemaless", but in fact there is a schema that is kept in the memory. Loading that schema at startup takes long time and a lot of RAM (for certain cases). I might be wrong with my assumptions, sorry for being overconfident.
I've looked into logs, seems there's pretty long shard loads happening, I've grepped from log:
tsm1] 2016/04/06 21:43:56 /var/lib/influxdb/data/Loadosophia/default/512 database index loaded in 11m2.839406822s
[tsm1] 2016/04/06 21:47:00 /var/lib/influxdb/data/Loadosophia/default/513 database index loaded in 14m7.359681031s
[tsm1] 2016/04/06 21:54:48 /var/lib/influxdb/data/Loadosophia/default/525 database index loaded in 21m12.196889089s
[[tsm1] 2016/04/06 22:11:10 /var/lib/influxdb/data/Loadosophia/default/524 database index loaded in 27m14.029874898s
[tsm1] 2016/04/06 22:45:11 /var/lib/influxdb/data/Loadosophia/default/523 database index loaded in 1h12m5.557480759s
[tsm1] 2016/04/06 23:41:10 /var/lib/influxdb/data/Loadosophia/default/527 database index loaded in 1h54m9.795156493s
Attaching full startup log.
influxd.20160407.txt
The text was updated successfully, but these errors were encountered: