-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Service Latency: SQL in Cockroach DB v2.0.2 #26830
Comments
Hi @akshath, the logs look OK. What is the output of You need to run it against each node to get the full view; each node only records the queries which it received. |
Hi @tschottdorf Thanks for looking into this. Please do find the output from 2 of our nodes. Thanks .63:26257/> select node_id, application_name, flags, anonymized, count, first_attempt_count, max_retries, last_error, rows_avg, rows_var, parse_lat_avg, parse_lat_var, plan_lat_avg, plan_lat_var, run_lat_avg, run_lat_var, service_lat_avg, service_lat_var, overhead_lat_avg, overhead_lat_var from crdb_internal.node_statement_statistics order by service_lat_avg desc limit 10; +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----------+---------------+---------------+-----------------------+------------------------+--------------------+--------------------+-------------------+--------------------+---------------------+-------------------+ | node_id | application_name | flags | anonymized | count | first_attempt_count | max_retries | last_error | rows_avg | rows_var | parse_lat_avg | parse_lat_var | plan_lat_avg | plan_lat_var | run_lat_avg | run_lat_var | service_lat_avg | service_lat_var | overhead_lat_avg | overhead_lat_var | +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----------+---------------+---------------+-----------------------+------------------------+--------------------+--------------------+-------------------+--------------------+---------------------+-------------------+ | 3 | PostgreSQL JDBC Driver | ! | UPSERT INTO _(_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, | 449 | 449 | 0 | waiting on split that failed: split at key | 0 | 0 | 0 | 0 | 0.0005864656926503344 | 1.1959763134486078e-07 | 12.334324954247206 | 17.552177148488607 | 13.09610419302672 | 25.400026600190156 | 0.7611927730868592 | 9.798498873023515 | | | | | _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _) VALUES ($1, $2, | | | | /Table/53/3/"Wibmo-User-Services"/1921-07-18T20:34:22.803999999Z/"IAP-Promocode-Failure"/0/0/"Wibmo-User-Services:201806160325:3a2170f5-44ab-4850-8ac7-21532f436ba8" | | | | | | | | | | | | | | | | | $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, | | | | failed: command is too large: 411135097 bytes (max: 67108864) | | | | | | | | | | | | | | | | | $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, | | | | | | | | | | | | | | | | | | | | | $33, $34, $35, $36, $37) | | | | | | | | | | | | | | | | | | 3 | PostgreSQL JDBC Driver | ! | INSERT INTO _(_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, | 62 | 62 | 0 | waiting on split that failed: split at key | 0 | 0 | 0 | 0 | 0.0002502923548387096 | 2.0648623864949776e-09 | 11.222318867516126 | 17.484553463098802 | 11.63914844672581 | 24.227832210938242 | 0.41657928685483836 | 5.303991335186809 | | | | | _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _) VALUES ($1, $2, | | | | /Table/53/3/"Wibmo-User-Services"/1921-07-18T20:34:22.803999999Z/"IAP-Promocode-Failure"/0/0/"Wibmo-User-Services:201806160325:3a2170f5-44ab-4850-8ac7-21532f436ba8" | | | | | | | | | | | | | | | | | $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, | | | | failed: command is too large: 411135097 bytes (max: 67108864) | | | | | | | | | | | | | | | | | $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, | | | | | | | | | | | | | | | | | | | | | $33, $34, $35, $36, $37) | | | | | | | | | | | | | | | | | +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----------+---------------+---------------+-----------------------+------------------------+--------------------+--------------------+-------------------+--------------------+---------------------+-------------------+ (2 rows) Time: 30.33142ms -- .37:26257/> select node_id, application_name, flags, anonymized, count, first_attempt_count, max_retries, last_error, rows_avg, rows_var, parse_lat_avg, parse_lat_var, plan_lat_avg, plan_lat_var, run_lat_avg, run_lat_var, service_lat_avg, service_lat_var, overhead_lat_avg, overhead_lat_var from crdb_internal.node_statement_statistics order by service_lat_avg desc limit 10; +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----------+---------------+---------------+-----------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+---------------------+--------------------+ | node_id | application_name | flags | anonymized | count | first_attempt_count | max_retries | last_error | rows_avg | rows_var | parse_lat_avg | parse_lat_var | plan_lat_avg | plan_lat_var | run_lat_avg | run_lat_var | service_lat_avg | service_lat_var | overhead_lat_avg | overhead_lat_var | +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----------+---------------+---------------+-----------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+---------------------+--------------------+ | 4 | PostgreSQL JDBC Driver | ! | UPSERT INTO _(_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, | 605 | 605 | 0 | waiting on split that failed: split at key | 0 | 0 | 0 | 0 | 0.0009641634231404957 | 5.119412022820149e-06 | 12.374603077419838 | 14.550700033699048 | 13.448744997242972 | 26.040339129397303 | 1.0731777564000007 | 13.313399120084355 | | | | | _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _) VALUES ($1, $2, | | | | /Table/53/3/"Wibmo-User-Services"/1921-07-18T20:34:22.803999999Z/"IAP-Promocode-Failure"/0/0/"Wibmo-User-Services:201806160325:3a2170f5-44ab-4850-8ac7-21532f436ba8" | | | | | | | | | | | | | | | | | $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, | | | | failed: command is too large: 411176996 bytes (max: 67108864) | | | | | | | | | | | | | | | | | $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, | | | | | | | | | | | | | | | | | | | | | $33, $34, $35, $36, $37) | | | | | | | | | | | | | | | | | | 4 | PostgreSQL JDBC Driver | ! | INSERT INTO _(_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, | 88 | 88 | 0 | waiting on split that failed: split at key | 0 | 0 | 0 | 0 | 0.000422673659090909 | 4.875683078479517e-07 | 11.126149902465908 | 11.903821223902712 | 11.559220605 | 16.126650320411954 | 0.43264802887499987 | 5.969340121018091 | | | | | _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _) VALUES ($1, $2, | | | | /Table/53/3/"Wibmo-User-Services"/1921-07-18T20:34:22.803999999Z/"IAP-Promocode-Failure"/0/0/"Wibmo-User-Services:201806160325:3a2170f5-44ab-4850-8ac7-21532f436ba8" | | | | | | | | | | | | | | | | | $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, | | | | failed: command is too large: 411176996 bytes (max: 67108864) | | | | | | | | | | | | | | | | | $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, | | | | | | | | | | | | | | | | | | | | | $33, $34, $35, $36, $37) | | | | | | | | | | | | | | | | | | 4 | cockroach sql | | SELECT * FROM _.crdb_internal.node_build_info | 1 | 1 | 0 | NULL | 6 | NaN | 0.000194389 | NaN | 0.000229029 | NaN | 2.1821e-05 | NaN | 0.000717841 | NaN | 0.000272602 | NaN | +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----------+---------------+---------------+-----------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+---------------------+--------------------+ (3 rows) Time: 31.613941ms -- |
Please find attached the output of |
Let me know if you need any more info from my side. |
@akshath are you sure the PDF is for the right range (unfortunately the key boundaries are cut off). I'm asking because I'd expect a larger range size (especially system bytes) but am not seeing that for the range you provided. If you're having trouble finding the right range, try |
You can also hover over the table entry on the UI page to see the full keys. |
Yes I think its the wrong range, will try to get you the right range. In the meanwhile, I tried to increase raft range using below command. SET CLUSTER SETTING kv.raft.command.max_size='100000000'; But now one of my node is stopping continuously after starting for few mins. Please do find the logs of that node here. .37:26257/> select node_id, application_name, flags, anonymized, count, first_attempt_count, max_retries, last_error, rows_avg, rows_var, parse_lat_avg, parse_lat_var, plan_lat_avg, plan_lat_var, run_lat_avg, run_lat_var, service_lat_avg, service_lat_var, overhead_lat_avg, overhead_lat_var from crdb_internal.node_statement_statistics order by service_lat_avg desc limit 10; +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+--------------------------------------------------------------------------+----------+----------+---------------+---------------+-----------------------+------------------------+-------------------+-----------------------+--------------------+-----------------------+------------------------+-----------------------+ | node_id | application_name | flags | anonymized | count | first_attempt_count | max_retries | last_error | rows_avg | rows_var | parse_lat_avg | parse_lat_var | plan_lat_avg | plan_lat_var | run_lat_avg | run_lat_var | service_lat_avg | service_lat_var | overhead_lat_avg | overhead_lat_var | +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+--------------------------------------------------------------------------+----------+----------+---------------+---------------+-----------------------+------------------------+-------------------+-----------------------+--------------------+-----------------------+------------------------+-----------------------+ | 4 | PostgreSQL JDBC Driver | ! | UPSERT INTO _(_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, | 397 | 397 | 0 | failed to send RPC: sending to all 2 replicas failed; last error: { | 0 | 0 | 0 | 0 | 0.0004124487329974809 | 9.894478819524491e-09 | 60.00055264033502 | 5.612509102459291e-06 | 60.002029279753145 | 3.729748127091705e-06 | 0.0010641906851387034 | 1.827346683951616e-06 | | | | | _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _) VALUES ($1, $2, | | | | initial connection heartbeat failed: rpc error: code = Unavailable desc | | | | | | | | | | | | | | | | | $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, | | | | = all SubConns are in TransientFailure} | | | | | | | | | | | | | | | | | $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, | | | | | | | | | | | | | | | | | | | | | $33, $34, $35, $36, $37) | | | | | | | | | | | | | | | | | | 4 | PostgreSQL JDBC Driver | ! | INSERT INTO _(_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, | 59 | 59 | 0 | failed to send RPC: sending to all 2 replicas failed; last error: { | 0 | 0 | 0 | 0 | 0.0001890023389830509 | 1.6361461264003518e-09 | 60.00109911883051 | 8.380717212976845e-07 | 60.002010979355916 | 7.45170076613103e-07 | 0.0007228581864416222 | 1.599297494696807e-08 | | | | | _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _) VALUES ($1, $2, | | | | initial connection heartbeat failed: rpc error: code = Unavailable desc | | | | | | | | | | | | | | | | | $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, | | | | = all SubConns are in TransientFailure} | | | | | | | | | | | | | | | | | $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, | | | | | | | | | | | | | | | | | | | | | $33, $34, $35, $36, $37) | | | | | | | | | | | | | | | | | | 4 | cockroach sql | | SELECT * FROM _.crdb_internal.node_statement_statistics ORDER BY _ DESC | 1 | 1 | 0 | NULL | 3 | NaN | 5.3535e-05 | NaN | 0.000487278 | NaN | 5.3257e-05 | NaN | 0.000743264 | NaN | 0.00014919399999999995 | NaN | | | | | LIMIT _ | | | | | | | | | | | | | | | | | | 4 | cockroach sql | | SELECT * FROM _.crdb_internal.node_build_info | 1 | 1 | 0 | NULL | 6 | NaN | 0.000145061 | NaN | 0.000151945 | NaN | 1.7204e-05 | NaN | 0.000458473 | NaN | 0.00014426299999999994 | NaN | +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+--------------------------------------------------------------------------+----------+----------+---------------+---------------+-----------------------+------------------------+-------------------+-----------------------+--------------------+-----------------------+------------------------+-----------------------+ (4 rows) Time: 8.336417ms |
@akshath Please restore it to the defaults. Unfortunately with this setting it is possible to add commands to the Raft log that exceed what can be held in memory. This is not easy to undo, so let's hope that it eventually stops crash-looping. |
Thank you @tschottdorf for the quick revert. I have rolled-back that setting with the following command. SET CLUSTER SETTING kv.raft.command.max_size='62914560'; But still, one of my node is not coming up and keeps exiting. Please do find the latest logs. Thanks |
@akshath on the machine which keeps crashing, run this (you may have to adjust the url):
When it crashes again, upload the Thanks! |
We are suspecting that some workloads create a sufficient number of abort span records to cause split commands that are too large to be proposed to Raft, in effect rendering the range permanently unsplittable. This is a problem since the range will at some point start to backpressure writes (and even without that, it's a resource hog). Most of the problematic abort span records would very likely be removed during garbage collection; however, abort span records aren't counted in any quantity that triggers GC. Instead of copying the entirety of the abort span, restrict to the entries that would not be removed by a GC operation. In practice, this means that unless a high number of abort span records are created every hour, very few records will actually be copied, and in return the split command size should be small. See cockroachdb#26830. Release note (bug fix): Avoid a situation in which ranges repeatedly fail to perform a split.
Please do find the last two heap files. Thanks |
We are suspecting that some workloads create a sufficient number of abort span records to cause split commands that are too large to be proposed to Raft, in effect rendering the range permanently unsplittable. This is a problem since the range will at some point start to backpressure writes (and even without that, it's a resource hog). Most of the problematic abort span records would very likely be removed during garbage collection; however, abort span records aren't counted in any quantity that triggers GC. Instead of copying the entirety of the abort span, restrict to the entries that would not be removed by a GC operation. In practice, this means that unless a high number of abort span records are created every hour, very few records will actually be copied, and in return the split command size should be small. See cockroachdb#26830. Release note (bug fix): Avoid a situation in which ranges repeatedly fail to perform a split.
26934: storage: filter abort span during split copy r=nvanbenschoten a=tschottdorf We are suspecting that some workloads create a sufficient number of abort span records to cause split commands that are too large to be proposed to Raft, in effect rendering the range permanently unsplittable. This is a problem since the range will at some point start to backpressure writes (and even without that, it's a resource hog). Most of the problematic abort span records would very likely be removed during garbage collection; however, abort span records aren't counted in any quantity that triggers GC. Instead of copying the entirety of the abort span, restrict to the entries that would not be removed by a GC operation. In practice, this means that unless a high number of abort span records are created every hour, very few records will actually be copied, and in return the split command size should be small. See #26830. Release note (bug fix): Avoid a situation in which ranges repeatedly fail to perform a split. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
We are suspecting that some workloads create a sufficient number of abort span records to cause split commands that are too large to be proposed to Raft, in effect rendering the range permanently unsplittable. This is a problem since the range will at some point start to backpressure writes (and even without that, it's a resource hog). Most of the problematic abort span records would very likely be removed during garbage collection; however, abort span records aren't counted in any quantity that triggers GC. Instead of copying the entirety of the abort span, restrict to the entries that would not be removed by a GC operation. In practice, this means that unless a high number of abort span records are created every hour, very few records will actually be copied, and in return the split command size should be small. See cockroachdb#26830. Release note (bug fix): Avoid a situation in which ranges repeatedly fail to perform a split.
@akshath this looks like what I expected. You have a long tail of uncommitted very large Raft proposals in the Raft log (presumably due to incrementing the max proposal size as you did temporarily). During startup, the Raft group is elected leader (presumably because it calls elections for some high term and has the highest log, but @bdarnell, shouldn't there be a leader alredy, and shouldn't that prevent the candidate state in the first place?) and that means that it has to load all of the entries from storage, which causes the out of memory error. @bdarnell is this expected behavior? I'm surprised that this replica calls an election. I think the other replicas are online (please confirm, @akshath -- and please also supply the range info page). |
Yeah, I'm confused about why we're seeing elections here. Maybe the range is dormant and for some reason this node is the only one that wakes it up? If the other nodes are up, one of them should become leader and invalidate the failing nodes' uncommitted entries. In any case, I thought this was fixed in 2.0, but looking back at the fix (etcd-io/etcd#9073), my change doesn't actually do what I meant for it to do. |
Is it possible not to call From the code it looks like all @akshath we'll need to see the Range status page for the correct RangeID to make future suggestions. It can be retrieved from one of the other nodes while your problematic node is down. I would leave that node down, by the way. |
26944: backport-2.0: storage: filter abort span during split copy r=bdarnell a=tschottdorf Backport 1/1 commits from #26934. Needed slight adjustments to apply, but nothing in the split logic. /cc @cockroachdb/release --- We are suspecting that some workloads create a sufficient number of abort span records to cause split commands that are too large to be proposed to Raft, in effect rendering the range permanently unsplittable. This is a problem since the range will at some point start to backpressure writes (and even without that, it's a resource hog). Most of the problematic abort span records would very likely be removed during garbage collection; however, abort span records aren't counted in any quantity that triggers GC. Instead of copying the entirety of the abort span, restrict to the entries that would not be removed by a GC operation. In practice, this means that unless a high number of abort span records are created every hour, very few records will actually be copied, and in return the split command size should be small. See #26830. Release note (bug fix): Avoid a situation in which ranges repeatedly fail to perform a split. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Yes, we should be able to remove the call to |
Right now only 3 of 6 nodes are up. Other 3 keep going down if I try to restart. When I try to access 'admin ui' I keep getting the 'Connection to CockroachDB node lost.' error. Can you please guide me on how to get correct range info page? Also, the application is no longer able to connect to the DB. Please do find the current output from SQL window. 37:26257/> select node_id, application_name, flags, anonymized, count, first_attempt_count, max_retries, last_error, rows_avg, rows_var, parse_lat_avg, parse_lat_var, plan_lat_avg, plan_lat_var, run_lat_avg, run_lat_var, service_lat_avg, service_lat_var, overhead_lat_avg, overhead_lat_var from crdb_internal.node_statement_statistics order by service_lat_avg desc limit 10; +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+--------------------------------------------------------------------------+----------+----------+---------------+---------------+------------------------+------------------------+--------------------+------------------------+-------------------+------------------------+------------------------+------------------------+ | node_id | application_name | flags | anonymized | count | first_attempt_count | max_retries | last_error | rows_avg | rows_var | parse_lat_avg | parse_lat_var | plan_lat_avg | plan_lat_var | run_lat_avg | run_lat_var | service_lat_avg | service_lat_var | overhead_lat_avg | overhead_lat_var | +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+--------------------------------------------------------------------------+----------+----------+---------------+---------------+------------------------+------------------------+--------------------+------------------------+-------------------+------------------------+------------------------+------------------------+ | 4 | PostgreSQL JDBC Driver | ! | INSERT INTO _(_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, | 154 | 154 | 0 | failed to send RPC: sending to all 1 replicas failed; last error: { | 0 | 0 | 0 | 0 | 0.00028536728571428564 | 1.8641165387438835e-07 | 60.003291508188305 | 3.4200656321583525e-05 | 60.0046154097013 | 3.286236984590149e-05 | 0.0010385342272728945 | 2.782355718221276e-07 | | | | | _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _) VALUES ($1, $2, | | | | rpc error: code = Canceled desc = context canceled} | | | | | | | | | | | | | | | | | $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, | | | | | | | | | | | | | | | | | | | | | $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, | | | | | | | | | | | | | | | | | | | | | $33, $34, $35, $36, $37) | | | | | | | | | | | | | | | | | | 4 | PostgreSQL JDBC Driver | ! | SELECT * FROM _ WHERE _ = $1 | 1 | 1 | 0 | failed to send RPC: sending to all 1 replicas failed; last error: { | 0 | NaN | 0 | NaN | 0.000424678 | NaN | 60.003456095 | NaN | 60.004421181 | NaN | 0.000540407999999104 | NaN | | | | | | | | | rpc error: code = Canceled desc = context canceled} | | | | | | | | | | | | | | 4 | PostgreSQL JDBC Driver | ! | UPSERT INTO _(_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, | 718 | 718 | 0 | failed to send RPC: sending to all 1 replicas failed; last error: { | 0 | 0 | 0 | 0 | 0.0006447936197771588 | 2.2162686355907175e-06 | 59.99943388852792 | 2.258043766237384e-05 | 60.00151220645542 | 1.6787926880634934e-05 | 0.0014335243077994557 | 1.6650101094467024e-06 | | | | | _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _) VALUES ($1, $2, | | | | rpc error: code = Canceled desc = context canceled} | | | | | | | | | | | | | | | | | $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, | | | | | | | | | | | | | | | | | | | | | $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, | | | | | | | | | | | | | | | | | | | | | $33, $34, $35, $36, $37) | | | | | | | | | | | | | | | | | | 4 | cockroach sql | | SELECT * FROM _.crdb_internal.node_statement_statistics ORDER BY _ DESC | 1 | 1 | 0 | NULL | 4 | NaN | 6.7415e-05 | NaN | 0.001236047 | NaN | 8.2888e-05 | NaN | 0.001572622 | NaN | 0.00018627200000000013 | NaN | | | | | LIMIT _ | | | | | | | | | | | | | | | | | | 4 | cockroach sql | | SELECT * FROM _.crdb_internal.node_build_info | 1 | 1 | 0 | NULL | 6 | NaN | 0.000128333 | NaN | 0.000147583 | NaN | 3.1358e-05 | NaN | 0.000454601 | NaN | 0.000147327 | NaN | +---------+------------------------+-------+--------------------------------------------------------------------------+-------+---------------------+-------------+--------------------------------------------------------------------------+----------+----------+---------------+---------------+------------------------+------------------------+--------------------+------------------------+-------------------+------------------------+------------------------+------------------------+ (5 rows) Time: 12.07971ms |
Hi Team, Any advice on how I should proceed? Thanks |
@bdarnell has an upstream fix merged, which we should be able to cherry-pick onto the release-2.0 branch today. You should try running with that binary when it's available; it should mitigate this specific out of memory error, but there is a possibility that you will see more problems due to the large Raft log that has been built up. |
Thanks, @tschottdorf for getting back. Waiting for the new release. |
@akshath v2.0.4 was released today, with the mitigation for this problem. Please give it a try! |
Thank you, will upgrade and update back. |
@akshath, any news from your end? In other cases, we have seen that it was still difficult to revive the server due to other locations in the code that struggle with the large Raft log. |
not much.. 1 node of 6 keeps quitting |
Could you provide more information? This isn't expected. |
Closing as the known problem here was a large Raft log, and @bdarnell and @nvanbenschoten have since landed a series of fixes that we believe eliminates both the root cause and mitigates the effects. |
Hi Team,
I am having very high 'Service Latency: SQL' on my Cockroach DB from a couple of days and I am not able to find the cause.
The application is during about 50 inserts/updates per second and has a 6 node cluster in Digital Ocean. Total records in the system are around 22M.
Please do find the debug info, that may help.
Also I am unable to get 'profile?debug' has req. is timing out on the server when I issue wget for the same.
cockroach-last500line-node2.log
cockroach-last500line-node3.log
The text was updated successfully, but these errors were encountered: