TIDB Instantly runs out of memory when adding index #27687

shellderp · 2021-08-30T22:30:29Z

Bug Report

1. Minimal reproduce step (Required)

TiDB deployed with Kubernetes AWS deployment, 32gb memory dedicated node (https://docs.pingcap.com/tidb-in-kubernetes/stable/deploy-on-aws-eks/)

Created a table with the following schema

CREATE TABLE `outgoing_emails` (
  `id` bigint(20) NOT NULL,
  `token` varchar(32) COLLATE utf8mb4_bin NOT NULL,
  `user_id` bigint(20) DEFAULT NULL,
  `brand_id` bigint(20) DEFAULT NULL,
  `retailer_id` bigint(22) DEFAULT NULL,
  `template` varchar(255) COLLATE utf8mb4_general_ci NOT NULL,
  `subject` varchar(255) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `send_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `sent_at` timestamp NULL DEFAULT NULL,
  `bounced_at` timestamp NULL DEFAULT NULL,
  `delivered_at` timestamp NULL DEFAULT NULL,
  `first_opened_at` timestamp NULL DEFAULT NULL,
  `last_opened_at` timestamp NULL DEFAULT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `open_count` bigint(20) NOT NULL DEFAULT '0',
  `message_id` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `provider` varchar(50) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `promotional` tinyint(1) NOT NULL DEFAULT '0',
  `idempotence_token` varchar(255) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `provider_id` bigint(20) DEFAULT NULL,
  `brand_customer_id` bigint(20) DEFAULT NULL,
  `brand_lead_id` bigint(20) DEFAULT NULL,
  `retailer_customer_id` bigint(20) DEFAULT NULL,
  `non_user_email_address` varchar(255) COLLATE utf8mb4_general_ci DEFAULT NULL,
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
  UNIQUE KEY `idx_token` (`token`),
  UNIQUE KEY `unq_idempotence_token` (`idempotence_token`),
  KEY `idx_template__user_id__send_at` (`template`,`user_id`,`send_at`),
  KEY `idx_template__send_at` (`template`,`send_at`),
  KEY `idx_template__brand_id__send_at` (`template`,`brand_id`,`send_at`),
  KEY `idx_created_at` (`created_at`),
  KEY `idx_retailer_id` (`retailer_id`),
  KEY `idx_provider_message_id` (`provider`,`message_id`),
  KEY `idx_send_at__sent_at` (`send_at`,`sent_at`),
  KEY `idx_updated_at__template` (`updated_at`,`template`),
  KEY `idx_user_id` (`user_id`),
  KEY `idx_template__brand_id__retailer_id__send_at` (`template`,`brand_id`,`retailer_id`,`send_at`),
  KEY `idx_template__brand_customer_id__send_at` (`template`,`brand_customer_id`,`send_at`),
  KEY `idx_brand_id__template__send_at` (`brand_id`,`template`,`send_at`),
  KEY `idx_template__guest_non_user_email_address__send_at` (`template`,`non_user_email_address`,`send_at`),
  KEY `idx_non_user_email_address` (`non_user_email_address`),
  KEY `idx_brand_id__send_at` (`brand_id`,`send_at`),
  KEY `idx_retailer_id__send_at` (`retailer_id`,`send_at`),
  KEY `idx_user_id__send_at` (`user_id`,`send_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci

Inserted 3 million records with random values (I don't know how to reproduce trivially)

select count(*) from outgoing_emails;
+----------+
| count(*) |
+----------+
|  3273930 |
+----------+

I ran:

alter table outgoing_emails add index idx_updated_at (updated_at);

No other queries are running on the cluster.

2. What did you expect to see? (Required)

Add index ok

3. What did you see instead (Required)

TiDB uses 30+ gb almost immediately and is OOMKilled repeatedly

Row count is much higher than actual rows in admin show ddl jobs output

+--------+----------------------+---------------------+--------------+----------------------+-----------+----------+-----------+---------------------+---------------------+---------------+
| JOB_ID | DB_NAME              | TABLE_NAME          | JOB_TYPE     | SCHEMA_STATE         | SCHEMA_ID | TABLE_ID | ROW_COUNT | START_TIME          | END_TIME            | STATE         |
+--------+----------------------+---------------------+--------------+----------------------+-----------+----------+-----------+---------------------+---------------------+---------------+
|     81 | email_sender_staging | outgoing_emails     | add index    | write reorganization |        53 |       57 |  10726792 | 2021-08-30 21:40:26 | NULL                | running       |

I see this log as well - is it related?

[2021/08/30 22:28:20.615 +00:00] [ERROR] [misc.go:117] ["panic in the recoverable goroutine"] [label=ddl] [funcInfo=backfillWorker.run] [r={}] [stack="goroutine 1463010 [running]:\ngithub.com/pingcap/tidb/util.GetStack(...)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/util/misc.go:78\ngithub.com/pingcap/tidb/util.Recover(0x3affd76, 0x3, 0x3b26528, 0x12, 0x0, 0x0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/util/misc.go:121 +0x2d5\npanic(0x38ffa80, 0xc002840120)\n\t/usr/local/go/src/runtime/panic.go:965 +0x1b9\ngithub.com/pingcap/tidb/util/codec.CutOne(0xc0029459eb, 0x9, 0x9, 0x612d5b84, 0x24b5132d, 0x223afda30da, 0xc00393cbe0, 0x2688e98, 0xc002840108, 0x13, ...)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/util/codec/codec.go:900 +0x12a\ngithub.com/pingcap/tidb/kv.NewCommonHandle(0xc0029459eb, 0x9, 0x9, 0x3ef84cf03a, 0xc04374c124b5132d, 0xc0021d9ad0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/kv/key.go:247 +0xd4\ngithub.com/pingcap/tidb/tablecodec.DecodeRowKey(0xc0029459e0, 0x14, 0x14, 0x25dd4ec, 0xc00196dd80, 0xc0010ae8f0, 0xc002840108)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/tablecodec/tablecodec.go:273 +0x225\ngithub.com/pingcap/tidb/ddl.tryDecodeToHandleString(0xc0029459e0, 0x14, 0x14, 0x2, 0x13)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:427 +0x49\ngithub.com/pingcap/tidb/ddl.(*reorgBackfillTask).String(0xc00393cf68, 0x5ee64b5f7a40004, 0x1)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:187 +0x87\ngithub.com/pingcap/tidb/ddl.(*baseIndexWorker).fetchRowColVals(0xc0021d9ad0, 0x4072fe0, 0xc0023de4d0, 0x39, 0xc0029459e0, 0x14, 0x14, 0xc0023e22a0, 0x13, 0x18, ...)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/index.go:1008 +0x2fb\ngithub.com/pingcap/tidb/ddl.(*addIndexWorker).BackfillDataInTxn.func2(0x4030d40, 0xc0001a2000, 0x4072fe0, 0xc0023de4d0, 0x0, 0xc000cd4130)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/index.go:1125 +0x13e\ngithub.com/pingcap/tidb/kv.RunInNewTxn(0x4030d40, 0xc0001a2000, 0x4069090, 0xc0009f28a0, 0x1, 0xc00393d538, 0x0, 0x0)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/kv/txn.go:46 +0x103\ngithub.com/pingcap/tidb/ddl.(*addIndexWorker).BackfillDataInTxn(0xc0021d9ad0, 0x39, 0xc0029459e0, 0x14, 0x14, 0xc0023e22a0, 0x13, 0x18, 0x0, 0x0, ...)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/index.go:1120 +0x16f\ngithub.com/pingcap/tidb/ddl.(*backfillWorker).handleBackfillTask(0xc0029d6900, 0xc000b7a480, 0xc001872880, 0x400ac70, 0xc0021d9ad0, 0x2)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:244 +0x1df\ngithub.com/pingcap/tidb/ddl.(*backfillWorker).run(0xc0029d6900, 0xc000b7a480, 0x400ac70, 0xc0021d9ad0, 0xc0024e2000)\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:313 +0x2d7\ncreated by github.com/pingcap/tidb/ddl.(*worker).writePhysicalTableRecord\n\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:613 +0xa6f\n"]

Same issue as #22453

When DDL is cancelled, memory usage drops to < 1GB

I had no issue migrating a table with similar number of records but fewer columns and indexes. I don't know what is special about this table to cause an issue.

4. What is your TiDB version? (Required)

5.7.25-TiDB-v5.2.0

The text was updated successfully, but these errors were encountered:

ichn-hu · 2021-08-31T11:00:51Z

thanks for reporting this bug!

I unrolled the stack

goroutine 1463010 [running]:
github.com/pingcap/tidb/util.GetStack(...)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/util/misc.go:78
github.com/pingcap/tidb/util.Recover(0x3affd76, 0x3, 0x3b26528, 0x12, 0x0, 0x0)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/util/misc.go:121 +0x2d5
panic(0x38ffa80, 0xc002840120)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/pingcap/tidb/util/codec.CutOne(0xc0029459eb, 0x9, 0x9, 0x612d5b84, 0x24b5132d, 0x223afda30da, 0xc00393cbe0, 0x2688e98, 0xc002840108, 0x13, ...)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/util/codec/codec.go:900 +0x12a
github.com/pingcap/tidb/kv.NewCommonHandle(0xc0029459eb, 0x9, 0x9, 0x3ef84cf03a, 0xc04374c124b5132d, 0xc0021d9ad0)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/kv/key.go:247 +0xd4
github.com/pingcap/tidb/tablecodec.DecodeRowKey(0xc0029459e0, 0x14, 0x14, 0x25dd4ec, 0xc00196dd80, 0xc0010ae8f0, 0xc002840108)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/tablecodec/tablecodec.go:273 +0x225
github.com/pingcap/tidb/ddl.tryDecodeToHandleString(0xc0029459e0, 0x14, 0x14, 0x2, 0x13)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:427 +0x49
github.com/pingcap/tidb/ddl.(*reorgBackfillTask).String(0xc00393cf68, 0x5ee64b5f7a40004, 0x1)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:187 +0x87
github.com/pingcap/tidb/ddl.(*baseIndexWorker).fetchRowColVals(0xc0021d9ad0, 0x4072fe0, 0xc0023de4d0, 0x39, 0xc0029459e0, 0x14, 0x14, 0xc0023e22a0, 0x13, 0x18, ...)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/index.go:1008 +0x2fb
github.com/pingcap/tidb/ddl.(*addIndexWorker).BackfillDataInTxn.func2(0x4030d40, 0xc0001a2000, 0x4072fe0, 0xc0023de4d0, 0x0, 0xc000cd4130)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/index.go:1125 +0x13e
github.com/pingcap/tidb/kv.RunInNewTxn(0x4030d40, 0xc0001a2000, 0x4069090, 0xc0009f28a0, 0x1, 0xc00393d538, 0x0, 0x0)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/kv/txn.go:46 +0x103
github.com/pingcap/tidb/ddl.(*addIndexWorker).BackfillDataInTxn(0xc0021d9ad0, 0x39, 0xc0029459e0, 0x14, 0x14, 0xc0023e22a0, 0x13, 0x18, 0x0, 0x0, ...)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/index.go:1120 +0x16f
github.com/pingcap/tidb/ddl.(*backfillWorker).handleBackfillTask(0xc0029d6900, 0xc000b7a480, 0xc001872880, 0x400ac70, 0xc0021d9ad0, 0x2)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:244 +0x1df
github.com/pingcap/tidb/ddl.(*backfillWorker).run(0xc0029d6900, 0xc000b7a480, 0x400ac70, 0xc0021d9ad0, 0xc0024e2000)
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:313 +0x2d7
created by github.com/pingcap/tidb/ddl.(*worker).writePhysicalTableRecord
        /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/ddl/backfilling.go:613 +0xa6f

do you have trouble if the primary key is not clustered index?

shellderp · 2021-09-01T20:49:31Z

@ichn-hu thanks! I was able to drop the table, insert random data and have the same issue. So it appears reproducible but needs a large data set. I'm not sure how to easily reproduce. I'll try to upload the data set soon. I tried with not clustered index and I can't reproduce this issue, but since it is hard to reproduce I'm not 100% sure

zimulala · 2021-09-02T09:49:29Z

@shellderp
Did the cluster change the values of tidb_ddl_reorg_worker_cnt and tidb_ddl_reorg_batch_size? We can show it by executing the select @@global.tidb_ddl_reorg_worker_cnt; and select @@global.tidb_ddl_reorg_batch_size; statements.

In addition, could you please give all the logs during the execution of this "Add Index"?

shellderp · 2021-09-22T23:27:03Z

Did not change these:

mysql> select @@global.tidb_ddl_reorg_worker_cnt;
+------------------------------------+
| @@global.tidb_ddl_reorg_worker_cnt |
+------------------------------------+
|                                  4 |
+------------------------------------+
1 row in set (0.06 sec)

mysql> select @@global.tidb_ddl_reorg_batch_size;
+------------------------------------+
| @@global.tidb_ddl_reorg_batch_size |
+------------------------------------+
|                                256 |
+------------------------------------+
1 row in set (0.05 sec)

I'm attaching logs up to the OOM crash - but I didn't see anything interesting.
tidblogs.txt

Maybe this is caused by negative values, somehow? My clustered primary key contains random numbers from LONG_MIN to LONG_MAX, including large negative numbers. I just tested inserting only positive numbers for the clustered key in a new table, and I can't reproduce this issue, even at 40 million records in the table. Are negative integers supported in TIDB clustered primary keys? Unfortunately I have to support an old data set that includes negative primary keys.

github-actions · 2021-11-24T02:36:05Z

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

…) (#30123) close #27687

shellderp added the type/bug The issue is confirmed as a bug. label Aug 30, 2021

jingshanglu added the sig/migrate label Aug 31, 2021

aytrack added the severity/major label Oct 27, 2021

tangenta mentioned this issue Nov 18, 2021

ddl, util/codec: fix add index OOM and prevent panic in logging #29925

Merged

12 tasks

ti-chi-bot closed this as completed in #29925 Nov 24, 2021

github-actions bot added the needs-more-info label Nov 24, 2021

tangenta added affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. and removed needs-more-info labels Dec 16, 2021

ti-chi-bot pushed a commit that referenced this issue Feb 9, 2022

ddl, util/codec: fix add index OOM and prevent panic in logging (#29925…

2e28e12

…) (#30123) close #27687

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIDB Instantly runs out of memory when adding index #27687

TIDB Instantly runs out of memory when adding index #27687

shellderp commented Aug 30, 2021 •

edited

Loading

ichn-hu commented Aug 31, 2021

shellderp commented Sep 1, 2021

zimulala commented Sep 2, 2021

shellderp commented Sep 22, 2021

github-actions bot commented Nov 24, 2021

TIDB Instantly runs out of memory when adding index #27687

TIDB Instantly runs out of memory when adding index #27687

Comments

shellderp commented Aug 30, 2021 • edited Loading

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

ichn-hu commented Aug 31, 2021

shellderp commented Sep 1, 2021

zimulala commented Sep 2, 2021

shellderp commented Sep 22, 2021

github-actions bot commented Nov 24, 2021

shellderp commented Aug 30, 2021 •

edited

Loading