Skip to content

Latest commit

 

History

History
453 lines (351 loc) · 17 KB

20221019_01.md

File metadata and controls

453 lines (351 loc) · 17 KB
typora-copy-images-to
..\images

PostgreSQL误删表数据文件,清理与恢复实验

作者

Sean

日期

2022-10-19

标签

PostgreSQL, tablespace, vacuum

背景

image-20221019083137282

源于一个实际的环境,有人居然把数据文件的segment当作垃圾文件,删了一大批(为了清理空间)。对应的表说不重要,担心系统用不起来。

实验过程

一、配置

整个系统是有归档日志的,关键时候能救命:

archive_mode = on # enables archiving; off, on, or always
archive_command = 'cp %p /pgccc/archive/d5432/%f'               # command to use to archive a logfile segment
#archive_timeout = 0            # force a logfile segment switch after this
archive_cleanup_command = 'pg_archivecleanup -d  /pgccc/archive/d5432 %r 2>>archive_cleanup.log'        # command to execute at

二、试验(1)

弄一个单独的表空间,准备测试表t1

mydb=# create tablespace myts location '/pgccc/myts';
CREATE TABLESPACE
mydb=# create table t1 (id int, col2 char(2000)) tablespace myts;
CREATE TABLE
mydb=# CREATE OR REPLACE FUNCTION random_string( int ) RETURNS TEXT as $$
mydb$#     SELECT string_agg(substring('abcdefghijiklmnopqrstuvwxyz', round(random() * 25)::integer, 1), '') FROM generate_series(1, $1); $$ language sql;
CREATE FUNCTION

插入1000万条记录,生成多个段。

mydb=# insert into t1 select n, random_string(1000) from generate_series(1, 10000000) as n;
INSERT 0 10000000

[21:12:58-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ ls -lirt
total 20005636
783370 -rw------- 1 postgres postgres       8192 Oct 18 14:33 16393
783369 -rw------- 1 postgres postgres          0 Oct 18 14:33 16392
783371 -rw------- 1 postgres postgres    5054464 Oct 18 15:25 16389_fsm
783368 -rw------- 1 postgres postgres 1073741824 Oct 18 15:27 16389
783436 -rw------- 1 postgres postgres 1073741824 Oct 18 15:27 16389.1
783502 -rw------- 1 postgres postgres 1073741824 Oct 18 15:28 16389.2
783567 -rw------- 1 postgres postgres 1073741824 Oct 18 15:28 16389.3
783633 -rw------- 1 postgres postgres 1073741824 Oct 18 15:29 16389.4
783699 -rw------- 1 postgres postgres 1073741824 Oct 18 15:29 16389.5
783765 -rw------- 1 postgres postgres 1073741824 Oct 18 15:30 16389.6
783830 -rw------- 1 postgres postgres 1073741824 Oct 18 15:30 16389.7
783896 -rw------- 1 postgres postgres 1073741824 Oct 18 15:31 16389.8
783962 -rw------- 1 postgres postgres 1073741824 Oct 18 15:31 16389.9
784027 -rw------- 1 postgres postgres 1073741824 Oct 18 15:32 16389.10
784093 -rw------- 1 postgres postgres 1073741824 Oct 18 15:32 16389.11
784159 -rw------- 1 postgres postgres 1073741824 Oct 18 15:33 16389.12
784224 -rw------- 1 postgres postgres 1073741824 Oct 18 15:33 16389.13
784290 -rw------- 1 postgres postgres 1073741824 Oct 18 15:34 16389.14
784356 -rw------- 1 postgres postgres 1073741824 Oct 18 15:35 16389.15
784422 -rw------- 1 postgres postgres 1073741824 Oct 18 15:35 16389.16
784487 -rw------- 1 postgres postgres 1073741824 Oct 18 15:36 16389.17
784553 -rw------- 1 postgres postgres 1073741824 Oct 18 15:41 16389.18
784619 -rw------- 1 postgres postgres   78905344 Oct 18 15:44 16389.19
784624 -rw------- 1 postgres postgres     630784 Oct 18 15:44 16389_vm

观察下表的情况.

mydb=# select relname, oid from pg_class where oid in (16393, 16392, 16389);
       relname        |  oid
----------------------+-------
 pg_toast_16389       | 16392
 pg_toast_16389_index | 16393
 t1                   | 16389
(3 rows)

把其中一个段移走之后:

[21:13:54-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ mv 16389.2 16389.2_

mydb=# select count(*) from t1;

  count
---------

 1048576
(1 row)

由1000万条,只统计出来100万条.

vacuum试一下

mydb=# vacuum t1;
VACUUM

ls -lirt 没变化 

mydb=# vacuum full verbose t1;
INFO:  vacuuming "public.t1"
INFO:  "t1": found 0 removable, 1048576 nonremovable row versions in 262144 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 1.52 s, system: 4.60 s, elapsed: 16.30 s.
VACUUM

再观察:

[21:19:05-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ ls -lirt
total 20000084
783502 -rw------- 1 postgres postgres 1073741824 Oct 18 15:28 16389.2_
783567 -rw------- 1 postgres postgres 1073741824 Oct 18 15:28 16389.3
783633 -rw------- 1 postgres postgres 1073741824 Oct 18 15:29 16389.4
783699 -rw------- 1 postgres postgres 1073741824 Oct 18 15:29 16389.5
783765 -rw------- 1 postgres postgres 1073741824 Oct 18 15:30 16389.6
783830 -rw------- 1 postgres postgres 1073741824 Oct 18 15:30 16389.7
783896 -rw------- 1 postgres postgres 1073741824 Oct 18 15:31 16389.8
783962 -rw------- 1 postgres postgres 1073741824 Oct 18 15:31 16389.9
784027 -rw------- 1 postgres postgres 1073741824 Oct 18 15:32 16389.10
784093 -rw------- 1 postgres postgres 1073741824 Oct 18 15:32 16389.11
784159 -rw------- 1 postgres postgres 1073741824 Oct 18 15:33 16389.12
784224 -rw------- 1 postgres postgres 1073741824 Oct 18 15:33 16389.13
784290 -rw------- 1 postgres postgres 1073741824 Oct 18 15:34 16389.14
784356 -rw------- 1 postgres postgres 1073741824 Oct 18 15:35 16389.15
784422 -rw------- 1 postgres postgres 1073741824 Oct 18 15:35 16389.16
784487 -rw------- 1 postgres postgres 1073741824 Oct 18 15:36 16389.17
784553 -rw------- 1 postgres postgres 1073741824 Oct 18 15:41 16389.18
784619 -rw------- 1 postgres postgres   78905344 Oct 18 15:44 16389.19
784637 -rw------- 1 postgres postgres       8192 Oct 18 21:18 16399
784636 -rw------- 1 postgres postgres          0 Oct 18 21:18 16398
784635 -rw------- 1 postgres postgres 1073741824 Oct 18 21:18 16395
784681 -rw------- 1 postgres postgres 1073741824 Oct 18 21:19 16395.1
783368 -rw------- 1 postgres postgres          0 Oct 18 21:19 16389
783370 -rw------- 1 postgres postgres          0 Oct 18 21:19 16393
783369 -rw------- 1 postgres postgres          0 Oct 18 21:19 16392

结论

会发现16389.2之前的文件以及toast文件都被删除了. 后边的文件因为断了档,直接保留,需要手动删除. 由此,甚至我们可以总结得到, 如果中间删除了一些,倒不如,把16389.1给删除,然后vacuum full, 最后也相当于保留一些数据,表也还在. 看看下边的,还有100条记录.

其实,drop table, 再checkpoint 落盘,则清理的非常干净. (此处略)

mydb=# select count(*) from t1;

  count
---------

 1048576
(1 row)

.

上边是不带index的情况. 如果带上index,是不是同样的结论?

三、试验 (2)

环境准备

带上primary index再试一下这个场景, 为省点空间,用200万条记录模拟即可。

mydb=# create table t1 (id int primary key using index tablespace myts, col2 char(2000)) tablespace myts;
CREATE TABLE
mydb=# insert into t1 select n, random_string(1000) from generate_series(1, 2000000) as n;
INSERT 0 2000000
[21:35:28-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ ls -lirt
total 4044924
783370 -rw------- 1 postgres postgres       8192 Oct 18 21:34 16404
783369 -rw------- 1 postgres postgres          0 Oct 18 21:34 16403
784812 -rw------- 1 postgres postgres 1073741824 Oct 18 21:40 16400.1
784879 -rw------- 1 postgres postgres 1073741824 Oct 18 21:43 16400.2
783567 -rw------- 1 postgres postgres    1024000 Oct 18 21:45 16400_fsm
783502 -rw------- 1 postgres postgres   44941312 Oct 18 21:45 16405
785003 -rw------- 1 postgres postgres       8192 Oct 18 21:45 16400_vm
784947 -rw------- 1 postgres postgres  874774528 Oct 18 21:45 16400.3
783368 -rw------- 1 postgres postgres 1073741824 Oct 18 21:45 16400

测试过程

文件分布情况:

mydb=# select relname, oid from pg_class where oid in (16403,16404,16405,16400);
       relname        |  oid
----------------------+-------
 pg_toast_16400       | 16403
 pg_toast_16400_index | 16404
 t1                   | 16400
 t1_pkey              | 16405
(4 rows)

同样,这次,只移走数据文件的一个seg. 16400.2 -> 16400.2.new,

select count出错

mydb=# select count(*) from t1;
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/16400.2" (target block 484333): No such file or directory
CONTEXT:  writing block 484333 of relation pg_tblspc/16388/PG_14_202107181/16384/16400

这时读取count,要访问数据文件了,失败

mydb=# vacuum verbose analyze t1;
INFO:  vacuuming "public.t1"
INFO:  table "t1": found 0 removable, 4 nonremovable row versions in 1 out of 262144 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 747
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  vacuuming "pg_toast.pg_toast_16400"
INFO:  table "pg_toast_16400": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 747
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  analyzing "public.t1"
INFO:  "t1": scanned 30000 of 262144 pages, containing 120000 live rows and 0 dead rows; 30000 rows in sample, 1048576 estimated total rows
VACUUM

再次查count: 报同样的错

mydb=# select count(*) from t1;
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/16400.2" (target block 485312): No such file or directory
CONTEXT:  writing block 485312 of relation pg_tblspc/16388/PG_14_202107181/16384/16400
parallel worker

做一次vacuum full, 失败:

mydb=# vacuum full verbose t1;
WARNING:  could not write block 485313 of pg_tblspc/16388/PG_14_202107181/16384/16400
DETAIL:  Multiple failures --- write error might be permanent.
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/16400.2" (target block 485313): No such file or directory
CONTEXT:  writing block 485313 of relation pg_tblspc/16388/PG_14_202107181/16384/16400

如果移掉index文件: 16405?

[21:51:41-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ mv 16405 16405.new
mydb=# select count(*) from t1;
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/16400.2" (target block 485314): No such file or directory
CONTEXT:  writing block 485314 of relation pg_tblspc/16388/PG_14_202107181/16384/16400
parallel worker

mydb=# select * from t1 limit 1;
WARNING:  could not write block 485315 of pg_tblspc/16388/PG_14_202107181/16384/16400
DETAIL:  Multiple failures --- write error might be permanent.
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/16400.2" (target block 485315): No such file or directory
CONTEXT:  writing block 485315 of relation pg_tblspc/16388/PG_14_202107181/16384/16400

此时发现drop table也不能了

mydb=# drop table t1;
WARNING:  could not write block 485316 of pg_tblspc/16388/PG_14_202107181/16384/16400
DETAIL:  Multiple failures --- write error might be permanent.
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/16400.2" (target block 485316): No such file or directory
CONTEXT:  writing block 485316 of relation pg_tblspc/16388/PG_14_202107181/16384/16400

因为这个tablespace很干净,直接清除所有文件(也就是t1表的所有文件), 再看下:

[21:52:56-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ rm -f *

mydb=# drop table t1;
WARNING:  could not write block 485317 of pg_tblspc/16388/PG_14_202107181/16384/16400
DETAIL:  Multiple failures --- write error might be permanent.
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/16400.2" (target block 485317): No such file or directory
CONTEXT:  writing block 485317 of relation pg_tblspc/16388/PG_14_202107181/16384/16400

mydb=# checkpoint;
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.

直接stop 再 start 看看server log:

2022-10-18 22:00:26.554 UTC [32343] LOG:  starting PostgreSQL 14.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2022-10-18 22:00:26.554 UTC [32343] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-10-18 22:00:26.554 UTC [32343] LOG:  listening on IPv6 address "::", port 5432
2022-10-18 22:00:26.556 UTC [32343] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-10-18 22:00:26.559 UTC [32343] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-10-18 22:00:26.569 UTC [32345] LOG:  database system shutdown was interrupted; last known up at 2022-10-18 22:00:22 UTC
2022-10-18 22:00:26.583 UTC [32345] LOG:  database system was not properly shut down; automatic recovery in progress
2022-10-18 22:00:26.586 UTC [32345] LOG:  redo starts at 6/45048608
2022-10-18 22:00:34.130 UTC [32345] LOG:  redo done at 6/5EFA8880 system usage: CPU: user: 2.43 s, system: 5.08 s, elapsed: 7.54 s

连接再看:

mydb=# \d
        List of relations
 Schema | Name | Type  |  Owner
--------+------+-------+----------
 public | t    | table | postgres
 public | t1   | table | postgres
(2 rows)

[22:02:38-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ ls -lrt
total 4045032
-rw------- 1 postgres postgres 1073741824 Oct 18 22:00 16400
-rw------- 1 postgres postgres 1073741824 Oct 18 22:00 16400.1
-rw------- 1 postgres postgres 1073741824 Oct 18 22:00 16400.2
-rw------- 1 postgres postgres    1024000 Oct 18 22:00 16400_fsm
-rw------- 1 postgres postgres   44941312 Oct 18 22:00 16405
-rw------- 1 postgres postgres     131072 Oct 18 22:00 16400_vm
-rw------- 1 postgres postgres  874774528 Oct 18 22:00 16400.3

这样看来,好像toast表是没给建,只有data和index.

mydb=# select count(*) from t1;
ERROR:  index "t1_pkey" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

mydb=# reindex table t1;
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/16403": No such file or directory

下边干掉全部的,是可以的.

mydb=# drop table t1;
DROP TABLE
mydb=# checkpoint;
CHECKPOINT
全部干掉了

四、试验(3)

环境准备:

再试一下恢复:

mydb=# create table t1 (id int primary key using index tablespace myts, col2 char(2000)) tablespace myts;
CREATE TABLE
mydb=# insert into t1 select n, random_string(1000) from generate_series(1, 2000000) as n;
INSERT 0 2000000

[22:19:53-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ ls -lirt
total 4045044
783370 -rw------- 1 postgres postgres          0 Oct 18 22:17 24581
783502 -rw------- 1 postgres postgres       8192 Oct 18 22:17 24582
784812 -rw------- 1 postgres postgres    1024000 Oct 18 22:27 24578_fsm
783369 -rw------- 1 postgres postgres 1073741824 Oct 18 22:29 24578
785068 -rw------- 1 postgres postgres 1073741824 Oct 18 22:29 24578.1
785136 -rw------- 1 postgres postgres 1073741824 Oct 18 22:30 24578.2
785259 -rw------- 1 postgres postgres     131072 Oct 18 22:30 24578_vm
785204 -rw------- 1 postgres postgres  874774528 Oct 18 22:30 24578.3
783567 -rw------- 1 postgres postgres   44941312 Oct 18 22:31 24583

测试过程:

[22:41:55-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ mv 24578.2 24578.2_

mydb=# select count(*) from t1;
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/24578.2" (target block 484234): No such file or directory
CONTEXT:  writing block 484234 of relation pg_tblspc/16388/PG_14_202107181/16384/24578

mydb=# select * from t1 limit 1;
WARNING:  could not write block 484262 of pg_tblspc/16388/PG_14_202107181/16384/24578
DETAIL:  Multiple failures --- write error might be permanent.
ERROR:  could not open file "pg_tblspc/16388/PG_14_202107181/16384/24578.2" (target block 484262): No such file or directory
CONTEXT:  writing block 484262 of relation pg_tblspc/16388/PG_14_202107181/16384/24578

重启以后:

[22:42:59-postgres@sean-rh2.openstack.eu-nl-1.cloud.sap:/pgccc/myts/PG_14_202107181/16384]$ ls -lrt
total 5093624
-rw------- 1 postgres postgres          0 Oct 18 22:17 24581
-rw------- 1 postgres postgres       8192 Oct 18 22:17 24582
-rw------- 1 postgres postgres    1024000 Oct 18 22:27 24578_fsm
-rw------- 1 postgres postgres 1073741824 Oct 18 22:30 24578.2_
-rw------- 1 postgres postgres 1073741824 Oct 18 23:22 24578.2
-rw------- 1 postgres postgres   44941312 Oct 18 23:22 24583
-rw------- 1 postgres postgres 1073741824 Oct 18 23:22 24578
-rw------- 1 postgres postgres 1073741824 Oct 18 23:22 24578.1
-rw------- 1 postgres postgres     131072 Oct 18 23:22 24578_vm
-rw------- 1 postgres postgres  874774528 Oct 18 23:22 24578.3
mydb=# select count(*) from t1;

  count
---------

 2000000
(1 row)

归档的功劳? 直接能恢复。后边可以随便弄了。

Flag Counter