Skip to content

Commit

Permalink
Update bad-disk docs
Browse files Browse the repository at this point in the history
  • Loading branch information
acelyc111 committed Feb 4, 2024
1 parent c4de635 commit acd6e10
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 24 deletions.
34 changes: 33 additions & 1 deletion _docs/en/administration/bad-disk.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,36 @@
permalink: administration/bad-disk
---

TRANSLATING
# Bad disk troubleshooting

When a disk failure occurs, it can be checked by the following methods:

- In the Replica Server logs, an `IO error` was found for a certain disk
- It is possible that the latency of a certain server is significantly higher than that of other servers. Continuing to investigate, if it is found that the _IO wait_ of a certain disk is significantly higher, it basically proves that the disk is a _slow disk_

# Bad disk blacklist

Pegasus supports _disk black list_, if you want to take a bad disk offline, firstly, define it in the _disk black list_ file on the Replica Server where it is located, the file path is determined by the configuration:

```ini
[replication]
data_dirs_black_list_file = /home/work/.pegasus_data_dirs_black_list
```

Then log in to the corresponding server and edit the file, for example, disable `ssd2` and `ssd3`:
```txt
/home/work/ssd2
/home/work/ssd3
```

## Restart service

After marking the black list of bad disks, a restart is required to take effect. It is recommended to restart the Replica Server process on the corresponding server through [High availability restart steps](rolling-update#high-availability-restart-steps).

After restarting, the following records can be found in the server log, indicating that the disks marked in the black list have taken effect:

```log
data_dirs_black_list_file[/home/work/.pegasus_data_dirs_black_list] found, apply it
black_list[1] = [/home/work/ssd2/]
black_list[2] = [/home/work/ssd3/]
```
36 changes: 13 additions & 23 deletions _docs/zh/administration/bad-disk.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,46 +2,36 @@
permalink: administration/bad-disk
---

磁盘故障时有发生,通常有下列检查方式:
# 坏盘故障排查

- 可能是某节点的延迟明显高于其他节点,追其原因,如果看到某个 SSD 的 IO await 明显较高,
那基本说明该磁盘是“慢盘”。
磁盘故障时有发生,可通过以下方法检查:

- 平时的周期运维检修也容易发现潜在的磁盘故障。
- 在 Replica Server 日志中,发现有某块磁盘的 IO error 错误
- 可能是某节点的延迟明显高于其他节点,继续排查,如果发现某个磁盘的 IO await 明显较高,那基本证明该磁盘是 _慢盘_

在Pegasus中,我们如何进行坏盘维修的操作?
# 坏盘黑名单

## 坏盘黑名单

Pegasus 支持磁盘黑名单,如果你要下线某块磁盘,首先要把它定义在其所在 Replica 节点的黑名单文件中,
黑名单文件的所在路径依据配置:
Pegasus 支持 _磁盘黑名单_,如果要将坏盘下线,首先要把它定义在其所在 Replica Server 的 _黑名单文件_ 中,黑名单文件所在路径根据配置决定:

```ini
[replication]
data_dirs_black_list_file = /home/work/.pegasus_data_dirs_black_list
```

接着你登录对应节点,编辑 /home/work/.pegasus_data_dirs_black_list:

然后登录对应服务器,编辑该文件,例如标注磁盘 ssd2 和 ssd3 需要禁用:
```txt
/home/work/ssd2
/home/work/ssd3
```

上面标注磁盘 ssd2 与 ssd3 需要下线。
## 重启服务

## 重启节点
在标注了坏盘名单后,需要重启使其生效。建议通过 [高可用重启](rolling-update#高可用重启) 对应服务器上的 Replica Server 进程。

在你标注好坏盘名单后,你可以通过 [高可用重启](rolling-update#高可用重启) 单独重启对应节点的服务进程。
通常你在程序日志中能够看到下列记录,表示黑名单内的磁盘的确被忽略了:
重启后,可以在程序日志中能够发现如下记录,表示黑名单内标记的磁盘生效了:

```log
D2019-07-10 21:54:28.879 (1562766868879176673 9e8d) replica.default0.00009e5b00010001: replication_common.cpp:177:initialize(): data_dirs_black_list_file[/home/work/.pegasus_data_dirs_black_list] found, apply it
D2019-07-10 21:54:28.879 (1562766868879300907 9e8d) replica.default0.00009e5b00010001: replication_common.cpp:194:initialize(): black_list[1] = [/home/work/ssd2/]
D2019-07-10 21:54:28.879 (1562766868879312394 9e8d) replica.default0.00009e5b00010001: replication_common.cpp:194:initialize(): black_list[2] = [/home/work/ssd3/]
W2019-07-10 21:54:28.879 (1562766868879404635 9e8d) replica.default0.00009e5b00010001: replication_common.cpp:218:initialize(): replica data dir /home/work/ssd2/pegasus/c3tst-dup2/replica is in black list, ignore it
W2019-07-10 21:54:28.879 (1562766868879411121 9e8d) replica.default0.00009e5b00010001: replication_common.cpp:218:initialize(): replica data dir /home/work/ssd3/pegasus/c3tst-dup2/replica is in black list, ignore it
D2019-07-10 21:54:28.879 (1562766868879415865 9e8d) replica.default0.00009e5b00010001: replication_common.cpp:220:initialize(): data_dirs[0] = /home/work/ssd4/pegasus/c3tst-dup2/replica, tag = ssd4
D2019-07-10 21:54:28.879 (1562766868879422843 9e8d) replica.default0.00009e5b00010001: replication_common.cpp:220:initialize(): data_dirs[1] = /home/work/ssd5/pegasus/c3tst-dup2/replica, tag = ssd5
D2019-07-10 21:54:28.879 (1562766868879428846 9e8d) replica.default0.00009e5b00010001: replication_common.cpp:220:initialize(): data_dirs[2] = /home/work/ssd6/pegasus/c3tst-dup2/replica, tag = ssd6
data_dirs_black_list_file[/home/work/.pegasus_data_dirs_black_list] found, apply it
black_list[1] = [/home/work/ssd2/]
black_list[2] = [/home/work/ssd3/]
```

0 comments on commit acd6e10

Please sign in to comment.