Skip to content

Commit

Permalink
Update zk-migration docs (#78)
Browse files Browse the repository at this point in the history
  • Loading branch information
acelyc111 authored Feb 8, 2024
1 parent fc42ea1 commit 96f7e43
Show file tree
Hide file tree
Showing 3 changed files with 171 additions and 64 deletions.
42 changes: 21 additions & 21 deletions _data/zh/translate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title_build_pegasus: "编译构建"
title_downloads: "下载"
title_installation: "安装构建"
title_compile-from-source: "从源码编译"
title_compile-by-docker: "使用Docker完成编译(推荐)"
title_compile-by-docker: "使用 Docker 完成编译(推荐)"
title_architecture: "系统架构"
title_data-model: "数据模型"
title_documentation: "文档"
Expand All @@ -16,8 +16,8 @@ title_contact: "联系我们"
title_contribution: "参与贡献"
title_coding-guides: "编码指引"
title_roadmap: "路线图"
title_bug-tracking: "Bug追踪"
title_apache-proposal: "Apache提案"
title_bug-tracking: "Bug 追踪"
title_apache-proposal: "Apache 提案"
title_asf: "ASF"
title_asf_foundation: "Foundation"
title_asf_license: "License"
Expand All @@ -29,21 +29,21 @@ title_asf_thanks: "Thanks"
title_github: "Github"
title_releases: "版本发布"
title_benchmark: "性能测试"
title_onebox: "体验Onebox集群"
title_onebox: "体验 Onebox 集群"
title_shell: "Pegasus Shell 工具"
title_java-client: "Java客户端"
title_cpp-client: "C++客户端"
title_go-client: "Golang客户端"
title_python-client: "Python客户端"
title_node-client: "NodeJS客户端"
title_scala-client: "Scala客户端"
title_java-client: "Java 客户端"
title_cpp-client: "C++ 客户端"
title_go-client: "Golang 客户端"
title_python-client: "Python 客户端"
title_node-client: "NodeJS 客户端"
title_scala-client: "Scala 客户端"
title_clients: "客户端库"
title_api: "用户接口"
title_ttl: "TTL"
title_single-atomic: "单行原子操作"
title_redis: "Redis适配"
title_geo: "GEO支持"
title_http: "HTTP接口"
title_redis: "Redis 适配"
title_geo: "GEO 支持"
title_http: "HTTP 接口"
title_deployment: "集群部署"
title_config: "配置说明"
title_rebalance: "负载均衡"
Expand All @@ -53,23 +53,23 @@ title_scale-in-out: "集群扩容缩容"
title_resource-management: "资源管理"
title_cold-backup: "冷备份"
title_meta-recovery: "元数据恢复"
title_replica-recovery: "Replica数据恢复"
title_zk-migration: "Zookeeper迁移"
title_table-migration: "Table迁移"
title_table-soft-delete: "Table软删除"
title_table-env: "Table环境变量"
title_replica-recovery: "Replica 数据恢复"
title_zk-migration: "Zookeeper 迁移"
title_table-migration: "Table 迁移"
title_table-soft-delete: "Table 软删除"
title_table-env: "Table 环境变量"
title_remote-commands: "远程命令"
title_partition-split: "Partition-Split"
title_duplication: "跨机房同步"
title_compression: "数据压缩"
title_throttling: "流量控制"
title_experiences: "运维经验"
title_manual-compact: "Manual Compact功能"
title_usage-scenario: "Usage Scenario功能"
title_manual-compact: "Manual Compact 功能"
title_usage-scenario: "Usage Scenario 功能"
title_bad-disk: "坏盘检修"
title_whitelist: "Replica Server 白名单"
title_backup-request: "Backup Request"
title_docs: "Pegasus产品文档"
title_docs: "Pegasus 产品文档"
title_tools: "生态工具"
title_admin_cli: "集群管理命令行"
title_pegic: "数据访问命令行"
Expand Down
108 changes: 107 additions & 1 deletion _docs/en/administration/zk-migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,110 @@
permalink: administration/zk-migration
---

TRANSLATING
Pegasus's Meta Server uses Zookeeper to store metadata and leader election, so the instability of the Zookeeper service can cause instability in Pegasus. If necessary, Pegasus metadata needs to be migrated to other more stable or idle Zookeeper.

There are two ways to migrate Zookeeper metadata: through metadata recovery, or through the `zkcopy` tool.

# Migration through metadata recovery

Pegasus provides [Metadata Recovery](meta-recovery) function, it can also be used for Zookeeper migration. The basic idea is to configure a new Zookeeper list and perform metadata recovery through the `recover` command, then the metadata is migrated to the new Zookeeper.

1. Backup table list

Use the `ls` command of the shell tools:
```
>>> ls -o apps.list
```

2. Backup node list

Use the `nodes` command of the shell tools:
```
>>> nodes -d -o nodes.list
```

Generate the `recover_node_list` file required for metadata recovery:
```bash
grep ALIVE nodes.list | awk '{print $1}' > recover_node_list
```

3. Stop all Meta Servers

Stop all Meta Servers, and wait for a period of time (default to 30 seconds, depending on configuration `[replication]config_sync_interval_ms`) to ensure that all Replica Servers enter the `INACTIVE` state due to the beacon timeout.

4. Modifying Meta Server configuration file

The modified content is as follows:
```
[meta_server]
recover_from_replica_server = true
[zookeeper]
hosts_list = {new Zookeeper host list}
```
They mean:
* Set `recover_from_replica_server` to `true` and enable to recover metadata from Replica Servers
* Update Zookeeper configuration to the new service addresses

5. Start a Meta Server

Start a Meta Server in the cluster, it will become the leader Meta Server of the cluster.

6. Use the `recover` command of the shell tools

```
>>> recover -f recover_node_list
```

7. Modify the configuration file and restart the Meta Server

After successful recovery, it is necessary to modify the configuration file of the Meta Server and reset to non-recovery state:
```
[meta_server]
recover_from_replica_server = false
```

8. Restart all Meta Servers, then the cluster enters the normal state.

## Sample script

Refer to the main process in the sample script [pegasus_migrate_zookeeper.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_migrate_zookeeper.sh) for Zookeeper metadata migration.

# Migration through the `zkcopy` tool

The basic idea is to use [zkcopy tool](https://github.com/ksprojects/zkcopy) to copy the Pegasus metadata from the original Zookeeper to the target Zookeeper, modify the Meta Server configuration file, and restart the cluster.

1. Stop all follower Meta Servers

In order to prevent other follower Meta Servers from requiring the lock and becoming the new leader when restarting the leader Meta Server, causing metadata inconsistency, it is necessary to keep only the leader Meta Server in live state and stop all other follower Meta Servers throughout the entire migration process.

2. Modify the leader Meta Server status to `blind`

Set the leader Meta Server's meta_level to `blind`, to prohibit any update operations on Zookeeper data and prevent metadata inconsistency during the migration process:
```
>>> set_meta_level blind
```
> For an introduction to Meta Server's meta_level, please refer to [Rebalance](rebalance).

3. Use the `zkcopy` tool to copy Zookeeper metadata

Obtain the path `zookeeper_root` where Pegasus metadata is stored on the Zookeeper through the `cluster_info` command of the shell tools, and then use the `zkcopy` tool to copy all the data from this path to the new Zookeeper, being careful to recursively copy.

4. Modify configuration file

Modify the configuration file of Meta Servers and change the `hosts_lists` value to the new service addresses:
```
[meta_server]
hosts_list = {new Zookeeper host list}
```

5. Restart the leader Meta Server

Restart the leader Meta Server and use shell tools to [check](/administration/experiences#troubleshooting) that the cluster has entered the normal state.

6. Restart all follower Meta Servers

Start all follower Meta Servers and check the cluster enters the normal state.

7. Clean up data on old Zookeepers

Use the `rmr` command of the [zookeepercli tool](https://github.com/openark/zookeepercli) to clean up data on old Zookeepers.
85 changes: 43 additions & 42 deletions _docs/zh/administration/zk-migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,109 +2,110 @@
permalink: administration/zk-migration
---

由于Pegasus的meta server依赖Zookeeper存储元数据和抢主,所以Zookeeper服务的不稳定会造成Pegasus服务不稳定,有时就需要迁移到其他更稳定或者空闲的Zookeeper上
由于 Pegasus 的 Meta Server 使用 Zookeeper 来存储元数据和选主,所以 Zookeeper 服务的不稳定会造成 Pegasus 服务不稳定,必要时需要迁移元数据到其他更稳定或者空闲的 Zookeeper 上

Zookeeper迁移提供了两种办法:通过元数据恢复迁移;通过zkcopy工具迁移
Zookeeper 元数据迁移有两种方式:通过元数据恢复迁移,或通过 `zkcopy` 工具迁移

# 通过元数据恢复迁移

Pegasus提供了[元数据恢复](meta-recovery)功能,这个功能也可用于Zookeeper迁移。基本思路就是配置新的Zookeeper后,通过recover命令发起元数据恢复,这样元数据就写入新的Zookeeper上
Pegasus 提供了 [元数据恢复](meta-recovery) 功能,这个功能也可用于 Zookeeper 迁移。基本思路是配置新的 Zookeeper 后,通过 `recover` 命令发起元数据恢复,这样元数据就写入新的 Zookeeper 上

1. 备份app列表
1. 备份 table 列表

使用shell的`ls`命令:
使用 shell 工具的 `ls` 命令:
```
>>> ls -o apps.list
```

2. 备份node列表
2. 备份 node 列表

使用shell的`nodes`命令:
使用 shell 工具的 `nodes` 命令:
```
>>> nodes -d -o nodes.list
```

生成元数据恢复所需的`recover_node_list`文件:
生成元数据恢复所需的 `recover_node_list` 文件:
```bash
grep ALIVE nodes.list | awk '{print $1}' >recover_node_list
grep ALIVE nodes.list | awk '{print $1}' > recover_node_list
```

3. 停掉所有meta
3. 停掉所有 Meta Server

停掉所有meta server,并等待30秒以上,以保证所有replica server因为心跳超时进入INACTIVE状态
停掉所有 Meta Server,并等待一段时间(默认为 30 秒,取决于配置项 `[replication]config_sync_interval_ms`),以保证所有 Replica Server 因为心跳超时进入 `INACTIVE` 状态

4. 修改meta配置文件
4. 修改 Meta Server 配置文件

修改meta server的配置文件,如下
修改内容如下
```
[meta_server]
recover_from_replica_server = true
[zookeeper]
hosts_list = {new zookeeper host list}
hosts_list = {new Zookeeper host list}
```
*`recover_from_replica_server`设置为true
* 将zookeeper的`hosts_lists`改为新的服务地址
即:
*`recover_from_replica_server` 设置为 `true`,开启从 Replica Server 恢复元数据的开关
* 更新 Zookeeper 配置更新为新的服务地址

5. 启动一个meta
5. 启动一个 Meta Server

启动其中一个meta server
启动集群中的一个 Meta Server,它将成为集群的主 Meta Server

6. 通过shell发送recover命令
6. 通过 shell 工具发送 `recover` 命令

```
>>> recover -f recover_node_list
```
检查恢复结果,如果出错请参考[常见问题整理](meta-recovery#常见问题整理)排查问题。

7. 修改配置文件并重启meta
7. 修改配置文件并重启 Meta Server

恢复成功后,需要修改配置文件,重新改回非recovery模式
恢复成功后,需要修改 Meta Server 的配置文件,重新改回非 recovery 模式
```
[meta_server]
recover_from_replica_server = false
```

重新启动所有的meta server,集群进入正常状态。
8. 重新启动所有的 Meta Server,集群进入正常状态。

注:[scripts/pegasus_migrate_zookeeper.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_migrate_zookeeper.sh)是我们在内部使用的迁移Zookeeper的脚本,虽然因为服务启停功能的兼容性不能直接使用,但是可以参考其中的流程,或者进行改造。
## 示例脚本

# 通过zkcopy工具迁移
可以参考 Zookeeper 元数据迁移的示例脚本 [pegasus_migrate_zookeeper.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_migrate_zookeeper.sh) 中的主要流程。

基本思路就是使用zkcopy工具将原始Zookeeper数据拷贝到目标Zookeeper上,修改meta server配置文件并重启。
# 通过 `zkcopy` 工具迁移

1. 停掉所有的备meta server
基本思路就是使用 [zkcopy 工具](https://github.com/ksprojects/zkcopy) 将原始 Zookeeper 上的 Pegasus 元数据拷贝到目标 Zookeeper 上,修改 Meta Server 配置文件并重启。

为了防止重启主meta server时有其他的备meta server抢到锁,造成状态混乱,在整个迁移过程中只保留一个主meta server,其他的备meta server全部停掉。
1. 停掉所有的备 Meta Server

2. 修改主meta server状态为blind
为了防止重启主 Meta Server 时,其他的备 Meta Server 抢到锁而成为新的主,造成元数据不一致的问题,需要在整个迁移过程中只保留主 Meta Server 为存活状态,其他的备 Meta Server 全部停掉。

将主meta server的level设置为blind(关于meta server的level介绍请参见[负载均衡](rebalance#控制集群的负载均衡)),以禁止任何对Zookeeper数据的更新操作,防止在copy过程中出现不一致:
2. 修改主 Meta Server 状态为 `blind`

将主 Meta Server 的 meta_level 设置为 `blind`,以禁止任何对 Zookeeper 数据的更新操作,防止在迁移过程中出现引起元数据不一致:
```
>>> set_meta_level blind
```
> 关于 Meta Server 的 meta_level 介绍请参见 [负载均衡](rebalance#控制集群的负载均衡)。

3. 使用zkcopy工具拷贝Zookeeper数据
3. 使用 zkcopy 工具拷贝 Zookeeper 元数据

通过shell的`cluster_info`命令获取Zookeeper元数据节点路径`zookeeper_root`然后使用zkcopy工具将该节点的数据完全拷贝到新集群的节点上,注意需要递归拷贝。
通过 shell 工具的 `cluster_info` 命令获取 Pegasus 元数据存储在 Zookeeper 上的路径 `zookeeper_root`然后使用 zkcopy 工具将该路径的数据全部拷贝到新 Zookeeper 上,注意需要递归拷贝。

4. 修改配置文件

修改meta server的配置文件,将zookeeper的`hosts_lists`改为新的服务地址
修改 Meta Server 的配置文件,将 `hosts_lists` 配置值改为新的服务地址
```
[meta_server]
hosts_list = {new zookeeper host list}
hosts_list = {new Zookeeper host list}
```

5. 重启主meta server

重新启动主meta server,通过shell工具检查集群进入正常状态。
5. 重启主 Meta Server

6. 启动所有备meta server
重新启动主 Meta Server,通过 shell 工具 [检查](/administration/experiences#问题排查) 集群进入正常状态。

启动所有备meta server,集群进入正常状态。
6. 启动所有备 Meta Server

7. 清理旧Zookeeper上的数据
启动所有备 Meta Server,集群进入正常状态。

使用zookeepercli工具的rmr命令清理旧zookeeper上的数据。
7. 清理旧 Zookeeper 上的数据

注:上面使用到的`zkcopy``zookeepercli`工具以后会提供出来
使用 [zookeepercli 工具](https://github.com/openark/zookeepercli) 的 `rmr` 命令清理旧 Zookeeper 上的数据

0 comments on commit 96f7e43

Please sign in to comment.