-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] AndSonder Add 可视化静态图自动并行时序图使用文档 #53
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
477eacb
add Docs/10_StaticGraph_Semi-AutomaticParallel_ExecutionFrameworkUpgr…
AndSonder 1c6fb11
Merge branch 'main' of https://github.com/PFCCLab/Camp into doc2
AndSonder 31b9e7d
update
AndSonder 046f393
apply suggestions from review
AndSonder a6c017d
apply suggestions from code review
AndSonder 57881ce
Update 1-visualize-flow-parallel-timing-diagram-in-static-graph-mode.md
AndSonder File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
173 changes: 173 additions & 0 deletions
173
...rade/UsageDocs/1-visualize-flow-parallel-timing-diagram-in-static-graph-mode.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
# 可视化静态图流水并行时序图工具使用手册 | ||
|
||
飞桨框架提供了流水并行时序图可视化工具,可以对模型运行过程中流水并行子图的调度信息进行收集、统计和展示。 | ||
|
||
虽然在GPU设备上可以使用Nsight工具对模型进行性能分析和优化,但在分布式流水并行场景下Nsight工具存在以下不足: | ||
|
||
1. 由于GPU程序的异步运行特性,在CPU端针对流水并行子图添加的NVTX标记无法直接与GPU端实际的算子执行区间相对应。在分析流水并行任务时,往往需要从一长串的kernel执行中人工找出每个流水子图的起始和终止算子,才能还原模型实际的流水并行情况。 | ||
2. 流水并行任务往往跨多个机器执行,完整的模型被切分成多个子图,分配给多台机器运行。nsight工具只能单独对每台机器上的调度信息进行采集和展现,无法集中呈现多机之间流水并行的调度全貌。 | ||
|
||
新开发的可视化工具旨在填补这一空白,通过直接在cuda计算流上插入断点以精准统计GPU任务的运行区间。通过展示流水线并行的时序图,直观呈现了分布式训练中各个GPU设备上不同任务的运行区间。 | ||
|
||
本文主要介绍如何使用该工具可视化流水线并行时序图。 | ||
|
||
本工具使用分为两步,第一步为生成时序图数据,第二步为使用python脚本可视化时序图。接下来依次介绍 | ||
|
||
## 一、生成时序图数据 | ||
|
||
### 1. 导入 Guard 工具 | ||
|
||
``` | ||
import paddle | ||
from paddle.profiler.utils import job_schedule_profiler_range | ||
``` | ||
|
||
### 2. 开启性能分析器,收集数据 | ||
|
||
``` | ||
start_step = 0 | ||
stop_step = 5 | ||
|
||
for i in range(10): | ||
with job_schedule_profiler_range(i, 3, 6, True) as status: | ||
exe.run(program, status) | ||
``` | ||
|
||
其中 `job_schedule_profiler_range` 函数的参数依次为: | ||
|
||
- iter_id: 当前迭代的id,用于区分不同的迭代 | ||
- start_step: 可视化工具的起始时间 | ||
- stop_step: 可视化工具的结束时间 | ||
- exit_after_prof: 是否在可视化工具结束后退出程序,默认为 True | ||
|
||
开启性能分析器后,收集到的数据会存储在日志目录 `log_dir` 下,每个设备的数据会存储在 `workerlog.0`, `workerlog.1`, `workerlog.2`, ... 中。 | ||
|
||
### 3. PaddleNLP下开启性能分析器 | ||
|
||
PaddleNLP下,本工具将可视化功能集成到了命令行参数中,以下以 PaddleNLP 中 LLama 训练脚本为例进行说明。 | ||
|
||
PaddleNLP 中为了方便用户运行测试本模型,提供了处理好的100k条doc的训练样本: | ||
|
||
```bash | ||
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_ids.npy | ||
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_idx.npz | ||
``` | ||
|
||
将所有预处理得到的文件统一放入一个文件夹中,以备训练使用: | ||
|
||
```bash | ||
mkdir data | ||
mv llama_openwebtext_100k_ids.npy ./data | ||
mv llama_openwebtext_100k_idx.npz ./data | ||
``` | ||
|
||
使用下面脚本,即可在llama-7b的基础上收集模型性能数据: | ||
|
||
```bash | ||
task_name="llama_7b_pp2_mp4_st" | ||
|
||
python -u -m paddle.distributed.launch \ | ||
--gpus "0,1,2,3" \ | ||
--log_dir "output/$task_name""_log" \ | ||
auto_parallel/run_pretrain_auto.py \ | ||
--model_type "llama" \ | ||
--model_name_or_path "meta-llama/Llama-2-7b" \ | ||
--tokenizer_name_or_path "meta-llama/Llama-2-7b" \ | ||
--input_dir "./data" \ | ||
--output_dir "output/$task_name" \ | ||
--split 949,50,1 \ | ||
--max_seq_length 2048 \ | ||
--per_device_train_batch_size 1 \ | ||
--per_device_eval_batch_size 1 \ | ||
--gradient_accumulation_steps 4 \ | ||
--use_flash_attention 0 \ | ||
--use_fused_rms_norm 0 \ | ||
--fp16 0 \ | ||
--fp16_opt_level "O2" \ | ||
--scale_loss 1024 \ | ||
--pipeline_parallel_degree 4 \ | ||
--tensor_parallel_degree 1 \ | ||
--sharding_parallel_degree 1 \ | ||
--sharding "stage1" \ | ||
--learning_rate 0.0001 \ | ||
--min_learning_rate 0.00001 \ | ||
--max_steps 10 \ | ||
--save_steps 5000 \ | ||
--weight_decay 0.01 \ | ||
--warmup_ratio 0.01 \ | ||
--max_grad_norm 1.0 \ | ||
--logging_steps 1\ | ||
--dataloader_num_workers 1 \ | ||
--sharding "" \ | ||
--eval_steps 1000 \ | ||
--report_to "visualdl" \ | ||
--disable_tqdm true \ | ||
--continue_training 0\ | ||
--recompute 1 \ | ||
--do_train \ | ||
--do_eval 0 \ | ||
--device "gpu" \ | ||
--data_impl "mmap" \ | ||
--parallel_mode "auto" \ | ||
--job_schedule_profiler_start 0 \ | ||
--job_schedule_profiler_end 5 \ | ||
``` | ||
|
||
其中,`--job_schedule_profiler_start 0` 和 `--job_schedule_profiler_end 5` 用于调控可视化的时间跨度。 | ||
|
||
在程序执行完毕后,将在 `log_dir` 目录下生成每个设备的时序图数据,储存在 `workerlog.0`、`workerlog.1`、`workerlog.2`、`workerlog.3` 等文件中。 | ||
|
||
## 二、可视化时序图 | ||
|
||
### 1. 单机训练下的可视化 | ||
|
||
在 Paddle 目录下有一个 `python/paddle/distributed/auto_parallel/static/profiler_helper_static.py` 脚本,用来可视化时序图。使用方法如下: | ||
|
||
```bash | ||
python python/paddle/distributed/auto_parallel/static/profiler_helper_static.py --devices 0,1,2,3 --log_dir /home/workspace/PaddleNLP/llm/llama/output/llama_7b_pp2_mp4_st_log | ||
``` | ||
|
||
其中 `--devices` 为需要可视化的设备,`--log_dir` 为时序图数据所在的目录。 | ||
|
||
![picture 4](images/c9bbd2b9f69872e3f761841eaedf18a5f4d35488c78412e9651f666fdadec11e.png) | ||
|
||
脚本会生成 `Chrome tracing` 格式的文件,可以使用 Chrome 浏览器打开,也可以使用 [perfetto](https://ui.perfetto.dev/) 打开 `pipeline_profile_perfetto.json`。perfetto 提供了更好看的界面以及更流畅的体验,更推荐使用。 | ||
|
||
perfetto 可视化效果如下: | ||
|
||
![picture 6](images/db4df88460aee707421616b99587554314637ddd0fc2541f1a4a6174aac4b29d.png) | ||
|
||
Chrome Tracing 可视化效果如下: | ||
|
||
![picture 7](images/7524f138147dab43c170668eb555c767c8e2c65750fae6d7d22332d43abb39b2.png) | ||
|
||
## 2. 多机环境下的可视化 | ||
|
||
由于多机环境下,每个设备的时序图数据会分别存储在不同的机器上,因此需要将时序图数据收集到一台机器上,再进行可视化。请用户在每台机器上运行训练命令,然后将每台机器上的时序图数据按照如下方式放在一台机器上: | ||
|
||
``` | ||
multi_machine_logs | ||
├── machine0 | ||
│ ├── workerlog.0 | ||
│ └── workerlog.1 | ||
├── machine1 | ||
│ ├── workerlog.0 | ||
│ └── workerlog.1 | ||
``` | ||
|
||
然后在任意一台机器上运行可视化脚本并指定 `--log_dir` 参数为 `log_dir` 目录以及开启 `--multi_machine` 参数即可。 | ||
|
||
```bash | ||
python python/paddle/distributed/auto_parallel/static/profiler_helper_static.py --devices 0,1 --log_dir /home/workspace/PaddleNLP/llm/llama/output/llama_7b_pp2_mp4_st_log/multi_machine_logs --multi_machine | ||
``` | ||
|
||
## 三、统计信息和呈现 | ||
|
||
上面已经介绍了如何使用可视化工具,本节将介绍可视化工具的统计信息和呈现。以perfetto为例,可视化效果如下: | ||
|
||
![picture 9](images/0738dff3dc81ad5659c1e54edfbb84836088c9d2f043ef145deba003d7c60447.png) | ||
|
||
目前 Timeline 提供以下特性: | ||
|
||
- 查看不同设备上的任务运行时间区间。 | ||
- 所展示的事件名字上标注事件所持续的时间,点击具体的事件,可在下方的说明栏中看到更详细的事件信息。通过按键 w、s 可进行放大和缩小,通过a、d可进行左移和右移。 |
Binary file added
BIN
+331 KB
...ocs/images/0738dff3dc81ad5659c1e54edfbb84836088c9d2f043ef145deba003d7c60447.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+190 KB
...ocs/images/20b7dd91579b7c3e28cf41a6175639ff23818f615e65c95e421537839ee171b7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+184 KB
...ocs/images/4bcf95d2863035f8807e574f6d47c08eadcb331ac3477f04aafed0e6e534cdab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+199 KB
...ocs/images/5d8a690e82933ae4f1da0fba04ab1cec37d010352b5ab8669157f2ce3e616f64.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+184 KB
...ocs/images/7524f138147dab43c170668eb555c767c8e2c65750fae6d7d22332d43abb39b2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+178 KB
...ocs/images/7fdbf2e01291b1a1ee2cae4e0962920bfdbbb32be9f7226532f370d275dc6e4f.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+181 KB
...ocs/images/c9bbd2b9f69872e3f761841eaedf18a5f4d35488c78412e9651f666fdadec11e.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+159 KB
...ocs/images/db4df88460aee707421616b99587554314637ddd0fc2541f1a4a6174aac4b29d.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个工具是面向流水并行场景,主要关注流水并行子图的实际编排和调度状况,在对子图进行分析后,下一步更细粒度的问题分析仍然需要通过nsight进行。在文档建议说明下使用场景和一般流程。