From c5f9b39a99b48644c903d9df1edd4a3f6a0dac6f Mon Sep 17 00:00:00 2001 From: Terry Kong Date: Tue, 1 Oct 2024 17:09:03 -0700 Subject: [PATCH 1/2] docs: adds a known_errors.rst to improve UX Signed-off-by: Terry Kong --- docs/user-guide/cai.rst | 7 ++++++- docs/user-guide/dpo.rst | 8 +++++--- docs/user-guide/draftp.rst | 4 +++- docs/user-guide/index.rst | 4 ++++ docs/user-guide/rlhf.rst | 7 +++++++ docs/user-guide/rs.rst | 6 ++++++ docs/user-guide/sft.rst | 3 +++ docs/user-guide/spin.rst | 2 ++ docs/user-guide/steerlm.rst | 4 +++- docs/user-guide/steerlm2.rst | 4 +++- 10 files changed, 42 insertions(+), 7 deletions(-) diff --git a/docs/user-guide/cai.rst b/docs/user-guide/cai.rst index 7f381d248..f0db72c30 100644 --- a/docs/user-guide/cai.rst +++ b/docs/user-guide/cai.rst @@ -190,7 +190,7 @@ Note that you would need to set up multi-node training run in your cluster env, trainer.sft.val_check_interval=50 \ trainer.sft.save_interval=50 - +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. Step 4: Generate the RL-CAI (preference) dataset for RM and PPO training @@ -277,6 +277,7 @@ Run the following command to train the RM: trainer.rm.val_check_interval=25 \ trainer.rm.limit_val_batches=100000 +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. The trained RM checkpoint will be saved to output dir given by ``exp_manager.explicit_log_dir``. @@ -298,6 +299,8 @@ Run the following command in the background to launch a RM and PPO critic traini model.seed=1234 \ exp_manager.explicit_log_dir= +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + Run the following command to launch actor training and a reference policy server: .. code-block:: bash @@ -322,6 +325,8 @@ Run the following command to launch actor training and a reference policy server remote_critic_rm.critic.ip= \ remote_critic_rm.critic.port=5567 +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + The trained LLM policy checkpoint will be saved to the output dir given by ``exp_manager.explicit_log_dir``. Step 7: Inference diff --git a/docs/user-guide/dpo.rst b/docs/user-guide/dpo.rst index 62d66fcba..a419f45b4 100644 --- a/docs/user-guide/dpo.rst +++ b/docs/user-guide/dpo.rst @@ -17,7 +17,7 @@ For full-parameter DPO, there exists an actor and a reference model. The actor i For LoRA-based DPO, the actor is initialized by the reference model plus LoRA weights, where only the LoRA weights are trainable. Therefore, it allows us to switch between the actor/reference models by simply enabling or disabling LoRA. In addition, there is no need to store two sets of LLM weights. RPO and IPO Variations -####################### +###################### Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity preference optimization (IPO) and Reward-aware preference optimization (RPO). @@ -26,7 +26,7 @@ The algorithm is identified with the ``dpo.preference_loss`` config variable. We To use the RPO algorithm, each dataset example should have chosen_reward and rejected_reward, which might come from human labelers or reward models. If chosen_reward and rejected_reward are not existent in the data, dpo.default_chosen_reward and dpo.default_rejected_reward are used. Obtain a Pretrained Model -############################ +######################### To start, we must first get a pretrained model to align. There are two models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes, we will use the smaller 2B model. .. tab-set:: @@ -76,7 +76,7 @@ Instruction Following Taught by Supervised Fine-Tuning (SFT) For best DPO training performance, it is recommended that you start with a SFT model, rather than the base model. For a full guide on how to perform SFT on a Megatron GPT model, please refer to the :ref:`SFT guide `. DPO Model Training -##################### +################## Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects .jsonl files where each line is a JSON dict corresponding to a single, complete sample, as shown below:: @@ -182,6 +182,8 @@ For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` correspond srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}" set +x +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + The default DPO training tunes all parameters. To use LoRA, we can set ``model.peft.peft_scheme=lora`` and use different parameters in ``model.peft.lora_tuning``. Please check the parameters in `the config file `__. During DPO training, several metrics will be recorded in WandB, with the primary one being ``acc`` (representing the percentage by which the model’s chosen rewards exceed the rejected rewards). diff --git a/docs/user-guide/draftp.rst b/docs/user-guide/draftp.rst index b6fdb8d3c..93f40ea61 100644 --- a/docs/user-guide/draftp.rst +++ b/docs/user-guide/draftp.rst @@ -165,6 +165,8 @@ To launch reward model training, you must have checkpoints for `UNet `. + .. note:: For more info on DRaFT+ hyperparameters please see the model config files (for SD and SDXL respectively): @@ -264,5 +266,5 @@ AIG provides the inference-time flexibility to interpolate between the base Stab exp_manager.explicit_log_dir=${DIR_SAVE_CKPT_PATH} \ exp_manager.wandb_logger_kwargs.project=${PROJECT} +weight_type='draft,base,power_2.0' - +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst index 650d67a6e..436d0d358 100644 --- a/docs/user-guide/index.rst +++ b/docs/user-guide/index.rst @@ -14,6 +14,7 @@ spin.rst draftp.rst cai.rst + known_errors.rst :ref:`Prerequisite Obtaining a Pre-Trained Model ` This section provides instructions on how to download pre-trained LLMs in .nemo format. The following section will use these base LLMs for further fine-tuning and alignment. @@ -41,3 +42,6 @@ :ref:`Constitutional AI: Harmlessness from AI Feedback ` CAI, an alignment method developed by Anthropic, enables the incorporation of AI feedback for aligning LLMs. This feedback is grounded in a small set of principles (referred to as the ‘Constitution’) that guide the model toward desired behaviors, emphasizing helpfulness, honesty, and harmlessness. + +:ref:`Known Errors and Resolutions ` + This section details how to resolve common pitfalls that may arise during the alignment process. diff --git a/docs/user-guide/rlhf.rst b/docs/user-guide/rlhf.rst index 48906a79f..ec414afe3 100644 --- a/docs/user-guide/rlhf.rst +++ b/docs/user-guide/rlhf.rst @@ -139,6 +139,7 @@ To launch reward model training, you must start with a pretrained or SFT-trained srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}" set +x +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. *Remark: Currently, the example training script does not automatically run evaluation on the provided test set. This may change in a future release.* @@ -196,6 +197,8 @@ To launch the server: ++model.offload_adam_states=True \ ++model.mcore_gpt=True +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + The above example launches the reward model Critic server on 8 GPUs and 1 node. Please make sure to change ``trainer.devices``, ``trainer.num_nodes`` depending on your model size and scale. NeMo-Aligner will work on any scale. In addition, make sure to tune the `trainer.ppo.inference_micro_batch_size` argument as this determines the batch size the PPO Actor is allowed to send to the Critic per DP rank. Launching the Initial Policy and PPO Actor Training @@ -235,6 +238,8 @@ The PPO Actor training job contains the master controller that makes the HTTP ca ++model.ppo.entropy_bonus=0.0 \ remote_critic_rm.pad_to_length=2048 +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + The above script launches the initial and Actor server on 1 node with 8 GPUs. .. note:: @@ -355,6 +360,8 @@ You can use Slurm to launch both jobs and coordinate them together in a full RLH wait +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + The above script runs the reward model Critic server on 1 node and the Actor on 1 node. It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. diff --git a/docs/user-guide/rs.rst b/docs/user-guide/rs.rst index ac7ea30ee..384f414bf 100644 --- a/docs/user-guide/rs.rst +++ b/docs/user-guide/rs.rst @@ -42,6 +42,8 @@ To launch the server: ++model.tensor_model_parallel_size=4 \ rm_model_file=${RM_NEMO_FILE} +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.rs.inference_micro_batch_size argument. This argument sets the size of the batch the RS actor is allowed to send to the critic per DP rank. @@ -95,6 +97,8 @@ The RS Actor training job contains the master controller that makes the HTTP cal model.rs.num_rollouts_per_prompt=8 \ model.rs.top_n_rollouts=1 +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + The above command launches the initial and actor server on 1 node with 8 GPUs. Launching Both Servers for Rejection Sampling training @@ -217,6 +221,8 @@ You can use slurm to launch the 2 jobs and get them to coordinate together in a wait +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + The above script runs the reward model server on 1 node and the actor on 1 node. It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. diff --git a/docs/user-guide/sft.rst b/docs/user-guide/sft.rst index 0bed1703a..d62cc8792 100644 --- a/docs/user-guide/sft.rst +++ b/docs/user-guide/sft.rst @@ -223,6 +223,8 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}" set +x +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + If using sequence packing, replace the data paths with the paths to your packed datasets. For each packed dataset, you should also set ``packed_sequence=True`` in the config: .. code-block:: python @@ -382,6 +384,7 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. Compare srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}" set +x +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. To scale to thousands of GPUs, adjust the ``trainer.num_nodes`` and ``trainer.devices`` accordingly based on the size of your machine. diff --git a/docs/user-guide/spin.rst b/docs/user-guide/spin.rst index c7e241533..4ab3b1b84 100644 --- a/docs/user-guide/spin.rst +++ b/docs/user-guide/spin.rst @@ -165,6 +165,8 @@ For the below parameters, the ``model.spin.ref_policy_kl_penalty`` corresponds t srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}" set +x +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + During SPIN training, there will be several metrics recorded to WandB which you can monitor, chiefly acc (representing the percentage amount whereby the model's implicit reward for the ground truth response is greater than for the response generated by the reference policy). The ``reward`` in this case is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses. During training, the acc should generally be increasing, but don't worry if its absolute value remains low, as it doesn't correlate to finalised MTBench or MMLU scores. It should just be generally increasing. diff --git a/docs/user-guide/steerlm.rst b/docs/user-guide/steerlm.rst index 635abe5a3..8e16d9b1f 100644 --- a/docs/user-guide/steerlm.rst +++ b/docs/user-guide/steerlm.rst @@ -142,6 +142,7 @@ Note that you would need to set up multi-node training in your cluster env, depe model.reward_model_type="regression" \ model.regression.num_attributes=9 +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. Step 4: Generate annotations ############################ @@ -158,6 +159,7 @@ To generate annotations, run the following command in the background to launch a inference.micro_batch_size=2 \ inference.port=1424 +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. Now execute: @@ -230,7 +232,7 @@ For the purposes of this tutorial, the Attribute-Conditioned SFT model is traine exp_manager.explicit_log_dir=/results/acsft_70b \ exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True - +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. Step 6: Inference diff --git a/docs/user-guide/steerlm2.rst b/docs/user-guide/steerlm2.rst index b3802f45f..8aa55275c 100644 --- a/docs/user-guide/steerlm2.rst +++ b/docs/user-guide/steerlm2.rst @@ -159,6 +159,8 @@ By organizing the data in this format, the SteerLM 2.0 model can be effectively exp_manager.explicit_log_dir=/results/acsft_70b \ exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True +For more information on handling potential errors, see :ref:`Known Errors and Resolutions `. + Inference ------------------ @@ -169,4 +171,4 @@ References .. [1] Dong, Y., Delalleau, O., Zeng, J., Shen, G., Zhang, J.J., Sreedhar, M.N., Kuchaiev, O. (2023). SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF. -.. [2] Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Zhang, J.J., Sreedhar, M.N., Kuchaiev, O. (2024). HelpSteer2: Open-source dataset for training top-performing reward models. \ No newline at end of file +.. [2] Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Zhang, J.J., Sreedhar, M.N., Kuchaiev, O. (2024). HelpSteer2: Open-source dataset for training top-performing reward models. From 3a94377768f3da73f4f7ace26b7fc1b76fc04439 Mon Sep 17 00:00:00 2001 From: Terry Kong Date: Tue, 1 Oct 2024 17:22:18 -0700 Subject: [PATCH 2/2] missing Signed-off-by: Terry Kong --- docs/user-guide/known_errors.rst | 63 ++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) create mode 100644 docs/user-guide/known_errors.rst diff --git a/docs/user-guide/known_errors.rst b/docs/user-guide/known_errors.rst new file mode 100644 index 000000000..7920bf19f --- /dev/null +++ b/docs/user-guide/known_errors.rst @@ -0,0 +1,63 @@ +.. include:: /content/nemo.rsts + +.. _known_errors_and_resolutions: + +Known Errors and Resolutions +@@@@@@@@@@@@@@@@@@@@@@@@@@@@ + +This section details how to resolve common pitfalls that may arise during the alignment process. + +Gated Huggingface Assets +######################## + +Some NeMo models will pull gated assets like tokenizers from Huggingface. Examples include Llama3 or Llama3.1 tokenizers. + +Example error:: + + ValueError: Unable to instantiate HuggingFace AUTOTOKENIZER for meta-llama/Meta-Llama-3.1-8B. Exception: You are trying to access a gated repo. + Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3.1-8B. + 401 Client Error. (Request ID: Root=) + + Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/resolve/main/config.json. + Access to model meta-llama/Llama-3.1-8B is restricted. You must have access to it and be authenticated to access it. Please log in. + +Resolution: + +1. Request access to the gated repo +2. Create `HF Personal Access Token `__ +3. Add token to environment: + a. (opt 1): Store it in ``~/.cache/huggingface/token`` + a. (opt 2): In your script, set it in the environment ``export HF_TOKEN=`` + + +Checkpoints with Missing or Unexpected Weights +############################################## + +Some NeMo model checkpoints will error when loaded if weights are missing or unexpected. + +Example error:: + + Traceback (most recent call last): + File "/workspace/NeMo-Aligner/examples/nlp/gpt/serve_ppo_critic.py", line 119, in + main() + ... + + ... + File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/torch.py", line 528, in create_local_plan + return super().create_local_plan() + File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py", line 196, in create_local_plan + return create_default_local_load_plan( + File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py", line 315, in create_default_local_load_plan + raise RuntimeError(f"Missing key in checkpoint state_dict: {fqn}.") + RuntimeError: Missing key in checkpoint state_dict: model.decoder.layers.self_attention.core_attention._extra_state/shard_0_24. + +Resolution: + +Add the following to your script invocation: + +.. code-block:: bash + + python .... \ + ++model.dist_ckpt_load_strictness=log_all + +Visit `Megatron-LM's docs `__ for more information on the options available.