NVIDIA · terrykong · Oct 2, 2024 · Oct 2, 2024
diff --git a/docs/user-guide/cai.rst b/docs/user-guide/cai.rst
@@ -190,7 +190,7 @@ Note that you would need to set up multi-node training run in your cluster env,
       trainer.sft.val_check_interval=50 \
       trainer.sft.save_interval=50
 
-
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
 
 
 Step 4: Generate the RL-CAI (preference) dataset for RM and PPO training
@@ -277,6 +277,7 @@ Run the following command to train the RM:
       trainer.rm.val_check_interval=25 \
       trainer.rm.limit_val_batches=100000
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
 
 The trained RM checkpoint will be saved to output dir given by ``exp_manager.explicit_log_dir``.
 
@@ -298,6 +299,8 @@ Run the following command in the background to launch a RM and PPO critic traini
       model.seed=1234 \
       exp_manager.explicit_log_dir=<path to critic output dir>
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 Run the following command to launch actor training and a reference policy server:
 
 .. code-block:: bash
@@ -322,6 +325,8 @@ Run the following command to launch actor training and a reference policy server
       remote_critic_rm.critic.ip=<ip to critic service> \
       remote_critic_rm.critic.port=5567
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 The trained LLM policy checkpoint will be saved to the output dir given by ``exp_manager.explicit_log_dir``.
 
 Step 7: Inference

diff --git a/docs/user-guide/dpo.rst b/docs/user-guide/dpo.rst
@@ -17,7 +17,7 @@ For full-parameter DPO, there exists an actor and a reference model. The actor i
 For LoRA-based DPO, the actor is initialized by the reference model plus LoRA weights, where only the LoRA weights are trainable. Therefore, it allows us to switch between the actor/reference models by simply enabling or disabling LoRA. In addition, there is no need to store two sets of LLM weights.
 
 RPO and IPO Variations
-#######################
+######################
 
 Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity preference optimization (IPO) and Reward-aware preference optimization (RPO).
 
@@ -26,7 +26,7 @@ The algorithm is identified with the ``dpo.preference_loss`` config variable. We
 To use the RPO algorithm, each dataset example should have chosen_reward and rejected_reward, which might come from human labelers or reward models. If chosen_reward and rejected_reward are not existent in the data, dpo.default_chosen_reward and dpo.default_rejected_reward are used.
 
 Obtain a Pretrained Model
-############################
+#########################
 To start, we must first get a pretrained model to align. There are two models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes, we will use the smaller 2B model. 
 
 .. tab-set::
@@ -76,7 +76,7 @@ Instruction Following Taught by Supervised Fine-Tuning (SFT)
 For best DPO training performance, it is recommended that you start with a SFT model, rather than the base model. For a full guide on how to perform SFT on a Megatron GPT model, please refer to the :ref:`SFT guide <sft>`.
 
 DPO Model Training
-#####################
+##################
 
 Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects .jsonl files where each line is a JSON dict corresponding to a single, complete sample, as shown below::
 
@@ -182,6 +182,8 @@ For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` correspond
             srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 The default DPO training tunes all parameters. To use LoRA, we can set ``model.peft.peft_scheme=lora`` and use different parameters in ``model.peft.lora_tuning``. Please check the parameters in `the config file <https://github.com/NVIDIA/NeMo-Aligner/blob/main/examples/nlp/gpt/conf/gpt_dpo.yaml>`__.
 
 During DPO training, several metrics will be recorded in WandB, with the primary one being ``acc`` (representing the percentage by which the model’s chosen rewards exceed the rejected rewards).

diff --git a/docs/user-guide/draftp.rst b/docs/user-guide/draftp.rst
@@ -165,6 +165,8 @@ To launch reward model training, you must have checkpoints for `UNet <https://hu
             srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 
 .. note::
    For more info on DRaFT+ hyperparameters please see the model config files (for SD and SDXL respectively):
@@ -264,5 +266,5 @@ AIG provides the inference-time flexibility to interpolate between the base Stab
                 exp_manager.explicit_log_dir=${DIR_SAVE_CKPT_PATH} \
                 exp_manager.wandb_logger_kwargs.project=${PROJECT} +weight_type='draft,base,power_2.0'
 
-
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
 
diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst
@@ -14,6 +14,7 @@
    spin.rst
    draftp.rst
    cai.rst
+   known_errors.rst
 
 :ref:`Prerequisite Obtaining a Pre-Trained Model <prerequisite>`
    This section provides instructions on how to download pre-trained LLMs in .nemo format. The following section will use these base LLMs for further fine-tuning and alignment. 
@@ -41,3 +42,6 @@
 
 :ref:`Constitutional AI: Harmlessness from AI Feedback <model-aligner-cai>`
    CAI, an alignment method developed by Anthropic, enables the incorporation of AI feedback for aligning LLMs. This feedback is grounded in a small set of principles (referred to as the ‘Constitution’) that guide the model toward desired behaviors, emphasizing helpfulness, honesty, and harmlessness.
+
+:ref:`Known Errors and Resolutions <known_errors_and_resolutions>`
+   This section details how to resolve common pitfalls that may arise during the alignment process.
diff --git a/docs/user-guide/known_errors.rst b/docs/user-guide/known_errors.rst
@@ -0,0 +1,63 @@
+.. include:: /content/nemo.rsts
+
+.. _known_errors_and_resolutions:
+
+Known Errors and Resolutions
+@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+
+This section details how to resolve common pitfalls that may arise during the alignment process.
+
+Gated Huggingface Assets
+########################
+
+Some NeMo models will pull gated assets like tokenizers from Huggingface. Examples include Llama3 or Llama3.1 tokenizers.
+
+Example error::
+
+    ValueError: Unable to instantiate HuggingFace AUTOTOKENIZER for meta-llama/Meta-Llama-3.1-8B. Exception: You are trying to access a gated repo.
+    Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3.1-8B.
+    401 Client Error. (Request ID: Root=<redacted>)
+
+    Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/resolve/main/config.json.
+    Access to model meta-llama/Llama-3.1-8B is restricted. You must have access to it and be authenticated to access it. Please log in.
+
+Resolution:
+
+1. Request access to the gated repo
+2. Create `HF Personal Access Token <https://huggingface.co/settings/tokens>`__
+3. Add token to environment:
+   a. (opt 1): Store it in ``~/.cache/huggingface/token``
+   a. (opt 2): In your script, set it in the environment ``export HF_TOKEN=<REDACTED_PAT>``
+
+
+Checkpoints with Missing or Unexpected Weights
+##############################################
+
+Some NeMo model checkpoints will error when loaded if weights are missing or unexpected.
+
+Example error::
+
+    Traceback (most recent call last):
+      File "/workspace/NeMo-Aligner/examples/nlp/gpt/serve_ppo_critic.py", line 119, in <module>
+        main()
+      ...
+      <Traceback shorted for brevity>
+      ...
+      File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/torch.py", line 528, in create_local_plan
+        return super().create_local_plan()
+      File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py", line 196, in create_local_plan
+        return create_default_local_load_plan(
+      File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py", line 315, in create_default_local_load_plan
+        raise RuntimeError(f"Missing key in checkpoint state_dict: {fqn}.")
+    RuntimeError: Missing key in checkpoint state_dict: model.decoder.layers.self_attention.core_attention._extra_state/shard_0_24.
+
+Resolution:
+
+Add the following to your script invocation:
+
+.. code-block:: bash
+
+   python .... \
+   ++model.dist_ckpt_load_strictness=log_all
+
+Visit `Megatron-LM's docs <https://github.com/NVIDIA/Megatron-LM/blob/85cd99bbf54acc1b188b28960155e5c6fcb06686/megatron/core/dist_checkpointing/validation.py#L44>`__ for more information on the options available.
diff --git a/docs/user-guide/rlhf.rst b/docs/user-guide/rlhf.rst
@@ -139,6 +139,7 @@ To launch reward model training, you must start with a pretrained or SFT-trained
             srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
 
 *Remark: Currently, the example training script does not automatically run evaluation on the provided test set. This may change in a future release.* 
 
@@ -196,6 +197,8 @@ To launch the server:
       ++model.offload_adam_states=True \
       ++model.mcore_gpt=True
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 The above example launches the reward model Critic server on 8 GPUs and 1 node. Please make sure to change ``trainer.devices``, ``trainer.num_nodes`` depending on your model size and scale. NeMo-Aligner will work on any scale. In addition, make sure to tune the `trainer.ppo.inference_micro_batch_size` argument as this determines the batch size the PPO Actor is allowed to send to the Critic per DP rank.
 
 Launching the Initial Policy and PPO Actor Training
@@ -235,6 +238,8 @@ The PPO Actor training job contains the master controller that makes the HTTP ca
       ++model.ppo.entropy_bonus=0.0 \
       remote_critic_rm.pad_to_length=2048
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 The above script launches the initial and Actor server on 1 node with 8 GPUs.
 
 .. note::
@@ -355,6 +360,8 @@ You can use Slurm to launch both jobs and coordinate them together in a full RLH
 
    wait
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 The above script runs the reward model Critic server on 1 node and the Actor on 1 node.
 
 It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. 

diff --git a/docs/user-guide/rs.rst b/docs/user-guide/rs.rst
@@ -42,6 +42,8 @@ To launch the server:
       ++model.tensor_model_parallel_size=4 \
       rm_model_file=${RM_NEMO_FILE}
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 
 The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.rs.inference_micro_batch_size argument. This argument sets the size of the batch the RS actor is allowed to send to the critic per DP rank.
 
@@ -95,6 +97,8 @@ The RS Actor training job contains the master controller that makes the HTTP cal
       model.rs.num_rollouts_per_prompt=8 \
       model.rs.top_n_rollouts=1
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 The above command launches the initial and actor server on 1 node with 8 GPUs.
 
 Launching Both Servers for Rejection Sampling training
@@ -217,6 +221,8 @@ You can use slurm to launch the 2 jobs and get them to coordinate together in a
 
    wait
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 The above script runs the reward model server on 1 node and the actor on 1 node.
 
 It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. 

diff --git a/docs/user-guide/sft.rst b/docs/user-guide/sft.rst
@@ -223,6 +223,8 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner.
             srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 If using sequence packing, replace the data paths with the paths to your packed datasets. For each packed dataset, you should also set ``packed_sequence=True`` in the config:
 
 .. code-block:: python
@@ -382,6 +384,7 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. Compare
             srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
 
 To scale to thousands of GPUs, adjust the ``trainer.num_nodes`` and ``trainer.devices`` accordingly based on the size of your machine.
 

diff --git a/docs/user-guide/spin.rst b/docs/user-guide/spin.rst
@@ -165,6 +165,8 @@ For the below parameters, the ``model.spin.ref_policy_kl_penalty`` corresponds t
             srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 During SPIN training, there will be several metrics recorded to WandB which you can monitor, chiefly acc (representing the percentage amount whereby the model's implicit reward for the ground truth response is greater than for the response generated by the reference policy).
 The ``reward`` in this case is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses.
 During training, the acc should generally be increasing, but don't worry if its absolute value remains low, as it doesn't correlate to finalised MTBench or MMLU scores. It should just be generally increasing.

diff --git a/docs/user-guide/steerlm.rst b/docs/user-guide/steerlm.rst
@@ -142,6 +142,7 @@ Note that you would need to set up multi-node training in your cluster env, depe
          model.reward_model_type="regression" \
          model.regression.num_attributes=9
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
 
 Step 4: Generate annotations
 ############################
@@ -158,6 +159,7 @@ To generate annotations, run the following command in the background to launch a
          inference.micro_batch_size=2 \
          inference.port=1424
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
 
 Now execute:
 
@@ -230,7 +232,7 @@ For the purposes of this tutorial, the Attribute-Conditioned SFT model is traine
         exp_manager.explicit_log_dir=/results/acsft_70b \
         exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True 
 
-        
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
 
 
 Step 6: Inference

diff --git a/docs/user-guide/steerlm2.rst b/docs/user-guide/steerlm2.rst
@@ -159,6 +159,8 @@ By organizing the data in this format, the SteerLM 2.0 model can be effectively
         exp_manager.explicit_log_dir=/results/acsft_70b \
         exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True 
 
+For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.
+
 Inference
 ------------------
 
@@ -169,4 +171,4 @@ References
 
 .. [1] Dong, Y., Delalleau, O., Zeng, J., Shen, G., Zhang, J.J., Sreedhar, M.N., Kuchaiev, O. (2023). SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF.
 
-.. [2] Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Zhang, J.J., Sreedhar, M.N., Kuchaiev, O. (2024). HelpSteer2: Open-source dataset for training top-performing reward models.
+.. [2] Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Zhang, J.J., Sreedhar, M.N., Kuchaiev, O. (2024). HelpSteer2: Open-source dataset for training top-performing reward models.