Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: adds a known_errors.rst to improve UX #332

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion docs/user-guide/cai.rst
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ Note that you would need to set up multi-node training run in your cluster env,
trainer.sft.val_check_interval=50 \
trainer.sft.save_interval=50


For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.


Step 4: Generate the RL-CAI (preference) dataset for RM and PPO training
Expand Down Expand Up @@ -277,6 +277,7 @@ Run the following command to train the RM:
trainer.rm.val_check_interval=25 \
trainer.rm.limit_val_batches=100000

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

The trained RM checkpoint will be saved to output dir given by ``exp_manager.explicit_log_dir``.

Expand All @@ -298,6 +299,8 @@ Run the following command in the background to launch a RM and PPO critic traini
model.seed=1234 \
exp_manager.explicit_log_dir=<path to critic output dir>

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

Run the following command to launch actor training and a reference policy server:

.. code-block:: bash
Expand All @@ -322,6 +325,8 @@ Run the following command to launch actor training and a reference policy server
remote_critic_rm.critic.ip=<ip to critic service> \
remote_critic_rm.critic.port=5567

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

The trained LLM policy checkpoint will be saved to the output dir given by ``exp_manager.explicit_log_dir``.

Step 7: Inference
Expand Down
8 changes: 5 additions & 3 deletions docs/user-guide/dpo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ For full-parameter DPO, there exists an actor and a reference model. The actor i
For LoRA-based DPO, the actor is initialized by the reference model plus LoRA weights, where only the LoRA weights are trainable. Therefore, it allows us to switch between the actor/reference models by simply enabling or disabling LoRA. In addition, there is no need to store two sets of LLM weights.

RPO and IPO Variations
#######################
######################

Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity preference optimization (IPO) and Reward-aware preference optimization (RPO).

Expand All @@ -26,7 +26,7 @@ The algorithm is identified with the ``dpo.preference_loss`` config variable. We
To use the RPO algorithm, each dataset example should have chosen_reward and rejected_reward, which might come from human labelers or reward models. If chosen_reward and rejected_reward are not existent in the data, dpo.default_chosen_reward and dpo.default_rejected_reward are used.

Obtain a Pretrained Model
############################
#########################
To start, we must first get a pretrained model to align. There are two models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes, we will use the smaller 2B model.

.. tab-set::
Expand Down Expand Up @@ -76,7 +76,7 @@ Instruction Following Taught by Supervised Fine-Tuning (SFT)
For best DPO training performance, it is recommended that you start with a SFT model, rather than the base model. For a full guide on how to perform SFT on a Megatron GPT model, please refer to the :ref:`SFT guide <sft>`.

DPO Model Training
#####################
##################

Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects .jsonl files where each line is a JSON dict corresponding to a single, complete sample, as shown below::

Expand Down Expand Up @@ -182,6 +182,8 @@ For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` correspond
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

The default DPO training tunes all parameters. To use LoRA, we can set ``model.peft.peft_scheme=lora`` and use different parameters in ``model.peft.lora_tuning``. Please check the parameters in `the config file <https://github.com/NVIDIA/NeMo-Aligner/blob/main/examples/nlp/gpt/conf/gpt_dpo.yaml>`__.

During DPO training, several metrics will be recorded in WandB, with the primary one being ``acc`` (representing the percentage by which the model’s chosen rewards exceed the rejected rewards).
Expand Down
4 changes: 3 additions & 1 deletion docs/user-guide/draftp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,8 @@ To launch reward model training, you must have checkpoints for `UNet <https://hu
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.


.. note::
For more info on DRaFT+ hyperparameters please see the model config files (for SD and SDXL respectively):
Expand Down Expand Up @@ -264,5 +266,5 @@ AIG provides the inference-time flexibility to interpolate between the base Stab
exp_manager.explicit_log_dir=${DIR_SAVE_CKPT_PATH} \
exp_manager.wandb_logger_kwargs.project=${PROJECT} +weight_type='draft,base,power_2.0'


For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

4 changes: 4 additions & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
spin.rst
draftp.rst
cai.rst
known_errors.rst

:ref:`Prerequisite Obtaining a Pre-Trained Model <prerequisite>`
This section provides instructions on how to download pre-trained LLMs in .nemo format. The following section will use these base LLMs for further fine-tuning and alignment.
Expand Down Expand Up @@ -41,3 +42,6 @@

:ref:`Constitutional AI: Harmlessness from AI Feedback <model-aligner-cai>`
CAI, an alignment method developed by Anthropic, enables the incorporation of AI feedback for aligning LLMs. This feedback is grounded in a small set of principles (referred to as the ‘Constitution’) that guide the model toward desired behaviors, emphasizing helpfulness, honesty, and harmlessness.

:ref:`Known Errors and Resolutions <known_errors_and_resolutions>`
This section details how to resolve common pitfalls that may arise during the alignment process.
63 changes: 63 additions & 0 deletions docs/user-guide/known_errors.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
.. include:: /content/nemo.rsts

.. _known_errors_and_resolutions:

Known Errors and Resolutions
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

This section details how to resolve common pitfalls that may arise during the alignment process.

Gated Huggingface Assets
########################

Some NeMo models will pull gated assets like tokenizers from Huggingface. Examples include Llama3 or Llama3.1 tokenizers.

Example error::

ValueError: Unable to instantiate HuggingFace AUTOTOKENIZER for meta-llama/Meta-Llama-3.1-8B. Exception: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3.1-8B.
401 Client Error. (Request ID: Root=<redacted>)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/resolve/main/config.json.
Access to model meta-llama/Llama-3.1-8B is restricted. You must have access to it and be authenticated to access it. Please log in.

Resolution:

1. Request access to the gated repo
2. Create `HF Personal Access Token <https://huggingface.co/settings/tokens>`__
3. Add token to environment:
a. (opt 1): Store it in ``~/.cache/huggingface/token``
a. (opt 2): In your script, set it in the environment ``export HF_TOKEN=<REDACTED_PAT>``


Checkpoints with Missing or Unexpected Weights
##############################################

Some NeMo model checkpoints will error when loaded if weights are missing or unexpected.

Example error::

Traceback (most recent call last):
File "/workspace/NeMo-Aligner/examples/nlp/gpt/serve_ppo_critic.py", line 119, in <module>
main()
...
<Traceback shorted for brevity>
...
File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/torch.py", line 528, in create_local_plan
return super().create_local_plan()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py", line 196, in create_local_plan
return create_default_local_load_plan(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py", line 315, in create_default_local_load_plan
raise RuntimeError(f"Missing key in checkpoint state_dict: {fqn}.")
RuntimeError: Missing key in checkpoint state_dict: model.decoder.layers.self_attention.core_attention._extra_state/shard_0_24.

Resolution:

Add the following to your script invocation:

.. code-block:: bash

python .... \
++model.dist_ckpt_load_strictness=log_all

Visit `Megatron-LM's docs <https://github.com/NVIDIA/Megatron-LM/blob/85cd99bbf54acc1b188b28960155e5c6fcb06686/megatron/core/dist_checkpointing/validation.py#L44>`__ for more information on the options available.
7 changes: 7 additions & 0 deletions docs/user-guide/rlhf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ To launch reward model training, you must start with a pretrained or SFT-trained
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

*Remark: Currently, the example training script does not automatically run evaluation on the provided test set. This may change in a future release.*

Expand Down Expand Up @@ -196,6 +197,8 @@ To launch the server:
++model.offload_adam_states=True \
++model.mcore_gpt=True

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

The above example launches the reward model Critic server on 8 GPUs and 1 node. Please make sure to change ``trainer.devices``, ``trainer.num_nodes`` depending on your model size and scale. NeMo-Aligner will work on any scale. In addition, make sure to tune the `trainer.ppo.inference_micro_batch_size` argument as this determines the batch size the PPO Actor is allowed to send to the Critic per DP rank.

Launching the Initial Policy and PPO Actor Training
Expand Down Expand Up @@ -235,6 +238,8 @@ The PPO Actor training job contains the master controller that makes the HTTP ca
++model.ppo.entropy_bonus=0.0 \
remote_critic_rm.pad_to_length=2048

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

The above script launches the initial and Actor server on 1 node with 8 GPUs.

.. note::
Expand Down Expand Up @@ -355,6 +360,8 @@ You can use Slurm to launch both jobs and coordinate them together in a full RLH

wait

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

The above script runs the reward model Critic server on 1 node and the Actor on 1 node.

It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other.
Expand Down
6 changes: 6 additions & 0 deletions docs/user-guide/rs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ To launch the server:
++model.tensor_model_parallel_size=4 \
rm_model_file=${RM_NEMO_FILE}

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.


The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.rs.inference_micro_batch_size argument. This argument sets the size of the batch the RS actor is allowed to send to the critic per DP rank.

Expand Down Expand Up @@ -95,6 +97,8 @@ The RS Actor training job contains the master controller that makes the HTTP cal
model.rs.num_rollouts_per_prompt=8 \
model.rs.top_n_rollouts=1

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

The above command launches the initial and actor server on 1 node with 8 GPUs.

Launching Both Servers for Rejection Sampling training
Expand Down Expand Up @@ -217,6 +221,8 @@ You can use slurm to launch the 2 jobs and get them to coordinate together in a

wait

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

The above script runs the reward model server on 1 node and the actor on 1 node.

It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other.
Expand Down
3 changes: 3 additions & 0 deletions docs/user-guide/sft.rst
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,8 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner.
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

If using sequence packing, replace the data paths with the paths to your packed datasets. For each packed dataset, you should also set ``packed_sequence=True`` in the config:

.. code-block:: python
Expand Down Expand Up @@ -382,6 +384,7 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. Compare
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

To scale to thousands of GPUs, adjust the ``trainer.num_nodes`` and ``trainer.devices`` accordingly based on the size of your machine.

Expand Down
2 changes: 2 additions & 0 deletions docs/user-guide/spin.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,8 @@ For the below parameters, the ``model.spin.ref_policy_kl_penalty`` corresponds t
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

During SPIN training, there will be several metrics recorded to WandB which you can monitor, chiefly acc (representing the percentage amount whereby the model's implicit reward for the ground truth response is greater than for the response generated by the reference policy).
The ``reward`` in this case is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses.
During training, the acc should generally be increasing, but don't worry if its absolute value remains low, as it doesn't correlate to finalised MTBench or MMLU scores. It should just be generally increasing.
Expand Down
4 changes: 3 additions & 1 deletion docs/user-guide/steerlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ Note that you would need to set up multi-node training in your cluster env, depe
model.reward_model_type="regression" \
model.regression.num_attributes=9

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

Step 4: Generate annotations
############################
Expand All @@ -158,6 +159,7 @@ To generate annotations, run the following command in the background to launch a
inference.micro_batch_size=2 \
inference.port=1424

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

Now execute:

Expand Down Expand Up @@ -230,7 +232,7 @@ For the purposes of this tutorial, the Attribute-Conditioned SFT model is traine
exp_manager.explicit_log_dir=/results/acsft_70b \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.


Step 6: Inference
Expand Down
4 changes: 3 additions & 1 deletion docs/user-guide/steerlm2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,8 @@ By organizing the data in this format, the SteerLM 2.0 model can be effectively
exp_manager.explicit_log_dir=/results/acsft_70b \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True

For more information on handling potential errors, see :ref:`Known Errors and Resolutions <known_errors_and_resolutions>`.

Inference
------------------

Expand All @@ -169,4 +171,4 @@ References

.. [1] Dong, Y., Delalleau, O., Zeng, J., Shen, G., Zhang, J.J., Sreedhar, M.N., Kuchaiev, O. (2023). SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF.

.. [2] Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Zhang, J.J., Sreedhar, M.N., Kuchaiev, O. (2024). HelpSteer2: Open-source dataset for training top-performing reward models.
.. [2] Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Zhang, J.J., Sreedhar, M.N., Kuchaiev, O. (2024). HelpSteer2: Open-source dataset for training top-performing reward models.
Loading