docs: Apply other feedback from 24.09 VDR (#411)

Signed-off-by: Terry Kong <terryk@nvidia.com>
NVIDIA · Nov 27, 2024 · 4247dc5 · 4247dc5
1 parent 716e503
commit 4247dc5
Show file tree

Hide file tree

Showing 13 changed files with 262 additions and 127 deletions.
diff --git a/docs/user-guide/aligner-algo-header.rst b/docs/user-guide/aligner-algo-header.rst
@@ -0,0 +1,4 @@
+.. important::
+   Before starting this tutorial, be sure to review the :ref:`introduction <nemo-aligner-getting-started>` for tips on setting up your NeMo-Aligner environment.
+
+   If you run into any problems, refer to NeMo's `Known Issues page <https://docs.nvidia.com/nemo-framework/user-guide/latest/knownissues.html>`__. The page enumerates known issues and provides suggested workarounds where appropriate.
diff --git a/docs/user-guide/cai.rst b/docs/user-guide/cai.rst
@@ -1,6 +1,8 @@
 .. include:: /content/nemo.rsts
 
-.. _model-aligner-cai:
+.. include:: aligner-algo-header.rst
+
+.. _nemo-aligner-cai:
 
 Constitutional AI: Harmlessness from AI Feedback
 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@ -14,12 +16,12 @@ CAI allows training a harmless, but non-evasive AI assistant that engages with h
 .. _Constitutional AI (CAI): https://arxiv.org/abs/2212.08073
 
 CAI
-###############
-The basic steps of CAI are described in this section and illustrated in the figure below (`Figure 1 <https://arxiv.org/abs/2212.08073>`_).
+###
+The basic steps of CAI are described in this section and illustrated in the `figure below <nemo-aligner-cai-flow-diagram>`_.
 
 (Supervised Stage) Critique → Revision → Supervised Learning: The AI generates responses to harmfulness prompts using a helpful-only AI assistant, then critiques and revises its own responses according to a principle in the constitution, and then fine-tunes the original model on the revised responses.
 
-(RL Stage) AI Comparison Evaluations → Reward Model → Reinforcement Learning: The AI generates pairs of responses to harmfulness prompts using the finetuned model, then evaluates which response is better according to a principle in the constitution, and then trains a reward model based on this dataset of AI preferences and a human helpfulness preferences. The AI then trains with RL using the learned reward model.
+(RL Stage) AI Comparison Evaluations → Reward Model → Reinforcement Learning: The AI generates pairs of responses to harmfulness prompts using the fine-tuned model, then evaluates which response is better according to a principle in the constitution, and then trains a reward model based on this dataset of AI preferences and a human helpfulness preferences. The AI then trains with RL using the learned reward model.
 
 .. image:: ../assets/cai_diagram.png
    :alt: basic steps of the CAI process
@@ -29,25 +31,22 @@ The basic steps of CAI are described in this section and illustrated in the figu
 Critiques, revisions, and AI harmlessness feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model. It gives some control over the initial behavior at the start of the RL phase, while addressing potential exploration problems. The RL stage significantly improves performance and reliability.
 
 Motivation
-###############
+##########
 Constitutional AI motivation refers to designing AI systems in such a way that their objectives and behaviors are guided by a set of predefined rules or principles. It includes the following:
 
-Scaling supervision: using AI to help humans supervise other AIs more efficiently and effectively, especially for tasks where AI capabilities may exceed human ones.
-
-A harmless but non-evasive assistant: reducing the tension between helpfulness and harmlessness, and avoiding evasive responses that reduce transparency and helpfulness.
-
-Simplicity and transparency: encoding the training goals in a simple list of natural language instructions or principles, and using chain-of-thought reasoning to make AI decision making explicit and understandable.
+- Scaling Supervision: Use AI to assist humans in supervising other AIs more efficiently and effectively, particularly for tasks where AI capabilities may surpass human ones.
+- A Harmless but Non-Evasive Assistant: Minimize the tension between helpfulness and harmlessness, and avoid evasive responses that reduce transparency and helpfulness.
+- Simplicity and Transparency: Encode training goals in a straightforward list of natural language instructions or principles, and employ chain-of-thought reasoning to make AI decision-making explicit and understandable.
+- Reducing Iteration Time: Eliminate the need to collect new human feedback labels when modifying objectives or testing different behaviors.
 
-Reducing iteration time: obviating the need to collect new human feedback labels when altering the objective or testing different behaviors.
-
-Train a CAI model
-#####################
+Train a CAI Model
+#################
 
 This section is a step-by-step tutorial that walks you through how to run a full CAI pipeline with a ``Mistral-7B`` LLM model. It includes the following:
 
-1. Data download and preprocessing.
+1. Download the models and datasets.
 
-2. Generate responses to harmfulness prompts using a helpful-only AI assistant. Ask the model to critique its response according to a principle in the constitution, and then revise the original response in light of the critique.
+2. Generate and revise responses to harmful prompts creating the SL-CAI dataset. Ask the model to critique its response according to a principle in the constitution, and then revise the original response in light of the critique.
 
 3. Fine-tune ``Mistral-7B`` with SFT on the revised responses to create a ``Mistral-7B-SL-CAI`` model.
 
@@ -56,24 +55,22 @@ This section is a step-by-step tutorial that walks you through how to run a full
    b. Formulate each prompt and pair into a multiple choice question, where we ask ``Mixtral-8x7B`` which response is best according to the constitution.
    c. Blend the AI feedback preference dataset (prompts and pairs) with human feedback helpfulness dataset.
 
-5. Train a Reward Model (RM).
+5. Train the Reward Model (RM).
 
 6. Fine-tune the ``Mistral-7B-SL-CAI`` with Proximal Policy Optimization (PPO) and the RM to train a ``Mistral-7B-RL-CAI`` model.
 
 7. Run inference.
 
-.. note::
-   Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
-
-   If you run into any problems, refer to NeMo's `Known Issues page <https://docs.nvidia.com/nemo-framework/user-guide/latest/knownissues.html>`__. The page enumerates known issues and provides suggested workarounds where appropriate.
+.. _nemo-aligner-cai-flow-diagram:
 
 .. image:: ../assets/cai_flow.png
 
-Step 1: Download models and datasets
-#############################################################################
-1. Download ``Mistral-7B-Instruct`` and ``Mistral-7B`` LLM models from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 and https://huggingface.co/mistralai/Mistral-7B-v0.1 into the models folder.
+Step 1: Download the models and datasets
+########################################
+
+1. Download the ``Mistral-7B-Instruct`` and ``Mistral-7B`` LLM models from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 and https://huggingface.co/mistralai/Mistral-7B-v0.1 into the models folder.
 
-   Then, convert into .nemo format:
+   Then, convert them into .nemo format:
 
    .. code-block:: bash
    
@@ -92,7 +89,7 @@ Step 1: Download models and datasets
    This command will download the dataset to ``/path/to/anthropic_red_team_attempts_train.json``
 
 
-3. Download SFT helpfulness dataset:
+3. Download the SFT helpfulness dataset:
 
    .. code-block:: bash
    
@@ -101,7 +98,7 @@ Step 1: Download models and datasets
    This command will download the dataset to ``/path/to/nvidia_sft_datablend_v1_train.json``
 
 
-4. Download and process preference helpfulness dataset:
+4. Download and process the preference helpfulness dataset:
 
    .. code-block:: bash
    
@@ -112,7 +109,7 @@ Step 1: Download models and datasets
 
 
 Step 2: Generate and revise responses to harmful prompts creating the SL-CAI dataset
-###################################################################################################
+####################################################################################
 
 Run an inference server in the background using the following command:
 
@@ -158,16 +155,16 @@ Please wait for the server to be ready before proceeding.
       --apply_chat_template False \
       --response_extract_pattern "[/INST]"
 
-This will generate an SL-CAI dataset of prompts and revised responses as ``cai_revisions_aligner_chat_template.json``
+This will generate an SL-CAI dataset of prompts and revised responses as ``cai_revisions_aligner_chat_template.json``.
 
-The few-shot samples should be provided following the template in ``few_shot_samples_example.json`` (filling in the `content` tags, and choosing how many samples to use), and should include a red teaming prompt, a response from the helpful model (e.g. ``Mistral-7B`` in this tutorial), critique and revision requests and responses. An example is shown in the `Anthropic repo <https://github.com/anthropics/ConstitutionalHarmlessnessPaper/blob/main/prompts/CritiqueRevisionFewShotPrompts.json>`_.
+The few-shot samples should be provided following the template in ``few_shot_samples_example.json``. Fill in the `content` tags and choose how many samples to use. The samples should include a red teaming prompt, a response from the helpful model (e.g., ``Mistral-7B`` in this tutorial), critique and revision requests, and responses. An example is shown in the `Anthropic repo <https://github.com/anthropics/ConstitutionalHarmlessnessPaper/blob/main/prompts/CritiqueRevisionFewShotPrompts.json>`_.
 
-*NOTE: The tokenizer file can be found by extracting the .nemo checkpoint using `tar -xf /models/mistral/mistral-7b-Instruct.nemo`.
-There are 2 tokenizer files that end with `.model` in the model checkpoint and they are the same, so you can use either one for data processing.*
+.. note::
+     The tokenizer file can be found by extracting the .nemo checkpoint using `tar -xf /models/mistral/mistral-7b-Instruct.nemo`. There are two tokenizer files that end with `.model` in the model checkpoint, and they are identical. You can use either one for data processing.
 
 
 Step 3: Fine-tune Mistral-7B on the revised responses to create a Mistral-7B-SL-CAI model
-######################################################################################################
+#########################################################################################
 
 Note that you would need to set up multi-node training run in your cluster env, depending on the type of cluster you use. For details, please refer to https://lightning.ai/docs/pytorch/stable/clouds/cluster.html .
 
@@ -199,10 +196,9 @@ Note that you would need to set up multi-node training run in your cluster env,
 
 
 Step 4: Generate the RL-CAI (preference) dataset for RM and PPO training
-##############################################################################################################
+########################################################################
 
-The following section runs an inference server with the SL-CAI model that we've previously trained, and queries it with red teaming prompts asking for several responses per prompt.
-The responses will then be ranked by a judge LLM being run from NVIDIA's NGC. An NGC API key can be acquired `here`_.
+The following section runs an inference server with the SL-CAI model that we've previously trained. It queries the server with red teaming prompts, requesting several responses per prompt. These responses will then be ranked by a judge LLM running from NVIDIA's NGC. You can acquire an NGC API key `here`_.
 
 The following command will run the inference server:
 
@@ -257,8 +253,8 @@ Using a different terminal, run the following command to start the RL-CAI datase
 This command will create the ``rl-cai`` dataset files in the defined output folder with the given output filename prefix.
 
 
-Step 5: Train the RM
-#####################
+Step 5: Train the Reward Model (RM)
+###################################
 
 Run the following command to train the RM:
 
@@ -285,7 +281,7 @@ Run the following command to train the RM:
 
 The trained RM checkpoint will be saved to output dir given by ``exp_manager.explicit_log_dir``.
 
-Step 6: Fine-tune Mistral-7B-SL-CAI with PPO and the RM to train a Mistral-7B-RL-CAI model
+Step 6: Fine-tune the Mistral-7B-SL-CAI with PPO and the RM to train a Mistral-7B-RL-CAI model
 ##############################################################################################
 Run the following command in the background to launch a RM and PPO critic training server:
 
@@ -329,8 +325,8 @@ Run the following command to launch actor training and a reference policy server
 
 The trained LLM policy checkpoint will be saved to the output dir given by ``exp_manager.explicit_log_dir``.
 
-Step 7: Inference
-##################
+Step 7: Run inference
+#####################
 To start inference, run an inference server in the background using the following command:
 
 .. code-block:: bash

diff --git a/docs/user-guide/dpo.rst b/docs/user-guide/dpo.rst
@@ -1,15 +1,12 @@
 .. include:: /content/nemo.rsts
 
-.. _model-aligner-dpo:
+.. include:: aligner-algo-header.rst
+
+.. _nemo-aligner-dpo:
 
 Model Alignment by DPO, RPO, and IPO
 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 
-.. note::
-   Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
-
-   If you run into any problems, refer to NeMo's `Known Issues page <https://docs.nvidia.com/nemo-framework/user-guide/latest/knownissues.html>`__. The page enumerates known issues and provides suggested workarounds where appropriate.
-
 The NeMo Framework supports efficient model alignment via the NeMo-Aligner codebase.
 
 All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>`__.  The same tutorial also works for GPT models (such as LLaMa3) of any size.
@@ -22,7 +19,7 @@ In full-parameter DPO, there exists an actor and a reference model. The actor is
 For LoRA-based DPO, the actor is initialized by the reference model plus LoRA weights, where only the LoRA weights are trainable. Therefore, it allows us to switch between the actor/reference models by simply enabling or disabling LoRA. In addition, there is no need to store two sets of LLM weights.
 
 RPO and IPO Variations
-#######################
+######################
 
 Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity Preference Optimization (IPO) and Reward-aware Preference Optimization (RPO).
 
@@ -31,7 +28,7 @@ The algorithm is identified with the ``dpo.preference_loss`` config variable. We
 To use the RPO algorithm, each dataset example should have ``chosen_reward`` and ``rejected_reward``, which might come from human labelers or reward models. If ``chosen_reward`` and ``rejected_reward`` are not existent in the data, ``dpo.default_chosen_reward`` and ``dpo.default_rejected_reward`` are used.
 
 Obtain a Pretrained Model
-############################
+#########################
 To start, we must first get a pretrained model to align. There are two models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes, we will use the smaller 2B model. 
 
 .. tab-set::
@@ -81,7 +78,7 @@ Instruction Following Taught by Supervised Fine-Tuning (SFT)
 For best DPO training performance, it is recommended that you start with a SFT model, rather than the base model. For a full guide on how to perform SFT on a Megatron GPT model, please refer to the :ref:`SFT guide <sft>`.
 
 DPO Model Training
-#####################
+##################
 
 Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects ``.jsonl`` files where each line is a JSON dict corresponding to a single, complete sample, as shown below::
 
@@ -100,6 +97,25 @@ Your JSONL file must contain at least as many samples as the Global Batch Size (
 Once your data is processed into the correct format, you are ready to begin DPO training. You must start with a pretrained or SFT trained model. For this section, we will use the SFT model trained in the previous step to train the DPO model.
 For the purposes of the following sections, we assume that your training ``.jsonl`` file is located in ``/path/to/train_dpo_format.jsonl`` and your validation ``.jsonl`` file is located in ``/path/to/valid_dpo_format.jsonl``.
 
+.. tip::
+
+   If you don't have a DPO dataset readily available, you can generate a toy one to get started. Here's
+   an example to generate ``NUM_EXAMPLES_TO_GENERATE`` examples. Ensure this value is larger than the
+   global_batch_size.
+
+      .. code-block:: bash
+
+         # Generates a dummy dataset in /path/to/train_dpo_format.jsonl /path/to/valid_dpo_format.jsonl 
+
+         NUM_EXAMPLES_TO_GENERATE=200
+
+         mkdir -p /path/to
+         for i in $(seq 1 $NUM_EXAMPLES_TO_GENERATE); do
+            cat <<EOF
+         {"prompt": "<extra_id_0>System\n\n<extra_id_1>User\n${i}*10=?\n<extra_id_1>Assistant\n", "chosen_response": "$((i * 10))\n<extra_id_1>", "rejected_response": "I refuse to answer this question.\n<extra_id_1>"}
+         EOF
+         done | tee /path/to/train_dpo_format.jsonl /path/to/valid_dpo_format.jsonl >/dev/null
+
 For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` corresponds to the beta parameter in the DPO paper.
 
 .. tab-set::
@@ -111,7 +127,7 @@ For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` correspond
 
          .. code-block:: bash 
 
-            export GPFS="/path/to/nemo-aligner-repo"
+            export GPFS="/opt/NeMo-Aligner"
             export TRAIN_DATA_PATH="/path/to/train_dpo_format.jsonl"
             export VALID_DATA_PATH="/path/to/valid_dpo_format.jsonl"
 
@@ -147,7 +163,7 @@ For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` correspond
             #SBATCH --exclusive
             #SBATCH --overcommit
 
-            GPFS="/path/to/nemo-aligner-repo"
+            export GPFS="/opt/NeMo-Aligner"
             PRETRAINED_CHECKPOINT_NEMO_FILE="/path/to/megatron_gpt_sft.nemo"
 
             TRAIN_DATA_PATH="/path/to/train_comparisons.jsonl"
@@ -187,7 +203,6 @@ For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` correspond
             EOF
 
             srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
-            set +x
 
 The default DPO training tunes all parameters. To use LoRA, we can set ``model.peft.peft_scheme=lora`` and use different parameters in ``model.peft.lora_tuning``. Please check the parameters in `the config file <https://github.com/NVIDIA/NeMo-Aligner/blob/main/examples/nlp/gpt/conf/gpt_dpo.yaml>`__.
 
@@ -204,4 +219,4 @@ However, the following list is a brief overview of which hyperparameters we have
 * global_batch_size: Generally, we have found that, all other parameters held equal, lower GBS performs worse. GBS of 256 or 512 seems to be the sweet spot for most models we trained.
 * epochs: Highly sensitive to training data size. We recommend you start with 1 epoch and then add on from there. We did not see any improvements beyond 3 epochs.
 * learning rate: We tested cosine annealing with a warmup of 10 steps, followed by a slow decay to a constant rate. That constant rate should be fairly low. We saw the best performance with 9e-7 and 5-e7.
-* ref_policy_kl_penalty: We generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too.
+* ref_policy_kl_penalty: We generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too.