Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🦙 llama2 optimization #641

Merged
merged 10 commits into from
Oct 28, 2023
Merged

🦙 llama2 optimization #641

merged 10 commits into from
Oct 28, 2023

Conversation

trajepl
Copy link
Contributor

@trajepl trajepl commented Oct 12, 2023

Describe your changes

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Format your code by running pre-commit run --all-files
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

@trajepl trajepl changed the title llamav2 converter case 🦙 [llama2] converter + transformer optimizaition_with GQA Oct 12, 2023
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
examples/llama2/user_script.py Fixed Show fixed Hide fixed
@trajepl trajepl marked this pull request as ready for review October 24, 2023 03:36
@trajepl trajepl changed the title 🦙 [llama2] converter + transformer optimizaition_with GQA 🦙 llama2 optimization Oct 24, 2023
examples/llama2/llama2.py Fixed Show fixed Hide fixed
examples/llama2/README.md Show resolved Hide resolved
examples/llama2/llama2_cpu.json Outdated Show resolved Hide resolved
if config["use_gqa"]:
# Replace MultiHeadAttention with GroupQueryAttention and remove attention mask nodes
num_kv_heads = model.model_attributes.get("num_key_value_heads", None)
if num_kv_heads is None:
Copy link
Contributor

@jambayk jambayk Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: Should we make it possible to provide the value as a pass config parameter too?
I think there is an asymmetry that num_head and hidden_size can be pass config param or model attributes.
Also, we don't usually provide model_attributes manually for non-automatic workflows. Most examples have the pass config with the model specific values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. But the GQA currently only works for llama-70B as the num_heads!=num_kv_heads.
We can complete the hf_mapping to introduce similar value for num_kv_heads.

@@ -202,3 +226,36 @@ def _run_for_config(

# save the model to the output path and return the model
return model_proto_to_olive_model(optimizer.model, output_model_path, config)

@staticmethod
def _replace_mha_with_gqa(model: "OnnxModel", past_seq_len: str = "past_sequence_length", kv_num_heads: int = 0):
Copy link
Contributor

@jambayk jambayk Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't use onnxruntime.transformers.convert_generation.replace_mha_with_gqa directly? Do we want this option to still be usable for older versions of onnxruntime on the host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, old version of ort do not contain the corresponding OPs, so that is not the problem.

The reason I was thinking: the scripts under convert generation is like temp-solutions, also as this part of logics is quite simple, I directly copy it here.

@trajepl trajepl merged commit 649e314 into main Oct 28, 2023
31 checks passed
@trajepl trajepl deleted the jiapli/llama_converter branch October 28, 2023 05:30
jambayk added a commit that referenced this pull request Oct 30, 2023
commit 093cfaf
Merge: d074e17 649e314
Author: Jambay Kinley <jambaykinley@microsoft.com>
Date:   Mon Oct 30 18:20:05 2023 +0000

    Merge branch 'main' into jambayk/lora

commit d074e17
Author: Jambay Kinley <jambaykinley@microsoft.com>
Date:   Mon Oct 30 18:11:28 2023 +0000

    merge main

commit 649e314
Author: trajep <jiapli@microsoft.com>
Date:   Sat Oct 28 13:30:40 2023 +0800

    🦙 llama2 optimization (#641)

    ## Describe your changes

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 819c25a
Author: Mike Guo <myguo@microsoft.com>
Date:   Thu Oct 26 13:43:31 2023 +0800

    raise ValueError if the model_type is None (#666)

    ## Describe your changes

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit b897b8c
Author: Mike Guo <myguo@microsoft.com>
Date:   Thu Oct 26 11:41:40 2023 +0800

    raise the known failure exceptions when do perf-tuning (#664)

    ## Describe your changes
    In perf-tuning, if the exception is known as ImportModuleError, we need
    not retry the next step. And we need throw the exception like in engine.

    ## Checklist before requesting a review
    - [x] Add unit tests for this change.
    - [x] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit a54dda8
Author: trajep <jiapli@microsoft.com>
Date:   Mon Oct 23 12:10:54 2023 +0800

    🆓 Release GPU memory for torch model evaluation (#662)

    ## Describe your changes
    In llamav2 tests, torch model occupied the gpu memory which leads to
    error when evaluate onnx model.

    Release GPU memory for torch model evaluation
    model.to("cpu") could not release gpu memory in time. Call
    torch.cuda.empty_cache() to cleanup.

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit ac6d0f7
Author: Mike Guo <myguo@microsoft.com>
Date:   Mon Oct 23 09:04:29 2023 +0800

    fix some pylint issues (#661)

    ## Describe your changes
    fix some pylint issues
    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 01296bc
Author: Jambay Kinley <jambaykinley@microsoft.com>
Date:   Fri Oct 20 20:30:37 2023 -0700

    Optional `evaluate_input_model` in no-search mode (#663)

    ## Describe your changes
    In `no-search` mode, the `evaluate_input_model=True` case is only used
    if an evaluator is provided. Otherwise, it requires the user to
    explicitly set `evaluate_input_model=False`.
    Now the input model evaluation behavior is the same as for the output
    model: evaluate if evaluator provided.
     
    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 91f1c45
Author: Mike Guo <myguo@microsoft.com>
Date:   Fri Oct 20 14:48:50 2023 +0800

    remove pre-commit CI pipeline (#660)

    ## Describe your changes
    remove pre-commit CI pipeline
    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit ece52f7
Author: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com>
Date:   Thu Oct 19 23:01:01 2023 -0700

    Update contributing doc (#642)

    ## Describe your changes

    Update contributing doc

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 4df44a5
Author: Mike Guo <myguo@microsoft.com>
Date:   Fri Oct 20 13:43:00 2023 +0800

    add editorconfig rules & enable pylint (#659)

    ## Describe your changes

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [x] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 2b5aef1
Author: trajep <jiapli@microsoft.com>
Date:   Thu Oct 19 10:48:56 2023 +0800

    🚧 Consistent dataloader for benchmark (#657)

    ## Describe your changes
    Consistent dataloader for benchmark
    Transforms.RandomCrop/Clip will generate different tensors every time
    call dataloader creation function. It bring the inconsistent model
    accuracy measurement.

    This PR is a workaround used to ensure the input data is consist for
    torch model and onnx model.

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit ab70b6c
Author: Gaurav Garg <52341457+gaugarg-nv@users.noreply.github.com>
Date:   Thu Oct 19 06:34:33 2023 +0530

    Fix bugs in how free dimension session options are set: (#658)

    ## Describe your changes

    - The “unet_text_embeds_size” free dimension should always be overridden
    with a value of 1280 irrespective of image resolution - this value is
    same as the text_encoder_2 hidden dimension for both SD-XL base and
    refiner model. Currently using a resolution other than 1024 results in
    error.
    - With the optimum pipeline, the right keyword to pass session options
    is “session_options” and not “sess_options”. This change improves perf
    by 40% for 512x512 resolution and about 10% for 1024x1024 resolution.

commit 42524f5
Author: trajep <jiapli@microsoft.com>
Date:   Wed Oct 18 18:34:36 2023 +0800

    👁️‍🗨️ Disable qualcomm linkcheck (#655)

    ## Describe your changes

    Docs build failed with broken link from qualcomm. This pr is used to
    disable link check for qualcomm

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 133d5d5
Author: Jambay Kinley <jambaykinley@microsoft.com>
Date:   Tue Oct 17 23:05:54 2023 -0700

    Text-gen: Optional attn mask. Loading args for hf model config. QLoRA: Update loading args  (#654)

    ## Describe your changes
    Some enhancements to the QLoRA pass and related components.

    Example use case scenario is for fine-tuning
    [microsoft/phi-1_5](https://huggingface.co/microsoft/phi-1_5). In order
    to load the model and config for this model non-interactively, we have
    to pass `trust_remote_code=True` to `model_loading_args`. However, the
    QLoRA pass and `get_hf_model_config` both ignore it.
    There are also cases where we have to provide `token` for models like
    [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b).
    Both components have been updated to use the model_loading_args
    correctly.

    `microsoft/phi-1_5` gives a warning about `attention_mask` during
    training. Text-gen preprocessing is updated to make `attention_mask`
    optional (opt out if needed).

    Bug Fix: Check `pad_token_id` for `None` since `not pad_token_id` is
    `True` for `pad_token_id == 0`.

    Renamed `dataset_name` to `data_name` in qlora example to use consistent
    naming.

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 169ffda
Author: trajep <jiapli@microsoft.com>
Date:   Wed Oct 18 13:31:41 2023 +0800

    🍮 Detect Data Component Reserved Arguments with Position info (#636)

    ## Describe your changes

    1. Fix arguments for customized data components. First arguments for
    pre/post process and dataloader must starts with `_`.
    2. Update data configs documents.

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 0bc3d5d
Author: Xiao Sheng <shengxiao0319@126.com>
Date:   Tue Oct 17 18:34:23 2023 +0800

    Update Vitis AI quantization to support ORT 1.16, support TensorData and QuantizationParams (#650)

    ## Describe your changes
    Update Vitis AI quantization to support ORT 1.16, support TensorData and
    QuantizationParams

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link
    #629

commit efd83e0
Author: Jambay Kinley <jambaykinley@microsoft.com>
Date:   Mon Oct 16 02:10:52 2023 -0700

    Fix `quant_preprocess` import in VitisAI Quantization pass  (#648)

    ## Describe your changes
    Follow up to #639 that missed
    this pass.

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit e7eae08
Author: Jambay Kinley <jambaykinley@microsoft.com>
Date:   Sun Oct 15 22:16:08 2023 -0700

    ResNet: Update seed_everything and accuracy goal (#646)

    ## Describe your changes
    Updated the seed steps to make the model consistent accross runs.

    Update the accuracy goal in `resnet_ptq_cpu_aml_dataset.json` to
    increase the tolerance. This was not increased last time along with
    `r`resnet_ptq_cpu_aml_dataset.json`` and was causing the test to be
    flaky. Reduced the value for `resnet_ptq_cpu_aml_dataset.json` from
    `0.1` to `0.05` since the baseline is better now.

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit b06a89a
Author: Jambay Kinley <jambaykinley@microsoft.com>
Date:   Sat Oct 14 01:32:12 2023 -0700

    Train original resnet for min 1 epoch (#644)

    ## Describe your changes
    The default value for `num_epochs` in `prepare_model_data.py` is `0` so
    we are using a random model. The accuracy numbers and goals are
    meaningless for this. This also causes the resnet test to be flaky since
    the original accuracy ~10% and doen't have much freedom for movement.

    Train it for atleast 1 epoch so that we are testing for a good model.

     
    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 4a00cad
Author: Jambay Kinley <jambaykinley@microsoft.com>
Date:   Fri Oct 13 20:22:22 2023 -0700

    Clean up `no_search` mode to not use `search_strategy` (#643)

    ## Describe your changes
    We currently use a hacky method to run `no_search` using
    `search_strategy`. This makes the logic confusing and unnecessarily
    complicated.

    Clean up the logic by just iterating over the pass flows directly.

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

commit 989964d
Author: lainey1570 <91394589+lainey1570@users.noreply.github.com>
Date:   Fri Oct 13 19:34:35 2023 -0400

    Implemented Performance Monitoring in Azure DevOps with Olive (#439)

    ## Describe your changes
    Contains Python scripts and Azure DevOps YAML files for performance
    monitoring of Olive models. The scripts and YAML files help you compare
    the performance of different models and ensure no regression occurs over
    time.
    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [x] Make sure all tests can pass.
    - [x] Update documents if necessary.
    - [x] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link

    ---------

    Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com>
    Co-authored-by: Xiaoyu <xiaoyuzhang@microsoft.com>

commit bb66c6c
Author: trajep <jiapli@microsoft.com>
Date:   Thu Oct 12 15:27:35 2023 +0800

    🎿 Support ORT 1.16.1 (#639)

    ## Describe your changes

    1. Skip vitis tests for ORT 1.16.1. Wait for VitisAI teams to fix this
    then to add the tests back.
    2. Remove quant pre process as the bug is fixed in 1.16.1.
    3. lower the metrics goal to allow at least one model in output.

    ## Checklist before requesting a review
    - [ ] Add unit tests for this change.
    - [ ] Make sure all tests can pass.
    - [ ] Update documents if necessary.
    - [ ] Format your code by running `pre-commit run --all-files`
    - [ ] Is this a user-facing change? If yes, give a description of this
    change to be included in the release notes.

    ## (Optional) Issue link
jambayk added a commit that referenced this pull request Nov 2, 2023
## Describe your changes
`MatMul4BitsQuantizer` uses logging.basicConfig to set level to INFO and
prints very verbose logs at INFO level. Suppress these logs by manually
setting the loggers level to `ERROR`.
Update ort version requirement to `>=1.16.2` since the quantizer will be
added to 1.16.2.

Remove the save to tmp_dir -> load -> save steps since it is not needed.
We only need to sort the model topologically before saving the model to
file. Refer to this discussion for more context
#641 (comment).

`llama.py`: Correct description for `--only_config` option. Create a new
user_script file for the workflow instead of updating the original one.

## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [ ] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [ ] Format your code by running `pre-commit run --all-files`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.

## (Optional) Issue link
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants