Skip to content

Commit

Permalink
Merge branch 'releases/2024/3' into updating-interactive-tutorials-07…
Browse files Browse the repository at this point in the history
…0824-for-24.3
  • Loading branch information
kblaszczak-intel authored Aug 12, 2024
2 parents eca0c99 + 99450a7 commit bd20779
Show file tree
Hide file tree
Showing 6 changed files with 59 additions and 14 deletions.
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ You can ask questions and get support on:
* OpenVINO channels on the [Intel DevHub Discord server](https://discord.gg/7pVRxUwdWG).
* The [`openvino`](https://stackoverflow.com/questions/tagged/openvino) tag on Stack Overflow\*.


## Additional Resources

* [Product Page](https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html)
Expand All @@ -123,6 +124,18 @@ You can ask questions and get support on:
* [OpenVINO™ toolkit on Medium](https://medium.com/@openvino)


## Telemetry

OpenVINO™ collects software performance and usage data for the purpose of improving OpenVINO™ tools.
This data is collected directly by OpenVINO™ or through the use of Google Analytics 4.
You can opt-out at any time by running the command:

``` bash
opt_in_out --opt_out
```

More Information is available at [OpenVINO™ Telemetry](https://docs.openvino.ai/2024/about-openvino/additional-resources/telemetry.html).

## License

OpenVINO™ Toolkit is licensed under [Apache License Version 2.0](LICENSE).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
Release Policy
=============================================================================

OpenVINO offers releases of three different types, each targeting a different use case:
OpenVINO offers releases of four different types, each targeting a different use case:

* `Regular releases <#regular-releases>`__
* `Long-Term Support <#long-term-support-releases>`__
* `Pre-release releases <#pre-release-releases>`__
* `Nightly <#nightly-releases>`__


Regular releases
####################

Expand Down Expand Up @@ -60,6 +60,16 @@ The following elements are not guaranteed to receive updates:
* OpenVINO tools, such as NNCF and OVMS.
* Code samples used in component testing.

Pre-release releases
######################

OpenVINO pre-release is an early version of regular releases that has not undergone full release validation
or qualification. Pre-release versions are more stable than nightly releases. No support is offered on pre-release software. The scope, functionality,
and APIs/behavior are subject to change in the future. It **should NOT** be incorporated into
any production software/solution, instead it should be used only for:

* Performing early testing and integration.
* Getting early feedback from the community.

Nightly releases
###########################
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ mark a conversation session, as you can see in these simple examples:
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path)
pipe.set_generation_cofnig({'max_new_tokens': 100)
pipe.set_generation_config({'max_new_tokens': 100)
pipe.start_chat()
while True:
Expand All @@ -209,7 +209,7 @@ mark a conversation session, as you can see in these simple examples:
ov::genai::GenerationConfig config = pipe.get_generation_config();
config.max_new_tokens = 100;
pipe.set_generation_cofnig(config)
pipe.set_generation_config(config)
pipe.start_chat();
for (size_t i = 0; i < questions.size(); i++) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,8 @@ as in OpenVINO native API:
model.to("GPU")
.. _enabling-runtime-optimizations:

Enabling OpenVINO Runtime Optimizations
############################################################

Expand All @@ -237,17 +239,28 @@ includes **Dynamic quantization** of activations of 4/8-bit quantized MatMuls an
insignificant deviation in generation accuracy. Quantization is performed in a group-wise
manner, with configurable group size. It means that values in a group share quantization
parameters. Larger group sizes lead to faster inference but lower accuracy. Recommended
group size values are ``32``, ``64``, or ``128``. To enable Dynamic quantization, use
the corresponding inference property as follows:
group size values are ``0``, ``32``, ``64``, or ``128``. Dynamic quantization is enabled **by
default** on the CPU device. To disable dynamic quantization you can either:

* **(Primary Option)** Set ``DYNAMIC_QUANTIZATION_GROUP_SIZE`` to the ``0`` value.
* Switch execution mode from the ``PERFORMANCE mode`` to the ``ACCURACY mode``. However, this
option affects inference precision as well. You can learn more about both: ``PERFORMANCE``
and ``ACCURACY`` modes by following the :ref:`Precision Control Guide <execution-mode>`.

To change a group size value (e.g. to ``64``), you need to execute the following code:

.. code-block:: python
model = OVModelForCausalLM.from_pretrained(
model_path,
ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
ov_config={"DYNAMIC_QUANTIZATION_GROUP_SIZE": "64"}
)
.. note::

As of release 2024.3, dynamic quantization is not enabled for BF16 inference.


* **KV-cache quantization** allows lowering the precision of Key and Value cache in LLMs.
This helps reduce memory consumption during inference, improving latency and throughput.
KV-cache can be quantized into the following precisions: ``u8``, ``bf16``, ``f16``.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
.. {#openvino_docs_OV_UG_Precision_Control}
Precision Control
=================

Expand All @@ -26,6 +24,7 @@ Advanced Matrix Extensions (AMX) on CPU do not support ``f32``). Also, I/O opera
requires less memory due to the smaller tensor byte size. This guide will focus on how
to control inference precision.

.. _execution-mode:

Execution Mode
##############
Expand All @@ -36,8 +35,9 @@ may lower the accuracy for performance reasons (**PERFORMANCE mode**)

* In **ACCURACY mode**, the device cannot convert floating point tensors to a smaller
floating point type, so devices try to keep the accuracy metrics as close as possible to
the original values ​​obtained after training relative to the device's real capabilities.
the original values obtained after training relative to the device's real capabilities.
This means that most devices will infer with ``f32`` precision if your device supports it.
In this mode, the :ref:`Dynamic Quantization <enabling-runtime-optimizations>` is disabled.
* In **PERFORMANCE mode**, the device can convert to smaller data types and apply other
optimizations that may have some impact on accuracy rates, although we still try to
minimize accuracy loss and may use mixed precision execution in some cases.
Expand Down Expand Up @@ -80,14 +80,23 @@ to specify the exact precision the user wants, but is less portable. For example
supports ``f32`` inference precision and ``bf16`` on some platforms, GPU supports ``f32``
and ``f16``, so if a user wants to an application that uses multiple devices, they have
to handle all these combinations manually or let OV do it automatically by using higher
level ``execution_mode`` property. Another thing is that ``inference_precision`` is also
a hint, so the value provided is not guaranteed to be used by Runtime (mainly in cases
where the current device does not have the required hardware capabilities).
level ``execution_mode`` property.

.. note::

When using ``execution_mode``, you need to be aware that using **ACCURACY mode**
will result in enabling ``f32`` inference precision, but it will also disable
:ref:`dynamic quantization <enabling-runtime-optimizations>`. This may highly affect
inference performance (esp. on the Intel® Xeon® platforms and Intel® GPU devices)

Another thing is that ``inference_precision`` is also a hint, so the value provided is not guaranteed
to be used by Runtime (mainly in cases where the current device does not have the required hardware
capabilities).

.. note::

All devices only support floating-point data types (``f32``, ``f16``, ``bf16``) as a value
for ``inference_precision`` attribute, because quantization cannot be done in Runtime.
for ``inference_precision`` attribute.


.. _limited_inference_precision:
Expand Down
Binary file not shown.

0 comments on commit bd20779

Please sign in to comment.