[GPU] Update docs related to KV-cache quantization (#27821)

sshlyapn · web-flow · commit 76707155c3c6 · 2024-11-29T15:49:04.000Z
### Details:
 - Update docs related to KV-cache quantization on GPU
- Allow to use `element::u8` as data type for KV-cache quantization to
be aligned with CPU Plugin
diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst
@@ -276,9 +276,10 @@ includes **Dynamic quantization** of activations of 4/8-bit quantized MatMuls an
          ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
      )
 
-.. note::
+  .. note::
+     Currently, for KV-cache quantization, GPU ignores the DYNAMIC_QUANTIZATION_GROUP_SIZE property, using ``group_size = head_size``. Additionally, it does not support the ``get_state()`` and ``set_state()`` APIs when KV-cache quantization is enabled.
 
-   Currently, both Dynamic quantization and KV-cache quantization are available for CPU device.
+     For GPU, KV-cache quantization is enabled by default on platforms without XMX support, and can be disabled by setting KV_CACHE_PRECISION to ``undefined``.
 
 
 Working with Models Tuned with LoRA
diff --git a/src/plugins/intel_gpu/src/plugin/transformations/kv_cache_compression.cpp b/src/plugins/intel_gpu/src/plugin/transformations/kv_cache_compression.cpp
@@ -133,7 +133,7 @@ class KVCacheCompressionMatcher : public ov::pass::MatcherPass {
 KVCacheCompressionMatcher::KVCacheCompressionMatcher(ov::element::Type compression_dt) {
     using namespace ov::pass::pattern;
 
-    if (compression_dt != element::i8)
+    if (compression_dt != element::i8 && compression_dt != element::u8)
         return;
 
     const auto quantization_type = ov::op::internal::DynamicQuantize::QuantizationType::Asymmetric;

Original file line number	Diff line number	Diff line change
`@@ -276,9 +276,10 @@ includes Dynamic quantization of activations of 4/8-bit quantized MatMuls an`
`276`	`276`	`ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}`
`277`	`277`	`)`
`278`	`278`
`279`		`-.. note::`
	`279`	`+ .. note::`
	`280`	+ Currently, for KV-cache quantization, GPU ignores the DYNAMIC_QUANTIZATION_GROUP_SIZE property, using ``group_size = head_size``. Additionally, it does not support the ``get_state()`` and ``set_state()`` APIs when KV-cache quantization is enabled.
`280`	`281`
`281`		`- Currently, both Dynamic quantization and KV-cache quantization are available for CPU device.`
	`282`	+ For GPU, KV-cache quantization is enabled by default on platforms without XMX support, and can be disabled by setting KV_CACHE_PRECISION to ``undefined``.
`282`	`283`
`283`	`284`
`284`	`285`	`Working with Models Tuned with LoRA`