Skip to content

Commit 7670715

Browse files
authored
[GPU] Update docs related to KV-cache quantization (#27821)
### Details: - Update docs related to KV-cache quantization on GPU - Allow to use `element::u8` as data type for KV-cache quantization to be aligned with CPU Plugin
1 parent 71d9463 commit 7670715

File tree

2 files changed

+4
-3
lines changed

2 files changed

+4
-3
lines changed

docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -276,9 +276,10 @@ includes **Dynamic quantization** of activations of 4/8-bit quantized MatMuls an
276276
ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"}
277277
)
278278
279-
.. note::
279+
.. note::
280+
Currently, for KV-cache quantization, GPU ignores the DYNAMIC_QUANTIZATION_GROUP_SIZE property, using ``group_size = head_size``. Additionally, it does not support the ``get_state()`` and ``set_state()`` APIs when KV-cache quantization is enabled.
280281

281-
Currently, both Dynamic quantization and KV-cache quantization are available for CPU device.
282+
For GPU, KV-cache quantization is enabled by default on platforms without XMX support, and can be disabled by setting KV_CACHE_PRECISION to ``undefined``.
282283

283284

284285
Working with Models Tuned with LoRA

src/plugins/intel_gpu/src/plugin/transformations/kv_cache_compression.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ class KVCacheCompressionMatcher : public ov::pass::MatcherPass {
133133
KVCacheCompressionMatcher::KVCacheCompressionMatcher(ov::element::Type compression_dt) {
134134
using namespace ov::pass::pattern;
135135

136-
if (compression_dt != element::i8)
136+
if (compression_dt != element::i8 && compression_dt != element::u8)
137137
return;
138138

139139
const auto quantization_type = ov::op::internal::DynamicQuantize::QuantizationType::Asymmetric;

0 commit comments

Comments
 (0)