uxlfoundation · TaoLv · Mar 5, 2025 · Mar 12, 2025 · Mar 12, 2025 · Mar 12, 2025
@@ -128,9 +128,9 @@ platforms follow the general description in @ref dev_guide_data_types.
 4. GPU
    - Optimized implementation is available for 4D Q/K/V tensors with shape
      defined as (N, H, S, D).
-   - Optimized implementation is available for floating-point SDPA with `f16`
-     data type and `D <= 256` on Intel Graphics Products with Intel(R) Xe Matrix
-     Extensions (Intel(R) XMX) support.
+   - Optimized implementation is available for `f16` or `bf16` SDPA with `f32`
+     intermediate data type and `D <= 256` on Intel Graphics Products with
+     Intel(R) Xe Matrix Extensions (Intel(R) XMX) support.
 
 ## Example
 

@@ -44,8 +44,10 @@ different and auto-broadcasting is allowed if `auto_broadcast` attributes is
 
 Add operation supports the following data type combinations.
 
-| Src_0 / Src_1 | Dst  |
-|:--------------|:-----|
-| f32           | f32  |
-| bf16          | bf16 |
-| f16           | f16  |
+| Src_0     | Src_1     | Dst  |
+|:----------|:----------|:-----|
+| f32       | f32       | f32  |
+| bf16      | bf16      | bf16 |
+| f16       | f16       | f16  |
+| f32       | bf16, f16 | f32  |
+| bf16, f16 | f32       | f32  |
@@ -44,8 +44,10 @@ different and auto-broadcasting is allowed if `auto_broadcast` attributes is
 
 Divide operation supports the following data type combinations.
 
-| Src_0 / Src_1 | Dst  |
-|:--------------|:-----|
-| f32           | f32  |
-| bf16          | bf16 |
-| f16           | f16  |
+| Src_0     | Src_1     | Dst  |
+|:----------|:----------|:-----|
+| f32       | f32       | f32  |
+| bf16      | bf16      | bf16 |
+| f16       | f16       | f16  |
+| f32       | bf16, f16 | f32  |
+| bf16, f16 | f32       | f32  |
@@ -61,8 +61,8 @@ constructing an operation.
 
 MatMul operation supports the following data type combinations.
 
-| Src  | Weights | Bias | Dst  |
-|:-----|:--------|:-----|:-----|
-| f32  | f32     | f32  | f32  |
-| bf16 | bf16    | bf16 | bf16 |
-| f16  | f16     | f16  | f16  |
+| Src  | Weights | Bias | Dst       |
+|:-----|:--------|:-----|:----------|
+| f32  | f32     | f32  | f32       |
+| bf16 | bf16    | bf16 | f32, bf16 |
+| f16  | f16     | f16  | f32, f16  |
@@ -44,8 +44,10 @@ different and auto-broadcasting is allowed if `auto_broadcast` attributes is
 
 Multiply operation supports the following data type combinations.
 
-| Src_0 / Src_1 | Dst  |
-|:--------------|:-----|
-| f32           | f32  |
-| bf16          | bf16 |
-| f16           | f16  |
+| Src_0     | Src_1     | Dst  |
+|:----------|:----------|:-----|
+| f32       | f32       | f32  |
+| bf16      | bf16      | bf16 |
+| f16       | f16       | f16  |
+| f32       | bf16, f16 | f32  |
+| bf16, f16 | f32       | f32  |
@@ -36,8 +36,8 @@ constructing an operation.
 
 SoftMax operation supports the following data type combinations.
 
-| Src  | Dst  |
-|:-----|:-----|
-| f32  | f32  |
-| bf16 | bf16 |
-| f16  | f16  |
+| Src  | Dst             |
+|:-----|:----------------|
+| f32  | f32, bf16, f16  |
+| bf16 | bf16            |
+| f16  | f16             |
@@ -44,8 +44,10 @@ different and auto-broadcasting is allowed if `auto_broadcast` attributes is
 
 Subtract operation supports the following data type combinations.
 
-| Src_0 / Src_1 | Dst  |
-|:--------------|:-----|
-| f32           | f32  |
-| bf16          | bf16 |
-| f16           | f16  |
+| Src_0     | Src_1     | Dst  |
+|:----------|:----------|:-----|
+| f32       | f32       | f32  |
+| bf16      | bf16      | bf16 |
+| f16       | f16       | f16  |
+| f32       | bf16, f16 | f32  |
+| bf16, f16 | f32       | f32  |
@@ -52,7 +52,6 @@ Graph operations support bf16 and f16 data types.
 
 A TypeCast operation performing down conversion should be inserted clearly to
 indicate the use of low numeric precision. oneDNN Graph implementation fully
-honors the API-specified numeric precision and only performs the computation
-using the API-specified or higher numeric precision.
+honors the API-specified numeric precision.
 
 @img{bf16_programming.jpg,Figure 2: Overview of bf16 programming model.,80%,}