Matrix docs update (shader-slang#1815)

jsmall-zzz · web-flow · commit b17701c489ce · 2021-04-26T15:17:54.000-04:00
* #include an absolute path didn't work - because paths were taken to always be relative.

* Update matrix documentation.

* Small fixes.

* Some small fixes.

* Fixes and improvements to matrix doc.

* Small fixes.

* Additional matrix doc layout clarification.
diff --git a/docs/user-guide/a1-01-matrix-layout.md b/docs/user-guide/a1-01-matrix-layout.md
@@ -3,26 +3,141 @@ layout: user-guide
 ---
 
 Handling Matrix Layout Differences on Different Platforms
-============================
+=========================================================
 
-The differences on default matrix layout or storage conventions between GLSL (OpenGL/Vulkan) and HLSL has been an issue that frequently causes confusion among developers. When writing applications that work on different targets, one important goal that developers frequently seek is to make it possible to pass the same matrix generated by host code to the same shader code, regardless of what graphics API is being used (e.g. Vulkan, OpenGL or Direct3D). As a solution to shader cross-compilation, Slang provides necessary tools for developers navigate around the differences between GLSL and HLSL targets.
+The differences between default matrix layout or storage conventions between GLSL (OpenGL/Vulkan) and HLSL has been an issue that frequently causes confusion among developers. When writing applications that work on different targets, one important goal that developers frequently seek is to make it possible to pass the same matrix generated by host code to the same shader code, regardless of what graphics API is being used (e.g. Vulkan, OpenGL or Direct3D). As a solution to shader cross-compilation, Slang provides necessary tools for developers navigate around the differences between GLSL and HLSL targets.
+
+A high level summary:
+
+* Default matrix **layout** in memory for Slang is `column-major`. 
+  * This default is for *legacy* reasons and may change in the future.
+* Row-major layout is the only *portable* layout to use across targets (with significant caveats for non 4x4 matrices)
+* Use `setMatrixLayoutMode`/`spSetMatrixLayoutMode`/`createSession` to set the default  
+* Use `-matrix-layout-row-major` or `-matrix-layout-column-major` for the command line 
+  * or via `spProcessCommandLineArguments`/`processCommandLineArguments`
+* Depending on your host maths library, matrix sizes and targets, it may be necessary to convert matrices at host/kernel boundary  
+  
+On the portability issue, some targets *ignore* the matrix layout mode, notably CUDA and CPU/C++. For this reason for the widest breadth of targets it is recommended to use *row-major* matrix layout.
 
 Two conventions of matrix transform math
--------------------------
-Depending on the platform a developer is used to, a matrix-vector transform can be expressed as either `v*m` (`mul(v, m)` in HLSL), or `m*v` (`mul(m,v)` in HLSL). This convention, together with the matrix layout (column-major or row-major), determines how a matrix should be filled out in host code. One way to make things less ambiguous is to think about where the translation terms should be placed in memory when filling a typical 4x4 transform matrix.
+----------------------------------------
+
+Depending on the platform a developer is used to, a matrix-vector transform can be expressed as either `v*m` (`mul(v, m)` in HLSL), or `m*v` (`mul(m,v)` in HLSL). This convention, together with the matrix layout (column-major or row-major), determines how a matrix should be filled out in host code. 
+
+In HLSL/Slang the order of vector and matrix parameters to `mul` determine how the *vector* is interpretted. This interpretation is required because a vector does not in as of it's self differentiate between being a row or a column. 
+
+* `mul(v, m)` - v is interpretted as a row vector 
+* `mul(m, v)` - v is interpretted as a column vector. 
+
+Through this mechanism a developer is able to write transforms in their preferred style. 
+
+These two styles are not directly interchangable - for a given `v` and `m` then generally `mul(v, m) != mul(m, v)`. For that the matrix needs to be transposed so 
+
+* `mul(v, m) == mul(transpose(m), v)`
+* `mul(m, v) == mul(v, transpose(m))`
+
+This behavior is *independent* of how a matrix layout in memory. Host code needs to be aware of how a shader code will interpret a matrix stored in memory, it's layout, as well as the vector interpretation convention used in shader code (ie `mul(v,m)` or `mul(m, v)`).
+
+[Matrix layout](https://en.wikipedia.org/wiki/Row-_and_column-major_order) can be either `row-major` or `column-major`. The difference just determines which elements are contiguous in memory. `Row-major` means the rows elements are contiguous. `Column-major` means the column elements are contiguous.
+
+Another way to think about this difference is in terms of where translation terms should be placed in memory when filling a typical 4x4 transform matrix. When transforming a row vector (ie `mul(v, m)`) with a `row-major` matrix layout, translation will be at `m + 12, 13, 14`. For a `column-major` matrix layout, translation will be at `m + 3, 7, 11`.
+
+Note it is a *HLSL*/*Slang* convention that the parameter ordering of `mul(v, m)` means v is a *row* vector. A host maths library *could* have a transform function `SomeLib::transform(v, m)` such that `v` is a interpretted as *column* vector. For simplicitys sake the remainder of this discussion assumes that the `mul(v, m)` in equivalent in host code follows the interpretation that `v` is *row* vector.
+
+Discussion
+----------
+
+There are four variables in play here:
+
+* Host vector interpretation (row or column) - and therefore effective tranform order (column) `m * v` or (row) `v * m`
+* Host matrix memory layout
+* Shader vector interpretation (as determined via `mul(v, m)` or `mul(m, v)`
+* Shader matrix memory layout 
+
+Since each item can be either `row` or `column` there are 16 possible combinations. For simplicity let's reduce the variable space by making some assumptions.
+
+1) The same vector convention will be used in host code as in shader code. 
+2) The host maths matrix layout is the same as the kernel.
+
+If we accept 1, then we can ignore the vector interpretation because as long as they are consistent then only matrix layout is significant.
+If we accept 2, then there are only two possible combinations - either both host and shader are using `row-major` matrix layout or `column-major` layout.
+
+This is simple, but is perhaps not the end of the story. First lets assume that we want our Slang code to be as portable as possible. As previously discussed for CUDA and C++/CPU targets Slang ignores the matrix layout settings - the matrix layout is *always* `row-major`.
+
+Second lets consider performance. The matrix layout in a host maths libray is not arbitrary from a performance point of view. A performant host maths library will want to use SIMD instructions. With both x86/x64 SSE and ARM NEON SIMD it makes a performance difference which layout is used, depending on if `column` or `row` is the *prefered* vector interpretation. If the `row` vector interpretation is prefered, it is most performant to have `row-major` matrix layout. Conversely if `column` vector interpretation is prefered `column-major` matrix will be the most performant.
+
+The performance difference comes down to a SIMD implementation having to do a transpose if the layout doesn't match the prefered vector interpretation. 
+
+If we put this all together - best performance, consistency between vector interpretation and platform independence we get:
+
+1) Consistency : Same vector interpretation in shader and host code
+2) Platform independence: Kernel uses `row-major` matrix layout
+3) Performance: Host vector interpretation should match host matrix layout
+
+The only combination that forfils all aspects is `row-major` matrix layout and `row` vector interpretation for both host and kernel.
+
+It's worth noting that for targets that honor the default matrix layout - that setting can acts like a toggle transposing a matrix layout. That if for some reason the combination of choices leads to inconsistent vector transforms, an implementation can perform this transform in *host* code at the boundary between host and the kernel. This is not the most performant or convenient scenario, but if supported in an implementation it could be used for targets that do not support kernel matrix layout settings. 
+
+If only targetting platforms that honor matrix layout, there is more flexibility, our constraints are 
 
-If the shader code writes `mul(m, v)`, then the last **column** of `m` defines the translation terms. If we use row-major matrix layout, then the host code should make sure the translation terms are filled in at `m + 4, 7, 11` locations in memory.
+1) Consistency : Same vector interpretation in shader and host code
+2) Performance: Host vector interpretation should match host matrix layout
 
-Alternatively, if the shader code writes `mul(v, m)`, then the last **row** of `m` defines the translation terms. When using row-major matrix layout, the host code should make sure the translation terms are filled in at `m + 12, 13, 14` locations in memory.
+Then there are two combinations that work
 
-By default, Slang assumes all matrices to be in **row-major** layout, since this is the most nature layout to work with in CPU code: each row of the matrix occupies contiguous space in memory. A user should stick to one of the above practices to get correct result. Note that this is different from `fxc` which assumes `column_major` layout by default. As an example, if the host code uses `glm` library to generate transform matrices, the translation terms will be stored in `[12], [13], [14]` locations in memory. Therefore, the shader code should stick to the `mul(v,m)` convention to ensure correctness.
+1) `row-major` matrix layout for host and kernel, and `row` vector interpretation.
+2) `column-major` matrix layout for host and kernel, and `column` vector interpretation.
 
-Slang automatically handles the convention differences when cross-compiling code to GLSL. For example, a `float3x4` matrix will be translated to `mat4x3` in the resulting GLSL. Correspondingly, `mul(v, m)` will be translated to `m*v` in GLSL. Therefore, as long as the user is sticking to the above practices consistently, they will get correct result with the same matrix value in memory regardless of what graphics API they are actually using.
+If the host maths library is not performance orientated, it may be arbitray from a performance point of view if a `row` or `column` vector interpretation is used. In that case assuming shader and host vector interpretation is the same it is only important that the kernel and maths library matrix layout match. 
+
+Another way of thinking about these combinations is to think of each change in `row-major`/`column-major` matrix layout and `row`/`column` vector interpretation is a transpose. If there are an *even* number of flips then all the transposes cancel out. Therefore the following combinations work
+
+| Host Vector | Kernel Vector | Host Mat Layout | Kernel Mat Layout 
+|-------------|---------------|-----------------|------------------
+|   Row       |    Row        |    Row          |    Row
+|   Row       |    Row        |    Column       |    Column
+|   Column    |    Column     |    Row          |    Row
+|   Column    |    Column     |    Column       |    Column
+|   Row       |    Column     |    Row          |    Column
+|   Row       |    Column     |    Column       |    Row
+|   Column    |    Row        |    Row          |    Column
+|   Column    |    Row        |    Column       |    Row
+
+To be clear 'Kernel Mat Layout' is the shader matrix layout setting. As previously touched upon, if it is not possible to use the setting (say because it is not supported on a target), then doing a transpose at the host/kernel boundary can fix the issue. 
+
+Matrix Layout
+-------------
+
+The above discussion is largely around 4x4 32-bit element matrices. For graphics APIs such as Vulkan, GL, and D3D there are typically additional restrictions for matrix layout. One restriction is for 16 byte alignment between rows (for `row-major` layout) and columns (for `column-major` layout). 
+
+More CPU-like targets such as CUDA and C++/CPU do not have this restriction, and have all elements are consecutive. 
+
+This being the case only the following matrix types/matrix layouts will work across all targets. (Listed in the HLSL convention of RxC). 
+ 
+* 1x4 `row-major` matrix layout
+* 2x4 `row-major` matrix layout
+* 3x4 `row-major` matrix layout
+* 4x4 `row-major` matrix layout
+
+These are all 'row-major' because as previously discussed currently only `row-major` matrix layout works across all targets currently.
+
+NOTE! This only applies to matrices that are trafficed between host and kernel - any matrix size will work appropriately for variables in shader/kernel code for example.
+
+The hosts maths library also plays a part here. The library may hold all elements consecutively in memory. If that's the case it will match the CPU/CUDA kernels, but will only work on 'graphics'-like targets that match that layout for the size. 
+
+For SIMD based host maths libraries it can be even more convoluted. If a SIMD library is being used that prefers `row` vector interpretation and therefore will have `row-majow` layout it may for many sizes *not* match the CPU-like consecutive layout. For example a 4x3 - it will likely be packed with 16 byte row alignment. Additionally even if a matrix is packed in the same way it may not be the same size. For example a 3x2 matrix *may* hold the rows consecutively *but* be 16 bytes in size, as opposed to the 12 bytes that a CPU-like kernel will expect. 
+
+If a SIMD based host maths library with graphics-like APIs are being used, there is a good chance (but certainly *not* guarenteed) that layout across non 4x4 sizes will match because SIMD typically implies 16 byte alignment. 
+
+If your application uses matrix sizes that are not 4x4 across the host/kernel boundary and it wants to work across all targets, it is *likely* that *some* matrices will have to be converted across the boundary. This being the case, having to handle transposing matrices at the boundary is a less significant issue. 
+
+In conclusion if your application has to perform matrix conversion work at the host/kernel boundary the previous observation about "best performance" implies `row-major` layout and `row` vector interpretation becomes somewhat mute.
 
 Overriding default matrix layout
---------------------------
+--------------------------------
+
+Slang allows users to override default matrix layout with a compiler flag. This compiler flag can be specified during the creation of a `Session`:
 
-While we do not recommend so, Slang allows users to override default matrix layout with a compiler flag. This compiler flag can be specified during the creation of a `Session`:
 ```
 slang::IGlobalSession* globalSession;
 ...
@@ -33,37 +148,12 @@ slang::ISession* session;
 globalSession->createSession(slangSessionDesc, &session);
 ```
 
-This make make Slang treat all matrices as in column-major layout, and emit `column_major` qualifier in resulting code.
-
-Note that if you choose to use column-major layout, you either need to flip the matrix multiplication order in shader code or fill in the matrix in transpose order in host code.
-
-Summary
--------------
-
-In summary, we put together all options you have to ensure correct result:
-
-**Option 1: using row-major matrix layout, and `mul(m, v)` math convention**
-
-- Make sure the host code fills in matrices in the odering so that translation terms are specified in `m[3], m[7], m[11]` elements.
-- Leave `defaultMatrixLayoutMode` as default value when creating a Slang session, or specify `SLANG_MATRIX_LAYOUT_ROW_MAJOR`.
-- Write `mul(Matrix, Vector)` in shader code to transform `Vector` by `Matrix`.
-
-**Option 2: using row-major matrix layout, and `mul(v, m)` math convention**
-
-- Make sure the host code fills in matrices so that translations terms are specified in `m[12], m[13], m[14]` elements. Matrices filled in this way are compatible with typical OpenGL applications.
-- Leave `defaultMatrixLayoutMode` as default value when creating a Slang session, or specify `SLANG_MATRIX_LAYOUT_ROW_MAJOR`.
-- Write `mul(Vector, Matrix)` in shader code.
-
-**Option 3: using column-major matrix layout, and `mul(m, v)` math convention**
+This makes Slang treat all matrices as in `column-major` layout, and for example emitting `column_major` qualifier in resulting HLSL code.
 
-- Make sure the host code fills in matrices in the odering so that translation terms are specified in `m[12], m[13], m[14]` elements. Matrices filled in this way are compatible with typical OpenGL applications.
-- Set `defaultMatrixLayoutMode` to `SLANG_MATRIX_LAYOUT_COLUMN_MAJOR` when creating a Slang session.
-- Write `mul(Matrix, Vector)` in shader code to transform `Vector` by `Matrix`.
+Alternatively the default layout can be set via
 
-**Option 4: using column-major matrix layout, and `mul(v, m)` math convention**
+* `setMatrixLayoutMode`/`spSetMatrixLayoutMode` API calls
+* `-matrix-layout-row-major` or `-matrix-layout-column-major` command line options
+  * or via `spProcessCommandLineArguments`/`processCommandLineArguments`
 
-- Make sure the host code fills in matrices so that translations terms are specified in `m[3], m[7], m[11]` elements.
-- Set `defaultMatrixLayoutMode` to `SLANG_MATRIX_LAYOUT_COLUMN_MAJOR` when creating a Slang session.
-- Write `mul(Vector, Matrix)` in shader code.
 
-And that's all you need to pay attention to. Slang will make sure the remaining details are correctly handled when generating target HLSL/GLSL code.