Implement batched serial gbtrf #2489

yasahi-hpc · 2025-01-27T18:42:51Z

This PR implements gbtrf function.

Following files are added:

KokkosBatched_Gbtrf_Serial_Impl.hpp: Internal interfaces
KokkosBatched_Gbtrf_Serial_Internal.hpp: Implementation details
KokkosBatched_Gbtrf.hpp: APIs
Test_Batched_SerialGbtrf.hpp: Unit tests for that

Detailed description

It computes an LU factorization of a real general M-by-N band matrix A using partial pivoting with row interchanges.
Here, the matrix has the following shape.

A: (batch_count, ldab, n)
On entry, the matrix A in band storage. M-by-N matrix to be factored. On exit, the factors L and U from the factorization where U is stored as an upper triangular band matrix with KL+KU superdiagonals in rows 0 to KL+KU,
and the multipliers used during the factorization are stored in rows KL+KU+1 to 2*KL+KU.
IPIV: (batch_count, min(m, n))
The pivot indices; for 0 <= i < min(M,N), row i of the matrix was interchanged with row IPIV(i).
kl: The number of subdiagonals within the band of A. kl >= 0
ku: The number of superdiagonals within the band of A. ku >= 0
m: The number of rows of the matrix A. (optional)

Parallelization would be made in the following manner. This is efficient only when
A is given in LayoutLeft for GPUs and LayoutRight for CPUs (parallelized over batch direction).

Kokkos::parallel_for('gbtrf', 
    Kokkos::RangePolicy<execution_space> policy(0, n),
    [=](const int k) {
        auto aa = Kokkos::subview(m_a, k, Kokkos::ALL(), Kokkos::ALL());
        auto ipiv = Kokkos::subview(m_ipiv, k, Kokkos::ALL());

        KokkosBatched::SerialGbtrf<AlgoTagType>::invoke(aa, ipiv, kl, ku);
    });

Tests

Make a random band matrix from random A and copy it to LU. Represent A in band storage AB and factorize it with gbtrf. Then, convert AB back into full storage A and extract L and U. Make a reference by getrf to get reference L and U from LU matrix. Finally, we confirm L and U are the same.
Simple and small analytical test, i.e. choose A as follows to confirm LUB is factorized as expected.

A = [[1. -3. -2. 0.],
     [-1. 1 -3 -2],
     [2. -1. 1. -3],
     [0. 2. -1. 1.]]
LUB: [[0,       0,    0,    0],
      [0,       0,    0,   -3],
      [0,       0,    1,  1.5],
      [0,      -1, -2.5, -3.2],
      [2,    -2.5,   -3,  5.4],
      [-0.5, -0.2,    1,    0],
      [0.5,  -0.8,    0,    0]]
piv = [2 2 2 3]

lucbv

Some small clean-up needed but nothing major

blas/impl/KokkosBlas_util.hpp

batched/dense/impl/KokkosBatched_Gbtrf_Serial_Impl.hpp

batched/dense/unit_test/Test_Batched_DenseUtils.hpp

batched/dense/unit_test/Test_Batched_SerialGbtrf.hpp

lucbv · 2025-03-11T15:41:17Z

batched/dense/unit_test/Test_Batched_SerialGbtrf.hpp

+  auto h_NL1 = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), NL1);
+  auto h_NL2 = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), NL2);
+
+  RealType eps = 1.0e1 * ats::epsilon();


This looks like an arbitrary number that happens to work... how about doing an error analysis to compute the number of round off operations performed in gbtrf?

Could you please detail this point?
The tolerance is numerical precision of fp32 or fp64 multiplied by 10.

For example when you perform a gemv operation:

y = beta * y + alpha * A * x

for each value y(i) you have performed numCols multiplications and numCols - 1 additions to compute A * x, then there is two more multiplications for alpha and beta and one more addition between beta * y and alpha * A * x so in total that's numCols + 2 multiplications and numCols additions. So a check might look like this

tol = (2 * numCols + 2) * maxVal * Kokkos::ArithTraits<Scalar>::eps()
Kokkos::abs(y(i) - y_ref(i)) < tol

the maxVal is the maximum value an input can take as catastrophic cancelation could happen

I understand your point.
The error analysis would be critical, when it comes to fp16.

Can I start from simpler cases like gemv and gemm.

lucbv · 2025-03-11T15:42:07Z

batched/dense/unit_test/Test_Batched_SerialGbtrf.hpp

+  auto h_NL_ref   = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), NL_ref);
+  auto h_ipiv_ref = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), ipiv_ref);
+
+  RealType eps = 1.0e3 * ats::epsilon();


This one looks even more arbitrary than the previous one above : (

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

yasahi-hpc · 2025-03-12T14:04:16Z

@lucbv Thank you for your review.
I have fixed except for the error analysis.

For the error analysis, as commented also in #2530,
I would like to start from the simpler cases.

cwpearson added the AT2-CI-APPROVAL Approve CI to run at SNL label Jan 28, 2025

yasahi-hpc force-pushed the implement-batched-serial-gbtrf branch 2 times, most recently from 2723819 to 507b3bd Compare February 6, 2025 07:40

lucbv self-requested a review February 19, 2025 02:28

lucbv assigned yasahi-hpc Feb 19, 2025

yasahi-hpc force-pushed the implement-batched-serial-gbtrf branch from 507b3bd to 1d0c1e2 Compare February 27, 2025 14:43

yasahi-hpc added AT2-CI-APPROVAL Approve CI to run at SNL and removed AT2-CI-APPROVAL Approve CI to run at SNL labels Mar 7, 2025

lucbv requested changes Mar 11, 2025

View reviewed changes

Yuuichi Asahi added 10 commits March 12, 2025 21:45

fix: conflicts

e993292

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

improve gbtrf unit-test to deal with non-rectangular cases

e14c1ef

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

improve gbtrf unit-test

8761c5c

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

fix: errors from codeQL

c73f4bb

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

remove unused View2DType

5d0f400

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

use ger internal to simplify the gbtrf implementation details

9cab97d

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

Add docstring and assertion for ArgAlgo parameter in gbtrf

0d27a28

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

format Test_Batched_Dense.hpp

3251836

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

fix check function for gbtrf

ab48f34

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

rename full matrix to dense matrix

8e75aff

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>

yasahi-hpc force-pushed the implement-batched-serial-gbtrf branch from 01205a5 to 8e75aff Compare March 12, 2025 13:12

yasahi-hpc requested a review from lucbv March 12, 2025 14:04

yasahi-hpc added AT2-CI-APPROVAL Approve CI to run at SNL and removed AT2-CI-APPROVAL Approve CI to run at SNL labels Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement batched serial gbtrf #2489

Implement batched serial gbtrf #2489

yasahi-hpc commented Jan 27, 2025

lucbv left a comment

lucbv Mar 11, 2025

yasahi-hpc Mar 11, 2025

lucbv Mar 11, 2025

yasahi-hpc Mar 12, 2025

lucbv Mar 11, 2025

yasahi-hpc commented Mar 12, 2025

Implement batched serial gbtrf #2489

Are you sure you want to change the base?

Implement batched serial gbtrf #2489

Conversation

yasahi-hpc commented Jan 27, 2025

Detailed description

Tests

lucbv left a comment

Choose a reason for hiding this comment

lucbv Mar 11, 2025

Choose a reason for hiding this comment

yasahi-hpc Mar 11, 2025

Choose a reason for hiding this comment

lucbv Mar 11, 2025

Choose a reason for hiding this comment

yasahi-hpc Mar 12, 2025

Choose a reason for hiding this comment

lucbv Mar 11, 2025

Choose a reason for hiding this comment

yasahi-hpc commented Mar 12, 2025