-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement batched serial gbtrf #2489
base: develop
Are you sure you want to change the base?
Implement batched serial gbtrf #2489
Conversation
2723819
to
507b3bd
Compare
507b3bd
to
1d0c1e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small clean-up needed but nothing major
auto h_NL1 = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), NL1); | ||
auto h_NL2 = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), NL2); | ||
|
||
RealType eps = 1.0e1 * ats::epsilon(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an arbitrary number that happens to work... how about doing an error analysis to compute the number of round off operations performed in gbtrf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please detail this point?
The tolerance is numerical precision of fp32 or fp64 multiplied by 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example when you perform a gemv operation:
y = beta * y + alpha * A * x
for each value y(i)
you have performed numCols
multiplications and numCols - 1
additions to compute A * x
, then there is two more multiplications for alpha
and beta
and one more addition between beta * y
and alpha * A * x
so in total that's numCols + 2
multiplications and numCols
additions. So a check might look like this
tol = (2 * numCols + 2) * maxVal * Kokkos::ArithTraits<Scalar>::eps()
Kokkos::abs(y(i) - y_ref(i)) < tol
the maxVal is the maximum value an input can take as catastrophic cancelation could happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point.
The error analysis would be critical, when it comes to fp16.
Can I start from simpler cases like gemv
and gemm
.
auto h_NL_ref = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), NL_ref); | ||
auto h_ipiv_ref = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), ipiv_ref); | ||
|
||
RealType eps = 1.0e3 * ats::epsilon(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one looks even more arbitrary than the previous one above : (
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
01205a5
to
8e75aff
Compare
This PR implements gbtrf function.
Following files are added:
KokkosBatched_Gbtrf_Serial_Impl.hpp
: Internal interfacesKokkosBatched_Gbtrf_Serial_Internal.hpp
: Implementation detailsKokkosBatched_Gbtrf.hpp
: APIsTest_Batched_SerialGbtrf.hpp
: Unit tests for thatDetailed description
It computes an LU factorization of a real general M-by-N band matrix
A
using partial pivoting with row interchanges.Here, the matrix has the following shape.
A
:(batch_count, ldab, n)
On entry, the matrix A in band storage. M-by-N matrix to be factored. On exit, the factors
L
andU
from the factorization whereU
is stored as an upper triangular band matrix with KL+KU superdiagonals in rows 0 to KL+KU,and the multipliers used during the factorization are stored in rows KL+KU+1 to 2*KL+KU.
IPIV
:(batch_count, min(m, n))
The pivot indices; for
0 <= i < min(M,N)
, rowi
of the matrix was interchanged with rowIPIV(i)
.kl
: The number of subdiagonals within the band of A. kl >= 0ku
: The number of superdiagonals within the band of A. ku >= 0m
: The number of rows of the matrix A. (optional)Parallelization would be made in the following manner. This is efficient only when
A is given in
LayoutLeft
for GPUs andLayoutRight
for CPUs (parallelized over batch direction).Tests
A
and copy it toLU
. RepresentA
in band storageAB
and factorize it withgbtrf
. Then, convertAB
back into full storageA
and extractL
andU
. Make a reference bygetrf
to get referenceL
andU
fromLU
matrix. Finally, we confirmL
andU
are the same.A
as follows to confirmLUB
is factorized as expected.