-
-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nalgebra seems to run an order of magnitude slower than Numpy, with the same backend bindings #1468
Comments
You're pulling in |
why you are printing results during timing? io can take up more time than you think. besides you should use criterion.rs rather than timing things yourself. |
@Ralith ah, thanks! I hadn't realised that the nalgebra lapack crate actually had its own, other methods for computing the decompositions. I naively thought by importing it that the main nalgebra crate would have access to the speed-ups. I've tried again now: use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;
use rayon::prelude::*;
extern crate nalgebra as na;
extern crate nalgebra_lapack as nl;
use std::time::Instant;
fn main() {
let start = Instant::now();
let rows = 2000;
let cols = 2000;
// Pre-allocate the vector with exact capacity
let mut data = Vec::with_capacity(rows * cols);
// Initialize random number generator and distribution
let uniform = Uniform::new(0.0, 1.0);
// Generate random numbers in parallel
data.par_extend(
(0..rows * cols)
.into_par_iter()
.map(|_|
{
let mut local_rng = thread_rng(); // Create a new RNG for each thread
uniform.sample(&mut local_rng)
})
);
// Create matrix from pre-generated data
let a = na::DMatrix::from_vec(rows, cols, data);
let creation_time = start.elapsed();
println!("Matrix creation: {:?}", creation_time);
println!("Matrix dimensions: {} x {}", a.nrows(), a.ncols());
// Compute A^T A directly without storing the transpose
let mult_start = Instant::now();
let c = &a * &a.transpose();
println!("Result dimensions: {} x {}", c.nrows(), c.ncols());
let mult_time = mult_start.elapsed();
println!("Matrix multiplication: {:?}", mult_time);
// Use symmetric eigenvalue computation since C = A^T A is symmetric
let eig_start = Instant::now();
let _eigvals = na::linalg::SymmetricEigen::new(c.clone()).eigenvalues;
let eig_time = eig_start.elapsed();
println!("Eigenvalues computation native rust: {:?}", eig_time);
// Symmetric eigen method from nalgebra lapack fails here due to non-convergence.
// which is quite strange ...
// calculating in the general case here and just returning the real eigenvalues
let eig_start = Instant::now();
let _eigen_vals = nl::Eigen::new(c.clone(), false, false).unwrap().eigenvalues_re;
let eig_time = eig_start.elapsed();
println!("Eigenvalues computation nalgebra lapack: {:?}", eig_time);
// Use symmetric SVD since C is symmetric positive semidefinite
let svd_start = Instant::now();
// let svd = na::linalg::SVD::new(c.clone(), true, true);
let _svd = na::SVD::new(c.clone(),true, true).singular_values;
let svd_time = svd_start.elapsed();
println!("SVD calculation native rust: {:?}", svd_time);
// Use symmetric SVD since C is symmetric positive semidefinite
let svd_start = Instant::now();
let svd = nl::SVD::new(c.clone());
let _singular_values = svd.unwrap().singular_values;
let svd_time = svd_start.elapsed();
println!("SVD calculation nalgebra lapack: {:?}", svd_time);
println!("Total time: {:?}", start.elapsed());
} This gives much better results: Matrix creation: 9.137ms @Ralith is there any scope for enabling the nalgebra-lapack speedup as a feature of the main nalgebra crate? i.e. if you have an accelerator registered, the operation will be accelerated, and if you don't, it won't. That way, there could exist one standard way of doing the decompositions, rather than two sort of similar ways, depending on the underlying architecture. @Da1sypetals thanks for the recommendation! Criterion looks very useful and I'll use it in future. In this case however, I just want to log to the std out in as quick and easy a means possible. Given the time difference between the operations (accelerated and not) is in the order of multiple seconds, I think the time incurred to log to std out is immaterial in this example. |
Also, FYI, I achieved some pretty good results on this particular benchmark by using this great port of libtorch: https://github.com/LaurentMazare/tch-rs . Admittedly, if using in production, the final binary will end up a lot larger (if you decide to statically link torch), but it enabled me to take full advantage of libtorch bindings for my device. use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;
use rayon::prelude::*;
use std::time::Instant;
use tch::{Device, Tensor};
fn main() {
let device = Device::Mps;
println!("Device: {:?}", device);
let start = Instant::now();
let rows = 2000;
let cols = 2000;
// Pre-allocate the vector with exact capacity
let mut data = Vec::with_capacity(rows * cols);
// Initialize random number generator and distribution
let uniform = Uniform::new(0.0f32, 1.0f32); // Changed to f32
// Generate random numbers in parallel
data.par_extend((0..rows * cols).into_par_iter().map(|_| {
let mut local_rng = thread_rng(); // Create a new RNG for each thread
uniform.sample(&mut local_rng)
}));
// t.print()
let t = Tensor::from_slice2(&data.chunks(cols).collect::<Vec<_>>()).to_device(device);
println!("Time taken to create matrix: {:?}", start.elapsed());
let start = Instant::now();
let transpose_t = t.transpose(0, 1);
println!("Time taken to transpose matrix: {:?}", start.elapsed());
let start = Instant::now();
let psd_t = t.matmul(&transpose_t);
println!("Time taken to matmul matrix: {:?}", start.elapsed());
let start = Instant::now();
let _eig = psd_t.linalg_eigvals();
println!("Time taken to calculate eigenvalues: {:?}", start.elapsed());
// let start = Instant::now();
// let _eig = psd_t.cholesky(true);
// println!(
// "Time taken to calculate cholesky decomp: {:?}",
// start.elapsed()
// );
let start = Instant::now();
let _svd = t.svd(true, true);
println!("Time taken to calculate SVD: {:?}", start.elapsed());
// t.print();
} Results: Device: Mps Doing it in torch also has the handy nicety that you can prototype stuff in torch in Python, and then transfer it pretty trivially to tch-rs in production (as it's all just torch at the end of the day). |
Hi,
I wanted to try out nalgebra and see how it compares to very vanilla operations in numpy. I think I've implemented everything the 'right' way according to the documentation, but I'm really struggling with the performance here:
My rust code here:
my cargo.toml:
and, for comparison, my Python code, a super simple script using numpy:
I find that when I run both, I get these results:
RUST:
❯ cargo run --release
Finished
release
profile [optimized] target(s) in 0.08sRunning
target/release/rust
Matrix creation: 9.157083ms
Matrix dimensions: 2000 x 2000
Result dimensions: 2000 x 2000
Matrix multiplication: 364.901042ms
Eigenvalues computation: 4.586041791s
SVD calculation: 14.505883666s
Total time: 19.466049208s
and
PYTHON:
Total time: 3177.9 miliseconds
Matrix creation: 17.5 miliseconds
Matrix multiplication: 22.2 miliseconds
Eigenval calculation: 1377.0 miliseconds
SVD calculation: 1761.1 miliseconds
Cholesky decomposition: 27.5 miliseconds
You can see that Python is more than 10x faster at matmul, ~4x faster at calculating eigenvalues, and an order of magnitude faster at computing the SVD.
Please could someone tell me where I'm going wrong here?
Some further Info:
❯ otool -L target/release/rust
target/release/rust:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate (compatibility version 1.0.0, current version 4.0.0)
/usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
If someone could please help me out / point me to where I'm going wrong that would be great. I am aware Python is going to be hard to beat here (because this Python code is basically just running C / Fortran optimised for Apple chips), but, I would expect at the very least that the Rust would be on-par with the performance of really-simple Python, given that both are targeting the same backend (Apple accelerate).
Thanks!
The text was updated successfully, but these errors were encountered: