Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nalgebra seems to run an order of magnitude slower than Numpy, with the same backend bindings #1468

Open
rossamurphy opened this issue Dec 27, 2024 · 4 comments

Comments

@rossamurphy
Copy link

rossamurphy commented Dec 27, 2024

Hi,

I wanted to try out nalgebra and see how it compares to very vanilla operations in numpy. I think I've implemented everything the 'right' way according to the documentation, but I'm really struggling with the performance here:

My rust code here:

use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;
use rayon::prelude::*;

extern crate nalgebra as na;
extern crate nalgebra_lapack;

use std::time::Instant;

fn main() {

    let start = Instant::now();
    let rows = 2000;
    let cols = 2000;

    // Pre-allocate the vector with exact capacity
    let mut data = Vec::with_capacity(rows * cols);

    // Initialize random number generator and distribution
    let uniform = Uniform::new(0.0, 1.0);

    // Generate random numbers in parallel
    data.par_extend(
        (0..rows * cols)
            .into_par_iter()
            .map(|_|
                {
                    let mut local_rng = thread_rng(); // Create a new RNG for each thread
                    uniform.sample(&mut local_rng)
                })
    );

    // Create matrix from pre-generated data
    let a = na::DMatrix::from_vec(rows, cols, data);

    let creation_time = start.elapsed();
    println!("Matrix creation: {:?}", creation_time);
    println!("Matrix dimensions: {} x {}", a.nrows(), a.ncols());

    // Compute A^T A directly without storing the transpose
    let mult_start = Instant::now();
    let c = &a * &a.transpose();  // Use references to avoid unnecessary copies

    println!("Result dimensions: {} x {}", c.nrows(), c.ncols());
    let mult_time = mult_start.elapsed();
    println!("Matrix multiplication: {:?}", mult_time);

    // Use symmetric eigenvalue computation since C = A^T A is symmetric
    let eig_start = Instant::now();
    let _eigvals = na::linalg::SymmetricEigen::new(c.clone()).eigenvalues;
    let eig_time = eig_start.elapsed();
    println!("Eigenvalues computation: {:?}", eig_time);

    // Use symmetric SVD since C is symmetric positive semidefinite
    let svd_start = Instant::now();
    let svd = na::linalg::SVD::new(c, true, true);
    let _singular_values = svd.singular_values;
    let svd_time = svd_start.elapsed();
    println!("SVD calculation: {:?}", svd_time);

    println!("Total time: {:?}", start.elapsed());
}

my cargo.toml:

[package]
name = "rust"
version = "0.1.0"
edition = "2021"

[dependencies]
nalgebra = "0.33.2"
nalgebra-lapack = { version = "0.25.0", default-features = false, features = ["accelerate"] }
rand = "0.8.5"
rayon = "1.10.0"

and my build.rs

fn main() {
    #[cfg(target_os = "macos")]
    {
        println!("cargo:rustc-link-lib=framework=Accelerate");
    }
}

and, for comparison, my Python code, a super simple script using numpy:

import numpy as np
import time

t1 = time.time()
a = np.random.rand(2_000, 2_000)
a.shape
print(a)
t2 = time.time()
b = a.T @ a
t3 = time.time()
# calculate the svd of b
result = np.linalg.eigvals(b)
# print(result)
t4 = time.time()
t5 = time.time()
result = np.linalg.svd(b)
t6 = time.time()
# cholesky decomp
result = np.linalg.cholesky(b)
t7 = time.time()
# np.show_config()
print(f"Total time: {(t6 - t1)*1e3:.1f} miliseconds")
print(f"Matrix creation: {(t2 - t1)*1e3:.1f} miliseconds")
print(f"Matrix multiplication: {(t3 - t2)*1e3:.1f} miliseconds")
print(f"Eigenval calculation: {(t4 - t3)*1e3:.1f} miliseconds")
print(f"SVD calculation: {(t6 - t5)*1e3:.1f} miliseconds")
print(f"Cholesky decomposition: {(t7 - t6)*1e3:.1f} miliseconds")

I find that when I run both, I get these results:

RUST:

❯ cargo run --release
Finished release profile [optimized] target(s) in 0.08s
Running target/release/rust
Matrix creation: 9.157083ms
Matrix dimensions: 2000 x 2000
Result dimensions: 2000 x 2000
Matrix multiplication: 364.901042ms
Eigenvalues computation: 4.586041791s
SVD calculation: 14.505883666s
Total time: 19.466049208s

and

PYTHON:

Total time: 3177.9 miliseconds
Matrix creation: 17.5 miliseconds
Matrix multiplication: 22.2 miliseconds
Eigenval calculation: 1377.0 miliseconds
SVD calculation: 1761.1 miliseconds
Cholesky decomposition: 27.5 miliseconds

You can see that Python is more than 10x faster at matmul, ~4x faster at calculating eigenvalues, and an order of magnitude faster at computing the SVD.

Please could someone tell me where I'm going wrong here?

Some further Info:

  1. I'm running in 'release' mode.
  2. I know my binary is linked to Apple accelerate because:
    ❯ otool -L target/release/rust
    target/release/rust:
    /System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate (compatibility version 1.0.0, current version 4.0.0)
    /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
  3. I've also tried with the default openblas setting, and performance is even worse.

If someone could please help me out / point me to where I'm going wrong that would be great. I am aware Python is going to be hard to beat here (because this Python code is basically just running C / Fortran optimised for Apple chips), but, I would expect at the very least that the Rust would be on-par with the performance of really-simple Python, given that both are targeting the same backend (Apple accelerate).

Thanks!

@Ralith
Copy link
Collaborator

Ralith commented Dec 27, 2024

You're pulling in nalgebra_lapack, but you don't seem to actually be using it for anything.

@Da1sypetals
Copy link

why you are printing results during timing? io can take up more time than you think. besides you should use criterion.rs rather than timing things yourself.

@rossamurphy
Copy link
Author

@Ralith ah, thanks! I hadn't realised that the nalgebra lapack crate actually had its own, other methods for computing the decompositions. I naively thought by importing it that the main nalgebra crate would have access to the speed-ups.

I've tried again now:

use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;
use rayon::prelude::*;

extern crate nalgebra as na;
extern crate nalgebra_lapack as nl;

use std::time::Instant;

fn main() {

    let start = Instant::now();
    let rows = 2000;
    let cols = 2000;

    // Pre-allocate the vector with exact capacity
    let mut data = Vec::with_capacity(rows * cols);

    // Initialize random number generator and distribution
    let uniform = Uniform::new(0.0, 1.0);

    // Generate random numbers in parallel
    data.par_extend(
        (0..rows * cols)
            .into_par_iter()
            .map(|_|
                {
                    let mut local_rng = thread_rng(); // Create a new RNG for each thread
                    uniform.sample(&mut local_rng)
                })
    );

    // Create matrix from pre-generated data
    let a = na::DMatrix::from_vec(rows, cols, data);

    let creation_time = start.elapsed();
    println!("Matrix creation: {:?}", creation_time);
    println!("Matrix dimensions: {} x {}", a.nrows(), a.ncols());

    // Compute A^T A directly without storing the transpose
    let mult_start = Instant::now();
    let c = &a * &a.transpose();  

    println!("Result dimensions: {} x {}", c.nrows(), c.ncols());
    let mult_time = mult_start.elapsed();
    println!("Matrix multiplication: {:?}", mult_time);
    
    // Use symmetric eigenvalue computation since C = A^T A is symmetric
    let eig_start = Instant::now();
    let _eigvals = na::linalg::SymmetricEigen::new(c.clone()).eigenvalues;
    let eig_time = eig_start.elapsed();
    println!("Eigenvalues computation native rust: {:?}", eig_time);
    
    // Symmetric eigen method from nalgebra lapack fails here due to non-convergence.
    // which is quite strange ...
    // calculating in the general case here and just returning the real eigenvalues 
    let eig_start = Instant::now();
    let _eigen_vals = nl::Eigen::new(c.clone(), false, false).unwrap().eigenvalues_re;
    let eig_time = eig_start.elapsed();
    println!("Eigenvalues computation nalgebra lapack: {:?}", eig_time);
    
    // Use symmetric SVD since C is symmetric positive semidefinite
    let svd_start = Instant::now();
    // let svd = na::linalg::SVD::new(c.clone(), true, true);
    let _svd = na::SVD::new(c.clone(),true, true).singular_values;
    let svd_time = svd_start.elapsed();
    println!("SVD calculation native rust: {:?}", svd_time);
    
    // Use symmetric SVD since C is symmetric positive semidefinite
    let svd_start = Instant::now();
    let svd = nl::SVD::new(c.clone());
    let _singular_values = svd.unwrap().singular_values;
    let svd_time = svd_start.elapsed();
    println!("SVD calculation nalgebra lapack: {:?}", svd_time);

    println!("Total time: {:?}", start.elapsed());
}

This gives much better results:

Matrix creation: 9.137ms
Matrix dimensions: 2000 x 2000
Result dimensions: 2000 x 2000
Matrix multiplication: 362.080209ms
Eigenvalues computation native rust: 4.609710542s
Eigenvalues computation nalgebra lapack: 1.521910792s
SVD calculation native rust: 13.960452334s
SVD calculation nalgebra lapack: 1.578839s
Total time: 22.042224s

@Ralith is there any scope for enabling the nalgebra-lapack speedup as a feature of the main nalgebra crate? i.e. if you have an accelerator registered, the operation will be accelerated, and if you don't, it won't. That way, there could exist one standard way of doing the decompositions, rather than two sort of similar ways, depending on the underlying architecture.

@Da1sypetals thanks for the recommendation! Criterion looks very useful and I'll use it in future. In this case however, I just want to log to the std out in as quick and easy a means possible. Given the time difference between the operations (accelerated and not) is in the order of multiple seconds, I think the time incurred to log to std out is immaterial in this example.

@rossamurphy
Copy link
Author

Also, FYI, I achieved some pretty good results on this particular benchmark by using this great port of libtorch: https://github.com/LaurentMazare/tch-rs . Admittedly, if using in production, the final binary will end up a lot larger (if you decide to statically link torch), but it enabled me to take full advantage of libtorch bindings for my device.

use rand::distributions::{Distribution, Uniform};
use rand::thread_rng;
use rayon::prelude::*;
use std::time::Instant;
use tch::{Device, Tensor};

fn main() {
    let device = Device::Mps;
    println!("Device: {:?}", device);
    let start = Instant::now();
    let rows = 2000;
    let cols = 2000;

    // Pre-allocate the vector with exact capacity
    let mut data = Vec::with_capacity(rows * cols);

    // Initialize random number generator and distribution
    let uniform = Uniform::new(0.0f32, 1.0f32); // Changed to f32

    // Generate random numbers in parallel
    data.par_extend((0..rows * cols).into_par_iter().map(|_| {
        let mut local_rng = thread_rng(); // Create a new RNG for each thread
        uniform.sample(&mut local_rng)
    }));

    // t.print()
    let t = Tensor::from_slice2(&data.chunks(cols).collect::<Vec<_>>()).to_device(device);
    println!("Time taken to create matrix: {:?}", start.elapsed());

    let start = Instant::now();
    let transpose_t = t.transpose(0, 1);
    println!("Time taken to transpose matrix: {:?}", start.elapsed());

    let start = Instant::now();
    let psd_t = t.matmul(&transpose_t);
    println!("Time taken to matmul matrix: {:?}", start.elapsed());

    let start = Instant::now();
    let _eig = psd_t.linalg_eigvals();
    println!("Time taken to calculate eigenvalues: {:?}", start.elapsed());

    // let start = Instant::now();
    // let _eig = psd_t.cholesky(true);
    // println!(
    //     "Time taken to calculate cholesky decomp: {:?}",
    //     start.elapsed()
    // );

    let start = Instant::now();
    let _svd = t.svd(true, true);
    println!("Time taken to calculate SVD: {:?}", start.elapsed());
    // t.print();
}

Results:

Device: Mps
Time taken to create matrix: 277.979625ms
Time taken to transpose matrix: 541.417µs
Time taken to matmul matrix: 21.853542ms
Time taken to calculate eigenvalues: 891.593208ms
Time taken to calculate SVD: 592.646416ms

Doing it in torch also has the handy nicety that you can prototype stuff in torch in Python, and then transfer it pretty trivially to tch-rs in production (as it's all just torch at the end of the day).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants