Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Matrix Operations #298

Open
aujxn opened this issue Apr 22, 2022 · 3 comments
Open

Parallel Matrix Operations #298

aujxn opened this issue Apr 22, 2022 · 3 comments

Comments

@aujxn
Copy link

aujxn commented Apr 22, 2022

Currently only sparse by sparse products are parallel in the smmp module. Converting the current sparse by dense products using ndarray::parallel should be straight forward. Here is an implementation for par_csr_mulacc_dense_colmaj that gives a significant speedup on my machine:

pub fn par_csr_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
    lhs: CsMatViewI<A, I, Iptr>,
    rhs: ArrayView<B, Ix2>,
    mut out: ArrayViewMut<'a, N, Ix2>,
) where
    A: Send + Sync,
    B: Send + Sync,
    N: 'a + crate::MulAcc<A, B>  + Send + Sync,
    I: 'a + SpIndex,
    Iptr: 'a + SpIndex,
{
    assert_eq!(lhs.cols(), rhs.shape()[0], "Dimension mismatch");
    assert_eq!(lhs.rows(), out.shape()[0], "Dimension mismatch");
    assert_eq!(rhs.shape()[1], out.shape()[1], "Dimension mismatch");
    assert!(lhs.is_csr(), "Storage mismatch");

    let axis1 = Axis(1);
    ndarray::Zip::from(out.axis_iter_mut(axis1))
        .and(rhs.axis_iter(axis1))
        .par_for_each(|mut ocol, rcol| {
            for (orow, lrow) in lhs.outer_iterator().enumerate() {
                let oval = &mut ocol[[orow]];
                for (rrow, lval) in lrow.iter() {
                    let rval = &rcol[[rrow]];
                    oval.mul_acc(lval, rval);
                }
            }
        });
}

The only changes here are the parallel iterator, adding the rayon feature for ndarray, and adding the Sync and Send trait bounds to the data types inside the matrices. My concern is that adding Send + Sync will result in these trait requirements to be unnecessarily added in many places.

Looking at the impl Mul for CsMatBase and CsMatBase I see that Sync + Send is required no matter if multi_thread is enabled or not. Is it okay to propagate these trait requirements all the way up to many of the trait impls for CsMatBase and then use the conditional feature compilation on the lowest level functions found in the prod module? Conditionally compiling at all the higher level implementations sounds like it would get nasty very quickly, especially as more parallel features get added.

@mulimoen
Copy link
Collaborator

This would be perfect as a PR! I see no problem adding the Send+Sync bounds, this is for a matrix element which would be quite surprising to see as anything else. We will just have to add the trait bound where necessary, maybe add these bounds to CsMatBase?

@aujxn
Copy link
Author

aujxn commented Apr 26, 2022

This implementation isn't ideal but should be better in many sparse by dense products. In certain cases like when the right hand side is a dense vector this is actually still sequential. To optimize it might be necessary to make a parallel iterator of the CsMatBase::outer_iterator, but I don't know if it is possible to zip parallel iterators from sprs with parallel iterators from ndarray using their zip combinator. I don't have any experience actually implementing parallel iterators but I'll look into what it will take.

It will probably also be worth building some more benchmarks to compare the difference but I'm not sure if Criterion can do benchmark comparisons for different conditional compilation directives.

@chetmurthy
Copy link
Contributor

chetmurthy commented Jul 22, 2022

I had need of parallel {CSR matrix} x { dense vector} mult, so I coded it using rayon. It was .... a nice speedup.

Um, I'm [ETA: NOT] sufficiently experienced with Rust to know how to go about integrating this into sprs, but then again, the code is so tiny that I think it wouldn't be difficult to just rewrite it in the right way to integrate with what you've already done.

Pointer: https://github.com/chetmurthy/qrusty/blob/f7e47f3a2e9a3d6bec9599de5076e91690bc66c0/qrusty/src/accel.rs#L336

Disclaimer: I really don't have a good idea of how to work with and extend traits, so I didn't even try to make this compatible with the existing traits, but I know that that is something that needs doing.

Disc#2: the code isn't as efficient as it could be: viz. it could assign the results to the final locations, but instead concats a bunch of fragment-vectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants