Calculate the correlation of millions of index pairs #348

Liripo · 2024-06-27T07:24:29Z

Liripo
Jun 27, 2024

I would like to ask how to speed up the following code to calculate the correlation according to the index.

fn pearson_correlation(x: &ArrayView1<f64>, y: &ArrayView1<f64>) -> f64 {
    let mean_x = x.mean().unwrap();
    let mean_y = y.mean().unwrap();
    // Calculate the terms needed for the Pearson coefficient
    let mut numerator: f64 = 0.0;
    let mut denominator_x: f64 = 0.0;
    let mut denominator_y: f64 = 0.0;

    for (&x_i, &y_i) in x.iter().zip(y.iter()) {
        let diff_x = x_i - mean_x;
        let diff_y = y_i - mean_y;
        numerator += diff_x * diff_y;
        denominator_x += diff_x * diff_x;
        denominator_y += diff_y * diff_y;
    }
    let denominator = denominator_x.sqrt() * denominator_y.sqrt();
    if denominator == 0.0 {
        f64::NAN
    } else {
        numerator / denominator
    }
}

#[pyfunction]
fn corr_by_index_rs2(py: Python<'_>, mtx: &Bound<'_, PyAny>, col_idx_pairs: PyReadonlyArrayDyn<'_, usize>) -> PyResult<PyObject> {
    let col_idx_pairs = col_idx_pairs.as_array();
    // csc matrix to sprs
    let data: Vec<f64> = mtx.getattr("data")?.extract()?;
    let indices: Vec<usize> = mtx.getattr("indices")?.extract()?;
    let indptr: Vec<usize> = mtx.getattr("indptr")?.extract()?;
    let shape: (usize, usize) = mtx.getattr("shape")?.extract()?;

    let mtx = CsMat::new_csc((shape.0, shape.1), indptr, indices, data);

    let corr_values: Vec<f64> = col_idx_pairs.axis_iter(ndarray::Axis(0))
        .into_par_iter()
        .map(|row| {
            let idx1 = row[0];
            let idx2 = row[1];
            let col1 = mtx.outer_view(idx1).unwrap().to_dense();
            let col2 = mtx.outer_view(idx2).unwrap().to_dense();

            pearson_correlation(&col1.view(), &col2.view())
        })
        .collect();

    // You would then iterate through the index pairs and compute correlations, storing them in a vector
    // For now, we just return a placeholder string
    Ok(corr_values.into_pyarray_bound(py).into())
}

Answered by mulimoen

Jun 27, 2024

You should retain the sparse structure and try to use the cheapest way of calculating the coefficients. Something like below might be faster (OBS: have not tested this myself, this is merely an example, you need to test this throughly yourself)

fn pearson_correlation(x: sprs::CsVecView<f64>, y: sprs::CsVecView<f64>) -> f64 {
    fn sum_x_x2(x: sprs::CsVecView<f64>) -> (f64, f64) {
        x.data()
            .iter()
            .fold((0.0, 0.0), |(x0, x1), &x| (x0 + x, x1 + x.powi(2)))
    }

    assert_eq!(x.dim(), y.dim());

    let (sum_x, sum_x2) = sum_x_x2(x);
    let (sum_y, sum_y2) = sum_x_x2(y);

    let sum_xy = x.dot(y);

    let n = x.dim() as f64;
    let numerator = n * sum_…

View full answer

mulimoen · 2024-06-27T20:06:59Z

mulimoen
Jun 27, 2024
Maintainer

You should retain the sparse structure and try to use the cheapest way of calculating the coefficients. Something like below might be faster (OBS: have not tested this myself, this is merely an example, you need to test this throughly yourself)

fn pearson_correlation(x: sprs::CsVecView<f64>, y: sprs::CsVecView<f64>) -> f64 {
    fn sum_x_x2(x: sprs::CsVecView<f64>) -> (f64, f64) {
        x.data()
            .iter()
            .fold((0.0, 0.0), |(x0, x1), &x| (x0 + x, x1 + x.powi(2)))
    }

    assert_eq!(x.dim(), y.dim());

    let (sum_x, sum_x2) = sum_x_x2(x);
    let (sum_y, sum_y2) = sum_x_x2(y);

    let sum_xy = x.dot(y);

    let n = x.dim() as f64;
    let numerator = n * sum_xy - sum_x * sum_y;

    let denominator_x = (n * sum_x2 - sum_x.powi(2)).sqrt();
    let denominator_y = (n * sum_y2 - sum_y.powi(2)).sqrt();

    let denominator = denominator_x * denominator_y;
    if denominator == 0.0 {
        f64::NAN
    } else {
        numerator / denominator
    }
}

You can save a pass of x/y by using nnz_or_zip

0 replies

Liripo · 2024-06-28T01:45:02Z

Liripo
Jun 28, 2024
Author

Thanks a lot for your help, this code made it 10 times faster. @mulimoen

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate the correlation of millions of index pairs #348

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Calculate the correlation of millions of index pairs #348

Liripo Jun 27, 2024

Replies: 2 comments

mulimoen Jun 27, 2024 Maintainer

Liripo Jun 28, 2024 Author

Liripo
Jun 27, 2024

mulimoen
Jun 27, 2024
Maintainer

Liripo
Jun 28, 2024
Author