Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjustments for Finch.jl#562 #57

Merged
merged 1 commit into from
May 21, 2024
Merged

Adjustments for Finch.jl#562 #57

merged 1 commit into from
May 21, 2024

Conversation

mtsokol
Copy link
Collaborator

@mtsokol mtsokol commented May 20, 2024

Hi @willow-ahrens,

Here's a PR with adjustments required by finch-tensor/Finch.jl#562.

One thing that I'm concerned about is reduced performance of code that uses PlusOneVector (zero-copy SciPy constructors), which is way slower than copy counterparts.

For example, in SDDMM, changing copy to no-copy CSR matrix introduces x10 speed reduction.

@mtsokol mtsokol self-assigned this May 20, 2024
@mtsokol
Copy link
Collaborator Author

mtsokol commented May 20, 2024

Overall I'm happy with the change, I would like only to investigate a bit further PlusOneVector.

To repeat one of my comments: I run SDDMM with csr sparse format. And for explicit copy I get 5s execution:

Executing:
:(function var"##compute#316"(prgm)
      begin
          V = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, SparseListLevel{Int64, Vector{Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_2 = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          V_3 = (((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A0 = V::Tensor{DenseLevel{Int64, SparseListLevel{Int64, Vector{Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          A2 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A5 = V_3::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A8 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          @finch mode = :fast begin
                  A8 .= 0.0
                  for i49 = _
                      for i47 = _
                          for i48 = _
                              A8[i48, i47] << + >>= (*)((*)(A0[i48, i47], A2[i47, i49]), A5[i48, i49])
                          end
                      end
                  end
                  return A8
              end
          return (swizzle(A8, 2, 1),)
      end
  end)

And for the no-copy constructor (that differs only in PlusOneVector, used diffchecker.com) it's 100s:

Executing:
:(function var"##compute#308"(prgm)
      begin
          V = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32}, PlusOneVector{Int32}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          V_2 = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          V_3 = (((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A0 = V::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32}, PlusOneVector{Int32}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A2 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A5 = V_3::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A8 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseLevel{Int64, Finch.DictTable{Int64, Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          @finch mode = :fast begin
                  A8 .= 0.0
                  for i49 = _
                      for i47 = _
                          for i48 = _
                              A8[i48, i47] << + >>= (*)((*)(A0[i48, i47], A2[i47, i49]), A5[i48, i49])
                          end
                      end
                  end
                  return A8
              end
          return (swizzle(A8, 2, 1),)
      end
  end)

@hameerabbasi
Copy link
Collaborator

Ouch -- that's a big perf hit. I would wait to merge this as PlusOneVector is used a lot.

@mtsokol
Copy link
Collaborator Author

mtsokol commented May 21, 2024

Ouch -- that's a big perf hit. I would wait to merge this as PlusOneVector is used a lot.

@hameerabbasi It turns out that PlusOneVector was already slow, but before it wasn't really used internally: finch-tensor/Finch.jl#562 (comment)

@mtsokol
Copy link
Collaborator Author

mtsokol commented May 21, 2024

I created an issue for it: finch-tensor/Finch.jl#565

@mtsokol
Copy link
Collaborator Author

mtsokol commented May 21, 2024

@hameerabbasi Would you have a spare moment to review it?

@hameerabbasi
Copy link
Collaborator

It's fairly minimal code -- there's not anything to review besides a dependency version bump. But if that dependency introduces a major regression, then we should wait for that regression to be fixed before bumping the dependency.

@mtsokol
Copy link
Collaborator Author

mtsokol commented May 21, 2024

AFAIU it's not necessarily a regression, as PlusOneVectors used to be dropped internally. Latest release just exposed slow performance of PlusOneVector, so zero-copy SciPy constructors in general.

My current opinion is it's better to merge it and provide zero-copy/PlusOneVector with low performance (and keep working on improving it / or drop zero-copy SciPy constructors if it's not feasible) than being blocked on current version that drops PlusOneVector anyway.

@hameerabbasi
Copy link
Collaborator

A regression doesn't need to be a bug, it can be anything which practically renders the new version inferior to the old one. If such a situation arises, I'd re-evaluate after each release if the newest version is better for my use-case than whatever version I'm using now, and then upgrade.

@mtsokol
Copy link
Collaborator Author

mtsokol commented May 21, 2024

Then I think I can open a PR which drops PlusOneVector from finch-tensor, which acknowledges that it's not used internally. Then this PR won't be a regression.

@willow-ahrens
Copy link
Collaborator

I think plusonevector introduces a type instability which we can fix pretty easily

@willow-ahrens
Copy link
Collaborator

But i think we should merge this change because it strictly makes the finch-tensor more robust. This is a bug fix, and we need to fix plusonevector in any case

@willow-ahrens
Copy link
Collaborator

Also, changes to the heuristic scheduler should be evaluated on mean improvement to all benchmarks. There will certainly be performance regressions on individual benchmarks as we develop it, but we need to make these changes to eventually reach better performance for the variety of kernels.

@willow-ahrens
Copy link
Collaborator

There's a one-line fix for the performance bug in finch-tensor/Finch.jl#567, with a 100x perf improvement

@willow-ahrens
Copy link
Collaborator

I bumped the version in there to 0.6.30

@mtsokol mtsokol merged commit a87481a into main May 21, 2024
5 checks passed
@mtsokol mtsokol deleted the wma-heuristic-fix branch May 21, 2024 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants