Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example notebook #709

Merged
merged 2 commits into from
Jun 26, 2024
Merged

Add example notebook #709

merged 2 commits into from
Jun 26, 2024

Conversation

mtsokol
Copy link
Collaborator

@mtsokol mtsokol commented Jun 21, 2024

Hi @willow-ahrens @hameerabbasi,

Here I add a short example notebook with benchmarks+plots for the latest alpha version released.

@mtsokol mtsokol self-assigned this Jun 21, 2024
@hameerabbasi
Copy link
Collaborator

I'd love to see two things:

  1. Testing that a notebook can be executed (without checking the output). We can use e.g. nbmake for this. This is so the notebook doesn't go obsolete, and should be done in this PR.
  2. We can, as a follow-up, include notebooks in documentation. Maybe a task for the upcoming intern.

Copy link
Collaborator

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a notebook pre-commit hook, also extending the existing ruff lint to work for notebooks seems to be the only thing that's left here.

Edit: The ruff pre-commit hook already includes notebooks.

ci/test_notebooks.sh Outdated Show resolved Hide resolved
@hameerabbasi
Copy link
Collaborator

Does the notebook have large datasets/arrays or benchmarks? If so, we should remove those or convert them to be smaller for demonstration purposes, with deterministic data.

@willow-ahrens
Copy link
Collaborator

Can we run benchmarks on large data without storing it in the notebook?

@willow-ahrens
Copy link
Collaborator

also these notebooks look great Mateusz! Thanks!

@hameerabbasi
Copy link
Collaborator

Can we run benchmarks on large data without storing it in the notebook?

There are several in the examples/ folder, yes. Maybe we need a way to read the dimension length from an external env var so CI load is lower.

@hameerabbasi
Copy link
Collaborator

Hmm. Seems like the notebook is flaky on CI. It failed in only one of the last two commits, with no changes.

@willow-ahrens
Copy link
Collaborator

it looks like a very solid failure though, that csr is not meant to be (1, 0) ordered.

@hameerabbasi
Copy link
Collaborator

it looks like a very solid failure though, that csr is not meant to be (1, 0) ordered.

Well, asarray should convert to the right format if necessary.

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 24, 2024

Ok, so we have this code snippet from the notebook that uses Finch backend:

X = sparse.random((100, 5), density=0.08)  # creates COO random matrix
X = sparse.asarray(X, format="csc")  # converts to CSC format
X_X = sparse.permute_dims(X, (1, 0)) @ X  # for me locally it densifies as the result is: SwizzleArray(Tensor(Dense{Int64}(Dense{Int64}(Element{0.0, Float64, Int64}...

X_X.get_order()  # it gives (1, 0) order so I can only convert to CSR. 

X_X = sparse.asarray(X_X, format="csr")  # move back from dense to CSR format

So it looks like in the CI the result order of permute_dims(X, (1, 0)) @ X is sometimes (0, 1) as it's the only reason for this failure.

@willow-ahrens
Copy link
Collaborator

Is this a finch version issue?

@hameerabbasi
Copy link
Collaborator

Ok, so we have this code snippet from the notebook that uses Finch backend:

X = sparse.random((100, 5), density=0.08)  # creates COO random matrix
X = sparse.asarray(X, format="csc")  # converts to CSC format
X_X = sparse.permute_dims(X, (1, 0)) @ X  # for me locally it densifies as the result is: SwizzleArray(Tensor(Dense{Int64}(Dense{Int64}(Element{0.0, Float64, Int64}...

X_X.get_order()  # it gives (1, 0) order so I can only convert to CSR. 

X_X = sparse.asarray(X_X, format="csr")  # move back from dense to CSR format

So it looks like in the CI the result order of permute_dims(X, (1, 0)) @ X is sometimes (0, 1) as it's the only reason for this failure.

While the indeterminism here is definitely an issue, one other issue is that asarray should convert the format if necessary, similar to the order= or dtype= kwargs.

@hameerabbasi
Copy link
Collaborator

Is this a finch version issue?

We use the latest finch-tensor version -- but I'm unsure if that's behind as it pins Finch.jl.

@hameerabbasi
Copy link
Collaborator

@willow-ahrens Could it be that the output order depends on the data? sparse.random would produce different data/coordinates each time.

@willow-ahrens
Copy link
Collaborator

It is possible to set the finch version we use with an environment variable. I am wondering whether Mateusz has an env var set locally?

@willow-ahrens
Copy link
Collaborator

willow-ahrens commented Jun 24, 2024

Also, no, the output order and format should be reproducible. As far as I can tell, it always fails on CI so far?

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jun 24, 2024

@willow-ahrens It's indeterministic as to whether it passes: passing, failing, diff.

@willow-ahrens
Copy link
Collaborator

Could the difference between these commits cause the problem?

@hameerabbasi
Copy link
Collaborator

Could the difference between these commits cause the problem?

Nope, it was just re-enabling a lint and doesn't affect running Python code.

@willow-ahrens
Copy link
Collaborator

I can't reproduce locally (it always fails for me). I also get sparse(sparse format, I really do think something may also be up with our finch versions.

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 24, 2024

Hmm... now it passed with csr... I couldn't reproduce it locally on macos or remote linux machine.

But I think once we merge finch-tensor/finch-tensor-python#75 we don't need to worry about it anymore as any format will be convertible to csc/csr etc.

@willow-ahrens
Copy link
Collaborator

Ill work towards reproducing this nondeterminism. Maybe theres a similar kernel that behaves differently. Could we start by running in verbose mode?

@willow-ahrens
Copy link
Collaborator

willow-ahrens commented Jun 24, 2024 via email

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 24, 2024

Would y’all be okay if I push a few test commits?

Sure!

@hameerabbasi
Copy link
Collaborator

Would y’all be okay if I push a few test commits?

I've just sent a collaborator invitation, if you accept it you should be able to push to Mateusz's branch.

@willow-ahrens
Copy link
Collaborator

I think stdout is being suppressed. Also, it's failing now saying tensordot is not defined. How are notebooks being tested, and can we get them to print stdout?

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 25, 2024

I think stdout is being suppressed. Also, it's failing now saying tensordot is not defined. How are notebooks being tested, and can we get them to print stdout?

The issue was that you defined X as a lazy variable, used in sparse.compute(...) but then also in:

b_hat = (inverted @ sparse.permute_dims(X, (1, 0))) @ y

without calling lazy on inverted or y or calling compute. I updated it by defining X_lazy separately. Now you can execute the cell to get the execution plan.

tensordot error comes from the fact that args are mixed lazy and eager tensors here.

In the CI notebooks are tested by executing the whole notebook, if notebook passes then it's a ✅ without an output, if there's an error it reports stacktrace and points to the cell that failed.

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 25, 2024

@hameerabbasi I'm not sure about using nbmake. Right now it fails with an internal nbmake error https://github.com/pydata/sparse/actions/runs/9661787940/job/26650246800?pr=709 after upgrading finch-tensor with the latest patch (version that passes everywhere else). I tried to debug it but I didn't find anything informative. Is there nbmake alternative? Maybe we could merge this notebook without running in the CI.

@hameerabbasi
Copy link
Collaborator

@hameerabbasi I'm not sure about using nbmake. Right now it fails with an internal nbmake error https://github.com/pydata/sparse/actions/runs/9661787940/job/26650246800?pr=709 after upgrading finch-tensor with the latest patch (version that passes everywhere else). I tried to debug it but I didn't find anything informative. Is there nbmake alternative? Maybe we could merge this notebook without running in the CI.

Yes let's do that for now. I'll look for alternatives.

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 25, 2024

👍 Let me squash git history and hide that CI job for now.

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 25, 2024

@hameerabbasi Ok, I figured out what was the issue. It was out of memory error and I commented out the last test configurations in MTTKRP (1000x1000x1000). Let me clean the notebook up and squash.

I think we can stay with nbmake but it could have reported OOM more explicitly.

Copy link
Collaborator

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work and investigations, @mtsokol!

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 25, 2024

@willow-ahrens You can force pull (I squashed commits) the branch and the notebook should run and print the execution plan in sparse.compute(..., verbose=True).

I released a new finch-tensor so you also need pip install --upgrade finch-tensor in your env to convert freely between formats.

@willow-ahrens
Copy link
Collaborator

wait, I'm not sure why the oom on mttkrp is related to the nondeterministic transposition order of X

@hameerabbasi
Copy link
Collaborator

wait, I'm not sure why the oom on mttkrp is related to the nondeterministic transposition order of X

Would you prefer that be resolved in this PR? AFAICT, the code here has no issues related to that. We can, of course, file an issue to track it if you prefer.

@willow-ahrens
Copy link
Collaborator

I'm happy to resolve in another PR! I'd like to understand though whether we have an explanation for that behavior, and whether it is still observed here. Is it explained by the OOM, for example? Was it just the finch-tensor version all along? Perhaps I can try to reproduce the problem with a simple test in finch-tensor.

@willow-ahrens
Copy link
Collaborator

I was confused because it didn't seem like we found the original source of nondeterminism, and I wanted to check whether y'all felt like you had found it.

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 26, 2024

wait, I'm not sure why the oom on mttkrp is related to the nondeterministic transposition order of X

@willow-ahrens They're unrelated: There's an issue where X.T @ X returned different resulting order locally on my machine and in the CI. This caused that sparse.asarray(arr, format="csr") failed in the CI as arr had different order than CSR required. I implemented a fix in finch-tensor that refines order handling and passes SwizzleArrays to copyto! when we change storage, therefore any format can now be changed to any format, as copyto! accepts SwizzleArrays as source and destination.
But there's and issue where copyto!(denese_source, any_target) copies zeros too as non-zero values, therefore for dense source formats I needed to add dropfills for this case, which creates another copy.

This fixed calling sparse.asarray(arr, format="csr") for any format of arr but in MTTKRP we call sparse.asarray(dense, format="csf") which for largest dimension configuration caused an OOM as dense was copied twice (in copyto! and dropfills). I disabled running all configurations in the CI (we only want to make sure the notebook generally runs) and now the job passes.

@willow-ahrens
Copy link
Collaborator

feel free to merge though, you're right that this can go through.

@willow-ahrens
Copy link
Collaborator

@mtsokol thanks! Yes, I think dropfills should probably support transposition like copyto does. I'll add a PR.

@mtsokol
Copy link
Collaborator Author

mtsokol commented Jun 26, 2024

Thank you! My two comments in finch-tensor/Finch.jl#609 describe these two issues.

@willow-ahrens
Copy link
Collaborator

Also, it's good to know that the nondeterminism was related to the Finch version. I'll add a more direct test for this to finch-tensor so we can see if it ever fails again nondeterministically in CI.

@hameerabbasi hameerabbasi merged commit a73b20d into main Jun 26, 2024
14 checks passed
@hameerabbasi hameerabbasi deleted the example-notebook branch June 26, 2024 09:26
@hameerabbasi
Copy link
Collaborator

Merging, thanks for the follow-ups @willow-ahrens and @mtsokol!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants