Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from RAFT to CUVS #3549

Open
wants to merge 142 commits into
base: main
Choose a base branch
from

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jun 25, 2024

Remove the dependency on raft::compiled and modify GPU implementations to use cuVS backend in place of RAFT.

A deeper insight into the dependency:
FAISS gets the ANN algorithm implementations such as IVF-Flat and IVF-PQ from cuVS. RAFT is meant to be a lightweight C++ header-only template library that cuVS relies on for the more fundamental / low-level utilities. Some examples of these are RAFT's device mdarray and mdspan objects; the RAFT resource object (raft::resource) that takes care of the stream ordering of device functions; linear algebra functions such as mapping, reduction, BLAS routines etc. A lot of the cuVS functions take the RAFT mdspan objects as arguments (for example raft::device_matrix_view). Therefore FAISS relies on both cuVS and RAFT. FAISS gets RAFT headers through cuVS and uses them to create the function arguments that can be consumed by cuVS. Note that we are not explicitly linking FAISS against raft::raft or raft::compiled. Only the required headers are included and compiled rather than compiling the whole RAFT shared library. This is the reason we still see mentions of raft in FAISS.

@tarang-jain
Copy link
Contributor Author

@asadoughi I was able to resolve the seg fault. It was related to some bugs in the cuvs parts of bfKnn.

Copy link

@mfoerste4 mfoerste4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarang-jain , as discussed offline I added the changes required to get rid of additional overhead and achieve comparable performance of faiss+cuvs vs plain cuvs.

faiss/gpu/GpuDistance.cu Outdated Show resolved Hide resolved
faiss/gpu/GpuIndex.cu Outdated Show resolved Hide resolved
faiss/gpu/StandardGpuResources.cpp Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -79,9 +79,9 @@ if(FAISS_ENABLE_GPU)
endif()
endif()

if(FAISS_ENABLE_RAFT AND NOT TARGET raft::raft)
find_package(raft COMPONENTS compiled distributed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RAFT still needed for compilation or we need to install rmm explicitly?

https://github.com/facebookresearch/faiss/actions/runs/10889999572/job/30218385769?pr=3549

The link interface of target "cuvs::cuvs" contains:

    rmm::rmm

  but the target was not found.  Possible reasons include:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmm and raft should ideally be available through cuvs. But I have also tried explicitly linking raft::raft.

@mfoerste4 mfoerste4 mentioned this pull request Sep 18, 2024
4 tasks
@asadoughi
Copy link
Contributor

It looks like one test is still failing TestTorchUtilsGPU.test_train_add_with_ids. Have you had a chance to troubleshoot?

@tarang-jain
Copy link
Contributor Author

@asadoughi yes I have been looking into it. It is quite strange. The following order of operations gives erroneous results:
create a GPU torch tensor --> train GPU IVF-Flat Index --> search the index --> reset the index --> train again with the same tensor but this time have the data on CPU --> train the GPU IVF-Flat index --> search the index.

However, if both the tensors are on the same device (either CPU or GPU), there is no failure and it works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.