Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Support pygloo in collective communication #38

Merged
merged 18 commits into from
Jul 13, 2023
Merged

FEAT: Support pygloo in collective communication #38

merged 18 commits into from
Jul 13, 2023

Conversation

YibinLiu666
Copy link
Contributor

@YibinLiu666 YibinLiu666 commented Jul 5, 2023

What do these changes do?

Related issue number

Related #22

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass

@XprobeBot XprobeBot added this to the v0.0.6 milestone Jul 5, 2023
@ChengjieLi28 ChengjieLi28 changed the title FEATURE: Support pygloo in collective communication FEAT: Support pygloo in collective communication Jul 5, 2023
@codecov
Copy link

codecov bot commented Jul 5, 2023

Codecov Report

Merging #38 (02869a8) into main (892e1c5) will decrease coverage by 0.07%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main      #38      +/-   ##
==========================================
- Coverage   93.85%   93.78%   -0.07%     
==========================================
  Files          43       42       -1     
  Lines        3399     3361      -38     
  Branches      675      672       -3     
==========================================
- Hits         3190     3152      -38     
  Misses        138      138              
  Partials       71       71              
Flag Coverage Δ
unittests 93.63% <ø> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@XprobeBot XprobeBot modified the milestones: v0.0.6, v0.0.9 Jul 7, 2023
ChengjieLi28
ChengjieLi28 previously approved these changes Jul 7, 2023
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
cpp/collective/gloo/src/rendezvous.cc Outdated Show resolved Hide resolved
python/xoscar/collective/gloo/xoscar_pygloo.pyi Outdated Show resolved Hide resolved
python/xoscar/collective/gloo/xoscar_pygloo.pyi Outdated Show resolved Hide resolved
python/xoscar/collective/test/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
python/xoscar/collective/test/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
python/xoscar/collective/test/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
python/xoscar/collective/gloo/test/__init__.py Outdated Show resolved Hide resolved
@ChengjieLi28
Copy link
Contributor

Also, we need to support all_to_all bind.

@YibinLiu666
Copy link
Contributor Author

Also, we need to support all_to_all bind.

Implemented

CMakeLists.txt Outdated Show resolved Hide resolved
python/xoscar/tests/core.py Outdated Show resolved Hide resolved
python/xoscar/tests/core.py Outdated Show resolved Hide resolved
Copy link
Contributor

@codingl2k1 codingl2k1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep only one Store type and wrap the TCPStore to PrefixStore directly.

CMakeLists.txt Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
cpp/collective/gloo/src/rendezvous.cc Show resolved Hide resolved
python/xoscar/collective/tests/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
python/xoscar/collective/tests/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
python/xoscar/collective/tests/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
python/xoscar/collective/tests/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
python/xoscar/collective/tests/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
python/xoscar/collective/tests/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
python/xoscar/collective/tests/test_pygloo_tcp_store.py Outdated Show resolved Hide resolved
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/collective/rendezvous/src/tcp_store.cpp Outdated Show resolved Hide resolved
cpp/collective/rendezvous/src/tcp_store.cpp Outdated Show resolved Hide resolved
cpp/collective/rendezvous/src/tcp_store.cpp Outdated Show resolved Hide resolved

std::vector<T> inputbuf(size);

memcpy(inputbuf.data(), input_ptr, size * sizeof(T));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mem copy required by gloo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gloo ReduceScatterHalvingDoubling algorithm accepts a vector with just one ptr: facebookincubator/gloo#303
I am not sure if we can remove this copy in the future.

@@ -1313,4 +1314,16 @@ void TCPStore::multiSet(const std::vector<std::string> &keys,

bool TCPStore::hasExtendedApi() const { return true; }

void TCPStore::set(const std::string &key, const std::vector<char> &data) {
std::vector<uint8_t> dataSet(data.begin(), data.end());
Copy link
Contributor

@codingl2k1 codingl2k1 Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here copying data to cast the vector<char> to vector<uint8_t>. But the send function accepts a const char * type here https://github.com/xorbitsai/xoscar/blob/main/cpp/collective/rendezvous/include/utils.hpp#L121:

::send(socket, (const char *) currentBytes, bytesToSend, flags)

We should unify the data types to char* to avoid copying data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here copying data to cast the vector<char> to vector<uint8_t>. But the send function accepts a const char * type here https://github.com/xorbitsai/xoscar/blob/main/cpp/collective/rendezvous/include/utils.hpp#L121:

::send(socket, (const char *) currentBytes, bytesToSend, flags)

We should unify the data types to char* to avoid copying data.

See issue pybind/pybind11#1807 in pybind11. Copying data here is just for pybind11.

Copy link
Contributor

@codingl2k1 codingl2k1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot merged commit 7db36a2 into xorbitsai:main Jul 13, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants