Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: adding mcast allgather staging based algo #994

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

MamziB
Copy link
Collaborator

@MamziB MamziB commented Jun 27, 2024

What

add the algorithm for mcast-based allgather

Why

scalability and performance improvement over sw based allgather

How ?

Realizing the Allgather operations as N (team-size) concurrent Bcast operations (every process becomes the root). We use the one-sided design that was merged before for its reliability.

@MamziB MamziB requested review from samnordmann and Sergei-Lebedev and removed request for samnordmann June 27, 2024 18:54
@MamziB MamziB self-assigned this Jun 27, 2024
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have time to review everything yet, but here is a first quick round of review. I see several natural ways to make the PR smaller -- more than 1k lines is a large pr, so please try to break it down whenever possible.

src/components/tl/mlx5/tl_mlx5.c Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_coll.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_coll.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5.h Outdated Show resolved Hide resolved
@samnordmann
Copy link
Collaborator

The CI has relevant issues

@MamziB
Copy link
Collaborator Author

MamziB commented Jul 1, 2024

Hi @samnordmann Thanks for the constructive comments. I have addressed those and added a new commit (second commit). Please take a look and let me know your thoughts.

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for addressing the comment! Here is another round

src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
@MamziB
Copy link
Collaborator Author

MamziB commented Jul 3, 2024

Thanks @samnordmann for today's comments. I have added a new separate commit. Please take a look. Also, I have yet to cut something from this PR to make it shorter.

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for addressing the comments

@MamziB
Copy link
Collaborator Author

MamziB commented Jul 3, 2024

#994 (comment)
@samnordmann I would rather bail out gracefully, so let me throw an error if mcast req is NULL. Please look at the last commit.

@MamziB MamziB force-pushed the mamzi/mcast-allgather-3-part-1 branch from 711f798 to 61b99b6 Compare July 3, 2024 22:05
src/components/tl/mlx5/tl_mlx5_coll.h Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5.c Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
@MamziB MamziB force-pushed the mamzi/mcast-allgather-3-part-1 branch from 13e6845 to 75576ee Compare July 16, 2024 18:23
@MamziB
Copy link
Collaborator Author

MamziB commented Jul 16, 2024

Thank you Sergey and Sam. All the comments have been resolved.

@MamziB
Copy link
Collaborator Author

MamziB commented Aug 20, 2024

@ Sergei-Lebedev can you also look at this PR and let me know if you have more comments? Thank you, Sergey!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants