Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleans tensorParallelDegree with MultiDevice #1222

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

zachgk
Copy link
Contributor

@zachgk zachgk commented Oct 26, 2023

This is a refactor to simplify the handling of tensor parallel degree. Before, it is read independently in 3+ locations in code and the behavior determining the tpDegree is hard to follow. This moves the reading to a single place and then represents it using a MultiDevice. This also means that the behavior can be entirely visible - a worker group will show up with the tp devices that it's worker will be using.

@zachgk zachgk requested review from frankfliu and a team as code owners October 26, 2023 00:58
@sindhuvahinis
Copy link
Contributor

Now that we have MultiDevice, Here getVisibleDevices could make of use of this as well right? https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/src/main/java/ai/djl/python/engine/Connection.java#L207

@zachgk
Copy link
Contributor Author

zachgk commented Oct 27, 2023

Now that we have MultiDevice, Here getVisibleDevices could make of use of this as well right? https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/src/main/java/ai/djl/python/engine/Connection.java#L207

Yeah. The engine would already have the list of devices that should be part of the connection already. It doesn't use the CUDA_VISIBLE_DEVICES though.

Also, I talked with @frankfliu and we may save this PR for post-reinvent as it is hard to get the behavior right and we don't test every case even in integration. Although, I did add a bunch of tests which may be sufficient

This is a refactor to simplify the handling of tensor parallel degree. Before,
it is read independently in 3+ locations in code and the behavior determining
the tpDegree is hard to follow. This moves the reading to a single place and
then represents it using a MultiDevice. This also means that the behavior can be
entirely visible - a worker group will show up with the tp devices that it's
worker will be using.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants