Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add loadgen over the network support for bert (onnxruntime and pytorch) #1524

Merged
merged 14 commits into from
Dec 12, 2023

Conversation

arjunsuresh
Copy link
Contributor

No description provided.

@arjunsuresh arjunsuresh requested a review from a team as a code owner November 25, 2023 12:24
Copy link

github-actions bot commented Nov 25, 2023

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@nv-jinhosuh
Copy link
Contributor

Thanks @arjunsuresh for the work. As we discussed briefly, I wonder if we could move from the current design of QDL calling SUT, to have single QDL per transport and SUT calls QDL for 1) waiting for incoming requests, and 2) responding with inference results. It would be that QDL can prepare some calls like wait_for_requests() and respond_back() (or reuse current callable names) so SUT can use for. This will help the QDL to be used like an API to LON/SUT. And we can come up with different QDLs (ethernet socket, InfiniBand, etc) to current RESTful QDL.

@arjunsuresh
Copy link
Contributor Author

Thank you @nv-jinhosuh for your suggestions. I have now unified the QDLs for onnxruntime and pytorch. But I did not fully understand the benefit of SUT calling the QDL as opposed to SUT acting as a server and QDL sending queries to it -- as in the demo code. We can provide different network implementations by possibly giving different implements for network SUT and this will need corresponding communication changes in the QDL too.

input_mask = eval_features.input_mask
segment_ids = eval_features.segment_ids

if self.quantized:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to quantize in QSL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the reference implementations, quantization is applicable only for onnxruntime and the qsl implementation is shared by all the backends. Probably that's why quantization is kept in the onnxruntime backend code. This code is shared by the network SUT and also the non-network onnxruntime backend.

@liorkhe
Copy link
Contributor

liorkhe commented Dec 4, 2023

Thank you Arjun,
can you add a few words in the conversation on the testing done to the code in network and non-network modes.
,
Lior

@liorkhe
Copy link
Contributor

liorkhe commented Dec 4, 2023

Thank you @nv-jinhosuh for your suggestions. I have now unified the QDLs for onnxruntime and pytorch. But I did not fully understand the benefit of SUT calling the QDL as opposed to SUT acting as a server and QDL sending queries to it -- as in the demo code. We can provide different network implementations by possibly giving different implements for network SUT and this will need corresponding communication changes in the QDL too.

if I understand correctly the last point is that the current demo is waiting for the response in each thread and receive and transmit are not two unrelated tasks, that can queue separately and allow the system to reach its full performance. Let's discuss and clarify in our call.

@arjunsuresh
Copy link
Contributor Author

Thank you Arjun, can you add a few words in the conversation on the testing done to the code in network and non-network modes. , Lior

Sure Lior. I'll do that shortly.

@arjunsuresh
Copy link
Contributor Author

Thank you @nv-jinhosuh for your suggestions. I have now unified the QDLs for onnxruntime and pytorch. But I did not fully understand the benefit of SUT calling the QDL as opposed to SUT acting as a server and QDL sending queries to it -- as in the demo code. We can provide different network implementations by possibly giving different implements for network SUT and this will need corresponding communication changes in the QDL too.

if I understand correctly the last point is that the current demo is waiting for the response in each thread and receive and transmit are not two unrelated tasks, that can queue separately and allow the system to reach its full performance. Let's discuss and clarify in our call.

Yes Lior. This part is taken as is from the shared LON demo.

@nv-jinhosuh
Copy link
Contributor

LGTM. @arjunsuresh It might be a good idea putting CM_MAX_NUM_THREADS env var in the readme file. :)

@arjunsuresh
Copy link
Contributor Author

How to Run Loadgen over the Network

The below CM command will launch the SUT server

cm run script --tags=generate-run-cmds,inference --model=bert-99 --backend=pytorch  \
--rerun --adr.mlperf-implementation.version=custom \
--adr.mlperf-implementation.tags=_repo.https://github.com/GATEOVerflow/inference \
--mode=performance --device=cuda --quiet --test_query_count=1000 --network=sut

Once the SUT server is launched, the below command can be run on the loadgen node to do issue queries to the SUT nodes. In this command -sut_servers has just the localhost address - it can be changed to a comma-separated list of any hostname/IP in the network.

cm run script --tags=generate-run-cmds,inference --model=bert-99 --backend=pytorch  --rerun \
--adr.mlperf-implementation.version=custom \
--adr.mlperf-implementation.tags=_repo.https://github.com/GATEOVerflow/inference \
--mode=performance --device=cuda --quiet --test_query_count=1000  \
--sut_servers,=http://localhost:8000 --network=lon

If you are not using CM, just add --network=lon along with your normal run command on the SUT side.
On the loadgen node, add --network=lon option and --sut_server <IP1> <IP2> to the normal command to connect to SUT nodes at IP addresses IP1, IP2 etc.

Loadgen over the network works for onnxruntime and pytorch backends. For onnxruntime below are the numbers on a single Nvidia RTX 4090 GPU for offline scenario:

Native run: 189 QPS
LON run: 180 QPS

@arjunsuresh
Copy link
Contributor Author

Thank you @nv-jinhosuh for your feedback. I have now added it in the README

response = requests.post(url, json={'query': query, id: id})
return response.json()['result']
responses = []
response = requests.post(url, json={'query': query})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Arjun for the update of the code.
If I am not mistaken it looks like there could be now multiple POSTs to the server as the number of threads.
As we discuss in the Network call this allows to increase performance and reach the level of performance for non network operation of the current test.

I think maybe another optimization level can use an asynchronous API with async/await patterns and then not block the thread for queuing multiple posts for each thread. this might be needed when implementing the additional rate optimizations we discussed yesterday (batching in server and not client). let's consider the improvement if we see network rate limit. I have not tried it and maybe there are other methods so I am not certain.

Copy link
Contributor

@pgmpablo157321 pgmpablo157321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arjunsuresh Can you update this branch too?

@arjunsuresh
Copy link
Contributor Author

Done @pgmpablo157321

@pgmpablo157321 pgmpablo157321 merged commit fea5b14 into mlcommons:master Dec 12, 2023
2 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants