Add loadgen over the network support for bert (onnxruntime and pytorch) #1524

arjunsuresh · 2023-11-25T12:23:59Z

No description provided.

github-actions · 2023-11-25T12:24:11Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

nv-jinhosuh · 2023-11-28T21:31:52Z

Thanks @arjunsuresh for the work. As we discussed briefly, I wonder if we could move from the current design of QDL calling SUT, to have single QDL per transport and SUT calls QDL for 1) waiting for incoming requests, and 2) responding with inference results. It would be that QDL can prepare some calls like wait_for_requests() and respond_back() (or reuse current callable names) so SUT can use for. This will help the QDL to be used like an API to LON/SUT. And we can come up with different QDLs (ethernet socket, InfiniBand, etc) to current RESTful QDL.

arjunsuresh · 2023-11-30T22:24:54Z

Thank you @nv-jinhosuh for your suggestions. I have now unified the QDLs for onnxruntime and pytorch. But I did not fully understand the benefit of SUT calling the QDL as opposed to SUT acting as a server and QDL sending queries to it -- as in the demo code. We can provide different network implementations by possibly giving different implements for network SUT and this will need corresponding communication changes in the QDL too.

liorkhe · 2023-12-03T20:15:28Z

language/bert/onnxruntime_SUT.py

+            input_mask = eval_features.input_mask
+            segment_ids = eval_features.segment_ids
+
+        if self.quantized:


Maybe better to quantize in QSL?

In the reference implementations, quantization is applicable only for onnxruntime and the qsl implementation is shared by all the backends. Probably that's why quantization is kept in the onnxruntime backend code. This code is shared by the network SUT and also the non-network onnxruntime backend.

liorkhe · 2023-12-04T13:26:02Z

Thank you Arjun,
can you add a few words in the conversation on the testing done to the code in network and non-network modes.
,
Lior

liorkhe · 2023-12-04T15:58:09Z

Thank you @nv-jinhosuh for your suggestions. I have now unified the QDLs for onnxruntime and pytorch. But I did not fully understand the benefit of SUT calling the QDL as opposed to SUT acting as a server and QDL sending queries to it -- as in the demo code. We can provide different network implementations by possibly giving different implements for network SUT and this will need corresponding communication changes in the QDL too.

if I understand correctly the last point is that the current demo is waiting for the response in each thread and receive and transmit are not two unrelated tasks, that can queue separately and allow the system to reach its full performance. Let's discuss and clarify in our call.

arjunsuresh · 2023-12-04T17:01:53Z

Thank you Arjun, can you add a few words in the conversation on the testing done to the code in network and non-network modes. , Lior

Sure Lior. I'll do that shortly.

arjunsuresh · 2023-12-04T17:02:41Z

Thank you @nv-jinhosuh for your suggestions. I have now unified the QDLs for onnxruntime and pytorch. But I did not fully understand the benefit of SUT calling the QDL as opposed to SUT acting as a server and QDL sending queries to it -- as in the demo code. We can provide different network implementations by possibly giving different implements for network SUT and this will need corresponding communication changes in the QDL too.

if I understand correctly the last point is that the current demo is waiting for the response in each thread and receive and transmit are not two unrelated tasks, that can queue separately and allow the system to reach its full performance. Let's discuss and clarify in our call.

Yes Lior. This part is taken as is from the shared LON demo.

nv-jinhosuh · 2023-12-12T14:51:54Z

LGTM. @arjunsuresh It might be a good idea putting CM_MAX_NUM_THREADS env var in the readme file. :)

arjunsuresh · 2023-12-12T15:44:51Z

How to Run Loadgen over the Network

The below CM command will launch the SUT server

cm run script --tags=generate-run-cmds,inference --model=bert-99 --backend=pytorch  \
--rerun --adr.mlperf-implementation.version=custom \
--adr.mlperf-implementation.tags=_repo.https://github.com/GATEOVerflow/inference \
--mode=performance --device=cuda --quiet --test_query_count=1000 --network=sut

Once the SUT server is launched, the below command can be run on the loadgen node to do issue queries to the SUT nodes. In this command -sut_servers has just the localhost address - it can be changed to a comma-separated list of any hostname/IP in the network.

cm run script --tags=generate-run-cmds,inference --model=bert-99 --backend=pytorch  --rerun \
--adr.mlperf-implementation.version=custom \
--adr.mlperf-implementation.tags=_repo.https://github.com/GATEOVerflow/inference \
--mode=performance --device=cuda --quiet --test_query_count=1000  \
--sut_servers,=http://localhost:8000 --network=lon

If you are not using CM, just add --network=lon along with your normal run command on the SUT side.
On the loadgen node, add --network=lon option and --sut_server <IP1> <IP2> to the normal command to connect to SUT nodes at IP addresses IP1, IP2 etc.

Loadgen over the network works for onnxruntime and pytorch backends. For onnxruntime below are the numbers on a single Nvidia RTX 4090 GPU for offline scenario:

Native run: 189 QPS
LON run: 180 QPS

arjunsuresh · 2023-12-12T15:48:53Z

Thank you @nv-jinhosuh for your feedback. I have now added it in the README

liorkhe · 2023-12-12T15:54:11Z

language/bert/bert_QDL.py

-        response = requests.post(url, json={'query': query, id: id})
-        return response.json()['result']
+        responses = []
+        response = requests.post(url, json={'query': query})


Thank you Arjun for the update of the code.
If I am not mistaken it looks like there could be now multiple POSTs to the server as the number of threads.
As we discuss in the Network call this allows to increase performance and reach the level of performance for non network operation of the current test.

I think maybe another optimization level can use an asynchronous API with async/await patterns and then not block the thread for queuing multiple posts for each thread. this might be needed when implementing the additional rate optimizations we discussed yesterday (batching in server and not client). let's consider the improvement if we see network rate limit. I have not tried it and maybe there are other methods so I am not certain.

pgmpablo157321

@arjunsuresh Can you update this branch too?

arjunsuresh · 2023-12-12T22:43:38Z

Done @pgmpablo157321

arjunsuresh added 5 commits November 24, 2023 20:18

Added loadgen over the network support for bert onnxruntime

eb2b7b7

Support pytorch bert LON

f88f296

Support pytorch bert LON

54ef836

Support pytorch bert LON

448b67c

Cleanup

dc3f129

arjunsuresh requested a review from a team as a code owner November 25, 2023 12:24

Fix offline scenario for LON

0547669

arjunsuresh added 2 commits December 1, 2023 03:40

unify the network QDLs

acc3024

unify the network QDLs

ebf8f3a

liorkhe reviewed Dec 3, 2023

View reviewed changes

arjunsuresh added 2 commits December 5, 2023 21:23

Bug fix

c95a967

Make issue query non blocking

340f9c8

arjunsuresh added 2 commits December 12, 2023 20:40

Fix nargs for sut_server

b27ef6e

Cleanup

111fb6a

Update README.md

7deab64

liorkhe reviewed Dec 12, 2023

View reviewed changes

pgmpablo157321 approved these changes Dec 12, 2023

View reviewed changes

Merge branch 'master' into master

3b7ee55

pgmpablo157321 merged commit fea5b14 into mlcommons:master Dec 12, 2023
2 checks passed

github-actions bot locked and limited conversation to collaborators Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add loadgen over the network support for bert (onnxruntime and pytorch) #1524

Add loadgen over the network support for bert (onnxruntime and pytorch) #1524

arjunsuresh commented Nov 25, 2023

github-actions bot commented Nov 25, 2023 •

edited

Loading

nv-jinhosuh commented Nov 28, 2023

arjunsuresh commented Nov 30, 2023

liorkhe Dec 3, 2023

arjunsuresh Dec 3, 2023

liorkhe commented Dec 4, 2023

liorkhe commented Dec 4, 2023

arjunsuresh commented Dec 4, 2023

arjunsuresh commented Dec 4, 2023

nv-jinhosuh commented Dec 12, 2023

arjunsuresh commented Dec 12, 2023

arjunsuresh commented Dec 12, 2023

liorkhe Dec 12, 2023

pgmpablo157321 left a comment

arjunsuresh commented Dec 12, 2023

Add loadgen over the network support for bert (onnxruntime and pytorch) #1524

Add loadgen over the network support for bert (onnxruntime and pytorch) #1524

Conversation

arjunsuresh commented Nov 25, 2023

github-actions bot commented Nov 25, 2023 • edited Loading

nv-jinhosuh commented Nov 28, 2023

arjunsuresh commented Nov 30, 2023

liorkhe Dec 3, 2023

Choose a reason for hiding this comment

arjunsuresh Dec 3, 2023

Choose a reason for hiding this comment

liorkhe commented Dec 4, 2023

liorkhe commented Dec 4, 2023

arjunsuresh commented Dec 4, 2023

arjunsuresh commented Dec 4, 2023

nv-jinhosuh commented Dec 12, 2023

arjunsuresh commented Dec 12, 2023

How to Run Loadgen over the Network

arjunsuresh commented Dec 12, 2023

liorkhe Dec 12, 2023

Choose a reason for hiding this comment

pgmpablo157321 left a comment

Choose a reason for hiding this comment

arjunsuresh commented Dec 12, 2023

github-actions bot commented Nov 25, 2023 •

edited

Loading