-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add loadgen over the network support for bert (onnxruntime and pytorch) #1524
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Thanks @arjunsuresh for the work. As we discussed briefly, I wonder if we could move from the current design of QDL calling SUT, to have single QDL per transport and SUT calls QDL for 1) waiting for incoming requests, and 2) responding with inference results. It would be that QDL can prepare some calls like wait_for_requests() and respond_back() (or reuse current callable names) so SUT can use for. This will help the QDL to be used like an API to LON/SUT. And we can come up with different QDLs (ethernet socket, InfiniBand, etc) to current RESTful QDL. |
Thank you @nv-jinhosuh for your suggestions. I have now unified the QDLs for onnxruntime and pytorch. But I did not fully understand the benefit of SUT calling the QDL as opposed to SUT acting as a server and QDL sending queries to it -- as in the demo code. We can provide different network implementations by possibly giving different implements for network SUT and this will need corresponding communication changes in the QDL too. |
input_mask = eval_features.input_mask | ||
segment_ids = eval_features.segment_ids | ||
|
||
if self.quantized: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to quantize in QSL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the reference implementations, quantization is applicable only for onnxruntime and the qsl implementation is shared by all the backends. Probably that's why quantization is kept in the onnxruntime backend code. This code is shared by the network SUT and also the non-network onnxruntime backend.
Thank you Arjun, |
if I understand correctly the last point is that the current demo is waiting for the response in each thread and receive and transmit are not two unrelated tasks, that can queue separately and allow the system to reach its full performance. Let's discuss and clarify in our call. |
Sure Lior. I'll do that shortly. |
Yes Lior. This part is taken as is from the shared LON demo. |
LGTM. @arjunsuresh It might be a good idea putting |
How to Run Loadgen over the NetworkThe below CM command will launch the SUT server
Once the SUT server is launched, the below command can be run on the loadgen node to do issue queries to the SUT nodes. In this command
If you are not using CM, just add Loadgen over the network works for Native run: 189 QPS |
Thank you @nv-jinhosuh for your feedback. I have now added it in the README |
response = requests.post(url, json={'query': query, id: id}) | ||
return response.json()['result'] | ||
responses = [] | ||
response = requests.post(url, json={'query': query}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Arjun for the update of the code.
If I am not mistaken it looks like there could be now multiple POSTs to the server as the number of threads.
As we discuss in the Network call this allows to increase performance and reach the level of performance for non network operation of the current test.
I think maybe another optimization level can use an asynchronous API with async/await patterns and then not block the thread for queuing multiple posts for each thread. this might be needed when implementing the additional rate optimizations we discussed yesterday (batching in server and not client). let's consider the improvement if we see network rate limit. I have not tried it and maybe there are other methods so I am not certain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arjunsuresh Can you update this branch too?
Done @pgmpablo157321 |
No description provided.