-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation on networks with large ammount of nodes #1613
Comments
Hi @jleach from my experience and benchmarking it's very likely that outdated genesis file is like the cause of the slowness you observe. The genesis file (for sov:staging) I see here I have previously verified this behaviour with my indy network healtchecking tool over here You can tweak
Try to run it few times with genesis file provided in my repo, then try to delete last 2 lines of |
@Patrik-Stas When I rand the test I recorded the IP addresses of the nodes IndyVDR (AFJ) was connecting to and they matched the currently active nodes on Sovrin Staging/Test. Or another way, when you "Accept" a credential, IndyVDR only only connects to Active nodes. Also, reconciling the ledger (figuring out what nodes are active and what ones are not) takes place during initialization, not during credential acceptance. For example, in test 1 above connected to these nodes comprising the 28 connections (I removed duplicates):
All these nodes are active ledger nodes, except these two, which are removed in subsequent blocks. The genesis block I was using was two transactions each of which removed one of these nodes below.
They way I understand how the ledgers work is IndyVDR will use the provided genesis block as a starting point to establish network connections. The genesis block may be behind so it will fetch any new transactions and rebuild the current network state (what nodes are alive and what IPs they are on). Then it uses this list to make queries. Even if I remove a few IPs from the genesis block its still going to get them from other nodes as the ledger is always the source of truth. Also, from what we know of IndyVDR it does not automatically rebuild the network once it does it's initial startup and reconciliation. |
With indy-vdr, the client is expected to use In indy-sdk this caching is automatic, because the use case is a little more narrowly defined. ACA-Py uses its own implementation of client-side transaction caching here (in the Normally the same set of connections should be used for all requests within a window of time, I believe the default is 5 seconds before it sends requests to a new connection pool. If the same Pool instance is used then it should not be re-establishing connections to the nodes within that time. |
Based on the performance we are seeing — I wonder if AFJ is creating a new pool from scratch everytime it is doing a request. That is, doing the entire genesis file handling and initial querying of the ledger on every request. I would think it should be able to cache enough info about each ledger to by pass that — such as what is done with the indy-cli-rs. |
@swcurran Its a good question. One one hand, to accept a credential AFJ makes 6 ledger calls, with each call being sent to two different nodes I would expect to see a max of 12 network connections. Maybe this would explain why I see 28-36 calls. But, I see it reaching out two these two nodes:
Which were removed in the next two ledger transactions. Which makes me think its not rebuilding else it should grab the last two transactions and remove these from the pool. @cvarjao Is just testing the latest-and-greatest AFJ 0.4.2 to see if that improves performance. |
AFAIK — when you connect to the ledger (create a pool), you use the genesis file to know what are supposed to be the nodes, and then process the rest of the “pool” ledger to know EXACTLY what ledger to use (in this case, process the last two transactions so you know not to use those nodes). So if you re-create the pool every request, you go through that process everytime. Painful! @WadeBarnes — can you give them a complete, working genesis file that adds those last two transactions to see if that makes any difference? |
AFJ creates a pool instance for every ledger once, and that used ad long as the AFJ agent is initialized. These pool instances are shared between all tenants. You can configure in the indy vdr config from AFJ whether you want to connect on startup. We never cache the genesis though, or call referesh (so this process is done once on every startup). |
Process to get an up-to-date genesis file for any network:
Latest copy: |
From the descriptions from @andrewwhitehead, @jleach, and @TimoGlastra, it sounds like AFJ may not be initializing the pool completely/correctly. If it's trying to communicate with nodes that no long exist after initializing the pool only once, that pool has not been reconciled with the live network properly. The genesis file is meant to bootstrap client and node connections to a given network. It reflects the exact state of a network at a moment in time. It is not meant to reflect the exact state of the network at the exact moment a given client or node is connecting to a network, the network itself contains that state and is the source of truth for that information. Once an initial connection is established it is the responsibility of the client or node to update it's information based on the data contained on the live network. Where the synchronization and reconciliation is done is up for debate. I think it would be convenient for this to all happen transparently in indy-vdr, since it is closest to the network and its purpose is to broker the communications with the network. |
To be completely clear, I think what you are saying is that creating the pool instance only reads the genesis file and does not actually connect to the ledger to get the “latest” pool ledger data. Further, a Solutions would then be for either Indy-VDR to do a This can be verified by running Aries Bifold with the latest genesis file that @WadeBarnes provided above, since with it, processing the genesis file and doing a I still don’t understand what @jleach is seeing with 28 - 36 connections from the agent to the ledger when getting a credential offer. Perhaps it doesn’t matter, but I find it curious that so many connections are opened. It would be good to know what calls AFJ is making to Indy-VDR at that time, and then, what Indy VDR is doing with each of those calls. |
You're understanding seems correct to me. @jleach comment about the 28 - 36 connections is more about the efficiency of the communications following the connection to the ledger. The fewer nodes the transactions are sent to, the less time you have to wait for a response (my understanding). |
Some history on the genesis files and network operations that may provide some better insight into some aspects of this issue:
These issues affect any network as time goes on. Both Sovrin and Indico have had to publish updates to their network's genesis files for these reasons. |
FYI — I did some testing with When I use the Sovrin published file, the create took negligible time always, and the When I use the file Wade provides above, the create tool negligible time always, and the Obviously, we don’t know how the CLI and AFJ are handling things, but it does indicate that the handling can vary. A quick and dirty fix is to get Sovrin to update the genesis file, but that is not sustainable — things change. We need to fix the handling in AFJ, I think. |
I just tested on BC Wallet with Wade Barnes' sovrin staging genesis, I can confirm that the LSBC test credential seems to be back to normal speed with the new genesis file |
Good stuff! So it sounds like the issue is the need for a refresh in AFJ and/or Indy VDR when connecting the pool. Agreed? |
We should then however also store the updated genesis file for later use, as indy-vdr doesn't store/update the genesis AFAIK. @andrewwhitehead is that correct? |
I’m assuming (but could be wrong — @andrewwhitehead — input needed) that when you do the |
I merged two PRs from @wadeking98 that addressed ledger-related issues:
After merging these changes, I conducted tests by accepting an LSBC credential and collected timing and network statistics. Test 1-3 are done after the upgrade from the previous version. For test 5 & 6 I performed a fresh installation of the wallet. The final group of tests, 7 & &, are done without PR1013 or PR1015.
Tests 1-5 with the two fixes showed significantly fewer network connections (approximately 71% less on average) and a much faster test (around 68% faster on average). The initial high network count for test 4 may be due to reconciliation (rebuilding the current state of the network pool) on first use. All connections for tests 1-5 were made to known active nodes listed as follows:
|
TL;DR
Although it seems that ledger network performance doesn't directly impact a wallet's ability to complete a transaction (such as accepting a credential), the number of network connections does matter. Wallets with fewer connections to the ledger network tend to complete transactions noticeably faster. This highlights the importance of optimizing the number of network connections for efficiency.
Problem
When using Aries Bifold, BC Wallet, or AFJ to "Accept" a credential, the process can become frustratingly slow on ledgers with many nodes.
Analysis
The table below provides information about test results. These tests were conducted on the
sovrin:staging
environment using the LSBC Test credential.In testing process, the same tests were run three times for each platform. However, in the table, we've displayed only the two best results for each platform, with the exception of Orbi Edge.
For Orbi Edge, the first test result is shown, even though it was slower. This initial test showed a notably higher number of network connections. Subsequent tests for Orbi Edge were more optimized.
In the table, you'll find two key columns:
During testing, both Trinsic and Lissi encountered issues when attempting to accept the credential. After waiting for ~60 seconds, an error message appeared.
It was noted via logging that when accepting a credential, AFJ makes a series of network calls, which are logged as follows:
To accurately assess network performance and duration, these calls were replicated using the IndyVDR Proxy and cURL. The results of these tests are documented in the table below.
It is not know what framworks Orbi Edge, Lissi, or Trinsic use.
In the case of Lissi and Trinsic, our observations indicate that they scan all configured ledgers while in the process of accepting a credential. This thorough scanning approach likely played a role in reaching a timeout at approximately 60 seconds, resulting in a test failure.
NOTE Preliminary testing with Trinsic showd it getting similar results to Orbi Edge, however, after a re-install this was not the case and the above results were collected.
Conclusion
This slowness seems to stem from AFJ or IndyVDR making numerous network calls. While each call is quick on its own, the cumulative effect can lead to significant delays when considering response processing.
Our test results reveal some key insights:
Ledger Network Performance: It's worth noting that ledger network performance doesn't seem to significantly impact the results, as evidenced by the efficient performance of IndyVRD Linux and Orbi Edge. Orbi Edge impressively completes transactions in just 3 seconds. Even when we eliminate duplicate network calls from IndyVDR Linux, it achieves similar results.
Number of Network Connections: On the other hand, the number of network connections appears to be a notable factor in test results. Tests that establish fewer connections, possibly the minimum required, tend to complete significantly faster compared to those that create multiple network connections.
In light of these findings, we recommend that AFJ and IndyVDR consider optimizing their ledger network interactions by:
Removing Duplicate Network Calls: Identify and eliminate any duplicate network calls that are part of the same transaction to reduce redundancy.
Batching Queries: Implement a strategy to batch queries, sending them to the same two nodes over the same connection, rather than establishing multiple network connections for each query.
Network Reconciliation: Consider periodic network reconciliation or scheduling intervals for this process, separating it from critical transactions such as accepting a credential or processing proof requests.
Caching Transactions: Explore the possibility of caching ledger transactions, given their immutability, to improve efficiency.
These optimizations could enhance the overall performance of ledger network interactions. Thank you for your attention to these recommendations, which aim to streamline the process for a smoother user experience.
How To Reproduce
Use a demo on Sovrin Test/Staging. If you don't have one, use this email verification service. Watch your firewall logs, time the result. If you have a
pfSense
based router withpfTop
you can run this filter:tcp dst port 9700||9702||9744||9777||9778||9799 and out
.The text was updated successfully, but these errors were encountered: