Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node connection issues have significantly more impact on performance comparing to JS SDK #2070

Open
AlexanderShenshin opened this issue Nov 6, 2024 · 5 comments
Assignees
Milestone

Comments

@AlexanderShenshin
Copy link

AlexanderShenshin commented Nov 6, 2024

Recently I noticed performance inconsistency between Hedera Java SDK and JS SDK while having issues with node connections, with Java SDK working significantly worse.

I would like to share my findings, it may help to find potential issues in Java SDK and/or figure out what exactly work better in JS SDK for described case.

Environment

  • First of all, I have some network-related issue that prevents me from successfully connecting to certain Hedera Testnet (and Previewnet) nodes (at least intermittently)
  • I was working with Hedera identity libraries (did-sdk-js and community-made Python SDK) and not with SDKs directly, but my investigation comes down to SDKs
  • Hedera networks: testnet, previewnet
  • Hedera SDK versions:
    • Java SDK: 2.41.0, 2.43.0 (used both)
    • JS SDK: 2.28.0
  • JDK version: 17.0.1
  • OS: Windows 11

Problem description

The issues were initially discovered while running some tests against Hedera Testnet using Java SDK under the hood.
These tests are only using HCS-related API (topic creation, submitting & reading topic messages).

I tried to run the same tests in JS library (uses JS SDK) and it apparently worked fine.
To confirm my assumptions, I ran Java SDK integration tests against Testnet and observed the same performance/timeouts issues.

Summary of issues comparing to JS SDK:

  • Frequent timeouts (java.util.concurrent.TimeoutException) that randomly happen on transaction.execute() calls and HCS topic messages subscription
    • It was not really possible to pass 10 tests in a row (each does 2-4 calls to HCS)
    • With JS SDK, timeouts are possible, but very rare
  • Slower performance
    • Should be directly linked with timeouts, but just for comparison
    • Same set of operations for code that uses different SDKs
      • Java SDK: 1.5 - 3 minutes
      • JS SDK: around 45 seconds

Workaround

It was possible to work around the problem by doing the following:

  • Run Java SDK integration tests against Hedera Testnet
  • Observe output of specific can connect to testnet with TLS STANDARD_OUT:
AccountBalanceIntegrationTest > can connect to testnet with TLS STANDARD_OUT
    failed for 35.155.49.147:50212=0.0.8
    succeeded for 3.212.6.13:50212=0.0.4
    succeeded for 34.94.106.61:50212=0.0.3                                                                                                                                                               
    succeeded for 34.106.102.218:50212=0.0.8                                                                                                                                                             
    succeeded for 35.245.27.193:50212=0.0.5                                                                                                                                                              
    failed for 34.133.197.230:50212=0.0.9                                                                                                                                                                
    succeeded for 54.176.199.109:50212=0.0.7                                                                                                                                                             
    failed for 34.83.112.116:50212=0.0.6                                                                                                                                                                 
    failed for 34.94.160.4:50212=0.0.7                                                                                                                                                                   
    failed for 52.20.18.86:50212=0.0.5
    failed for 35.237.119.55:50212=0.0.4                                                                                                                                                                 
    failed for 50.18.132.211:50212=0.0.3                                                                                                                                                                 
    failed for 52.14.252.207:50212=0.0.9                                                                                                                                                                 
    failed for 54.70.192.33:50212=0.0.6
  • Select node addresses with successful connection status and create custom network config, for example:
{
   "network": {
       "0.0.4": "3.212.6.13:50212",
       "0.0.3": "34.94.106.61:50212",
       "0.0.8": "34.106.102.218:50212",
       "0.0.5": "35.245.27.193:50212",
       "0.0.7": "54.176.199.109:50212"
   },
   "mirrorNetwork": "testnet"
}
  • Use Client.fromConfig() with custom config instead of Client.forTestnet()

After moving to custom network map/config, timeouts and performance issues are gone. Same integration tests that I run with Java SDK are now even faster than with JS SDK.

@0xivanov
Copy link
Contributor

Hello @AlexanderShenshin , thank you for opening this issue.

I have a couple of questions / suggestions:

  1. You mentioned that you are using did-js-sdk , what does this have to do with the Java SDK? Are you using the Java SDK in a similar way as did-js-sdk? If so - please give more information about that and maybe some code snippet - configuring the client -> configuring the transaction -> executing the transaction -> getting the receipt, etc.
  2. Would you be able to set the SDK Client's log level to trace : client.setLogger(new Logger(LogLevel.TRACE));? Please attach the logs when executing the transaction.
  3. If java.util.concurrent.TimeoutException occurres, can you send me the SDK relevant stack trace as well?
  4. I am running the integration tests for testnet and previewnet right now and the connection is successful. I also checked sone github jobs from 5 days ago (when this issue was opened) and the connection was fine.

@0xivanov 0xivanov self-assigned this Nov 11, 2024
@0xivanov 0xivanov added this to the v2.45.0 milestone Nov 12, 2024
@AlexanderShenshin
Copy link
Author

Hello @0xivanov, thank you for the response.

Sure, I will try to address these points.

First of all, it's clear for me that connection issues that I encounter are not directly related to Java SDK - my team member do not encounter them using the same code and I noticed that same integration tests work well in GH pipelines (as you mentioned). What makes me curious is that SDKs perform differently.

So the point for raising this issue is that JS SDK seems to be performing significantly better in my case and this may uncover some potential improvements for Java SDK side.

Regarding did-sdk-js, I initially discovered problematic behavior by comparing integration tests performance of DID SDKs that correspondingly use Hedera Java SDK and Hedera JS SDK under the hood.

To isolate our context, I reproduced the behavior by comparing results of Java and JS SDKs integration tests:

  • I've done multiple runs of TopicMessageIntegrationTest that present in both Java and JS SDK
  • I also set TRACE log level how you suggested for both SDKs, by calling Client.setLogger
    • However, for Java SDK it doesn't actually print TRACE logs this way. I guess it can be related to org.slf4j.simple provider, so I replaced it with ch.qos.logback:logback-classic and was able to get at least DEBUG logs
    • If TRACE logs are necessary, please provide your suggestions on how properly enable them or what can be done wrong in my environment
  • I've got following results for test runs:
    • Java SDK - fails with timeout around 50% of the time
    • JS SDK - no failures

I've run Java SDK tests using following command:
./gradlew :sdk:testIntegration -POPERATOR_ID="{OPERATOR_ID}" -POPERATOR_KEY="{OPERATOR_KEY}" -PHEDERA_NETWORK="testnet" -PSDK_LOG_LEVEL="TRACE" --tests "TopicMessageIntegrationTest"

Here's the log output for failed Java SDK test case:

TopicMessageIntegrationTest > Can receive a large topic message STANDARD_OUT
    01:12:51.543 [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger -- Retrying in 8000 ms after channel connection failure with node 0.0.6 during attempt #1

TopicMessageIntegrationTest > Can receive a large topic message STANDARD_ERROR
    com.hedera.hashgraph.sdk.MaxAttemptsExceededException: exceeded maximum attempts for request with last exception being

TopicMessageIntegrationTest > Can receive a large topic message STANDARD_OUT
    01:13:03.976 [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger -- Retrying in 7999 ms after channel connection failure with node 0.0.5 during attempt #1

TopicMessageIntegrationTest > Can receive a large topic message STANDARD_ERROR
    com.hedera.hashgraph.sdk.MaxAttemptsExceededException: exceeded maximum attempts for request with last exception being

TopicMessageIntegrationTest > Can receive a large topic message STANDARD_OUT
    01:13:16.465 [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger -- Retrying in 8000 ms after channel connection failure with node 0.0.5 during attempt #1

TopicMessageIntegrationTest > Can receive a large topic message STANDARD_ERROR
    com.hedera.hashgraph.sdk.MaxAttemptsExceededException: exceeded maximum attempts for request with last exception being


TopicMessageIntegrationTest > Can receive a large topic message STANDARD_OUT
    01:13:17.415 [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger -- Retrying in 250 ms after failure with node 0.0.9 during attempt #1: OK
    01:13:17.427 [Test worker] DEBUG com.hedera.hashgraph.sdk.logger.Logger -- Sleeping for: 250 | Thread name: Test worker
    01:13:30.189 [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger -- Retrying in 8000 ms after channel connection failure with node 0.0.9 during attempt #2                                       
    01:13:43.265 [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger -- Retrying in 8000 ms after channel connection failure with node 0.0.9 during attempt #1
    01:13:43.266 [Test worker] DEBUG com.hedera.hashgraph.sdk.logger.Logger -- Sleeping for: 7999 | Thread name: Test worker
    01:14:03.702 [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger -- Retrying in 16000 ms after channel connection failure with node 0.0.9 during attempt #2                                      
    01:14:03.702 [Test worker] DEBUG com.hedera.hashgraph.sdk.logger.Logger -- Sleeping for: 16000 | Thread name: Test worker
    01:14:32.142 [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger -- Retrying in 32000 ms after channel connection failure with node 0.0.9 during attempt #3                                      
    01:14:32.143 [Test worker] DEBUG com.hedera.hashgraph.sdk.logger.Logger -- Sleeping for: 31999 | Thread name: Test worker
    01:15:16.590 [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger -- Retrying in 64000 ms after channel connection failure with node 0.0.9 during attempt #4
    01:15:16.593 [Test worker] DEBUG com.hedera.hashgraph.sdk.logger.Logger -- Sleeping for: 63997 | Thread name: Test worker

TopicMessageIntegrationTest > Can receive a large topic message FAILED
    java.util.concurrent.TimeoutException
        at com.hedera.hashgraph.sdk@2.43.0/com.hedera.hashgraph.sdk.Executable.execute(Executable.java:431)
        at com.hedera.hashgraph.sdk@2.43.0/com.hedera.hashgraph.sdk.Transaction.execute(Transaction.java:53)
        at com.hedera.hashgraph.sdk@2.43.0/com.hedera.hashgraph.sdk.Executable.execute(Executable.java:367)
        at com.hedera.hashgraph.sdk@2.43.0/com.hedera.hashgraph.sdk.Transaction.execute(Transaction.java:53)
        at com.hedera.hashgraph.sdk.test.integration/com.hedera.hashgraph.sdk.test.integration.TopicMessageIntegrationTest.canReceiveALargeTopicMessage(TopicMessageIntegrationTest.java:103)

And here's the output for successful test case:

TopicMessageIntegrationTest > Can receive a topic message STANDARD_ERROR
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 8000 ms after channel connection failure with node 0.0.3 during attempt #1
    com.hedera.hashgraph.sdk.MaxAttemptsExceededException: exceeded maximum attempts for request with last exception being
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 7999 ms after channel connection failure with node 0.0.9 during attempt #1
    com.hedera.hashgraph.sdk.MaxAttemptsExceededException: exceeded maximum attempts for request with last exception being
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 8000 ms after channel connection failure with node 0.0.9 during attempt #1                                                     
    com.hedera.hashgraph.sdk.MaxAttemptsExceededException: exceeded maximum attempts for request with last exception being
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 8000 ms after channel connection failure with node 0.0.5 during attempt #1                                                     
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 8000 ms after channel connection failure with node 0.0.5 during attempt #2
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 250 ms after failure with node 0.0.5 during attempt #1: OK                                                                     
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 500 ms after failure with node 0.0.5 during attempt #2: OK                                                                     
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 250 ms after failure with node 0.0.5 during attempt #1: OK                                                                     
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 500 ms after failure with node 0.0.5 during attempt #2: OK                                                                     
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 1000 ms after failure with node 0.0.5 during attempt #3: OK                                                                    
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 250 ms after failure with node 0.0.5 during attempt #1: OK                                                                     
    [Test worker] WARN com.hedera.hashgraph.sdk.logger.Logger - Retrying in 500 ms after failure with node 0.0.5 during attempt #2: OK                                                                     
    [hedera-sdk-5] WARN com.hedera.hashgraph.sdk.TopicMessageQuery - Call is cancelled for topic 0.0.5148534.                                                                                              

TopicMessageIntegrationTest > Can receive a topic message PASSED

For reference, I'm also posting full logs of a single test run for both SDKs:

@0xivanov
Copy link
Contributor

I've spent some time looking into this today and I have a couple of things to say:

  1. The test environment does client.setMaxNodesPerTransaction(1); thus making the sdk to retry only for 1 node.
  2. MaxAttemptsExceededException occurres when we've retried 10 times (10 is the default, can be changed with setMaxAttempts). In your logs we are seeing 3-4 attempts, so it's weird this exception gets printed.
  3. TimeoutException occurres when sum of the time of grpc requests exceeds 2 mins (can be changed with setRequestTimeout())

@0xivanov
Copy link
Contributor

I am still unable to reproduce this on testnet, I am conncting to node 9, which was failing for you.

@SimiHunjan SimiHunjan modified the milestones: v2.45.0, v2.46.0 Nov 26, 2024
@AlexanderShenshin
Copy link
Author

Got it, thank you for an update.

I tried to run the same test with 10 max nodes per transaction and extended timeout for grpc requests (10 minutes), problematic behavior is still here - I was able to get TimeoutException.

Regarding reproduction, I believe it's hard to reproduce because it seems to be environment/network specific.
I guess that the best option to proceed is for me to take a closer look at connection issues and come back to you with more details.

@rwalworth rwalworth modified the milestones: v2.46.0, v2.47.0 Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants