Performance degradation during load #3566

vitaliishandra · 2023-02-09T11:02:34Z

vitaliishandra
Feb 9, 2023

We have the operation on backend that performs few janusgraph queries. The total execution time of operation is ~100ms average. We use indexes on vertexes and not full graph scans.
When we start to perform multiple operations (100-200) in paralell the performance started to degradate (5000-9000ms).
We tried different configurations without any results.
All metrics and configurations are below. Are there any ideas what can be improved?

Version: 0.6.2
Storage Backend: cql (tested on Cassandra 3.10.11 and ScyllaDB)
Mixed Index Backend: none

Configuration:
3 pods of Janusgraph with following deployment configuration:
env:
- name: JAVA_OPTIONS
value: >-
-Xms256m -Xmx6144m -Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.rmi.port=9999
- name: gremlinserver.threadPoolWorker
value: '16'
- name: gremlinserver.gremlinPool
value: '32'
- name: gremlinserver.writeBufferHighWaterMark
value: '6553600'
- name: gremlinserver.writeBufferLowWaterMark
value: '627680'
- name: gremlinserver.maxContentLength
value: '365536'
- name: gremlinserver.resultIterationBatchSize
value: '128'
- name: gremlinserver.evaluationTimeout
value: '1800000'
- name: janusgraph.storage.cql.local-datacenter
value: northcentralus
- name: janusgraph.cassandra.keyspace
value: janusgraph
- name: JANUS_PROPS_TEMPLATE
value: cql
- name: janusgraph.storage.hostname
value: *****
- name: janusgraph.ids.block-size
value: '100000'
- name: janusgraph.storage.cql.read-consistency-level
value: QUORUM
- name: janusgraph.storage.cql.write-consistency-level
value: QUORUM
- name: janusgraph.storage.cql.replication-factor
value: '3'
- name: janusgraph.cache.db-cache
value: 'false'
- name: janusgraph.query.batch
value: 'true'
- name: janusgraph.query.batch-property-prefetch
value: 'true'
- name: janusgraph.query.smart-limit
value: 'false'
- name: janusgraph.metrics.enabled
value: 'true'
- name: janusgraph.metrics.prefix
value: janusgraph
- name: janusgraph.metrics.jmx.enabled
value: 'true'
- name: janusgraph.metrics.jmx.domain
value: janusgraph
- name: janusgraph.storage.batch-loading
value: 'false'
- name: janusgraph.storage.page-size
value: '500'
- name: janusgraph.storage.cql.batch-statement-size
value: '100'
resources:
limits:
cpu: 850m
memory: 6096Mi
requests:
cpu: 650m
memory: 5096Mi

CQL janusgraph.graphindex latency during load:

CQL janusgraph.edgestore latency during load:

Grmelin dashboard:

Janusgraph dashabords:

porunov · 2023-02-09T13:31:08Z

porunov
Feb 9, 2023
Maintainer

By default configuration isn't tuned and the throughput is limited. I think it's better to tune the default configuration for new JanusGraph users, but for now I would suggest taking a look at the following configuration options.

I would check out specific cql configuration options. Perhaps storage.cql.local-max-connections-per-host or storage.cql.max-requests-per-connection should be tuned.
If you find some CQL driver configuration options here which are not provided directly in JanusGraph you can always relay on storage.cql.internal to pass those configurations directly into the driver.
storage.cql.executor-service.enabled - I would use it as false because for now this thread pool isn't useful, but is used to limit the amount of parallel cql requests. This pool is enabled for historical reason, but it may be useful later when we switch to async JanusGraph queries execution. As for now I would recommend disabling it in most cases. This pool shouldn't limit performance much but it may take up additional resources.
Perhaps storage.parallel-backend-executor-service should be tuned for your queries. Be default it's fixed with the size of number of processors multiplied by 2 which might be too small for many use-cases. Try either increasing storage.parallel-backend-executor-service.core-pool-size to something meaningful in your case or even switch the implementation to some scalable executor pool. For example, you can reuse org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor by wrapping it into your own implementation for JanusGraph (notice that it should have the same package path. I.e. you can create org.elasticsearch.common.util.concurrent.EsThreadPoolExecutorWithJanusGraphSettings). Nevertheless, it's not necessary if your load is constant and you can simply increase storage.parallel-backend-executor-service.core-pool-size (try simply multiply initial value to the amount of parallel requests. i.e. should be 2000 or something in your case).
Generally check configurations which might influence performance for your usecase https://docs.janusgraph.org/configs/configuration-reference/ .

I don't know how heavy operations you are performing but usually it's not the bottleneck of Scylla nodes. Nevertheless you should check ScyllaDB metrics as well to verify there is no any limitations on that side. Usually if you have good network capping to handle all load it shouldn't be a problem with network throughput either (nevertheless it's always better to check just in case).
I would assume that most likely your current limitation is in JanusGraph nodes. It looks like your queries are fighting for the same resources but the task is to find what they are fighting for (it's either machine resources like processor time or queries execution resources like thread pool limits or else).

9 replies

li-boxuan Feb 13, 2023
Maintainer

I don't think it has much to do with the executor pool since your throughput is very low:

However when i run it in paralell (20 paralell queries)

And yeah you might have to do some profiling to figure out what happens.

vitaliishandra Feb 14, 2023
Author

@porunov @li-boxuan thank you for you reccomendations. We have experimented with different configurations without any success. Maybe you have any other ideas what can be wrong? Here are additional details:
We run multiple queries in paralell. In the same time we connected to Janusgraph pod and run simple query g.V().has('uid', 'Flow:1012309').out().profile() that return 80 results. We are waiting ~30sec to receive the result. However the profile shows that everything is ok and it took 3-4ms:

porunov Feb 14, 2023
Maintainer

30 seconds to return 80 elements doesn't sound right to me. I guess in your case the problem here could be that connection between your application / gremlin console and JanusGraph is too slow. You are probably measuring performance when you received data back into your application, but profile measures how much time it takes to retrieve data from storage backend / index backend and explains how much time it spend on each step.
As such, try to check your serialization config (use GraphBinary if possible), check latency between your app and JanusGraph.
That said, still, I don't imagine how small throughput and high latency should be to return 80 vertices without any properties (because .out() will return basically vertex ids with no properties). Thus, it could be that the issue is somewhere else.
The parallelization problem also doesn't make sense to me. If you applied the suggested configuration above it should work normally and you shouldn't see performance degradation if you run 1 or 100 queries like the above one.

The one thing comes to my mind: Are you using Gremlin Console to measure performance by any chance? If yes than it's better not to. I believe the default Gremlin Console thread pool is 1 which means all your queries will be processed one by one (and not in parallel).

That said, if you are using direct JanusGraph connection to measure performance then I would do the next:

Check how much time JanusGraph spends to compute the result (in the above profiling it looks like 3-4 ms).
Enable profiling on JanusGraph side and investigate how much time you spend before you send the result to your network (i.e. how much time it takes for serialization or maybe something else).
Enable profiling on your application side and check how much time it takes to deserialize the result.
Check latency and network throughput.

vitaliishandra Feb 17, 2023
Author

@porunov we are trying to perform test on separate VM with Janusgraph+Scylla. We would like to make sure we setup configuration correctly. We use cql and janusgraph-cql.properties file. We set some configuration there. Is it correct to override LOCAL configuration setting via this file (like storage.cql.executor-service.enabled) or it should be overrided in some other way? Is it possible to make sure it is set correctly as management doesn't show it?

hadoopmarc Feb 18, 2023

Overriding LOCAL properties in the properties file is OK, see here for more explanation.
Inspecting actual config properties of the current instance goes like this:

gremlin> graph.configuration().getKeys()
==>gremlin.graph
==>storage.backend
==>storage.hostname
==>storage.cql.keyspace
==>storage.cql.local-datacenter
==>cache.db-cache
==>cache.db-cache-clean-wait
==>cache.db-cache-time
==>cache.db-cache-size
gremlin> graph.configuration().getProperty('storage.backend')
==>cql

hadoopmarc · 2023-02-10T06:59:57Z

hadoopmarc
Feb 10, 2023

It strikes me that the reads on the graphindex take much longer than those on the edgestore. Are you sure that the indexed property values have a high cardinatlity? See the section "caveat" in this blog: https://li-boxuan.medium.com/janusgraph-deep-dive-part-2-demystify-indexing-d26e71edb386

2 replies

vitaliishandra Feb 10, 2023
Author

Yes. We have indexes on few props with high cardinality. We use just composite indexes. It executes very fast without load. However when we produce the paralell connections it starts to degradate.

li-boxuan Feb 13, 2023
Maintainer

It strikes me that the reads on the graphindex take much longer than those on the edgestore

Cannot agree more. I would guess it's a coincidence because apparently @vitaliishandra doesn't have too much traffic as of now. I would say it looks like your Cassandra/Scylla instances don't have enough machine resources. You may need to pull more metrics of them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation during load #3566

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Performance degradation during load #3566

vitaliishandra Feb 9, 2023

Replies: 2 comments · 11 replies

porunov Feb 9, 2023 Maintainer

li-boxuan Feb 13, 2023 Maintainer

vitaliishandra Feb 14, 2023 Author

porunov Feb 14, 2023 Maintainer

vitaliishandra Feb 17, 2023 Author

hadoopmarc Feb 18, 2023

hadoopmarc Feb 10, 2023

vitaliishandra Feb 10, 2023 Author

li-boxuan Feb 13, 2023 Maintainer

vitaliishandra
Feb 9, 2023

Replies: 2 comments 11 replies

porunov
Feb 9, 2023
Maintainer

li-boxuan Feb 13, 2023
Maintainer

vitaliishandra Feb 14, 2023
Author

porunov Feb 14, 2023
Maintainer

vitaliishandra Feb 17, 2023
Author

hadoopmarc
Feb 10, 2023

vitaliishandra Feb 10, 2023
Author

li-boxuan Feb 13, 2023
Maintainer