Cluster stress tests. #20

gray380 · 2023-01-04T08:09:48Z

Hi,

I'm testing postgres and 2 keycloak under docker swarm with traefik as a loadbalancer runnung in the same docker overlay network.

Keycloak 1:

    environment:
      - PROXY_ADDRESS_FORWARDING=true
      - KC_DB=postgres
      - KC_DB_URL_HOST=keycloak-postgres
      - KC_DB_URL_DATABASE=keycloak
      - KC_DB_SCHEMA=clustered_jdbc
      - KC_CACHE_CONFIG_FILE=cache-ispn-jdbc-ping.xml
      - JGROUPS_DISCOVERY_EXTERNAL_IP=keycloak-jdbc1
      - KC_LOG_LEVEL=INFO,org.infinispan:DEBUG,org.jgroups:DEBUG

Keycloak 2:

    environment:
      - PROXY_ADDRESS_FORWARDING=true
      - KC_DB=postgres
      - KC_DB_URL_HOST=keycloak-postgres
      - KC_DB_URL_DATABASE=keycloak
      - KC_DB_SCHEMA=clustered_jdbc
      - KC_CACHE_CONFIG_FILE=cache-ispn-jdbc-ping.xml
      - JGROUPS_DISCOVERY_EXTERNAL_IP=keycloak-jdbc2
      - KC_LOG_LEVEL=INFO,org.infinispan:DEBUG,org.jgroups:DEBUG

JGROUPSPING table

keycloak=# SELECT * FROM clustered_jdbc.JGROUPSPING;
               own_addr               | cluster_name |   bind_addr    |          updated           |                                                ping_data                                                 
--------------------------------------+--------------+----------------+----------------------------+----------------------------------------------------------------------------------------------------------
 b1a481f2-96a9-4cf7-b1d1-c14d0bf38b35 | ISPN         | keycloak-jdbc2 | 2023-01-03 19:05:10.402106 | \x02b1d1c14d0bf38b35b1a481f296a94cf7030100146b6579636c6f616b2d6a646263322d343337353910040a0014b11e78ffff
 c4845a5e-6526-4664-a811-0f90a76b99c1 | ISPN         | keycloak-jdbc2 | 2023-01-03 19:05:10.429296 | \x02a8110f90a76b99c1c4845a5e65264664010100146b6579636c6f616b2d6a646263312d323738393010040a0002ee1e78ffff
(2 rows)

And I'm trying to run stress tests.
It's okay when one keycloak left the cluster (traefik sends requests to "survived" one):

docker service scale common_keycloak-jdbc2=0

Some logs:

Expected behavior:

DEBUG [org.jgroups.protocols.JDBC_PING] (Thread-15) Removed b1a481f2-96a9-4cf7-b1d1-c14d0bf38b35 for cluster ISPN from database
DEBUG [org.jgroups.protocols.JDBC_PING] (Thread-4) Removed b1a481f2-96a9-4cf7-b1d1-c14d0bf38b35 for cluster ISPN from database

Unexpected behavior:

DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Removed c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN from database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Inserted c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN into database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Inserted c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN into database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Removed c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN from database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Removed c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN from database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Inserted c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN into database

JGROUPSPING table:

keycloak=# SELECT * FROM clustered_jdbc.JGROUPSPING;
               own_addr               | cluster_name |   bind_addr    |          updated          |                                                ping_data                                                 
--------------------------------------+--------------+----------------+---------------------------+----------------------------------------------------------------------------------------------------------
 c4845a5e-6526-4664-a811-0f90a76b99c1 | ISPN         | keycloak-jdbc1 | 2023-01-04 09:37:22.26674 | \x02a8110f90a76b99c1c4845a5e65264664030100146b6579636c6f616b2d6a646263312d323738393010040a0002ee1e78ffff
(1 row)

but when it comes back it tooks a time to reform the cluster:

docker service scale common_keycloak-jdbc2=1

Some logs:

DEBUG [org.jgroups.protocols.TCP] (TQ-Bundler-7,keycloak-jdbc2-35303) JGRP000034: keycloak-jdbc2-35303: failure sending message to keycloak-jdbc1-27890: java.net.ConnectException: Connection refused (Connection refused)
DEBUG [org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-10,keycloak-jdbc2-35303) keycloak-jdbc2-35303: broadcasting suspect(keycloak-jdbc1-27890)
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-380,keycloak-jdbc1-27890) keycloak-jdbc1-27890: suspecting [keycloak-jdbc2-35303]
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-380,keycloak-jdbc1-27890) keycloak-jdbc1-27890: broadcasting unsuspect(keycloak-jdbc2-35303)
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-21,keycloak-jdbc2-35303) keycloak-jdbc2-35303: suspecting [keycloak-jdbc1-27890]
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-25,keycloak-jdbc2-35303) keycloak-jdbc2-35303: broadcasting unsuspect(keycloak-jdbc1-27890)
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-382,keycloak-jdbc1-27890) keycloak-jdbc1-27890: broadcasting unsuspect(keycloak-jdbc2-35303)
...
a series of removed/inserted (both nodes) for cluster ISPN into database

JGROUPSPING table

keycloak=# SELECT * FROM clustered_jdbc.JGROUPSPING;
               own_addr               | cluster_name |   bind_addr    |          updated           |                                                ping_data                                                 
--------------------------------------+--------------+----------------+----------------------------+----------------------------------------------------------------------------------------------------------
 c4845a5e-6526-4664-a811-0f90a76b99c1 | ISPN         | keycloak-jdbc1 | 2023-01-04 09:48:34.328075 | \x02a8110f90a76b99c1c4845a5e65264664030100146b6579636c6f616b2d6a646263312d323738393010040a0002ee1e78ffff
 262c1029-ab8a-481c-9b51-8e78c1906ef1 | ISPN         | keycloak-jdbc1 | 2023-01-04 09:48:34.347452 | \x029b518e78c1906ef1262c1029ab8a481c010100146b6579636c6f616b2d6a646263322d333533303310040a0014b41e78ffff
(2 rows)

so while it makes all of these connection refuse, suspect/unsuspect, remove/insert the container is already up and running and traefik sends part of requests to unready keyclock instance.

And the worst part is that sometimes the cluster failed to reform, I can see two different bind_addr for the same cluster_name in the JGROUPSPING.

best regards,
Serhiy.

The text was updated successfully, but these errors were encountered:

kochen · 2024-06-08T08:43:00Z

Hey @gray380 I see you also have the same bind_addr value for multiple instances.

Good idea about running a stress test!
Though before doing that, I attempted a much simpler test:

spin up 2 instances
spin up a reverse proxy (round robin both KC instances)
access the admin console
shut down one of the instances
refresh the admin console page (doesn't matter which)
This results in the entire KC instance refreshing and after a few long seconds it becomes available again (as if it's bootstrapping all over again).
The expectation here would be that a cluster can easily survive that single instance failure, but it seems not.

Where you able to sort this out?

kochen · 2024-06-15T19:31:23Z

Hey @gray380 I see you also have the same bind_addr value for multiple instances.

Good idea about running a stress test! Though before doing that, I attempted a much simpler test:

spin up 2 instances

spin up a reverse proxy (round robin both KC instances)

access the admin console

shut down one of the instances

refresh the admin console page (doesn't matter which)
This results in the entire KC instance refreshing and after a few long seconds it becomes available again (as if it's bootstrapping all over again).
The expectation here would be that a cluster can easily survive that single instance failure, but it seems not.

Where you able to sort this out?

@ivangfr is this an expected behaviour (in general on a keycloak-cluster setup and/or with JDBC_PING)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster stress tests. #20

Cluster stress tests. #20

gray380 commented Jan 4, 2023

kochen commented Jun 8, 2024

kochen commented Jun 15, 2024

Cluster stress tests. #20

Cluster stress tests. #20

Comments

gray380 commented Jan 4, 2023

kochen commented Jun 8, 2024

kochen commented Jun 15, 2024