Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster stress tests. #20

Open
gray380 opened this issue Jan 4, 2023 · 2 comments
Open

Cluster stress tests. #20

gray380 opened this issue Jan 4, 2023 · 2 comments

Comments

@gray380
Copy link

gray380 commented Jan 4, 2023

Hi,

I'm testing postgres and 2 keycloak under docker swarm with traefik as a loadbalancer runnung in the same docker overlay network.

Keycloak 1:

    environment:
      - PROXY_ADDRESS_FORWARDING=true
      - KC_DB=postgres
      - KC_DB_URL_HOST=keycloak-postgres
      - KC_DB_URL_DATABASE=keycloak
      - KC_DB_SCHEMA=clustered_jdbc
      - KC_CACHE_CONFIG_FILE=cache-ispn-jdbc-ping.xml
      - JGROUPS_DISCOVERY_EXTERNAL_IP=keycloak-jdbc1
      - KC_LOG_LEVEL=INFO,org.infinispan:DEBUG,org.jgroups:DEBUG

Keycloak 2:

    environment:
      - PROXY_ADDRESS_FORWARDING=true
      - KC_DB=postgres
      - KC_DB_URL_HOST=keycloak-postgres
      - KC_DB_URL_DATABASE=keycloak
      - KC_DB_SCHEMA=clustered_jdbc
      - KC_CACHE_CONFIG_FILE=cache-ispn-jdbc-ping.xml
      - JGROUPS_DISCOVERY_EXTERNAL_IP=keycloak-jdbc2
      - KC_LOG_LEVEL=INFO,org.infinispan:DEBUG,org.jgroups:DEBUG

JGROUPSPING table

keycloak=# SELECT * FROM clustered_jdbc.JGROUPSPING;
               own_addr               | cluster_name |   bind_addr    |          updated           |                                                ping_data                                                 
--------------------------------------+--------------+----------------+----------------------------+----------------------------------------------------------------------------------------------------------
 b1a481f2-96a9-4cf7-b1d1-c14d0bf38b35 | ISPN         | keycloak-jdbc2 | 2023-01-03 19:05:10.402106 | \x02b1d1c14d0bf38b35b1a481f296a94cf7030100146b6579636c6f616b2d6a646263322d343337353910040a0014b11e78ffff
 c4845a5e-6526-4664-a811-0f90a76b99c1 | ISPN         | keycloak-jdbc2 | 2023-01-03 19:05:10.429296 | \x02a8110f90a76b99c1c4845a5e65264664010100146b6579636c6f616b2d6a646263312d323738393010040a0002ee1e78ffff
(2 rows)

And I'm trying to run stress tests.
It's okay when one keycloak left the cluster (traefik sends requests to "survived" one):

docker service scale common_keycloak-jdbc2=0

Some logs:

Expected behavior:

DEBUG [org.jgroups.protocols.JDBC_PING] (Thread-15) Removed b1a481f2-96a9-4cf7-b1d1-c14d0bf38b35 for cluster ISPN from database
DEBUG [org.jgroups.protocols.JDBC_PING] (Thread-4) Removed b1a481f2-96a9-4cf7-b1d1-c14d0bf38b35 for cluster ISPN from database

Unexpected behavior:

DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Removed c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN from database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Inserted c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN into database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Inserted c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN into database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Removed c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN from database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Removed c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN from database
DEBUG [org.jgroups.protocols.JDBC_PING] (jgroups-362,keycloak-jdbc1-27890) Inserted c4845a5e-6526-4664-a811-0f90a76b99c1 for cluster ISPN into database

JGROUPSPING table:

keycloak=# SELECT * FROM clustered_jdbc.JGROUPSPING;
               own_addr               | cluster_name |   bind_addr    |          updated          |                                                ping_data                                                 
--------------------------------------+--------------+----------------+---------------------------+----------------------------------------------------------------------------------------------------------
 c4845a5e-6526-4664-a811-0f90a76b99c1 | ISPN         | keycloak-jdbc1 | 2023-01-04 09:37:22.26674 | \x02a8110f90a76b99c1c4845a5e65264664030100146b6579636c6f616b2d6a646263312d323738393010040a0002ee1e78ffff
(1 row)

but when it comes back it tooks a time to reform the cluster:

docker service scale common_keycloak-jdbc2=1

Some logs:

DEBUG [org.jgroups.protocols.TCP] (TQ-Bundler-7,keycloak-jdbc2-35303) JGRP000034: keycloak-jdbc2-35303: failure sending message to keycloak-jdbc1-27890: java.net.ConnectException: Connection refused (Connection refused)
DEBUG [org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-10,keycloak-jdbc2-35303) keycloak-jdbc2-35303: broadcasting suspect(keycloak-jdbc1-27890)
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-380,keycloak-jdbc1-27890) keycloak-jdbc1-27890: suspecting [keycloak-jdbc2-35303]
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-380,keycloak-jdbc1-27890) keycloak-jdbc1-27890: broadcasting unsuspect(keycloak-jdbc2-35303)
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-21,keycloak-jdbc2-35303) keycloak-jdbc2-35303: suspecting [keycloak-jdbc1-27890]
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-25,keycloak-jdbc2-35303) keycloak-jdbc2-35303: broadcasting unsuspect(keycloak-jdbc1-27890)
DEBUG [org.jgroups.protocols.FD_SOCK] (jgroups-382,keycloak-jdbc1-27890) keycloak-jdbc1-27890: broadcasting unsuspect(keycloak-jdbc2-35303)
...
a series of removed/inserted (both nodes) for cluster ISPN into database

JGROUPSPING table

keycloak=# SELECT * FROM clustered_jdbc.JGROUPSPING;
               own_addr               | cluster_name |   bind_addr    |          updated           |                                                ping_data                                                 
--------------------------------------+--------------+----------------+----------------------------+----------------------------------------------------------------------------------------------------------
 c4845a5e-6526-4664-a811-0f90a76b99c1 | ISPN         | keycloak-jdbc1 | 2023-01-04 09:48:34.328075 | \x02a8110f90a76b99c1c4845a5e65264664030100146b6579636c6f616b2d6a646263312d323738393010040a0002ee1e78ffff
 262c1029-ab8a-481c-9b51-8e78c1906ef1 | ISPN         | keycloak-jdbc1 | 2023-01-04 09:48:34.347452 | \x029b518e78c1906ef1262c1029ab8a481c010100146b6579636c6f616b2d6a646263322d333533303310040a0014b41e78ffff
(2 rows)

so while it makes all of these connection refuse, suspect/unsuspect, remove/insert the container is already up and running and traefik sends part of requests to unready keyclock instance.

And the worst part is that sometimes the cluster failed to reform, I can see two different bind_addr for the same cluster_name in the JGROUPSPING.

best regards,
Serhiy.

@kochen
Copy link

kochen commented Jun 8, 2024

Hey @gray380 I see you also have the same bind_addr value for multiple instances.

Good idea about running a stress test!
Though before doing that, I attempted a much simpler test:

  • spin up 2 instances
  • spin up a reverse proxy (round robin both KC instances)
  • access the admin console
  • shut down one of the instances
  • refresh the admin console page (doesn't matter which)
    This results in the entire KC instance refreshing and after a few long seconds it becomes available again (as if it's bootstrapping all over again).
    The expectation here would be that a cluster can easily survive that single instance failure, but it seems not.

Where you able to sort this out?

@kochen
Copy link

kochen commented Jun 15, 2024

Hey @gray380 I see you also have the same bind_addr value for multiple instances.

Good idea about running a stress test! Though before doing that, I attempted a much simpler test:

  • spin up 2 instances
  • spin up a reverse proxy (round robin both KC instances)
  • access the admin console
  • shut down one of the instances
  • refresh the admin console page (doesn't matter which)
    This results in the entire KC instance refreshing and after a few long seconds it becomes available again (as if it's bootstrapping all over again).
    The expectation here would be that a cluster can easily survive that single instance failure, but it seems not.

Where you able to sort this out?

@ivangfr is this an expected behaviour (in general on a keycloak-cluster setup and/or with JDBC_PING)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants