scylla-bench fails to reconnect after altering table #114

soyacz · 2022-12-15T08:56:08Z

Installation details

Kernel Version: 5.15.0-1026-aws
Scylla version (or git commit hash): 5.2.0~dev-20221209.6075e01312a5 with build-id 0e5d044b8f9e5bdf7f53cc3c1e959fab95bf027c

Cluster size: 9 nodes (i3.2xlarge)

Scylla Nodes used in this run:

longevity-counters-multidc-master-db-node-7785df01-9 (54.157.115.162 | 10.12.2.62) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-8 (3.238.92.3 | 10.12.2.95) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-7 (3.236.190.51 | 10.12.0.119) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-6 (54.212.64.38 | 10.15.0.77) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-5 (35.92.94.31 | 10.15.3.207) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-4 (34.219.193.110 | 10.15.3.94) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-3 (52.213.121.166 | 10.4.0.42) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-2 (54.229.18.181 | 10.4.2.143) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-1 (34.245.75.18 | 10.4.0.195) (shards: 7)

OS / Image: ami-0b85d6f35bddaff65 ami-0a1ff01b931943772 ami-08e5c2ae0089cade3 (aws: eu-west-1)

Test: longevity-counters-6h-multidc-test
Test id: 7785df01-a1fe-483a-beb7-2f63b9044b87
Test name: scylla-master/raft/longevity-counters-6h-multidc-test
Test config file(s):

longevity-counters-multidc.yaml

Issue description

Counters test in multidc scenario is failing persistenlty after altering table.
E.g. after running ALTER TABLE scylla_bench.test_counters WITH bloom_filter_fp_chance = 0.45374057709882093 or ALTER TABLE scylla_bench.test_counters WITH read_repair_chance = 0.9;, or even ALTER TABLE scylla_bench.test_counters WITH comment = 'IHQS6RAYS5VQ6CQZYBYEX1GP';
after such changes, scylla-bench is failing tests due error:

2022/12/09 15:26:29 error: failed to connect to "[HostInfo hostname=\"10.12.0.119\" connectAddress=\"10.12.0.119\" peer=\"<nil>\" rpc_address=\"10.12.0.119\" broadcast_address=\"10.12.0.119\" preferred_ip=\"<nil>\" connect_addr=\"10.12.0.119\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-eastscylla_node_east\" rack=\"1a\" host_id=\"ec773dfb-ef87-4ab8-abbf-190e3e082e4c\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response to connection startup within timeout

later it looks connection is recovered - so connection issues are not permanent. But it is enough to fail test critically ending the test.

Restore Monitor Stack command: $ hydra investigate show-monitor 7785df01-a1fe-483a-beb7-2f63b9044b87
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 7785df01-a1fe-483a-beb7-2f63b9044b87

Logs:

| 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-longevity-counters-6h-multidc-test-scylla-per-server-metrics-nemesis-20221209_161803-longevity-counters-multidc-master-monitor-node-7785df01-1.png |
| 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-overview-20221209_161654-longevity-counters-multidc-master-monitor-node-7785df01-1.png |
| 20221209_162553 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/db-cluster-7785df01.tar.gz |
| 20221209_162553 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/loader-set-7785df01.tar.gz |
| 20221209_162553 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/monitor-set-7785df01.tar.gz |
| 20221209_162553 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/sct-runner-7785df01.tar.gz

Jenkins job URL

The text was updated successfully, but these errors were encountered:

fruch · 2022-12-15T09:00:58Z

maybe

Timeout:		 5s

isn't enough for this test case ?

We face the issue with disconnecting from cluster after simple table modifications in scylla-bench (counters multidc test): scylladb/scylla-bench#114 It was proven it behaves correctly when s-b is in version 0.1.3. This commit pins s-b for couters multidc test.

soyacz · 2022-12-15T09:06:03Z

I'm not sure, disconnections were persisting sometimes for 2 minutes. We would need to test it.

soyacz · 2022-12-15T12:50:34Z

I tried with timeout settings like this: -timeout 15s -retry-interval=80ms,5s -retry-number=20 and it failed anyway.

KnifeyMoloko · 2022-12-15T16:57:58Z

While running a large-partitions test I encountered a similar problem. Not sure if it's tied to this, but it's a possbility. After the pre-write workload, when starting one of the stress workloads, we got:

2022-12-08 21:08:42.623: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=a5a01a5d-ef1f-4c96-9836-7b6b23c0d77e duration=10s: node=Node longevity-large-partitions-4d-maste-loader-node-a967ab57-2 [34.249.171.113 | 10.4.2.108] (seed: False)
stress_cmd=scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60 -clustering-row-count=10000000 -clustering-row-size=2048 -rows-per-request=2000 -timeout=180s -concurrency=700 -max-rate=64000  -duration=5760m -connection-count 500 -error-at-row-limit 1000 -nodes 10.4.1.5,10.4.2.90,10.4.2.71,10.4.1.191
errors:
Stress command completed with bad status 1: 2022/12/08 21:08:42 gocql: unable to create session: unable to fetch peer host info: Operation timed

Running the same job with a pinned version of scylla-bench (0.1.14) did not reproduce this issue. Similarly, a run without Raft did not fail at this point, so there might be some flakiness involved here.

Installation details

Kernel Version: 5.15.0-1026-aws
Scylla version (or git commit hash): 5.2.0~dev-20221208.a076ceef97d5 with build-id 020ec076898a692651fd48edfb1920fc190cd81e

Cluster size: 4 nodes (i3en.3xlarge)

Scylla Nodes used in this run:

longevity-large-partitions-4d-maste-db-node-a967ab57-4 (3.252.203.198 | 10.4.1.191) (shards: 10)
longevity-large-partitions-4d-maste-db-node-a967ab57-3 (18.203.69.233 | 10.4.2.71) (shards: 10)
longevity-large-partitions-4d-maste-db-node-a967ab57-2 (52.212.226.132 | 10.4.2.90) (shards: 10)
longevity-large-partitions-4d-maste-db-node-a967ab57-1 (54.194.73.19 | 10.4.1.5) (shards: 10)

OS / Image: ami-063cdd564cd2fbe46 (aws: eu-west-1)

Test: longevity-large-partition-4days-test
Test id: a967ab57-4860-4f31-8b0a-d940b857542e
Test name: scylla-master/raft/longevity-large-partition-4days-test
Test config file(s):

longevity-large-partition-4days.yaml

Issue description

>>>>>>>
Your description here...
<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor a967ab57-4860-4f31-8b0a-d940b857542e
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs a967ab57-4860-4f31-8b0a-d940b857542e

Logs:

db-cluster-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/db-cluster-a967ab57.tar.gz
monitor-set-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/monitor-set-a967ab57.tar.gz
loader-set-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/loader-set-a967ab57.tar.gz
sct-runner-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/sct-runner-a967ab57.tar.gz

Jenkins job URL

roydahan · 2022-12-15T19:27:44Z

@avelanarius, we suspect there is a regression or at least a behavior change in how s-b works for us with later (latest?) gocql driver.
We're kind of lost here on how to debug it or how to progress.
Can you please help us or advise us how to debug it further?

We face the issue with disconnecting from cluster after simple table modifications in scylla-bench (counters multidc test): scylladb/scylla-bench#114 It was proven it behaves correctly when s-b is in version 0.1.3. This commit pins s-b for couters multidc test.

the case was pinned to older s-b cause of: scylladb/scylla-bench#114 but since then we implemented s-b retires that should fix most of the of the timeout observed in 2023.1 run. Ref: scylladb/scylla-bench#114

the case was pinned to older s-b cause of: scylladb/scylla-bench#114 but since then we implemented s-b retires that should fix most of the of the timeout observed in 2023.1 run. Ref: scylladb/scylla-bench#114 (cherry picked from commit b1b3fe0)

juliayakovlev · 2024-07-28T12:15:38Z

scylla-bench failed with unable to create session: unable to fetch peer host info despite all nodes are up and ok

< t:2024-07-25 16:14:19,760 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.
< t:2024-07-25 16:14:19,761 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "sudo  docker exec 5e5d3d02c589373354dd8ad087985ca17a7db44f6cd5f9a9d115641b82f41fb0 /bin/sh -c 'scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=750 -partition-offset=1251 -clustering-row-count=200000 -clustering-row-size=uniform:100..8192 -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10 -timeout=90s -iterations=0 -duration=720m  -error-at-row-limit 1000 -nodes 10.142.0.207,10.142.0.236,10.142.0.240,10.142.0.242,10.142.0.248'"; Exit status: 1
< t:2024-07-25 16:14:19,761 f:base.py         l:150  c:RemoteLibSSH2CmdRunner p:DEBUG > STDERR: 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.

< t:2024-07-25 16:14:19,763 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-07-25 16:14:19.761: (ScyllaBenchEvent Severity.ERROR) period_type=end event_id=53d8c0a0-2870-4453-b7e3-7df585f03411 during_nemesis=RunUniqueSequence duration=18s: node=Node longevity-large-partitions-200k-pks-loader-node-53145d7f-0-1 [35.196.217.128 | 10.142.0.250]

Packages

Scylla version: 2023.1.11-20240725.11a2022bd6ed with build-id a0cab71f78c44bb0b694d46800fbcaef02607251

Kernel Version: 5.15.0-1065-gcp

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 5 nodes (n2-highmem-16)

Scylla Nodes used in this run:

longevity-large-partitions-200k-pks-db-node-53145d7f-0-8 (34.23.81.52 | 10.142.0.69) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-7 (35.237.229.97 | 10.142.0.12) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-6 (35.229.67.161 | 10.142.0.3) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-5 (35.196.146.159 | 10.142.0.248) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-4 (35.237.38.63 | 10.142.0.242) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-3 (35.196.86.69 | 10.142.0.240) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-2 (35.227.121.244 | 10.142.0.236) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-1 (35.227.87.19 | 10.142.0.207) (shards: 14)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/6980420640571389317 (gce: undefined_region)

Test: longevity-large-partition-200k-pks-4days-gce-test
Test id: 53145d7f-6918-4728-acc6-6236916d8d08
Test name: enterprise-2023.1/longevity/longevity-large-partition-200k-pks-4days-gce-test
Test config file(s):

longevity-large-partition-200k_pks-4days.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 53145d7f-6918-4728-acc6-6236916d8d08
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 53145d7f-6918-4728-acc6-6236916d8d08

Logs:

db-cluster-53145d7f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/db-cluster-53145d7f.tar.gz
sct-runner-events-53145d7f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/sct-runner-events-53145d7f.tar.gz
sct-53145d7f.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/sct-53145d7f.log.tar.gz
loader-set-53145d7f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/loader-set-53145d7f.tar.gz
monitor-set-53145d7f.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/monitor-set-53145d7f.tar.gz

Jenkins job URL
Argus

roydahan · 2024-07-28T14:11:59Z

I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me.
@sylwiaszunejko / @dkropachev can you please take a look at this one?

fruch · 2024-07-28T15:39:46Z

I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me.
@sylwiaszunejko / @dkropachev can you please take a look at this one?

It's probably cause of scylla slowing down, the internal queries might not have enough timeouts setup.

So as always it's a combination of a scylla issue, and how strict we want to be with timeouts, and how configurable those internal queries are.

soyacz mentioned this issue Dec 15, 2022

fix(test): pin scylla-bench to fix multidc counter test scylladb/scylla-cluster-tests#5567

Merged

7 tasks

fruch mentioned this issue Jun 25, 2023

fix(longevity-counters-multidc): unping scylla-bench scylladb/scylla-cluster-tests#6291

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scylla-bench fails to reconnect after altering table #114

scylla-bench fails to reconnect after altering table #114

soyacz commented Dec 15, 2022

fruch commented Dec 15, 2022

soyacz commented Dec 15, 2022

soyacz commented Dec 15, 2022

KnifeyMoloko commented Dec 15, 2022

roydahan commented Dec 15, 2022

juliayakovlev commented Jul 28, 2024

Logs:

roydahan commented Jul 28, 2024

fruch commented Jul 28, 2024

scylla-bench fails to reconnect after altering table #114

scylla-bench fails to reconnect after altering table #114

Comments

soyacz commented Dec 15, 2022

Installation details

Issue description

Logs:

fruch commented Dec 15, 2022

soyacz commented Dec 15, 2022

soyacz commented Dec 15, 2022

KnifeyMoloko commented Dec 15, 2022

Installation details

Issue description

Logs:

roydahan commented Dec 15, 2022

juliayakovlev commented Jul 28, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

roydahan commented Jul 28, 2024

fruch commented Jul 28, 2024