Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scylla-bench fails to reconnect after altering table #114

Open
soyacz opened this issue Dec 15, 2022 · 8 comments
Open

scylla-bench fails to reconnect after altering table #114

soyacz opened this issue Dec 15, 2022 · 8 comments

Comments

@soyacz
Copy link

soyacz commented Dec 15, 2022

Installation details

Kernel Version: 5.15.0-1026-aws
Scylla version (or git commit hash): 5.2.0~dev-20221209.6075e01312a5 with build-id 0e5d044b8f9e5bdf7f53cc3c1e959fab95bf027c

Cluster size: 9 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • longevity-counters-multidc-master-db-node-7785df01-9 (54.157.115.162 | 10.12.2.62) (shards: 7)
  • longevity-counters-multidc-master-db-node-7785df01-8 (3.238.92.3 | 10.12.2.95) (shards: 7)
  • longevity-counters-multidc-master-db-node-7785df01-7 (3.236.190.51 | 10.12.0.119) (shards: 7)
  • longevity-counters-multidc-master-db-node-7785df01-6 (54.212.64.38 | 10.15.0.77) (shards: 7)
  • longevity-counters-multidc-master-db-node-7785df01-5 (35.92.94.31 | 10.15.3.207) (shards: 7)
  • longevity-counters-multidc-master-db-node-7785df01-4 (34.219.193.110 | 10.15.3.94) (shards: 7)
  • longevity-counters-multidc-master-db-node-7785df01-3 (52.213.121.166 | 10.4.0.42) (shards: 7)
  • longevity-counters-multidc-master-db-node-7785df01-2 (54.229.18.181 | 10.4.2.143) (shards: 7)
  • longevity-counters-multidc-master-db-node-7785df01-1 (34.245.75.18 | 10.4.0.195) (shards: 7)

OS / Image: ami-0b85d6f35bddaff65 ami-0a1ff01b931943772 ami-08e5c2ae0089cade3 (aws: eu-west-1)

Test: longevity-counters-6h-multidc-test
Test id: 7785df01-a1fe-483a-beb7-2f63b9044b87
Test name: scylla-master/raft/longevity-counters-6h-multidc-test
Test config file(s):

Issue description

Counters test in multidc scenario is failing persistenlty after altering table.
E.g. after running ALTER TABLE scylla_bench.test_counters WITH bloom_filter_fp_chance = 0.45374057709882093 or ALTER TABLE scylla_bench.test_counters WITH read_repair_chance = 0.9;, or even ALTER TABLE scylla_bench.test_counters WITH comment = 'IHQS6RAYS5VQ6CQZYBYEX1GP';
after such changes, scylla-bench is failing tests due error:

2022/12/09 15:26:29 error: failed to connect to "[HostInfo hostname=\"10.12.0.119\" connectAddress=\"10.12.0.119\" peer=\"<nil>\" rpc_address=\"10.12.0.119\" broadcast_address=\"10.12.0.119\" preferred_ip=\"<nil>\" connect_addr=\"10.12.0.119\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-eastscylla_node_east\" rack=\"1a\" host_id=\"ec773dfb-ef87-4ab8-abbf-190e3e082e4c\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response to connection startup within timeout

later it looks connection is recovered - so connection issues are not permanent. But it is enough to fail test critically ending the test.

  • Restore Monitor Stack command: $ hydra investigate show-monitor 7785df01-a1fe-483a-beb7-2f63b9044b87
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7785df01-a1fe-483a-beb7-2f63b9044b87

Logs:

| 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-longevity-counters-6h-multidc-test-scylla-per-server-metrics-nemesis-20221209_161803-longevity-counters-multidc-master-monitor-node-7785df01-1.png |
| 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-overview-20221209_161654-longevity-counters-multidc-master-monitor-node-7785df01-1.png |
| 20221209_162553 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/db-cluster-7785df01.tar.gz |
| 20221209_162553 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/loader-set-7785df01.tar.gz |
| 20221209_162553 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/monitor-set-7785df01.tar.gz |
| 20221209_162553 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/sct-runner-7785df01.tar.gz

Jenkins job URL

@fruch
Copy link
Contributor

fruch commented Dec 15, 2022

maybe

Timeout:		 5s

isn't enough for this test case ?

soyacz added a commit to soyacz/scylla-cluster-tests that referenced this issue Dec 15, 2022
We face the issue with disconnecting from cluster after simple table
modifications in scylla-bench (counters multidc test):
scylladb/scylla-bench#114

It was proven it behaves correctly when s-b is in version 0.1.3.
This commit pins s-b for couters multidc test.
@soyacz
Copy link
Author

soyacz commented Dec 15, 2022

I'm not sure, disconnections were persisting sometimes for 2 minutes. We would need to test it.

@soyacz
Copy link
Author

soyacz commented Dec 15, 2022

I tried with timeout settings like this: -timeout 15s -retry-interval=80ms,5s -retry-number=20 and it failed anyway.

@KnifeyMoloko
Copy link

While running a large-partitions test I encountered a similar problem. Not sure if it's tied to this, but it's a possbility. After the pre-write workload, when starting one of the stress workloads, we got:

2022-12-08 21:08:42.623: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=a5a01a5d-ef1f-4c96-9836-7b6b23c0d77e duration=10s: node=Node longevity-large-partitions-4d-maste-loader-node-a967ab57-2 [34.249.171.113 | 10.4.2.108] (seed: False)
stress_cmd=scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60 -clustering-row-count=10000000 -clustering-row-size=2048 -rows-per-request=2000 -timeout=180s -concurrency=700 -max-rate=64000  -duration=5760m -connection-count 500 -error-at-row-limit 1000 -nodes 10.4.1.5,10.4.2.90,10.4.2.71,10.4.1.191
errors:
Stress command completed with bad status 1: 2022/12/08 21:08:42 gocql: unable to create session: unable to fetch peer host info: Operation timed

Running the same job with a pinned version of scylla-bench (0.1.14) did not reproduce this issue. Similarly, a run without Raft did not fail at this point, so there might be some flakiness involved here.

Installation details

Kernel Version: 5.15.0-1026-aws
Scylla version (or git commit hash): 5.2.0~dev-20221208.a076ceef97d5 with build-id 020ec076898a692651fd48edfb1920fc190cd81e

Cluster size: 4 nodes (i3en.3xlarge)

Scylla Nodes used in this run:

  • longevity-large-partitions-4d-maste-db-node-a967ab57-4 (3.252.203.198 | 10.4.1.191) (shards: 10)
  • longevity-large-partitions-4d-maste-db-node-a967ab57-3 (18.203.69.233 | 10.4.2.71) (shards: 10)
  • longevity-large-partitions-4d-maste-db-node-a967ab57-2 (52.212.226.132 | 10.4.2.90) (shards: 10)
  • longevity-large-partitions-4d-maste-db-node-a967ab57-1 (54.194.73.19 | 10.4.1.5) (shards: 10)

OS / Image: ami-063cdd564cd2fbe46 (aws: eu-west-1)

Test: longevity-large-partition-4days-test
Test id: a967ab57-4860-4f31-8b0a-d940b857542e
Test name: scylla-master/raft/longevity-large-partition-4days-test
Test config file(s):

Issue description

>>>>>>>
Your description here...
<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor a967ab57-4860-4f31-8b0a-d940b857542e
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs a967ab57-4860-4f31-8b0a-d940b857542e

Logs:

Jenkins job URL

@roydahan
Copy link
Collaborator

@avelanarius, we suspect there is a regression or at least a behavior change in how s-b works for us with later (latest?) gocql driver.
We're kind of lost here on how to debug it or how to progress.
Can you please help us or advise us how to debug it further?

fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Jan 25, 2023
We face the issue with disconnecting from cluster after simple table
modifications in scylla-bench (counters multidc test):
scylladb/scylla-bench#114

It was proven it behaves correctly when s-b is in version 0.1.3.
This commit pins s-b for couters multidc test.
fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Jun 25, 2023
the case was pinned to older s-b cause of: scylladb/scylla-bench#114
but since then we implemented s-b retires that should fix
most of the of the timeout observed in 2023.1 run.

Ref: scylladb/scylla-bench#114
vponomaryov pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Jun 27, 2023
the case was pinned to older s-b cause of: scylladb/scylla-bench#114
but since then we implemented s-b retires that should fix
most of the of the timeout observed in 2023.1 run.

Ref: scylladb/scylla-bench#114
vponomaryov pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Jun 27, 2023
the case was pinned to older s-b cause of: scylladb/scylla-bench#114
but since then we implemented s-b retires that should fix
most of the of the timeout observed in 2023.1 run.

Ref: scylladb/scylla-bench#114
(cherry picked from commit b1b3fe0)
vponomaryov pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Jun 27, 2023
the case was pinned to older s-b cause of: scylladb/scylla-bench#114
but since then we implemented s-b retires that should fix
most of the of the timeout observed in 2023.1 run.

Ref: scylladb/scylla-bench#114
(cherry picked from commit b1b3fe0)
@juliayakovlev
Copy link

scylla-bench failed with unable to create session: unable to fetch peer host info despite all nodes are up and ok

< t:2024-07-25 16:14:19,760 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.
< t:2024-07-25 16:14:19,761 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "sudo  docker exec 5e5d3d02c589373354dd8ad087985ca17a7db44f6cd5f9a9d115641b82f41fb0 /bin/sh -c 'scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=750 -partition-offset=1251 -clustering-row-count=200000 -clustering-row-size=uniform:100..8192 -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10 -timeout=90s -iterations=0 -duration=720m  -error-at-row-limit 1000 -nodes 10.142.0.207,10.142.0.236,10.142.0.240,10.142.0.242,10.142.0.248'"; Exit status: 1
< t:2024-07-25 16:14:19,761 f:base.py         l:150  c:RemoteLibSSH2CmdRunner p:DEBUG > STDERR: 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.

< t:2024-07-25 16:14:19,763 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-07-25 16:14:19.761: (ScyllaBenchEvent Severity.ERROR) period_type=end event_id=53d8c0a0-2870-4453-b7e3-7df585f03411 during_nemesis=RunUniqueSequence duration=18s: node=Node longevity-large-partitions-200k-pks-loader-node-53145d7f-0-1 [35.196.217.128 | 10.142.0.250]

Packages

Scylla version: 2023.1.11-20240725.11a2022bd6ed with build-id a0cab71f78c44bb0b694d46800fbcaef02607251

Kernel Version: 5.15.0-1065-gcp

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 5 nodes (n2-highmem-16)

Scylla Nodes used in this run:

  • longevity-large-partitions-200k-pks-db-node-53145d7f-0-8 (34.23.81.52 | 10.142.0.69) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-53145d7f-0-7 (35.237.229.97 | 10.142.0.12) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-53145d7f-0-6 (35.229.67.161 | 10.142.0.3) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-53145d7f-0-5 (35.196.146.159 | 10.142.0.248) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-53145d7f-0-4 (35.237.38.63 | 10.142.0.242) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-53145d7f-0-3 (35.196.86.69 | 10.142.0.240) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-53145d7f-0-2 (35.227.121.244 | 10.142.0.236) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-53145d7f-0-1 (35.227.87.19 | 10.142.0.207) (shards: 14)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/6980420640571389317 (gce: undefined_region)

Test: longevity-large-partition-200k-pks-4days-gce-test
Test id: 53145d7f-6918-4728-acc6-6236916d8d08
Test name: enterprise-2023.1/longevity/longevity-large-partition-200k-pks-4days-gce-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 53145d7f-6918-4728-acc6-6236916d8d08
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 53145d7f-6918-4728-acc6-6236916d8d08

Logs:

Jenkins job URL
Argus

@roydahan
Copy link
Collaborator

I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me.
@sylwiaszunejko / @dkropachev can you please take a look at this one?

@fruch
Copy link
Contributor

fruch commented Jul 28, 2024

I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me.
@sylwiaszunejko / @dkropachev can you please take a look at this one?

It's probably cause of scylla slowing down, the internal queries might not have enough timeouts setup.

So as always it's a combination of a scylla issue, and how strict we want to be with timeouts, and how configurable those internal queries are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants