Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "gocql no hosts available in the pool" when we are continuously streaming data #1806

Open
sathyendranv opened this issue Aug 27, 2024 · 3 comments

Comments

@sathyendranv
Copy link

sathyendranv commented Aug 27, 2024

Please answer these questions before submitting your issue. Thanks!

What version of Cassandra are you using?

4.1.5

What version of Gocql are you using?

1.6.0

What version of Go are you using?

1.22

What did you do?

We are streaming data meta data (json as string) and blob data (blob) to cassandra db. It will do a continuous write. Per second there will around 20~30 data points with multiple streams. It might have up to 4 ~ 6 streams which will be around 120 datapoints per seconds.

Size of meta data will be around 200kb and blob will be ~500kb. Also its a single node Cassandra db running in container.

What did you expect to see?

All the write to be successfully.

What did you see instead?

Initially all the writes are successful and after few mins writes are failing with
gocql no response received from cassandra within timeout period and followed by
gocql no hosts available in the pool

image

If you are having connectivity related issues please share the following additional information

Describe your Cassandra cluster

Singe Node Cluster running in docker container

System setting

12th Gen Intel(R) Core(TM) i9-12900TE
Thread(s) per core: 2
Core(s) per socket: 16
RAM: 64 GB
Storage: SSD 1TB

Code Snippet

initialization:

		dbconn.cluster = gocql.NewCluster(container_name+ ":" + 9042)
		dbconn.cluster.Consistency = gocql.Quorum
		dbconn.cluster.ProtoVersion = ProtoVersion
		if devMode != true {
			dbconn.cluster.SslOpts = &gocql.SslOptions{
				secrets....
			}
		}
		var err error
		dbconn.cluster.DisableInitialHostLookup = false
		dbconn.cluster.Timeout = time.Second * 300
		dbconn.cluster.NumConns = 2
		dbconn.cluster.RetryPolicy = &gocql.SimpleRetryPolicy{NumRetries: 2}
		dbconn.cluster.Keyspace = dbconn.dbMgr.DbInfo.Database
		dbconn.cluster.ReconnectInterval = time.Millisecond * 500
		dbconn.cluster.PoolConfig.HostSelectionPolicy = gocql.TokenAwareHostPolicy(
			gocql.DCAwareRoundRobinPolicy("datacenter1"),
		)
		dbconn.session, err = dbconn.cluster.CreateSession()
		if err != nil {
			glog.Errorf("Not able to CreateSession", err)
			return common.EFAIL
		}

write code. It will be called in multiple go routine when streaming data recvd.

ctx := context.Background()
  err = dbconn.session.Query(qur,
	temp.TableName, temp.BlobIdentifier, temp.QueryStr, temp.BlobData).WithContext(ctx).Exec()
  • output of nodetool status
  • output of SELECT peer, rpc_address FROM system.peers
  • rebuild your application with the gocql_debug tag and post the output
@caineblood
Copy link

The above comment asking you to download a file is malware to steal your account; do not under any circumstances download or run it. The post needs to be removed. If you have attempted to run it please have your system cleaned and your account secured immediately.

@ribaraka
Copy link

ribaraka commented Oct 7, 2024

@caineblood which file are you referring to?

The above comment asking you to download a file is malware to steal your account; do not under any circumstances download or run it. The post needs to be removed. If you have attempted to run it please have your system cleaned and your account secured immediately.

@ribaraka
Copy link

ribaraka commented Oct 7, 2024

@sathyendranv, I was able to simulate the case you reported. I ran multiple iterations using the GoCQL client configuration you provided, and everything worked fine—no errors were generated by the GoCQL driver when using the same setup of 4-6 streams.

However, I did manage to reproduce the exact error you mentioned ("gocql no hosts available in the pool" and "no response received from cassandra within timeout period"), but only when there were 5000 concurrent streams. This leads me to believe that the issue is not with the driver itself but more likely related to resource limitations within your environment.

Running a single-node Cassandra cluster in a Docker container can lead to performance degradation under heavy loads. Docker introduces additional overhead, and Cassandra’s single-node setup may cause significant stress on the system when handling large write requests.
Docker's resource management can add to this issue, where CPU, memory, and I/O are constrained, which may not be ideal for Cassandra’s high throughput demands.

The setup with 120 data points (approx. 100 mb) per second results in significant CPU and memory usage, which can lead to slow processing of write requests and timeouts. This load is substantial and could easily overwhelm 1 node, especially when running in a Docker environment with limited resources.

Disk I/O saturation is another likely cause. Cassandra's write-intensive workload involves commit log writes, memtable flushes, and SSTable compactions, all of which heavily utilize disk resources. If the disk cannot keep up with these operations, it will result in increased write latency and timeouts in the client application. High disk I/O from continuous writes can significantly degrade performance.

With limited resources, Cassandra may take longer to process requests, causing gocql’s connection pool to be depleted, as connections remain occupied for longer periods and pool may get exhausted.

This results in Cassandra slowing down and eventually failing to keep up with the incoming data load.

I have developed a program that mimics this case, which can be used to further troubleshooting.

package main

import (
	"context"
	"github.com/gocql/gocql"
	"log"
	"sync"
	"time"
)

const (
	NumStreams        = 5
	DataPointsPerSec  = 25
	StreamingDuration = 600 * time.Second
)

func main() {
	cluster := gocql.NewCluster("127.0.0.1")
	cluster.Consistency = gocql.Quorum
	cluster.ProtoVersion = 4

	cluster.DisableInitialHostLookup = false
	cluster.Timeout = time.Second * 300
	cluster.NumConns = 2
	cluster.RetryPolicy = &gocql.SimpleRetryPolicy{NumRetries: 2}
	cluster.Keyspace = "my_keyspace"
	cluster.ReconnectInterval = time.Millisecond * 500
	cluster.PoolConfig.HostSelectionPolicy = gocql.TokenAwareHostPolicy(
		gocql.DCAwareRoundRobinPolicy("datacenter1"),
	)

	session, err := cluster.CreateSession()
	if err != nil {
		log.Fatalf("Failed to connect to Cassandra: %v", err)
	}
	defer session.Close()

	var wg sync.WaitGroup

	for streamID := 0; streamID < NumStreams; streamID++ {
		wg.Add(1)

		go func(streamID int) {
			defer wg.Done()
			log.Printf("Starting stream %d", streamID)

			insertTicker := time.NewTicker(time.Second / DataPointsPerSec)
			defer insertTicker.Stop()

			timeoutTimer := time.NewTimer(StreamingDuration)
			defer timeoutTimer.Stop()

			for {
				select {
				case <-insertTicker.C:
					err := writeData(session, streamID)
					if err != nil {
						log.Printf("Stream %d: Write failed: %v", streamID, err)
					} else {
						log.Printf("Write succeeded. streamID: %v", streamID)
					}

				case <-timeoutTimer.C:
					log.Printf("Stopping stream %d after %v", streamID, StreamingDuration)
					return
				}
			}

		}(streamID)
	}

	wg.Wait()
}

func writeData(session *gocql.Session, streamID int) error {
	metaData := make([]byte, 200*1024) // 200 KB
	blobData := make([]byte, 500*1024) // 500 KB

	query := `INSERT INTO my_table (id, metadata, blob_data) VALUES (?, ?, ?)`
	ctx := context.Background()
	return session.Query(query, streamID, metaData, blobData).WithContext(ctx).Exec()
}

Default Setup: The program operates with 5 streams, each writing 25 data points per second, resulting in a total of 125 writes per second over a duration of 10 minutes. But you can tune it as you wish. Tell me, if the program is accurate (or near to accuare) and you are able to reproduce the issue.

General Recommendations.

  • Resource Tuning:
    Increase CPU and Memory. Cassandra benefits from more CPU cores and memory, especially for handling high throughput. So increasing available resources will likely help.
    Disk I/O Monitoring and Optimization Monitor Disk I/O using tools like iostat or nodetool cfstats. If disk I/O is found to be a bottleneck, moving to SSD storage or optimizing disk access patterns (e.g., tuning memtable flush intervals, compaction thresholds) could alleviate the issue.
  • Scaling Out:
    For high-write workloads, consider scaling out from a single-node Cassandra cluster to a multi-node cluster. Cassandra is designed to scale horizontally, and adding more nodes will distribute the write load and reduce the likelihood of resource exhaustion on a single node.
  • Connection Pool:
    Increase NumConns in the GoCQL client configuration to allow more concurrent connections between the client and Cassandra. This can help avoid connection pool exhaustion under heavy load.

Feel free to reach out if you need further help or specific adjustments. If you managed to recreate the errors but not with the extreme edge cases I used (like 5000 streams), let me know 😄

@github-staff github-staff deleted a comment from YeGop0218 Oct 28, 2024
@github-staff github-staff deleted a comment from YeGop0218 Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants