Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Frontend (FE) component of StarRocks in a Kubernetes environment fails to start properly. The container is in a ContainerNotReady state, and the logs indicate an error due to an existing FE instance with the same host name. #571

Open
617450941 opened this issue Jul 20, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@617450941
Copy link

Describe the bug

The Frontend (FE) component of StarRocks in a Kubernetes environment fails to start properly. The container is in a ContainerNotReady state, and the logs indicate an error due to an existing FE instance with the same host name.

To Reproduce

Steps to reproduce the behavior:

  1. Deploy the StarRocks FE component in a Kubernetes cluster.
  2. Ensure that the deployment includes readiness, liveness, and startup probes set to:
    • Initial delay: 0 seconds
    • Timeout: 1 second
    • HTTP request: GET /api/health on port 8030
  3. Observe the container status and logs for any errors.

Expected behavior

The FE component should start successfully, pass the health probes, and join the StarRocks cluster without any hostname conflicts, becoming ready for operation.

Please complete the following information

  • Operator Version: [e.g. v1.8.6]
  • Chart Name: [e.g. kube-starrocks]
  • Chart Version: [e.g. v1.8.6]

Additional Information

Container State and Probes

  • Readiness Probe: HTTP GET /api/health on port 8030, initial delay 0s, timeout 1s
  • Liveness Probe: HTTP GET /api/health on port 8030, initial delay 0s, timeout 1s
  • Startup Probe: HTTP GET /api/health on port 8030, initial delay 0s, timeout 1s

Container Status

  • State: Waiting
  • Container Restarts: 3 times
  • Ports: 8030/TCP, 9020/TCP, 9030/TCP
  • Volumes:
    • fe-meta (EmptyDir, ReadWrite, /opt/starrocks/fe/meta)
    • fe-log (EmptyDir, ReadWrite, /opt/starrocks/fe/log)

Logs

[Sat Jul 20 08:23:42 UTC 2024] Empty $CONFIGMAP_MOUNT_PATH env var, skip it!
[Sat Jul 20 08:23:42 UTC 2024] first start fe with meta not exist.
[Sat Jul 20 08:23:42 UTC 2024] FE service is alive, check if has leader ...
[Sat Jul 20 08:23:42 UTC 2024] Find leader: starrocks-cluster-fe-0.starrocks-cluster-fe-search.starrocks.svc.cluster.local!
[Sat Jul 20 08:23:42 UTC 2024] Add myself(starrocks-cluster-fe-1.starrocks-cluster-fe-search.starrocks.svc.cluster.local:9010) to leader as follower ...
ERROR 1064 (HY000) at line 1: Unexpected exception: FE with the same host: starrocks-cluster-fe-1.starrocks-cluster-fe-search.starrocks.svc.cluster.local already exists
[Sat Jul 20 08:23:42 UTC 2024] first start with no meta run start_fe.sh with additional options: ' --host_type FQDN --helper starrocks-cluster-fe-0.starrocks-cluster-fe-search.starrocks.svc.cluster.local:9010'

Root Cause

The error message indicates that there is already an FE instance with the same hostname (starrocks-cluster-fe-1.starrocks-cluster-fe-search.starrocks.svc.cluster.local) in the cluster, which prevents the new instance from joining.

Resolution Steps

  1. Verify the existing FE instances in the cluster to identify and resolve any duplicates.
  2. Ensure that each FE instance has a unique hostname.
  3. Adjust the deployment configuration if necessary to prevent hostname conflicts.
  4. Restart the FE instance and monitor its status to ensure it joins the cluster successfully.

apiVersion: starrocks.com/v1
kind: StarRocksCluster
metadata:
name: starrocks-cluster
namespace: starrocks
spec:
starRocksFeSpec:
image: starrocks/fe-ubuntu:latest
replicas: 3
starRocksBeSpec:
image: starrocks/be-ubuntu:latest
replicas: 3

@617450941 617450941 added the bug Something isn't working label Jul 20, 2024
@cnmac
Copy link

cnmac commented Sep 2, 2024

I also encountered this problem. I deployed multiple environments in the same way. Two environments were normal, but one environment outputted the error: ERROR 1064 (HY000) at line 1: Unexpected excption: FE with the same host, which made it impossible to add a new FE instance to the cluster.

@yandongxiao
Copy link
Collaborator

After FE Pod startup, it will always attempt to join the StarRocks Cluster, so it is normal for this error to occur after the Pod is rebuilt. This error is not the root cause of FE Pod restart.

ERROR 1064 (HY000) at line 1: Unexpected exception: FE with the same host: starrocks-cluster-fe-1.starrocks-cluster-fe-search.starrocks.svc.cluster.local already exists

See: https://github.com/StarRocks/starrocks/blob/ad51b419363f0ed97987e68630765188ea1cee6a/docker/dockerfiles/fe/fe_entrypoint.sh#L221-L226

From the code context, It seems you did not use PVC to persist the FE meta data. Please see https://github.com/StarRocks/starrocks-kubernetes-operator/blob/main/doc/mount_persistent_volume_howto.md for more information.

@cnmac
Copy link

cnmac commented Sep 3, 2024

Thanks to yandongxiao for the tip,I found the real error message from the fe.log file in the pvc directory:
Caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 18.3.16) kube-starrocks-fe-1.kube-starrocks-fe.search.starrocks.svc.cluster.local_9010_1725331459701(-1):/opt/starrocks/fe/meta/bdb Clock delta: 339929 ms. between Feeder: kube-starrocks-fe-0.kube-starrocks-fe-search.starrocks.svc.cluster.local_9010_1725331350969 and this Replica exceeds max permissible delta: 5000 ms. HANDSHAKE_EEROR: Error during the handshake between two nodes. Somse validity or compatibility check failed,preventing further communication between the nodes. Environment is invalid and must be closed. Originally thrown by HA thread: RepNode kube-starrocks-fe-1.kube-starrocks-fe-search.starrocks.svc.cluster.local_9010_1725331459701(-1)

I encountered this problem because the time between nodes was not synchronized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants