-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial NFS Volume Mount Failures Following OpenShift 4.14 Upgrade (error: code = FailedPrecondition desc = open /var/lib/trident/tracking/pvc-xx-xx-xx-xx-xx.json: no such file or directory) #891
Comments
We had the same issue also after updating from Openshift 4.13.34 to 4.14.19. Kubernetes orchestrator: OpenShift 4.14.19 One pod was in ContainerCreating state. The events was
After I have restarted the pod of the daemonSet trident-node-linux, that was running on the same node as the failing pod, the error message was still in the log. |
Hey @phhutter and @Xavier-0965, I'm looking into this bug. If possible can you upload logs of both controller and node pod with log-level set to debug and also any steps to reproduce this? To set log-level to debug you can use the following command: Thanks! |
Hey @shashank-netapp I've fixed all affected PVCs by using the workaround mentioned in my initial comment, which makes it now nearly impossible to gather the requested debug logs. I did this because NetApp support told me that you already have a solution for it and that the fix will be delivered in the next release. ;-) Of course, I was puzzled because whenever I asked about the root cause, no answer was ever given. I will probably stick to GitHub issues only in the future and rely on GitHub issues instead of NetApp support cases. Unfortunately, it doesn't seem reproducible to me. I've noticed the same issue on 4 clusters out of 30. What surprises me is that for @Xavier-0965, a restart of the DS supposedly solved the issue. I also tried this when the problem occurred. I restarted the controller, operator, and DaemonSet without any luck. So, it could also be that the problem mentoned by Xavier is unrelated. Cheers |
Hi @shashank-netapp, As the problem does not occures anymore, I do not have any actual logs.
Then I have first, delete the pod, to see if the problem is solved. It was not the case. Here the (same) events:
Here a part of the log of
Then I have deleted the pod But the error message "no such file or directory" was still there. After I have restartet the pod $pod (that was consuming the persistent Volume), it worked. The PV was successfully mounted. Regards |
Hi, first of all: Thank you for providing the fix in the first post. we are on 24.02.0 and OpenShift 4.14.20 and had the same issue for one PV with two Pods. Restarting (deleting) the Pods did not help at all. Creating the JSON file by hand, empty, did not work. Copying it without removing the "publishedTargetPaths" did not work too. |
Describe the bug
I have encountered an issue after upgrading from OpenShift 4.12.x to OpenShift 4.14.x. Following the upgrade, as the updated nodes were brought back online, I noticed that certain NFS volumes were unable to be mounted, resulting in the corresponding application pods remaining in a "Pending" state. Below, I have attached the log from a Linux-Trident DaemonSet pod which seems to indicate that it is looking for a status/tracking file in "/var/lib/trident/tracking/" for the PVC to be mounted, but is unable to find it. This issue only affects some PVCs (5-10% of all PVCs). - Other PVCs from the same backend storage were mounted without any issues.
As a workaround, I manually copied the missing JSON tracking file from another remaining CoreOS node and deleted the "publishedTargetPaths" value. This temporarily allowed Trident to remount the volume.
Steps to temporarily fix it:
find and copy the tracking file from a remaining worker node
/var/lib/trident/tracking/pvc-xxx.json
remove the value from
publishedTargetPaths
and let the corresponding linux trident pod reconcile its value.I also tried to delete the Operator/Controller and DaemonSet before manually creating the file, hoping this would resolve the issue. Unfortunately, this did not work.
Here is the log from the Trident DaemonSet pod:
Error-Message:
Environment
Provide accurate information about the environment to help us reproduce the issue.
We have been using Trident for 3-4 years now and have never encountered this error before.
-- EDIT --
We also face the same issue with Trident Version 24.02.0.
The text was updated successfully, but these errors were encountered: