Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pipelinerun): resolve issue with PipelineRun not timing out successfully #8236

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

l-qing
Copy link
Contributor

@l-qing l-qing commented Sep 3, 2024

fix #8230

When the PipelineRun timeout, validation errors returned when patch a completed TaskRun should be ignored.

Changes

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

fix(pipelinerun): resolve issue with PipelineRun not timing out successfully

/kind bug

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Sep 3, 2024
@tekton-robot
Copy link
Collaborator

Hi @l-qing. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 3, 2024
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/errors/errors.go 100.0% 92.3% -7.7
pkg/reconciler/pipelinerun/timeout.go 89.7% 87.8% -1.9

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/cc @chitrangpatel

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 3, 2024
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 3, 2024
@vdemeester
Copy link
Member

/cherry-pick release-v0.63.x

@tekton-robot
Copy link
Collaborator

@vdemeester: once the present PR merges, I will cherry-pick it on top of release-v0.63.x in a new PR and assign it to you.

In response to this:

/cherry-pick release-v0.63.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/errors/errors.go 100.0% 92.3% -7.7
pkg/reconciler/pipelinerun/timeout.go 89.7% 87.8% -1.9

@l-qing
Copy link
Contributor Author

l-qing commented Sep 3, 2024

failMsg := "\"step-timeout\" exited because the step exceeded the specified timeout limit"
t.Logf("Waiting for %s in namespace %s to time out", "step-timeout", namespace)
if err := WaitForTaskRunState(ctx, c, taskRun.Name, FailedWithMessage(failMsg, taskRun.Name), "StepTimeout", v1Version); err != nil {
t.Logf("Error in taskRun %s status: %s\n", taskRun.Name, err)
t.Errorf("Expected: %s", failMsg)
}
tr, err := c.V1TaskRunClient.Get(ctx, taskRun.Name, metav1.GetOptions{})
if err != nil {
t.Errorf("Error getting Taskrun: %v", err)
}
if tr == nil {
t.Fatalf("no TaskRun details available")
}
if tr.Status.Steps[0].Terminated == nil {
t.Errorf("step-no-timeout should have Completed.")
} else if tr.Status.Steps[0].Terminated.Reason != "Completed" {
t.Errorf("step-no-timeout should not have been terminated")
}
if tr.Status.Steps[2].Terminated == nil {
t.Errorf("step-canceled should have been canceled after step-timeout timed out")

Sometimes, the tr.Status.Steps[2].Terminated is nil.

          - container: step-canceled
            imageID: docker.io/library/busybox@sha256:82742949a3709938cbeb9cec79f5eaf3e48b255389f2dcedf2de29ef96fd841c
            name: canceled
            running:
              startedAt: "2024-09-03T13:30:04Z"

Ref: https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/8236/pull-tekton-pipeline-alpha-integration-tests/1830956965430300672

    timeout_test.go:217: step-canceled should have been canceled after step-timeout timed out
    timeout_test.go:221: ############################################
    timeout_test.go:221: ### Dumping objects from arendelle-sq24k ###
    timeout_test.go:221: ############################################
    timeout_test.go:221: 
        ---
        apiVersion: tekton.dev/v1
        kind: TaskRun
        metadata:
          annotations:
            pipeline.tekton.dev/release: 7499202-dirty
          creationTimestamp: "2024-09-03T13:30:00Z"
          generation: 1
          labels:
            app.kubernetes.io/managed-by: tekton-pipelines
          name: step-timeout-taevnmsh
          namespace: arendelle-sq24k
          resourceVersion: "9515"
          uid: 47de3643-fda7-416d-844a-8645e049adac
        spec:
          serviceAccountName: default
          taskSpec:
          timeout: 1h0m0s
        status:
          artifacts: {}
          completionTime: "2024-09-03T13:30:08Z"
          conditions:
          - lastTransitionTime: "2024-09-03T13:30:08Z"
            message: '"step-timeout" exited because the step exceeded the specified timeout
              limit'
            reason: Failed
            status: "False"
            type: Succeeded
          podName: step-timeout-taevnmsh-pod
          provenance:
          startTime: "2024-09-03T13:30:00Z"
          steps:
          - container: step-no-timeout
            imageID: docker.io/library/busybox@sha256:82742949a3709938cbeb9cec79f5eaf3e48b255389f2dcedf2de29ef96fd841c
            name: no-timeout
            terminated:
              containerID: containerd://3bf613441e8bd2c2451c65b72279f9082a4e03e175a735420f77a1a0192c8ad3
              exitCode: 0
              finishedAt: "2024-09-03T13:30:07Z"
              reason: Completed
              startedAt: "2024-09-03T13:30:06Z"
            terminationReason: Completed
          - container: step-timeout
            imageID: docker.io/library/busybox@sha256:82742949a3709938cbeb9cec79f5eaf3e48b255389f2dcedf2de29ef96fd841c
            name: timeout
            terminated:
              containerID: containerd://bebb5264ae4a7452217ed66c2aad4956f4e498bc035605c601f9d8c778d6c930
              exitCode: 1
              finishedAt: "2024-09-03T13:30:07Z"
              reason: Error
              startedAt: "2024-09-03T13:30:07Z"
            terminationReason: TimeoutExceeded
          - container: step-canceled
            imageID: docker.io/library/busybox@sha256:82742949a3709938cbeb9cec79f5eaf3e48b255389f2dcedf2de29ef96fd841c
            name: canceled
            running:
              startedAt: "2024-09-03T13:30:04Z"
          taskSpec:
            steps:

@l-qing
Copy link
Contributor Author

l-qing commented Sep 4, 2024

Many times, automation fails due to an unstable known e2e.

fix(e2e): stabilize TestTaskRunFailure test #8174

@l-qing l-qing force-pushed the fix/pipelinerun-timeout-issue branch from 7e3a45b to 3856add Compare September 5, 2024 14:13
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/errors/errors.go 100.0% 92.3% -7.7
pkg/reconciler/pipelinerun/timeout.go 89.7% 87.8% -1.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/errors/errors.go 100.0% 92.3% -7.7
pkg/reconciler/pipelinerun/timeout.go 89.7% 87.8% -1.9

@l-qing l-qing force-pushed the fix/pipelinerun-timeout-issue branch from 3856add to 4737501 Compare September 6, 2024 03:59
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/errors/errors.go 100.0% 92.3% -7.7
pkg/reconciler/pipelinerun/timeout.go 89.7% 87.8% -1.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/errors/errors.go 100.0% 92.3% -7.7
pkg/reconciler/pipelinerun/timeout.go 89.7% 87.8% -1.9

…ssfully

fix tektoncd#8230

When the PipelineRun timeout, validation errors returned when patch a
completed TaskRun should be ignored.
@l-qing l-qing force-pushed the fix/pipelinerun-timeout-issue branch from 4737501 to dac0c48 Compare September 13, 2024 02:11
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/errors/errors.go 100.0% 92.3% -7.7
pkg/reconciler/pipelinerun/timeout.go 89.7% 87.8% -1.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/errors/errors.go 100.0% 92.3% -7.7
pkg/reconciler/pipelinerun/timeout.go 89.7% 87.8% -1.9

@l-qing
Copy link
Contributor Author

l-qing commented Sep 13, 2024

/retest

@l-qing
Copy link
Contributor Author

l-qing commented Sep 13, 2024

/retest

@l-qing
Copy link
Contributor Author

l-qing commented Sep 13, 2024

The instability in this integration test is related to another PR #8171.

It's mainly caused by the informer cache not updating in a timely manner. I'll look into how to completely avoid this issue within a week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PipelineRun fails to timeout properly in v0.63.0 (hits PipelineRunCouldntTimeOut state)
3 participants