Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(e2e): stabilize TestTaskRunFailure test #8174

Merged

Conversation

l-qing
Copy link
Contributor

@l-qing l-qing commented Aug 3, 2024

Most of the time, if a middle step fails, the subsequent steps are immediately marked as skipped. However, in rare cases, the subsequent steps may not be marked as skipped in time.

To ensure the stability of e2e, we adapted to this scenario.

Failed cases:

Fatal log

    taskrun_test.go:133: -got, +want:   []v1.StepState{
          	{ContainerState: {Terminated: &{Reason: "Completed"}}, Name: "unnamed-0", Container: "step-unnamed-0", TerminationReason: "Completed", ...},
          	{ContainerState: {Terminated: &{ExitCode: 1, Reason: "Error"}}, Name: "unnamed-1", Container: "step-unnamed-1", TerminationReason: "Error", ...},
          	{
          		ContainerState: v1.ContainerState{
          			Waiting:    nil,
        - 			Running:    s"&ContainerStateRunning{StartedAt:2024-08-03 02:18:01 +0000 UTC,}",
        + 			Running:    nil,
        - 			Terminated: nil,
        + 			Terminated: s"&ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:0001-01-01 00:00:00 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,ContainerID:,}",
          		},
          		Name:      "unnamed-2",
          		Container: "step-unnamed-2",
          		... // 1 ignored field
          		Results:           nil,
          		Provenance:        nil,
        - 		TerminationReason: "",
        + 		TerminationReason: "Skipped",
          		Inputs:            nil,
          		Outputs:           nil,
          	},
          }

Code:

expectedStepState := []v1.StepState{{
ContainerState: corev1.ContainerState{
Terminated: &corev1.ContainerStateTerminated{
ExitCode: 0,
Reason: "Completed",
},
},
TerminationReason: "Completed",
Name: "unnamed-0",
Container: "step-unnamed-0",
}, {
ContainerState: corev1.ContainerState{
Terminated: &corev1.ContainerStateTerminated{
ExitCode: 1,
Reason: "Error",
},
},
TerminationReason: "Error",
Name: "unnamed-1",
Container: "step-unnamed-1",
}, {
ContainerState: corev1.ContainerState{
Terminated: &corev1.ContainerStateTerminated{
ExitCode: 1,
Reason: "Error",
},
},
TerminationReason: "Skipped",
Name: "unnamed-2",
Container: "step-unnamed-2",
}}
ignoreTerminatedFields := cmpopts.IgnoreFields(corev1.ContainerStateTerminated{}, "StartedAt", "FinishedAt", "ContainerID")
ignoreStepFields := cmpopts.IgnoreFields(v1.StepState{}, "ImageID")
if d := cmp.Diff(taskrun.Status.Steps, expectedStepState, ignoreTerminatedFields, ignoreStepFields); d != "" {
t.Fatalf("-got, +want: %v", d)
}

Debug stepStatuses

// Convert the Pod's status to the equivalent TaskRun Status.
tr.Status, err = podconvert.MakeTaskRunStatus(ctx, logger, *tr, pod, c.KubeClientSet, rtr.TaskSpec)

pipeline/pkg/pod/status.go

Lines 144 to 162 in 95fbf31

var stepStatuses []corev1.ContainerStatus
var sidecarStatuses []corev1.ContainerStatus
for _, s := range pod.Status.ContainerStatuses {
if IsContainerStep(s.Name) {
stepStatuses = append(stepStatuses, s)
} else if IsContainerSidecar(s.Name) {
sidecarStatuses = append(sidecarStatuses, s)
}
}
for _, s := range pod.Status.InitContainerStatuses {
if IsContainerSidecar(s.Name) {
sidecarStatuses = append(sidecarStatuses, s)
}
}
var merr *multierror.Error
if err := setTaskRunStatusBasedOnStepStatus(ctx, logger, stepStatuses, &tr, pod.Status.Phase, kubeclient, ts); err != nil {
merr = multierror.Append(merr, err)
}

pipeline/pkg/pod/status.go

Lines 267 to 309 in 95fbf31

// Continue with extraction of termination messages
for _, s := range stepStatuses {
// Avoid changing the original value by modifying the pointer value.
state := s.State.DeepCopy()
taskRunStepResults := []v1.TaskRunStepResult{}
// Identify Step Results
stepResults := []v1.StepResult{}
if ts != nil {
for _, step := range ts.Steps {
if GetContainerName(step.Name) == s.Name {
stepResults = append(stepResults, step.Results...)
}
}
}
// Identify StepResults needed by the Task Results
neededStepResults, err := findStepResultsFetchedByTask(s.Name, specResults)
if err != nil {
merr = multierror.Append(merr, err)
}
// populate step results from sidecar logs
stepResultsFromSidecarLogs, err := getStepResultsFromSidecarLogs(sidecarLogResults, s.Name)
if err != nil {
merr = multierror.Append(merr, err)
}
_, stepRunRes, _ := filterResults(stepResultsFromSidecarLogs, specResults, stepResults)
if tr.IsDone() {
taskRunStepResults = append(taskRunStepResults, stepRunRes...)
// Set TaskResults from StepResults
trs.Results = append(trs.Results, createTaskResultsFromStepResults(stepRunRes, neededStepResults)...)
}
var sas v1.Artifacts
err = setStepArtifactsValueFromSidecarLogResult(sidecarLogResults, s.Name, &sas)
if err != nil {
logger.Errorf("Failed to set artifacts value from sidecar logs: %v", err)
merr = multierror.Append(merr, err)
}
// Parse termination messages
terminationReason := ""
if state.Terminated != nil && len(state.Terminated.Message) != 0 {

Normal e2e is successful

stepStatuses:
  - name: "step-unnamed-0"
    state:
      terminated:
        exitCode: 0
        reason: "Completed"
        message: '[{"key":"StartedAt","value":"2024-08-03T08:32:01.924Z","type":3}]'
        startedAt: "2024-08-03T08:31:58Z"
        finishedAt: "2024-08-03T08:32:01Z"
        containerID: "containerd://0626499f8e35d6179dcb67c2b7bbba2f78cc14d9430106125eb3f4be5a435fc4"
    lastState: {}
    ready: false
    restartCount: 0
    image: "docker.io/library/busybox:latest"
    imageID: "docker.io/library/busybox@sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7"
    containerID: "containerd://0626499f8e35d6179dcb67c2b7bbba2f78cc14d9430106125eb3f4be5a435fc4"
    started: false
  - name: "step-unnamed-1"
    state:
      terminated:
        exitCode: 1
        reason: "Error"
        message: '[{"key":"StartedAt","value":"2024-08-03T08:32:02.781Z","type":3}]'
        startedAt: "2024-08-03T08:31:59Z"
        finishedAt: "2024-08-03T08:32:02Z"
        containerID: "containerd://a04c5360b640cbb82ffc2f04d12bc7b38b1851add06a19df5c603659eee36262"
    lastState: {}
    ready: false
    restartCount: 0
    image: "docker.io/library/busybox:latest"
    imageID: "docker.io/library/busybox@sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7"
    containerID: "containerd://a04c5360b640cbb82ffc2f04d12bc7b38b1851add06a19df5c603659eee36262"
    started: false
  - name: "step-unnamed-2"
    state:
      terminated:
        exitCode: 1
        reason: "Error"
        message: '[{"key":"StartedAt","value":"2024-08-03T08:32:03.705Z","type":3},{"key":"Reason","value":"Skipped","type":3}]'
        startedAt: "2024-08-03T08:32:00Z"
        finishedAt: "2024-08-03T08:32:03Z"
        containerID: "containerd://8e14be369ce0f62b95ccdf785ced891c5979ce901f887a68bede87dd8b2538d3"
    lastState: {}
    ready: false
    restartCount: 0
    image: "docker.io/library/busybox:latest"
    imageID: "docker.io/library/busybox@sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7"
    containerID: "containerd://8e14be369ce0f62b95ccdf785ced891c5979ce901f887a68bede87dd8b2538d3"
    started: false

Accidental failure

stepStatuses:
  - name: "step-unnamed-0"
    state:
      terminated:
        exitCode: 0
        reason: "Completed"
        message: '[{"key":"StartedAt","value":"2024-08-03T08:32:01.924Z","type":3}]'
        startedAt: "2024-08-03T08:31:58Z"
        finishedAt: "2024-08-03T08:32:01Z"
        containerID: "containerd://0626499f8e35d6179dcb67c2b7bbba2f78cc14d9430106125eb3f4be5a435fc4"
    lastState: {}
    ready: false
    restartCount: 0
    image: "docker.io/library/busybox:latest"
    imageID: "docker.io/library/busybox@sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7"
    containerID: "containerd://0626499f8e35d6179dcb67c2b7bbba2f78cc14d9430106125eb3f4be5a435fc4"
    started: false
  - name: "step-unnamed-1"
    state:
      terminated:
        exitCode: 1
        reason: "Error"
        message: '[{"key":"StartedAt","value":"2024-08-03T08:32:02.781Z","type":3}]'
        startedAt: "2024-08-03T08:31:59Z"
        finishedAt: "2024-08-03T08:32:02Z"
        containerID: "containerd://a04c5360b640cbb82ffc2f04d12bc7b38b1851add06a19df5c603659eee36262"
    lastState: {}
    ready: false
    restartCount: 0
    image: "docker.io/library/busybox:latest"
    imageID: "docker.io/library/busybox@sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7"
    containerID: "containerd://a04c5360b640cbb82ffc2f04d12bc7b38b1851add06a19df5c603659eee36262"
    started: false
  - name: "step-unnamed-2"
    state:
      running:
        startedAt: "2024-08-03T08:32:00Z"
    lastState: {}
    ready: true
    restartCount: 0
    image: "docker.io/library/busybox:latest"
    imageID: "docker.io/library/busybox@sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7"
    containerID: "containerd://8e14be369ce0f62b95ccdf785ced891c5979ce901f887a68bede87dd8b2538d3"
    started: true

Changes

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

/kind bug

Release Notes

NONE

@tekton-robot tekton-robot added kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesnt merit a release note. labels Aug 3, 2024
@tekton-robot
Copy link
Collaborator

Hi @l-qing. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 3, 2024
@l-qing
Copy link
Contributor Author

l-qing commented Aug 6, 2024

/auto-cc

@chitrangpatel
Copy link
Member

/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 7, 2024
@l-qing
Copy link
Contributor Author

l-qing commented Aug 7, 2024

/retest-required

@l-qing l-qing force-pushed the fix/e2e-test-task-run-failure-stability branch 2 times, most recently from 08e312e to 8ad46e2 Compare August 14, 2024 02:19
@l-qing l-qing force-pushed the fix/e2e-test-task-run-failure-stability branch from 8ad46e2 to 927613c Compare August 24, 2024 09:41
@l-qing
Copy link
Contributor Author

l-qing commented Aug 24, 2024

/auto-cc

@l-qing l-qing force-pushed the fix/e2e-test-task-run-failure-stability branch from 927613c to 54fdb19 Compare August 28, 2024 10:31
@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 30, 2024
@l-qing l-qing force-pushed the fix/e2e-test-task-run-failure-stability branch 4 times, most recently from 5919d77 to 3e1d622 Compare September 2, 2024 06:26
ignoreStepFields := cmpopts.IgnoreFields(v1.StepState{}, "ImageID")
if d := cmp.Diff(taskrun.Status.Steps, expectedStepState, ignoreTerminatedFields, ignoreStepFields); d != "" {
t.Fatalf("-got, +want: %v", d)
ignoreStepFields := cmpopts.IgnoreFields(v1.StepState{}, "ImageID", "Running")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore the Running fields:

This case failed because the Running field was inconsistent.

=== NAME  TestTaskRunFailure
    taskrun_test.go:152: -got, +want:   v1.StepState{
          	ContainerState: v1.ContainerState{
          		Waiting:    nil,
        - 		Running:    s"&ContainerStateRunning{StartedAt:2024-09-02 04:01:41 +0000 UTC,}",
        + 		Running:    nil,
        - 		Terminated: nil,
        + 		Terminated: s"&ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:0001-01-01 00:00:00 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,ContainerID:,}",
          	},
          	Name:      "unnamed-2",
          	Container: "step-unnamed-2",
          	... // 1 ignored field
          	Results:           nil,
          	Provenance:        nil,
        - 	TerminationReason: "",
        + 	TerminationReason: "Skipped",
          	Inputs:            nil,
          	Outputs:           nil,
          }
    taskrun_test.go:155: -got, +want:   v1.StepState{
          	ContainerState: v1.ContainerState{
          		Waiting:    nil,
        - 		Running:    s"&ContainerStateRunning{StartedAt:2024-09-02 04:01:41 +0000 UTC,}",
        + 		Running:    nil,
          		Terminated: nil,
          	},
          	Name:      "unnamed-2",
          	Container: "step-unnamed-2",
          	... // 1 ignored and 5 identical fields
          }
```__

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! /lgtm

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [afrittoli,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Most of the time, if a middle step fails, the subsequent steps are immediately
marked as skipped. However, in rare cases, the subsequent steps may not be
marked as skipped in time.

To ensure the stability of e2e, we adapted to this scenario.
@l-qing l-qing force-pushed the fix/e2e-test-task-run-failure-stability branch from 3e1d622 to 1b6d193 Compare September 5, 2024 14:13
@l-qing
Copy link
Contributor Author

l-qing commented Sep 5, 2024

/retest

2 similar comments
@l-qing
Copy link
Contributor Author

l-qing commented Sep 5, 2024

/retest

@l-qing
Copy link
Contributor Author

l-qing commented Sep 5, 2024

/retest

@vdemeester
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 5, 2024
@tekton-robot tekton-robot merged commit 2704e08 into tektoncd:main Sep 5, 2024
11 of 12 checks passed
@l-qing l-qing deleted the fix/e2e-test-task-run-failure-stability branch September 6, 2024 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesnt merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants