Catch flaky tests (go) #94

Stebalien · 2021-05-17T16:34:46Z

On PR:

Fetch the test output from the latest test run on master to determine which tests already exist.
Fetch the test output from the go-test run on the PR and diff it with master to determine new tests.
Run those new tests for some period of time (consider using timing information from the PR's test run to figure out how many times the new tests should be run?).

This is very much not an easy thing to implement, but it should really help guard us from checking in new flaky tests by stress-testing new tests.

marten-seemann · 2021-05-17T16:43:09Z

Very interesting idea!
I'm wondering if the go tool has any way to generate a list of test cases. @mvdan, any idea?

Alternatively, we might also get away with a bash grep "func Test" **/*_test.go | sort. The only complication I can see at this point is that grep is not module-aware, so we'd a way to limit this to test files within one module in the case of multi-module repos.

marten-seemann · 2021-05-17T16:50:24Z

Further problem: grep won't work in the case of sub-tests.

Related: https://stackoverflow.com/questions/36938520/use-go-test-to-list-all-tests-case/44626375.

marten-seemann · 2021-05-17T16:53:56Z

go test -list '.*'

works, but not for sub-tests.

-list regexp
List tests, benchmarks, or examples matching the regular expression.
No tests, benchmarks or examples will be run. This will only
list top-level tests. No subtest or subbenchmarks will be shown.

mvdan · 2021-05-17T21:04:58Z

You can't list sub-tests because that implies running the tests themselves, i.e. arbitrary code. If you really want a full list, I think the best way might be to use go test -json ./... and looking at all the result JSON objects.

There isn't a "just run the tests up until they run sub-tests and then stop" because... halting problem :) How do you know when a test is done spawning sub-tests?

As for the goal of this issue, I agree it would be nice to rely more on CI to catch flakes. But I'm less convinced that we can be clever about this. For example:

What if a test was added months ago, and today it got modified to be flaky?

What if a test hasn't been touched at all in months, but updating a Go version made it flaky? Or if the non-test code that it tests became flaky?

So I don't think we can safely determine what tests we need to hammer. Why not hammer all of them? After all, a project that gets regular pushes and merges is already running all tests easily hundreds of times per day. GitHub Actions for open source only limits the number of concurrent jobs, not the total amount of resources used over a month. We could set up something like go test -count=100 ./... to run as a cron or batch job to run when most of PL is idle, possibly around 06h UTC, since most of the EU would still be asleep and most of the US would be away from the keyboard or asleep.

Another solution is to do nothing, and assume that flakes will pop up with time. File issues and investigate as they come. That's what I do personally, and it works pretty well. If a flake is so rare that I never see it without CI hammering my tests, does the flake really matter at all?

From experience, I should also note that it's somewhat common for go test ./... to succeed, but go test -count=10 ./... to fail. It can be easy to write a test in a way that isn't idempotent or otherwise doesn't allow being run twice in a row within the same Go process. Imagine a test that listens on port 1234 but doesn't close it, or a test that assumes a global is empty and fills it. Those tests should be fixed, but they're not flaky tests and they would get in the way.

Stebalien · 2021-05-17T21:41:37Z

I'm mostly concerned about adding new flaky tests. Specifically, ones with strong timing dependence.

Once we merge, we'll naturally catch flakes as we go. The goal here was to catch them before we merge so we're less likely to run into an "oh well, we merged a flaky test, better go fix it later" issue.

But this just may not be worth fixing (given the complexity and the fact that we can't catch everything).

anorth · 2021-05-20T01:34:04Z

As a first pass, I would suggest that this be solved with engineering discipline rather than tooling (but tooling later can certainly help). If a PR merges a test that's soon discovered to be flaky, revert the PR. No questions asked.

Note I say revert the PR rather than just disable the test in order to avoid the accumulation of tricky but untested code.

While code coverage tools are a valuable tool during development, having code coverage aggressively enforced in CI is not a good idea: Removing a bunch of well-tested, but ultimately unnecessary code from a project would be regarded as an improvement by all developers, however, a code coverage tool would only see a drop in coverage and fail the status check on the PR. We can use "non-blocking status checks" config from the Codecov manual: https://docs.codecov.io/docs/common-recipe-list#set-non-blocking-status-checks. What we want the code coverage tool to do is report the changes, but to not affect the PR status. See here for a real-world example how this will look like: marten-seemann/codecov-test#5

galargh · 2023-08-28T13:03:50Z

We'll be aiming to make the existing tooling for flaky tests detection easier to reuse across PL as part of pl-strflt/ipdx#89. I'm going to close this issue now but linking it because the discussion here will be really helpful.

marten-seemann added the enhancement New feature or request label May 17, 2021

galargh added this to IP Productivity 🆙 Dec 24, 2021

galargh moved this to New in IP Productivity 🆙 Dec 24, 2021

galargh moved this from New to Backlog in IP Productivity 🆙 Jan 13, 2022

galargh removed this from IP Productivity 🆙 Apr 7, 2022

galargh added this to InterPlanetary Developer Experience Apr 14, 2022

galargh moved this to 🤔 Triage in InterPlanetary Developer Experience Apr 14, 2022

laurentsenta moved this from 🤔 Triage to 🥺 Backlog in InterPlanetary Developer Experience Jun 10, 2022

galargh closed this as completed Aug 28, 2023

github-project-automation bot moved this from 🥺 Backlog to 🥳 Done in InterPlanetary Developer Experience Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch flaky tests (go) #94

Catch flaky tests (go) #94

Stebalien commented May 17, 2021

marten-seemann commented May 17, 2021

marten-seemann commented May 17, 2021

marten-seemann commented May 17, 2021

mvdan commented May 17, 2021

Stebalien commented May 17, 2021

anorth commented May 20, 2021

galargh commented Aug 28, 2023

Catch flaky tests (go) #94

Catch flaky tests (go) #94

Comments

Stebalien commented May 17, 2021

marten-seemann commented May 17, 2021

marten-seemann commented May 17, 2021

marten-seemann commented May 17, 2021

mvdan commented May 17, 2021

Stebalien commented May 17, 2021

anorth commented May 20, 2021

galargh commented Aug 28, 2023