Flaky tests, or why not to ignore mysteries
I spent a few weeks earlier this year tracking down a set of flaky
end-to-end tests where bpftrace would occasionally cease to print output. I
had gotten as far as figuring out std::cout
had badbit set after a write
but had run out of ideas on how to debug it. At the time, because I could not
reproduce it locally, I had assumed it was an oddity with pipes and CI and
given up.
Except bugs never go away. They only lay dormant.
The Return of the King Bug
This week the issue re-appeared. Except in the form of new flakiness after a seemingly harmless test reliability change. I'll spare you the details - just tedious printf bisection through many CI runs - but I essentially picked up where I left off in March and isolated the buggy code down to:
LOG(DEBUG) << "before std::cout good=" << std::cout.good();
std::cout << fmt.format_str(arg_values) << std::endl;
LOG(DEBUG) << "after std::cout good=" << std::cout.good();
Note stderr
(LOG(DEBUG)
) continued to work while stdout
did not. The
above code would show the following:
before std::cout good=1
after std::cout good=0
I'm not sure why I didn't think to drop down to syscall level and actually read the errno last time, but fortunately I had a fresh chance to redeem myself:
auto ret = write(STDOUT_FILENO, fmted.c_str(), fmted.size());
if (ret > 0) {
LOG(DEBUG) << "write was ok";
} else if (ret == 0) {
LOG(DEBUG) << "write wrote nothing?";
} else {
LOG(DEBUG) << "write failed: errno=" << errno << ": " << strerror(errno);
}
The above revealed:
write failed: errno=9: Bad file descriptor
EBADF
is interesting. According to the write(2)
man page:
EBADF fd is not a valid file descriptor or is not open for writing.
To not rule out either case out of hand, I opted to lsof
on bpftrace startup
as well as right before the write(2)
. I used this chunk of code:
char buf[256];
snprintf(buf, sizeof(buf), "lsof -p %d 1>&2", getpid());
system(buf);
On startup, we see:
[..]
bpftrace 7526 root 0r FIFO 0,14 0t0 16880 pipe
bpftrace 7526 root 1w FIFO 0,14 0t0 28011 pipe
bpftrace 7526 root 2w FIFO 0,14 0t0 28011 pipe
[..]
but later at the write:
[..]
bpftrace 7526 root 0r FIFO 0,14 0t0 16880 pipe
bpftrace 7526 root 2w FIFO 0,14 0t0 28011 pipe
[..]
This can only mean something inside the process is closing FD 1.