Flaky tests, or why not to ignore mysteries

January 17, 2025 · 7 min read

Software Engineer

I spent a few weeks earlier this year tracking down a set of flaky end-to-end tests where bpftrace would occasionally cease to print output. I had gotten as far as figuring out std::cout had badbit set after a write but had run out of ideas on how to debug it. At the time, because I could not reproduce it locally, I had assumed it was an oddity with pipes and CI and given up.

Except bugs never go away. They only lay dormant.

The Return of the King Bug

This week the issue re-appeared. Except in the form of new flakiness after a seemingly harmless test reliability change. I'll spare you the details - just tedious printf bisection through many CI runs - but I essentially picked up where I left off in March and isolated the buggy code down to:

LOG(DEBUG) << "before std::cout good=" << std::cout.good();
std::cout << fmt.format_str(arg_values) << std::endl;
LOG(DEBUG) << "after std::cout good=" << std::cout.good();

Note stderr (LOG(DEBUG)) continued to work while stdout did not. The above code would show the following:

before std::cout good=1
after std::cout good=0

I'm not sure why I didn't think to drop down to syscall level and actually read the errno last time, but fortunately I had a fresh chance to redeem myself:

auto ret = write(STDOUT_FILENO, fmted.c_str(), fmted.size());
if (ret > 0) {
    LOG(DEBUG) << "write was ok";
} else if (ret == 0) {
    LOG(DEBUG) << "write wrote nothing?";
} else {
    LOG(DEBUG) << "write failed: errno=" << errno << ": " << strerror(errno);
}

The above revealed:

write failed: errno=9: Bad file descriptor

EBADF is interesting. According to the write(2) man page:

EBADF fd is not a valid file descriptor or is not open for writing.

To not rule out either case out of hand, I opted to lsof on bpftrace startup as well as right before the write(2). I used this chunk of code:

char buf[256];
snprintf(buf, sizeof(buf), "lsof -p %d 1>&2", getpid());
system(buf);

On startup, we see:

[..]
bpftrace 7526 root   0r  FIFO   0,14       0t0   16880 pipe
bpftrace 7526 root   1w  FIFO   0,14       0t0   28011 pipe
bpftrace 7526 root   2w  FIFO   0,14       0t0   28011 pipe
[..]

but later at the write:

[..]
bpftrace 7526 root   0r     FIFO   0,14       0t0   16880 pipe
bpftrace 7526 root   2w     FIFO   0,14       0t0   28011 pipe
[..]

This can only mean something inside the process is closing FD 1.

Clear and present data

bpftrace works with raw file descriptors quite often. So it wasn't really practical to audit every close(2) callsite. Not to mention all the dependencies. Therefore, I opted for some runtime instrumentation by scripting GDB. I originally had bpftrace to do this, but due to all the debug logs the output started getting confusing - multiple copies of everything was getting logged.

Regardless, I switched the end-to-end test invocation to this:

RUN gdb -q -ex "set confirm off" -ex "break close if \$rdi==1" -ex "commands;backtrace;kill;end" -ex run --args {{BPFTRACE}} -e 'uprobe:/proc/{{BEFORE_PID}}/root/uprobe_test:uprobeFunction1 { printf("func %s\n", func); exit() }'

What this snippet does is set a breakpoint on close(2) but only dump backtrace if the argument value is 1.

Running that in CI reveals:

Attaching 1 probe...
__BPFTRACE_NOTIFY_PROBES_ATTACHED

Thread 1 "bpftrace" hit Breakpoint 1.2, 0x00007fffe4504910 in close ()
   from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
(gdb)

The lack of a stack seemed to indicate missing debug information. However, these CI jobs configure bpftrace to be built with DWARF. Therefore, I started to suspect one of bpftrace's dependencies.

Phantom thread

The idea that it could be one of bpftrace's dependencies sparked an interesting question: what do all these failing tests have in common? Given that the above test case was really only using the func builtin, I checked how that feature is implemented. To my surprise, it's implemented through userspace symbolication!

That makes the common thread userspace symbolication, as the original 4 disabled tests involve symbolizing userspace stacks. So the obvious next place to look is bcc.

Upon inspection, there's indeed a bit of suspicious looking code in ProcSyms::ModulePath::fd_ where fd_ is declared without a default value and ModulePath() has an early return that does not set fd_. In turn, ~ModulePath() does if (fd_ > 0) close(fd_) which would be buggy if fd_ is not initialized to -1.

It's also more clear why the above reliability change started triggering flakiness. It’s b/c before the change, uprobe_test probably exited before symbolization occurred, so we were not tickling the bad codepath in bcc. After the change the process hangs around, so symbolization does occur.

The Validator

Continuing with the GDB approach, I thought if I could turn on debug information for bcc it could give me a second opinion on the source. Through our Nix based CI, this was as easy as adding to the bcc package overlay:

cmakeFlags = old.cmakeFlags ++ [ "-DCMAKE_BUILD_TYPE=Debug" ];

Unfortunately, stacks still did not work. I'm guessing it's b/c glibc doesn't have frame pointers enabled. And changing that would probably take too much compute to rebuild everything. So it was a matter of luck that I started thinking about dependencies - the lack of stacks is unrelated.

So I opted to check a different way: through printf debugging the dependency. Through the power of Nix, I applied a patch to bcc:

From e5609182cf0ede7c9bb32bf54670cbfba754a6c0 Mon Sep 17 00:00:00 2001
Message-ID: <e5609182cf0ede7c9bb32bf54670cbfba754a6c0.1737135820.git.dxu@dxuuu.xyz>
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Fri, 17 Jan 2025 10:43:33 -0700
Subject: [PATCH] XXX: Log ~ModulePath close()

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 src/cc/syms.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/cc/syms.h b/src/cc/syms.h
index 3dffdda7..415fe28f 100644
--- a/src/cc/syms.h
+++ b/src/cc/syms.h
@@ -16,6 +16,7 @@
 #pragma once

 #include <algorithm>
+#include <iostream>
 #include <memory>
 #include <string>
 #include <sys/types.h>
@@ -144,6 +145,7 @@ class ProcSyms : SymbolCache {
       return proc_root_path_.c_str();
     }
     ~ModulePath() {
+      std::cerr << "XXX BCC: trying to ~ModulePath() close fd_=" << fd_ << std::endl;
       if (fd_ > 0)
         close(fd_);
     }
--
2.47.1

And to the flake:

patches = old.patches ++ [
    # Need to use pkgs.fetchpatch for remote patches or pkgs.writeText for local ones
    (pkgs.writeText "my-patch" (builtins.readFile ./0001-XXX-Log-ModulePath-close.patch))
];

From this, we finally get:

XXX BCC: trying to ~ModulePath() close fd_=1

So mystery solved. To validate, I applied this patch through the same mechanism:

From 02fffe05dec4eca98da053d8f6d4741a758c184d Mon Sep 17 00:00:00 2001
Message-ID: <02fffe05dec4eca98da053d8f6d4741a758c184d.1737137000.git.dxu@dxuuu.xyz>
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Fri, 17 Jan 2025 11:02:55 -0700
Subject: [PATCH] XXX: FIX

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 src/cc/syms.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/cc/syms.h b/src/cc/syms.h
index 415fe28f..b9563e87 100644
--- a/src/cc/syms.h
+++ b/src/cc/syms.h
@@ -131,7 +131,7 @@ class ProcSyms : SymbolCache {
     // process by storing a file descriptor created from openat(2) if possible
     // if openat fails, falls back to process-dependent path with /proc/.../root
    private:
-    int fd_;
+    int fd_ = -1;
     std::string proc_root_path_;
     std::string path_;

--
2.47.1

A 90 minute bake (failures usually appeared after 5 minutes) shows no failures.

Conclusions

This is a really nasty bug. Arbitrarily closing file descriptors as a library dependency is very not nice. Anything can happen. For bpftrace, this is particularly risky, as bcc could've been closing probe file descriptors which would lead to silent data loss.

One takeaway I have is that as a tool's developer, we are best positioned to debug crazy mysteries like this. Our users likely have very little chance of figuring this stuff out. So when we find an alarming mystery (like truncated output!), we should be doing our best to chase it to its conclusion.

Also that making puns out of movie titles got pretty hard after 2 titles.

The Return of the King Bug​

Clear and present data​

Phantom thread​

The Validator​

Conclusions​

The Return of the King Bug

Clear and present data

Phantom thread

The Validator

Conclusions