5. Solutions to lab exercises

Quite often there will a number of possible solutions to the lab exercises. Below are suggested solutions (yours may well be better!).

The bpftrace language

Solution for 1.1

Although we haven't been formally introduced to any bpftrace details, can you guess what the above action block does?

This action block keeps track of the number of bytes written to each file descriptor for the bash process (more accurately, any thread with the thread name of "bash").

Solution for 1.2

tracepoint:syscalls:sys_enter_*
{
  @[comm] = count();
}

tracepoint:syscalls:sys_enter_*
{
  @[comm] = sum(1);
}

Warning: This is racy because ++ is not atomic. See docs for more details.

tracepoint:syscalls:sys_enter_*
{
  @[comm]++;
}

Solution for 1.3

Expand the above script to display the per-process system call counts every 10 seconds:

tracepoint:syscalls:sys_enter_*
{
  @[comm] = count();
}

interval:s:10
{
  print(@);
}

Add the ability to only display the top 10 per process counts:

tracepoint:syscalls:sys_enter_*
{
  @[comm] = count();
}

interval:s:10
{
  print(@, 10);
}

Delete all per-process syscall stats every 10 secs:

tracepoint:syscalls:sys_enter_*
{
  @[comm] = count();
}

interval:s:10
{
  print(@);
  clear(@);
}

Finally, exit the script after 3 iterations:

The simplest way is as above with the addition of the following action block:

interval:s:30
{
  exit();
}

Alternatively, for the more adventurous we can use a global variable to count the number of times we have hit the 10 second interval timer:

BEGIN
{
  @i = 0;
}

tracepoint:syscalls:sys_enter_*
{
  @[comm] = count();
}

interval:s:10
{
  @i++;
  print(@);
  clear(@);

  if (@i == 3)
  {
    exit();
  }
}

Solution for 1.4

Count the syscalls made by each <pid, tid,comm> tuple for every thread in the core process:

tracepoint:syscalls:sys_enter_*
/pid == $1/
{
  @[pid,tid,comm] = count();
}

Note: The $1 is a positional parameter which allows us to pass the pid of core as a parameter to the script. You could just hard code the pid of the process if you wanted.

Target a particular tid discovered previously and keep a count of the individual syscalls it makes:

tracepoint:syscalls:sys_enter_*
/pid == $1 && tid == $2/
{
  @[probe] = count();
}

Solution for 1.5

Pick a thread from core and also one of the system calls that it makes. Write a script to time the calls and print the result using printf.

t:syscalls:sys_enter_write
/tid == $1/
{
        @ts[tid] = nsecs;
}

t:syscalls:sys_exit_write
/@ts[tid]/
{
        printf("time taken for write: %lld", nsecs - @ts[tid]);
        delete(@ts, tid);
}

Use the hist() aggregating function to track the range of times taken by this syscall.

This is the same as the solution above but with the printf() replaced:

t:syscalls:sys_enter_write
/tid == 1414349/
{
        @ts[tid] = nsecs;
}

t:syscalls:sys_exit_write
/@ts[tid]/
{
        @histwrite[probe] = hist(nsecs - @ts[tid]);
        delete(@ts, tid);
}

Now add the max() and min() functions in to track the lowest to highest times.

t:syscalls:sys_enter_write
/tid == 1414349/
{
        @ts[tid] = nsecs;
}

t:syscalls:sys_exit_write
/@ts[tid]/
{
        $ts = nsecs - @ts[tid];
        @h[probe] = hist($ts);
        @maxwrite[probe] = max($ts);
        @minwrite[probe] = min($ts);
        delete(@ts, tid);
}

Can you think of how you might dump the stack of a thread when it hits a new highest time value? Implement it.

BEGIN
{
        @curmaxval[tid] = 0;
}

t:syscalls:sys_enter_write
/tid == 1414349/
{
        @ts[tid] = nsecs;
}

t:syscalls:sys_exit_write
/@ts[tid]/
{
        $ts = nsecs - @ts[tid];
        $maxval = @curmaxval[tid];

        if ($ts > $maxval)
        {
                @curmaxval[tid] = $ts;
                printf("New max value hit: %llu nsecs\n", $ts);
                printf("%s\n", ustack());
        }

        @h[probe] = hist($ts);
        @maxwrite[probe] = max($ts);
        @minwrite[probe] = min($ts);
        delete(@ts, tid);
}

Solution for 1.6

Write a script which uses a 10 millisecond profile probe (profile:ms:10) to count the number of times a non-root thread (uid != 0) was running.

profile:ms:10
{
  if (uid != 0)
  {
    @[cpu, comm]++;
  }
}

System Call Tracing

Solution for 2.1

Does t:syscalls:sys_exit_exit exist? If so, when will it fire?

The tracepoint does exists, but will never fire because the exit(2) syscall never returns.

Solution for 2.2: mmap

Locate the process doing the most mmap calls in a 30 second period.

t:syscalls:sys_enter_mmap
{
  @[comm] = count();
}

interval:s:30
{
  print(@, 10);
}

What are the sizes of the segments being created?

Segments from 4k -> 32k in steps of 4k.

t:syscalls:sys_enter_mmap
/comm == "mapit"/
{
  @[args->len/1024] = count();
}

interval:s:30
{
  print(@);
}

Can you tell what percentage of the created mappings are private to the process and which are shared?

#include <sys/mman.h>

BEGIN
{
  @total = 0;
  @shared = 0;
  @private = 0;
}

t:syscalls:sys_enter_mmap
/comm == "mapit"/
{
  @total++;

  if (args->flags & MAP_SHARED)
  {
    @shared++;
  }

  if (args->flags & MAP_PRIVATE)
  {
    @private++;
  }
}

interval:s:30
{
  printf("Total mmap: Total %llu, shared %llu private  %llu\n",
          @total, @shared, @private);
}

The above gives the following output for me:

Total mmap: Total 150000, shared 112500 private  37500

So, 75% of the segments are MAP_SHARED and 25% are MAP_PRIVATE.

Solution for 2.3: open/openat

Warning: Modern systems are using openat(2) over open(2).

Write a script to show which files are being opened.

t:syscalls:sys_enter_openat,
t:syscalls:sys_enter_open
{
  @[str(args->filename)] = count();
}

Extend that script to show which processes are opening which file.

t:syscalls:sys_enter_open,
t:syscalls:sys_enter_openat
{
  @[comm, str(args->filename)] = count();
}

Change that script to only show open calls that are creating temp files.

In the depths of the open(2) man page it specifies that you need to define _GNU_SOURCE to obtain the definitions of a few Linux specific flags, O_TMPFILE being one of them.

#define _GNU_SOURCE
#include <fcntl.h>

tracepoint:syscalls:sys_enter_openat
{
  if ((args->flags & O_TMPFILE) == O_TMPFILE)
  {
    printf("tmpfile: %s\n", str(args->filename));
  }
}

Solution for 2.4: close

#include <errno.h>

t:syscalls:sys_enter_close
{
    @[tid] = args.fd;
}

t:syscalls:sys_exit_close
{
    if (args.ret == -EBADF)
    {
        print(@[tid]);
    }

    delete(@, tid);
}

kprobe exercises

Solution for 3.1

Expand the above script to print the parent directory for the file being unlinked. Warning: advanced challenge - you may want to skip.

kfunc:vfs_unlink
{
  printf("%-16s %s/%s\n", comm, str(args.dentry->d_parent->d_name.name), str(args.dentry->d_name.name));
}

Solution for 3.2

Make the block duration threshold time (currently set to 100 microsecs) into a parameter passed into the script.

tracepoint:lock:contention_begin
{
  @[tid, args.lock_addr] = nsecs;
}

tracepoint:lock:contention_end
/@[tid, args.lock_addr] != 0/
{
  $lock_duration = nsecs - @[tid, args.lock_addr];
  if ($lock_duration > $1) { /* block duration threshold time from bpftrace's arguments */
    printf("contended time: %lld\n", $lock_duration);
  }
  delete(@, (tid, args.lock_addr));
}

For locks that exceed the threshold duration, instead of printing information, store the blocking time into a map named long_block_times and use the hist() action and indexed using the lock addressed.

tracepoint:lock:contention_begin
{
  @[tid, args.lock_addr] = nsecs;
}

tracepoint:lock:contention_end
/@[tid, args.lock_addr] != 0/
{
  $lock_duration = nsecs - @[tid, args.lock_addr];
  if ($lock_duration > $1) { /* block duration threshold time from bpftrace's arguments */
    @long_block_times[args.lock_addr] = hist($lock_duration);
  }
  delete(@, (tid, args.lock_addr));
}

Add an END probe where you print out the @long_block_times map but not the anonymous map used to calculate the results.

/* Same as above, with this END probe: */
END
{
  clear(@);
}

Solution for 3.3

The kprobeme process open(2)s and read(2)s a file once a second. Which file is it?

The file is /proc/uptime.

$ sudo bpftrace -p $(pidof kprobeme) -e 't:syscalls:sys_enter_open { print(str(args.filename)) }'
Attaching 1 probe...
/proc/sys/vm/overcommit_memory
/sys/kernel/mm/transparent_hugepage/enabled
/proc/sys/vm/overcommit_memory
/sys/kernel/mm/transparent_hugepage/enabled
/proc/sys/vm/overcommit_memory
/sys/kernel/mm/transparent_hugepage/enabled
/proc/uptime
/proc/uptime
/proc/uptime
/proc/uptime
^C

Is kprobeme opening a new file descriptor every time?

It re-use the same file descriptor 5 times, then use a new one and repeat.

$ sudo bpftrace -p $(pidof kprobeme) -e 't:syscalls:sys_enter_open { @[tid] = str(args.filename) } t:syscalls:sys_exit_open /@[tid] != ""/ { printf("%s: %d\n", @[tid], args.ret); delete(@, tid); }'
Attaching 2 probes...
[...]
/proc/uptime: 19
/proc/uptime: 20
/proc/uptime: 20
/proc/uptime: 20
/proc/uptime: 20
/proc/uptime: 20
/proc/uptime: 21
/proc/uptime: 21
[...]

Is kprobeme leaking file descriptors? (HINT: use an associative map)

Yes.

TODO

uprobe exercises

Solution for 4.3

TODO

The bpftrace language

Solution for 1.1​

Solution for 1.2​

Solution for 1.3​

Solution for 1.4​

Solution for 1.5​

Solution for 1.6​

System Call Tracing

Solution for 2.1​

Solution for 2.2: mmap​

Solution for 2.3: open/openat​

Solution for 2.4: close​