Why do it yourself when there is automation?

If you ever tried to configure systemd’s SystemCallFilter= directive to harden some systemd unit… When its quite a pain, how are you supposed to know that?

Of systemd developers did notice that very few people ever will know all the system calls any process would use, so they gave us system call groups that each roughly correspond to one subsystem in the kernel.

But there is no information on which groups to use, so its still not really better than trial-and-error or wasting a lot of time on viewing strace logs.

Collecting a list on all the system calls used

Fortunately strace has a handy feature called “summary-only” mode that reduces the entire infinite log of system calls into just a list of system calls that occured and how often each system call was used.

Since this already simplifies things a lot run your process like this and make sure to trigger as many features as you can:

$ strace -f -c -o /tmp/strace.stats <progname> [<args>, ]

It will be somewhat slower, but after it finishes (you may use Ctrl+C, it’s safe!) it will create a the meantioned /tmp/strace.stats file with information regarding all the system calls that occurred:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- -------------------
 92,54   34,660383        5779      5997       608 futex
  1,77    0,664396       33219        20           clock_nanosleep
  1,18    0,442909          15     28667           clock_gettime
  1,05    0,393020          17     22346     21769 readlink
  0,44    0,166252         112      1478           munmap
  0,40    0,150442          46      3245           mprotect
  0,36    0,133489          21      6254           read
  0,36    0,133216          27      4788      2218 openat
  0,34    0,127973          18      6963      2451 statx
  0,17    0,063905          83       761         1 epoll_wait
  0,16    0,060857          15      4022           fcntl
  0,14    0,054210          61       879           mmap
  0,14    0,053680          83       641           sched_yield
  0,14    0,050997          18      2720           close
  0,11    0,041341          11      3637       340 recvfrom
[ more lines ]
  0,00    0,000004           4         1           arch_prctl
  0,00    0,000004           4         1           set_tid_address
  0,00    0,000000           0         1           listen
  0,00    0,000000           0         1           execve
  0,00    0,000000           0         1           rename
  0,00    0,000000           0         1         1 pkey_alloc
------ ----------- ----------- --------- --------- -------------------
100,00   37,454788         346    108137     27505 total

The rightmost column here is what we’re interested to pass to systemd.

In theory we could use that column as-is, but its quite likely that minor updates of glibc, the application runtime or the application itself will end up using slightly different (but very likely related) system calls, so it’s not a good idea.

Instead we want to map this list to systemd’s more dynamic system call groups!

Mapping the list

First save the following ZSH shell script based on this simpler version by SjonHortensius:

#!/usr/bin/env zsh
# Generate SystemCallFilter from list of syscalls
#
# Run this script, type or paste a list of syscalls and this script will return
# the possible @call2groups based on the list of groups returned by
# `systemd-analyze` on the current system.
## Sjon Hortensius, 12020
## Erin of Yukis, 12025
set -eu

# Dynamically initialize call2groups (${syscall} → ${group})
declare -A call2group
declare -A group2call
maxgrouplen=0
while IFS= read -r line
do
    [[ ${#line} -eq 0 ]] && continue

    if [[ $line == @* ]]
    then
        group=${line}
        group2call[${group}]=
        if [ ${#group} -gt ${maxgrouplen} ];
        then
            maxgrouplen=${#group}
        fi
    elif [[ $line != \ *\#* && -n ${group+set} && ${group} != @known ]]
    then
        syscall=${line##    }
        call2group[${syscall}]=${call2group[$syscall]:-}${call2group[$syscall]:+,}${group}
        group2call[${group}]+=${group2call[$syscall]:-}${group2call[$syscall]:+,}${syscall}
    fi
done < <(systemd-analyze syscall-filter)

# Expand group references
for name group in ${(kv)group2call[@]};
do
    if [[ ${name} == @.* ]];
    then
        for syscall in ${(s:,:)group2call[${name}]};
        do
            call2group[${syscall}]=${call2group[$syscall]:-}${call2group[$syscall]:+,}${name}
        done
    fi
done
unset group2call

# Read used syscalls, eg. from strace -c, and build forward mappings (${group} → ${syscalls})
declare -A groupuse
while read -r syscall;
do
    if [[ -n ${call2group[${syscall}]+set} ]];
    then
        for group in ${(s:,:)call2group[${syscall}]};
        do
            groupuse[${group}]=${groupuse[${group}]-}${groupuse[${group}]+,}${syscall}
        done
    else
        groupuse[${syscall}]=${syscall}
    fi
done

# Drop groups entirely subsumed (strict subset) by other groups
for group syscalls in ${(kv)groupuse[@]};
do
    for group2 syscalls2 in ${(kv)groupuse[@]};
    do
        all_found=true
        for syscall in ${(s:,:)syscalls};
        do
            if [[ ",${syscalls2}," != .*,${syscall},.* ]];
            then
                all_found=false
                break
            fi
        done

        # Check if all substrings where found AND the number of items in the
        # reference is strictly greater than in our list
        if ${all_found} \
        && [ ${(ws:,:)#syscalls} -lt ${(ws:,:)#syscalls2} ] \
        && [ -v ${groupuse[${group}]} ];
        then
            unset groupuse[${group}]
        fi
    done
done

# Pretty print and sort each group and used syscall therein
for group syscalls in ${(kv)groupuse[@]};
do
    printf "%-$((maxgrouplen+1))s %s\n" "${group}:" ${(j:, :)${(os:,:)syscalls}}
done | sort

Review it if you want then use it like this:

$ tail -n+3 /tmp/strace.stats | head -n-2 | cut -b52- | ./systemd-callgroups.sh

That’s doing some cutting out of the part of the strace summary we care about (just the list of system calls), then pass it to the script which will match the system calls with all the system call groups present on the current system and generate a report on the possible groups to use:

@basic-io:       close, lseek, pread64, pwrite64, read, write
@default:        arch_prctl, brk, clock_gettime, clock_nanosleep, execve, futex, geteuid, getpid, getrandom, gettid, gettimeofday, mmap, mprotect, munmap, prlimit64, rseq, sched_getaffinity, sched_yield, set_robust_list, set_tid_address
@file-system:    access, close, fcntl, fstat, ftruncate, getcwd, getdents64, mkdir, newfstatat, openat, readlink, rename, statx, unlink, unlinkat
@io-event:       epoll_create1, epoll_ctl, epoll_wait, eventfd2, poll, pselect6
@network-io:     accept4, bind, connect, getpeername, getsockname, getsockopt, listen, recvfrom, recvmsg, sendto, setsockopt, shutdown, socket, socketpair
@pkey:           pkey_alloc
@process:        clone3, prctl
@signal:         rt_sigaction, rt_sigprocmask, sigaltstack
@system-service: flock, ioctl, madvise, mremap, sched_getparam, sched_getscheduler, sched_yield, sysinfo, uname

Note that some system calls may show up in multiple groups! This just matches systemd making some system-calls available in multiple groups as well. (Groups entirely subsumed by other groups are removed however.)

Review which groups make the most sense for your use-case (and are least-likely to break!) and add them to the SystemCallFilter= directive.

Much better!