Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Add support for maximum supported kernel version #457

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

drakenclimber
Copy link
Member

@drakenclimber drakenclimber commented Feb 12, 2025

This patchset proposes to solve issue #11 - RFE: support "maximum kernel version".

Signficant changes in this patchset

  • Updates syscalls.csv with the kernel versions that syscalls were added for x86, x86_64, and x32. (See the discussion heading below for why I only did these three architectures.)
  • Adds two new filter attributes, SCMP_FLTATR_ACT_ENOSYS and SCMP_FLTATR_CTL_KVERMAX, for managing the maximum supported kernel version and what to do with syscalls that are newer than that version
  • If this feature is enabled by the user, then libseccomp will add a rule for every single known syscall up to the maximum supported kernel version. These rules will perform the DEFAULT action. (See the discussion below for more info.)
  • Adds supporting documentation and a test

Fixes: #11
CC: @kolyshkin @cyphar

Finally, I am hoping to discuss this issue at Linux Security Summit 2025 in Denver, Colorado USA on June 26th and 27th. I would love to get community feedback about the problem, the proposed solution, etc.

@drakenclimber drakenclimber added this to the v2.7.0 milestone Feb 12, 2025
@drakenclimber drakenclimber self-assigned this Feb 12, 2025
@hrw
Copy link
Contributor

hrw commented Feb 13, 2025

According to my system calls table there are holes in syscall numbering on several architectures (looked at arm64, arm, armoabi, x86-64, x32 and i386). New style architectures share syscall numbering and new entries are added at the end of table.

Your syscalls.csv shown me that I missed "parisc64" architecture. Will have to add support for it. (Edit: DONE)

When it comes to LTS/stable kernels then I think that one of rules in them is "no new stuff" which in this case mean no new system calls. Distribution kernels may add them and many did that in the past so check "is syscall present" may need to be more complex than "is kernel version high enough".

As you have support for syscall.tbl for x86 variants then for start it can be expanded for other architectures too. Will not cover all system calls but you get data for many.

I used those scripts for quick check with my syscalls-table project:

#!/bin/bash

KERNELDIR=~/devel/sources/linux/

for kernel_version in 3.{0..19} 4.{0..20} 5.{0..19} 6.{0..13}
do
                echo $kernel_version
                (cd $KERNELDIR; git checkout v${kernel_version})
                bash scripts/update-tables.sh $KERNELDIR
                pip install .
                python examples/tables-to-yaml.py $kernel_version
                cp -r data/tables data/tables-${kernel_version}
                cp syscalls.yml syscalls-${kernel_version}.yml
done

examples/tables-to-yaml.py one:

#!/usr/bin/python3

import sys
import system_calls
import yaml

kernel_version = ""

if len(sys.argv) > 1:
    kernel_version = sys.argv[1]

syscalls = system_calls.syscalls()

with open("syscalls.yml", "r") as yf:
    yml = yaml.safe_load(yf)

for syscall_name in yml["syscalls"]:

    if not yml["syscalls"][syscall_name]["from"]:
        yml["syscalls"][syscall_name]["from"] = kernel_version

    for arch in syscalls.archs():
        try:
            number = syscalls.get(syscall_name, arch)
        except system_calls.NotSupportedSystemCall:
            number = ""
            pass
        yml["syscalls"][syscall_name]["archs"][arch]["number"] = number
        if number and not yml["syscalls"][syscall_name]["archs"][arch]["from"]:
            yml["syscalls"][syscall_name]["archs"][arch]["from"] = kernel_version


with open("syscalls.yml", "w") as yf:
    yaml.dump(yml, yf)

Not checked result for correctness yet.

@coveralls
Copy link

coveralls commented Feb 13, 2025

Coverage Status

coverage: 90.645% (+0.4%) from 90.252%
when pulling 425defc on drakenclimber:issues/11
into 7db46d7 on seccomp:main.

Promote the scmp_kver enumeration to the public header file,
seccomp.h.in.  Add enumerations for all kernel versions from 4.0 to 6.12

Signed-off-by: Tom Hromatka <[email protected]>
A placeholder, KV_UNDEF, was added for when each syscall was added to
the kernel for each architecture, but the C code has defined this enum
value as SCMP_KV_UNDEF.  Find and replace all instances of KV_UNDEF with
SCMP_KV_UNDEF.

Signed-off-by: Tom Hromatka <[email protected]>
@drakenclimber
Copy link
Member Author

drakenclimber commented Feb 18, 2025

Moved the discussion list to the v3 comment

Here's a side-by-side diff of between v1 of this patchset's syscalls.csv and v2's syscalls.csv

@hrw
Copy link
Contributor

hrw commented Feb 19, 2025

Isn't that 'kernel wide' new system calls are added at the end and 'new on this architecture' ones are added where they were supposed to be?

I remember system calls which were added on subset of architectures in kernel X (and got the highest number) and then kernel X+1, X+2 added it for other architectures. And if there were any new 'kernel wide' system calls added in meantime then it looked like some were added in a middle of table.

@hrw
Copy link
Contributor

hrw commented Feb 19, 2025

Please note that "afs_syscall, break, fattach, fdetach, ftime, getmsg, getpmsg, gtty, isastream, lock, madvise1, mpx, prof, profil, putmsg, putpmsg, security, stty, tuxcall, ulimit, vserver" are officially unimplemented system calls. My syscalls-table has them on ignorelist so that can be why you have some diff.

And problem of x32 is that you need x32 headers in system to get them properly handled. Otherwise you get x86-64 ones. My github action which updates syscalls-table data has extra step to make sure that they are present.

@hrw
Copy link
Contributor

hrw commented Feb 19, 2025

Posted on mastodon about it: https://society.oftrolls.com/@hrw/114030254556485861 as some other people may find it useful too.

@drakenclimber
Copy link
Member Author

Isn't that 'kernel wide' new system calls are added at the end and 'new on this architecture' ones are added where they were supposed to be?

I remember system calls which were added on subset of architectures in kernel X (and got the highest number) and then kernel X+1, X+2 added it for other architectures. And if there were any new 'kernel wide' system calls added in meantime then it looked like some were added in a middle of table.

Yes, that was my recollection as well, but I wanted data to back it up. I expect this model to continue going forward.

For libseccomp I think that means that we can't rely on a "less than" rule for unknown syscalls. We'll either need an explicit rule for each syscall or a series of ranges.

Thanks for the verification, @hrw

@hrw
Copy link
Contributor

hrw commented Feb 19, 2025

https://gpages.juszkiewicz.com.pl/syscalls-table/syscalls.html allows to disable and reorder columns which can be handy when you want to compare numbers between architectures.

I recommend sorting by arm64 or riscv64 column to see how new system calls are present on each architecture.

Note that everything from 'avr32' to right side does not exist in current Linux kernel - they are kept for historical purposes.

Add a tool to populate the syscalls.csv table.  It parses the data
output from the syscalls-table [1] tool.  The following script was used
to build the directories and files with the relevant syscall data:

	#!/bin/bash

	KERNELDIR=~/devel/sources/linux/

	for kernel_version in 3.{0..19} 4.{0..20} 5.{0..19} 6.{0..13}
	do
			echo $kernel_version
			(cd $KERNELDIR; git checkout v${kernel_version})
			bash scripts/update-tables.sh $KERNELDIR
			pip install .
			python examples/tables-to-yaml.py $kernel_version
			cp -r data/tables data/tables-${kernel_version}
			cp syscalls.yml syscalls-${kernel_version}.yml
	done

Note that the inlined script above takes quite a bit of time to run :)

[1] https://github.com/hrw/syscalls-table

Signed-off-by: Tom Hromatka <[email protected]>
Using the script from the previous commit, populate the syscalls.csv
table for all architectures.

Signed-off-by: Tom Hromatka <[email protected]>
Add a tool, scmp_get_max_syscall_num.py, that can calculate the largest
current syscall number.

As of this commit, the largest syscall number is 547 via pwritev2() in
the x32 architecture.

Signed-off-by: Tom Hromatka <[email protected]>
Add two new filter attributes, SCMP_FLTATR_ACT_ENOSYS and
SCMP_FLTATR_CTL_KVER.  When SCMP_FLTATR_CTL_KVERMAX is set, then
libseccomp will handle syscalls as follows:

* syscalls with explicit actions set by the user will behave as
  before
* syscalls that are not explicitly called out by the user's filter
  but are valid for the specified kernel version will return the
  default filter action (SCMP_FLTATR_ACT_DEFAULT).
* syscalls that are newer than the specified kernel version will
  return the unknown filter action (SCMP_FLTATR_ACT_ENOSYS)

Note that setting the SCMP_FLTATR_CTL_KVERMAX can result in large
seccomp BPF filters.  It's recommended to also enable the binary
tree optimization (SCMP_FLTATR_CTL_OPTIMIZE = 2) to speed up
filter traversal in the kernel.

Signed-off-by: Tom Hromatka <[email protected]>
Add support for an application to specify the maximum kernel version it
currently supports.  Any syscalls that have been added to a kernel
version newer than this specified version will return the unknown
action.  The unknown action defaults to returning ENOSYS, but it can be
overridden via the filter attribute SCMP_FLTATR_ACT_ENOSYS.

When the maximum supported kernel version is enabled, libseccomp will
create a filter as follows:
	* Users explicitly declare rules for syscalls.  No changes here
	  from previous behavior
	* The default action provided via seccomp_init() will still be
	  used for all syscalls that existed as of the user-specified
	  supported kernel
	* Any syscalls that did not exist at the time of the
	  user-specified supported kernel will return the unknown
	  action.  By default libseccomp sets this to return ENOSYS, but
	  it can be overridden via the filter attribute
	  SCMP_FLTATR_ACT_ENOSYS.

Below is a rough pseudo-code outline of a typical usage of this feature:
	seccomp_init()
	seccomp_add_rules()

	(optional but recommended) seccomp_attr_set( binary tree )
	seccomp_attr_set( max supported kernel version, e.g. SCMP_KV_6_5 )
	(optional) seccomp_attr_set( default unknown action )

	seccomp_load()
	seccomp_release()

Fixes: seccomp#11
Signed-off-by: Tom Hromatka <[email protected]>
Add a test, 63-sim-kernel_version.[c|py], to test the kernel version
logic.

Signed-off-by: Tom Hromatka <[email protected]>
Add documentation for SCMP_FLTATR_ACT_UNKNOWN and SCMP_FLTATR_CTL_KVER.

Signed-off-by: Tom Hromatka <[email protected]>
@drakenclimber
Copy link
Member Author

drakenclimber commented Feb 19, 2025

Changes for v3:

  • Fixed the x32 syscall numbers. Thanks to @hrw for the guidance here

Discussion

  • Should we support every architecture from the start?
    • This patchset only adds kernel versions for x86, [x86_64]
      (e2b42b6), and x32. They have had a consistent syscall.tbl since 2015 (kernel version 4.0), so they were an easy initial candidate to prove out the logic. I would prefer to support all architectures from the start, but I'm not certain how easy/hard it will be to flesh out the remainder of syscalls.csv
    • Patch 55bf2ea adds kernel versions for all syscalls on all architectures
  • libseccomp has been around since kernel version 3.7.10 or so. Do we need to go that far back with our kernel version table?
    • This patchset only goes back to 2015 (linux kernel version 4.0)
    • Patch 55bf2ea b424f57 now lists kernel versions all the way back to kernel v3.0
  • One thing that has kept me up at night with this patchset - did I get the correct kernel versions in which a syscall was added?
    • I wrote a simple Python script to populate the x86-ish syscall kernel versions, and I'm reasonably confident the numbers are right, but "reasonably confident" is insufficient when security is concerned. @hrw has written a tool to determine syscall kernel versions, and it could be used to populate our table (or perhaps verify my numbers)
    • Patch 005280d 9b285ef uses the syscalls-table tool to populate syscalls.csv. libseccomp's kernel versions (prior to this patch) align very, very closely to the output from the syscalls-table tool with the exception of x32.
      • Here's a side-by-side diff of before and after this patchset (v3)
      • There are ~25 syscalls that we need to dig deeper into. For example, afs_syscall() was syscall number 137 prior to this patchset, and is now a PNR
      • As mentioned above, we need to figure out what's up with x32
      • x32 syscall numbers now largely match our previous numbers
  • Can we simplify the logic and shrink the filter? I don't think so
    • @pcmoore has wondered if we could simplify the logic to only return -ENOSYS for syscalls greater than the maximum supported number. (Again, this patchset explicitly creates a rule for every known syscall rather than a single if syscall_num > max_num rule.) Note that most (all?) architectures have several holes in their syscall table. It looks like syscalls are typically added to the end of the list, but is this always true? And will it always be true in the future? And what about long-term stable kernels?
    • Running this script as follows ./tools/scmp_populate_syscalls_csv.py -d ~/git/other/syscalls-table/data -v shows that syscalls have been added in the middle 112 times since kernel v3.0. arm, s390, x86_64, parisc, x32, and more have all historically done it. Unfortunately, I don't think we can safely rely on new syscalls being added to the end of the list :(
  • As written, SCMP_FLTATR_CTL_KVERMAX must be set at the end of creating the libseccomp context. Any seccomp_arch_add() after setting the maximum kernel version will result in -EINVAL.
    • Aside - libseccomp doesn't allow overwriting of existing rules, and (regardless of this patchset) silently ignores the "new" rule and doesn't add it to the filter. Thus as currently implemented, we must populate the known rules logic at the very end of the filter construction.
    • Do we consider changing the existing behavior of silently ignoring new rules, and instead overwrite the existing rules? That would simplify this patchset

@hrw
Copy link
Contributor

hrw commented Feb 19, 2025

There are ~25 syscalls that we need to dig deeper into. For example, afs_syscall() was syscall number 137 prior to this patchset, and is now a PNR

Like I wrote above: afs_syscall() and a bunch of others are listed in system call tables in kernel but are not implemented. My code ignores them.

@drakenclimber
Copy link
Member Author

Like I wrote above: afs_syscall() and a bunch of others are listed in system call tables in kernel but are not implemented. My code ignores them.

Ack. That's on my todo list :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFE: support "maximum kernel version"
3 participants