TL;DR; If set*id()
syscalls' return values are not checked, it can cause security issues such as privilege escalation. Conditions to trigger a fail of these syscalls on modern kernels are less likely than in the past but return values should always be checked.
setuid()
system call
As man 2 setuid
states:
setuid() sets the effective user ID of the calling process. If the calling process is privileged (more precisely: if the process has the CAP_SETUID capability in its user namespace), the real UID and saved set-user-ID are also set.
setuid()
is widely used in SUID/SGID binaries to drop privileges as such:
setuid(getuid())
If the above setuid()
call is successful, then the effective UID of the process would have the value of the real UID, hence “cancelling” the SUID bits given to the binary.
Return value section of this call from the man
also states that there exists security issues when it is not checked:
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and _errno_ is set to indicate the error. _Note_: there are cases where **setuid**() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from **setuid**().
Imagine the previous setuid()
call fails and its return value is not checked. A root-SUID binary doing such a call would then end up not dropping its privileges while believing it did, allowing for potential privilege escalations in the rest of the execution.
The goal of an attacker is thus to provoke such an error in programs that do not enforce return value checks.
The RLIMIT_NPROC
case
There are different cases where setuid()
fails but one is of particular interest for us:
ERRORS
**EAGAIN** _uid_ does not match the real user ID of the caller and this call would bring the number of processes belonging to the real user ID _uid_ over the caller's **RLIMIT_NPROC** resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of **EAGAIN** in execve(2)
For now, let’s ignore the Linux 3.1 fix described above. What that means is that if one was able to provoke an EAGAIN
error by reaching RLIMIT_NPROC
, then the setuid()
call would fail, and privileges of an hypothetic SUID binary not checking the return value would not be dropped. Great.
The kernel 2.6 modification
The previous behaviour is due to the fact that an additional check on RLIMIT_NPROC
was added to the setuid()
syscall back in Linux 2.6. It ensured that if RLIMIT_NPROC
was reached, then the function failed with EAGAIN
.
It was added because programs were abusing the per-user process number limit by running a daemon as root, then dropping privileges and running as another user which would bypass the limitation. Unfortunately, it opened the door to the previously described security issue.
The kernel 3.1 fix
The issue introduced by the 2.6 patch was addressed in Linux 3.1, after a discussion on the kernel mailing list (that can be found here: RLIMIT_NPROC check in set_user(). The base idea is not to enforce a check of RLIMIT_NPROC
in set*id()
syscalls but to move it to subsequent fork()
or execv()
ones. Somehow “delegating” the check to these.
A full explanation can be found in man 2 execve
:
execve() and EAGAIN A more detailed explanation of the EAGAIN error that can occur (since Linux 3.1) when calling execve() is as follows.
The **EAGAIN** error can occur when a _preceding_ call to setuid(2) setreuid(2), or setresuid(2) caused the real user ID of the process to change, and that change caused the process to exceed its **RLIMIT_NPROC** resource limit (i.e., the number of processes belonging to the new real UID exceeds the resource limit). From Linux 2.6.0 to Linux 3.0, this caused the **set*uid**() call to fail. (Before Linux 2.6, the resource limit was not imposed on processes that changed their user IDs.) Since Linux 3.1, the scenario just described no longer causes the **set*uid**() call to fail, because it too often led to security holes where buggy applications didn't check the return status and assumed that—if the caller had root privileges—the call would always succeed. Instead, the **set*uid**() calls now successfully change the real UID, but the kernel sets an internal flag, named **PF_NPROC_EXCEEDED**, to note that the **RLIMIT_NPROC** resource limit has been exceeded. If the **PF_NPROC_EXCEEDED** flag is set and the resource limit is still exceeded at the time of a subsequent **execve**() call, that call fails with the error **EAGAIN**. This kernel logic ensures that the **RLIMIT_NPROC** resource limit is still enforced for the common privileged daemon workflow—namely, fork(2) + **set*uid**() + **execve**().
And the corresponding source code from Linux kernel kernel/sys.c
states the same:
static void flag_nproc_exceeded(struct cred *new)
{
if (new->ucounts == current_ucounts())
return;
/*
* We don't fail in case of NPROC limit excess here because too many
* poorly written programs don't check set*uid() return code, assuming
* it never fails if called by root. We may still enforce NPROC limit
* for programs doing set*uid()+execve() by harmlessly deferring the
* failure to the execve() stage.
*/
if (is_rlimit_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
new->user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
}
This is effective as all programs reaching RLIMIT_NPROC
and subsequently calling setuid()
will still see their privileges dropped while if they call fork()
and execve()
, they would fail.
Proof of concept
The fact that this fix is working and that EAGAIN
error code to the setuid()
syscall cannot be obtained anymore using the previously described technique can be proved by the below piece of code:
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/resource.h>
#define RLIMIT_PROC_LOW 0
void print_uids()
{
printf("uid: %d euid: %d\n", getuid(), geteuid());
}
void print_nproc_limits()
{
struct rlimit rlim;
if (getrlimit(RLIMIT_NPROC, &rlim) == -1)
{
printf("getrlimit() failed: %s\n", strerror(errno));
return;
}
printf("RLIMIT_NPROC current: %d max: %d\n", rlim.rlim_cur, rlim.rlim_max);
}
int main(int argc, char *argv[])
{
printf("Testing setuid()\n");
print_uids();
print_nproc_limits();
struct rlimit rlim;
rlim.rlim_cur = RLIMIT_PROC_LOW;
rlim.rlim_max = RLIMIT_PROC_LOW;
printf("Setting the limit to a low value: %d\n", RLIMIT_PROC_LOW);
if (setrlimit(RLIMIT_NPROC, &rlim) == -1)
{
printf("setrlimit() failed: %s\n", strerror(errno));
return 1;
}
print_nproc_limits();
printf("Calling setuid to drop privileges\n");
if (setuid(getuid()) == -1)
{
printf("setuid() failed: %s\n", strerror(errno));
}
print_uids();
char *binary = "/usr/bin/ls";
printf("executing %s\n", binary);
int child = fork();
if (child == -1)
{
printf("fork() failed: %s\n", strerror(errno));
}
else if (child == 0)
{
// child
if (execve(binary, NULL, NULL) == -1)
{
printf("execve() failed: %s\n", strerror(errno));
}
}
return 0;
}
This code:
- Reduce the number of processes the current process can create (
RLIMIT_NPROC
) to a low value (e.g.0
) - Drop privileges by setting the effective user ID to the real one via
setuid()
- Creates a child process executing a given binary (e.g.
/usr/bin/ls
)
It basically does a setuid()
+ fork()
+ execve()
chain of system calls.
The program has to be compiled and set as a root-SUID binary:
$ sudo gcc -o setuid_poc setuid_poc.c
$ sudo chmod +s setuid_poc
$ ls -l setuid_poc
-rwsr-sr-x 1 root root 16512 Jul 20 12:57 setuid_poc
The fork()
will fail with EAGAIN
as expected but the previous setuid()
call is successfully executed regardless of RLIMIT_NPROC
:
$ ./setuid_poc
Testing setuid()
uid: 1000 euid: 0
RLIMIT_NPROC current: 31507 max: 31507
Setting the limit to a low value: 0
RLIMIT_NPROC current: 0 max: 0
Calling setuid to drop privileges
uid: 1000 euid: 1000
executing /usr/bin/ls
fork() failed: Resource temporarily unavailable
History
Examples of that particular issue have been found in a large variety of binaries throughout the years:
- CVE-2006-4447: Xorg
- CVE-2011-2490: opielogin
- CVE-2011-2910: AX.25 daemon (ax25-tools)
- CVE-2013-4559: lighttpd
- CVE-2014-0470: Super
- CVE-2017-6964: dmcrypt-get-device
The most recent of these vulnerabilities are after the RLIMIT_NPROC
kernel 3.1 fix (2011).
Detection
Clang’s static code analyzer scan-build detects that behaviour: 1. Available Checkers — Clang 17.0.0git documentation (llvm.org).
It can also be searched for in source code using a simple regex to see if a C line only contains a set*id()
call without anything else: /^[ ]*sete?[ug]id[ ]*\([^0]+\);/
This is a common issue.
Conclusion
Not checking set*id()
syscalls return values is not a good idea but the RLIMIT_NPROC
attack path is quite mitigated by the Linux 3.1 fix. It seems to be only a security risk on kernel versions between 2.6 and 3.1. set*id()
calls could though fail for other reasons, so one should always check return values of these functions.