Baggy bounds (continued)

Note: These lecture notes were slightly modified from the ones posted on the 6.858 course website from 2014.

Example code: (assume that slot_size = 16):

    char *p = malloc(44); // Note that the nearest power of 2 (i.e.,
                          // 64 bytes) are allocated. So, there are
                          // 64/(slot_size) = 4 bounds table entries
                          // that are set to log_2(64) = 6.
    char *q = p + 60;     // This access is ok: It's past p's object
                          // size of 44, but still within the baggy
                          // bounds of 64.
    char *r = q + 16;     // ERROR: r is now at an offset of 60+16=76
                          // from p. This means that r is (76-64)=12
                          // beyond the end of p. This is more than
                          // half a slot away, so baggy bounds will
                          // raise an error.
    char *s = q + 8;      // s is now at an offset of 60+8=68 from p.
                          // So, s is only 4 bytes beyond the baggy
                          // bounds, which is less than half a slot
                          // away. No error is raised, but the OOB
                          // high-order bit is set in s, so that s
                          // cannot be derefernced.
    char *t = s - 32;     // t is now back inside the bounds, so
                          // the OOB bit is cleared.

For OOB pointers, the high bit is set (if OOB within half a slot).

So what's the answer to the homework problem?

    char *p = malloc(255);
    char *q = p + 256;
    char ch = *q;  // Does this raise an exception?
                   // Hint: How big is the baggy bound for p?

Does baggy bounds checking have to instrument every memory address computation and access?

Handling function call arguments is a bit tricky, because the x86 calling convention is fixed, i.e., the hardware expects certain things to be in certain places on the stack.

How does baggy bounds checking ensure binary compatibility with existing libraries?

In particular, how does baggy bounds code interact with pointers to memory that was allocated by uninstrumented code?

Example:

    Contiguous range of
    memory used for the
    heap

    +-------------------+
    |                   |
    |                   |
    | Heap allocated by |
    |   uninstrumented  |---+
    |       code        |    \      Bounds table
    |                   |     \
    +-------------------+      \   +-----------+
    |                   |       +->|           |
    |                   |          | Always 31 |
    | Heap allocated by |          |           |
    | instrumented code |          +-----------+
    |                   |          | Set using |
    |                   |--------->| baggy bnds|
    +-------------------+          +-----------+

How can baggy bits leverage 64-bit address spaces?

Can get rid of the table storing bounds information, and put it in the pointer.

  Regular pointer       
    +---------------+-------+------------------------+
    |      zero     |  size |   supported addr space |
    +---------------+-------+------------------------+
            21          5             38

  OOB pointer
    +--------+------+-------+------------------------+
    | offset | size |  zero |   supported addr space |
    +--------+------+-------+------------------------+
        13      5       8             38

This is similar to a fat pointer, but has the advantages that...

  1. tagged pointers are the same size as regular pointers
  2. writes to them are atomic

...so programmer expectations are not broken, and data layouts stay the same.

Also note that, using tagged pointers, we can now keep track of OOB pointers that go much further out-of-bounds. This is because now we can tag pointers with an offset indicating how far they are from their base pointer. In the 32-bit world, we couldn't track OOB offsets without having an additional data structure!

Can you still launch a buffer overflow attack in a baggy bounds system?

Yes, because the world is filled with sadness.

Example:

    struct {
        char buf[256];
        void (*f) (void);
    } my_type;

Note that *f is not an allocated type, so there are no bounds checks associated with its dereference during invocation. Thus, if s.buf is overflowed (e.g., by a bug in an uninstrumented library) and s.f is corrupted, the invocation of f will not cause a bounds error!

Would re-ordering f and buf help?

In general, what are the costs of bounds checking?

So, baggy bounds checking is an approach for mitigating buffer overflows in buggy code.

More approaches for implementing bounds checking

Approach 4: non-executable memory (AMD's NX bit, Windows DEP, W^X, ...)

Approach 5: randomized memory addresses (ASLR, stack randomization, ...)

Which buffer overflow defenses are used in practice?

Return-oriented programming (ROP)

ASLR and DEP are very powerful defensive techniques.

These kinds of attacks are called return-oriented programming, or ROP. To understand how ROP works, let's examine a simple C program that has a security vulnerability. Example adapted from here.

    void run_shell(){
        system("/bin/bash");
    }

    void process_msg(){
        char buf[128];
        gets(buf);
    }

Let's imagine that the system does not use ASLR or stack canaries, but it does use DEP. process_msg() has an obvious buffer overflow, but the attacker can't use this overflow to execute shellcode in buf, since DEP makes the stack non-executable. However, that run_shell() function looks tempting... how can the attacker execute it?

Example:

                     +------------------+
    entry %ebp ----> | .. prev frame .. |
                     |                  |  
                     |                  |
                     +------------------+
    entry %esp ----> |  return address  | ^    <--Gets overwritten 
                     +------------------+ |       with address of
    new %ebp ------> |    saved %ebp    | |       run_shell()
                     +------------------+ |
                     |     buf[127]     | |
                     |       ...        | |
                     |      buf[0]      | |
    new %esp ------> +------------------+

That's a straightforward extension of the buffer overflows that we've already looked at. But how can we pass arguments to the function that we're jumping to?

   char *bash_path = "/bin/bash";

   void run_cmd(){
       system("/something/boring");
   }

   void process_msg(){
       char buf[128];
       gets(buf);
   }

In this case, the argument that we want to pass to is already located in the program code. There's also a preexisting call to system(), but that call isn't passing the argument that we want.

We know that system() must be getting linked to our program. So, using our trusty friend gdb, we can find where the system() function is located, and where bash_path is located.

To call system() with the bash_path argument, we have to set up the stack in the way that system() expects when we jump to it. Right after we jump to system(), system() expects this to be on the stack:

                 |        ...       |
                 +------------------+
                 |     argument     |  The system() argument.
                 +------------------+
    %esp ---->   |    return addr   |  Where system() should 
                 +------------------+  ret after it has
                                       finished.

So, the buffer overflow needs to set up a stack that looks like this:

                     +------------------+
    entry %ebp ----> | .. prev frame .. |
                     |                  |
                     |                  |
                     | *  - - - - - - - | ^
                     |                  | | Address of bash_path 
                     + *  - - - - - - - | |
                     |                  | | Junk return addr for system()
                     +------------------+ |
    entry %esp ----> |  return address  | | Address of system()
                     +------------------+ | 
    new %ebp ------> |    saved %ebp    | | Junk
                     +------------------+ |
                     |     buf[127]     | |
                     |       ...        | | Junk
                     |      buf[0]      | |
    new %esp ------> +------------------+ |

In essence, what we've done is set up a fake calling frame for the system() call! In other words, we've simulated what the compiler would do if it actually wanted to setup a call to system().

What if the string "/bin/bash" was not in the program?

Note that, in these examples, I've been assuming that the attacker used a junk return address from system(). However, the attacker could set it to something useful.

In fact, by setting it to something useful, the attacker can chain calls together!

Goal: We want to call system("/bin/bash") multiple times. Assume that we've found three addresses:

These opcodes are an example of a "gadget." Gadgets are preexisting instruction sequences that can be strung together to create an exploit. Note that there are user-friendly tools to help you extract gadgets from preexisting binaries (e.g., msfelfscan).

                     |                  | ^
                     + *  - - - - - - - + |
                     |                  | | Address of bash_path -+ Fake calling
                     + *  - - - - - - - + |                       | frame for
         (4)         |                  | | Address of pop/ret  * + system()
                     + *  - - - - - - - + | 
         (3)         |                  | | Address of system()
                     + *  - - - - - - - + |
         (2)         |                  | | Address of bash_path -+ Fake calling
                     + *  - - - - - - - + |                       | frame for
         (1)         |                  | | Address of pop/ret  * + system()
                     +------------------+ |
    entry %esp ----> |  return address  | | Address of system()
                     +------------------+ | 
    new %ebp ------> |    saved %ebp    | | Junk
                     +------------------+ |
                     |     buf[127]     | |
                     |       ...        | | Junk
    new %esp ------> |      buf[0]      | |
                     +------------------+ |

So, how does this work? Remember that the return instruction pops the top of the stack and puts it into %eip.

And so on and so forth.

Basically, we've created a new type of machine that is driven by the stack pointer instead of the regular instruction pointer! As the stack pointer moves down the stack, it executes gadgets whose code comes from preexisting program code, and whose data comes from stack data created by the buffer overflow.

This attack evades DEP protections--we're not generating any new code, just invoking preexisting code!

Stack reading: defeating canaries

Assumptions:

So, to determine an 8-byte canary value:

    char canary[8];
    for(int i = 1; i <= 8; i++){  //For each canary byte . . .
        for(char c = 0; c < 256; c++){  //. . . guess the value.
            canary[i-1] = c;
            server_crashed = try_i_byte_overflow(i, canary);
            if(!server_crashed){
                //We've discovered i-th byte of the
                //the canary!
                break;
            }
        }
    }

At this point we have the canary, but remember that the attack assumes that the server uses the same canary after a crash.

Guessing the correct value for a byte takes 128 guesses on average, so on a 32-bit system, we only need 4*128=512 guesses to determine the canary (on a 64-bit system, we need 8*128=1024).

So, we've discussed how we can defeat randomized canaries if canaries are not changed when a server regenerates. We've also shown how to use gdb and gadgets to execute preexisting functions in the program using arguments that the attacker controls. But what if the server DOES use ASLR? This prevents you from using offline analysis to find where the preexisting functions are?

This is what the paper for today's lecture discussed. That paper assumed that we're using a 64-bit machine, so that's what we'll assume in this lecture from now on. For the purposes of this discussion, the main change is that function arguments are now passed in registers instead of on the stack.

Blind return-oriented programming

Step 1: Find a stop gadget

Step 2: Find gadgets that pop stack entries

Example: Find a gadget that pops one thing off the stack.

                          sleep(10)
                          ^       ^
    +--- pop rax         /         \
    |    ret            /           \
    |     \--->[stop]  0x5....     0x5....
    |          [trap]  0x0         0x0    <-----------------+
    +----------[probe] 0x4...8     0x4...c -->xor rax, rax  | Crash!
                                              ret           |
                                                 \__________|

After you do this a bunch of times, you'll have a collection of gadgets that pop one thing from the stack and then return. However, you won't know which register those gadgets store the popped value in.

Step 3: Find syscall() and determine which registers the pop gadgets use

So, at the end of this chain, the pop gadgets have placed the syscall number for pause() in a bunch of registers, hopefully including rax, which is the one that syscall() looks in to find the syscall number.

Once this mega-gadget induces a pause, we know that we've determined the location of syscall(). Now we need to determine which gadget pops the top-of-the stack into rax. The attacker can figure this out by process-of-elimination: iteratively try just one gadget and see if you can invoke pause().

To identify arbitrary "pop x; ret" gadgets, you can use tricks with other system calls that use the x register that you're trying to find.

So, the outcome of this phase is knowledge of "pop x; ret" gadgets, location of syscall().

Step 4: Invoke write()

Now we want to invoke the write call on the network socket that the server has with the attacker's client. We need the following gadgets:

    pop rdi; ret (socket)
    pop rsi; ret (buffer)
    pop rdx; ret (length)
    pop rax; ret (write syscall number)
    syscall

We have to guess the socket value, but that's fairly easy to do, since Linux restricts processes to 1024 simultaneously open file descriptors, and new file descriptors have to be the lowest one available (so guessing a small file descriptor works well in practice).

To test whether we've guessed the correct file descriptor, simply try the write and see if we receive anything!

Once we have the socket number, we issue a write, and for the data to send, we send a pointer to the program's .text segment! This allows the attacker to read the program's code (which was randomized but now totally known to the attacker!). Now the attacker can find more powerful gadgets directly, and leverage those gadgets to open a shell.

Defenses against BROP

More info on ROP and x86 calling conventions