Massachusetts Institute of Technology Computer Science and AI Lab Cambridge, MA 02139 smcc@csail.mit.edu | Harvard University Division of Engineering and Applied Sciences Cambridge, MA 02138 greg@eecs.harvard.edu |
Abstract: Executing untrusted code while preserving security requires that the code be prevented from modifying memory or executing instructions except as explicitly allowed. Software-based fault isolation (SFI) or "sandboxing" enforces such a policy by rewriting the untrusted code at the instruction level. However, the original sandboxing technique of Wahbe et al. is applicable only to RISC architectures, and most other previous work is either insecure, or has been not described in enough detail to give confidence in its security properties. We present a new sandboxing technique that can be applied to a CISC architecture like the IA-32, and whose application can be checked at load-time to minimize the TCB. We describe an implementation which provides a robust security guarantee and has low runtime overheads (an average of 21% on the SPECint2000 benchmarks). We evaluate the utility of the technique by applying it to untrusted decompression modules in an archive tool, and its safety by constructing a machine-checked proof that any program approved by the verification algorithm will respect the desired safety property.
Many systems programming applications can benefit from a code isolation mechanism that is efficient, robust, and easily applicable to existing code. A useful technique to improve the reliability of operating systems is to isolate device drivers so that their failures (which may include arbitrary memory writes) do not corrupt the rest of a kernel. The Nooks system [26] achieves such isolation with hardware mechanisms that are robust, but impose a high overhead when many short cross-boundary calls are made; SFI could provide similar protection without high per-call overheads. To reduce the damage caused by attacks on network servers, they should be designed to minimize the amount of code that requires high (e.g., root) privileges; but retrofitting such a design on an existing server is difficult. Dividing servers along these lines by using separate OS-level processes [23, 14] is effective at preventing vulnerabilities, but is far from trivial because of the need to serialize data for transport and prevent an untrusted process from making damaging system calls. SFI could make such separation easier by automatically preventing system calls and making memory sharing more transparent. Section 8 discusses VXA [11], an architecture that ensures compressed archives will be readable in the future by embedding an appropriate decompressor directly in an archive. Applying our SFI tool to VXA we see that it very easily obtains a strong security guarantee, without imposing prohibitive runtime overheads. Note that all of these examples include large existing code bases written in C or C++, which would be impractical to rewrite in a new language; the language neutrality of SFI techniques is key to their applicability.
void f(int arg, int arg2, int arg3, int arg4) { return; } void poke(int *loc, int val) { int local; unsigned diff = &local - loc - 2; for (diff /= 4; diff; diff--) alloca(16); f(val, val, val, val); }
Figure 1: Example attack against SFI systems which depend on the compiler's management of the stack for safety. The function poke modifies the stack pointer by repeatedly allocating unused buffers with alloca until it points near an arbitrary target location loc, which is then overwritten by one of the arguments to f. MiSFIT [25] and Erlingsson's x86 SASI tool [10] allow this unsafe write because they incorrectly assume that the stack pointer always points to the legal data region.
0xda000000
and extending 16 megabytes to 0xdaffffff
.
With such a choice, an address can be efficiently checked to point
inside the region by bitwise operations.
In this case, we could check whether the bitwise AND of an address and
the constant 0xff000000
was equal to 0xda000000
.
We'll use the term tag to refer to the portion of the address
that's the same for every address in a region, such as 0xda
above.switch
statements,
function pointers, and object dispatch tables, among other language
features (Deutsch and Grant's system did not allow them).
Indirect jumps must also be checked to see that their target address
is in the allowed code region, but how can we also exclude the
addresses of unsafe instructions, while allowing safe instruction
addresses?0xda
.
Code can only be allowed to store an arbitrary value into %rs if it
immediately guarantees that the stored value really is appropriate.
If we know that this invariant holds whenever the code jumps, we can
see that even if the code jumps directly to an instruction that stores
to the address in %rs, all that will occur is a write to the data
region, which is safe (allowed by the security policy).
Of course, there is no reason why a correct program would jump
directly to an unsafe store instruction; the technique is needed for
incorrect or maliciously designed programs.To avoid this problem, our PittSFIeld tool artificially enforces its own alignment constraints on the x86 architecture. Conceptually, we divide memory into segments we call chunks whose size and location is a power of two, say 16, bytes. PittSFIeld inserts no-op instructions as padding so that no instruction crosses a chunk boundary; every 16-byte aligned address holds a valid instruction. Instructions that are targets of jumps are put at the beginning of chunks;
Figure 2: Illustration of the instruction alignment enforced by our technique. Black filled rectangles represent instructions of various lengths present in the original program. Gray outline rectangles represent added no-op instructions. Instructions are not packed as tightly as possible into chunks because jump targets must be aligned, and because the rewriter cannot always predict the length of an instruction. Call instructions (gray filled box) go at the end of chunks, so that the addresses following them are aligned.
call
instructions go at the ends of chunks, because the
instructions after them are the targets of returns.
This alignment is illustrated schematically in Figure 2.
Furthermore, jump instructions are checked so that their target
addresses always have their low 4 bits zero.
This transformation means that each chunk is an atomic unit of
execution with respect to incoming jumps: it is impossible to execute
the second instruction of a chunk without executing the first.
To ensure that an otherwise unsafe operation and the check of its
operand cannot be separated, PittSFIeld additionally enforces that
such pairs of instructions do not cross chunk boundaries, making
them atomic.
Thus, our technique does not need dedicated registers as in classic
SFI.
A scratch register is still required to hold the effective address of
an operation while it is being checked, but it isn't required that the
same register be used consistently, or that other uses of the register
be prohibited.
(For reasons of implementation simplicity, though, our current system
consistently uses %ebx.)%ebp + 10
,
but this would not be the case if %ebp were already near the end of
the data region.
To handle this case efficiently, we follow Wahbe et al. in using guard
regions, areas in the address space directly before and after the
data region that are also safe for the sandboxed code to attempt to
write to.0x10000000
and 0x20000000
respectively.
The code sequence to ensure that an address in %ebx is legal for the
data region is:2
and $0x20ffffff, %ebxThis instruction turns off all of the bits in the tag except possibly the third from the top, so the address will be either in the data region or the zero-tag region. On examples such as the set of larger programs appearing in a previous report [18], disabling this optimization increases PittSFIeld's overhead over normal execution by about 10%.
ret
instruction into (in this example
and the next, the final two instructions must be in a single
chunk):
popl %ebx and $0x10fffff0, %ebx jmp *%ebxHowever, if a procedure is called from multiple locations, the single buffer slot will not be effective at predicting the return address, and performance will suffer. In order to deal more efficiently with returns, modern x86 processors keep a shadow stack of return addresses in a separate cache, and use this to predict the destinations of returns. To allow the processor to use this cache, we would like PittSFIeld to return from procedures using a real
ret
instruction.
Thus PittSFIeld modifies the return address and writes it back to the
stack before using a regular ret
.
In fact, this can be done without a scratch register:
and $0x10fffff0, (%esp) retOn a worst case example, like a recursive implementation of the Fibonacci function, this optimization makes an enormous difference, reducing 95% overhead to 40%. In more realistic examples, the difference is smaller but still significant; for the SPECint2000 benchmarks discussed in Section 7, disabling this optimization increases the average overhead from 21% to 27%, and almost doubles the overhead for one program, 255.vortex.
and
instruction
not at a chunk boundary, we might know that the contents of the target
register are appropriately sandboxed for use in accessing the data
region.
The major part of the safety proof is to show that these properties
are sound for any possible execution; it is then easy to see that if
the properties always hold, no unsafe executions will be possible.
An important aspect of the soundness is that it is inductive over the
steps in the execution of the rewritten code: for instance, it is
important that none of the instructions in the code region change
during execution, as new instructions would not necessarily match the
static properties.
We can be confident of this only because in previous execution up to a
given point, we can assume we were successful in preventing writes
outside the data section.
In program verification terminology, the soundness property is an
invariant that the verifier checks as being preserved by each
instruction step.--fixed-ebx
flag to
GCC), and used to hold the sandboxed address for accesses to both the
data and code regions.
The effective address of an unsafe operation is computed in %ebx
using a lea
instruction.
The value in %ebx is required to be checked or sandboxed directly
before each data write or indirect code jump (reads are
unrestricted).
Both direct and indirect jumps are constrained to chunk-aligned
targets.
Guard regions are 64k bytes in size: %ebp and %esp are treated as
usually-sandboxed.
Accesses are allowed at an offset of up to 64k from %ebp, and of up to
255 bytes from %esp; %esp is also allowed to be modified up to 255
times, by as much as 255 bytes each time, between checks.
Both %ebp and %esp must be restored to safe values before a jump.
A safe value in %esp may be copied to %ebp or vice-versa without a
check.
Chunks are padded using standard no-op instructions of lengths 1, 2,
3, 4, 6, and 7 bytes, to a size of 16 or 32 bytes.The rewriting phase of PittSFIeld is implemented as a text processing tool, of about 720 lines of code, operating on input to the GNU assembler
f: push %ebp mov %esp, %ebp mov 8(%ebp), %edx mov 48(%edx), %eax lea 1(%eax), %ecx mov %ecx, 48(%edx) pop %ebp ret f: push %ebp mov %esp, %ebp mov 8(%ebp), %edx mov 48(%edx), %eax lea 1(%eax), %ecx lea 0(%esi), %esi ------------------------------- lea 48(%edx), %ebx lea 0(%esi), %esi lea 0(%edi), %edi ------------------------------- and $0x20ffffff, %ebx mov %ecx, (%ebx) pop %ebp lea 0(%esi), %esi ------------------------------- and $0x20ffffff, %ebp andl $0x10fffff0, (%esp) ret
Figure 3: Before and after example of code transformation. f is a function that takes an array of integers, increments the 12th, and returns (in %eax) the value before the increment. The assembly code on the left is produced by GCC; that on the right shows the results of the PittSFIeld rewriter after assembly. Rules separate the chunks, and no-op instructions are shown in gray. (Though they look the same here, the first three no-ops use different instruction encodings so as to take 4, 6, and 7 bytes respectively).
gas
.
In most cases, alignment is achieved using the .p2align
directive to the assembler, which computes the correct number of
no-ops to add; the rewriter uses a conservative estimate of
instruction length to decide when to emit a .p2align
.
The rewriter adds no-ops itself for aligning call instructions,
because they need to go at the end rather than the beginning of a
chunk.
The rewriter notices instructions that are likely to be used for their
effect on the processor status flags (e.g., comparisons), and saves
and restores the flags register around sandboxing operations when the
flags are deemed live.
However, such save and restore operations can be costly on modern
architectures, so to prevent GCC from moving comparisons away from
their corresponding branches, we disable instruction scheduling with
the
-fno-schedule-insns2
option when compiling for PittSFIeld.
An example of the rewriter's operation on a small function is shown in
Figure 3.objdump
), and represents about 500 lines of code.
Our second verifier is implemented directly in the program that loads
and executes sandboxed code objects, using a pre-existing disassembly
library; this allows for a better assessment of the performance
overheads of verification.
Though it does not yet check the complete safety policy, the second
verifier is complete enough to give rough performance estimates: for
instance, it can verify the 2.7MB of rewritten code for GCC, the
largest of the programs from Section 7, in about
half a second.
Both of our verifiers are much smaller than the disassemblers they
use, so the total amount of trusted code could be reduced by
disassembling only to the extent needed for verification, but using
existing disassemblers takes advantage of other users' testing.
Performing more targeted disassembly in this way would also be
a way to further improve performance.
PittSFIeld supports a large subset of the x86 32-bit protected mode
instruction set, but supervisor mode instructions, string
instructions, and multimedia SIMD (e.g. MMX, SSE) instructions are not
supported; the verifier will reject any program containing an
instruction it does not recognize.-O3
optimization level.
The test system was a 3.06GHz Pentium 4 "Northwood", with 512KB of
cache and 2GB of main memory, running Debian Linux 3.1 with kernel
version 2.4.28 and C library version 2.3.2.
We changed the layout of the code and data sandbox areas to allow a
larger data area.
Each test was run five times with the reference inputs using the
standard SPEC scripts; we discarded the slowest and fastest runtimes
and took the average of the remaining three.
In order to measure the effect on performance of different aspects of PittSFIeld's rewriting, we ran the programs using a number of treatments, representing increasing subsets of the transformation the real tool performs. Figure 4 shows the increase in runtime overhead as each transformation is enabled, from bottom to top. The base treatment uses PittSFIeld's program loader, but compiles the programs with normal optimization and uses none of the features of the rewriter. The measurements of Figure 4 are all measured as percentage overhead relative to the base treatment. The first (bottom) set of bars in Figure 4 represents disabling instruction scheduling with an option to GCC. Disabling this optimization has a small performance penalty, but avoids higher overheads later by reducing the need to save and restore the EFLAGS register as discussed in Section 6. The next set of bars represents the effect of directing GCC to avoid using the %ebx register in its generated code, reducing the number of general purpose registers from 6 to 5; PittSFIeld requires %ebx to be available to hold the effective address of indirect writes and jumps. The next treatment, labelled "padding", reflects the basic cost of requiring chunk alignment: the rewriter adds enough no-op instructions so that no instruction crosses a 16-byte boundary, and every jump target is 16-byte aligned. The next set of bars, labelled "NOP sandboxing", attempts to measure all of the additional overheads related to PittSFIeld's code size increase, beyond those measured in "padding". To achieve this, this treatment adds just as many bytes of new instructions as PittSFIeld normally would, but makes all of them no-ops: this includes both sandboxing instructions, and additional padding required for the new instructions and to keep some instruction pairs in the same chunk. Finally, the last set of bars represents the complete PittSFIeld transformation; exactly the same number of instructions as "NOP sandboxing", but with AND instructions instead of no-ops as appropriate. For the last two treatments, we also considered another subset of PittSFIeld's rewriting: the left half of each bar shows the overhead when PittSFIeld is used to protect both writes to memory and indirect jumps; the right half shows the overhead for protecting jumps only. For some combinations of programs and treatments, we actually measured a small performance improvement relative to the previous treatment, either because of the inaccuracy of our runtime measurement or because of unpredictable performance effects such as instruction cache conflicts. In these cases the corresponding bars are omitted from the figure.
Figure 4: Runtime overheads of PittSFIeld for the SPECint2000 benchmarks, by source. The left half of each bar represents overhead when both jumps and memory writes are protected; the right half shows the overhead of protecting jumps only. The programs are listed in decreasing order of binary size. See the body of the text, Section 7, for details on the meaning of each type of overhead.
Figure 5 show how PittSFIeld's transformation affects the size of the code. The row labelled "Size" shows the size of a binary rewritten by PittSFIeld, in bytes (K = 210, M = 220). This size includes the program and the stub library, both rewritten by PittSFIeld in its default mode (16-byte chunks, both memory and jump protection). The row "Ratio" shows the ratio of the size of an object file processed by PittSFIeld to that of an unmodified program. The row "Compressed" is analogous, except that both the transformed and original object files were first compressed with
Program gcc perl vortex eon gap crafty twolf parser vpr gzip bzip2 mcf Size 2.7M 1.2M 1010K 923K 853K 408K 390K 276K 267K 109K 108K 50K Ratio 1.84 1.96 1.63 1.72 1.84 1.62 1.80 1.92 1.67 1.65 1.63 1.74 Compressed 1.05 1.07 0.98 1.05 1.05 1.06 1.08 1.06 1.07 1.10 1.09 1.13
Figure 5: PittSFIeld space overheads for the SPECint2000 benchmarks. "Size" is the size of the PittSFIeld-rewritten binary. "Ratio" is the ratio of the size of the rewritten binary to the size of a binary generated without rewriting. "Compressed" is like "Ratio", except with both binaries compressed with bzip2 before comparing their sizes.
bzip2
.
Which of these measurements is relevant depends on the application.
The most important effect of PittSFIeld's size increase in most
applications is likely its effect on performance, discussed in the
previous paragraph.
Uncompressed size is relevant for memory usage, for instance on
smaller devices.
Compressed size is more relevant, for instance, to the cost of storing
and distributing software; the compressed ratios are smaller because
the added instructions tend to be repetitive.The author of VXA was not aware of PittSFIeld at the time it was designed, but to examine whether PittSFIeld would be a suitable replacement for VX32, we used it to protect the execution of the six VXA decompression modules demonstrated in [11]. We used VX32's virtual C library rather than the one used in Section 7; this required implementing VXA's four virtual system calls (read, write, _exit, and sbrk). We also used VX32's library of math functions, but compiled to use the x87-style floating point instructions supported by PittSFIeld rather than the SSE2 ones VX32 uses. The runtime overheads of VX32 and PittSFIeld are compared in Figure 6. Zlib and BZip2 are decompressors for the same general-purpose compression formats as the SPECint2000 gzip and bzip2 programs (which also include compression); JPEG and JPEG2000 are lossy image compression formats, and FLAC and Vorbis are lossless and lossy audio formats respectively. In each case the programs decompressed large single files of each format. To minimize I/O overhead, we made sure the input files were cached in memory, and sent the decompressed output to
Zlib BZip2 JPEG JPEG2000 FLAC Vorbis Geom. Mean VX32 1.006 0.975 1.034 1.283 0.954 0.948 1.028 PittSFIeld jump-only 1.238 1.018 1.134 1.114 1.142 1.239 1.145 PittSFIeld full 1.398 1.072 1.328 1.211 1.241 1.458 1.278
Figure 6: Run time ratios for VX32 and PittSFIeld on the VXA decompressors, compared to natively compiled decompressors.
/dev/null
;
measurements are based on elapsed time.
The system was the same Pentium 4 machine described in
Section 7, except that VX32 uses a specially
compiled version of GCC 4.0, and the native build uses Debian's GCC
4.0 to match.We have constructed the proof using ACL2 [13]. ACL2 is a theorem-proving system that combines a restricted subset of Common Lisp, used to model a system, with a sophisticated engine for semi-automatically proving theorems about those models. We use the programming language (which is first-order and purely functional) to construct a simplified model of our verifier, and a simulator for the x86 instruction set. Then, we give a series of lemmas about the behavior of the model, culminating in the statement of the desired safety theorem. The lemmas are chosen to be sufficiently elementary that ACL2 can automatically prove each from the model and the preceding lemmas. The proof required less than two months of effort by a user with no previous experience with proof assistants (the first author). An experienced ACL2 user could likely have produced a more elegant proof in less time; our inexperience in choosing abstractions also made the going more difficult as the proof size increased. An example of a function from the executable model and a lemma we have proved about it are shown as the first two parts of Figure 7. A disadvantage of ACL2 compared to some other theorem provers is that its proofs cannot be automatically checked by a simpler proof checker. However, ACL2 has been well tested by other academic and industrial users, and its underlying logic is simple, so we still consider it trustworthy.
(defun seq-reachable-rec (mem eip k) (if (zp k) (if (= eip (code-start)) 0 nil) (let ((kth-insn (kth-insn-from mem (code-start) k))) (or (and kth-insn (= eip kth-insn) k) (seq-reachable-rec mem eip (- k 1)))))) (defthm if-reach-in-k-then-bound-by-kth-insn (implies (and (mem-p mem) (natp k) (natp eip) (kth-insn-from mem (code-start) k) (seq-reachable-rec mem eip k)) (<= eip (kth-insn-from mem (code-start) k)))) (defthm safety (implies (and (mem-p mem) (mem-sandbox-ok mem) (addr-p eax) (addr-p ebx) (addr-p ecx) (addr-p edx) (addr-p esi) (addr-p edi) (addr-p ebp) (addr-p esp) (addr-p eflags) (data-region-p ebp)) (consp (step-for-k (x86-state (code-start) eflags eax ebx ecx edx esi edi ebp esp mem) k))))
Figure 7: From top to bottom, a typical function definition, a typical lemma, and the final safety result from our formal ACL2 proof. seq-reachable-rec is a recursive procedure that checks whether the instruction at location eip is among the first k instructions reachable from the beginning of the sandboxed code region in a memory image mem. The lemma states that if eip is among the first k instructions, then its address is at most that of the kth instruction. The safety theorem states that if a memory image mem passes the verifier mem-sandbox-ok, then whatever the initial state of the registers, execution can proceed for any number of steps (the free variable k) without causing a safety violation (represented by a nil return value from step-for-k, which would not satisfy the predicate consp).
mem-sandbox-ok
on the state of the code region before
execution: if the verifier approves the rewritten code, then for any
inputs (modelled as the initial contents of registers and the data
region), execution of the code will continue forever without
performing an unsafe operation.
(Unlike the real system, the model has no exit()
function.)
Note that the rewriter does not appear in the formal proof, so the
proof makes no claims about it: for instance, we have not proved that
the code produced by the rewriter has the same behavior as the
original code.
Though a statement like that could be formalized, it would require a
number of additional hypotheses; in particular, because the rewriter
changes the address of instructions, code that say examined the
numeric values of function pointers would not behave identically.
One aspect of the proof to note is that it deals with a subset of the instructions handled by the real tool: this applies both to which instructions are allowed by the simulated verifier, and to which can be executed by the x86 simulator. The subset used in the current version of the proof appears in Figure 8. The instructions were chosen to exercise all of the aspects of the security policy; for instance,
nop
mov
addr, %eax
xchg %eax, %ebx
inc %eax
mov %eax,
addrxchg %eax, %ebp
jmp
addrand $
immed, %ebx
mov %eax, (%ebx)
jmp *%ebx
and $
immed, %ebp
mov %eax, (%ebp)
Figure 8: List of instructions in the subset considered in the proof of Section 9.
jmp *%ebx
is included to demonstrate an indirect jump.
Though small compared to the number of instructions allowed by the
real tool, this set is similar to the instruction sets used in recent
similar proofs [2, 29].
We constructed the proof by beginning with a minimal set of
instructions and then adding additional ones: adding a new instruction
similar to an existing one required few changes, but additions that
required a more complex safety invariant often involved extensive
modifications.
The simulator is structured so that an attempt to execute any
un-modelled instruction causes an immediate failure, so safety for a
program written in the subset that is treated in the proof extends to
the complete system.
A related concern is whether the simulated x86 semantics match those
of a real processor: though the description of the subset used in the
current proof can be easily checked by hand, this would be impractical
for a more complete model.
To facilitate proofs like ours in the future, as well as for
applications such a foundational proof-carrying code (see
Section 10.6), it should be possible to generate a
description of the encoding and semantics of instructions from a
concise, declarative, and proof-environment-neutral specification.In concurrent work [1], the Gleipnir project at Microsoft Research has investigated a binary-rewriting security technique called Control-Flow Integrity, or CFI. As suggested by the name, CFI differs from SFI in focusing solely on constraining a program's jumps: in the CFI implementation, each potential jump target is labelled by a 32-bit value encoded in a no-op instruction, and an indirect jump checks for the presence of an appropriate tag before transferring control. This approach gives finer control of jump destinations than the SFI techniques of Wahbe et al., or PittSFIeld, though the ideal precision could only be obtained with a careful static analysis of, for instance, which function pointers might be used at which indirect call sites. In the basic presentation, CFI relies on an external mechanism (such as hardware) to prevent changes to code or jumps to a data region, but it can also be combined with inserted memory-operation checks, as in SFI, to enforce these constraints simultaneously.
Figure 9: Runtime overheads for PittSFIeld in the default mode (black bars), PittSFIeld in jump-only mode (gray bars), and CFI (white bars) for the SPECint2000 benchmarks. PittSFIeld results are the same as those in Figure 4, but not broken down by cause. CFI results are taken from Figure 4 of [1], which does not include results for Perl. Because these were separate experiments with other variables not held constant, care should be used in comparing them directly.