## Memory-Mapped Transactions

by

### Jim Sukha

Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

#### MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2005

© Massachusetts Institute of Technology 2005. All rights reserved.

| Certified by |                                               |
|--------------|-----------------------------------------------|
|              | Charles E. Leiserson                          |
|              | Professor of Computer Science and Engineering |
|              | Thesis Supervisor                             |
| Certified by |                                               |
| v            | Bradley C. Kuszmaul                           |
|              | Research Scientist                            |
|              | Thesis Supervisor                             |
|              |                                               |

Accepted by ...... Arthur C. Smith

Chairman, Department Committee on Graduate Theses

#### Memory-Mapped Transactions by Jim Sukha

Submitted to the Department of Electrical Engineering and Computer Science on May 19, 2005, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

#### Abstract

Memory-mapped transactions combine the advantages of both memory mapping and transactions to provide a programming interface for concurrently accessing data on disk without explicit I/O or locking operations. This interface enables a programmer to design a complex serial program that accesses only main memory, and with little to no modification, convert the program into correct code with multiple processes that can simultaneously access disk.

I implemented LIBXAC, a prototype for an efficient and portable system supporting memorymapped transactions. LIBXAC is a C library that supports atomic transactions on memory-mapped files. LIBXAC guarantees that transactions are serializable, and it uses a multiversion concurrency control algorithm to ensure that all transactions, even aborted transactions, always see a consistent view of a memory-mapped file. LIBXAC was tested on Linux, and it is portable because it is written as a user-space library, and because it does not rely on special operating system support for transactions.

With LIBXAC, I was easily able to convert existing serial, memory-mapped implementations of a  $B^+$ -tree and a cache-oblivious B-tree into parallel versions that support concurrent searches and insertions. To test the performance of memory-mapped transactions, I ran several experiments inserting elements with random keys into the LIBXAC B<sup>+</sup>-tree and LIBXAC cache-oblivious B-tree. When a single process performed each insertion as a durable transaction, the LIBXAC search trees ran between 4% slower and 67% faster than the B-tree for Berkeley DB, a high-quality transaction system. Memory-mapped transactions have the potential to greatly simplify the programming of concurrent data structures for databases.

Thesis Supervisor: Charles E. Leiserson Title: Professor of Computer Science and Engineering

Thesis Supervisor: Bradley C. Kuszmaul Title: Research Scientist

#### Acknowledgments

These acknowledgments are presented in a random order:

I would like to thank my advisor, Charles E. Leiserson for his helpful comments on this thesis, and on his helpful advice in general.

I would also like to thank Bradley C. Kuszmaul, who has been deeply involved in this project from day one. Bradley's initial idea started me on this project that eventually became LIBXAC, and I have had many helpful discussions with him on this topic ever since.

I'd like to thank Bradley for giving me an implementation of a B<sup>+</sup>-tree, and Zardosht Kasheff for the code for the cache-oblivious B-tree. The experimental results on search trees without LIBXAC are primarily the work of Bradley, Michael A. Bender, and Martin Farach-Colton.

All of the people in the SuperTech group deserve a special round of acknowledgments. Those people in SuperTech, also in random order, include Gideon, Kunal, Jeremy, Angelina, John, Tim, Yuxiong, Vicky, and Tushara. Everyone has been wonderfully patient in listening to my ramblings about transactions, names for LIBXAC, research, and life in general. They have all kept me (relatively) sane throughout the past year. In particular, I'd like to Kunal, Angelina, and Jeremy for their feedback on parts of my thesis draft, and more generally for choosing to serve the same sentence. I'd also like to thank Ian and Elizabeth, who are not in the SuperTech group, but have also been subjected to conversations about transactions and other research topics. Thanks to Leigh Deacon, who kept SuperTech running administratively.

Thanks to Akamai and MIT for the Presidential fellowship that funded my stay here this past year. Also, thanks to the people I met through SMA for their helpful comments on the presentation I gave in Singapore.

Thank you to my family and to my friends, here at MIT and elsewhere. Without their support, none of this would have been possible.

I apologize to all those other people I am sure I have missed that deserve acknowledgments. I will try to include you all when the next thesis rolls around.

This work was partially supported by NSF Grant Numbers 0305606d and 0324974, and by the Singapore MIT Alliance. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

## Contents

| 1 | Introduction         1.1       Explicit I/O vs. Memory Mapping                                                                                                                                                                                                                                                    | <b>13</b><br>14                                                                                |
|---|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
|   | 1.2       Locks vs. Transactions         1.3       Concurrent Disk-Based Programs         1.4       Thesis Overview                                                                                                                                                                                               | $     \begin{array}{r}       17 \\       21 \\       22     \end{array} $                      |
| 2 | The Libxac Interface         2.1 Programming with LIBXAC         2.2 Semantics of Aborted Transactions         2.3 Optimizations for LIBXAC         2.4 Related Work         2.5 Advantages of the LIBXAC Interface                                                                                               | <b>25</b><br>26<br>30<br>32<br>33<br>35                                                        |
| 3 | The Libxac Implementation         3.1       Overview         3.2       Transactions on a Single Process         3.3       Concurrent Transactions         3.4       Conclusion                                                                                                                                    | <b>37</b><br>37<br>40<br>43<br>46                                                              |
| 4 | A Performance Study         4.1 The Hardware         4.2 Performance of Nondurable Transactions         4.2.1 Page-Touch Experiment         4.2.2 Page-Touch With an Advisory Function         4.2.3 Decomposing the Per-Page Overhead         4.3 Durable Transactions         4.4 Testing Concurrency in LIBXAC | <ol> <li>49</li> <li>50</li> <li>51</li> <li>51</li> <li>53</li> <li>58</li> <li>59</li> </ol> |
| 5 | Search Trees Using Libxac         5.1 Introduction                                                                                                                                                                                                                                                                | <ul> <li>63</li> <li>63</li> <li>66</li> <li>68</li> <li>68</li> <li>71</li> </ul>             |
| 6 | Conclusion         6.1       Ideas for Future Work         6.2       LIBXAC and Transactional Memory                                                                                                                                                                                                              | <b>73</b><br>73<br>74                                                                          |
| A | Restrictions to the Libxac Interface         A.1 Restrictions to LIBXAC                                                                                                                                                                                                                                           | <b>79</b><br>79                                                                                |
| в | Transaction Recovery in Libxac                                                                                                                                                                                                                                                                                    | 81                                                                                             |

| $\mathbf{C}$ | Deta | ailed Experimental Results          | <b>85</b> |
|--------------|------|-------------------------------------|-----------|
|              | C.1  | Timer Resolution                    | 85        |
|              | C.2  | Page-Touch Experiments              | 86        |
|              | C.3  | Experiments on Various System Calls | 86        |
|              | C.4  | Durable Transactions                | 90        |
|              | C.5  | Concurrency Tests                   | 92        |
|              | C.6  | Search Trees using LIBXAC           | 92        |

# List of Figures

| 1-1        | Two versions of a simple C program that reads the first 4-byte integer from each of 5 randomly selected pages of a file, computes their sum, and stores this value as the first 4-byte integer on a randomly selected 6th page. In Program A, the entire file is brought into memory using <b>read</b> , the selected pages are modified, and the entire file is written out to disk using <b>write</b> . On the first 5 pages, Program B does an explicit disk seek to read the first integer. Then B does a seek and a write to modify the 6th | 1 5             |
|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
| 1-2        | page                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | $15 \\ 16$      |
| 1-2<br>1-3 | Concurrent processes sharing data through a memory mapped file.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 10              |
| 1-4        | An interleaving of instructions from the execution of the programs in Figure 1-3 that causes a data race.                                                                                                                                                                                                                                                                                                                                                                                                                                        | 18              |
| 1-5        | Two different locking protocols for the program in Figure 1-2. Program D acquires a global lock, while Program E acquires a different lock for every page. For simplicity, only the body of the code is shown here.                                                                                                                                                                                                                                                                                                                              | 18              |
| 1-6        | Program F. This version of the program from Figure 1-2 is written with memory-<br>mapped transactions.                                                                                                                                                                                                                                                                                                                                                                                                                                           | 20              |
| 1-7        | An illustration of the possible space of programs that can concurrently access data on disk.                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 21              |
| 2-1        | LIBXAC program that increments the first integer of a memory-mapped file                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 27              |
| 2-2        | A side-by-side comparison of the program in Figure 1-2 and the parallel version written with LIBXAC.                                                                                                                                                                                                                                                                                                                                                                                                                                             | 28              |
| 2-3<br>2-4 | A recursive function that uses nested transactions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | $\frac{29}{31}$ |
| 3-1<br>3-2 | Changes to the memory map for a simple transaction                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | $\frac{39}{44}$ |
| 4-1        | A simple transaction that (a) reads from $n$ consecutive pages, and (b) writes to $n$ consecutive pages.                                                                                                                                                                                                                                                                                                                                                                                                                                         | 51              |
| 4-2        | Average time per page to execute the transactions shown in Figure 4-1 on Machine 1. For each value of $n$ , each transaction was repeated 1000 times                                                                                                                                                                                                                                                                                                                                                                                             | 52              |
| 4-3        | The transactions in Figure 4-1 written with the advisory function. The transaction in (a) reads from $n$ consecutive pages, the transaction in (b) writes to $n$ consecutive                                                                                                                                                                                                                                                                                                                                                                     |                 |
|            | pages                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 53              |
| 4-4        | Average time per page to execute the transactions shown in Figure 4-3 on Machine 1. For each value of $n$ , each transaction was repeated 1000 times                                                                                                                                                                                                                                                                                                                                                                                             | 54              |
| 4-5        | Concurrency Test A: Each transaction increments the first integer on a page 10,000 times.                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 60              |
| 4-6        | Concurrency Tests B and C: Test B increments every integer on the page. Test C repeats the transaction in Test B 1,000 times. I omit the outermost for-loop, but as                                                                                                                                                                                                                                                                                                                                                                              |                 |
|            | in Figure 4-5, each transaction is repeated 10,000 times                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 60              |
| 5-1        | An illustration of the Disk Access Machine (DAM) model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 64              |

|     | The van Emde Boas layout (left) in general and (right) of a tree of height $5$<br>Machine 3: Time for kth most expensive insert operation.           | $\begin{array}{c} 65 \\ 70 \end{array}$ |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|
| B-1 | An example of a LIBXAC log file when transactions execute (a) on one process, and (b) on two processes                                               | 82                                      |
|     | Machine 1:Distribution of Delay Times Between Successive gettimeofday Calls                                                                          | 86                                      |
| C-2 | Average time per page to execute the transactions shown in Figure 4-3 on Machine 2. For each value of $n$ , each transaction was repeated 1000 times | 87                                      |
| C-3 | Average time per page to execute the transactions shown in Figure 4-3 on Machine                                                                     |                                         |
|     | 3. For each value of $n$ , each transaction was repeated 1000 times                                                                                  | 88                                      |
| C-4 | Average time per page to execute the transactions shown in Figure 4-3 on Machine                                                                     |                                         |
|     | 4. For each value of $n$ , each transaction was repeated 1000 times                                                                                  | 89                                      |

# List of Tables

| 2.1               | The LIBXAC functions for nondurable transactions.                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 26                                     |
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|
| 4.1<br>4.2        | Processor speeds and time per clock cycle for the test machines<br>Average $\#$ of clock cycles per page access for transactions touching 1024 pages, with<br>and without the advisory function. Numbers are in thousands of cycles. Percent                                                                                                                                                                                                                                                                       | 50                                     |
| 4.3               | speedup is calculated as $100 \left( \frac{\text{Normal - With Adv}}{\text{Normal}} \right)$                                                                                                                                                                                                                                                                                                                                                                                                                       | 53                                     |
| 4.4               | (average of 10,000 repetitions)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 55                                     |
| $4.5 \\ 4.6$      | page. Each experiment was repeated 5 times. $\dots$ Clock cycles required for a memcpy between two 4K character arrays in memory. $\dots$ Average $\#$ of clock cycles per page access for transactions touching 1024 pages. All                                                                                                                                                                                                                                                                                   | 56<br>56                               |
| 4.7               | numbers are in thousands of clock cycles. $\dots$ Average Access Time ( $\mu$ s) per Page, for Transactions Touching 1024 Pages. $\dots$                                                                                                                                                                                                                                                                                                                                                                           | $\frac{57}{58}$                        |
| 4.8<br>4.9        | Time required to call msync and fsync on a 10,000 page file with one random page modified, 1000 repetitions. All times are in ms. $\dots \dots \dots$                                                                                                                                                                                                                                                              | 59                                     |
|                   | is calculated as time on 1 processor over time on 2 processors. $\dots \dots \dots \dots$<br>Concurrency tests for durable transactions. Times are milliseconds per transaction.                                                                                                                                                                                                                                                                                                                                   | $\begin{array}{c} 59\\ 61 \end{array}$ |
| 5.1<br>5.2        | Performance measurements of 1000 random searches on static trees. Both trees use 128-byte keys. In both cases, we chose enough data so that each machine would have to swap. On the small machine, the CO B-tree had $2^{23}$ (8M) keys for a total of 1GB. On the large machine, the CO B-tree had $2^{29}$ (512M) keys for a total of 64GB Timings for memory-mapped dynamic trees. The keys are 128 bytes long. The range query is a scan of the entire data set after the insert. Berkeley DB was run with the | 66                                     |
| <b>F</b> 0        | default buffer pool size (256KB), and with a customized loader that uses 64MB of<br>buffer pool. These experiments were performed on the small machine.                                                                                                                                                                                                                                                                                                                                                            | 67                                     |
| 5.3<br>5.4<br>5.5 | The time to insert a sorted sequence 450,000 keys. Inserting sorted sequence is the most expensive operation on the packed memory array used in the dynamic CO B-tree. Changes in Code Length Converting B <sup>+</sup> -tree and CO B-tree to Use LIBXAC Time for 250,000 durable insertions into LIBXAC search trees. All times are in ms.                                                                                                                                                                       | 67<br>68                               |
| 5.6               | Percent speedup is calculated as $\frac{100(t_1-t_2)}{t_2}$ , where $t_1$ and $t_2$ are the running times on 1 and 2 processors, respectively                                                                                                                                                                                                                                                                                                                                                                      | 69                                     |
|                   | Berkeley DB tree, respectively. Speedup is $t_1/t_2$ .                                                                                                                                                                                                                                                                                                                                                                                                                                                             | 69                                     |
| C.1<br>C.2        | Delay (in clock cycles) between successive calls to timer using rdtsc instruction,<br>10,000 repetitions. $\dots \dots \dots$                                                                                                                                                                                                                                                                                      | 85<br>86                               |
| C.3               | Timing data for entering SIGSEGV handler, calling mmap, and leaving handler, 10,000 repetitions. All times are processor cycles.                                                                                                                                                                                                                                                                                                                                                                                   | 90                                     |

| 90 |
|----|
| 91 |
|    |
| 91 |
| 91 |
|    |
| 92 |
| 93 |
| 94 |
| 95 |
|    |
| 95 |
|    |