Memory-mapped (MMAP) file I/O is an OS-provided feature that
maps the contents of a file on secondary storage into a program’s
address space. The program then accesses pages via pointers as if
the file resided entirely in memory. The OS transparently loads
pages only when the program references them and automatically evicts
pages if memory fills up.
MMAP‘s perceived ease of use has seduced database management system
(DBMS) developers for decades as a viable alternative to
implementing a buffer pool. There are, however, severe correctness
and performance issues with MMAP that are not immediately apparent.
Such problems make it difficult, if not impossible, to use MMAP
correctly and efficiently in a modern DBMS. In fact, several popular
DBMSs initially used MMAP to support larger-than-memory databases
but soon encountered these hidden perils, forcing them to switch to
managing file I/O themselves after significant engineering costs.
In this way, MMAP and DBMSs are like coffee and spicy food: an
unfortunate combination that becomes obvious after the fact.
Since developers keep trying to use MMAP in new DBMSs, we wrote this
paper to provide a warning to others that MMAP is not a suitable
replacement for a traditional buffer pool. We discuss the main
shortcomings of MMAP in detail, and our experimental analysis
demonstrates clear performance limitations. Based on these findings,
we conclude with a prescription for when DBMS developers might
consider using MMAP for file I/O.
Recommended Music for this Paper:
Dr. Dre – High Powered (featuring RBX)
Citation
@inproceedings{crotty22-mmap💩, author={Crotty, Andrew and Leis, Viktor and Pavlo, Andrew}, title={Are You Sure You Want to Use MMAP in Your Database Management System?}, booktitle={{CIDR} 2022, Conference on Innovative Data Systems Research}, year={2022}, }
Acknowledgments
This paper is the culmination of an unhealthy, years-long obsession with the idea of developers incorrectly using mmap in their DBMSs. The authors would like to thank everyone who contributed and provided helpful feedback: Chenyao Lou (PKU), David “Greasy” Andersen (CMU), Michael Kaminsky (BrdgAI), Thomas Neumann (TUM), Christian Dietrich (TUHH), Todd Lipcon (lipcon.org), and Sasha Fedorova (UBC).
This work was supported (in part) by the NSF (IIS-1846158, III-1423210, DGE-1252522), research grants from Google and Snowflake, and the Alfred P. Sloan Research Fellowship program.