Efficient Fine-grained OS Support for Huge Pages

HawkEye demonstrates fine-grained OS support for huge pages, not fine-grained huge pages. The focus of this paper relies on when, where, and how to promote huge pages.

For dynamic size of huge pages, see: Diverse Contiguity of Memory Mapping.

Memory Bloat

There are two main memory issues:

  • Memory Safety: access to unallocated(or already freed) memory [MICRO 2019]
  • Memory Bloat: too large memory usage

Memory leakage is one of the primary reasons that result in memory bloat. Data fragmentation could be another reason even applications have no leakage issue.

To demonstrate this issue, the authors designed a three-phase workload:

  1. The client inserts 11 million key-value pairs of size (10B, 4KB) for an in-memory dataset of 45GB.
  2. The client deletes 80% randomly selected keys, leaving Redis with a sparsely populated address space.
  3. After some time gap, the client tries to insert 17K (10B, 2MB) key-value pairs so that the dataset again reaches 45GB.

While the size of the dataset reaches 45GB for the second time, the resident set size (RSS), also known as physical memory usage, reaches ~45GB. While the size of the dataset reaches 45GB for the second time, if Redis is developed correctly (of cause), there should be enough memory. However, both Linux and Ingens runs into out-of-memory error. This is due to the data fragmentation issue. As both designs employ huge pages, the fragmented memory usage could occupy even more physical memory.

Memory bloat for huge pages is an interesting finding.

Another well-known problem of huge pages includes high allocation overhead. More specifically, zeroing a huge page is considerably more expensive than that of regular pages.

The fairness of allocating huge pages for multiple processes is another ignored problem.

The benefits of using huge pages includes less overall page fault time and higher page translation performance.

Asynchronous page pre-zeroing

Asynchronous page pre-zeroing aims to solve the initialization delay. It initializes the huge page in a background kernel thread and uses non-temporal hints to avoid cache pollution, thus significantly reduces both cache contention and the double cache miss problem.

Although pre-zeroing does not necessarily enable high-performance with 4KB pages, it enables non-negligible performance improvements with huge pages.

Managing bloat vs. performance

To resolve the memory bloat, HawkEye scan for zero-filled pages and convert huge pages with many zero-filled pages to regular pages. HawkEys then turn zero-filled pages to CoW pages.

Fine-grained huge page promotion

HawkEye promotes huge page sized regions that are more frequently accessed. It implements access-coverage based promotion using a per-process data structure. Access-coverage denotes the number of base pages that are accessed. It searches for candidate across multiple processes.