On any modern system, applications don’t use physical memory directly. Instead, they use a virtual memory address model to make it easier to handle memory allocation and avoid the complexity of computing and mapping physical addresses into the application memory space. The virtual memory address model, however, requires the computer to translate virtual addresses into their corresponding physical memory addresses as soon as any application reads or writes to or from memory. This mapping is organized in the so-called page tables, a hierarchically-organized lookup table.

The TLB (Translation Lookaside Buffer) and Performance

Memory allocations on any common operating system are done by allocating single or multiple pages of available memory. These pages have a specific size, mostly depending on the operating system. On Linux, the default page size used by memory allocation are 4kB. So someone can imagine that allocating very large amounts of memory requires a high number of pages to be held in the page table, in order to map them to their corresponding physical addresses. Looking up virtual addresses requires multiple memory accesses, which are very costly when compared to CPU speed. Thus, modern CPUs maintain a so-called Translation Lookaside Buffer (TLB), which is basically a cache that speeds up translation between virtual and physical memory.

The size of a TLB is constrained to a few thousand entries, and especially on systems with high memory-specific memory workload, very soon becomes a bottleneck. If, for example, a database like PostgreSQL uses a shared buffer pool to implement Inter Process Communication (IPC), block layer access, and other functionality over it, this very quickly puts a high amount of pressure on the TLB, since the number of virtual address mappings is limited. Especially when using dozens of gigabytes for the shared buffer pool, such instances will suffer from performance hits because of this, depending on their workload.

The solution: Huge Pages

On modern operating systems like Linux this problem can be mitigated by using Huge Pages. Since page tables are organized hierarchically, they enable us to summarize allocations in much larger pages than the default. The size of huge pages are architecture-dependent, on x86 systems we can usually expect 2MB or 1GB sizes, IBM POWER allows 64kB, 16MB and 16 GB.
Lookups for specific virtual addresses of memory allocations are then much faster and hopefully more independent of the entries found in the TLB.

On x86, when configuring huge pages, the 2MB page size is the default. You can easily get your system current settings via /proc/meminfo:

% cat/proc/meminfo| grep-i HugePageAnonHugePages: 161792 kBShmemHugePages: 0 kBFileHugePages: 0 kBHugePages_Total: 0HugePages_Free: 0HugePages_Rsvd: 0HugePages_Surp: 0Hugepagesize: 2048 kB

HugePages_Total says no huge pages are configured at the moment, accordingly HugePages_Rsvd,HugePages_Free and HugePages_Surp are equal to zero. Hugepagesize shows the currently configured default size for huge pages in the kernel. When configuring huge pages later, we will revisit the value of these settings.

PostgreSQL usage of huge pages

PostgreSQL uses huge page allocations only for its shared buffer pool. Dynamic memory allocation for e.g. work_memmaintenance_work_memautovacuum_work_mem doesn’t use them. Also, dynamic shared memory segments used to transfer data and communication between parallel query workers don’t request them automatically, with one exception, which I’ll describe in the next paragraph about configuring PostgreSQL for huge pages on Linux. It’s also necessary to have shared buffers allocated in so-called mmap'ed segments, which are controlled by the shared_memory_type parameter. This is the default setting on Linux. System V shared memory is only used by PostgreSQL to initialize a very small shared memory area; the buffer pool itself is allocated via mmap.

Configure your system

Huge Pages aren’t configured on recent Linux Distributions out of the box. There are usually two ways to teach your Linux system to provide them to your applications:

– Transparent Huge Pages (THP)
– Explicitely configured pools of huge pages for applications

The first possibility via THP isn’t recommended for PostgreSQL on Linux, since it doesn’t allocate huge pages directly but provides them to applications as they need them. THP are swappable and considered a slower option compared to explicit configured huge pages. Other operating systems like FreeBSD provide better support in this case.

An easy example

So we are going to focus on explicit huge pages support for PostgreSQL. In the following sections, examples are all done on a Rocky Linux 9 system, but should apply to all other Linux systems accordingly. To configure explicit huge pages, we need to know the computed size of shared memory and additional shared allocations in advance. As of PostgreSQL 15, this is very easy (but unfortunately requires a PostgreSQL instance to be down). With the PostgreSQL 15 Community RPM installation from it looks as follows:

% su- postgres% /usr/pgsql-15/bin/postgres--shared-buffers=20GB -D $PGDATA -C shared_memory_size_in_huge_pages10475

Again, the above code doesn’t start PostgreSQL, but calculates the value of shared_memory_size_in_huge_pages and prints the result to the terminal. This also gives us some flexibility to calculate this value according to specific configuration settings we can provide to the postgresbinary as command line options. The example above specifies --shared-buffers as a command line option, overriding any other setting in the configuration files (like postgresql.conf).

If PostgreSQL cannot allocate the requested amount of shared buffers with huge pages, it falls back to normal page size allocations. This is controlled by the configuration parameter huge_pages, which can have the following values:

  • try: Try to allocate huge pages, fall back to normal page allocation if not possible (the default setting currently)
  • off: Never try to request huge page allocation
  • on: Force shared buffer pool allocations with huge pages

If we have finally found our settings to start with, we need to reserve the required pool of huge pages by the kernel. As user root and using the default huge page size, this can be done during runtime by the following command:

% echo10475 > /proc/sys/vm/nr_hugepages

The memory settings then can be used to confirm what we have in mind:

grep-i hugepage /proc/meminfoAnonHugePages:    124928 kBShmemHugePages:        0 kBFileHugePages:         0 kBHugePages_Total:   10475HugePages_Free:    10475HugePages_Rsvd:        0HugePages_Surp:        0Hugepagesize:       2048 kB

Pools for specific huge pages must be reserved as a continuous memory area, so the kernel needs a memory area which is available for the reservation. If that’s not possible, far less memory might be available to the pool. In this case, the HugePages_Total number is lower than we requested. Experimenting with values that are too large can also drive a system into an out-of-memory situation, rendering it unavailable in the worst case. So I recommend carefully checking memory usage if the system hosts other applications or users.

Setting the number of pages like this, though, is not persistent across system reboots. To survive them, the easiest way is to put a configuration into /etc/sysctl.d/ with the right settings and reboot:

% cat/etc/sysctl.d/01-nr-hugepages.confvm.nr_hugepages=10475

PostgreSQL is then configured to request huge pages. In this example, we adjust the postgresql.conf configuration file to force PostgreSQL to perform allocations accordingly:

shared_buffers = 20GBhuge_pages = onhuge_page_size = 0 #use default kernel setting

PostgreSQL must be started (or restarted) and should allocate its shared buffer pool via the default huge page size. /proc/meminfo changes its figures. Now it looks like it does here in the following output:

grep-i hugepage /proc/meminfoAnonHugePages:    176128 kBShmemHugePages:        0 kBFileHugePages:         0 kBHugePages_Total:   10475HugePages_Free:    10264HugePages_Rsvd:    10264HugePages_Surp:        0Hugepagesize:       2048 kB

Note the HugePages_Rsvd figure: It reflects the number of pages reserved due to the allocations requested. You can use the pmap tool with the PID of the postmaster process to get the allocated memory mapped buffer pool segment:

pmap  $(head-n1  /var/lib/pgsql/15/data/ | grephuge00007f11a5400000 21452800K rw-s- anon_hugepage (deleted)

This output shows the address where the memory mapped buffer pool area is mapped, its size and permissions. The (deleted) indicates this memory mapped allocation is already unlinked, which is a common trick to make it disappear as soon as the process owning it exits.

Dynamically allocated shared memory and huge pages

In the section on how PostgreSQL uses huge pages, we owe an explanation to the exception mentioned there. In general, dynamic shared memory allocations are done on purpose, for example when parallel query workers process a parallelized query. These dynamically allocated memory areas aren’t done by mmap by default, instead PostgreSQL uses the posix API via shm_open() on Linux by default. In this case, allocations don’t use huge pages. However, when having parallel queries, memory pressure can arise there as well, especially when large amount of data needs to be processed.

PostgreSQL starting with version 14 provides the configuration parameter min_dynamic_shared_memory. When configured, the specified value defines the additional space in the shared memory created at server startup that should be used for parallel query workers. Since it is allocated along with the buffer pool, it also uses huge pages for allocation. We can extend the example above and re-calculate the setting with 2GB of additional shared memory for use for parallel queries:

/usr/pgsql-15/bin/postgres-D $PGDATA --min-dynamic-shared-memory=2GB --shared-buffers=20GB -C shared_memory_size_in_huge_pages11499

Again, the PostgreSQL instance needs to be stopped to execute the command successfully. Compared to the previous example, the number for huge pages to be reserved has now increased by an additional 2GB.

What about PostgreSQL versions older than 15?

These versions are a little bit more difficult to handle. Particularly calculations on how many huge pages need to be reserved are much trickier. The best solution so far is just trying to start PostgreSQL manually with huge_pages=on and some debugging parameters. In this case PostgreSQL prints the following error message if it is not able to allocate the requested shared memory size:

FATAL:  could not map anonymous shared memory: Cannot allocate memoryHINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 21965570048 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.LOG:  database system is shut down

The interesting thing is the HINT part of the message, telling us that PostgreSQL wants to allocate 21965570048 bytes of shared memory. This value can be used to calculate the number of pages ourselves. The problem with this number is that this is more than we requested when specifying shared buffers to 20GB. This is because PostgreSQL also needs additional shared memory for other internal processes, like wal_buffers or max_locks_per_transactions. Extension might require additional shared
memory, too. PostgreSQL internally uses a formula that corresponds approximately to the following

((shm size + (BLKSZ - (shm size % BLKSZ))) / (huge page size * 1024)) + 1 = num huge pages

Standard PostgreSQL installations are always compiled with BLKSZ equal to 8192, specifying the requested huge page size gives then the following result:

((21965570048 + (8192 - (21965570048 % 8192))) / (2048 * 1024)) + 1 = 10475

Configuration with gigantic huge pages

Things get even more complicated if you want PostgreSQL to use a page size of 1GB. Since this is not the default on Linux, configuration requires more steps and more careful thought in such a case. This is left to the next blog post, where I’ll show you how to deal with such a requirement.

Find out more about allocating memory in Hans-Jürgen Schönig’s blog, fixing track activity query size in PostgreSQL conf.