9sphinx.addnodesdocument)}( rawsourcechildren]( translations LanguagesNode)}(hhh](h pending_xref)}(hhh]docutils.nodesTextChinese (Simplified)}parenthsba attributes}(ids]classes]names]dupnames]backrefs] refdomainstdreftypedoc reftarget"/translations/zh_CN/mm/page_tablesmodnameN classnameN refexplicitutagnamehhh ubh)}(hhh]hChinese (Traditional)}hh2sbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/zh_TW/mm/page_tablesmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hItalian}hhFsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/it_IT/mm/page_tablesmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hJapanese}hhZsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/ja_JP/mm/page_tablesmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hKorean}hhnsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/ko_KR/mm/page_tablesmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hSpanish}hhsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/sp_SP/mm/page_tablesmodnameN classnameN refexplicituh1hhh ubeh}(h]h ]h"]h$]h&]current_languageEnglishuh1h hh _documenthsourceNlineNubhcomment)}(h SPDX-License-Identifier: GPL-2.0h]h SPDX-License-Identifier: GPL-2.0}hhsbah}(h]h ]h"]h$]h&] xml:spacepreserveuh1hhhhhh| P4D | +-----+ | | +-----+ +-->| PUD | +-----+ | | +-----+ +-->| PMD | +-----+ | | +-----+ +-->| PTE | +-----+h]hX+-----+ | PGD | +-----+ | | +-----+ +-->| P4D | +-----+ | | +-----+ +-->| PUD | +-----+ | | +-----+ +-->| PMD | +-----+ | | +-----+ +-->| PTE | +-----+}hjsbah}(h]h ]h"]h$]h&]hhuh1jhhhK?hhhhubh)}(hqSymbols on the different levels of the page table hierarchy have the following meaning beginning from the bottom:h]hqSymbols on the different levels of the page table hierarchy have the following meaning beginning from the bottom:}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKThhhhubh bullet_list)}(hhh](h list_item)}(hX>**pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier. The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each mapping a single page of virtual memory to a single page of physical memory. The architecture defines the size and contents of `pteval_t`. A typical example is that the `pteval_t` is a 32- or 64-bit value with the upper bits being a **pfn** (page frame number), and the lower bits being some architecture-specific bits such as memory protection. The **entry** part of the name is a bit confusing because while in Linux 1.0 this did refer to a single page table entry in the single top level page table, it was retrofitted to be an array of mapping elements when two-level page tables were first introduced, so the *pte* is the lowermost page *table*, not a page table *entry*. h](h)}(hX!**pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier. The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each mapping a single page of virtual memory to a single page of physical memory. The architecture defines the size and contents of `pteval_t`.h](h)}(h**pte**h]hpte}(hjDhhhNhNubah}(h]h ]h"]h$]h&]uh1hhj@ubh, }(hj@hhhNhNubj$)}(h`pte_t`h]hpte_t}(hjVhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hj@ubh, }hj@sbj$)}(h `pteval_t`h]hpteval_t}(hjhhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hj@ubh = }(hj@hhhNhNubh)}(h**Page Table Entry**h]hPage Table Entry}(hjzhhhNhNubah}(h]h ]h"]h$]h&]uh1hhj@ubh - mentioned earlier. The }(hj@hhhNhNubjF)}(h*pte*h]hpte}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhj@ubh is an array of }(hj@hhhNhNubj$)}(h`PTRS_PER_PTE`h]h PTRS_PER_PTE}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hj@ubh elements of the }(hj@hhhNhNubj$)}(h `pteval_t`h]hpteval_t}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hj@ubh type, each mapping a single page of virtual memory to a single page of physical memory. The architecture defines the size and contents of }(hj@hhhNhNubj$)}(h `pteval_t`h]hpteval_t}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hj@ubh.}(hj@hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKWhj<ubh)}(hA typical example is that the `pteval_t` is a 32- or 64-bit value with the upper bits being a **pfn** (page frame number), and the lower bits being some architecture-specific bits such as memory protection.h](hA typical example is that the }(hjhhhNhNubj$)}(h `pteval_t`h]hpteval_t}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh6 is a 32- or 64-bit value with the upper bits being a }(hjhhhNhNubh)}(h**pfn**h]hpfn}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubhi (page frame number), and the lower bits being some architecture-specific bits such as memory protection.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhK\hj<ubh)}(hXJThe **entry** part of the name is a bit confusing because while in Linux 1.0 this did refer to a single page table entry in the single top level page table, it was retrofitted to be an array of mapping elements when two-level page tables were first introduced, so the *pte* is the lowermost page *table*, not a page table *entry*.h](hThe }(hj hhhNhNubh)}(h **entry**h]hentry}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhj ubh part of the name is a bit confusing because while in Linux 1.0 this did refer to a single page table entry in the single top level page table, it was retrofitted to be an array of mapping elements when two-level page tables were first introduced, so the }(hj hhhNhNubjF)}(h*pte*h]hpte}(hj&hhhNhNubah}(h]h ]h"]h$]h&]uh1jEhj ubh is the lowermost page }(hj hhhNhNubjF)}(h*table*h]htable}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1jEhj ubh, not a page table }(hj hhhNhNubjF)}(h*entry*h]hentry}(hjJhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhj ubh.}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhK`hj<ubeh}(h]h ]h"]h$]h&]uh1j:hj7hhhhhNubj;)}(h**pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s. h]h)}(h**pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.h](h)}(h**pmd**h]hpmd}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1hhjlubh, }(hjlhhhNhNubj$)}(h`pmd_t`h]hpmd_t}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjlubh, }hjlsbj$)}(h `pmdval_t`h]hpmdval_t}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjlubh = }(hjlhhhNhNubh)}(h**Page Middle Directory**h]hPage Middle Directory}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjlubh , the hierarchy right above the }(hjlhhhNhNubjF)}(h*pte*h]hpte}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjlubh, with }(hjlhhhNhNubj$)}(h`PTRS_PER_PMD`h]h PTRS_PER_PMD}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjlubh references to the }(hjlhhhNhNubjF)}(h*pte*h]hpte}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjlubh:s.}(hjlhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKfhjhubah}(h]h ]h"]h$]h&]uh1j:hj7hhhhhNubj;)}(h**pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after the other levels to handle 4-level page tables. It is potentially unused, or *folded* as we will discuss later. h]h)}(h**pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after the other levels to handle 4-level page tables. It is potentially unused, or *folded* as we will discuss later.h](h)}(h**pud**h]hpud}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubh, }(hjhhhNhNubj$)}(h`pud_t`h]hpud_t}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh, }hjsbj$)}(h `pudval_t`h]hpudval_t}(hj&hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh = }(hjhhhNhNubh)}(h**Page Upper Directory**h]hPage Upper Directory}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubhc was introduced after the other levels to handle 4-level page tables. It is potentially unused, or }(hjhhhNhNubjF)}(h*folded*h]hfolded}(hjJhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjubh as we will discuss later.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKihjubah}(h]h ]h"]h$]h&]uh1j:hj7hhhhhNubj;)}(hX**p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to handle 5-level page tables after the *pud* was introduced. Now it was clear that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the directory level and that we cannot go on with ad hoc names any more. This is only used on systems which actually have 5 levels of page tables, otherwise it is folded. h]h)}(hX**p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to handle 5-level page tables after the *pud* was introduced. Now it was clear that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the directory level and that we cannot go on with ad hoc names any more. This is only used on systems which actually have 5 levels of page tables, otherwise it is folded.h](h)}(h**p4d**h]hp4d}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1hhjlubh, }(hjlhhhNhNubj$)}(h`p4d_t`h]hp4d_t}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjlubh, }hjlsbj$)}(h `p4dval_t`h]hp4dval_t}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjlubh = }(hjlhhhNhNubh)}(h**Page Level 4 Directory**h]hPage Level 4 Directory}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjlubh8 was introduced to handle 5-level page tables after the }(hjlhhhNhNubjF)}(h*pud*h]hpud}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjlubh< was introduced. Now it was clear that we needed to replace }(hjlhhhNhNubjF)}(h*pgd*h]hpgd}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjlubh, }hjlsbjF)}(h*pmd*h]hpmd}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjlubh, }hjlsbjF)}(h*pud*h]hpud}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjlubh etc with a figure indicating the directory level and that we cannot go on with ad hoc names any more. This is only used on systems which actually have 5 levels of page tables, otherwise it is folded.}(hjlhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKmhjhubah}(h]h ]h"]h$]h&]uh1j:hj7hhhhhNubj;)}(hX**pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel main page table handling the PGD for the kernel memory is still found in `swapper_pg_dir`, but each userspace process in the system also has its own memory context and thus its own *pgd*, found in `struct mm_struct` which in turn is referenced to in each `struct task_struct`. So tasks have memory context in the form of a `struct mm_struct` and this in turn has a `struct pgt_t *pgd` pointer to the corresponding page global directory. h]h)}(hX**pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel main page table handling the PGD for the kernel memory is still found in `swapper_pg_dir`, but each userspace process in the system also has its own memory context and thus its own *pgd*, found in `struct mm_struct` which in turn is referenced to in each `struct task_struct`. So tasks have memory context in the form of a `struct mm_struct` and this in turn has a `struct pgt_t *pgd` pointer to the corresponding page global directory.h](h)}(h**pgd**h]hpgd}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubh, }(hjhhhNhNubj$)}(h`pgd_t`h]hpgd_t}(hj&hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh, }hjsbj$)}(h `pgdval_t`h]hpgdval_t}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh = }(hjhhhNhNubh)}(h**Page Global Directory**h]hPage Global Directory}(hjJhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubh] - the Linux kernel main page table handling the PGD for the kernel memory is still found in }(hjhhhNhNubj$)}(h`swapper_pg_dir`h]hswapper_pg_dir}(hj\hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh\, but each userspace process in the system also has its own memory context and thus its own }(hjhhhNhNubjF)}(h*pgd*h]hpgd}(hjnhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjubh , found in }(hjhhhNhNubj$)}(h`struct mm_struct`h]hstruct mm_struct}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh( which in turn is referenced to in each }(hjhhhNhNubj$)}(h`struct task_struct`h]hstruct task_struct}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh0. So tasks have memory context in the form of a }(hjhhhNhNubj$)}(h`struct mm_struct`h]hstruct mm_struct}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh and this in turn has a }(hjhhhNhNubj$)}(h`struct pgt_t *pgd`h]hstruct pgt_t *pgd}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh4 pointer to the corresponding page global directory.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKthj ubah}(h]h ]h"]h$]h&]uh1j:hj7hhhhhNubeh}(h]h ]h"]h$]h&]bullet-uh1j5hhhKWhhhhubh)}(hXTo repeat: each level in the page table hierarchy is a *array of pointers*, so the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d** contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of pointers on each level is architecture-defined.::h](h7To repeat: each level in the page table hierarchy is a }(hjhhhNhNubjF)}(h*array of pointers*h]harray of pointers}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjubh , so the }(hjhhhNhNubh)}(h**pgd**h]hpgd}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubh contains }(hjhhhNhNubj$)}(h`PTRS_PER_PGD`h]h PTRS_PER_PGD}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh# pointers to the next level below, }(hjhhhNhNubh)}(h**p4d**h]hp4d}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubh contains }(hjhhhNhNubj$)}(h`PTRS_PER_P4D`h]h PTRS_PER_P4D}(hj,hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh pointers to }(hjhhhNhNubh)}(h**pud**h]hpud}(hj>hhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubhP items and so on. The number of pointers on each level is architecture-defined.:}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhK|hhhhubj)}(hXR PMD --> +-----+ PTE | ptr |-------> +-----+ | ptr |- | ptr |-------> PAGE | ptr | \ | ptr | | ptr | \ ... | ... | \ | ptr | \ PTE +-----+ +----> +-----+ | ptr |-------> PAGE | ptr | ...h]hXR PMD --> +-----+ PTE | ptr |-------> +-----+ | ptr |- | ptr |-------> PAGE | ptr | \ | ptr | | ptr | \ ... | ... | \ | ptr | \ PTE +-----+ +----> +-----+ | ptr |-------> PAGE | ptr | ...}hjVsbah}(h]h ]h"]h$]h&]hhuh1jhhhKhhhhubh)}(hhh](h)}(hPage Table Foldingh]hPage Table Folding}(hjghhhNhNubah}(h]h ]h"]h$]h&]uh1hhjdhhhhhKubh)}(hIf the architecture does not use all the page table levels, they can be *folded* which means skipped, and all operations performed on page tables will be compile-time augmented to just skip a level when accessing the next lower level.h](hHIf the architecture does not use all the page table levels, they can be }(hjuhhhNhNubjF)}(h*folded*h]hfolded}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1jEhjuubh which means skipped, and all operations performed on page tables will be compile-time augmented to just skip a level when accessing the next lower level.}(hjuhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjdhhubh)}(hXPage table handling code that wishes to be architecture-neutral, such as the virtual memory manager, will need to be written so that it traverses all of the currently five levels. This style should also be preferred for architecture-specific code, so as to be robust to future changes.h]hXPage table handling code that wishes to be architecture-neutral, such as the virtual memory manager, will need to be written so that it traverses all of the currently five levels. This style should also be preferred for architecture-specific code, so as to be robust to future changes.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjdhhubeh}(h]page-table-foldingah ]h"]page table foldingah$]h&]uh1hhhhhhhhKubh)}(hhh](h)}(hMMU, TLB, and Page Faultsh]hMMU, TLB, and Page Faults}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhKubh)}(hXThe `Memory Management Unit (MMU)` is a hardware component that handles virtual to physical address translations. It may use relatively small caches in hardware called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up these translations.h](hThe }(hjhhhNhNubj$)}(h`Memory Management Unit (MMU)`h]hMemory Management Unit (MMU)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh is a hardware component that handles virtual to physical address translations. It may use relatively small caches in hardware called }(hjhhhNhNubj$)}(h&`Translation Lookaside Buffers (TLBs)`h]h$Translation Lookaside Buffers (TLBs)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh and }(hjhhhNhNubj$)}(h`Page Walk Caches`h]hPage Walk Caches}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh to speed up these translations.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hX6When CPU accesses a memory location, it provides a virtual address to the MMU, which checks if there is the existing translation in the TLB or in the Page Walk Caches (on architectures that support them). If no translation is found, MMU uses the page walks to determine the physical address and create the map.h]hX6When CPU accesses a memory location, it provides a virtual address to the MMU, which checks if there is the existing translation in the TLB or in the Page Walk Caches (on architectures that support them). If no translation is found, MMU uses the page walks to determine the physical address and create the map.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hThe dirty bit for a page is set (i.e., turned on) when the page is written to. Each page of memory has associated permission and dirty bits. The latter indicate that the page has been modified since it was loaded into memory.h]hThe dirty bit for a page is set (i.e., turned on) when the page is written to. Each page of memory has associated permission and dirty bits. The latter indicate that the page has been modified since it was loaded into memory.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hIf nothing prevents it, eventually the physical memory can be accessed and the requested operation on the physical frame is performed.h]hIf nothing prevents it, eventually the physical memory can be accessed and the requested operation on the physical frame is performed.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hThere are several reasons why the MMU can't find certain translations. It could happen because the CPU is trying to access memory that the current task is not permitted to, or because the data is not present into physical memory.h]hThere are several reasons why the MMU can’t find certain translations. It could happen because the CPU is trying to access memory that the current task is not permitted to, or because the data is not present into physical memory.}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hWhen these conditions happen, the MMU triggers page faults, which are types of exceptions that signal the CPU to pause the current execution and run a special function to handle the mentioned exceptions.h]hWhen these conditions happen, the MMU triggers page faults, which are types of exceptions that signal the CPU to pause the current execution and run a special function to handle the mentioned exceptions.}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hX<There are common and expected causes of page faults. These are triggered by process management optimization techniques called "Lazy Allocation" and "Copy-on-Write". Page faults may also happen when frames have been swapped out to persistent storage (swap partition or file) and evicted from their physical locations.h]hXDThere are common and expected causes of page faults. These are triggered by process management optimization techniques called “Lazy Allocation” and “Copy-on-Write”. Page faults may also happen when frames have been swapped out to persistent storage (swap partition or file) and evicted from their physical locations.}(hjFhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hXThese techniques improve memory efficiency, reduce latency, and minimize space occupation. This document won't go deeper into the details of "Lazy Allocation" and "Copy-on-Write" because these subjects are out of scope as they belong to Process Address Management.h]hXThese techniques improve memory efficiency, reduce latency, and minimize space occupation. This document won’t go deeper into the details of “Lazy Allocation” and “Copy-on-Write” because these subjects are out of scope as they belong to Process Address Management.}(hjThhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hSwapping differentiates itself from the other mentioned techniques because it's undesirable since it's performed as a means to reduce memory under heavy pressure.h]hSwapping differentiates itself from the other mentioned techniques because it’s undesirable since it’s performed as a means to reduce memory under heavy pressure.}(hjbhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hXSwapping can't work for memory mapped by kernel logical addresses. These are a subset of the kernel virtual space that directly maps a contiguous range of physical memory. Given any logical address, its physical address is determined with simple arithmetic on an offset. Accesses to logical addresses are fast because they avoid the need for complex page table lookups at the expenses of frames not being evictable and pageable out.h]hXSwapping can’t work for memory mapped by kernel logical addresses. These are a subset of the kernel virtual space that directly maps a contiguous range of physical memory. Given any logical address, its physical address is determined with simple arithmetic on an offset. Accesses to logical addresses are fast because they avoid the need for complex page table lookups at the expenses of frames not being evictable and pageable out.}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hIf the kernel fails to make room for the data that must be present in the physical frames, the kernel invokes the out-of-memory (OOM) killer to make room by terminating lower priority processes until pressure reduces under a safe threshold.h]hIf the kernel fails to make room for the data that must be present in the physical frames, the kernel invokes the out-of-memory (OOM) killer to make room by terminating lower priority processes until pressure reduces under a safe threshold.}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hXRAdditionally, page faults may be also caused by code bugs or by maliciously crafted addresses that the CPU is instructed to access. A thread of a process could use instructions to address (non-shared) memory which does not belong to its own address space, or could try to execute an instruction that want to write to a read-only location.h]hXRAdditionally, page faults may be also caused by code bugs or by maliciously crafted addresses that the CPU is instructed to access. A thread of a process could use instructions to address (non-shared) memory which does not belong to its own address space, or could try to execute an instruction that want to write to a read-only location.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hIf the above-mentioned conditions happen in user-space, the kernel sends a `Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually causes the termination of the thread and of the process it belongs to.h](hKIf the above-mentioned conditions happen in user-space, the kernel sends a }(hjhhhNhNubj$)}(h`Segmentation Fault`h]hSegmentation Fault}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh (SIGSEGV) signal to the current thread. That signal usually causes the termination of the thread and of the process it belongs to.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hX)This document is going to simplify and show an high altitude view of how the Linux kernel handles these page faults, creates tables and tables' entries, check if memory is present and, if not, requests to load data from persistent storage or from other devices, and updates the MMU and its caches.h]hX+This document is going to simplify and show an high altitude view of how the Linux kernel handles these page faults, creates tables and tables’ entries, check if memory is present and, if not, requests to load data from persistent storage or from other devices, and updates the MMU and its caches.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hThe first steps are architecture dependent. Most architectures jump to `do_page_fault()`, whereas the x86 interrupt handler is defined by the `DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`.h](hGThe first steps are architecture dependent. Most architectures jump to }(hjhhhNhNubj$)}(h`do_page_fault()`h]hdo_page_fault()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh6, whereas the x86 interrupt handler is defined by the }(hjhhhNhNubj$)}(h!`DEFINE_IDTENTRY_RAW_ERRORCODE()`h]hDEFINE_IDTENTRY_RAW_ERRORCODE()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh macro which calls }(hjhhhNhNubj$)}(h`handle_page_fault()`h]hhandle_page_fault()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hWhatever the routes, all architectures end up to the invocation of `handle_mm_fault()` which, in turn, (likely) ends up calling `__handle_mm_fault()` to carry out the actual work of allocating the page tables.h](hCWhatever the routes, all architectures end up to the invocation of }(hj hhhNhNubj$)}(h`handle_mm_fault()`h]hhandle_mm_fault()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hj ubh* which, in turn, (likely) ends up calling }(hj hhhNhNubj$)}(h`__handle_mm_fault()`h]h__handle_mm_fault()}(hj&hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hj ubh< to carry out the actual work of allocating the page tables.}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hXhThe unfortunate case of not being able to call `__handle_mm_fault()` means that the virtual address is pointing to areas of physical memory which are not permitted to be accessed (at least from the current context). This condition resolves to the kernel sending the above-mentioned SIGSEGV signal to the process and leads to the consequences already explained.h](h/The unfortunate case of not being able to call }(hj>hhhNhNubj$)}(h`__handle_mm_fault()`h]h__handle_mm_fault()}(hjFhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hj>ubhX$ means that the virtual address is pointing to areas of physical memory which are not permitted to be accessed (at least from the current context). This condition resolves to the kernel sending the above-mentioned SIGSEGV signal to the process and leads to the consequences already explained.}(hj>hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(h`__handle_mm_fault()` carries out its work by calling several functions to find the entry's offsets of the upper layers of the page tables and allocate the tables that it may need.h](j$)}(h`__handle_mm_fault()`h]h__handle_mm_fault()}(hjbhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hj^ubh carries out its work by calling several functions to find the entry’s offsets of the upper layers of the page tables and allocate the tables that it may need.}(hj^hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hX@The functions that look for the offset have names like `*_offset()`, where the "*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the corresponding tables, layer by layer, are called `*_alloc`, using the above-mentioned convention to name them after the corresponding types of tables in the hierarchy.h](h7The functions that look for the offset have names like }(hjzhhhNhNubj$)}(h `*_offset()`h]h *_offset()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjzubh, where the “*” is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the corresponding tables, layer by layer, are called }(hjzhhhNhNubj$)}(h `*_alloc`h]h*_alloc}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjzubhm, using the above-mentioned convention to name them after the corresponding types of tables in the hierarchy.}(hjzhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hLThe page table walk may end at one of the middle or upper layers (PMD, PUD).h]hLThe page table walk may end at one of the middle or upper layers (PMD, PUD).}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hXsLinux supports larger page sizes than the usual 4KB (i.e., the so called `huge pages`). When using these kinds of larger pages, higher level pages can directly map them, with no need to use lower level page entries (PTE). Huge pages contain large contiguous physical regions that usually span from 2MB to 1GB. They are respectively mapped by the PMD and PUD page entries.h](hILinux supports larger page sizes than the usual 4KB (i.e., the so called }(hjhhhNhNubj$)}(h `huge pages`h]h huge pages}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubhX). When using these kinds of larger pages, higher level pages can directly map them, with no need to use lower level page entries (PTE). Huge pages contain large contiguous physical regions that usually span from 2MB to 1GB. They are respectively mapped by the PMD and PUD page entries.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hXThe huge pages bring with them several benefits like reduced TLB pressure, reduced page table overhead, memory allocation efficiency, and performance improvement for certain workloads. However, these benefits come with trade-offs, like wasted memory and allocation challenges.h]hXThe huge pages bring with them several benefits like reduced TLB pressure, reduced page table overhead, memory allocation efficiency, and performance improvement for certain workloads. However, these benefits come with trade-offs, like wasted memory and allocation challenges.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hX>At the very end of the walk with allocations, if it didn't return errors, `__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()` performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`. "read", "cow", "shared" give hints about the reasons and the kind of fault it's handling.h](hLAt the very end of the walk with allocations, if it didn’t return errors, }(hjhhhNhNubj$)}(h`__handle_mm_fault()`h]h__handle_mm_fault()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh finally calls }(hjhhhNhNubj$)}(h`handle_pte_fault()`h]hhandle_pte_fault()}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh , which via }(hjhhhNhNubj$)}(h `do_fault()`h]h do_fault()}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh performs one of }(hjhhhNhNubj$)}(h`do_read_fault()`h]hdo_read_fault()}(hj& hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh, }(hjhhhNhNubj$)}(h`do_cow_fault()`h]hdo_cow_fault()}(hj8 hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubh, }hjsbj$)}(h`do_shared_fault()`h]hdo_shared_fault()}(hjJ hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjubhi. “read”, “cow”, “shared” give hints about the reasons and the kind of fault it’s handling.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjhhubh)}(hThe actual implementation of the workflow is very complex. Its design allows Linux to handle page faults in a way that is tailored to the specific characteristics of each architecture, while still sharing a common overall structure.h]hThe actual implementation of the workflow is very complex. Its design allows Linux to handle page faults in a way that is tailored to the specific characteristics of each architecture, while still sharing a common overall structure.}(hjb hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hTo conclude this high altitude view of how Linux handles page faults, let's add that the page faults handler can be disabled and enabled respectively with `pagefault_disable()` and `pagefault_enable()`.h](hTo conclude this high altitude view of how Linux handles page faults, let’s add that the page faults handler can be disabled and enabled respectively with }(hjp hhhNhNubj$)}(h`pagefault_disable()`h]hpagefault_disable()}(hjx hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjp ubh and }(hjp hhhNhNubj$)}(h`pagefault_enable()`h]hpagefault_enable()}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j#hjp ubh.}(hjp hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hSeveral code path make use of the latter two functions because they need to disable traps into the page faults handler, mostly to prevent deadlocks.h]hSeveral code path make use of the latter two functions because they need to disable traps into the page faults handler, mostly to prevent deadlocks.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubeh}(h]mmu-tlb-and-page-faultsah ]h"]mmu, tlb, and page faultsah$]h&]uh1hhhhhhhhKubeh}(h] page-tablesah ]h"] page tablesah$]h&]uh1hhhhhhhhKubeh}(h]h ]h"]h$]h&]sourcehuh1hcurrent_sourceN current_lineNsettingsdocutils.frontendValues)}(hN generatorN datestampN source_linkN source_urlN toc_backlinksentryfootnote_backlinksK sectnum_xformKstrip_commentsNstrip_elements_with_classesN strip_classesN report_levelK halt_levelKexit_status_levelKdebugNwarning_streamN tracebackinput_encoding utf-8-siginput_encoding_error_handlerstrictoutput_encodingutf-8output_encoding_error_handlerj error_encodingutf-8error_encoding_error_handlerbackslashreplace language_codeenrecord_dependenciesNconfigN id_prefixhauto_id_prefixid dump_settingsNdump_internalsNdump_transformsNdump_pseudo_xmlNexpose_internalsNstrict_visitorN_disable_configN_sourceh _destinationN _config_files]7/var/lib/git/docbuild/linux/Documentation/docutils.confafile_insertion_enabled raw_enabledKline_length_limitM'pep_referencesN pep_base_urlhttps://peps.python.org/pep_file_url_templatepep-%04drfc_referencesN rfc_base_url&https://datatracker.ietf.org/doc/html/ tab_widthKtrim_footnote_reference_spacesyntax_highlightlong smart_quotessmartquotes_locales]character_level_inline_markupdoctitle_xform docinfo_xformKsectsubtitle_xform image_loadinglinkembed_stylesheetcloak_email_addressessection_self_linkenvNubreporterNindirect_targets]substitution_defs}substitution_names}refnames}refids}nameids}(j j jjj j u nametypes}(j jj uh}(j hjjdj ju footnote_refs} citation_refs} autofootnotes]autofootnote_refs]symbol_footnotes]symbol_footnote_refs] footnotes] citations]autofootnote_startKsymbol_footnote_startK id_counter collectionsCounter}Rparse_messages]transform_messages] transformerN include_log] decorationNhhub.