€•9Â      Œsphinx.addnodes”Œdocument”“”)”}”(Œ	rawsource”Œ ”Œchildren”]”(Œtranslations”ŒLanguagesNode”“”)”}”(hhh]”(h Œpending_xref”“”)”}”(hhh]”Œdocutils.nodes”ŒText”“”ŒChinese (Simplified)”…””}”Œparent”hsbaŒ
attributes”}”(Œids”]”Œclasses”]”Œnames”]”Œdupnames”]”Œbackrefs”]”Œ	refdomain”Œstd”Œreftype”Œdoc”Œ	reftarget”Œ"/translations/zh_CN/mm/page_tables”Œmodname”NŒ	classname”NŒrefexplicit”ˆuŒtagname”hhhubh)”}”(hhh]”hŒChinese (Traditional)”…””}”hh2sbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ"/translations/zh_TW/mm/page_tables”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒItalian”…””}”hhFsbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ"/translations/it_IT/mm/page_tables”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒJapanese”…””}”hhZsbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ"/translations/ja_JP/mm/page_tables”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒKorean”…””}”hhnsbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ"/translations/ko_KR/mm/page_tables”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒSpanish”…””}”hh‚sbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ"/translations/sp_SP/mm/page_tables”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubeh}”(h]”h ]”h"]”h$]”h&]”Œcurrent_language”ŒEnglish”uh1h
hhŒ	_document”hŒsource”NŒline”NubhŒcomment”“”)”}”(hŒ SPDX-License-Identifier: GPL-2.0”h]”hŒ SPDX-License-Identifier: GPL-2.0”…””}”hh£sbah}”(h]”h ]”h"]”h$]”h&]”Œ	xml:space”Œpreserve”uh1h¡hhhžhhŸŒ</var/lib/git/docbuild/linux/Documentation/mm/page_tables.rst”h KubhŒsection”“”)”}”(hhh]”(hŒtitle”“”)”}”(hŒPage Tables”h]”hŒPage Tables”…””}”(hh»hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h¹hh¶hžhhŸh³h KubhŒ	paragraph”“”)”}”(hXz  Paged virtual memory was invented along with virtual memory as a concept in
1962 on the Ferranti Atlas Computer which was the first computer with paged
virtual memory. The feature migrated to newer computers and became a de facto
feature of all Unix-like systems as time went by. In 1985 the feature was
included in the Intel 80386, which was the CPU Linux 1.0 was developed on.”h]”hXz  Paged virtual memory was invented along with virtual memory as a concept in
1962 on the Ferranti Atlas Computer which was the first computer with paged
virtual memory. The feature migrated to newer computers and became a de facto
feature of all Unix-like systems as time went by. In 1985 the feature was
included in the Intel 80386, which was the CPU Linux 1.0 was developed on.”…””}”(hhËhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Khh¶hžhubhÊ)”}”(hŒpPage tables map virtual addresses as seen by the CPU into physical addresses
as seen on the external memory bus.”h]”hŒpPage tables map virtual addresses as seen by the CPU into physical addresses
as seen on the external memory bus.”…””}”(hhÙhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Khh¶hžhubhÊ)”}”(hŒÀLinux defines page tables as a hierarchy which is currently five levels in
height. The architecture code for each supported architecture will then
map this to the restrictions of the hardware.”h]”hŒÀLinux defines page tables as a hierarchy which is currently five levels in
height. The architecture code for each supported architecture will then
map this to the restrictions of the hardware.”…””}”(hhçhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Khh¶hžhubhÊ)”}”(hŒúThe physical address corresponding to the virtual address is often referenced
by the underlying physical page frame. The **page frame number** or **pfn**
is the physical address of the page (as seen on the external memory bus)
divided by `PAGE_SIZE`.”h]”(hŒyThe physical address corresponding to the virtual address is often referenced
by the underlying physical page frame. The ”…””}”(hhõhžhhŸNh NubhŒstrong”“”)”}”(hŒ**page frame number**”h]”hŒpage frame number”…””}”(hhÿhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhhõubhŒ or ”…””}”(hhõhžhhŸNh Nubhþ)”}”(hŒ**pfn**”h]”hŒpfn”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhhõubhŒU
is the physical address of the page (as seen on the external memory bus)
divided by ”…””}”(hhõhžhhŸNh NubhŒtitle_reference”“”)”}”(hŒ`PAGE_SIZE`”h]”hŒ	PAGE_SIZE”…””}”(hj%  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hhõubhŒ.”…””}”(hhõhžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Khh¶hžhubhÊ)”}”(hŒ—Physical memory address 0 will be *pfn 0* and the highest pfn will be
the last page of physical memory the external address bus of the CPU can
address.”h]”(hŒ"Physical memory address 0 will be ”…””}”(hj=  hžhhŸNh NubhŒemphasis”“”)”}”(hŒ*pfn 0*”h]”hŒpfn 0”…””}”(hjG  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hj=  ubhŒn and the highest pfn will be
the last page of physical memory the external address bus of the CPU can
address.”…””}”(hj=  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Khh¶hžhubhÊ)”}”(hX*  With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000
and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are
at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3ffff.”h]”hX*  With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000
and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are
at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3ffff.”…””}”(hj_  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Khh¶hžhubhÊ)”}”(hŒäAs you can see, with 4KB pages the page base address uses bits 12-31 of the
address, and this is why `PAGE_SHIFT` in this case is defined as 12 and
`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)`”h]”(hŒeAs you can see, with 4KB pages the page base address uses bits 12-31 of the
address, and this is why ”…””}”(hjm  hžhhŸNh Nubj$  )”}”(hŒ`PAGE_SHIFT`”h]”hŒ
PAGE_SHIFT”…””}”(hju  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjm  ubhŒ# in this case is defined as 12 and
”…””}”(hjm  hžhhŸNh Nubj$  )”}”(hŒ`PAGE_SIZE`”h]”hŒ	PAGE_SIZE”…””}”(hj‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjm  ubhŒ2 is usually defined in terms of the page shift as ”…””}”(hjm  hžhhŸNh Nubj$  )”}”(hŒ`(1 << PAGE_SHIFT)`”h]”hŒ(1 << PAGE_SHIFT)”…””}”(hj™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjm  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K"hh¶hžhubhÊ)”}”(hXy  Over time a deeper hierarchy has been developed in response to increasing memory
sizes. When Linux was created, 4KB pages and a single page table called
`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with
the fact that Torvald's first computer had 4MB of physical memory. Entries in
this single table were referred to as *PTE*:s - page table entries.”h]”(hŒ™Over time a deeper hierarchy has been developed in response to increasing memory
sizes. When Linux was created, 4KB pages and a single page table called
”…””}”(hj­  hžhhŸNh Nubj$  )”}”(hŒ`swapper_pg_dir`”h]”hŒswapper_pg_dir”…””}”(hjµ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj­  ubhŒµ with 1024 entries was used, covering 4MB which coincided with
the fact that Torvaldâ€™s first computer had 4MB of physical memory. Entries in
this single table were referred to as ”…””}”(hj­  hžhhŸNh NubjF  )”}”(hŒ*PTE*”h]”hŒPTE”…””}”(hjÇ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hj­  ubhŒ:s - page table entries.”…””}”(hj­  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K&hh¶hžhubhÊ)”}”(hŒ­The software page table hierarchy reflects the fact that page table hardware has
become hierarchical and that in turn is done to save page table memory and
speed up mapping.”h]”hŒ­The software page table hierarchy reflects the fact that page table hardware has
become hierarchical and that in turn is done to save page table memory and
speed up mapping.”…””}”(hjß  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K,hh¶hžhubhÊ)”}”(hXÕ  One could of course imagine a single, linear page table with enormous amounts
of entries, breaking down the whole memory into single pages. Such a page table
would be very sparse, because large portions of the virtual memory usually
remains unused. By using hierarchical page tables large holes in the virtual
address space does not waste valuable page table memory, because it will suffice
to mark large areas as unmapped at a higher level in the page table hierarchy.”h]”hXÕ  One could of course imagine a single, linear page table with enormous amounts
of entries, breaking down the whole memory into single pages. Such a page table
would be very sparse, because large portions of the virtual memory usually
remains unused. By using hierarchical page tables large holes in the virtual
address space does not waste valuable page table memory, because it will suffice
to mark large areas as unmapped at a higher level in the page table hierarchy.”…””}”(hjí  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K0hh¶hžhubhÊ)”}”(hX†  Additionally, on modern CPUs, a higher level page table entry can point directly
to a physical memory range, which allows mapping a contiguous range of several
megabytes or even gigabytes in a single high-level page table entry, taking
shortcuts in mapping virtual memory to physical memory: there is no need to
traverse deeper in the hierarchy when you find a large mapped range like this.”h]”hX†  Additionally, on modern CPUs, a higher level page table entry can point directly
to a physical memory range, which allows mapping a contiguous range of several
megabytes or even gigabytes in a single high-level page table entry, taking
shortcuts in mapping virtual memory to physical memory: there is no need to
traverse deeper in the hierarchy when you find a large mapped range like this.”…””}”(hjû  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K7hh¶hžhubhÊ)”}”(hŒ6The page table hierarchy has now developed into this::”h]”hŒ5The page table hierarchy has now developed into this:”…””}”(hj	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K=hh¶hžhubhŒliteral_block”“”)”}”(hX‡  +-----+
| PGD |
+-----+
   |
   |   +-----+
   +-->| P4D |
       +-----+
          |
          |   +-----+
          +-->| PUD |
              +-----+
                 |
                 |   +-----+
                 +-->| PMD |
                     +-----+
                        |
                        |   +-----+
                        +-->| PTE |
                            +-----+”h]”hX‡  +-----+
| PGD |
+-----+
   |
   |   +-----+
   +-->| P4D |
       +-----+
          |
          |   +-----+
          +-->| PUD |
              +-----+
                 |
                 |   +-----+
                 +-->| PMD |
                     +-----+
                        |
                        |   +-----+
                        +-->| PTE |
                            +-----+”…””}”hj  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j  hŸh³h K?hh¶hžhubhÊ)”}”(hŒqSymbols on the different levels of the page table hierarchy have the following
meaning beginning from the bottom:”h]”hŒqSymbols on the different levels of the page table hierarchy have the following
meaning beginning from the bottom:”…””}”(hj'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h KThh¶hžhubhŒbullet_list”“”)”}”(hhh]”(hŒ	list_item”“”)”}”(hX>  **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each
mapping a single page of virtual memory to a single page of physical memory.
The architecture defines the size and contents of `pteval_t`.

A typical example is that the `pteval_t` is a 32- or 64-bit value with the
upper bits being a **pfn** (page frame number), and the lower bits being some
architecture-specific bits such as memory protection.

The **entry** part of the name is a bit confusing because while in Linux 1.0
this did refer to a single page table entry in the single top level page
table, it was retrofitted to be an array of mapping elements when two-level
page tables were first introduced, so the *pte* is the lowermost page
*table*, not a page table *entry*.
”h]”(hÊ)”}”(hX!  **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each
mapping a single page of virtual memory to a single page of physical memory.
The architecture defines the size and contents of `pteval_t`.”h]”(hþ)”}”(hŒ**pte**”h]”hŒpte”…””}”(hjD  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhj@  ubhŒ, ”…””}”(hj@  hžhhŸNh Nubj$  )”}”(hŒ`pte_t`”h]”hŒpte_t”…””}”(hjV  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj@  ubhŒ, ”…””}”hj@  sbj$  )”}”(hŒ
`pteval_t`”h]”hŒpteval_t”…””}”(hjh  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj@  ubhŒ = ”…””}”(hj@  hžhhŸNh Nubhþ)”}”(hŒ**Page Table Entry**”h]”hŒPage Table Entry”…””}”(hjz  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhj@  ubhŒ - mentioned earlier.
The ”…””}”(hj@  hžhhŸNh NubjF  )”}”(hŒ*pte*”h]”hŒpte”…””}”(hjŒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hj@  ubhŒ is an array of ”…””}”(hj@  hžhhŸNh Nubj$  )”}”(hŒ`PTRS_PER_PTE`”h]”hŒPTRS_PER_PTE”…””}”(hjž  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj@  ubhŒ elements of the ”…””}”(hj@  hžhhŸNh Nubj$  )”}”(hŒ
`pteval_t`”h]”hŒpteval_t”…””}”(hj°  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj@  ubhŒ‹ type, each
mapping a single page of virtual memory to a single page of physical memory.
The architecture defines the size and contents of ”…””}”(hj@  hžhhŸNh Nubj$  )”}”(hŒ
`pteval_t`”h]”hŒpteval_t”…””}”(hjÂ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj@  ubhŒ.”…””}”(hj@  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h KWhj<  ubhÊ)”}”(hŒÎA typical example is that the `pteval_t` is a 32- or 64-bit value with the
upper bits being a **pfn** (page frame number), and the lower bits being some
architecture-specific bits such as memory protection.”h]”(hŒA typical example is that the ”…””}”(hjÚ  hžhhŸNh Nubj$  )”}”(hŒ
`pteval_t`”h]”hŒpteval_t”…””}”(hjâ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjÚ  ubhŒ6 is a 32- or 64-bit value with the
upper bits being a ”…””}”(hjÚ  hžhhŸNh Nubhþ)”}”(hŒ**pfn**”h]”hŒpfn”…””}”(hjô  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjÚ  ubhŒi (page frame number), and the lower bits being some
architecture-specific bits such as memory protection.”…””}”(hjÚ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K\hj<  ubhÊ)”}”(hXJ  The **entry** part of the name is a bit confusing because while in Linux 1.0
this did refer to a single page table entry in the single top level page
table, it was retrofitted to be an array of mapping elements when two-level
page tables were first introduced, so the *pte* is the lowermost page
*table*, not a page table *entry*.”h]”(hŒThe ”…””}”(hj  hžhhŸNh Nubhþ)”}”(hŒ	**entry**”h]”hŒentry”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhj  ubhŒÿ part of the name is a bit confusing because while in Linux 1.0
this did refer to a single page table entry in the single top level page
table, it was retrofitted to be an array of mapping elements when two-level
page tables were first introduced, so the ”…””}”(hj  hžhhŸNh NubjF  )”}”(hŒ*pte*”h]”hŒpte”…””}”(hj&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hj  ubhŒ is the lowermost page
”…””}”(hj  hžhhŸNh NubjF  )”}”(hŒ*table*”h]”hŒtable”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hj  ubhŒ, not a page table ”…””}”(hj  hžhhŸNh NubjF  )”}”(hŒ*entry*”h]”hŒentry”…””}”(hjJ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hj  ubhŒ.”…””}”(hj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K`hj<  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1j:  hj7  hžhhŸh³h Nubj;  )”}”(hŒŽ**pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right
above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.
”h]”hÊ)”}”(hŒ**pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right
above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.”h]”(hþ)”}”(hŒ**pmd**”h]”hŒpmd”…””}”(hjp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjl  ubhŒ, ”…””}”(hjl  hžhhŸNh Nubj$  )”}”(hŒ`pmd_t`”h]”hŒpmd_t”…””}”(hj‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjl  ubhŒ, ”…””}”hjl  sbj$  )”}”(hŒ
`pmdval_t`”h]”hŒpmdval_t”…””}”(hj”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjl  ubhŒ = ”…””}”(hjl  hžhhŸNh Nubhþ)”}”(hŒ**Page Middle Directory**”h]”hŒPage Middle Directory”…””}”(hj¦  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjl  ubhŒ , the hierarchy right
above the ”…””}”(hjl  hžhhŸNh NubjF  )”}”(hŒ*pte*”h]”hŒpte”…””}”(hj¸  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hjl  ubhŒ, with ”…””}”(hjl  hžhhŸNh Nubj$  )”}”(hŒ`PTRS_PER_PMD`”h]”hŒPTRS_PER_PMD”…””}”(hjÊ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjl  ubhŒ references to the ”…””}”(hjl  hžhhŸNh NubjF  )”}”(hŒ*pte*”h]”hŒpte”…””}”(hjÜ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hjl  ubhŒ:s.”…””}”(hjl  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kfhjh  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j:  hj7  hžhhŸh³h Nubj;  )”}”(hŒ½**pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
the other levels to handle 4-level page tables. It is potentially unused,
or *folded* as we will discuss later.
”h]”hÊ)”}”(hŒ¼**pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
the other levels to handle 4-level page tables. It is potentially unused,
or *folded* as we will discuss later.”h]”(hþ)”}”(hŒ**pud**”h]”hŒpud”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjþ  ubhŒ, ”…””}”(hjþ  hžhhŸNh Nubj$  )”}”(hŒ`pud_t`”h]”hŒpud_t”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjþ  ubhŒ, ”…””}”hjþ  sbj$  )”}”(hŒ
`pudval_t`”h]”hŒpudval_t”…””}”(hj&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjþ  ubhŒ = ”…””}”(hjþ  hžhhŸNh Nubhþ)”}”(hŒ**Page Upper Directory**”h]”hŒPage Upper Directory”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjþ  ubhŒc was introduced after
the other levels to handle 4-level page tables. It is potentially unused,
or ”…””}”(hjþ  hžhhŸNh NubjF  )”}”(hŒ*folded*”h]”hŒfolded”…””}”(hjJ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hjþ  ubhŒ as we will discuss later.”…””}”(hjþ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kihjú  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j:  hj7  hžhhŸh³h Nubj;  )”}”(hXŽ  **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
handle 5-level page tables after the *pud* was introduced. Now it was clear
that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
directory level and that we cannot go on with ad hoc names any more. This
is only used on systems which actually have 5 levels of page tables, otherwise
it is folded.
”h]”hÊ)”}”(hX  **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
handle 5-level page tables after the *pud* was introduced. Now it was clear
that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
directory level and that we cannot go on with ad hoc names any more. This
is only used on systems which actually have 5 levels of page tables, otherwise
it is folded.”h]”(hþ)”}”(hŒ**p4d**”h]”hŒp4d”…””}”(hjp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjl  ubhŒ, ”…””}”(hjl  hžhhŸNh Nubj$  )”}”(hŒ`p4d_t`”h]”hŒp4d_t”…””}”(hj‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjl  ubhŒ, ”…””}”hjl  sbj$  )”}”(hŒ
`p4dval_t`”h]”hŒp4dval_t”…””}”(hj”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjl  ubhŒ = ”…””}”(hjl  hžhhŸNh Nubhþ)”}”(hŒ**Page Level 4 Directory**”h]”hŒPage Level 4 Directory”…””}”(hj¦  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjl  ubhŒ8 was introduced to
handle 5-level page tables after the ”…””}”(hjl  hžhhŸNh NubjF  )”}”(hŒ*pud*”h]”hŒpud”…””}”(hj¸  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hjl  ubhŒ< was introduced. Now it was clear
that we needed to replace ”…””}”(hjl  hžhhŸNh NubjF  )”}”(hŒ*pgd*”h]”hŒpgd”…””}”(hjÊ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hjl  ubhŒ, ”…””}”hjl  sbjF  )”}”(hŒ*pmd*”h]”hŒpmd”…””}”(hjÜ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hjl  ubhŒ, ”…””}”hjl  sbjF  )”}”(hŒ*pud*”h]”hŒpud”…””}”(hjî  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hjl  ubhŒÈ etc with a figure indicating the
directory level and that we cannot go on with ad hoc names any more. This
is only used on systems which actually have 5 levels of page tables, otherwise
it is folded.”…””}”(hjl  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kmhjh  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j:  hj7  hžhhŸh³h Nubj;  )”}”(hX  **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
main page table handling the PGD for the kernel memory is still found in
`swapper_pg_dir`, but each userspace process in the system also has its own
memory context and thus its own *pgd*, found in `struct mm_struct` which
in turn is referenced to in each `struct task_struct`. So tasks have memory
context in the form of a `struct mm_struct` and this in turn has a
`struct pgt_t *pgd` pointer to the corresponding page global directory.
”h]”hÊ)”}”(hX   **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
main page table handling the PGD for the kernel memory is still found in
`swapper_pg_dir`, but each userspace process in the system also has its own
memory context and thus its own *pgd*, found in `struct mm_struct` which
in turn is referenced to in each `struct task_struct`. So tasks have memory
context in the form of a `struct mm_struct` and this in turn has a
`struct pgt_t *pgd` pointer to the corresponding page global directory.”h]”(hþ)”}”(hŒ**pgd**”h]”hŒpgd”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhj  ubhŒ, ”…””}”(hj  hžhhŸNh Nubj$  )”}”(hŒ`pgd_t`”h]”hŒpgd_t”…””}”(hj&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj  ubhŒ, ”…””}”hj  sbj$  )”}”(hŒ
`pgdval_t`”h]”hŒpgdval_t”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj  ubhŒ = ”…””}”(hj  hžhhŸNh Nubhþ)”}”(hŒ**Page Global Directory**”h]”hŒPage Global Directory”…””}”(hjJ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhj  ubhŒ] - the Linux kernel
main page table handling the PGD for the kernel memory is still found in
”…””}”(hj  hžhhŸNh Nubj$  )”}”(hŒ`swapper_pg_dir`”h]”hŒswapper_pg_dir”…””}”(hj\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj  ubhŒ\, but each userspace process in the system also has its own
memory context and thus its own ”…””}”(hj  hžhhŸNh NubjF  )”}”(hŒ*pgd*”h]”hŒpgd”…””}”(hjn  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hj  ubhŒ, found in ”…””}”(hj  hžhhŸNh Nubj$  )”}”(hŒ`struct mm_struct`”h]”hŒstruct mm_struct”…””}”(hj€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj  ubhŒ( which
in turn is referenced to in each ”…””}”(hj  hžhhŸNh Nubj$  )”}”(hŒ`struct task_struct`”h]”hŒstruct task_struct”…””}”(hj’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj  ubhŒ0. So tasks have memory
context in the form of a ”…””}”(hj  hžhhŸNh Nubj$  )”}”(hŒ`struct mm_struct`”h]”hŒstruct mm_struct”…””}”(hj¤  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj  ubhŒ and this in turn has a
”…””}”(hj  hžhhŸNh Nubj$  )”}”(hŒ`struct pgt_t *pgd`”h]”hŒstruct pgt_t *pgd”…””}”(hj¶  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj  ubhŒ4 pointer to the corresponding page global directory.”…””}”(hj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kthj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j:  hj7  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”Œbullet”Œ-”uh1j5  hŸh³h KWhh¶hžhubhÊ)”}”(hX  To repeat: each level in the page table hierarchy is a *array of pointers*, so
the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d**
contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of
pointers on each level is architecture-defined.::”h]”(hŒ7To repeat: each level in the page table hierarchy is a ”…””}”(hjÜ  hžhhŸNh NubjF  )”}”(hŒ*array of pointers*”h]”hŒarray of pointers”…””}”(hjä  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hjÜ  ubhŒ	, so
the ”…””}”(hjÜ  hžhhŸNh Nubhþ)”}”(hŒ**pgd**”h]”hŒpgd”…””}”(hjö  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjÜ  ubhŒ
 contains ”…””}”(hjÜ  hžhhŸNh Nubj$  )”}”(hŒ`PTRS_PER_PGD`”h]”hŒPTRS_PER_PGD”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjÜ  ubhŒ# pointers to the next level below, ”…””}”(hjÜ  hžhhŸNh Nubhþ)”}”(hŒ**p4d**”h]”hŒp4d”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjÜ  ubhŒ

contains ”…””}”(hjÜ  hžhhŸNh Nubj$  )”}”(hŒ`PTRS_PER_P4D`”h]”hŒPTRS_PER_P4D”…””}”(hj,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjÜ  ubhŒ pointers to ”…””}”(hjÜ  hžhhŸNh Nubhþ)”}”(hŒ**pud**”h]”hŒpud”…””}”(hj>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hýhjÜ  ubhŒP items and so on. The number of
pointers on each level is architecture-defined.:”…””}”(hjÜ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K|hh¶hžhubj  )”}”(hXR        PMD
--> +-----+           PTE
    | ptr |-------> +-----+
    | ptr |-        | ptr |-------> PAGE
    | ptr | \       | ptr |
    | ptr |  \        ...
    | ... |   \
    | ptr |    \         PTE
    +-----+     +----> +-----+
                       | ptr |-------> PAGE
                       | ptr |
                         ...”h]”hXR        PMD
--> +-----+           PTE
    | ptr |-------> +-----+
    | ptr |-        | ptr |-------> PAGE
    | ptr | \       | ptr |
    | ptr |  \        ...
    | ... |   \
    | ptr |    \         PTE
    +-----+     +----> +-----+
                       | ptr |-------> PAGE
                       | ptr |
                         ...”…””}”hjV  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j  hŸh³h Khh¶hžhubhµ)”}”(hhh]”(hº)”}”(hŒPage Table Folding”h]”hŒPage Table Folding”…””}”(hjg  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h¹hjd  hžhhŸh³h KubhÊ)”}”(hŒêIf the architecture does not use all the page table levels, they can be *folded*
which means skipped, and all operations performed on page tables will be
compile-time augmented to just skip a level when accessing the next lower
level.”h]”(hŒHIf the architecture does not use all the page table levels, they can be ”…””}”(hju  hžhhŸNh NubjF  )”}”(hŒ*folded*”h]”hŒfolded”…””}”(hj}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jE  hju  ubhŒš
which means skipped, and all operations performed on page tables will be
compile-time augmented to just skip a level when accessing the next lower
level.”…””}”(hju  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K’hjd  hžhubhÊ)”}”(hX  Page table handling code that wishes to be architecture-neutral, such as the
virtual memory manager, will need to be written so that it traverses all of the
currently five levels. This style should also be preferred for
architecture-specific code, so as to be robust to future changes.”h]”hX  Page table handling code that wishes to be architecture-neutral, such as the
virtual memory manager, will need to be written so that it traverses all of the
currently five levels. This style should also be preferred for
architecture-specific code, so as to be robust to future changes.”…””}”(hj•  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K—hjd  hžhubeh}”(h]”Œpage-table-folding”ah ]”h"]”Œpage table folding”ah$]”h&]”uh1h´hh¶hžhhŸh³h Kubhµ)”}”(hhh]”(hº)”}”(hŒMMU, TLB, and Page Faults”h]”hŒMMU, TLB, and Page Faults”…””}”(hj®  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h¹hj«  hžhhŸh³h KžubhÊ)”}”(hX  The `Memory Management Unit (MMU)` is a hardware component that handles virtual
to physical address translations. It may use relatively small caches in hardware
called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up
these translations.”h]”(hŒThe ”…””}”(hj¼  hžhhŸNh Nubj$  )”}”(hŒ`Memory Management Unit (MMU)`”h]”hŒMemory Management Unit (MMU)”…””}”(hjÄ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj¼  ubhŒ† is a hardware component that handles virtual
to physical address translations. It may use relatively small caches in hardware
called ”…””}”(hj¼  hžhhŸNh Nubj$  )”}”(hŒ&`Translation Lookaside Buffers (TLBs)`”h]”hŒ$Translation Lookaside Buffers (TLBs)”…””}”(hjÖ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj¼  ubhŒ and ”…””}”(hj¼  hžhhŸNh Nubj$  )”}”(hŒ`Page Walk Caches`”h]”hŒPage Walk Caches”…””}”(hjè  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj¼  ubhŒ  to speed up
these translations.”…””}”(hj¼  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K hj«  hžhubhÊ)”}”(hX6  When CPU accesses a memory location, it provides a virtual address to the MMU,
which checks if there is the existing translation in the TLB or in the Page
Walk Caches (on architectures that support them). If no translation is found,
MMU uses the page walks to determine the physical address and create the map.”h]”hX6  When CPU accesses a memory location, it provides a virtual address to the MMU,
which checks if there is the existing translation in the TLB or in the Page
Walk Caches (on architectures that support them). If no translation is found,
MMU uses the page walks to determine the physical address and create the map.”…””}”(hj   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K¥hj«  hžhubhÊ)”}”(hŒáThe dirty bit for a page is set (i.e., turned on) when the page is written to.
Each page of memory has associated permission and dirty bits. The latter
indicate that the page has been modified since it was loaded into memory.”h]”hŒáThe dirty bit for a page is set (i.e., turned on) when the page is written to.
Each page of memory has associated permission and dirty bits. The latter
indicate that the page has been modified since it was loaded into memory.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kªhj«  hžhubhÊ)”}”(hŒ†If nothing prevents it, eventually the physical memory can be accessed and the
requested operation on the physical frame is performed.”h]”hŒ†If nothing prevents it, eventually the physical memory can be accessed and the
requested operation on the physical frame is performed.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K®hj«  hžhubhÊ)”}”(hŒåThere are several reasons why the MMU can't find certain translations. It could
happen because the CPU is trying to access memory that the current task is not
permitted to, or because the data is not present into physical memory.”h]”hŒçThere are several reasons why the MMU canâ€™t find certain translations. It could
happen because the CPU is trying to access memory that the current task is not
permitted to, or because the data is not present into physical memory.”…””}”(hj*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K±hj«  hžhubhÊ)”}”(hŒËWhen these conditions happen, the MMU triggers page faults, which are types of
exceptions that signal the CPU to pause the current execution and run a special
function to handle the mentioned exceptions.”h]”hŒËWhen these conditions happen, the MMU triggers page faults, which are types of
exceptions that signal the CPU to pause the current execution and run a special
function to handle the mentioned exceptions.”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kµhj«  hžhubhÊ)”}”(hX<  There are common and expected causes of page faults. These are triggered by
process management optimization techniques called "Lazy Allocation" and
"Copy-on-Write". Page faults may also happen when frames have been swapped out
to persistent storage (swap partition or file) and evicted from their physical
locations.”h]”hXD  There are common and expected causes of page faults. These are triggered by
process management optimization techniques called â€œLazy Allocationâ€ and
â€œCopy-on-Writeâ€. Page faults may also happen when frames have been swapped out
to persistent storage (swap partition or file) and evicted from their physical
locations.”…””}”(hjF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K¹hj«  hžhubhÊ)”}”(hX  These techniques improve memory efficiency, reduce latency, and minimize space
occupation. This document won't go deeper into the details of "Lazy Allocation"
and "Copy-on-Write" because these subjects are out of scope as they belong to
Process Address Management.”h]”hX  These techniques improve memory efficiency, reduce latency, and minimize space
occupation. This document wonâ€™t go deeper into the details of â€œLazy Allocationâ€
and â€œCopy-on-Writeâ€ because these subjects are out of scope as they belong to
Process Address Management.”…””}”(hjT  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h K¿hj«  hžhubhÊ)”}”(hŒ¢Swapping differentiates itself from the other mentioned techniques because it's
undesirable since it's performed as a means to reduce memory under heavy
pressure.”h]”hŒ¦Swapping differentiates itself from the other mentioned techniques because itâ€™s
undesirable since itâ€™s performed as a means to reduce memory under heavy
pressure.”…””}”(hjb  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h KÄhj«  hžhubhÊ)”}”(hX°  Swapping can't work for memory mapped by kernel logical addresses. These are a
subset of the kernel virtual space that directly maps a contiguous range of
physical memory. Given any logical address, its physical address is determined
with simple arithmetic on an offset. Accesses to logical addresses are fast
because they avoid the need for complex page table lookups at the expenses of
frames not being evictable and pageable out.”h]”hX²  Swapping canâ€™t work for memory mapped by kernel logical addresses. These are a
subset of the kernel virtual space that directly maps a contiguous range of
physical memory. Given any logical address, its physical address is determined
with simple arithmetic on an offset. Accesses to logical addresses are fast
because they avoid the need for complex page table lookups at the expenses of
frames not being evictable and pageable out.”…””}”(hjp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h KÈhj«  hžhubhÊ)”}”(hŒðIf the kernel fails to make room for the data that must be present in the
physical frames, the kernel invokes the out-of-memory (OOM) killer to make room
by terminating lower priority processes until pressure reduces under a safe
threshold.”h]”hŒðIf the kernel fails to make room for the data that must be present in the
physical frames, the kernel invokes the out-of-memory (OOM) killer to make room
by terminating lower priority processes until pressure reduces under a safe
threshold.”…””}”(hj~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h KÏhj«  hžhubhÊ)”}”(hXR  Additionally, page faults may be also caused by code bugs or by maliciously
crafted addresses that the CPU is instructed to access. A thread of a process
could use instructions to address (non-shared) memory which does not belong to
its own address space, or could try to execute an instruction that want to write
to a read-only location.”h]”hXR  Additionally, page faults may be also caused by code bugs or by maliciously
crafted addresses that the CPU is instructed to access. A thread of a process
could use instructions to address (non-shared) memory which does not belong to
its own address space, or could try to execute an instruction that want to write
to a read-only location.”…””}”(hjŒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h KÔhj«  hžhubhÊ)”}”(hŒâIf the above-mentioned conditions happen in user-space, the kernel sends a
`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually
causes the termination of the thread and of the process it belongs to.”h]”(hŒKIf the above-mentioned conditions happen in user-space, the kernel sends a
”…””}”(hjš  hžhhŸNh Nubj$  )”}”(hŒ`Segmentation Fault`”h]”hŒSegmentation Fault”…””}”(hj¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjš  ubhŒƒ (SIGSEGV) signal to the current thread. That signal usually
causes the termination of the thread and of the process it belongs to.”…””}”(hjš  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h KÚhj«  hžhubhÊ)”}”(hX)  This document is going to simplify and show an high altitude view of how the
Linux kernel handles these page faults, creates tables and tables' entries,
check if memory is present and, if not, requests to load data from persistent
storage or from other devices, and updates the MMU and its caches.”h]”hX+  This document is going to simplify and show an high altitude view of how the
Linux kernel handles these page faults, creates tables and tablesâ€™ entries,
check if memory is present and, if not, requests to load data from persistent
storage or from other devices, and updates the MMU and its caches.”…””}”(hjº  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h KÞhj«  hžhubhÊ)”}”(hŒØThe first steps are architecture dependent. Most architectures jump to
`do_page_fault()`, whereas the x86 interrupt handler is defined by the
`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`.”h]”(hŒGThe first steps are architecture dependent. Most architectures jump to
”…””}”(hjÈ  hžhhŸNh Nubj$  )”}”(hŒ`do_page_fault()`”h]”hŒdo_page_fault()”…””}”(hjÐ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjÈ  ubhŒ6, whereas the x86 interrupt handler is defined by the
”…””}”(hjÈ  hžhhŸNh Nubj$  )”}”(hŒ!`DEFINE_IDTENTRY_RAW_ERRORCODE()`”h]”hŒDEFINE_IDTENTRY_RAW_ERRORCODE()”…””}”(hjâ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjÈ  ubhŒ macro which calls ”…””}”(hjÈ  hžhhŸNh Nubj$  )”}”(hŒ`handle_page_fault()`”h]”hŒhandle_page_fault()”…””}”(hjô  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjÈ  ubhŒ.”…””}”(hjÈ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kãhj«  hžhubhÊ)”}”(hŒÑWhatever the routes, all architectures end up to the invocation of
`handle_mm_fault()` which, in turn, (likely) ends up calling
`__handle_mm_fault()` to carry out the actual work of allocating the page
tables.”h]”(hŒCWhatever the routes, all architectures end up to the invocation of
”…””}”(hj  hžhhŸNh Nubj$  )”}”(hŒ`handle_mm_fault()`”h]”hŒhandle_mm_fault()”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj  ubhŒ* which, in turn, (likely) ends up calling
”…””}”(hj  hžhhŸNh Nubj$  )”}”(hŒ`__handle_mm_fault()`”h]”hŒ__handle_mm_fault()”…””}”(hj&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj  ubhŒ< to carry out the actual work of allocating the page
tables.”…””}”(hj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kçhj«  hžhubhÊ)”}”(hXh  The unfortunate case of not being able to call `__handle_mm_fault()` means
that the virtual address is pointing to areas of physical memory which are not
permitted to be accessed (at least from the current context). This
condition resolves to the kernel sending the above-mentioned SIGSEGV signal
to the process and leads to the consequences already explained.”h]”(hŒ/The unfortunate case of not being able to call ”…””}”(hj>  hžhhŸNh Nubj$  )”}”(hŒ`__handle_mm_fault()`”h]”hŒ__handle_mm_fault()”…””}”(hjF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj>  ubhX$   means
that the virtual address is pointing to areas of physical memory which are not
permitted to be accessed (at least from the current context). This
condition resolves to the kernel sending the above-mentioned SIGSEGV signal
to the process and leads to the consequences already explained.”…””}”(hj>  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kìhj«  hžhubhÊ)”}”(hŒ´`__handle_mm_fault()` carries out its work by calling several functions to
find the entry's offsets of the upper layers of the page tables and allocate
the tables that it may need.”h]”(j$  )”}”(hŒ`__handle_mm_fault()`”h]”hŒ__handle_mm_fault()”…””}”(hjb  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hj^  ubhŒ¡ carries out its work by calling several functions to
find the entryâ€™s offsets of the upper layers of the page tables and allocate
the tables that it may need.”…””}”(hj^  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kòhj«  hžhubhÊ)”}”(hX@  The functions that look for the offset have names like `*_offset()`, where the
"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the
corresponding tables, layer by layer, are called `*_alloc`, using the
above-mentioned convention to name them after the corresponding types of tables
in the hierarchy.”h]”(hŒ7The functions that look for the offset have names like ”…””}”(hjz  hžhhŸNh Nubj$  )”}”(hŒ`*_offset()`”h]”hŒ
*_offset()”…””}”(hj‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjz  ubhŒ‹, where the
â€œ*â€ is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the
corresponding tables, layer by layer, are called ”…””}”(hjz  hžhhŸNh Nubj$  )”}”(hŒ	`*_alloc`”h]”hŒ*_alloc”…””}”(hj”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjz  ubhŒm, using the
above-mentioned convention to name them after the corresponding types of tables
in the hierarchy.”…””}”(hjz  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Köhj«  hžhubhÊ)”}”(hŒLThe page table walk may end at one of the middle or upper layers (PMD, PUD).”h]”hŒLThe page table walk may end at one of the middle or upper layers (PMD, PUD).”…””}”(hj¬  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kühj«  hžhubhÊ)”}”(hXs  Linux supports larger page sizes than the usual 4KB (i.e., the so called
`huge pages`). When using these kinds of larger pages, higher level pages can
directly map them, with no need to use lower level page entries (PTE). Huge
pages contain large contiguous physical regions that usually span from 2MB to
1GB. They are respectively mapped by the PMD and PUD page entries.”h]”(hŒILinux supports larger page sizes than the usual 4KB (i.e., the so called
”…””}”(hjº  hžhhŸNh Nubj$  )”}”(hŒ`huge pages`”h]”hŒ
huge pages”…””}”(hjÂ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjº  ubhX  ). When using these kinds of larger pages, higher level pages can
directly map them, with no need to use lower level page entries (PTE). Huge
pages contain large contiguous physical regions that usually span from 2MB to
1GB. They are respectively mapped by the PMD and PUD page entries.”…””}”(hjº  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Kþhj«  hžhubhÊ)”}”(hX  The huge pages bring with them several benefits like reduced TLB pressure,
reduced page table overhead, memory allocation efficiency, and performance
improvement for certain workloads. However, these benefits come with
trade-offs, like wasted memory and allocation challenges.”h]”hX  The huge pages bring with them several benefits like reduced TLB pressure,
reduced page table overhead, memory allocation efficiency, and performance
improvement for certain workloads. However, these benefits come with
trade-offs, like wasted memory and allocation challenges.”…””}”(hjÚ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Mhj«  hžhubhÊ)”}”(hX>  At the very end of the walk with allocations, if it didn't return errors,
`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()`
performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`.
"read", "cow", "shared" give hints about the reasons and the kind of fault it's
handling.”h]”(hŒLAt the very end of the walk with allocations, if it didnâ€™t return errors,
”…””}”(hjè  hžhhŸNh Nubj$  )”}”(hŒ`__handle_mm_fault()`”h]”hŒ__handle_mm_fault()”…””}”(hjð  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjè  ubhŒ finally calls ”…””}”(hjè  hžhhŸNh Nubj$  )”}”(hŒ`handle_pte_fault()`”h]”hŒhandle_pte_fault()”…””}”(hj	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjè  ubhŒ, which via ”…””}”(hjè  hžhhŸNh Nubj$  )”}”(hŒ`do_fault()`”h]”hŒ
do_fault()”…””}”(hj	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjè  ubhŒ
performs one of ”…””}”(hjè  hžhhŸNh Nubj$  )”}”(hŒ`do_read_fault()`”h]”hŒdo_read_fault()”…””}”(hj&	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjè  ubhŒ, ”…””}”(hjè  hžhhŸNh Nubj$  )”}”(hŒ`do_cow_fault()`”h]”hŒdo_cow_fault()”…””}”(hj8	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjè  ubhŒ, ”…””}”hjè  sbj$  )”}”(hŒ`do_shared_fault()`”h]”hŒdo_shared_fault()”…””}”(hjJ	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjè  ubhŒi.
â€œreadâ€, â€œcowâ€, â€œsharedâ€ give hints about the reasons and the kind of fault itâ€™s
handling.”…””}”(hjè  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h M	hj«  hžhubhÊ)”}”(hŒèThe actual implementation of the workflow is very complex. Its design allows
Linux to handle page faults in a way that is tailored to the specific
characteristics of each architecture, while still sharing a common overall
structure.”h]”hŒèThe actual implementation of the workflow is very complex. Its design allows
Linux to handle page faults in a way that is tailored to the specific
characteristics of each architecture, while still sharing a common overall
structure.”…””}”(hjb	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Mhj«  hžhubhÊ)”}”(hŒÊTo conclude this high altitude view of how Linux handles page faults, let's
add that the page faults handler can be disabled and enabled respectively with
`pagefault_disable()` and `pagefault_enable()`.”h]”(hŒTo conclude this high altitude view of how Linux handles page faults, letâ€™s
add that the page faults handler can be disabled and enabled respectively with
”…””}”(hjp	  hžhhŸNh Nubj$  )”}”(hŒ`pagefault_disable()`”h]”hŒpagefault_disable()”…””}”(hjx	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjp	  ubhŒ and ”…””}”(hjp	  hžhhŸNh Nubj$  )”}”(hŒ`pagefault_enable()`”h]”hŒpagefault_enable()”…””}”(hjŠ	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j#  hjp	  ubhŒ.”…””}”(hjp	  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Mhj«  hžhubhÊ)”}”(hŒ”Several code path make use of the latter two functions because they need to
disable traps into the page faults handler, mostly to prevent deadlocks.”h]”hŒ”Several code path make use of the latter two functions because they need to
disable traps into the page faults handler, mostly to prevent deadlocks.”…””}”(hj¢	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÉhŸh³h Mhj«  hžhubeh}”(h]”Œmmu-tlb-and-page-faults”ah ]”h"]”Œmmu, tlb, and page faults”ah$]”h&]”uh1h´hh¶hžhhŸh³h Kžubeh}”(h]”Œpage-tables”ah ]”h"]”Œpage tables”ah$]”h&]”uh1h´hhhžhhŸh³h Kubeh}”(h]”h ]”h"]”h$]”h&]”Œsource”h³uh1hŒcurrent_source”NŒcurrent_line”NŒsettings”Œdocutils.frontend”ŒValues”“”)”}”(h¹NŒ	generator”NŒ	datestamp”NŒsource_link”NŒ
source_url”NŒtoc_backlinks”Œentry”Œfootnote_backlinks”KŒsectnum_xform”KŒstrip_comments”NŒstrip_elements_with_classes”NŒstrip_classes”NŒreport_level”KŒ
halt_level”KŒexit_status_level”KŒdebug”NŒwarning_stream”NŒ	traceback”ˆŒinput_encoding”Œ	utf-8-sig”Œinput_encoding_error_handler”Œstrict”Œoutput_encoding”Œutf-8”Œoutput_encoding_error_handler”jã	  Œerror_encoding”Œutf-8”Œerror_encoding_error_handler”Œbackslashreplace”Œlanguage_code”Œen”Œrecord_dependencies”NŒconfig”NŒ	id_prefix”hŒauto_id_prefix”Œid”Œdump_settings”NŒdump_internals”NŒdump_transforms”NŒdump_pseudo_xml”NŒexpose_internals”NŒstrict_visitor”NŒ_disable_config”NŒ_source”h³Œ_destination”NŒ_config_files”]”Œ7/var/lib/git/docbuild/linux/Documentation/docutils.conf”aŒfile_insertion_enabled”ˆŒraw_enabled”KŒline_length_limit”M'Œpep_references”NŒpep_base_url”Œhttps://peps.python.org/”Œpep_file_url_template”Œpep-%04d”Œrfc_references”NŒrfc_base_url”Œ&https://datatracker.ietf.org/doc/html/”Œ	tab_width”KŒtrim_footnote_reference_space”‰Œsyntax_highlight”Œlong”Œsmart_quotes”ˆŒsmartquotes_locales”]”Œcharacter_level_inline_markup”‰Œdoctitle_xform”‰Œdocinfo_xform”KŒsectsubtitle_xform”‰Œimage_loading”Œlink”Œembed_stylesheet”‰Œcloak_email_addresses”ˆŒsection_self_link”‰Œenv”NubŒreporter”NŒindirect_targets”]”Œsubstitution_defs”}”Œsubstitution_names”}”Œrefnames”}”Œrefids”}”Œnameids”}”(j½	  jº	  j¨  j¥  jµ	  j²	  uŒ	nametypes”}”(j½	  ‰j¨  ‰jµ	  ‰uh}”(jº	  h¶j¥  jd  j²	  j«  uŒfootnote_refs”}”Œcitation_refs”}”Œautofootnotes”]”Œautofootnote_refs”]”Œsymbol_footnotes”]”Œsymbol_footnote_refs”]”Œ	footnotes”]”Œ	citations”]”Œautofootnote_start”KŒsymbol_footnote_start”K Œ
id_counter”Œcollections”ŒCounter”“”}”…”R”Œparse_messages”]”Œtransform_messages”]”Œtransformer”NŒinclude_log”]”Œ
decoration”Nhžhub.