sphinx.addnodesdocument)}( rawsourcechildren]( translations LanguagesNode)}(hhh](h pending_xref)}(hhh]docutils.nodesTextChinese (Simplified)}parenthsba attributes}(ids]classes]names]dupnames]backrefs] refdomainstdreftypedoc reftarget#/translations/zh_CN/mm/multigen_lrumodnameN classnameN refexplicitutagnamehhh ubh)}(hhh]hChinese (Traditional)}hh2sbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget#/translations/zh_TW/mm/multigen_lrumodnameN classnameN refexplicituh1hhh ubh)}(hhh]hItalian}hhFsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget#/translations/it_IT/mm/multigen_lrumodnameN classnameN refexplicituh1hhh ubh)}(hhh]hJapanese}hhZsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget#/translations/ja_JP/mm/multigen_lrumodnameN classnameN refexplicituh1hhh ubh)}(hhh]hKorean}hhnsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget#/translations/ko_KR/mm/multigen_lrumodnameN classnameN refexplicituh1hhh ubh)}(hhh]hSpanish}hhsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget#/translations/sp_SP/mm/multigen_lrumodnameN classnameN refexplicituh1hhh ubeh}(h]h ]h"]h$]h&]current_languageEnglishuh1h hh _documenthsourceNlineNubhcomment)}(h SPDX-License-Identifier: GPL-2.0h]h SPDX-License-Identifier: GPL-2.0}hhsbah}(h]h ]h"]h$]h&] xml:spacepreserveuh1hhhhhh=/var/lib/git/docbuild/linux/Documentation/mm/multigen_lru.rsthKubhsection)}(hhh](htitle)}(h Multi-Gen LRUh]h Multi-Gen LRU}(hhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhhhhhKubh paragraph)}(hXThe multi-gen LRU is an alternative LRU implementation that optimizes page reclaim and improves performance under memory pressure. Page reclaim decides the kernel's caching policy and ability to overcommit memory. It directly impacts the kswapd CPU usage and RAM efficiency.h]hXThe multi-gen LRU is an alternative LRU implementation that optimizes page reclaim and improves performance under memory pressure. Page reclaim decides the kernel’s caching policy and ability to overcommit memory. It directly impacts the kswapd CPU usage and RAM efficiency.}(hhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhhhhubh)}(hhh](h)}(hDesign overviewh]hDesign overview}(hhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhhhhhK ubh)}(hhh](h)}(h Objectivesh]h Objectives}(hhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhhhhhKubh)}(hThe design objectives are:h]hThe design objectives are:}(hhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhhhhubh bullet_list)}(hhh](h list_item)}(h%Good representation of access recencyh]h)}(hjh]h%Good representation of access recency}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhj hhhhhNubj)}(h#Try to profit from spatial localityh]h)}(hj)h]h#Try to profit from spatial locality}(hj+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj'ubah}(h]h ]h"]h$]h&]uh1jhj hhhhhNubj)}(h"Fast paths to make obvious choicesh]h)}(hj@h]h"Fast paths to make obvious choices}(hjBhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj>ubah}(h]h ]h"]h$]h&]uh1jhj hhhhhNubj)}(h"Simple self-correcting heuristics h]h)}(h!Simple self-correcting heuristicsh]h!Simple self-correcting heuristics}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjUubah}(h]h ]h"]h$]h&]uh1jhj hhhhhNubeh}(h]h ]h"]h$]h&]bullet*uh1j hhhKhhhhubh)}(hXThe representation of access recency is at the core of all LRU implementations. In the multi-gen LRU, each generation represents a group of pages with similar access recency. Generations establish a (time-based) common frame of reference and therefore help make better choices, e.g., between different memcgs on a computer or different computers in a data center (for job scheduling).h]hXThe representation of access recency is at the core of all LRU implementations. In the multi-gen LRU, each generation represents a group of pages with similar access recency. Generations establish a (time-based) common frame of reference and therefore help make better choices, e.g., between different memcgs on a computer or different computers in a data center (for job scheduling).}(hjuhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhhhhubh)}(hXjExploiting spatial locality improves efficiency when gathering the accessed bit. A rmap walk targets a single page and does not try to profit from discovering a young PTE. A page table walk can sweep all the young PTEs in an address space, but the address space can be too sparse to make a profit. The key is to optimize both methods and use them in combination.h]hXjExploiting spatial locality improves efficiency when gathering the accessed bit. A rmap walk targets a single page and does not try to profit from discovering a young PTE. A page table walk can sweep all the young PTEs in an address space, but the address space can be too sparse to make a profit. The key is to optimize both methods and use them in combination.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhhhhubh)}(hXFast paths reduce code complexity and runtime overhead. Unmapped pages do not require TLB flushes; clean pages do not require writeback. These facts are only helpful when other conditions, e.g., access recency, are similar. With generations as a common frame of reference, additional factors stand out. But obvious choices might not be good choices; thus self-correction is necessary.h]hXFast paths reduce code complexity and runtime overhead. Unmapped pages do not require TLB flushes; clean pages do not require writeback. These facts are only helpful when other conditions, e.g., access recency, are similar. With generations as a common frame of reference, additional factors stand out. But obvious choices might not be good choices; thus self-correction is necessary.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK$hhhhubh)}(hXyThe benefits of simple self-correcting heuristics are self-evident. Again, with generations as a common frame of reference, this becomes attainable. Specifically, pages in the same generation can be categorized based on additional factors, and a feedback loop can statistically compare the refault percentages across those categories and infer which of them are better choices.h]hXyThe benefits of simple self-correcting heuristics are self-evident. Again, with generations as a common frame of reference, this becomes attainable. Specifically, pages in the same generation can be categorized based on additional factors, and a feedback loop can statistically compare the refault percentages across those categories and infer which of them are better choices.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK+hhhhubeh}(h] objectivesah ]h"] objectivesah$]h&]uh1hhhhhhhhKubh)}(hhh](h)}(h Assumptionsh]h Assumptions}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhK3ubh)}(hThe protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels:h]hThe protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK4hjhhubj )}(hhh](j)}(hAccesses through page tablesh]h)}(hjh]hAccesses through page tables}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK7hjubah}(h]h ]h"]h$]h&]uh1jhjhhhhhNubj)}(h"Accesses through file descriptors h]h)}(h!Accesses through file descriptorsh]h!Accesses through file descriptors}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK8hjubah}(h]h ]h"]h$]h&]uh1jhjhhhhhNubeh}(h]h ]h"]h$]h&]jsjtuh1j hhhK7hjhhubh)}(hCThe protection of the former channel is by design stronger because:h]hCThe protection of the former channel is by design stronger because:}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK:hjhhubhenumerated_list)}(hhh](j)}(hThe uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit.h]h)}(hThe uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit.h]hThe uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit.}(hj#hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj7ubah}(h]h ]h"]h$]h&]uh1jhjhhhhhNubj)}(hXThe penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads. h]h)}(hXThe penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads.h]hXThe penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads.}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK@hjOubah}(h]h ]h"]h$]h&]uh1jhjhhhhhNubeh}(h]h ]h"]h$]h&]enumtypearabicprefixhsuffix.uh1jhjhhhhhKmax_seq`` for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in ``lrugen->min_seq[]`` separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing.h](h?Evictable pages are divided into multiple generations for each }(hj hhhNhNubj)}(h ``lruvec``h]hlruvec}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubh.. The youngest generation number is stored in }(hj hhhNhNubj)}(h``lrugen->max_seq``h]hlrugen->max_seq}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubhp for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in }(hj hhhNhNubj)}(h``lrugen->min_seq[]``h]hlrugen->min_seq[]}(hj9hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubh separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing.}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKRhjhhubh)}(hXGeneration numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` bits in order to fit into the gen counter in ``folio->flags``. Each truncated generation number is an index to ``lrugen->folios[]``. The sliding window technique is used to track at least ``MIN_NR_GENS`` and at most ``MAX_NR_GENS`` generations. The gen counter stores a value within ``[1, MAX_NR_GENS]`` while a page is on one of ``lrugen->folios[]``; otherwise it stores zero.h](h&Generation numbers are truncated into }(hjQhhhNhNubj)}(h``order_base_2(MAX_NR_GENS+1)``h]horder_base_2(MAX_NR_GENS+1)}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubh. bits in order to fit into the gen counter in }(hjQhhhNhNubj)}(h``folio->flags``h]h folio->flags}(hjkhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubh2. Each truncated generation number is an index to }(hjQhhhNhNubj)}(h``lrugen->folios[]``h]hlrugen->folios[]}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubh9. The sliding window technique is used to track at least }(hjQhhhNhNubj)}(h``MIN_NR_GENS``h]h MIN_NR_GENS}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubh and at most }(hjQhhhNhNubj)}(h``MAX_NR_GENS``h]h MAX_NR_GENS}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubh4 generations. The gen counter stores a value within }(hjQhhhNhNubj)}(h``[1, MAX_NR_GENS]``h]h[1, MAX_NR_GENS]}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubh while a page is on one of }(hjQhhhNhNubj)}(h``lrugen->folios[]``h]hlrugen->folios[]}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubh; otherwise it stores zero.}(hjQhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKZhjhhubh)}(hXEach generation is divided into multiple tiers. A page accessed ``N`` times through file descriptors is in tier ``order_base_2(N)``. Unlike generations, tiers do not have dedicated ``lrugen->folios[]``. In contrast to moving across generations, which requires the LRU lock, moving across tiers only involves atomic operations on ``folio->flags`` and therefore has a negligible cost. A feedback loop modeled after the PID controller monitors refaults over all the tiers from anon and file types and decides which tiers from which types to evict or protect. The desired effect is to balance refault percentages between anon and file types proportional to the swappiness level.h](h@Each generation is divided into multiple tiers. A page accessed }(hjhhhNhNubj)}(h``N``h]hN}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh+ times through file descriptors is in tier }(hjhhhNhNubj)}(h``order_base_2(N)``h]horder_base_2(N)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh2. Unlike generations, tiers do not have dedicated }(hjhhhNhNubj)}(h``lrugen->folios[]``h]hlrugen->folios[]}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh. In contrast to moving across generations, which requires the LRU lock, moving across tiers only involves atomic operations on }(hjhhhNhNubj)}(h``folio->flags``h]h folio->flags}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhXI and therefore has a negligible cost. A feedback loop modeled after the PID controller monitors refaults over all the tiers from anon and file types and decides which tiers from which types to evict or protect. The desired effect is to balance refault percentages between anon and file types proportional to the swappiness level.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKbhjhhubh)}(hThere are two conceptually independent procedures: the aging and the eviction. They form a closed-loop system, i.e., the page reclaim.h]hThere are two conceptually independent procedures: the aging and the eviction. They form a closed-loop system, i.e., the page reclaim.}(hj3hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKmhjhhubh)}(hhh](h)}(hAgingh]hAging}(hjDhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjAhhhhhKqubh)}(hXQThe aging produces young generations. Given an ``lruvec``, it increments ``max_seq`` when ``max_seq-min_seq+1`` approaches ``MIN_NR_GENS``. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments ``max_seq``. The aging uses page table walks and rmap walks to find young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()`` with each ``mm_struct`` on this list to scan PTEs, and after each iteration, it increments ``max_seq``. For the latter, when the eviction walks the rmap and finds a young PTE, the aging scans the adjacent PTEs. For both, on finding a young PTE, the aging clears the accessed bit and updates the gen counter of the page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.h](h/The aging produces young generations. Given an }(hjRhhhNhNubj)}(h ``lruvec``h]hlruvec}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubh, it increments }(hjRhhhNhNubj)}(h ``max_seq``h]hmax_seq}(hjlhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubh when }(hjRhhhNhNubj)}(h``max_seq-min_seq+1``h]hmax_seq-min_seq+1}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubh approaches }(hjRhhhNhNubj)}(h``MIN_NR_GENS``h]h MIN_NR_GENS}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubh. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments }(hjRhhhNhNubj)}(h ``max_seq``h]hmax_seq}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubha. The aging uses page table walks and rmap walks to find young PTEs. For the former, it iterates }(hjRhhhNhNubj)}(h``lruvec_memcg()->mm_list``h]hlruvec_memcg()->mm_list}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubh and calls }(hjRhhhNhNubj)}(h``walk_page_range()``h]hwalk_page_range()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubh with each }(hjRhhhNhNubj)}(h ``mm_struct``h]h mm_struct}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubhD on this list to scan PTEs, and after each iteration, it increments }(hjRhhhNhNubj)}(h ``max_seq``h]hmax_seq}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubh. For the latter, when the eviction walks the rmap and finds a young PTE, the aging scans the adjacent PTEs. For both, on finding a young PTE, the aging clears the accessed bit and updates the gen counter of the page mapped by this PTE to }(hjRhhhNhNubj)}(h``(max_seq%MAX_NR_GENS)+1``h]h(max_seq%MAX_NR_GENS)+1}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubh.}(hjRhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKrhjAhhubeh}(h]agingah ]h"]agingah$]h&]uh1hhjhhhhhKqubh)}(hhh](h)}(hEvictionh]hEviction}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhKubh)}(hXThe eviction consumes old generations. Given an ``lruvec``, it increments ``min_seq`` when ``lrugen->folios[]`` indexed by ``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to evict from, it first compares ``min_seq[]`` to select the older type. If both types are equally old, it selects the one whose first tier has a lower refault percentage. The first tier contains single-use unmapped clean pages, which are the best bet. The eviction sorts a page according to its gen counter if the aging has found this page accessed through page tables and updated its gen counter. It also moves a page to the next generation, i.e., ``min_seq+1``, if this page was accessed multiple times through file descriptors and the feedback loop has detected outlying refaults from the tier this page is in. To this end, the feedback loop uses the first tier as the baseline, for the reason stated earlier.h](h0The eviction consumes old generations. Given an }(hj-hhhNhNubj)}(h ``lruvec``h]hlruvec}(hj5hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj-ubh, it increments }(hj-hhhNhNubj)}(h ``min_seq``h]hmin_seq}(hjGhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj-ubh when }(hj-hhhNhNubj)}(h``lrugen->folios[]``h]hlrugen->folios[]}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj-ubh indexed by }(hj-hhhNhNubj)}(h``min_seq%MAX_NR_GENS``h]hmin_seq%MAX_NR_GENS}(hjkhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj-ubhM becomes empty. To select a type and a tier to evict from, it first compares }(hj-hhhNhNubj)}(h ``min_seq[]``h]h min_seq[]}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj-ubhX to select the older type. If both types are equally old, it selects the one whose first tier has a lower refault percentage. The first tier contains single-use unmapped clean pages, which are the best bet. The eviction sorts a page according to its gen counter if the aging has found this page accessed through page tables and updated its gen counter. It also moves a page to the next generation, i.e., }(hj-hhhNhNubj)}(h ``min_seq+1``h]h min_seq+1}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj-ubh, if this page was accessed multiple times through file descriptors and the feedback loop has detected outlying refaults from the tier this page is in. To this end, the feedback loop uses the first tier as the baseline, for the reason stated earlier.}(hj-hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubeh}(h]evictionah ]h"]evictionah$]h&]uh1hhjhhhhhKubh)}(hhh](h)}(hWorking set protectionh]hWorking set protection}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhKubh)}(hXqEach generation is timestamped at birth. If ``lru_gen_min_ttl`` is set, an ``lruvec`` is protected from the eviction when its oldest generation was born within ``lru_gen_min_ttl`` milliseconds. In other words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds from getting evicted. The OOM killer is triggered if this working set cannot be kept in memory.h](h,Each generation is timestamped at birth. If }(hjhhhNhNubj)}(h``lru_gen_min_ttl``h]hlru_gen_min_ttl}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh is set, an }(hjhhhNhNubj)}(h ``lruvec``h]hlruvec}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhK is protected from the eviction when its oldest generation was born within }(hjhhhNhNubj)}(h``lru_gen_min_ttl``h]hlru_gen_min_ttl}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh> milliseconds. In other words, it prevents the working set of }(hjhhhNhNubj)}(h``lru_gen_min_ttl``h]hlru_gen_min_ttl}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhm milliseconds from getting evicted. The OOM killer is triggered if this working set cannot be kept in memory.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(h6This time-based approach has the following advantages:h]h6This time-based approach has the following advantages:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubj)}(hhh](j)}(hRIt is easier to configure because it is agnostic to applications and memory sizes.h]h)}(hRIt is easier to configure because it is agnostic to applications and memory sizes.h]hRIt is easier to configure because it is agnostic to applications and memory sizes.}(hj+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj'ubah}(h]h ]h"]h$]h&]uh1jhj$hhhhhNubj)}(hDIt is more reliable because it is directly wired to the OOM killer. h]h)}(hCIt is more reliable because it is directly wired to the OOM killer.h]hCIt is more reliable because it is directly wired to the OOM killer.}(hjChhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj?ubah}(h]h ]h"]h$]h&]uh1jhj$hhhhhNubeh}(h]h ]h"]h$]h&]jmjnjohjpjquh1jhjhhhhhKubeh}(h]working-set-protectionah ]h"]working set protectionah$]h&]uh1hhjhhhhhKubh)}(hhh](h)}(h``mm_struct`` listh](j)}(h ``mm_struct``h]h mm_struct}(hjlhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjhubh list}(hjhhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhjehhhhhKubh)}(hAn ``mm_struct`` list is maintained for each memcg, and an ``mm_struct`` follows its owner task to the new memcg when this task is migrated.h](hAn }(hjhhhNhNubj)}(h ``mm_struct``h]h mm_struct}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh+ list is maintained for each memcg, and an }(hjhhhNhNubj)}(h ``mm_struct``h]h mm_struct}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhD follows its owner task to the new memcg when this task is migrated.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjehhubh)}(hXA page table walker iterates ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()`` with each ``mm_struct`` on this list to scan PTEs. When multiple page table walkers iterate the same list, each of them gets a unique ``mm_struct``, and therefore they can run in parallel.h](hA page table walker iterates }(hjhhhNhNubj)}(h``lruvec_memcg()->mm_list``h]hlruvec_memcg()->mm_list}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh and calls }(hjhhhNhNubj)}(h``walk_page_range()``h]hwalk_page_range()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh with each }(hjhhhNhNubj)}(h ``mm_struct``h]h mm_struct}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubho on this list to scan PTEs. When multiple page table walkers iterate the same list, each of them gets a unique }(hjhhhNhNubj)}(h ``mm_struct``h]h mm_struct}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh), and therefore they can run in parallel.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjehhubh)}(hX Page table walkers ignore any misplaced pages, e.g., if an ``mm_struct`` was migrated, pages left in the previous memcg will be ignored when the current memcg is under reclaim. Similarly, page table walkers will ignore pages from nodes other than the one under reclaim.h](h;Page table walkers ignore any misplaced pages, e.g., if an }(hj hhhNhNubj)}(h ``mm_struct``h]h mm_struct}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubh was migrated, pages left in the previous memcg will be ignored when the current memcg is under reclaim. Similarly, page table walkers will ignore pages from nodes other than the one under reclaim.}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjehhubh)}(hThis infrastructure also tracks the usage of ``mm_struct`` between context switches so that page table walkers can skip processes that have been sleeping since the last iteration.h](h-This infrastructure also tracks the usage of }(hj,hhhNhNubj)}(h ``mm_struct``h]h mm_struct}(hj4hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj,ubhy between context switches so that page table walkers can skip processes that have been sleeping since the last iteration.}(hj,hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjehhubeh}(h]mm-struct-listah ]h"]mm_struct listah$]h&]uh1hhjhhhhhKubh)}(hhh](h)}(hRmap/PT walk feedbackh]hRmap/PT walk feedback}(hjWhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjThhhhhKubh)}(hX>Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, searching the rmap can incur the highest CPU cost in the reclaim path.h]hX>Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, searching the rmap can incur the highest CPU cost in the reclaim path.}(hjehhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjThhubh)}(hXH``lru_gen_look_around()`` exploits spatial locality to reduce the trips into the rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If the scan was done cacheline efficiently, it adds the PMD entry pointing to the PTE table to the Bloom filter. This forms a feedback loop between the eviction and the aging.h](j)}(h``lru_gen_look_around()``h]hlru_gen_look_around()}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjsubhX/ exploits spatial locality to reduce the trips into the rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If the scan was done cacheline efficiently, it adds the PMD entry pointing to the PTE table to the Bloom filter. This forms a feedback loop between the eviction and the aging.}(hjshhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjThhubeh}(h]rmap-pt-walk-feedbackah ]h"]rmap/pt walk feedbackah$]h&]uh1hhjhhhhhKubh)}(hhh](h)}(h Bloom filtersh]h Bloom filters}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhKubh)}(hBloom filters are a space and memory efficient data structure for set membership test, i.e., test if an element is not in the set or may be in the set.h]hBloom filters are a space and memory efficient data structure for set membership test, i.e., test if an element is not in the set or may be in the set.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hIn the eviction path, specifically, in ``lru_gen_look_around()``, if a PMD has a sufficient number of hot pages, its address is placed in the filter. In the aging path, set membership means that the PTE range will be scanned for young pages.h](h'In the eviction path, specifically, in }(hjhhhNhNubj)}(h``lru_gen_look_around()``h]hlru_gen_look_around()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh, if a PMD has a sufficient number of hot pages, its address is placed in the filter. In the aging path, set membership means that the PTE range will be scanned for young pages.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hXNote that Bloom filters are probabilistic on set membership. If a test is false positive, the cost is an additional scan of a range of PTEs, which may yield hot pages anyway. Parameters of the filter itself can control the false positive rate in the limit.h]hXNote that Bloom filters are probabilistic on set membership. If a test is false positive, the cost is an additional scan of a range of PTEs, which may yield hot pages anyway. Parameters of the filter itself can control the false positive rate in the limit.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubeh}(h] bloom-filtersah ]h"] bloom filtersah$]h&]uh1hhjhhhhhKubh)}(hhh](h)}(hPID controllerh]hPID controller}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhKubh)}(hA feedback loop modeled after the Proportional-Integral-Derivative (PID) controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation.h]hA feedback loop modeled after the Proportional-Integral-Derivative (PID) controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(hXThe PID controller uses generations rather than the wall clock as the time domain because a CPU can scan pages at different rates under varying memory pressure. It calculates a moving average for each new generation to avoid being permanently locked in a suboptimal state.h]hXThe PID controller uses generations rather than the wall clock as the time domain because a CPU can scan pages at different rates under varying memory pressure. It calculates a moving average for each new generation to avoid being permanently locked in a suboptimal state.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubeh}(h]pid-controllerah ]h"]pid controllerah$]h&]uh1hhjhhhhhKubh)}(hhh](h)}(h Memcg LRUh]h Memcg LRU}(hj$hhhNhNubah}(h]h ]h"]h$]h&]uh1hhj!hhhhhKubh)}(hXPAn memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see ``mem_cgroup_lruvec()``). Its goal is to improve the scalability of global reclaim, which is critical to system-wide memory overcommit in data centers. Note that memcg LRU only applies to global reclaim.h](hAn memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see }(hj2hhhNhNubj)}(h``mem_cgroup_lruvec()``h]hmem_cgroup_lruvec()}(hj:hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj2ubh). Its goal is to improve the scalability of global reclaim, which is critical to system-wide memory overcommit in data centers. Note that memcg LRU only applies to global reclaim.}(hj2hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhj!hhubh)}(hkThe basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios):h]hkThe basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios):}(hjRhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj!hhubj)}(hhh](j)}(hbIt has the young and the old (generations), i.e., the counterparts to the active and the inactive;h]h)}(hbIt has the young and the old (generations), i.e., the counterparts to the active and the inactive;h]hbIt has the young and the old (generations), i.e., the counterparts to the active and the inactive;}(hjghhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjcubah}(h]h ]h"]h$]h&]uh1jhj`hhhhhNubj)}(hUThe increment of ``max_seq`` triggers promotion, i.e., the counterpart to activation;h]h)}(hUThe increment of ``max_seq`` triggers promotion, i.e., the counterpart to activation;h](hThe increment of }(hjhhhNhNubj)}(h ``max_seq``h]hmax_seq}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh9 triggers promotion, i.e., the counterpart to activation;}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhj{ubah}(h]h ]h"]h$]h&]uh1jhj`hhhhhNubj)}(h|Other events trigger similar operations, e.g., offlining an memcg triggers demotion, i.e., the counterpart to deactivation. h]h)}(h{Other events trigger similar operations, e.g., offlining an memcg triggers demotion, i.e., the counterpart to deactivation.h]h{Other events trigger similar operations, e.g., offlining an memcg triggers demotion, i.e., the counterpart to deactivation.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhj`hhhhhNubeh}(h]h ]h"]h$]h&]jmjnjohjpjquh1jhj!hhhhhKubh)}(h9In terms of global reclaim, it has two distinct features:h]h9In terms of global reclaim, it has two distinct features:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj!hhubj)}(hhh](j)}(hoSharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism;h]h)}(hoSharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism;h]hoSharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism;}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjhhhhhNubj)}(hEventual fairness, which allows direct reclaim to bail out at will and reduces latency without affecting fairness over some time. h]h)}(hEventual fairness, which allows direct reclaim to bail out at will and reduces latency without affecting fairness over some time.h]hEventual fairness, which allows direct reclaim to bail out at will and reduces latency without affecting fairness over some time.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjhhhhhNubeh}(h]h ]h"]h$]h&]jmjnjohjpjquh1jhj!hhhhhKubh)}(hIn terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity.h]hIn terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj!hhubeh}(h] memcg-lruah ]h"] memcg lruah$]h&]uh1hhjhhhhhKubh)}(hhh](h)}(hSummaryh]hSummary}(hj# hhhNhNubah}(h]h ]h"]h$]h&]uh1hhj hhhhhKubh)}(hKThe multi-gen LRU (of folios) can be disassembled into the following parts:h]hKThe multi-gen LRU (of folios) can be disassembled into the following parts:}(hj1 hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj hhubj )}(hhh](j)}(h Generationsh]h)}(hjD h]h Generations}(hjF hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjB ubah}(h]h ]h"]h$]h&]uh1jhj? hhhhhNubj)}(h Rmap walksh]h)}(hj[ h]h Rmap walks}(hj] hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjY ubah}(h]h ]h"]h$]h&]uh1jhj? hhhhhNubj)}(h'Page table walks via ``mm_struct`` listh]h)}(hjr h](hPage table walks via }(hjt hhhNhNubj)}(h ``mm_struct``h]h mm_struct}(hj{ hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjt ubh list}(hjt hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjp ubah}(h]h ]h"]h$]h&]uh1jhj? hhhhhNubj)}(h'Bloom filters for rmap/PT walk feedbackh]h)}(hj h]h'Bloom filters for rmap/PT walk feedback}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj ubah}(h]h ]h"]h$]h&]uh1jhj? hhhhhNubj)}(h$PID controller for refault feedback h]h)}(h#PID controller for refault feedbackh]h#PID controller for refault feedback}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj ubah}(h]h ]h"]h$]h&]uh1jhj? hhhhhNubeh}(h]h ]h"]h$]h&]jsjtuh1j hhhMhj hhubh)}(hX{The aging and the eviction form a producer-consumer model; specifically, the latter drives the former by the sliding window over generations. Within the aging, rmap walks drive page table walks by inserting hot densely populated page tables to the Bloom filters. Within the eviction, the PID controller uses refaults as the feedback to select types to evict and tiers to protect.h]hX{The aging and the eviction form a producer-consumer model; specifically, the latter drives the former by the sliding window over generations. Within the aging, rmap walks drive page table walks by inserting hot densely populated page tables to the Bloom filters. Within the eviction, the PID controller uses refaults as the feedback to select types to evict and tiers to protect.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj hhubeh}(h]summaryah ]h"]summaryah$]h&]uh1hhjhhhhhKubeh}(h]workflow-overviewah ]h"]workflow overviewah$]h&]uh1hhhhhhhhKQubeh}(h] multi-gen-lruah ]h"] multi-gen lruah$]h&]uh1hhhhhhhhKubeh}(h]h ]h"]h$]h&]sourcehuh1hcurrent_sourceN current_lineNsettingsdocutils.frontendValues)}(hN generatorN datestampN source_linkN source_urlN toc_backlinksentryfootnote_backlinksK sectnum_xformKstrip_commentsNstrip_elements_with_classesN strip_classesN report_levelK halt_levelKexit_status_levelKdebugNwarning_streamN tracebackinput_encoding utf-8-siginput_encoding_error_handlerstrictoutput_encodingutf-8output_encoding_error_handlerj error_encodingutf-8error_encoding_error_handlerbackslashreplace language_codeenrecord_dependenciesNconfigN id_prefixhauto_id_prefixid dump_settingsNdump_internalsNdump_transformsNdump_pseudo_xmlNexpose_internalsNstrict_visitorN_disable_configN_sourceh _destinationN _config_files]7/var/lib/git/docbuild/linux/Documentation/docutils.confafile_insertion_enabled raw_enabledKline_length_limitM'pep_referencesN pep_base_urlhttps://peps.python.org/pep_file_url_templatepep-%04drfc_referencesN rfc_base_url&https://datatracker.ietf.org/doc/html/ tab_widthKtrim_footnote_reference_spacesyntax_highlightlong smart_quotessmartquotes_locales]character_level_inline_markupdoctitle_xform docinfo_xformKsectsubtitle_xform image_loadinglinkembed_stylesheetcloak_email_addressessection_self_linkenvNubreporterNindirect_targets]substitution_defs}substitution_names}refnames}refids}nameids}(j j jjjjjjj j jjjjjbj_jQjNjjjjjjj j j j u nametypes}(j jjjj jjjbjQjjjj j uh}(j hjhjhjjj jjjAjjj_jjNjejjTjjjjj j!j j u footnote_refs} citation_refs} autofootnotes]autofootnote_refs]symbol_footnotes]symbol_footnote_refs] footnotes] citations]autofootnote_startKsymbol_footnote_startK id_counter collectionsCounter}Rparse_messages]transform_messages] transformerN include_log] decorationNhhub.