mm: gup: FOLL_UNSHARE and COR fault - kernel/git/andrea/aa.git

diff options

author	Andrea Arcangeli <aarcange@redhat.com>	2023-06-11 22:32:08 -0400
committer	Andrea Arcangeli <aarcange@redhat.com>	2023-06-11 22:32:08 -0400
commit	53d0d1eee7e268ea60678a4cd245751d28ab70e8 (patch)
tree	94d96d7c67709441dc4a05109b9626b6b0eb973c
parent	785f2d5f5a1a8b239ba89c3a1555aed9085644e9 (diff)
parent	888ec6e86aed7426c86a216c451da4416ece4a46 (diff)
download	aa-mapcount_unshare.tar.gz

mm: gup: FOLL_UNSHARE and COR faultmapcount_unshare

The primary objective of this patchset is to research a zero cons, only pros alternative to commit 178 (17839856fd588f4ab6b789f482ed3ffd7c403e1f) and the new approach in commit 098 (09854ba94c6aad7886996bfbee2530b3d8a7f4f4). Overall this patchset resolves all issues described in: https://lkml.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com 17839856fd588f4ab6b789f482ed3ffd7c403e1f was a secure complete fix for CVE-2020-29374 however it caused various regressions: ptrace on pmem crashes, uffd-wp instability, reduced KSM density. This patchset fixes CVE-2020-29374 as well as 178 did, but without introducing those regression. The difference between 178 and FOLL_UNSHARE/COR is that the unsharing of the page in GUP is done without setting FOLL_WRITE. With this solution GUP detects when the page has to be "unshared" and in such case it sets FOLL_UNSHARE before taking any readonly GUP pin. FOLL_UNSHARE then invokes the COR fault to execute the "unsharing". This ultimately allows to deliver full MM coherency to all GUP pins, short or long term, no matter which memory is being pinned and without requiring MMU Notifier. MMU Notifiers are still required for the GUP pins not to disable all the advanced VM features (i.e. swapping, NUMA balancing, KSM, memory hotunplug, compaction, CMA, etc..), but they're not required anymore just to deliver full MM coherency. To review the patchset it's best to start from the end of it, so from the below matrix: + * The reason short term pins don't lose coherency even if they don't + * unshare those three kind of readonly page types, is that they will + * refresh the physaddr from the pgtable before each DMA (or even + * if they issue many DMA from the same GUP they are ok that the + * snapshot of the data payload is taken at GUP time and that the + * coherency with the CPU may be temorarily lost until the next GUP + * invocation). So the MM coherency won't break for them even if a COW + * happens after the short term pin has been taken by GUP. + * + * The COR fault will then disambiguate a long term pin requested with + * FOLL_MM_SYNC from a short term pin using the + * FAULT_FLAG_UNSHARE_MM_SYNC flag. + * + * The below table assumes !FOLL_WRITE (readonly GUP pins) and answers + * when FOLL_UNSHARE is activated during GUP to generate exclusive + * anonymous memory mapped readonly through the COR fault. + * + * page/mapping type | short term | FOLL_MM_SYNC + * ------------------------------------------------------ + * PageAnon && !PageKsm | mapcount > 1 | mapcount > 1 + * PageAnon && PageKsm | never | always + * zeropage | never | always + * MAP_PRIVATE !PageAnon | never | always + * MAP_SHARED !PageAnon | never | never FOLL_MM_SYNC in the above code comment can be interpreted as FOLL_LONGTERM for simplicity (it's supposed to be set along with it). By the time the COW fault triggers, it's not possible anymore to disambiguate what an extra refcount on the page actually means. It could be a GUP pin or a temporary refcount (like a speculative pagecache lookup) or even a swapcache refcount. The GUP pin information is lost by the time GUP returns, so it should be acted upon within GUP itself, in a way that will prevent the COW fault to ever encounter a COW page that was GUP pinned by another process. This patchset can be split in two parts. The objective of the first part until the patch introducing FOLL_MM_SYNC (not included) is: - fix CVE-2020-29374: the GUP security issue that allows the fork() child to read the memory of the parent. - fix the ABI break for sub-PAGE_SIZE short term GUP pins using FOLL_GET, as O_DIRECT read(), showing userland MM corruption (despite the read syscall returning success) in combination with swap or clear_refs. - fix FOLL_LONGTERM/RDMA+swap or FOLL_LONGTERM+clear_refs silent memory corruption that may happen if pin-fast elevates the page_count after page_maybe_dma_pinned() already returned a false negative. - fix the ABI break for readonly long term GUP pins that may lose coherency with the MM (which can result in silent data loss on the receive end) if the pinned pages are wrprotected in the pagetable. Example: readonly FOLL_LONGTERM pins used to deliver dirty data written by the CPU and tracked with clear_refs. Another example: mprotect(PROT_READ) followed by mprotect(PROT_READ|PROT_WRITE) on pages that have been previously pinned with a readonly FOLL_LONGTERM. - clear_refs and uffd-wp both optimized not to require the mmap_write_lock. - pmbench swapcache 24.9% performance boost against the 09854ba94c6aad7886996bfbee2530b3d8a7f4f4~ baseline (not only a 23.8% improvement provided by f4c4a3f48480730214c4f02ffa480f6bf5b0718f if applied on top of 09854ba94c6aad7886996bfbee2530b3d8a7f4f4). It avoids not only generating orphaned swapcache, but also the spurious allocation and copies that generated it in the first place. Continuing from the patch introducing FOLL_MM_SYNC (included): - allows to mark writable all exclusive COW anon pages during mprotect to avoid spurious COW faults, even if the pages are GUP pinned. - Reuse the previously introduced FOLL_UNSHARE COR logic to deliver FOLL_LONGTERM MM coherency by the specs to readonly long term pins (FOLL_MM_SYNC supersedes FOLL_WRITE|FOLL_FORCE). - Other THP mapcount and GUP related optimizations. ----------------------------------------------------------------------- Patches originated in this patchset successfully merged upstream: 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP") a458b76a4171 ("mm: gup: pack has_pinned in MMF_HAS_PINNED") which provided these benefits to upstream: - >4000% SMP scalability performance improvement to pin_user_pages_fast() with 1 thread per CPU and 2 NUMA nodes with 64 cores each. Changelog: 785f2d5f5a1a8b239ba89c3a1555aed9085644e9 - sync with main-5.15.y 488c4509c7d892db6b5c7170095bc6bf651beb2b - Cleanup FOLL_UNSHARE definition from code and commit headers and other no-op cleanups. It's a further sync-up with the cleanups from the v1 "mm: COW fixes part 1: fix the COW security issue for THP and hugetlb" submit. 1b972f2a40824f35709dd8b730aedfd0eaff4ff8 - Tentative fix for the false positive BUILD_BUG_ON build error on some arches reported by the kernel test robot. e9a49c91f01d445e3be0b527a05eb097b81b8741 - The mprotect optimization that was proposed upstream to skip spurious COW faults had a bug in not checking the swapcount which could result in erroneously skipping the COW fault with swap enabled. This implementation inherited the same bug that the original upstream posted patch had. The bug has been found by source review and it has been fixed: in this implementation the swapcount is now taken into account as required for safety. 3346c14187913954a0e50d0069b231b55568a6af - optimized wp_page_unshare() with can_read_pin_swap_page(), in addition this change is a dependency for the PageKsm FOLL_MM_SYNC rework. - reworked from scratch PageKsm FOLL_MM_SYNC using can_read_pin_swap_page(). Enforcing that no FOLL_LONGTERM read pin can be ever taken on any PageKsm feels simpler in comparison to enforcing no PageAnon can be converted to PageKsm if there's any outstanding pin and that no wrprotected PageAnon can be replaced by an equal PageKsm if the PageAnon had any outstanding FOLL_LONGTERM pins. Both guarantees are required for FOLL_MM_SYNC to deliver full synchronicity to FOLL_LONGTERM pins on VM_MERGEABLE vmas too. 6865ab35a3b9720b1820c8016f20c589a0293e0e - gup_must_unshare() optimized with can_read_pin_swap_page(). - added the page lock in the hugetlbfs gup_must_unshare() path to protect against page migration. It'd be ideal if page migration could be improved to count how many migration entries it installed and then drop the mapcount accordingly only after the refcount freezing. - Improved FOLL_MM_SYNC for PageKsm: KSM code should cooperate with GUP and make sure to never de-dup pages with GUP pins. GUP already does its part in unsharing PageKsm pages with the COR fault before taking readonly FOLL_LONGTERM pins (with FOLL_MM_SYNC implicitly set). d0de2b3fef4a4e94207020c560158b2ccc97baa0 - More noop cleanups. - Added a missing update_mmu_tlb() which is also a noop for all arches except mips. 5db5863e2c773f347ded84c98d6f28f692cd2def - A solution based on the FOLL_UNSHARE+COR solution that originated in this tree has been proposed upstream and the review showed the gup_must_unshare() didn't properly take into account the swapcount. The lack of swapcount calculation reported upstream is a minor implementation issue and requires no change in design to fix. In fact it has been fixed in less than 48 hours as demonstrated by this quick hotfix update. It's worth pointing out that the lack of swapcount calculation in the previous version caused zero regressions compared to upstream v5.7 and in fact the previous version was preferable than v5.7. As opposed upstream still randomly corrupts memory if swap is enabled with O_DIRECT + swap if using 64k PAGE SIZE on aarch64 and a 4k db blocksize, with io_uring and all FOLL_LONGTERM and causes various horizontal regressions (for example all swapcache is COWed unconditionally even if it's exclusive). At the time of this writing, this is the only known solution that resolves all known security issues and that introduces zero user ABI regression compared to v5.7 and that retains the full power of the MM. In fact this goes beyond what v5.7 could do: with FOLL_MM_SYNC for the first time this solution provides full POSIX semantics to all FOLL_LONGTERM and short term pins by leveraging the COR (Copy On Read) fault. bbb329566757030dec3960ffbc8465a8092a17fd - Peter Xu discovered that the THP path of __page_mapcount was reading the first tail page instead of the right tailpage in a doublemap. This has been corrected. - David Hildenbrand reported that __page_mapcount and gup_must_unshare shared some code paths between THP and hugetlbfs, but the mapcount seqcount wasn't initialized in hugetlbfs which could result in a softlockup. This has been corrected and the hugetlbfs paths in __page_mapcount and gup_must_unshare don't share the same code paths anymore. - Merged a permutation from David Hildenbrand that simplifies __split_huge_pmd_locked() and reduces the page_trans_huge_mapcount_lock() hold time as well. - Merged FOLL_NOUNSHARE from David Hildenbrand "deactivate" the COR fault in follow_page(). follow_page() is special because the kernel is the "user" and the kernel intends to work on the real thing, not on the post-COR copy. Obtaining a (post-COR) copy of the page is functionally harmless from the userland point of view, but it'd defeat various kernel MM optimizations. b2bf3165cf3269a2b6b4e52a14f275b212f0bbe9 - rebase on top of THP folio. 89ab75761d378e4f3c165ff8c125f7bf9f433ab4 - added the COR fault and the FAULT_FLAG_UNSHARE support to hugetlbfs. 3e14641933648eecdb49c3f949d09097e733413b - Improved the 3771dc26618494d2fca1f8489cc1581a63a51ce8 commit header. 236f0d6d3eb19f81d53352dcb7f73d5cbb1ac1eb - cleanup gup_must_unshare(): added is_fast_only_in_irq() to document and deduplicate the irq_count() check. 1bc34a65718685b8caa980185de49dc04377eacd - Added feb889fb40fafc6933339cf1cca8f770126819fb to the list of reverts since it's unnecessary after reverting 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. - Documented more details on the SMP race against pin-fast of feb889fb40fafc6933339cf1cca8f770126819fb and 9348b73c2e1bfea74ccd4a44fb4ccc7276ab9623 at the end of the commit header of 5f3f91f23e41359338a41991fe19e4735d7e56e4 ("mm: COW: restore full accuracy in page reuse"). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Diffstat

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: