From 51e7849cd6e4050599ef4e255e85195788841bb4 Mon Sep 17 00:00:00 2001 From: Vishal Verma Date: Wed, 24 Jan 2024 12:03:48 -0800 Subject: Documentatiion/ABI: add ABI documentation for sys-bus-dax Add the missing sysfs ABI documentation for the device DAX subsystem. Various ABI attributes under this have been present since v5.1, and more have been added over time. In preparation for adding a new attribute, add this file with the historical details. Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-3-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma Cc: Dan Williams Cc: Dave Hansen Cc: Dave Jiang Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Huang Ying Cc: Jonathan Cameron Cc: Li Zhijian Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Oscar Salvador Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-bus-dax | 136 ++++++++++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-bus-dax (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax new file mode 100644 index 0000000000000..6359f7bc9bf43 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-dax @@ -0,0 +1,136 @@ +What: /sys/bus/dax/devices/daxX.Y/align +Date: October, 2020 +KernelVersion: v5.10 +Contact: nvdimm@lists.linux.dev +Description: + (RW) Provides a way to specify an alignment for a dax device. + Values allowed are constrained by the physical address ranges + that back the dax device, and also by arch requirements. + +What: /sys/bus/dax/devices/daxX.Y/mapping +Date: October, 2020 +KernelVersion: v5.10 +Contact: nvdimm@lists.linux.dev +Description: + (WO) Provides a way to allocate a mapping range under a dax + device. Specified in the format -. + +What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start +What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end +What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset +Date: October, 2020 +KernelVersion: v5.10 +Contact: nvdimm@lists.linux.dev +Description: + (RO) A dax device may have multiple constituent discontiguous + address ranges. These are represented by the different + 'mappingX' subdirectories. The 'start' attribute indicates the + start physical address for the given range. The 'end' attribute + indicates the end physical address for the given range. The + 'page_offset' attribute indicates the offset of the current + range in the dax device. + +What: /sys/bus/dax/devices/daxX.Y/resource +Date: June, 2019 +KernelVersion: v5.3 +Contact: nvdimm@lists.linux.dev +Description: + (RO) The resource attribute indicates the starting physical + address of a dax device. In case of a device with multiple + constituent ranges, it indicates the starting address of the + first range. + +What: /sys/bus/dax/devices/daxX.Y/size +Date: October, 2020 +KernelVersion: v5.10 +Contact: nvdimm@lists.linux.dev +Description: + (RW) The size attribute indicates the total size of a dax + device. For creating subdivided dax devices, or for resizing + an existing device, the new size can be written to this as + part of the reconfiguration process. + +What: /sys/bus/dax/devices/daxX.Y/numa_node +Date: November, 2019 +KernelVersion: v5.5 +Contact: nvdimm@lists.linux.dev +Description: + (RO) If NUMA is enabled and the platform has affinitized the + backing device for this dax device, emit the CPU node + affinity for this device. + +What: /sys/bus/dax/devices/daxX.Y/target_node +Date: February, 2019 +KernelVersion: v5.1 +Contact: nvdimm@lists.linux.dev +Description: + (RO) The target-node attribute is the Linux numa-node that a + device-dax instance may create when it is online. Prior to + being online the device's 'numa_node' property reflects the + closest online cpu node which is the typical expectation of a + device 'numa_node'. Once it is online it becomes its own + distinct numa node. + +What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/available_size +Date: October, 2020 +KernelVersion: v5.10 +Contact: nvdimm@lists.linux.dev +Description: + (RO) The available_size attribute tracks available dax region + capacity. This only applies to volatile hmem devices, not pmem + devices, since pmem devices are defined by nvdimm namespace + boundaries. + +What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/size +Date: July, 2017 +KernelVersion: v5.1 +Contact: nvdimm@lists.linux.dev +Description: + (RO) The size attribute indicates the size of a given dax region + in bytes. + +What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/align +Date: October, 2020 +KernelVersion: v5.10 +Contact: nvdimm@lists.linux.dev +Description: + (RO) The align attribute indicates alignment of the dax region. + Changes on align may not always be valid, when say certain + mappings were created with 2M and then we switch to 1G. This + validates all ranges against the new value being attempted, post + resizing. + +What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/seed +Date: October, 2020 +KernelVersion: v5.10 +Contact: nvdimm@lists.linux.dev +Description: + (RO) The seed device is a concept for dynamic dax regions to be + able to split the region amongst multiple sub-instances. The + seed device, similar to libnvdimm seed devices, is a device + that starts with zero capacity allocated and unbound to a + driver. + +What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/create +Date: October, 2020 +KernelVersion: v5.10 +Contact: nvdimm@lists.linux.dev +Description: + (RW) The create interface to the dax region provides a way to + create a new unconfigured dax device under the given region, which + can then be configured (with a size etc.) and then probed. + +What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/delete +Date: October, 2020 +KernelVersion: v5.10 +Contact: nvdimm@lists.linux.dev +Description: + (WO) The delete interface for a dax region provides for deletion + of any 0-sized and idle dax devices. + +What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/id +Date: July, 2017 +KernelVersion: v5.1 +Contact: nvdimm@lists.linux.dev +Description: + (RO) The id attribute indicates the region id of a dax region. -- cgit 1.2.3-korg From 73954d379efd176e9b011142c55aa8da93ac740a Mon Sep 17 00:00:00 2001 From: Vishal Verma Date: Wed, 24 Jan 2024 12:03:50 -0800 Subject: dax: add a sysfs knob to control memmap_on_memory behavior Add a sysfs knob for dax devices to control the memmap_on_memory setting if the dax device were to be hotplugged as system memory. The default memmap_on_memory setting for dax devices originating via pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to preserve legacy behavior. For dax devices via CXL, the default is on. The sysfs control allows the administrator to override the above defaults if needed. Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-5-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma Tested-by: Li Zhijian Reviewed-by: Jonathan Cameron Reviewed-by: David Hildenbrand Reviewed-by: Huang, Ying Reviewed-by: Alison Schofield Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Greg Kroah-Hartman Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Oscar Salvador Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-bus-dax | 17 +++++++++++++ drivers/dax/bus.c | 43 +++++++++++++++++++++++++++++++++ 2 files changed, 60 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax index 6359f7bc9bf43..b34266bfae49a 100644 --- a/Documentation/ABI/testing/sysfs-bus-dax +++ b/Documentation/ABI/testing/sysfs-bus-dax @@ -134,3 +134,20 @@ KernelVersion: v5.1 Contact: nvdimm@lists.linux.dev Description: (RO) The id attribute indicates the region id of a dax region. + +What: /sys/bus/dax/devices/daxX.Y/memmap_on_memory +Date: January, 2024 +KernelVersion: v6.8 +Contact: nvdimm@lists.linux.dev +Description: + (RW) Control the memmap_on_memory setting if the dax device + were to be hotplugged as system memory. This determines whether + the 'altmap' for the hotplugged memory will be placed on the + device being hotplugged (memmap_on_memory=1) or if it will be + placed on regular memory (memmap_on_memory=0). This attribute + must be set before the device is handed over to the 'kmem' + driver (i.e. hotplugged into system-ram). Additionally, this + depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled + memmap_on_memory parameter for memory_hotplug. This is + typically set on the kernel command line - + memory_hotplug.memmap_on_memory set to 'true' or 'force'." diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 0fd948a4443e3..27c86d0ca7118 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -1349,6 +1349,48 @@ static ssize_t numa_node_show(struct device *dev, } static DEVICE_ATTR_RO(numa_node); +static ssize_t memmap_on_memory_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + + return sysfs_emit(buf, "%d\n", dev_dax->memmap_on_memory); +} + +static ssize_t memmap_on_memory_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + bool val; + int rc; + + rc = kstrtobool(buf, &val); + if (rc) + return rc; + + if (val == true && !mhp_supports_memmap_on_memory()) { + dev_dbg(dev, "memmap_on_memory is not available\n"); + return -EOPNOTSUPP; + } + + rc = down_write_killable(&dax_dev_rwsem); + if (rc) + return rc; + + if (dev_dax->memmap_on_memory != val && dev->driver && + to_dax_drv(dev->driver)->type == DAXDRV_KMEM_TYPE) { + up_write(&dax_dev_rwsem); + return -EBUSY; + } + + dev_dax->memmap_on_memory = val; + up_write(&dax_dev_rwsem); + + return len; +} +static DEVICE_ATTR_RW(memmap_on_memory); + static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) { struct device *dev = container_of(kobj, struct device, kobj); @@ -1375,6 +1417,7 @@ static struct attribute *dev_dax_attributes[] = { &dev_attr_align.attr, &dev_attr_resource.attr, &dev_attr_numa_node.attr, + &dev_attr_memmap_on_memory.attr, NULL, }; -- cgit 1.2.3-korg From 5af28560fe4f091c215580b16672004fdf08f304 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 29 Jan 2024 17:35:40 -0800 Subject: Docs/admin-guide/mm/damon/usage: use sysfs interface for tracepoints example Patch series "mm/damon: make DAMON debugfs interface deprecation unignorable". DAMON debugfs interface is deprecated in February 2023, by commit 5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs interface deprecation notice"). Make the fact unable to be easily ignored by removing an example usage from the document (patch 1), renaming the config (patch 2), adding a deprecation notice file to the debugfs directory (patches 3-5), and renaming the debugfs file that essnetial to be used for real use of DAMON (patches 6-9). This patch (of 9): DAMON tracepoints example on the DAMON usage document is using DAMON debugfs interface, which is deprecated. Use its alternative, DAMON sysfs interface. Link: https://lkml.kernel.org/r/20240130013549.89538-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240130013549.89538-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Alex Shi Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet Cc: Shuah Khan Cc: Yanteng Si Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 9d23144bf9850..f2feabb4bd35c 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -579,11 +579,11 @@ monitoring results recording. While the monitoring is turned on, you could record the tracepoint events and show results using tracepoint supporting tools like ``perf``. For example:: - # echo on > monitor_on + # echo on > kdamonds/0/state # perf record -e damon:damon_aggregated & # sleep 5 # kill 9 $(pidof perf) - # echo off > monitor_on + # echo off > kdamonds/0/state # perf script kdamond.0 46568 [027] 79357.842179: damon:damon_aggregated: target_id=0 nr_regions=11 122509119488-135708762112: 0 864 [...] -- cgit 1.2.3-korg From cf3810cc317c069259915770f4ed43d61eaf85bf Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 29 Jan 2024 17:35:44 -0800 Subject: Docs/admin-guide/mm/damon/usage: document 'DEPRECATED' file of DAMON debugfs interface Document the newly added DAMON debugfs interface deprecation notice file on the usage document. Link: https://lkml.kernel.org/r/20240130013549.89538-6-sj@kernel.org Signed-off-by: SeongJae Park Cc: Alex Shi Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet Cc: Shuah Khan Cc: Yanteng Si Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index f2feabb4bd35c..5d3df18dfb9fc 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -628,9 +628,16 @@ debugfs Interface (DEPRECATED!) move, please report your usecase to damon@lists.linux.dev and linux-mm@kvack.org. -DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``, -``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and -``rm_contexts`` under its debugfs directory, ``/damon/``. +DAMON exports nine files, ``DEPRECATED``, ``attrs``, ``target_ids``, +``init_regions``, ``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` +and ``rm_contexts`` under its debugfs directory, ``/damon/``. + + +``DEPRECATED`` is a read-only file for the DAMON debugfs interface deprecation +notice. Reading it returns the deprecation notice, as below:: + + # cat DEPRECATED + DAMON debugfs interface is deprecated, so users should move to DAMON_SYSFS. If you cannot, please report your usecase to damon@lists.linux.dev and linux-mm@kvack.org. Attributes -- cgit 1.2.3-korg From ec28cf530cdf390ba5f8529b5767eab0c615b918 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 29 Jan 2024 17:35:47 -0800 Subject: Docs/admin-guide/mm/damon/usage: update for monitor_on renaming Update DAMON debugfs interface sections on the usage document to reflect the fact that 'monitor_on' file has renamed to 'monitor_on_DEPRECATED'. Link: https://lkml.kernel.org/r/20240130013549.89538-9-sj@kernel.org Signed-off-by: SeongJae Park Cc: Alex Shi Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet Cc: Shuah Khan Cc: Yanteng Si Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 29 ++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 5d3df18dfb9fc..58c34e66b31b2 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -629,8 +629,9 @@ debugfs Interface (DEPRECATED!) linux-mm@kvack.org. DAMON exports nine files, ``DEPRECATED``, ``attrs``, ``target_ids``, -``init_regions``, ``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` -and ``rm_contexts`` under its debugfs directory, ``/damon/``. +``init_regions``, ``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``, +``mk_contexts`` and ``rm_contexts`` under its debugfs directory, +``/damon/``. ``DEPRECATED`` is a read-only file for the DAMON debugfs interface deprecation @@ -855,16 +856,16 @@ Turning On/Off Setting the files as described above doesn't incur effect unless you explicitly start the monitoring. You can start, stop, and check the current status of the -monitoring by writing to and reading from the ``monitor_on`` file. Writing -``on`` to the file starts the monitoring of the targets with the attributes. -Writing ``off`` to the file stops those. DAMON also stops if every target -process is terminated. Below example commands turn on, off, and check the -status of DAMON:: +monitoring by writing to and reading from the ``monitor_on_DEPRECATED`` file. +Writing ``on`` to the file starts the monitoring of the targets with the +attributes. Writing ``off`` to the file stops those. DAMON also stops if +every target process is terminated. Below example commands turn on, off, and +check the status of DAMON:: # cd /damon - # echo on > monitor_on - # echo off > monitor_on - # cat monitor_on + # echo on > monitor_on_DEPRECATED + # echo off > monitor_on_DEPRECATED + # cat monitor_on_DEPRECATED off Please note that you cannot write to the above-mentioned debugfs files while @@ -880,11 +881,11 @@ can get the pid of the thread by reading the ``kdamond_pid`` file. When the monitoring is turned off, reading the file returns ``none``. :: # cd /damon - # cat monitor_on + # cat monitor_on_DEPRECATED off # cat kdamond_pid none - # echo on > monitor_on + # echo on > monitor_on_DEPRECATED # cat kdamond_pid 18594 @@ -914,5 +915,5 @@ directory by putting the name of the context to the ``rm_contexts`` file. :: # ls foo # ls: cannot access 'foo': No such file or directory -Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the -root directory only. +Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on_DEPRECATED`` files +are in the root directory only. -- cgit 1.2.3-korg From 87beb00404b712da545fad55ed3c3060099eea2f Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 29 Jan 2024 17:35:48 -0800 Subject: Docs/translations/damon/usage: update for monitor_on renaming Update DAMON debugfs interface sections on the translated usage documents to reflect the fact that 'monitor_on' file has renamed to 'monitor_on_DEPRECATED'. Link: https://lkml.kernel.org/r/20240130013549.89538-10-sj@kernel.org Signed-off-by: SeongJae Park Reviewed-by: Alex Shi Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet Cc: Shuah Khan Cc: Yanteng Si Signed-off-by: Andrew Morton --- .../zh_CN/admin-guide/mm/damon/usage.rst | 20 ++++++++++---------- .../zh_TW/admin-guide/mm/damon/usage.rst | 20 ++++++++++---------- 2 files changed, 20 insertions(+), 20 deletions(-) (limited to 'Documentation') diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst index 17b9949d9b435..da2745464ece4 100644 --- a/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst +++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst @@ -344,7 +344,7 @@ debugfs接口 :ref:`sysfs接口`。 DAMON导出了八个文件, ``attrs``, ``target_ids``, ``init_regions``, -``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` 和 +``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``, ``mk_contexts`` 和 ``rm_contexts`` under its debugfs directory, ``/damon/``. @@ -521,15 +521,15 @@ DAMON导出了八个文件, ``attrs``, ``target_ids``, ``init_regions``, 开关 ---- -除非你明确地启动监测,否则如上所述的文件设置不会产生效果。你可以通过写入和读取 ``monitor_on`` +除非你明确地启动监测,否则如上所述的文件设置不会产生效果。你可以通过写入和读取 ``monitor_on_DEPRECATED`` 文件来启动、停止和检查监测的当前状态。写入 ``on`` 该文件可以启动对有属性的目标的监测。写入 ``off`` 该文件则停止这些目标。如果每个目标进程被终止,DAMON也会停止。下面的示例命令开启、关 闭和检查DAMON的状态:: # cd /damon - # echo on > monitor_on - # echo off > monitor_on - # cat monitor_on + # echo on > monitor_on_DEPRECATED + # echo off > monitor_on_DEPRECATED + # cat monitor_on_DEPRECATED off 请注意,当监测开启时,你不能写到上述的debugfs文件。如果你在DAMON运行时写到这些文件,将会返 @@ -543,11 +543,11 @@ DAMON通过一个叫做kdamond的内核线程来进行请求监测。你可以 得该线程的 ``pid`` 。当监测被 ``关闭`` 时,读取该文件不会返回任何信息:: # cd /damon - # cat monitor_on + # cat monitor_on_DEPRECATED off # cat kdamond_pid none - # echo on > monitor_on + # echo on > monitor_on_DEPRECATED # cat kdamond_pid 18594 @@ -574,7 +574,7 @@ DAMON通过一个叫做kdamond的内核线程来进行请求监测。你可以 # ls foo # ls: cannot access 'foo': No such file or directory -注意, ``mk_contexts`` 、 ``rm_contexts`` 和 ``monitor_on`` 文件只在根目录下。 +注意, ``mk_contexts`` 、 ``rm_contexts`` 和 ``monitor_on_DEPRECATED`` 文件只在根目录下。 监测结果的监测点 @@ -583,9 +583,9 @@ DAMON通过一个叫做kdamond的内核线程来进行请求监测。你可以 DAMON通过一个tracepoint ``damon:damon_aggregated`` 提供监测结果. 当监测开启时,你可 以记录追踪点事件,并使用追踪点支持工具如perf显示结果。比如说:: - # echo on > monitor_on + # echo on > monitor_on_DEPRECATED # perf record -e damon:damon_aggregated & # sleep 5 # kill 9 $(pidof perf) - # echo off > monitor_on + # echo off > monitor_on_DEPRECATED # perf script diff --git a/Documentation/translations/zh_TW/admin-guide/mm/damon/usage.rst b/Documentation/translations/zh_TW/admin-guide/mm/damon/usage.rst index 6dee719a32ea6..7464279f9b7de 100644 --- a/Documentation/translations/zh_TW/admin-guide/mm/damon/usage.rst +++ b/Documentation/translations/zh_TW/admin-guide/mm/damon/usage.rst @@ -344,7 +344,7 @@ debugfs接口 :ref:`sysfs接口`。 DAMON導出了八個文件, ``attrs``, ``target_ids``, ``init_regions``, -``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` 和 +``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``, ``mk_contexts`` 和 ``rm_contexts`` under its debugfs directory, ``/damon/``. @@ -521,15 +521,15 @@ DAMON導出了八個文件, ``attrs``, ``target_ids``, ``init_regions``, 開關 ---- -除非你明確地啓動監測,否則如上所述的文件設置不會產生效果。你可以通過寫入和讀取 ``monitor_on`` +除非你明確地啓動監測,否則如上所述的文件設置不會產生效果。你可以通過寫入和讀取 ``monitor_on_DEPRECATED`` 文件來啓動、停止和檢查監測的當前狀態。寫入 ``on`` 該文件可以啓動對有屬性的目標的監測。寫入 ``off`` 該文件則停止這些目標。如果每個目標進程被終止,DAMON也會停止。下面的示例命令開啓、關 閉和檢查DAMON的狀態:: # cd /damon - # echo on > monitor_on - # echo off > monitor_on - # cat monitor_on + # echo on > monitor_on_DEPRECATED + # echo off > monitor_on_DEPRECATED + # cat monitor_on_DEPRECATED off 請注意,當監測開啓時,你不能寫到上述的debugfs文件。如果你在DAMON運行時寫到這些文件,將會返 @@ -543,11 +543,11 @@ DAMON通過一個叫做kdamond的內核線程來進行請求監測。你可以 得該線程的 ``pid`` 。當監測被 ``關閉`` 時,讀取該文件不會返回任何信息:: # cd /damon - # cat monitor_on + # cat monitor_on_DEPRECATED off # cat kdamond_pid none - # echo on > monitor_on + # echo on > monitor_on_DEPRECATED # cat kdamond_pid 18594 @@ -574,7 +574,7 @@ DAMON通過一個叫做kdamond的內核線程來進行請求監測。你可以 # ls foo # ls: cannot access 'foo': No such file or directory -注意, ``mk_contexts`` 、 ``rm_contexts`` 和 ``monitor_on`` 文件只在根目錄下。 +注意, ``mk_contexts`` 、 ``rm_contexts`` 和 ``monitor_on_DEPRECATED`` 文件只在根目錄下。 監測結果的監測點 @@ -583,10 +583,10 @@ DAMON通過一個叫做kdamond的內核線程來進行請求監測。你可以 DAMON通過一個tracepoint ``damon:damon_aggregated`` 提供監測結果. 當監測開啓時,你可 以記錄追蹤點事件,並使用追蹤點支持工具如perf顯示結果。比如說:: - # echo on > monitor_on + # echo on > monitor_on_DEPRECATED # perf record -e damon:damon_aggregated & # sleep 5 # kill 9 $(pidof perf) - # echo off > monitor_on + # echo off > monitor_on_DEPRECATED # perf script -- cgit 1.2.3-korg From dce41f5ae2539d1c20ae8de4e039630aec3c3f3c Mon Sep 17 00:00:00 2001 From: Rakie Kim Date: Fri, 2 Feb 2024 12:02:35 -0500 Subject: mm/mempolicy: implement the sysfs-based weighted_interleave interface MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mm/mempolicy: weighted interleave mempolicy and sysfs extension", v5. Weighted interleave is a new interleave policy intended to make use of heterogeneous memory environments appearing with CXL. The existing interleave mechanism does an even round-robin distribution of memory across all nodes in a nodemask, while weighted interleave distributes memory across nodes according to a provided weight. (Weight = # of page allocations per round) Weighted interleave is intended to reduce average latency when bandwidth is pressured - therefore increasing total throughput. In other words: It allows greater use of the total available bandwidth in a heterogeneous hardware environment (different hardware provides different bandwidth capacity). As bandwidth is pressured, latency increases - first linearly and then exponentially. By keeping bandwidth usage distributed according to available bandwidth, we therefore can reduce the average latency of a cacheline fetch. A good explanation of the bandwidth vs latency response curve: https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/ From the article: ``` Constant region: The latency response is fairly constant for the first 40% of the sustained bandwidth. Linear region: In between 40% to 80% of the sustained bandwidth, the latency response increases almost linearly with the bandwidth demand of the system due to contention overhead by numerous memory requests. Exponential region: Between 80% to 100% of the sustained bandwidth, the memory latency is dominated by the contention latency which can be as much as twice the idle latency or more. Maximum sustained bandwidth : Is 65% to 75% of the theoretical maximum bandwidth. ``` As a general rule of thumb: * If bandwidth usage is low, latency does not increase. It is optimal to place data in the nearest (lowest latency) device. * If bandwidth usage is high, latency increases. It is optimal to place data such that bandwidth use is optimized per-device. This is the top line goal: Provide a user a mechanism to target using the "maximum sustained bandwidth" of each hardware component in a heterogenous memory system. For example, the stream benchmark demonstrates that 1:1 (default) interleave is actively harmful, while weighted interleave can be beneficial. Default interleave distributes data such that too much pressure is placed on devices with lower available bandwidth. Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device) Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) Targeted weights : +2.5% to +4% (consistently better than DRAM) Global means the task-policy was set (set_mempolicy), while targeted means VMA policies were set (mbind2). We see weighted interleave is not always beneficial when applied globally, but is always beneficial when applied to bandwidth-driving memory regions. There are 4 patches in this set: 1) Implement system-global interleave weights as sysfs extension in mm/mempolicy.c. These weights are RCU protected, and a default weight set is provided (all weights are 1 by default). In future work, we intend to expose an interface for HMAT/CDAT code to set reasonable default values based on the memory configuration of the system discovered at boot/hotplug. 2) A mild refactor of some interleave-logic for re-use in the new weighted interleave logic. 3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind 4) Protect interleave logic (weighted and normal) with the mems_allowed seq cookie. If the nodemask changes while accessing it during a rebind, just retry the access. Included below are some performance and LTP test information, and a sample numactl branch which can be used for testing. = Performance summary = (tests may have different configurations, see extended info below) 1) MLC (W2) : +38% over DRAM. +264% over default interleave. MLC (W5) : +40% over DRAM. +226% over default interleave. 2) Stream : -6% to +4% over DRAM, +430% over default interleave. 3) XSBench : +19% over DRAM. +47% over default interleave. = LTP Testing Summary = existing mempolicy & mbind tests: pass mempolicy & mbind + weighted interleave (global weights): pass = version history v5: - style fixes - mems_allowed cookie protection to detect rebind issues, prevents spurious allocation failures and/or mis-allocations - sparse warning fixes related to __rcu on local variables ===================================================================== Performance tests - MLC From - Ravi Jonnalagadda Hardware: Single-socket, multiple CXL memory expanders. Workload: W2 Data Signature: 2:1 read:write DRAM only bandwidth (GBps): 298.8 DRAM + CXL (default interleave) (GBps): 113.04 DRAM + CXL (weighted interleave)(GBps): 412.5 Gain over DRAM only: 1.38x Gain over default interleave: 2.64x Workload: W5 Data Signature: 1:1 read:write DRAM only bandwidth (GBps): 273.2 DRAM + CXL (default interleave) (GBps): 117.23 DRAM + CXL (weighted interleave)(GBps): 382.7 Gain over DRAM only: 1.4x Gain over default interleave: 2.26x ===================================================================== Performance test - Stream From - Gregory Price Hardware: Single socket, single CXL expander numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) mbind2 weights : +2.5% to +4% (consistently better than DRAM) dram only: numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Function Direction BestRateMBs AvgTime MinTime MaxTime Copy: 0->0 200923.2 0.032662 0.031853 0.033301 Scale: 0->0 202123.0 0.032526 0.031664 0.032970 Add: 0->0 208873.2 0.047322 0.045961 0.047884 Triad: 0->0 208523.8 0.047262 0.046038 0.048414 CXL-only: numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 22209.7 0.288661 0.288162 0.289342 Scale: 0->0 22288.2 0.287549 0.287147 0.288291 Add: 0->0 24419.1 0.393372 0.393135 0.393735 Triad: 0->0 24484.6 0.392337 0.392083 0.394331 Based on the above, the optimal weights are ~9:1 echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1 echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2 default interleave: numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 44666.2 0.143671 0.143285 0.144174 Scale: 0->0 44781.6 0.143256 0.142916 0.143713 Add: 0->0 48600.7 0.197719 0.197528 0.197858 Triad: 0->0 48727.5 0.197204 0.197014 0.197439 global weighted interleave: numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 190085.9 0.034289 0.033669 0.034645 Scale: 0->0 207677.4 0.031909 0.030817 0.033061 Add: 0->0 202036.8 0.048737 0.047516 0.053409 Triad: 0->0 217671.5 0.045819 0.044103 0.046755 targted regions w/ global weights (modified stream to mbind2 malloc'd regions)) numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc Copy: 0->0 205827.0 0.031445 0.031094 0.031984 Scale: 0->0 208171.8 0.031320 0.030744 0.032505 Add: 0->0 217352.0 0.045087 0.044168 0.046515 Triad: 0->0 216884.8 0.045062 0.044263 0.046982 ===================================================================== Performance tests - XSBench From - Hyeongtak Ji Hardware: Single socket, Single CXL memory Expander NUMA node 0: 56 logical cores, 128 GB memory NUMA node 2: 96 GB CXL memory Threads: 56 Lookups: 170,000,000 Summary: +19% over DRAM. +47% over default interleave. Performance tests - XSBench 1. dram only $ numactl -m 0 ./XSBench -s XL –p 5000000 Runtime: 36.235 seconds Lookups/s: 4,691,618 2. default interleave $ numactl –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 55.243 seconds Lookups/s: 3,077,293 3. weighted interleave numactl –w –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 29.262 seconds Lookups/s: 5,809,513 ===================================================================== LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2 = Existing tests set_mempolicy, get_mempolicy, mbind MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality but did not adjust tests for weighting. Basically the weights were set to 1, which is the default, and it should behave the same as MPOL_INTERLEAVE if logic is correct. == set_mempolicy01 : passed 18, failed 0 == set_mempolicy02 : passed 10, failed 0 == set_mempolicy03 : passed 64, failed 0 == set_mempolicy04 : passed 32, failed 0 == set_mempolicy05 - n/a on non-x86 == set_mempolicy06 : passed 10, failed 0 this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE == set_mempolicy07 : passed 32, failed 0 set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE == get_mempolicy01 : passed 12, failed 0 change: added MPOL_WEIGHTED_INTERLEAVE == get_mempolicy02 : passed 2, failed 0 == mbind01 : passed 15, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind02 : passed 4, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind03 : passed 16, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind04 : passed 48, failed 0 added MPOL_WEIGHTED_INTERLEAVE ===================================================================== numactl (set_mempolicy) w/ global weighting test numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master command: numactl -w --interleave=0,1 ./eatmem result (weights 1:1): 0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4 7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4 50% distribution is correct result (weights 5:1): 01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4 7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4 16.666% distribution is correct result (weights 1:5): 01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4 7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4 16.666% distribution is correct #include #include #include int main (void) { char* mem = malloc(1024*1024*256); memset(mem, 1, 1024*1024*256); for (int i = 0; i < ((1024*1024*256)/4096); i++) { mem = malloc(4096); mem[0] = 1; } printf("done\n"); getchar(); return 0; } This patch (of 4): This patch provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN The sysfs structure is designed as follows. $ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ [1] └── weighted_interleave [2] ├── node0 [3] └── node1 Each file above can be explained as follows. [1] mm/mempolicy: configuration interface for mempolicy subsystem [2] weighted_interleave/: config interface for weighted interleave policy [3] weighted_interleave/nodeN: weight for nodeN If a node value is set to `0`, the system-default value will be used. As of this patch, the system-default for all nodes is always 1. Link: https://lkml.kernel.org/r/20240202170238.90004-1-gregory.price@memverge.com Link: https://lkml.kernel.org/r/20240202170238.90004-2-gregory.price@memverge.com Suggested-by: "Huang, Ying" Signed-off-by: Rakie Kim Signed-off-by: Honggyu Kim Co-developed-by: Gregory Price Signed-off-by: Gregory Price Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Reviewed-by: "Huang, Ying" Cc: Dan Williams Cc: Gregory Price Cc: Hasan Al Maruf Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Michal Hocko Cc: Srinivasulu Thanneeru Signed-off-by: Andrew Morton --- .../ABI/testing/sysfs-kernel-mm-mempolicy | 4 + .../sysfs-kernel-mm-mempolicy-weighted-interleave | 25 +++ mm/mempolicy.c | 223 +++++++++++++++++++++ 3 files changed, 252 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy new file mode 100644 index 0000000000000..8ac327fd7fb6e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy @@ -0,0 +1,4 @@ +What: /sys/kernel/mm/mempolicy/ +Date: January 2024 +Contact: Linux memory management mailing list +Description: Interface for Mempolicy diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave new file mode 100644 index 0000000000000..0b7972de04e93 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave @@ -0,0 +1,25 @@ +What: /sys/kernel/mm/mempolicy/weighted_interleave/ +Date: January 2024 +Contact: Linux memory management mailing list +Description: Configuration Interface for the Weighted Interleave policy + +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN +Date: January 2024 +Contact: Linux memory management mailing list +Description: Weight configuration interface for nodeN + + The interleave weight for a memory node (N). These weights are + utilized by tasks which have set their mempolicy to + MPOL_WEIGHTED_INTERLEAVE. + + These weights only affect new allocations, and changes at runtime + will not cause migrations on already allocated pages. + + The minimum weight for a node is always 1. + + Minimum weight: 1 + Maximum weight: 255 + + Writing an empty string or `0` will reset the weight to the + system default. The system default may be set by the kernel + or drivers at boot or during hotplug events. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 5e519163c4dcb..b4fccc921b623 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -131,6 +131,32 @@ static struct mempolicy default_policy = { static struct mempolicy preferred_node_policy[MAX_NUMNODES]; +/* + * iw_table is the sysfs-set interleave weight table, a value of 0 denotes + * system-default value should be used. A NULL iw_table also denotes that + * system-default values should be used. Until the system-default table + * is implemented, the system-default is always 1. + * + * iw_table is RCU protected + */ +static u8 __rcu *iw_table; +static DEFINE_MUTEX(iw_table_lock); + +static u8 get_il_weight(int node) +{ + u8 *table; + u8 weight; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + /* if no iw_table, use system default */ + weight = table ? table[node] : 1; + /* if value in iw_table is 0, use system default */ + weight = weight ? weight : 1; + rcu_read_unlock(); + return weight; +} + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -3063,3 +3089,200 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) p += scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +#ifdef CONFIG_SYSFS +struct iw_node_attr { + struct kobj_attribute kobj_attr; + int nid; +}; + +static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct iw_node_attr *node_attr; + u8 weight; + + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); + weight = get_il_weight(node_attr->nid); + return sysfs_emit(buf, "%d\n", weight); +} + +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct iw_node_attr *node_attr; + u8 *new; + u8 *old; + u8 weight = 0; + + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); + if (count == 0 || sysfs_streq(buf, "")) + weight = 0; + else if (kstrtou8(buf, 0, &weight)) + return -EINVAL; + + new = kzalloc(nr_node_ids, GFP_KERNEL); + if (!new) + return -ENOMEM; + + mutex_lock(&iw_table_lock); + old = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + if (old) + memcpy(new, old, nr_node_ids); + new[node_attr->nid] = weight; + rcu_assign_pointer(iw_table, new); + mutex_unlock(&iw_table_lock); + synchronize_rcu(); + kfree(old); + return count; +} + +static struct iw_node_attr **node_attrs; + +static void sysfs_wi_node_release(struct iw_node_attr *node_attr, + struct kobject *parent) +{ + if (!node_attr) + return; + sysfs_remove_file(parent, &node_attr->kobj_attr.attr); + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); +} + +static void sysfs_wi_release(struct kobject *wi_kobj) +{ + int i; + + for (i = 0; i < nr_node_ids; i++) + sysfs_wi_node_release(node_attrs[i], wi_kobj); + kobject_put(wi_kobj); +} + +static const struct kobj_type wi_ktype = { + .sysfs_ops = &kobj_sysfs_ops, + .release = sysfs_wi_release, +}; + +static int add_weight_node(int nid, struct kobject *wi_kobj) +{ + struct iw_node_attr *node_attr; + char *name; + + node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL); + if (!node_attr) + return -ENOMEM; + + name = kasprintf(GFP_KERNEL, "node%d", nid); + if (!name) { + kfree(node_attr); + return -ENOMEM; + } + + sysfs_attr_init(&node_attr->kobj_attr.attr); + node_attr->kobj_attr.attr.name = name; + node_attr->kobj_attr.attr.mode = 0644; + node_attr->kobj_attr.show = node_show; + node_attr->kobj_attr.store = node_store; + node_attr->nid = nid; + + if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) { + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); + pr_err("failed to add attribute to weighted_interleave\n"); + return -ENOMEM; + } + + node_attrs[nid] = node_attr; + return 0; +} + +static int add_weighted_interleave_group(struct kobject *root_kobj) +{ + struct kobject *wi_kobj; + int nid, err; + + wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL); + if (!wi_kobj) + return -ENOMEM; + + err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj, + "weighted_interleave"); + if (err) { + kfree(wi_kobj); + return err; + } + + for_each_node_state(nid, N_POSSIBLE) { + err = add_weight_node(nid, wi_kobj); + if (err) { + pr_err("failed to add sysfs [node%d]\n", nid); + break; + } + } + if (err) + kobject_put(wi_kobj); + return 0; +} + +static void mempolicy_kobj_release(struct kobject *kobj) +{ + u8 *old; + + mutex_lock(&iw_table_lock); + old = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + rcu_assign_pointer(iw_table, NULL); + mutex_unlock(&iw_table_lock); + synchronize_rcu(); + kfree(old); + kfree(node_attrs); + kfree(kobj); +} + +static const struct kobj_type mempolicy_ktype = { + .release = mempolicy_kobj_release +}; + +static int __init mempolicy_sysfs_init(void) +{ + int err; + static struct kobject *mempolicy_kobj; + + mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL); + if (!mempolicy_kobj) { + err = -ENOMEM; + goto err_out; + } + + node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *), + GFP_KERNEL); + if (!node_attrs) { + err = -ENOMEM; + goto mempol_out; + } + + err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj, + "mempolicy"); + if (err) + goto node_out; + + err = add_weighted_interleave_group(mempolicy_kobj); + if (err) { + pr_err("mempolicy sysfs structure failed to initialize\n"); + kobject_put(mempolicy_kobj); + return err; + } + + return err; +node_out: + kfree(node_attrs); +mempol_out: + kfree(mempolicy_kobj); +err_out: + pr_err("failed to add mempolicy kobject to the system\n"); + return err; +} + +late_initcall(mempolicy_sysfs_init); +#endif /* CONFIG_SYSFS */ -- cgit 1.2.3-korg From fa3bea4e1f8202d787709b7e3654eb0a99aed758 Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Fri, 2 Feb 2024 12:02:37 -0500 Subject: mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics. Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time. The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). Some high level notes about the pieces of weighted interleave: current->il_prev: Tracks the node previously allocated from. current->il_weight: The active weight of the current node (current->il_prev) When this reaches 0, current->il_prev is set to the next node and current->il_weight is set to the next weight. weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Operates only on task->mempolicy. weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Operates on VMA policies. bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight will be allocated first. Operates only on the task->mempolicy. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. If a call to alloc_pages_mpol() were made with a weighted-interleave policy and ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA policy - violating the description above. An inspection of all callers of alloc_pages_mpol() shows that all external callers set ilx to `0`, an index value, or will call get_vma_policy() to acquire the ilx. For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy requirements (task/vma respectively). Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.com Suggested-by: Hasan Al Maruf Signed-off-by: Gregory Price Co-developed-by: Rakie Kim Signed-off-by: Rakie Kim Co-developed-by: Honggyu Kim Signed-off-by: Honggyu Kim Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Co-developed-by: Srinivasulu Thanneeru Signed-off-by: Srinivasulu Thanneeru Co-developed-by: Ravi Jonnalagadda Signed-off-by: Ravi Jonnalagadda Reviewed-by: "Huang, Ying" Cc: Dan Williams Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Michal Hocko Signed-off-by: Andrew Morton --- .../admin-guide/mm/numa_memory_policy.rst | 9 + include/linux/sched.h | 1 + include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 218 ++++++++++++++++++++- 4 files changed, 225 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index eca38fa81e0f9..a70f20ce1ffb4 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node. +MPOL_WEIGHTED_INTERLEAVE + This mode operates the same as MPOL_INTERLEAVE, except that + interleaving behavior is executed based on weights set in + /sys/kernel/mm/mempolicy/weighted_interleave/ + + Weighted interleave allocates pages on nodes according to a + weight. For example if nodes [0,1] are weighted [5,2], 5 pages + will be allocated on node0 for every 2 pages allocated on node1. + NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES diff --git a/include/linux/sched.h b/include/linux/sched.h index ffe8f618ab869..b9ce285d8c9c8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1259,6 +1259,7 @@ struct task_struct { /* Protected by alloc_lock: */ struct mempolicy *mempolicy; short il_prev; + u8 il_weight; short pref_node_fork; #endif #ifdef CONFIG_NUMA_BALANCING diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c27..1f9bb10d1a473 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_WEIGHTED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 1bdc7d0d1b0b2..a8db92c236974 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -19,6 +19,13 @@ * for anonymous memory. For process policy an process counter * is used. * + * weighted interleave + * Allocate memory interleaved over a set of nodes based on + * a set of weights (per-node), with normal fallback if it + * fails. Otherwise operates the same as interleave. + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated + * on node 0 for every 1 page allocated on node 1. + * * bind Only allocate memory on a specific set of nodes, * no fallback. * FIXME: memory is allocated starting with the first node @@ -441,6 +448,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_nodemask, .rebind = mpol_rebind_preferred, }, + [MPOL_WEIGHTED_INTERLEAVE] = { + .create = mpol_new_nodemask, + .rebind = mpol_rebind_nodemask, + }, }; static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, @@ -858,8 +869,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE)) { current->il_prev = MAX_NUMNODES-1; + current->il_weight = 0; + } task_unlock(current); mpol_put(old); ret = 0; @@ -884,6 +898,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: *nodes = pol->nodes; break; case MPOL_LOCAL: @@ -968,6 +983,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } else if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE) { *policy = next_node_in(current->il_prev, pol->nodes); + } else if (pol == current->mempolicy && + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { + if (current->il_weight) + *policy = current->il_prev; + else + *policy = next_node_in(current->il_prev, + pol->nodes); } else { err = -EINVAL; goto out; @@ -1332,7 +1354,8 @@ static long do_mbind(unsigned long start, unsigned long len, * VMAs, the nodes will still be interleaved from the targeted * nodemask, but one by one may be selected differently. */ - if (new->mode == MPOL_INTERLEAVE) { + if (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE) { struct page *page; unsigned int order; unsigned long addr = -EFAULT; @@ -1780,7 +1803,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * @vma: virtual memory area whose policy is sought * @addr: address in @vma for shared policy lookup * @order: 0, or appropriate huge_page_order for interleaving - * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or + * MPOL_WEIGHTED_INTERLEAVE * * Returns effective policy for a VMA at specified address. * Falls back to current->mempolicy or system default policy, as necessary. @@ -1797,7 +1821,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, pol = __get_vma_policy(vma, addr, ilx); if (!pol) pol = get_task_policy(current); - if (pol->mode == MPOL_INTERLEAVE) { + if (pol->mode == MPOL_INTERLEAVE || + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { *ilx += vma->vm_pgoff >> order; *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); } @@ -1847,6 +1872,22 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) return zone >= dynamic_policy_zone; } +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) +{ + unsigned int node = current->il_prev; + + if (!current->il_weight || !node_isset(node, policy->nodes)) { + node = next_node_in(node, policy->nodes); + /* can only happen if nodemask is being rebound */ + if (node == MAX_NUMNODES) + return node; + current->il_prev = node; + current->il_weight = get_il_weight(node); + } + current->il_weight--; + return node; +} + /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1881,6 +1922,9 @@ unsigned int mempolicy_slab_node(void) case MPOL_INTERLEAVE: return interleave_nodes(policy); + case MPOL_WEIGHTED_INTERLEAVE: + return weighted_interleave_nodes(policy); + case MPOL_BIND: case MPOL_PREFERRED_MANY: { @@ -1919,6 +1963,45 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, return nodes_weight(*mask); } +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) +{ + nodemask_t nodemask; + unsigned int target, nr_nodes; + u8 *table; + unsigned int weight_total = 0; + u8 weight; + int nid; + + nr_nodes = read_once_policy_nodemask(pol, &nodemask); + if (!nr_nodes) + return numa_node_id(); + + rcu_read_lock(); + table = rcu_dereference(iw_table); + /* calculate the total weight */ + for_each_node_mask(nid, nodemask) { + /* detect system default usage */ + weight = table ? table[nid] : 1; + weight = weight ? weight : 1; + weight_total += weight; + } + + /* Calculate the node offset based on totals */ + target = ilx % weight_total; + nid = first_node(nodemask); + while (target) { + /* detect system default usage */ + weight = table ? table[nid] : 1; + weight = weight ? weight : 1; + if (target < weight) + break; + target -= weight; + nid = next_node_in(nid, nodemask); + } + rcu_read_unlock(); + return nid; +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -1979,6 +2062,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, *nid = (ilx == NO_INTERLEAVE_INDEX) ? interleave_nodes(pol) : interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + *nid = (ilx == NO_INTERLEAVE_INDEX) ? + weighted_interleave_nodes(pol) : + weighted_interleave_nid(pol, ilx); + break; } return nodemask; @@ -2040,6 +2128,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: *mask = mempolicy->nodes; break; @@ -2140,6 +2229,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, * node in its nodemask, we allocate the standard way. */ if (pol->mode != MPOL_INTERLEAVE && + pol->mode != MPOL_WEIGHTED_INTERLEAVE && (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but @@ -2275,6 +2365,114 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, return total_allocated; } +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, + struct mempolicy *pol, unsigned long nr_pages, + struct page **page_array) +{ + struct task_struct *me = current; + unsigned long total_allocated = 0; + unsigned long nr_allocated = 0; + unsigned long rounds; + unsigned long node_pages, delta; + u8 *table, *weights, weight; + unsigned int weight_total = 0; + unsigned long rem_pages = nr_pages; + nodemask_t nodes; + int nnodes, node; + int resume_node = MAX_NUMNODES - 1; + u8 resume_weight = 0; + int prev_node; + int i; + + if (!nr_pages) + return 0; + + nnodes = read_once_policy_nodemask(pol, &nodes); + if (!nnodes) + return 0; + + /* Continue allocating from most recent node and adjust the nr_pages */ + node = me->il_prev; + weight = me->il_weight; + if (weight && node_isset(node, nodes)) { + node_pages = min(rem_pages, weight); + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + /* if that's all the pages, no need to interleave */ + if (rem_pages <= weight) { + me->il_weight -= rem_pages; + return total_allocated; + } + /* Otherwise we adjust remaining pages, continue from there */ + rem_pages -= weight; + } + /* clear active weight in case of an allocation failure */ + me->il_weight = 0; + prev_node = node; + + /* create a local copy of node weights to operate on outside rcu */ + weights = kzalloc(nr_node_ids, GFP_KERNEL); + if (!weights) + return total_allocated; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + if (table) + memcpy(weights, table, nr_node_ids); + rcu_read_unlock(); + + /* calculate total, detect system default usage */ + for_each_node_mask(node, nodes) { + if (!weights[node]) + weights[node] = 1; + weight_total += weights[node]; + } + + /* + * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. + * Track which node weighted interleave should resume from. + * + * if (rounds > 0) and (delta == 0), resume_node will always be + * the node following prev_node and its weight. + */ + rounds = rem_pages / weight_total; + delta = rem_pages % weight_total; + resume_node = next_node_in(prev_node, nodes); + resume_weight = weights[resume_node]; + for (i = 0; i < nnodes; i++) { + node = next_node_in(prev_node, nodes); + weight = weights[node]; + node_pages = weight * rounds; + /* If a delta exists, add this node's portion of the delta */ + if (delta > weight) { + node_pages += weight; + delta -= weight; + } else if (delta) { + /* when delta is depleted, resume from that node */ + node_pages += delta; + resume_node = node; + resume_weight = weight - delta; + delta = 0; + } + /* node_pages can be 0 if an allocation fails and rounds == 0 */ + if (!node_pages) + break; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + if (total_allocated == nr_pages) + break; + prev_node = node; + } + me->il_prev = resume_node; + me->il_weight = resume_weight; + kfree(weights); + return total_allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2315,6 +2513,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) + return alloc_pages_bulk_array_weighted_interleave( + gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2390,6 +2592,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2526,6 +2729,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, polnid = interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + polnid = weighted_interleave_nid(pol, ilx); + break; + case MPOL_PREFERRED: if (node_isset(curnid, pol->nodes)) goto out; @@ -2900,6 +3107,7 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", [MPOL_LOCAL] = "local", [MPOL_PREFERRED_MANY] = "prefer (many)", }; @@ -2959,6 +3167,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } break; case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: /* * Default to online nodes with memory if no nodelist */ @@ -3069,6 +3278,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: nodes = pol->nodes; break; default: -- cgit 1.2.3-korg From d8310914848223de7ec04d55bd15f013f0dad803 Mon Sep 17 00:00:00 2001 From: Tiezhu Yang Date: Mon, 5 Feb 2024 14:09:21 +0800 Subject: kasan: docs: update descriptions about test file and module After commit f7e01ab828fd ("kasan: move tests to mm/kasan/"), the test file is renamed to mm/kasan/kasan_test.c and the test module is renamed to kasan_test.ko, so update the descriptions in the document. While at it, update the line number and testcase number when the tests kmalloc_large_oob_right and kmalloc_double_kzfree failed to sync with the current code in mm/kasan/kasan_test.c. Link: https://lkml.kernel.org/r/20240205060925.15594-2-yangtiezhu@loongson.cn Signed-off-by: Tiezhu Yang Acked-by: Marco Elver Reviewed-by: Andrey Konovalov Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/dev-tools/kasan.rst | 20 ++++++++++---------- Documentation/translations/zh_CN/dev-tools/kasan.rst | 20 ++++++++++---------- Documentation/translations/zh_TW/dev-tools/kasan.rst | 20 ++++++++++---------- 3 files changed, 30 insertions(+), 30 deletions(-) (limited to 'Documentation') diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index 858c77fe7dc46..a5a6dbe9029f4 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -169,7 +169,7 @@ Error reports A typical KASAN report looks like this:: ================================================================== - BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [test_kasan] + BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [kasan_test] Write of size 1 at addr ffff8801f44ec37b by task insmod/2760 CPU: 1 PID: 2760 Comm: insmod Not tainted 4.19.0-rc3+ #698 @@ -179,8 +179,8 @@ A typical KASAN report looks like this:: print_address_description+0x73/0x280 kasan_report+0x144/0x187 __asan_report_store1_noabort+0x17/0x20 - kmalloc_oob_right+0xa8/0xbc [test_kasan] - kmalloc_tests_init+0x16/0x700 [test_kasan] + kmalloc_oob_right+0xa8/0xbc [kasan_test] + kmalloc_tests_init+0x16/0x700 [kasan_test] do_one_initcall+0xa5/0x3ae do_init_module+0x1b6/0x547 load_module+0x75df/0x8070 @@ -200,8 +200,8 @@ A typical KASAN report looks like this:: save_stack+0x43/0xd0 kasan_kmalloc+0xa7/0xd0 kmem_cache_alloc_trace+0xe1/0x1b0 - kmalloc_oob_right+0x56/0xbc [test_kasan] - kmalloc_tests_init+0x16/0x700 [test_kasan] + kmalloc_oob_right+0x56/0xbc [kasan_test] + kmalloc_tests_init+0x16/0x700 [kasan_test] do_one_initcall+0xa5/0x3ae do_init_module+0x1b6/0x547 load_module+0x75df/0x8070 @@ -510,15 +510,15 @@ When a test passes:: When a test fails due to a failed ``kmalloc``:: - # kmalloc_large_oob_right: ASSERTION FAILED at lib/test_kasan.c:163 + # kmalloc_large_oob_right: ASSERTION FAILED at mm/kasan/kasan_test.c:245 Expected ptr is not null, but is - not ok 4 - kmalloc_large_oob_right + not ok 5 - kmalloc_large_oob_right When a test fails due to a missing KASAN report:: - # kmalloc_double_kzfree: EXPECTATION FAILED at lib/test_kasan.c:974 + # kmalloc_double_kzfree: EXPECTATION FAILED at mm/kasan/kasan_test.c:709 KASAN failure expected in "kfree_sensitive(ptr)", but none occurred - not ok 44 - kmalloc_double_kzfree + not ok 28 - kmalloc_double_kzfree At the end the cumulative status of all KASAN tests is printed. On success:: @@ -534,7 +534,7 @@ There are a few ways to run KUnit-compatible KASAN tests. 1. Loadable module With ``CONFIG_KUNIT`` enabled, KASAN-KUnit tests can be built as a loadable - module and run by loading ``test_kasan.ko`` with ``insmod`` or ``modprobe``. + module and run by loading ``kasan_test.ko`` with ``insmod`` or ``modprobe``. 2. Built-In diff --git a/Documentation/translations/zh_CN/dev-tools/kasan.rst b/Documentation/translations/zh_CN/dev-tools/kasan.rst index 8fdb20c9665b4..2b1e8f74904b0 100644 --- a/Documentation/translations/zh_CN/dev-tools/kasan.rst +++ b/Documentation/translations/zh_CN/dev-tools/kasan.rst @@ -137,7 +137,7 @@ KASAN受到通用 ``panic_on_warn`` 命令行参数的影响。当它被启用 典型的KASAN报告如下所示:: ================================================================== - BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [test_kasan] + BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [kasan_test] Write of size 1 at addr ffff8801f44ec37b by task insmod/2760 CPU: 1 PID: 2760 Comm: insmod Not tainted 4.19.0-rc3+ #698 @@ -147,8 +147,8 @@ KASAN受到通用 ``panic_on_warn`` 命令行参数的影响。当它被启用 print_address_description+0x73/0x280 kasan_report+0x144/0x187 __asan_report_store1_noabort+0x17/0x20 - kmalloc_oob_right+0xa8/0xbc [test_kasan] - kmalloc_tests_init+0x16/0x700 [test_kasan] + kmalloc_oob_right+0xa8/0xbc [kasan_test] + kmalloc_tests_init+0x16/0x700 [kasan_test] do_one_initcall+0xa5/0x3ae do_init_module+0x1b6/0x547 load_module+0x75df/0x8070 @@ -168,8 +168,8 @@ KASAN受到通用 ``panic_on_warn`` 命令行参数的影响。当它被启用 save_stack+0x43/0xd0 kasan_kmalloc+0xa7/0xd0 kmem_cache_alloc_trace+0xe1/0x1b0 - kmalloc_oob_right+0x56/0xbc [test_kasan] - kmalloc_tests_init+0x16/0x700 [test_kasan] + kmalloc_oob_right+0x56/0xbc [kasan_test] + kmalloc_tests_init+0x16/0x700 [kasan_test] do_one_initcall+0xa5/0x3ae do_init_module+0x1b6/0x547 load_module+0x75df/0x8070 @@ -421,15 +421,15 @@ KASAN连接到vmap基础架构以懒清理未使用的影子内存。 当由于 ``kmalloc`` 失败而导致测试失败时:: - # kmalloc_large_oob_right: ASSERTION FAILED at lib/test_kasan.c:163 + # kmalloc_large_oob_right: ASSERTION FAILED at mm/kasan/kasan_test.c:245 Expected ptr is not null, but is - not ok 4 - kmalloc_large_oob_right + not ok 5 - kmalloc_large_oob_right 当由于缺少KASAN报告而导致测试失败时:: - # kmalloc_double_kzfree: EXPECTATION FAILED at lib/test_kasan.c:974 + # kmalloc_double_kzfree: EXPECTATION FAILED at mm/kasan/kasan_test.c:709 KASAN failure expected in "kfree_sensitive(ptr)", but none occurred - not ok 44 - kmalloc_double_kzfree + not ok 28 - kmalloc_double_kzfree 最后打印所有KASAN测试的累积状态。成功:: @@ -445,7 +445,7 @@ KASAN连接到vmap基础架构以懒清理未使用的影子内存。 1. 可加载模块 启用 ``CONFIG_KUNIT`` 后,KASAN-KUnit测试可以构建为可加载模块,并通过使用 - ``insmod`` 或 ``modprobe`` 加载 ``test_kasan.ko`` 来运行。 + ``insmod`` 或 ``modprobe`` 加载 ``kasan_test.ko`` 来运行。 2. 内置 diff --git a/Documentation/translations/zh_TW/dev-tools/kasan.rst b/Documentation/translations/zh_TW/dev-tools/kasan.rst index 979eb84bc58f1..ed342e67d8ed0 100644 --- a/Documentation/translations/zh_TW/dev-tools/kasan.rst +++ b/Documentation/translations/zh_TW/dev-tools/kasan.rst @@ -137,7 +137,7 @@ KASAN受到通用 ``panic_on_warn`` 命令行參數的影響。當它被啓用 典型的KASAN報告如下所示:: ================================================================== - BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [test_kasan] + BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [kasan_test] Write of size 1 at addr ffff8801f44ec37b by task insmod/2760 CPU: 1 PID: 2760 Comm: insmod Not tainted 4.19.0-rc3+ #698 @@ -147,8 +147,8 @@ KASAN受到通用 ``panic_on_warn`` 命令行參數的影響。當它被啓用 print_address_description+0x73/0x280 kasan_report+0x144/0x187 __asan_report_store1_noabort+0x17/0x20 - kmalloc_oob_right+0xa8/0xbc [test_kasan] - kmalloc_tests_init+0x16/0x700 [test_kasan] + kmalloc_oob_right+0xa8/0xbc [kasan_test] + kmalloc_tests_init+0x16/0x700 [kasan_test] do_one_initcall+0xa5/0x3ae do_init_module+0x1b6/0x547 load_module+0x75df/0x8070 @@ -168,8 +168,8 @@ KASAN受到通用 ``panic_on_warn`` 命令行參數的影響。當它被啓用 save_stack+0x43/0xd0 kasan_kmalloc+0xa7/0xd0 kmem_cache_alloc_trace+0xe1/0x1b0 - kmalloc_oob_right+0x56/0xbc [test_kasan] - kmalloc_tests_init+0x16/0x700 [test_kasan] + kmalloc_oob_right+0x56/0xbc [kasan_test] + kmalloc_tests_init+0x16/0x700 [kasan_test] do_one_initcall+0xa5/0x3ae do_init_module+0x1b6/0x547 load_module+0x75df/0x8070 @@ -421,15 +421,15 @@ KASAN連接到vmap基礎架構以懶清理未使用的影子內存。 當由於 ``kmalloc`` 失敗而導致測試失敗時:: - # kmalloc_large_oob_right: ASSERTION FAILED at lib/test_kasan.c:163 + # kmalloc_large_oob_right: ASSERTION FAILED at mm/kasan/kasan_test.c:245 Expected ptr is not null, but is - not ok 4 - kmalloc_large_oob_right + not ok 5 - kmalloc_large_oob_right 當由於缺少KASAN報告而導致測試失敗時:: - # kmalloc_double_kzfree: EXPECTATION FAILED at lib/test_kasan.c:974 + # kmalloc_double_kzfree: EXPECTATION FAILED at mm/kasan/kasan_test.c:709 KASAN failure expected in "kfree_sensitive(ptr)", but none occurred - not ok 44 - kmalloc_double_kzfree + not ok 28 - kmalloc_double_kzfree 最後打印所有KASAN測試的累積狀態。成功:: @@ -445,7 +445,7 @@ KASAN連接到vmap基礎架構以懶清理未使用的影子內存。 1. 可加載模塊 啓用 ``CONFIG_KUNIT`` 後,KASAN-KUnit測試可以構建爲可加載模塊,並通過使用 - ``insmod`` 或 ``modprobe`` 加載 ``test_kasan.ko`` 來運行。 + ``insmod`` 或 ``modprobe`` 加載 ``kasan_test.ko`` 來運行。 2. 內置 -- cgit 1.2.3-korg From b9ad003af13a1fe34319da6c2082038bce833831 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Tue, 6 Feb 2024 10:27:31 +0530 Subject: mm/cma: add sysfs file 'release_pages_success' This adds the following new sysfs file tracking the number of successfully released pages from a given CMA heap area. This file will be available via CONFIG_CMA_SYSFS and help in determining active CMA pages available on the CMA heap area. This adds a new 'nr_pages_released' (CONFIG_CMA_SYSFS) into 'struct cma' which gets updated during cma_release(). /sys/kernel/mm/cma//release_pages_success After this change, an user will be able to find active CMA pages available in a given CMA heap area via the following method. Active pages = alloc_pages_success - release_pages_success That's valuable information for both software designers, and system admins as it allows them to tune the number of CMA pages available in the system. This increases user visibility for allocated CMA area and its utilization. Link: https://lkml.kernel.org/r/20240206045731.472759-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-kernel-mm-cma | 6 ++++++ mm/cma.c | 1 + mm/cma.h | 5 +++++ mm/cma_sysfs.c | 15 +++++++++++++++ 4 files changed, 27 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma b/Documentation/ABI/testing/sysfs-kernel-mm-cma index 02b2bb60c2969..dfd755201142f 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-cma +++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma @@ -23,3 +23,9 @@ Date: Feb 2021 Contact: Minchan Kim Description: the number of pages CMA API failed to allocate + +What: /sys/kernel/mm/cma//release_pages_success +Date: Feb 2024 +Contact: Anshuman Khandual +Description: + the number of pages CMA API succeeded to release diff --git a/mm/cma.c b/mm/cma.c index 4902bbfe24f12..01f5a8f71ddfa 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -562,6 +562,7 @@ bool cma_release(struct cma *cma, const struct page *pages, free_contig_range(pfn, count); cma_clear_bitmap(cma, pfn, count); + cma_sysfs_account_release_pages(cma, count); trace_cma_release(cma->name, pfn, pages, count); return true; diff --git a/mm/cma.h b/mm/cma.h index 88a0595670b76..ad61cc6dd4396 100644 --- a/mm/cma.h +++ b/mm/cma.h @@ -27,6 +27,8 @@ struct cma { atomic64_t nr_pages_succeeded; /* the number of CMA page allocation failures */ atomic64_t nr_pages_failed; + /* the number of CMA page released */ + atomic64_t nr_pages_released; /* kobject requires dynamic object */ struct cma_kobject *cma_kobj; #endif @@ -44,10 +46,13 @@ static inline unsigned long cma_bitmap_maxno(struct cma *cma) #ifdef CONFIG_CMA_SYSFS void cma_sysfs_account_success_pages(struct cma *cma, unsigned long nr_pages); void cma_sysfs_account_fail_pages(struct cma *cma, unsigned long nr_pages); +void cma_sysfs_account_release_pages(struct cma *cma, unsigned long nr_pages); #else static inline void cma_sysfs_account_success_pages(struct cma *cma, unsigned long nr_pages) {}; static inline void cma_sysfs_account_fail_pages(struct cma *cma, unsigned long nr_pages) {}; +static inline void cma_sysfs_account_release_pages(struct cma *cma, + unsigned long nr_pages) {}; #endif #endif diff --git a/mm/cma_sysfs.c b/mm/cma_sysfs.c index 56347d15b7e8b..f50db39731718 100644 --- a/mm/cma_sysfs.c +++ b/mm/cma_sysfs.c @@ -24,6 +24,11 @@ void cma_sysfs_account_fail_pages(struct cma *cma, unsigned long nr_pages) atomic64_add(nr_pages, &cma->nr_pages_failed); } +void cma_sysfs_account_release_pages(struct cma *cma, unsigned long nr_pages) +{ + atomic64_add(nr_pages, &cma->nr_pages_released); +} + static inline struct cma *cma_from_kobj(struct kobject *kobj) { return container_of(kobj, struct cma_kobject, kobj)->cma; @@ -48,6 +53,15 @@ static ssize_t alloc_pages_fail_show(struct kobject *kobj, } CMA_ATTR_RO(alloc_pages_fail); +static ssize_t release_pages_success_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct cma *cma = cma_from_kobj(kobj); + + return sysfs_emit(buf, "%llu\n", atomic64_read(&cma->nr_pages_released)); +} +CMA_ATTR_RO(release_pages_success); + static void cma_kobj_release(struct kobject *kobj) { struct cma *cma = cma_from_kobj(kobj); @@ -60,6 +74,7 @@ static void cma_kobj_release(struct kobject *kobj) static struct attribute *cma_attrs[] = { &alloc_pages_success_attr.attr, &alloc_pages_fail_attr.attr, + &release_pages_success_attr.attr, NULL, }; ATTRIBUTE_GROUPS(cma); -- cgit 1.2.3-korg From 0a1ebc17a710011486e919983e03837d450ff2ee Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 16 Feb 2024 16:58:38 -0800 Subject: Docs/mm/damon/maintainer-profile: fix reference links for mm-[un]stable tree Patch series "Docs/mm/damon: misc readability improvements". Fix trivial mistakes and improve layout of information on different documents for DAMON. This patch (of 5): A couple of sentences on maintainer-profile.rst are having reference links for mm-unstable and mm-stable trees with wrong rst markup. Fix those. Link: https://lkml.kernel.org/r/20240217005842.87348-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240217005842.87348-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/mm/damon/maintainer-profile.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst index a84c14e590530..5a306e4de22e5 100644 --- a/Documentation/mm/damon/maintainer-profile.rst +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -21,8 +21,8 @@ be queued in mm-stable [3]_ , and finally pull-requested to the mainline by the memory management subsystem maintainer. Note again the patches for review should be made against the mm-unstable -tree[1] whenever possible. damon/next is only for preview of others' works in -progress. +tree [1]_ whenever possible. damon/next is only for preview of others' works +in progress. Submit checklist addendum ------------------------- @@ -41,8 +41,8 @@ Further doing below and putting the results will be helpful. Key cycle dates --------------- -Patches can be sent anytime. Key cycle dates of the mm-unstable[1] and -mm-stable[3] trees depend on the memory management subsystem maintainer. +Patches can be sent anytime. Key cycle dates of the mm-unstable [1]_ and +mm-stable [3]_ trees depend on the memory management subsystem maintainer. Review cadence -------------- -- cgit 1.2.3-korg From 5b7708e6a85574db9fd0d82caba1c38e01723d64 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 16 Feb 2024 16:58:39 -0800 Subject: Docs/mm/damon: move the list of DAMOS actions to design doc DAMOS operation actions are explained nearly twice on the DAMON usage document, once for the sysfs interface, and then again for the debugfs interface. Duplication is bad. Also it would better to keep this kind of concept level details in design document and keep the usage document small and focus on only the usage. Move the list to design document and update usage document to reference it. Link: https://lkml.kernel.org/r/20240217005842.87348-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 47 ++++++++-------------------- Documentation/mm/damon/design.rst | 26 +++++++++++++-- 2 files changed, 36 insertions(+), 37 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 58c34e66b31b2..0335d584956b5 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -302,27 +302,8 @@ In each scheme directory, five directories (``access_pattern``, ``quotas``, The ``action`` file is for setting and getting the scheme's :ref:`action `. The keywords that can be written to and read -from the file and their meaning are as below. - -Note that support of each action depends on the running DAMON operations set -:ref:`implementation `. - - - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``. - Supported by ``vaddr`` and ``fvaddr`` operations set. - - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``. - Supported by ``vaddr`` and ``fvaddr`` operations set. - - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``. - Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set. - - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. - Supported by ``vaddr`` and ``fvaddr`` operations set. - - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. - Supported by ``vaddr`` and ``fvaddr`` operations set. - - ``lru_prio``: Prioritize the region on its LRU lists. - Supported by ``paddr`` operations set. - - ``lru_deprio``: Deprioritize the region on its LRU lists. - Supported by ``paddr`` operations set. - - ``stat``: Do nothing but count the statistics. - Supported by all operations sets. +from the file and their meaning are same to those of the list on +:ref:`design doc `. The ``apply_interval_us`` file is for setting and getting the scheme's :ref:`apply_interval ` in microseconds. @@ -763,19 +744,17 @@ Action ~~~~~~ The ```` is a predefined integer for memory management :ref:`actions -`. The supported numbers and their meanings are as -below. - - - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if - ``target`` is ``paddr``. - - 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if - ``target`` is ``paddr``. - - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``. - - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if - ``target`` is ``paddr``. - - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if - ``target`` is ``paddr``. - - 5: Do nothing but count the statistics +`. The mapping between the ```` values and +the memory management actions is as below. For the detailed meaning of the +action and DAMON operations set supporting each action, please refer to the +list on :ref:`design doc `. + + - 0: ``willneed`` + - 1: ``cold`` + - 2: ``pageout`` + - 3: ``hugepage`` + - 4: ``nohugepage`` + - 5: ``stat`` Quota ~~~~~ diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 1bb69524a62ea..9f16c4e62e724 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -294,9 +294,29 @@ not mandated to support all actions of the list. Hence, the availability of specific DAMOS action depends on what operations set is selected to be used together. -Applying an action to a region is considered as changing the region's -characteristics. Hence, DAMOS resets the age of regions when an action is -applied to those. +The list of the supported actions, their meaning, and DAMON operations sets +that supports each action are as below. + + - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``. + Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set. + - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``lru_prio``: Prioritize the region on its LRU lists. + Supported by ``paddr`` operations set. + - ``lru_deprio``: Deprioritize the region on its LRU lists. + Supported by ``paddr`` operations set. + - ``stat``: Do nothing but count the statistics. + Supported by all operations sets. + +Applying the actions except ``stat`` to a region is considered as changing the +region's characteristics. Hence, DAMOS resets the age of regions when any such +actions are applied to those. .. _damon_design_damos_access_pattern: -- cgit 1.2.3-korg From 669971b406f0d4a6ffe0816ec1281cbf8f99e307 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 16 Feb 2024 16:58:40 -0800 Subject: Docs/mm/damon: move DAMON operation sets list from the usage to the design document The list of DAMON operation sets and their explanation, which may better to be on design document, is written on the usage document. Move the detail to design document and make the usage document only reference the design document. [sj@kernel.org: fix a typo on a reference link] Link: https://lkml.kernel.org/r/20240221170852.55529-2-sj@kernel.org Link: https://lkml.kernel.org/r/20240217005842.87348-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 19 +++++++------------ Documentation/mm/damon/design.rst | 12 ++++++++++-- 2 files changed, 17 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 0335d584956b5..68d00e8f01400 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -180,19 +180,14 @@ In each context directory, two files (``avail_operations`` and ``operations``) and three directories (``monitoring_attrs``, ``targets``, and ``schemes``) exist. -DAMON supports multiple types of monitoring operations, including those for -virtual address space and the physical address space. You can get the list of -available monitoring operations set on the currently running kernel by reading +DAMON supports multiple types of :ref:`monitoring operations +`, including those for virtual address +space and the physical address space. You can get the list of available +monitoring operations set on the currently running kernel by reading ``avail_operations`` file. Based on the kernel configuration, the file will -list some or all of below keywords. - - - vaddr: Monitor virtual address spaces of specific processes - - fvaddr: Monitor fixed virtual address ranges - - paddr: Monitor the physical address space of the system - -Please refer to :ref:`regions sysfs directory ` for detailed -differences between the operations sets in terms of the monitoring target -regions. +list different available operation sets. Please refer to the :ref:`design +` for the list of all available operation sets and their +brief explanations. You can set and get what type of monitoring operations DAMON will use for the context by writing one of the keywords listed in ``avail_operations`` file and diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 9f16c4e62e724..6abf976dd71fd 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -31,6 +31,8 @@ DAMON subsystem is configured with three layers including interfaces for the user space, on top of the core layer. +.. _damon_design_configurable_operations_set: + Configurable Operations Set --------------------------- @@ -63,6 +65,8 @@ modules that built on top of the core layer using the API, which can be easily used by the user space end users. +.. _damon_operations_set: + Operations Set Layer ==================== @@ -71,8 +75,12 @@ The monitoring operations are defined in two parts: 1. Identification of the monitoring target address range for the address space. 2. Access check of specific address range in the target space. -DAMON currently provides the implementations of the operations for the physical -and virtual address spaces. Below two subsections describe how those work. +DAMON currently provides below three operation sets. Below two subsections +describe how those work. + + - vaddr: Monitor virtual address spaces of specific processes + - fvaddr: Monitor fixed virtual address ranges + - paddr: Monitor the physical address space of the system VMA-based Target Address Range Construction -- cgit 1.2.3-korg From 2d89957c93667d160de6b43771dc20946bbe9805 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 16 Feb 2024 16:58:41 -0800 Subject: Docs/mm/damon: move monitoring target regions setup detail from the usage to the design document Design doc is aimed to have all concept level details, while the usage doc is focused on only how the features can be used. Some details about monitoring target regions construction is on the usage doc. Move the details about the monitoring target regions construction differences for DAMON operations set from the usage to the design doc. Link: https://lkml.kernel.org/r/20240217005842.87348-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 16 +++++----------- Documentation/mm/damon/design.rst | 12 +++++++++--- 2 files changed, 14 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 68d00e8f01400..ae5b986a59762 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -242,17 +242,11 @@ process to the ``pid_target`` file. targets//regions ------------------- -When ``vaddr`` monitoring operations set is being used (``vaddr`` is written to -the ``contexts//operations`` file), DAMON automatically sets and updates the -monitoring target regions so that entire memory mappings of target processes -can be covered. However, users could want to set the initial monitoring region -to specific address ranges. - -In contrast, DAMON do not automatically sets and updates the monitoring target -regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used -(``fvaddr`` or ``paddr`` have written to the ``contexts//operations``). -Therefore, users should set the monitoring target regions by themselves in the -cases. +In case of ``fvaddr`` or ``paddr`` monitoring operations sets, users are +required to set the monitoring target address ranges. In case of ``vaddr`` +operations set, it is not mandatory, but users can optionally set the initial +monitoring region to specific address ranges. Please refer to the :ref:`design +` for more details. For such cases, users can explicitly set the initial monitoring target regions as they want, by writing proper values to the files under this directory. diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 6abf976dd71fd..2bd0c203dcfb7 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -83,12 +83,18 @@ describe how those work. - paddr: Monitor the physical address space of the system + .. _damon_design_vaddr_target_regions_construction: + VMA-based Target Address Range Construction ------------------------------------------- -This is only for the virtual address space monitoring operations -implementation. That for the physical address space simply asks users to -manually set the monitoring target address ranges. +A mechanism of ``vaddr`` DAMON operations set that automatically initializes +and updates the monitoring target address regions so that entire memory +mappings of the target processes can be covered. + +This mechanism is only for the ``vaddr`` operations set. In cases of +``fvaddr`` and ``paddr`` operation sets, users are asked to manually set the +monitoring target address ranges. Only small parts in the super-huge virtual address space of the processes are mapped to the physical memory and accessed. Thus, tracking the unmapped -- cgit 1.2.3-korg From 7d8cebb9630af71f04cb27314a4effbc0f4f8648 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 16 Feb 2024 16:58:42 -0800 Subject: Docs/admin-guide/mm/damon/usage: fix wrong quotas diabling condition After the introduction of DAMOS quotas, DAMOS quotas is not disabled if both size and time quotas are zero but the quota goal is set. The new rule is also applied to DAMON sysfs interface, but the usage doc is not updated. Update it. Link: https://lkml.kernel.org/r/20240217005842.87348-6-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index ae5b986a59762..22254997723c3 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -329,7 +329,8 @@ respectively. Then, DAMON tries to use only up to ``time quota`` milliseconds for applying the ``action`` to memory regions of the ``access_pattern``, and to apply the action to only up to ``bytes`` bytes of memory regions within the ``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the -quota limits. +quota limits unless at least one :ref:`goal ` is +set. Under ``weights`` directory, three files (``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) exist. -- cgit 1.2.3-korg From ba6fe53772447968194a9c182f082b33ac1c8daa Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Thu, 15 Feb 2024 22:59:07 +0100 Subject: mm,page_owner: update Documentation regarding page_owner_stacks Update page_owner documentation including the new page_owner_stacks feature to show how it can be used. Link: https://lkml.kernel.org/r/20240215215907.20121-8-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: Vlastimil Babka Reviewed-by: Marco Elver Acked-by: Andrey Konovalov Cc: Alexander Potapenko Cc: Michal Hocko Signed-off-by: Andrew Morton --- Documentation/mm/page_owner.rst | 45 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) (limited to 'Documentation') diff --git a/Documentation/mm/page_owner.rst b/Documentation/mm/page_owner.rst index 62e3f7ab23cc1..0d0334cd51798 100644 --- a/Documentation/mm/page_owner.rst +++ b/Documentation/mm/page_owner.rst @@ -24,6 +24,11 @@ fragmentation statistics can be obtained through gfp flag information of each page. It is already implemented and activated if page owner is enabled. Other usages are more than welcome. +It can also be used to show all the stacks and their outstanding +allocations, which gives us a quick overview of where the memory is going +without the need to screen through all the pages and match the allocation +and free operation. + page owner is disabled by default. So, if you'd like to use it, you need to add "page_owner=on" to your boot cmdline. If the kernel is built with page owner and page owner is disabled in runtime due to not enabling @@ -68,6 +73,46 @@ Usage 4) Analyze information from page owner:: + cat /sys/kernel/debug/page_owner_stacks/show_stacks > stacks.txt + cat stacks.txt + prep_new_page+0xa9/0x120 + get_page_from_freelist+0x7e6/0x2140 + __alloc_pages+0x18a/0x370 + new_slab+0xc8/0x580 + ___slab_alloc+0x1f2/0xaf0 + __slab_alloc.isra.86+0x22/0x40 + kmem_cache_alloc+0x31b/0x350 + __khugepaged_enter+0x39/0x100 + dup_mmap+0x1c7/0x5ce + copy_process+0x1afe/0x1c90 + kernel_clone+0x9a/0x3c0 + __do_sys_clone+0x66/0x90 + do_syscall_64+0x7f/0x160 + entry_SYSCALL_64_after_hwframe+0x6c/0x74 + stack_count: 234 + ... + ... + echo 7000 > /sys/kernel/debug/page_owner_stacks/count_threshold + cat /sys/kernel/debug/page_owner_stacks/show_stacks> stacks_7000.txt + cat stacks_7000.txt + prep_new_page+0xa9/0x120 + get_page_from_freelist+0x7e6/0x2140 + __alloc_pages+0x18a/0x370 + alloc_pages_mpol+0xdf/0x1e0 + folio_alloc+0x14/0x50 + filemap_alloc_folio+0xb0/0x100 + page_cache_ra_unbounded+0x97/0x180 + filemap_fault+0x4b4/0x1200 + __do_fault+0x2d/0x110 + do_pte_missing+0x4b0/0xa30 + __handle_mm_fault+0x7fa/0xb70 + handle_mm_fault+0x125/0x300 + do_user_addr_fault+0x3c9/0x840 + exc_page_fault+0x68/0x150 + asm_exc_page_fault+0x22/0x30 + stack_count: 8248 + ... + cat /sys/kernel/debug/page_owner > page_owner_full.txt ./page_owner_sort page_owner_full.txt sorted_page_owner.txt -- cgit 1.2.3-korg From 55c49fee57af99f3c663e69dedc5b85e691bbe50 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 2 Jan 2024 19:46:27 +0100 Subject: mm/vmalloc: remove vmap_area_list Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get the base address of vmalloc area. Now, vmap_area_list is empty, so export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list. [urezki@gmail.com: fix a warning in the crash_save_vmcoreinfo_init()] Link: https://lkml.kernel.org/r/20240111192329.449189-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20240102184633.748113-6-urezki@gmail.com Signed-off-by: Baoquan He Signed-off-by: Uladzislau Rezki (Sony) Acked-by: Lorenzo Stoakes Cc: Christoph Hellwig Cc: Dave Chinner Cc: Joel Fernandes (Google) Cc: Kazuhito Hagio Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Oleksiy Avramchenko Cc: Paul E. McKenney Signed-off-by: Andrew Morton --- Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++---- arch/arm64/kernel/crash_core.c | 1 - arch/riscv/kernel/crash_core.c | 1 - include/linux/vmalloc.h | 1 - kernel/crash_core.c | 4 +--- kernel/kallsyms_selftest.c | 1 - mm/nommu.c | 2 -- mm/vmalloc.c | 2 -- 8 files changed, 5 insertions(+), 15 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index bced9e4b6e089..0f714fc945acf 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates the kernel start address. Used to convert a virtual address from the direct kernel map to a physical address. -vmap_area_list --------------- +VMALLOC_START +------------- -Stores the virtual area list. makedumpfile gets the vmalloc start value -from this variable and its value is necessary for vmalloc translation. +Stores the base address of vmalloc area. makedumpfile gets this value +since is necessary for vmalloc translation. mem_map ------- diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c index 66cde752cd740..2a24199a9b81e 100644 --- a/arch/arm64/kernel/crash_core.c +++ b/arch/arm64/kernel/crash_core.c @@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void) /* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */ vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR); vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END); - vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START); vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END); vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START); vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END); diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c index 8706736fd4e2d..d18d529fd9b98 100644 --- a/arch/riscv/kernel/crash_core.c +++ b/arch/riscv/kernel/crash_core.c @@ -8,7 +8,6 @@ void arch_crash_save_vmcoreinfo(void) VMCOREINFO_NUMBER(phys_ram_base); vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET); - vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START); vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END); #ifdef CONFIG_MMU VMCOREINFO_NUMBER(VA_BITS); diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index c720be70c8ddd..91810b4e95107 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count); /* * Internals. Don't use.. */ -extern struct list_head vmap_area_list; extern __init void vm_area_add_early(struct vm_struct *vm); extern __init void vm_area_register_early(struct vm_struct *vm, size_t align); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 75cd6a736d030..49b31e59d3ccd 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -748,7 +748,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir); #endif VMCOREINFO_SYMBOL(_stext); - VMCOREINFO_SYMBOL(vmap_area_list); + vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", (unsigned long) VMALLOC_START); #ifndef CONFIG_NUMA VMCOREINFO_SYMBOL(mem_map); @@ -789,8 +789,6 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(free_area, free_list); VMCOREINFO_OFFSET(list_head, next); VMCOREINFO_OFFSET(list_head, prev); - VMCOREINFO_OFFSET(vmap_area, va_start); - VMCOREINFO_OFFSET(vmap_area, list); VMCOREINFO_LENGTH(zone.free_area, NR_PAGE_ORDERS); log_buf_vmcoreinfo_setup(); VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c index b4cac76ea5e98..8a689b4ff4f98 100644 --- a/kernel/kallsyms_selftest.c +++ b/kernel/kallsyms_selftest.c @@ -89,7 +89,6 @@ static struct test_item test_items[] = { ITEM_DATA(kallsyms_test_var_data_static), ITEM_DATA(kallsyms_test_var_bss), ITEM_DATA(kallsyms_test_var_data), - ITEM_DATA(vmap_area_list), #endif }; diff --git a/mm/nommu.c b/mm/nommu.c index b6dc558d31440..5ec8f44e7ce97 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL(follow_pfn); -LIST_HEAD(vmap_area_list); - void vfree(const void *addr) { kfree(addr); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 86efebf0e0c8a..b5882790da008 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -729,8 +729,6 @@ EXPORT_SYMBOL(vmalloc_to_pfn); static DEFINE_SPINLOCK(free_vmap_area_lock); -/* Export for kexec only */ -LIST_HEAD(vmap_area_list); static bool vmap_initialized __read_mostly; static struct rb_root purge_vmap_area_root = RB_ROOT; -- cgit 1.2.3-korg From 68c4905bba24691eed0cfb5c19f106f6c162ce02 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 19 Feb 2024 11:44:15 -0800 Subject: Docs/ABI/damon: document effective_bytes sysfs file Update the DAMON ABI doc for the effective_bytes sysfs file and the kdamond state file input command for updating the content of the file. Link: https://lkml.kernel.org/r/20240219194431.159606-5-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-kernel-mm-damon | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index bfa5b8288d8d1..a1e4fdb04f951 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -34,7 +34,9 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or kdamond. Writing 'update_schemes_tried_bytes' to the file updates only '.../tried_regions/total_bytes' files of this kdamond. Writing 'clear_schemes_tried_regions' to the file - removes contents of the 'tried_regions' directory. + removes contents of the 'tried_regions' directory. Writing + 'update_schemes_effective_quotas' to the file updates + '.../quotas/effective_bytes' files of this kdamond. What: /sys/kernel/mm/damon/admin/kdamonds//pid Date: Mar 2022 @@ -208,6 +210,12 @@ Contact: SeongJae Park Description: Writing to and reading from this file sets and gets the size quota of the scheme in bytes. +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//quotas/effective_bytes +Date: Feb 2024 +Contact: SeongJae Park +Description: Reading from this file gets the effective size quota of the + scheme in bytes, which adjusted for the time quota and goals. + What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//quotas/reset_interval_ms Date: Mar 2022 Contact: SeongJae Park -- cgit 1.2.3-korg From a6068d6dfa2f53bdee9d48f32d9e39cdeb74b372 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 19 Feb 2024 11:44:16 -0800 Subject: Docs/admin-guide/mm/damon/usage: document effective_bytes file Update DAMON usage document for the effective quota file of the DAMON sysfs interface. Link: https://lkml.kernel.org/r/20240219194431.159606-6-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 22254997723c3..a88cda52b0959 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -83,7 +83,7 @@ comma (","). │ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max │ │ │ │ │ │ │ │ age/min,max - │ │ │ │ │ │ │ :ref:`quotas `/ms,bytes,reset_interval_ms + │ │ │ │ │ │ │ :ref:`quotas `/ms,bytes,reset_interval_ms,effective_bytes │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals `/nr_goals │ │ │ │ │ │ │ │ │ 0/target_value,current_value @@ -153,6 +153,9 @@ Users can write below commands for the kdamond to the ``state`` file. - ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme action tried regions directory for each DAMON-based operation scheme of the kdamond. +- ``update_schemes_effective_bytes``: Update the contents of + ``effective_bytes`` files for each DAMON-based operation scheme of the + kdamond. For more details, refer to :ref:`quotas directory `. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. @@ -320,8 +323,9 @@ schemes//quotas/ The directory for the :ref:`quotas ` of the given DAMON-based operation scheme. -Under ``quotas`` directory, three files (``ms``, ``bytes``, -``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist. +Under ``quotas`` directory, four files (``ms``, ``bytes``, +``reset_interval_ms``, ``effective_bytes``) and two directores (``weights`` and +``goals``) exist. You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and ``reset interval`` in milliseconds by writing the values to the three files, @@ -332,6 +336,15 @@ apply the action to only up to ``bytes`` bytes of memory regions within the quota limits unless at least one :ref:`goal ` is set. +The time quota is internally transformed to a size quota. Between the +transformed size quota and user-specified size quota, smaller one is applied. +Based on the user-specified :ref:`goal `, the +effective size quota is further adjusted. Reading ``effective_bytes`` returns +the current effective size quota. The file is not updated in real time, so +users should ask DAMON sysfs interface to update the content of the file for +the stats by writing a special keyword, ``update_schemes_effective_bytes`` to +the relevant ``kdamonds//state`` file. + Under ``weights`` directory, three files (``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) exist. You can set the :ref:`prioritization weights -- cgit 1.2.3-korg From 3c17174f64fe4900a3c5eadaa6e9116b11a9bd33 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 19 Feb 2024 11:44:26 -0800 Subject: Docs/mm/damon/design: document quota goal self-tuning Update DAMON design doc to explain the quota goal self-tuning, which can be used by setting the goal's metric to metrics that kernel can self-retrieve. Link: https://lkml.kernel.org/r/20240219194431.159606-16-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 2bd0c203dcfb7..8c89d26f0baa1 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -398,12 +398,28 @@ Aim-oriented Feedback-driven Auto-tuning ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Automatic feedback-driven quota tuning. Instead of setting the absolute quota -value, users can repeatedly provide numbers representing how much of their goal -for the scheme is achieved as feedback. DAMOS then automatically tunes the +value, users can specify the metric of their interest, and what target value +they want the metric value to be. DAMOS then automatically tunes the aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS is under achieving the goal, DAMOS automatically increases the quota. If DAMOS is over achieving the goal, it decreases the quota. +The goal can be specified with three parameters, namely ``target_metric``, +``target_value``, and ``current_value``. The auto-tuning mechanism tries to +make ``current_value`` of ``target_metric`` be same to ``target_value``. +Currently, two ``target_metric`` are provided. + +- ``user_input``: User-provided value. Users could use any metric that they + has interest in for the value. Use space main workload's latency or + throughput, system metrics like free memory ratio or memory pressure stall + time (PSI) could be examples. Note that users should explicitly set + ``current_value`` on their own in this case. In other words, users should + repeatedly provide the feedback. +- ``some_mem_psi_us``: System-wide ``some`` memory pressure stall information + in microseconds that measured from last quota reset to next quota reset. + DAMOS does the measurement on its own, so only ``target_value`` need to be + set by users at the initial time. In other words, DAMOS does self-feedback. + .. _damon_design_damos_watermarks: -- cgit 1.2.3-korg From adc3908b3ccfd1ee5282e5ba75a24d9b536777c6 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 19 Feb 2024 11:44:27 -0800 Subject: Docs/ABI/damon: document quota goal metric file Update DAMON ABI document for the quota goal target_metric file. Link: https://lkml.kernel.org/r/20240219194431.159606-17-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-kernel-mm-damon | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index a1e4fdb04f951..dad4d5ffd7865 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -229,6 +229,12 @@ Description: Writing a number 'N' to this file creates the number of directories for setting automatic tuning of the scheme's aggressiveness named '0' to 'N-1' under the goals/ directory. +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//quotas/goals//target_metric +Date: Feb 2024 +Contact: SeongJae Park +Description: Writing to and reading from this file sets and gets the quota + auto-tuning goal metric. + What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//quotas/goals//target_value Date: Nov 2023 Contact: SeongJae Park -- cgit 1.2.3-korg From 57e88e86a167698978621a09719d716771430c48 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 19 Feb 2024 11:44:28 -0800 Subject: Docs/admin-guide/mm/damon/usage: document quota goal metric file Update DAMON usage document for the quota goal target_metric file. [sj@kernel.org: fix a typo on the auto-tuning design reference link] Link: https://lkml.kernel.org/r/20240221170852.55529-3-sj@kernel.org Link: https://lkml.kernel.org/r/20240219194431.159606-18-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index a88cda52b0959..6fce035fdbf5c 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -86,7 +86,7 @@ comma (","). │ │ │ │ │ │ │ :ref:`quotas `/ms,bytes,reset_interval_ms,effective_bytes │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals `/nr_goals - │ │ │ │ │ │ │ │ │ 0/target_value,current_value + │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value │ │ │ │ │ │ │ :ref:`watermarks `/metric,interval_us,high,mid,low │ │ │ │ │ │ │ :ref:`filters `/nr_filters │ │ │ │ │ │ │ │ 0/type,matching,memcg_id @@ -366,11 +366,11 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each goal and current achievement. Among the multiple feedback, the best one is used. -Each goal directory contains two files, namely ``target_value`` and -``current_value``. Users can set and get any number to those files to set the -feedback. User space main workload's latency or throughput, system metrics -like free memory ratio or memory pressure stall time (PSI) could be example -metrics for the values. Note that users should write +Each goal directory contains three files, namely ``target_metric``, +``target_value`` and ``current_value``. Users can set and get the three +parameters for the quota auto-tuning goals that specified on the :ref:`design +doc ` by writing to and reading from each +of the files. Note that users should further write ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond directory ` to pass the feedback to DAMON. -- cgit 1.2.3-korg From 75c40c2509e797830dd90d92568262ba69a89c9c Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 19 Feb 2024 11:44:31 -0800 Subject: Docs/admin-guide/mm/damon/reclaim: document auto-tuning parameters Update DAMON_RECLAIM usage document for the user/self feedback based auto-tuning of the quota. Link: https://lkml.kernel.org/r/20240219194431.159606-21-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/reclaim.rst | 27 ++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index 343e25b252f43..af05ae6170184 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -117,6 +117,33 @@ milliseconds. 1 second by default. +quota_mem_pressure_us +--------------------- + +Desired level of memory pressure-stall time in microseconds. + +While keeping the caps that set by other quotas, DAMON_RECLAIM automatically +increases and decreases the effective level of the quota aiming this level of +memory pressure is incurred. System-wide ``some`` memory PSI in microseconds +per quota reset interval (``quota_reset_interval_ms``) is collected and +compared to this value to see if the aim is satisfied. Value zero means +disabling this auto-tuning feature. + +Disabled by default. + +quota_autotune_feedback +----------------------- + +User-specifiable feedback for auto-tuning of the effective quota. + +While keeping the caps that set by other quotas, DAMON_RECLAIM automatically +increases and decreases the effective level of the quota aiming receiving this +feedback of value ``10,000`` from the user. DAMON_RECLAIM assumes the feedback +value and the quota are positively proportional. Value zero means disabling +this auto-tuning feature. + +Disabled by default. + wmarks_interval --------------- -- cgit 1.2.3-korg From ff0b5905a9c9712f36fb7f9dd1be17564483b5d4 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Sun, 25 Feb 2024 11:47:51 +1300 Subject: Docs/mm/damon/design: remove the details for pageout as paddr doesn't use MADV_PAGEOUT The doc needs a fix. As only in the case of virtual address, we are calling madvise() with MADV_PAGEOUT. But in the case of physical address, we are calling reclaim_pages() directly. MADV_PAGEOUT on virtual address is much more aggresive to reclaim memory compared to reclaim_pages() on paddr region. This patch removes the details so that the description can apply to both cases. And we don't need to couple with the implementation details. Link: https://lkml.kernel.org/r/20240224224751.4673-1-21cnbao@gmail.com Signed-off-by: Barry Song Reviewed-by: SeongJae Park Cc: Minchan Kim Cc: Michal Hocko Cc: Johannes Weiner Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 8c89d26f0baa1..5620aab9b3850 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -315,7 +315,7 @@ that supports each action are as below. Supported by ``vaddr`` and ``fvaddr`` operations set. - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``. Supported by ``vaddr`` and ``fvaddr`` operations set. - - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``. + - ``pageout``: Reclaim the region. Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set. - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Supported by ``vaddr`` and ``fvaddr`` operations set. -- cgit 1.2.3-korg