diff options
author | Jann Horn <jannh@google.com> | 2020-10-27 14:35:35 +0100 |
---|---|---|
committer | Michael Kerrisk <mtk.manpages@gmail.com> | 2020-10-27 14:52:08 +0100 |
commit | 20e43cd69485c683f2b7cf473cc93c216fc59d03 (patch) | |
tree | b87970c7c6625631e1843069844b4eb15eb71546 | |
parent | 14948ad6ecd669a9f56a8281a48f9b0fb87eb399 (diff) | |
download | man-pages-20e43cd69485c683f2b7cf473cc93c216fc59d03.tar.gz |
proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING
[mtk: Manually applied patch, because of conflicts with other
merged changes; also added an edit suggested by Jann; see the
thread at
https://lore.kernel.org/linux-man/20201012114940.1317510-1-jannh@google.com/]
Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
v2.6.34), Linux uses per-thread RSS counters to reduce cache
contention on the per-mm counters. With a 4K page size, that means
that you can end up with the counters off by up to 252KiB per
thread.
Example:
$ cat rsstest.c
#include <stdlib.h>
#include <err.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/eventfd.h>
#include <sys/prctl.h>
void dump(int pid) {
char cmd[1000];
sprintf(cmd,
"grep '^VmRSS' /proc/%d/status;"
"grep '^Rss:' /proc/%d/smaps_rollup;"
"echo",
pid, pid
);
system(cmd);
}
int main(void) {
eventfd_t dummy;
int child_wait = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
int child_resume = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
if (child_wait == -1 || child_resume == -1) err(1, "eventfd");
pid_t child = fork();
if (child == -1) err(1, "fork");
if (child == 0) {
if (prctl(PR_SET_PDEATHSIG, SIGKILL)) err(1, "PDEATHSIG");
if (getppid() == 1) exit(0);
char *mapping = mmap(NULL, 80 * 0x1000, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
eventfd_write(child_wait, 1);
eventfd_read(child_resume, &dummy);
for (int i=0; i<40; i++) mapping[0x1000 * i] = 1;
eventfd_write(child_wait, 1);
eventfd_read(child_resume, &dummy);
for (int i=40; i<80; i++) mapping[0x1000 * i] = 1;
eventfd_write(child_wait, 1);
eventfd_read(child_resume, &dummy);
exit(0);
}
eventfd_read(child_wait, &dummy);
dump(child);
eventfd_write(child_resume, 1);
eventfd_read(child_wait, &dummy);
dump(child);
eventfd_write(child_resume, 1);
eventfd_read(child_wait, &dummy);
dump(child);
eventfd_write(child_resume, 1);
exit(0);
}
$ gcc -o rsstest rsstest.c && ./rsstest
VmRSS: 68 kB
Rss: 616 kB
VmRSS: 68 kB
Rss: 776 kB
VmRSS: 812 kB
Rss: 936 kB
$
Let's document that those counters aren't entirely accurate.
Reported-by: Mark Mossberg <mark.mossberg@gmail.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
-rw-r--r-- | man5/proc.5 | 36 |
1 files changed, 34 insertions, 2 deletions
diff --git a/man5/proc.5 b/man5/proc.5 index 213fc004e3..48d0b5e06a 100644 --- a/man5/proc.5 +++ b/man5/proc.5 @@ -2265,6 +2265,9 @@ This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out. +This value is inaccurate; see +.I /proc/[pid]/statm +below. .TP (25) \fIrsslim\fP \ %lu Current soft limit in bytes on the rss of the process; @@ -2409,10 +2412,11 @@ The columns are: size (1) total program size (same as VmSize in \fI/proc/[pid]/status\fP) resident (2) resident set size - (same as VmRSS in \fI/proc/[pid]/status\fP) + (inaccurate; same as VmRSS in \fI/proc/[pid]/status\fP) shared (3) number of resident shared pages (i.e., backed by a file) - (same as RssFile+RssShmem in \fI/proc/[pid]/status\fP) + (inaccurate; same as RssFile+RssShmem in + \fI/proc/[pid]/status\fP) text (4) text (code) .\" (not including libs; broken, includes data segment) lib (5) library (unused since Linux 2.6; always 0) @@ -2421,6 +2425,16 @@ data (6) data + stack dt (7) dirty pages (unused since Linux 2.6; always 0) .EE .in +.IP +.\" See SPLIT_RSS_COUNTING in the kernel. +.\" Inaccuracy is bounded by TASK_RSS_EVENTS_THRESH. +Some of these values are inaccurate because +of a kernel-internal scalability optimization. +If accurate values are required, use +.I /proc/[pid]/smaps +or +.I /proc/[pid]/smaps_rollup +instead, which are much slower but provide accurate, detailed information. .TP .I /proc/[pid]/status Provides much of the information in @@ -2597,6 +2611,9 @@ directly access physical memory. .TP .IR VmHWM Peak resident set size ("high water mark"). +This value is inaccurate; see +.I /proc/[pid]/statm +above. .TP .IR VmRSS Resident set size. @@ -2605,16 +2622,25 @@ Note that the value here is the sum of .IR RssFile , and .IR RssShmem . +This value is inaccurate; see +.I /proc/[pid]/statm +above. .TP .IR RssAnon Size of resident anonymous memory. .\" commit bf9683d6990589390b5178dafe8fd06808869293 (since Linux 4.5). +This value is inaccurate; see +.I /proc/[pid]/statm +above. .TP .IR RssFile Size of resident file mappings. .\" commit bf9683d6990589390b5178dafe8fd06808869293 (since Linux 4.5). +This value is inaccurate; see +.I /proc/[pid]/statm +above. .TP .IR RssShmem Size of resident shared memory (includes System V shared memory, @@ -2626,6 +2652,9 @@ and shared anonymous mappings). .TP .IR VmData ", " VmStk ", " VmExe Size of data, stack, and text segments. +This value is inaccurate; see +.I /proc/[pid]/statm +above. .TP .IR VmLib Shared library code size. @@ -2641,6 +2670,9 @@ Size of second-level page tables (added in Linux 4.0; removed in Linux 4.15). .\" commit b084d4353ff99d824d3bc5a5c2c22c70b1fba722 Swapped-out virtual memory size by anonymous private pages; shmem swap usage is not included (since Linux 2.6.34). +This value is inaccurate; see +.I /proc/[pid]/statm +above. .TP .IR HugetlbPages Size of hugetlb memory portions |