aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2014-08-07Merge tag 'microblaze-3.17-rc1' of git://git.monstr.eu/linux-2.6-microblazeHEADmasterLinus Torvalds4-18/+28
Pull microblaze updates from Michal Simek: - add new syscall and fix comment - fix udelay implementation - fix libgcc for modules * tag 'microblaze-3.17-rc1' of git://git.monstr.eu/linux-2.6-microblaze: microblaze: Change libgcc-style functions from lib-y to obj-y microblaze: Wire-up renameat2 syscall microblaze: Add syscall number comment microblaze: delay.h fix udelay and ndelay for 8 bit args
2014-08-07Merge branch 'for-linus' of ↵Linus Torvalds1-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32 Pull avr32 fix from Hans-Christian Egtvedt. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32: avr32: fix error return code
2014-08-07Merge branch 'next' of ↵Linus Torvalds161-3284/+5477
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "This is the powerpc new goodies for 3.17. The short story: The biggest bit is Michael removing all of pre-POWER4 processor support from the 64-bit kernel. POWER3 and rs64. This gets rid of a ton of old cruft that has been bitrotting in a long while. It was broken for quite a few versions already and nobody noticed. Nobody uses those machines anymore. While at it, he cleaned up a bunch of old dusty cabinets, getting rid of a skeletton or two. Then, we have some base VFIO support for KVM, which allows assigning of PCI devices to KVM guests, support for large 64-bit BARs on "powernv" platforms, support for HMI (Hardware Management Interrupts) on those same platforms, some sparse-vmemmap improvements (for memory hotplug), There is the usual batch of Freescale embedded updates (summary in the merge commit) and fixes here or there, I think that's it for the highlights" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (102 commits) powerpc/eeh: Export eeh_iommu_group_to_pe() powerpc/eeh: Add missing #ifdef CONFIG_IOMMU_API powerpc: Reduce scariness of interrupt frames in stack traces powerpc: start loop at section start of start in vmemmap_populated() powerpc: implement vmemmap_free() powerpc: implement vmemmap_remove_mapping() for BOOK3S powerpc: implement vmemmap_list_free() powerpc: Fail remap_4k_pfn() if PFN doesn't fit inside PTE powerpc/book3s: Fix endianess issue for HMI handling on napping cpus. powerpc/book3s: handle HMIs for cpus in nap mode. powerpc/powernv: Invoke opal call to handle hmi. powerpc/book3s: Add basic infrastructure to handle HMI in Linux. powerpc/iommu: Fix comments with it_page_shift powerpc/powernv: Handle compound PE in config accessors powerpc/powernv: Handle compound PE for EEH powerpc/powernv: Handle compound PE powerpc/powernv: Split ioda_eeh_get_state() powerpc/powernv: Allow to freeze PE powerpc/powernv: Enable M64 aperatus for PHB3 powerpc/eeh: Aux PE data for error log ...
2014-08-07Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linusLinus Torvalds156-3036/+5486
Pull MIPS updates from Ralf Baechle: "This is the main pull request for 3.17. It contains: - misc Cavium Octeon, BCM47xx, BCM63xx and Alchemy updates - MIPS ptrace updates and cleanups - various fixes that will also go to -stable - a number of cleanups and small non-critical fixes. - NUMA support for the Loongson 3. - more support for MSA - support for MAAR - various FP enhancements and fixes" * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (139 commits) MIPS: jz4740: remove unnecessary null test before debugfs_remove MIPS: Octeon: remove unnecessary null test before debugfs_remove_recursive MIPS: ZBOOT: implement stack protector in compressed boot phase MIPS: mipsreg: remove duplicate MIPS_CONF4_FTLBSETS_SHIFT MIPS: Bonito64: remove a duplicate define MIPS: Malta: initialise MAARs MIPS: Initialise MAARs MIPS: detect presence of MAARs MIPS: define MAAR register accessors & bits MIPS: mark MSA experimental MIPS: Don't build MSA support unless it can be used MIPS: consistently clear MSA flags when starting & copying threads MIPS: 16 byte align MSA vector context MIPS: disable preemption whilst initialising MSA MIPS: ensure MSA gets disabled during boot MIPS: fix read_msa_* & write_msa_* functions on non-MSA toolchains MIPS: fix MSA context for tasks which don't use FP first MIPS: init upper 64b of vector registers when MSA is first used MIPS: save/disable MSA in lose_fpu MIPS: preserve scalar FP CSR when switching vector context ...
2014-08-07Merge branch 'for-linus' of ↵Linus Torvalds26-443/+619
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "Mostly cleanups and bug-fixes, with two exceptions. The first is lazy flushing of I/O-TLBs for PCI to improve performance, the second is software dirty bits in the pmd for the madvise-free implementation" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (24 commits) s390/locking: Reenable optimistic spinning s390/mm: implement dirty bits for large segment table entries KVM: s390/mm: Fix page table locking vs. split pmd lock s390/dasd: fix camel case s390/3215: fix hanging console issue s390/irq: improve displayed interrupt order in /proc/interrupts s390/seccomp: fix error return for filtered system calls s390/pci: introduce lazy IOTLB flushing for DMA unmap dasd: fix error recovery for alias devices during format dasd: fix list_del corruption during format dasd: fix unresponsive device during format dasd: use aliases for formatted devices during format s390/pci: fix kmsg component s390/kdump: Return NOTIFY_OK for all actions other than MEM_GOING_OFFLINE s390/watchdog: Fix module name in Kconfig help text s390/dasd: replace seq_printf by seq_puts s390/dasd: replace pr_warning by pr_warn s390/dasd: Move EXPORT_SYMBOL after function/variable s390/dasd: remove unnecessary null test before debugfs_remove s390/zfcp: use qdio buffer helpers ...
2014-08-07avr32: fix error return codeJulia Lawall1-3/+8
Convert a zero return value on error to a negative one, as returned elsewhere in the function. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier ret; expression e1,e2; @@ ( if (\(ret < 0\|ret != 0\)) { ... return ret; } | ret = 0 ) ... when != ret = e1 when != &ret *if(...) { ... when != ret = e2 when forall return ret; } // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Hans-Christian Egtvedt <egtvedt@samfundet.no>
2014-08-06Merge branch 'akpm' (patchbomb from Andrew Morton)Linus Torvalds156-2919/+3930
Merge incoming from Andrew Morton: - Various misc things. - arch/sh updates. - Part of ocfs2. Review is slow. - Slab updates. - Most of -mm. - printk updates. - lib/ updates. - checkpatch updates. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (226 commits) checkpatch: update $declaration_macros, add uninitialized_var checkpatch: warn on missing spaces in broken up quoted checkpatch: fix false positives for --strict "space after cast" test checkpatch: fix false positive MISSING_BREAK warnings with --file checkpatch: add test for native c90 types in unusual order checkpatch: add signed generic types checkpatch: add short int to c variable types checkpatch: add for_each tests to indentation and brace tests checkpatch: fix brace style misuses of else and while checkpatch: add --fix option for a couple OPEN_BRACE misuses checkpatch: use the correct indentation for which() checkpatch: add fix_insert_line and fix_delete_line helpers checkpatch: add ability to insert and delete lines to patch/file checkpatch: add an index variable for fixed lines checkpatch: warn on break after goto or return with same tab indentation checkpatch: emit a warning on file add/move/delete checkpatch: add test for commit id formatting style in commit log checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings checkpatch: improve "no space after cast" test checkpatch: allow multiple const * types ...
2014-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds15-44/+86
Pull networking fixes from David Miller: "This fixes the most immediate fallout from yesterday's networking merge: 1) sock_tx_timestamp() must not clear the passed in tx_flags, but rather add to them. Fix from Eric Dumazet. 2) The hyperv driver sendbuf region increase needs to be decreased slightly to handle older backends. From KY Srinivasan. 3) Fix RCU lockdep splats in netlink diag after recent hashing changes, from Thomas Graf. 4) The new IPV6_FLOWLABEL was given a socket option number that overlapped with an existing IP6 tables one, breaking ip6_tables. Fixed by Pablo Neira Ayuso" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: netlink: hold nl_sock_hash_lock during diag dump tcp: md5: check md5 signature without socket lock net: fix USB network driver config option. net: reallocate new socket option number for IPV6_AUTOFLOWLABEL vmxnet3: fix decimal printf format specifiers prefixed with 0x net-timestamp: cumulative tcp timestamping fixes hyperv: Adjust the size of sendbuf region to support ws2008r2 cxgb4: Fix for SR-IOV VF initialization net-timestamp: sock_tx_timestamp() fix
2014-08-06Merge branch 'for-linus' of ↵Linus Torvalds18-37/+26
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree changes from Jiri Kosina: "Summer edition of trivial tree updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits) doc: fix two typos in watchdog-api.txt irq-gic: remove file name from heading comment MAINTAINERS: Add miscdevice.h to file list for char/misc drivers. scsi: mvsas: mv_sas.c: Fix for possible null pointer dereference doc: replace "practise" with "practice" in Documentation befs: remove check for CONFIG_BEFS_RW scsi: doc: fix 'SCSI_NCR_SETUP_MASTER_PARITY' drivers/usb/phy/phy.c: remove a leading space mfd: fix comment cpuidle: fix comment doc: hpfall.c: fix missing null-terminate after strncpy call usb: doc: hotplug.txt code typos kbuild: fix comment in Makefile.modinst SH: add proper prompt to SH_MAGIC_PANEL_R2_VERSION ARM: msm: Remove MSM_SCM crypto: Remove MPILIB_EXTRA doc: CN: remove dead link, kerneltrap.org no longer works media: update reference, kerneltrap.org no longer works hexagon: update reference, kerneltrap.org no longer works doc: LSM: update reference, kerneltrap.org no longer works ...
2014-08-06Merge branch 'for-linus' of ↵Linus Torvalds18-642/+1194
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID updates from Jiri Kosina: "Some highlights: - hid-sony improvements of Sixaxis device support by Antonio Ospite - hid-hyperv driven devices can now be used as wakeup source, by Dexuan Cui - hid-lenovo driver is now more generic and supports more devices, by Jamie Lentin - hid-huion now supports wider range of tablets, by Nikolai Kondrashov - other various unsorted fixes and device ID additions" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (30 commits) HID: hyperv: register as a wakeup source HID: sony: Default initialize all elements of the LED max_brightness array to 1 HID: huion: Fix sparse warnings HID: usbhid: Use flag HID_DISCONNECTED when a usb device is removed HID: ignore jabra gn9350e HID: cp2112: add I2C mode HID: use multi input quirk for 22b9:2968 HID: rmi: only bind the hid-rmi driver to the mouse interface of composite USB devices HID: rmi: check that report ids exist in the report_id_hash before accessing their size HID: lenovo: Add support for Compact (BT|USB) keyboard HID: lenovo: Don't call function in condition, show error codes HID: lenovo: Prepare support for adding other devices HID: lenovo: Rename hid-lenovo-tpkbd to hid-lenovo HID: huion: Handle tablets with UC-Logic vendor ID HID: huion: Switch to generating report descriptor HID: huion: Don't ignore other interfaces HID: huion: Use "tablet" instead of specific model HID: add quirk for 0x04d9:0xa096 device HID: i2c-hid: call the hid driver's suspend and resume callbacks HID: rmi: change logging level of log messages related to unexpected reports ...
2014-08-06Merge git://www.linux-watchdog.org/linux-watchdogLinus Torvalds7-13/+43
Pull watchdog updates from Wim Van Sebroeck: - remove unnecessary checks after platform_get_resource() - fix watchdog api documentation typo's - imx2_wdt: adds big endianness support - move restart code to the sunxi watchdog driver * git://www.linux-watchdog.org/linux-watchdog: wdt: sunxi: Move restart code to the watchdog driver Documentation: fix two typos in watchdog-api.txt watchdog: imx2_wdt: adds big endianness support. watchdog: shwdt: Remove the unnecessary check of resource after platform_get_resource() watchdog: lantiq_wdt: Remove the un-necessary check of resource after platform_get_resource() watchdog: dw_wdt: Remove the un-necessary check after platform_get_resource()
2014-08-06Merge tag 'pm+acpi-3.17-rc1' of ↵Linus Torvalds120-1495/+5084
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: "Again, ACPICA leads the pack (47 commits), followed by cpufreq (18 commits) and system suspend/hibernation (9 commits). From the new code perspective, the ACPICA update brings ACPI 5.1 to the table, including a new device configuration object called _DSD (Device Specific Data) that will hopefully help us to operate device properties like Device Trees do (at least to some extent) and changes related to supporting ACPI on ARM. Apart from that we have hibernation changes making it use radix trees to store memory bitmaps which should speed up some operations carried out by it quite significantly. We also have some power management changes related to suspend-to-idle (the "freeze" sleep state) support and more preliminary changes needed to support ACPI on ARM (outside of ACPICA). The rest is fixes and cleanups pretty much everywhere. Specifics: - ACPICA update to upstream version 20140724. That includes ACPI 5.1 material (support for the _CCA and _DSD predefined names, changes related to the DMAR and PCCT tables and ARM support among other things) and cleanups related to using ACPICA's header files. A major part of it is related to acpidump and the core code used by that utility. Changes from Bob Moore, David E Box, Lv Zheng, Sascha Wildner, Tomasz Nowicki, Hanjun Guo. - Radix trees for memory bitmaps used by the hibernation core from Joerg Roedel. - Support for waking up the system from suspend-to-idle (also known as the "freeze" sleep state) using ACPI-based PCI wakeup signaling (Rafael J Wysocki). - Fixes for issues related to ACPI button events (Rafael J Wysocki). - New device ID for an ACPI-enumerated device included into the Wildcat Point PCH from Jie Yang. - ACPI video updates related to backlight handling from Hans de Goede and Linus Torvalds. - Preliminary changes needed to support ACPI on ARM from Hanjun Guo and Graeme Gregory. - ACPI PNP core cleanups from Arjun Sreedharan and Zhang Rui. - Cleanups related to ACPI_COMPANION() and ACPI_HANDLE() macros (Rafael J Wysocki). - ACPI-based device hotplug cleanups from Wei Yongjun and Rafael J Wysocki. - Cleanups and improvements related to system suspend from Lan Tianyu, Randy Dunlap and Rafael J Wysocki. - ACPI battery cleanup from Wei Yongjun. - cpufreq core fixes from Viresh Kumar. - Elimination of a deadband effect from the cpufreq ondemand governor and intel_pstate driver cleanups from Stratos Karafotis. - 350MHz CPU support for the powernow-k6 cpufreq driver from Mikulas Patocka. - Fix for the imx6 cpufreq driver from Anson Huang. - cpuidle core and governor cleanups from Daniel Lezcano, Sandeep Tripathy and Mohammad Merajul Islam Molla. - Build fix for the big_little cpuidle driver from Sachin Kamat. - Configuration fix for the Operation Performance Points (OPP) framework from Mark Brown. - APM cleanup from Jean Delvare. - cpupower utility fixes and cleanups from Peter Senna Tschudin, Andrey Utkin, Himangi Saraogi, Rickard Strandqvist, Thomas Renninger" * tag 'pm+acpi-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (118 commits) ACPI / LPSS: add LPSS device for Wildcat Point PCH ACPI / PNP: Replace faulty is_hex_digit() by isxdigit() ACPICA: Update version to 20140724. ACPICA: ACPI 5.1: Update for PCCT table changes. ACPICA/ARM: ACPI 5.1: Update for GTDT table changes. ACPICA/ARM: ACPI 5.1: Update for MADT changes. ACPICA/ARM: ACPI 5.1: Update for FADT changes. ACPICA: ACPI 5.1: Support for the _CCA predifined name. ACPICA: ACPI 5.1: New notify value for System Affinity Update. ACPICA: ACPI 5.1: Support for the _DSD predefined name. ACPICA: Debug object: Add current value of Timer() to debug line prefix. ACPICA: acpihelp: Add UUID support, restructure some existing files. ACPICA: Utilities: Fix local printf issue. ACPICA: Tables: Update for DMAR table changes. ACPICA: Remove some extraneous printf arguments. ACPICA: Update for comments/formatting. No functional changes. ACPICA: Disassembler: Add support for the ToUUID opererator (macro). ACPICA: Remove a redundant cast to acpi_size for ACPI_OFFSET() macro. ACPICA: Work around an ancient GCC bug. ACPI / processor: Make it possible to get local x2apic id via _MAT ...
2014-08-06Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds231-6265/+4936
Pull SCSI updates from James Bottomley: "This patch set consists of the usual driver updates (ufs, storvsc, pm8001 hpsa). It also has removal of the user space target driver code (everyone is using LIO now), a partial PCI MSI-X update, more multi-queue updates, conversion to 64 bit LUNs (so we could theoretically cope with any LUN returned by a device) and placeholder support for the ZBC device type (Shingle drives), plus an assortment of minor updates and bug fixes" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (143 commits) scsi: do not issue SCSI RSOC command to Promise Vtrak E610f vmw_pvscsi: Use pci_enable_msix_exact() instead of pci_enable_msix() pm8001: Fix invalid return when request_irq() failed lpfc: Remove superfluous call to pci_disable_msix() isci: Use pci_enable_msix_exact() instead of pci_enable_msix() bfa: Use pci_enable_msix_exact() instead of pci_enable_msix() bfa: Cleanup bfad_setup_intr() function bfa: Do not call pci_enable_msix() after it failed once fnic: Use pci_enable_msix_exact() instead of pci_enable_msix() scsi: use short driver name for per-driver cmd slab caches scsi_debug: support scsi-mq, queues and locks Drivers: add blist flags scsi: ufs: fix endianness sparse warnings scsi: ufs: make undeclared functions static bnx2i: Update driver version to 2.7.10.1 pm8001: fix a memory leak in nvmd_resp pm8001: fix update_flash pm8001: fix a memory leak in flash_update pm8001: Cleaning up uninitialized variables pm8001: Fix to remove null pointer checks that could never happen ...
2014-08-06Merge tag 'sound-3.17-rc1' of ↵Linus Torvalds294-7462/+17264
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound updates from Takashi Iwai: "There've been many updates in ASoC side at this time, especially the framework enhancement for multiple CODECs on a single DAI and more componentization works. The only major change in ALSA core is the addition of timestamp type in sw_params field. This should behave in backward compatible way. Other than that, there are lots of small changes and new drivers in wide range, including a large code cut in HD-audio driver for deprecated static quirks. Some highlights are below: ALSA Core: - Add the new timestamp type field to sw_params to choose MONOTONIC_RAW type HD-audio: - Continued conversion to standard printk macros, generic code cleanups - Removal of obsoleted static quirk codes for Conexant and C-Media codecs - Fixups for HP Envy TS, Dell XPS 15, HP and Dell mute/mic LED, Gigabyte BXBT-2807 mobo - Intel Braswell support ASoC: - Support for multiple CODECs attached to a single DAI, enabling systems with for example multiple DAC/speaker drivers on a single link, contributed by Benoit Cousson based on work from Misael Lopez Cruz - Support for byte controls larger than 256 bytes based on the use of TLVs contributed by Omair Mohammed Abdullah - More componentisation work from Lars-Peter Clausen - The remainder of the conversions of CODEC drivers to params_width() by Mark Brown - Drivers for Cirrus Logic CS4265, Freescale i.MX ASRC blocks, Realtek RT286 and RT5670, Rockchip RK3xxx I2S controllers and Texas Instruments TAS2552 - Lots of updates and fixes, especially to the DaVinci, Intel, Freescale, Realtek, and rcar drivers" * tag 'sound-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (402 commits) ALSA: usb-audio: Whitespace cleanups for sound/usb/midi.* ALSA: usb-audio: Respond to suspend and resume callbacks for MIDI input sound/oss/pss: Remove typedefs pss_mixerdata and pss_confdata sound/oss/opl3: Remove typedef opl_devinfo ALSA: fireworks: fix specifiers in format strings for propper output ASoC: imx-audmux: Use uintptr_t for port numbers ASoC: davinci: Enable menuconfig entry for McASP ASoC: fsl_asrc: Don't access members of config before checking it ASoC: fsl_sarc_dma: Check pair before using it ASoC: adau1977: Fix truncation warning on 64 bit architectures ALSA: virtuoso: add Xonar Essence STX II support ALSA: riptide: fix %d confusingly prefixed with 0x in format strings ALSA: fireworks: fix %d confusingly prefixed with 0x in format strings ALSA: hda - add codec ID for Braswell display audio codec ALSA: hda - add PCI IDs for Intel Braswell ALSA: usb-audio: Adjust Gamecom 780 volume level ALSA: usb-audio: improve dmesg source grepability ASoC: rt5670: Fix duplicate const warnings ASoC: rt5670: Staticise non-exported symbols ASoC: Intel: update stream only on stream IPC msgs ...
2014-08-06Merge tag 'hsi-for-3.17' of ↵Linus Torvalds3-15/+16
git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi Pull HSI changes from Sebastian Reichel: "Misc fixes in SSI related drivers" * tag 'hsi-for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi: HSI: omap_ssi: Fix return value check in ssi_debug_add_ctrl() HSI: omap_ssi_port: Fix return value check in ssi_debug_add_port() HSI: ssi_protocol: Fix sparse non static symbol warning drivers/hsi/controllers/omap_ssi{,_port}.c: fix failure checks
2014-08-06Merge tag 'for-v3.17' of git://git.infradead.org/battery-2.6Linus Torvalds18-99/+673
Pull power supply changes from Sebastian Reichel: - Added iPaq h3xxx battery driver - Added Broadcom STB reset driver - DT support for rx51-battery - misc. fixes * tag 'for-v3.17' of git://git.infradead.org/battery-2.6: ipaq_micro_battery: fix sparse non static symbol warning power: add driver for battery reading on iPaq h3xxx power: twl4030_charger: detect battery presence prior to enabling charger power: reset: Add reboot driver for brcmstb power_supply: Fix sparse non static symbol warning power_supply: Add inlmt,iterm, min/max temp props charger: tps65090: Allow charger module to be used when no irq power/reset: Fix GPL v2 license string typo power: poweroff: gpio: convert to use descriptors bq27000: report missing device better. bq27x00_battery: Introduce the use of the managed version of kzalloc Documentation: DT: Document rx51-battery binding rx51_battery: convert to iio consumer bq2415x_charger: Fix Atomic Sleep Bug
2014-08-07powerpc/eeh: Export eeh_iommu_group_to_pe()Gavin Shan1-0/+1
The function is used by VFIO driver, which might be built as a dynamic module. So it should be exported. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-08-06netlink: hold nl_sock_hash_lock during diag dumpThomas Graf3-0/+5
Although RCU protection would be possible during diag dump, doing so allows for concurrent table mutations which can render the in-table offset between individual Netlink messages invalid and thus cause legitimate sockets to be skipped in the dump. Since the diag dump is relatively low volume and consistency is more important than performance, the table mutex is held during dump. Reported-by: Andrey Wagin <avagin@gmail.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Fixes: e341694e3eb57fc ("netlink: Convert netlink_lookup() to use RCU protected hash table") Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06checkpatch: update $declaration_macros, add uninitialized_varJoe Perches1-5/+6
Using uninitialized_var reports a false positive for "Missing blank line after declarations". Fix it by adding uninitialized_var to the $declaration_macros exceptions list. Move the macro list after $Type is declared. Add optional prefixes to DECLARE_<FOO> and DEFINE_<BAR> macro declarations to allow forms like: MLX4_DECLARE_DOORBELL_LOCK Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Dotan Barak <dotanb@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: warn on missing spaces in broken up quotedDan Carpenter1-0/+6
Checkpatch already complains when people break up quoted strings but it's still pretty common. One mistake that people often make is they leave out the space character between the two strings. This check adds around 450 new warnings and has a low rate of false positives. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Andy Whitcroft <apw@canonical.com> Acked-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: fix false positives for --strict "space after cast" testJoe Perches1-1/+1
Commit 89da401f6cff ("checkpatch: improve "no space after cast" test") in -next improved the cast test for non pointer types, but also introduced false positives for some types of static inlines. Add a test for an open brace to the exclusions to avoid these false positives. Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Hartley Sweeten <HartleyS@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: fix false positive MISSING_BREAK warnings with --fileJoe Perches1-2/+2
Using --file mode can give false positives with MISSING_BREAK fall-through warnings on simple but long multiple consecutive case statements. Look for all lines before a case statement for a switch or a statement when using --file mode. Fix a misspelling of preceded while there. Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Lee Jones <lee.jones@linaro.org> Acked-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add test for native c90 types in unusual orderJoe Perches1-0/+44
c90 section "6.7.2 Type Specifiers" says: "type specifiers may occur in any order" That means that: short int is the same as int short unsigned short int is the same as int unsigned short etc... checkpatch currently parses only a subset of these allowed types. For instance: "unsigned short" and "signed short" are found by checkpatch as a specific type, but none of the or "int short" or "int signed short" variants are found. Add another table for the "kernel style misordered" variants. Add this misordered table to the findable types. Warn when the misordered style is used. This improves the "Missing a blank line after declarations" test as it depends on the correct parsing of the $Declare variable which looks for "$Type $Ident;" (ie: declarations like "int foo;"). Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add signed generic typesJoe Perches1-9/+9
Current generic types are unsigned or unspecified. Add signed to the types. Reorder the types to find the longest match first. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add short int to c variable typesJoe Perches1-0/+1
short int is one of the 6.7.2 c90 types. Find it appropriately. This fixes a defect in checkpatch where it suggests that a line break after declaration is required using an input like: int foo; short int bar; Without this change, it warns on the short int line. Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Hartley Sweeten <HartleyS@visionengravers.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add for_each tests to indentation and brace testsJoe Perches1-2/+2
All the various for_each loop macros were not tested for trailing brace on the following lines and for bad indentation. Add them. Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Greg KH <gregkh@linuxfoundation.org> Cc: Andy Whitcroft <apw@canonical.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: fix brace style misuses of else and whileJoe Perches1-8/+30
Add --fix corrections for ELSE_AFTER_BRACE and WHILE_AFTER_BRACE misuses. if (x) { ... } else { ... } is corrected to if (x) { ... } else { ... } and do { ... } while (x); is corrected to do { ... } while (x); Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add --fix option for a couple OPEN_BRACE misusesJoe Perches1-5/+28
Style misuses of these types are corrected: typedef struct foo { int bar; }; int foo(int bar) { return bar+1; } int foo(int bar) { return bar+1; } Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: use the correct indentation for which()Joe Perches1-6/+6
I copied the which subroutine from get_maintainer.pl. Unfortunately, get_maintainer uses a 4 space indentation so use the proper tab indentation instead. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add fix_insert_line and fix_delete_line helpersJoe Perches1-36/+29
Neaten the uses of patch/file line insertions or deletions. Hide the mechanism used. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add ability to insert and delete lines to patch/fileJoe Perches1-11/+130
This can be valuable to insert or delete blank lines as well as fix misplaced brace or else uses. Store indexes of lines to be added/deleted and the new lines. When creating the --fix file, insert or delete the appropriate lines and update the patch range information. Signed-off-by: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add an index variable for fixed linesJoe Perches1-54/+60
Make the fix code a bit easier to read. This should also start to allow an easier mechanism to insert/delete lines eventually too. Signed-off-by: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: warn on break after goto or return with same tab indentationJoe Perches1-0/+10
Using break; after a goto or return is unnecessary so emit a warning when the break is at the same indent level. So this emits a warning on: switch (foo) { case 1: goto err; break; } but not on: switch (foo) { case 1: if (bar()) goto err; break; } Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: emit a warning on file add/move/deleteJoe Perches1-1/+12
Whenever files are added, moved, or deleted, the MAINTAINERS file patterns can be out of sync or outdated. To try to keep MAINTAINERS more up-to-date, add a one-time warning whenever a patch does any of those. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add test for commit id formatting style in commit logJoe Perches1-0/+54
Commit logs have various forms of commit id references. Try to standardize on a 12 character long lower case commit id along with a description of parentheses and the quoted subject line. ie: commit 0123456789ab ("commit description") If git and a git tree exists, look up the commit id and emit the appropriate line as part of the message. Signed-off-by: Joe Perches <joe@perches.com> Requested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: emit fewer kmalloc_array/kcalloc conversion warningsJoe Perches1-8/+9
Avoid matching allocs that appear to be known small multiplications of a sizeof with a constant because gcc as of 4.8 cannot optimize the code in a calloc() exactly the same way as an alloc(). Look for numeric constants or what appear to be upper case only macro #defines. Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Theodore Ts'o <tytso@mit.edu> Original-patch-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: improve "no space after cast" testJoe Perches1-3/+3
This --strict test previously worked only for what appeared to be cast to pointer types. Make it work for all casts. Also, there's no reason to show the previous line for this type of message, so don't. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: allow multiple const * typesJoe Perches1-1/+1
checkpatch's $Type variable does not match declarations of multiple const * types. This can produce false positives for things like: $ ./scripts/checkpatch.pl -f drivers/staging/comedi/comedidev.h WARNING: Missing a blank line after declarations #60: FILE: drivers/staging/comedi/comedidev.h:60: + const struct comedi_lrange *range_table; + const struct comedi_lrange *const *range_table_list; Fix the $Type variable to support matching multiple "* const" uses. Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Hartley Sweeten <HartleyS@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: warn on unnecessary parentheses around references of foo->barJoe Perches1-0/+8
Parentheses around &(foo->bar) and *(foo->bar) are unnecessary. Emit a --strict only message on these uses. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: quiet Kconfig help message checkingJoe Perches1-3/+4
Editing Kconfig dependencies can emit unnecessary messages about missing or too short help entries. Only emit the message when adding help sections to Kconfig files. Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Jean Delvare <jdelvare@suse.de> Tested-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: change blank line after declaration type to "LINE_SPACING"Joe Perches1-1/+1
Make it consistent with the other missing or multiple blank line tests. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add a multiple blank lines testJoe Perches1-0/+11
Multiple consecutive blank lines waste screen space. Emit a --strict only message with these blank lines. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: add test for blank lines after function/struct/union/enumJoe Perches1-0/+16
Add a --strict test asking for a blank line after function/struct/union/enum declarations. Allow exceptions for several attributes and macro uses. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch.pl: also suggest 'else if' when if follows braceRasmus Villemoes1-1/+1
This might help a kernel hacker think twice before blindly adding a newline. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: ignore email headers betterJoe Perches1-2/+3
There are some patches created by git format-patch that when scanned by checkpatch report errors on lines like To: address.tld This is a checkpatch false positive. Improve the logic a bit to ignore folded email headers to avoid emitting these messages. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: fix function pointers in blank line needed after declarations testJoe Perches1-0/+4
Add a function pointer declaration check to the test for blank line needed after declarations. Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Bruce W Allan <bruce.w.allan@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: fix complex macro false positive for escaped constant charJoe Perches1-1/+1
A single escaped constant char is not a complex macro. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: warn on unnecessary else after return or breakJoe Perches1-0/+10
Using an else following a break or return can unnecessarily indent code blocks. ie: for (i = 0; i < 100; i++) { int foo = bar(); if (foo < 1) break; else usleep(1); } is generally better written as: for (i = 0; i < 100; i++) { int foo = bar(); if (foo < 1) break; usleep(1); } Warn when a bare else statement is preceded by a break or return indented 1 tab more than the else. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06checkpatch: attempt to find unnecessary 'out of memory' messagesJoe Perches1-0/+17
Logging messages that show some type of "out of memory" error are generally unnecessary as there is a generic message and a stack dump done by the memory subsystem. These messages generally increase kernel size without much added value. Emit a warning on these types of messages. This test looks for any inserted message function, then looks at the previous line for an "if (!foo)" or "if (foo == NULL)" test and then looks at the preceding statement for an allocation function like "foo = kmalloc()" ie: this code matches: foo = kmalloc(); if (foo == NULL) { printk("Out of memory\n"); return -ENOMEM; } This test is very crude and incomplete. This test can miss quite a lot of of OOM messages that do not have this specific form. ie: this code does not match: foo = kmalloc(); if (!foo) { rtn = -ENOMEM; printk("Out of memory!\n"); goto out; } This test could also be a false positive when the logging message itself does not specify anything about memory, but I did not find any false positives in my limited testing. spatch could be a better solution but correctness seems non-trivial for that tool too. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: add missing mask in bitmap_andnotRasmus Villemoes2-3/+6
Apparently, bitmap_andnot is supposed to return whether the new bitmap is empty. But it didn't take potential garbage bits in the last word into account. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: add missing mask in bitmap_andRasmus Villemoes2-3/+6
Apparently, bitmap_and is supposed to return whether the new bitmap is empty. But it didn't take potential garbage bits in the last word into account. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: add missing mask in bitmap_shift_rightRasmus Villemoes1-1/+1
There is no guarantee that *src does not contain garbage bits outside the lower nbits, so we need to mask it before the shift-and-assign. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: micro-optimize bitmap_allocate_regionRasmus Villemoes1-2/+1
__reg_op(..., REG_OP_ALLOC) always returns 0, so we might as well use that and save an instruction. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: change parameter of bitmap_*_region to unsignedRasmus Villemoes2-9/+9
Changing the pos parameter of __reg_op to unsigned allows the compiler to generate slightly smaller and simpler code. Also update its callers bitmap_*_region to receive and pass unsigned int. The return types of bitmap_find_free_region and bitmap_allocate_region are still int to allow a negative error code to be returned. An int is certainly capable of representing any realistic return value. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: fix typo in kerneldoc for bitmap_pos_to_ordRasmus Villemoes1-1/+1
A few lines above, it was stated that positions for non-set bits are mapped to -1, which is obviously also what the code does. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: simplify bitmap_parselistRasmus Villemoes1-7/+2
We want len to be the index of the first '\n', or the length of the string if there is no newline. This is a good example of the usefulness of strchrnul(). Use that instead, thus eliminating a branch and a call to strlen(). Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make the start index of bitmap_clear unsignedRasmus Villemoes2-6/+6
The compiler can generate slightly smaller and simpler code when it knows that "start" is non-negative. Also, use the names "start" and "len" for the two parameters for consistency with bitmap_set. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make the start index of bitmap_set unsignedRasmus Villemoes2-6/+6
The compiler can generate slightly smaller and simpler code when it knows that "start" is non-negative. Also, use the names "start" and "len" for the two parameters in both header file and implementation, instead of the previous mix. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make nbits parameter of bitmap_weight unsignedRasmus Villemoes2-4/+5
The compiler can generate slightly smaller and simpler code when it knows that "nbits" is non-negative. Since no-one passes a negative bit-count, this shouldn't affect the semantics. I didn't change the return type, since that might change the semantics of some expression containing a call to bitmap_weight(). Certainly an int is capable of holding the result. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make nbits parameter of bitmap_subset unsignedRasmus Villemoes2-4/+4
The compiler can generate slightly smaller and simpler code when it knows that "nbits" is non-negative. Since no-one passes a negative bit-count, this shouldn't affect the semantics. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make nbits parameter of bitmap_intersects unsignedRasmus Villemoes2-4/+4
The compiler can generate slightly smaller and simpler code when it knows that "nbits" is non-negative. Since no-one passes a negative bit-count, this shouldn't affect the semantics. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make nbits parameter of bitmap_{and,or,xor,andnot} unsignedRasmus Villemoes2-20/+20
This change is only for consistency with the changes to the other bitmap_* functions; it doesn't change the size of the generated code: inside BITS_TO_LONGS there is a sizeof(long), which causes bits to be interpreted as unsigned anyway. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: remove unnecessary mask from bitmap_complementRasmus Villemoes2-2/+2
Since the extra bits are "don't care", there is no reason to mask the last word to the used bits when complementing. This shaves off yet a few bytes. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make nbits parameter of bitmap_complement unsignedRasmus Villemoes2-5/+5
The compiler can generate slightly smaller and simpler code when it knows that "nbits" is non-negative. Since no-one passes a negative bit-count, this shouldn't affect the semantics. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make nbits parameter of bitmap_equal unsignedRasmus Villemoes2-3/+3
The compiler can generate slightly smaller and simpler code when it knows that "nbits" is non-negative. Since no-one passes a negative bit-count, this shouldn't affect the semantics. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make nbits parameter of bitmap_full unsignedRasmus Villemoes2-4/+4
The compiler can generate slightly smaller and simpler code when it knows that "nbits" is non-negative. Since no-one passes a negative bit-count, this shouldn't affect the semantics. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: bitmap: make nbits parameter of bitmap_empty unsignedRasmus Villemoes2-4/+4
Many functions in lib/bitmap.c start with an expression such as lim = bits/BITS_PER_LONG. Since bits has type (signed) int, and since gcc cannot know that it is in fact non-negative, it generates worse code than it could. These patches, mostly consisting of changing various parameters to unsigned, gives a slight overall code reduction: add/remove: 1/1 grow/shrink: 8/16 up/down: 251/-414 (-163) function old new delta tick_device_uses_broadcast 335 425 +90 __irq_alloc_descs 498 554 +56 __bitmap_andnot 73 115 +42 __bitmap_and 70 101 +31 bitmap_weight - 11 +11 copy_hugetlb_page_range 752 762 +10 follow_hugetlb_page 846 854 +8 hugetlb_init 1415 1417 +2 hugetlb_nrpages_setup 130 131 +1 hugetlb_add_hstate 377 376 -1 bitmap_allocate_region 82 80 -2 select_task_rq_fair 2202 2191 -11 hweight_long 66 55 -11 __reg_op 230 219 -11 dm_stats_message 2849 2833 -16 bitmap_parselist 92 74 -18 __bitmap_weight 115 97 -18 __bitmap_subset 153 129 -24 __bitmap_full 128 104 -24 __bitmap_empty 120 96 -24 bitmap_set 179 149 -30 bitmap_clear 185 155 -30 __bitmap_equal 136 105 -31 __bitmap_intersects 148 108 -40 __bitmap_complement 109 67 -42 tick_device_setup_broadcast_func.isra 81 - -81 [The increases in __bitmap_and{,not} are due to bug fixes 17/18,18/18. No idea why bitmap_weight suddenly appears.] While 163 bytes treewide is insignificant, I believe the bitmap functions are often called with locks held, so saving even a few cycles might be worth it. While making these changes, I found a few other things that might be worth including. 16,17,18 are actual bug fixes. The rest shouldn't change the behaviour of any of the functions, provided no-one passed negative nbits values. If something should come up, it should be fairly bisectable. A few issues I thought about, but didn't know what to do with: * Many of the functions misbehave if nbits is compile-time 0; the out-of-line functions generally handle 0 correctly. bitmap_fill() is particularly bad, whether the 0 is known at compile time or not. It would probably be nice to add detection of at least compile-time 0 and handle that appropriately. * I didn't change __bitmap_shift_{left,right} to use unsigned because I want to fully understand why the algorithm works before making that change. However, AFAICT, they behave correctly for all (positive) shift amounts. This is not the case for the small_const_nbits versions. If for example nbits = n = BITS_PER_LONG, the shift operators turn into no-ops (at least on x86), so one get *dst = *src, whereas one would expect to get *dst=0. That difference in behaviour is somewhat annoying. This patch (of 18): The compiler can generate slightly smaller and simpler code when it knows that "nbits" is non-negative. Since no-one passes a negative bit-count, this shouldn't affect the semantics. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib/list_sort.c: convert to pr_fooAndrew Morton1-28/+21
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Don Mullis <don.mullis@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: list_sort.c: Limit number of unused cmp callbacksRasmus Villemoes1-1/+3
The helper merge_and_restore_back_links() makes sure to call the caller's cmp function during the final ->prev pointer fixup, so that the cmp function may call cond_resched(). However, if the cmp function does not call cond_resched() at all, this is entirely redundant. If it does, doing at least two function calls for every two pointer assignments is a bit excessive. This patch limits the calls to once for every 256 iterations. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Don Mullis <don.mullis@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: list_sort_test(): simplify and harden cleanupRasmus Villemoes1-7/+5
There is no reason to maintain the list structure while freeing the debug elements. Aside from the redundant pointer manipulations, it is also inefficient from a locality-of-reference viewpoint, since they are visited in a random order (wrt. the order they were allocated). Furthermore, if we jumped to exit: after detecting list corruption, it is actually dangerous. So just free the elements in the order they were allocated, using the backing array elts. Allocate that using kcalloc(), so that if allocation of one of the debug element fails, we just end up calling kfree(NULL) for the trailing elements. Minor details: Use sizeof(*elts) instead of sizeof(void *), and return err immediately when allocation of elts fails, to avoid introducing another label just before the final return statement. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Don Mullis <don.mullis@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: list_sort_test(): add extra corruption checkRasmus Villemoes1-0/+5
Add a check to make sure that the prev pointer of the list head points to the last element on the list. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Don Mullis <don.mullis@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: list_sort_test(): return -ENOMEM when allocation failsRasmus Villemoes1-1/+2
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Don Mullis <don.mullis@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06kernel.h: remove deprecated pack_hex_byteJoe Perches1-5/+0
It's been nearly 3 years now since commit 55036ba76b2d ("lib: rename pack_hex_byte() to hex_byte_pack()") so it's time to remove this deprecated and unused static inline. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib/test-kstrtox.c: use ARRAY_SIZE instead of sizeof/sizeof[0]Fabian Frederick1-1/+1
Use kernel.h definition. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib/string_helpers.c: constify static arraysMathias Krause1-6/+9
Complement commit 68aecfb97978 ("lib/string_helpers.c: make arrays static") by making the arrays const -- not only pointing to const strings. This moves them out of the data section to the r/o data section: text data bss dec hex filename 1150 176 0 1326 52e lib/string_helpers.old.o 1326 0 0 1326 52e lib/string_helpers.new.o Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib/cmdline.c: add size unit t/p/e to memparseGui Hecheng1-5/+10
For modern filesystems such as btrfs, t/p/e size level operations are common. add size unit t/p/e parsing to memparse Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06libata: Use glob_match from lib/glob.cGeorge Spelvin2-69/+4
The function may be useful for other drivers, so export it. (Suggested by Tejun Heo.) Note that I inverted the return value of glob_match; returning true on match seemed to make more sense. Signed-off-by: George Spelvin <linux@horizon.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib/glob.c: add CONFIG_GLOB_SELFTESTGeorge Spelvin2-0/+178
This was useful during development, and is retained for future regression testing. GCC appears to have no way to place string literals in a particular section; adding __initconst to a char pointer leaves the string itself in the default string section, where it will not be thrown away after module load. Thus all string constants are kept in explicitly declared and named arrays. Sorry this makes printk a bit harder to read. At least the tests are more compact. Signed-off-by: George Spelvin <linux@horizon.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06lib: add lib/glob.cGeorge Spelvin4-0/+153
This is a helper function from drivers/ata/libata_core.c, where it is used to blacklist particular device models. It's being moved to lib/ so other drivers may use it for the same purpose. This implementation in non-recursive, so is safe for the kernel stack. [akpm@linux-foundation.org: fix sparse warning] Signed-off-by: George Spelvin <linux@horizon.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06zlib: clean up some dead codeSergey Senozhatsky3-393/+0
Cleanup unused `if 0'-ed functions, which have been dead since 2006 (commits 87c2ce3b9305 ("lib/zlib*: cleanups") by Adrian Bunk and 4f3865fb57a0 ("zlib_inflate: Upgrade library code to a recent version") by Richard Purdie): - zlib_deflateSetDictionary - zlib_deflateParams - zlib_deflateCopy - zlib_inflateSync - zlib_syncsearch - zlib_inflateSetDictionary - zlib_inflatePrime Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06klist: use same naming scheme as hlist for klist_add_after()Ken Helias2-4/+4
The name was modified from hlist_add_after() to hlist_add_behind() when adjusting the order of arguments to match the one with klist_add_after(). This is necessary to break old code when it would use it the wrong way. Make klist follow this naming scheme for consistency. Signed-off-by: Ken Helias <kenhelias@firemail.de> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06list: fix order of arguments for hlist_add_after(_rcu)Ken Helias15-21/+21
All other add functions for lists have the new item as first argument and the position where it is added as second argument. This was changed for no good reason in this function and makes using it unnecessary confusing. The name was changed to hlist_add_behind() to cause unconverted code to generate a compile error instead of using the wrong parameter order. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Ken Helias <kenhelias@firemail.de> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [intel driver bits] Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06list: make hlist_add_after() argument names match hlist_add_after_rcu()Ken Helias1-7/+7
The argument names for hlist_add_after() are poorly chosen because they look the same as the ones for hlist_add_before() but have to be used differently. hlist_add_after_rcu() has made a better choice. Signed-off-by: Ken Helias <kenhelias@firemail.de> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06kernel/printk/printk.c: fix bool assignementsNeil Zhang1-3/+3
Fix coccinelle warnings. Signed-off-by: Neil Zhang <zhangwm@marvell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: enable interrupts before calling console_trylock_for_printk()Jan Kara1-10/+21
We need interrupts disabled when calling console_trylock_for_printk() only so that cpu id we pass to can_use_console() remains valid (for other things console_sem provides all the exclusion we need and deadlocks on console_sem due to interrupts are impossible because we use down_trylock()). However if we are rescheduled, we are guaranteed to run on an online cpu so we can easily just get the cpu id in can_use_console(). We can lose a bit of performance when we enable interrupts in vprintk_emit() and then disable them again in console_unlock() but OTOH it can somewhat reduce interrupt latency caused by console_unlock(). We differ from (reverted) commit 939f04bec1a4 in that we avoid calling console_unlock() from vprintk_emit() with lockdep enabled as that has unveiled quite some bugs leading to system freezes during boot (e.g. https://lkml.org/lkml/2014/5/30/242, https://lkml.org/lkml/2014/6/28/521). Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Andreas Bombe <aeb@debian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: miscellaneous cleanupsAlex Elder1-13/+13
Some small cleanups to kernel/printk/printk.c. None of them should cause any change in behavior. - When CONFIG_PRINTK is defined, parenthesize the value of LOG_LINE_MAX. - When CONFIG_PRINTK is *not* defined, there is an extra LOG_LINE_MAX definition; delete it. - Pull an assignment out of a conditional expression in console_setup(). - Use isdigit() in console_setup() rather than open coding it. - In update_console_cmdline(), drop a NUL-termination assignment; the strlcpy() call that precedes it guarantees it's not needed. - Simplify some logic in printk_timed_ratelimit(). Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Petr Mladek <pmladek@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Jan Kara <jack@suse.cz> Cc: John Stultz <john.stultz@linaro.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: use a clever macroAlex Elder1-10/+2
Use the IS_ENABLED() macro rather than #ifdef blocks to set certain global values. Signed-off-by: Alex Elder <elder@linaro.org> Acked-by: Borislav Petkov <bp@suse.de> Reviewed-by: Petr Mladek <pmladek@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: John Stultz <john.stultz@linaro.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: fix some commentsAlex Elder1-11/+12
Fix a few comments that don't accurately describe their corresponding code. It also fixes some minor typographical errors. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Petr Mladek <pmladek@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Jan Kara <jack@suse.cz> Cc: John Stultz <john.stultz@linaro.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: rename DEFAULT_MESSAGE_LOGLEVELAlex Elder3-3/+3
Commit a8fe19ebfbfd ("kernel/printk: use symbolic defines for console loglevels") makes consistent use of symbolic values for printk() log levels. The naming scheme used is different from the one used for DEFAULT_MESSAGE_LOGLEVEL though. Change that symbol name to be MESSAGE_LOGLEVEL_DEFAULT for consistency. And because the value of that symbol comes from a similarly-named config option, rename CONFIG_DEFAULT_MESSAGE_LOGLEVEL as well. Signed-off-by: Alex Elder <elder@linaro.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Jan Kara <jack@suse.cz> Cc: John Stultz <john.stultz@linaro.org> Cc: Petr Mladek <pmladek@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: tweak do_syslog() to match commentsAlex Elder1-1/+1
In do_syslog() there's a path used by kmsg_poll() and kmsg_read() that only needs to know whether there's any data available to read (and not its size). These callers only check for non-zero return. As a shortcut, do_syslog() returns the difference between what has been logged and what has been "seen." The comments say that the "count of records" should be returned but it's not. Instead it returns (log_next_idx - syslog_idx), which is a difference between buffer offsets--and the result could be negative. The behavior is the same (it'll be zero or not in the same cases), but the count of records is more meaningful and it matches what the comments say. So change the code to return that. Signed-off-by: Alex Elder <elder@linaro.org> Cc: Petr Mladek <pmladek@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: allow increasing the ring buffer depending on the number of CPUsLuis R. Rodriguez3-6/+82
The default size of the ring buffer is too small for machines with a large amount of CPUs under heavy load. What ends up happening when debugging is the ring buffer overlaps and chews up old messages making debugging impossible unless the size is passed as a kernel parameter. An idle system upon boot up will on average spew out only about one or two extra lines but where this really matters is on heavy load and that will vary widely depending on the system and environment. There are mechanisms to help increase the kernel ring buffer for tracing through debugfs, and those interfaces even allow growing the kernel ring buffer per CPU. We also have a static value which can be passed upon boot. Relying on debugfs however is not ideal for production, and relying on the value passed upon bootup is can only used *after* an issue has creeped up. Instead of being reactive this adds a proactive measure which lets you scale the amount of contributions you'd expect to the kernel ring buffer under load by each CPU in the worst case scenario. We use num_possible_cpus() to avoid complexities which could be introduced by dynamically changing the ring buffer size at run time, num_possible_cpus() lets us use the upper limit on possible number of CPUs therefore avoiding having to deal with hotplugging CPUs on and off. This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT which is used to specify the maximum amount of contributions to the kernel ring buffer in the worst case before the kernel ring buffer flips over, the size is specified as a power of 2. The total amount of contributions made by each CPU must be greater than half of the default kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger an increase upon bootup. The kernel ring buffer is increased to the next power of two that would fit the required minimum kernel ring buffer size plus the additional CPU contribution. For example if LOG_BUF_SHIFT is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs in order to trigger an increase of the kernel ring buffer. With a LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64 possible CPUs to trigger an increase. If you had 128 possible CPUs the amount of minimum required kernel ring buffer bumps to: ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB Since we require the ring buffer to be a power of two the new required size would be 1024 KB. This CPU contributions are ignored when the "log_buf_len" kernel parameter is used as it forces the exact size of the ring buffer to an expected power of two value. [pmladek@suse.cz: fix build] Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.cz> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Tested-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: Arun KS <arunks.linux@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: make dynamic units clear for the kernel ring bufferLuis R. Rodriguez1-1/+1
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Suggested-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: Arun KS <arunks.linux@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: move power of 2 practice of ring buffer size to a helperLuis R. Rodriguez1-4/+10
In practice the power of 2 practice of the size of the kernel ring buffer remains purely historical but not a requirement, specially now that we have LOG_ALIGN and use it for both static and dynamic allocations. It could have helped with implicit alignment back in the days given the even the dynamically sized ring buffer was guaranteed to be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a __LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would be allowed only if it was > __LOG_BUF_LEN and we always ended up rounding the log_buf_len=n to the next power of 2 with roundup_pow_of_two(), any multiple of 2 then should be also architecture aligned. These assumptions of course relied heavily on CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always change this. We now have precise alignment requirements set for the log buffer size for both static and dynamic allocations, but lets upkeep the old practice of using powers of 2 for its size to help with easy expected scalable values and the allocators for dynamic allocations. We'll reuse this later so move this into a helper. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: Arun KS <arunks.linux@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06printk: make dynamic kernel ring buffer alignment explicitLuis R. Rodriguez1-2/+3
We have to consider alignment for the ring buffer both for the default static size, and then also for when an dynamic allocation is made when the log_buf_len=n kernel parameter is passed to set the size specifically to a size larger than the default size set by the architecture through CONFIG_LOG_BUF_SHIFT. The default static kernel ring buffer can be aligned properly if architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned value it can be reduced to a non aligned value. Commit 6ebb017de9 ("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew Lunn ensures the static buffer is always aligned and the decision of alignment is done by the compiler by using __alignof__(struct log). When log_buf_len=n is used we allocate the ring buffer dynamically. Dynamic allocation varies, for the early allocation called before setup_arch() memblock_virt_alloc() requests a page aligment and for the default kernel allocation memblock_virt_alloc_nopanic() requests no special alignment, which in turn ends up aligning the allocation to SMP_CACHE_BYTES, which is L1 cache aligned. Since we already have the required alignment for the kernel ring buffer though we can do better and request explicit alignment for LOG_ALIGN. This does that to be safe and make dynamic allocation alignment explicit. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Tested-by: Petr Mladek <pmladek@suse.cz> Acked-by: Petr Mladek <pmladek@suse.cz> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: Arun KS <arunks.linux@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06include/linux/byteorder/generic.h: minor comment fixGeoff Levand1-1/+1
Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06fs.h, drivers/hwmon/asus_atk0110.c: fix DEFINE_SIMPLE_ATTRIBUTE semicolon ↵Joe Perches2-2/+2
definition and use The DEFINE_SIMPLE_ATTRIBUTE macro should not end in a ; Fix the one use in the kernel tree that did not have a semicolon. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Luca Tettamanti <kronos.it@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06./Makefile: tell gcc optimizer to never introduce new data racesJiri Kosina1-0/+3
We have been chasing a memory corruption bug, which turned out to be caused by very old gcc (4.3.4), which happily turned conditional load into a non-conditional one, and that broke correctness (the condition was met only if lock was held) and corrupted memory. This particular problem with that particular code did not happen when never gccs were used. I've brought this up with our gcc folks, as I wanted to make sure that this can't really happen again, and it turns out it actually can. Quoting Martin Jambor <mjambor@suse.cz>: "More current GCCs are more careful when it comes to replacing a conditional load with a non-conditional one, most notably they check that a store happens in each iteration of _a_ loop but they assume loops are executed. They also perform a simple check whether the store cannot trap which currently passes only for non-const variables. A simple testcase demonstrating it on an x86_64 is for example the following: $ cat cond_store.c int g_1 = 1; int g_2[1024] __attribute__((section ("safe_section"), aligned (4096))); int c = 4; int __attribute__ ((noinline)) foo (void) { int l; for (l = 0; (l != 4); l++) { if (g_1) return l; for (g_2[0] = 0; (g_2[0] >= 26); ++g_2[0]) ; } return 2; } int main (int argc, char* argv[]) { if (mprotect (g_2, sizeof(g_2), PROT_READ) == -1) { int e = errno; error (e, e, "mprotect error %i", e); } foo (); __builtin_printf("OK\n"); return 0; } /* EOF */ $ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=0 $ ./a.out OK $ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=1 $ ./a.out Segmentation fault The testcase fails the same at least with 4.9, 4.8 and 4.7. Therefore I would suggest building kernels with this parameter set to zero. I also agree with Jikos that the default should be changed for -O2. I have run most of the SPEC 2k6 CPU benchmarks (gamess and dealII failed, at -O2, not sure why) compiled with and without this option and did not see any real difference between respective run-times" Hopefully the default will be changed in newer gccs, but let's force it for kernel builds so that we are on a safe side even when older gcc are used. The code in question was out-of-tree printk-in-NMI (yeah, surprise suprise, once again) patch written by Petr Mladek, let me quote his comment from our internal bugzilla: "I have spent few days investigating inconsistent state of kernel ring buffer. It went out that it was caused by speculative store generated by gcc-4.3.4. The problem is in assembly generated for make_free_space(). The functions is called the following way: + vprintk_emit(); + log = MAIN_LOG; // with logbuf_lock or log = NMI_LOG; // with nmi_logbuf_lock cont_add(log, ...); + cont_flush(log, ...); + log_store(log, ...); + log_make_free_space(log, ...); If called with log = NMI_LOG then only nmi_log_* global variables are safe to modify but the generated code does store also into (main_)log_* global variables: <log_make_free_space>: 55 push %rbp 89 f6 mov %esi,%esi 48 8b 05 03 99 51 01 mov 0x1519903(%rip),%rax # ffffffff82620868 <nmi_log_next_id> 44 8b 1d ec 98 51 01 mov 0x15198ec(%rip),%r11d # ffffffff82620858 <log_next_idx> 8b 35 36 60 14 01 mov 0x1146036(%rip),%esi # ffffffff8224cfa8 <log_buf_len> 44 8b 35 33 60 14 01 mov 0x1146033(%rip),%r14d # ffffffff8224cfac <nmi_log_buf_len> 4c 8b 2d d0 98 51 01 mov 0x15198d0(%rip),%r13 # ffffffff82620850 <log_next_seq> 4c 8b 25 11 61 14 01 mov 0x1146111(%rip),%r12 # ffffffff8224d098 <log_buf> 49 89 c2 mov %rax,%r10 48 21 c2 and %rax,%rdx 48 8b 1d 0c 99 55 01 mov 0x155990c(%rip),%rbx # ffffffff826608a0 <nmi_log_buf> 49 c1 ea 20 shr $0x20,%r10 48 89 55 d0 mov %rdx,-0x30(%rbp) 44 29 de sub %r11d,%esi 45 29 d6 sub %r10d,%r14d 4c 8b 0d 97 98 51 01 mov 0x1519897(%rip),%r9 # ffffffff82620840 <log_first_seq> eb 7e jmp ffffffff81107029 <log_make_free_space+0xe9> [...] 85 ff test %edi,%edi # edi = 1 for NMI_LOG 4c 89 e8 mov %r13,%rax 4c 89 ca mov %r9,%rdx 74 0a je ffffffff8110703d <log_make_free_space+0xfd> 8b 15 27 98 51 01 mov 0x1519827(%rip),%edx # ffffffff82620860 <nmi_log_first_id> 48 8b 45 d0 mov -0x30(%rbp),%rax 48 39 c2 cmp %rax,%rdx # end of loop 0f 84 da 00 00 00 je ffffffff81107120 <log_make_free_space+0x1e0> [...] 85 ff test %edi,%edi # edi = 1 for NMI_LOG 4c 89 0d 17 97 51 01 mov %r9,0x1519717(%rip) # ffffffff82620840 <log_first_seq> ^^^^^^^^^^^^^^^^^^^^^^^^^^ KABOOOM 74 35 je ffffffff81107160 <log_make_free_space+0x220> It stores log_first_seq when edi == NMI_LOG. This instructions are used also when edi == MAIN_LOG but the store is done speculatively before the condition is decided. It is unsafe because we do not have "logbuf_lock" in NMI context and some other process migh modify "log_first_seq" in parallel" I believe that the best course of action is both - building kernel (and anything multi-threaded, I guess) with that optimization turned off - persuade gcc folks to change the default for future releases Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Martin Jambor <mjambor@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Marek Polacek <polacek@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Steven Noonan <steven@uplinklabs.net> Cc: Richard Biener <richard.guenther@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/zpool: update zswap to use zpoolDan Streetman2-31/+46
Change zswap to use the zpool api instead of directly using zbud. Add a boot-time param to allow selecting which zpool implementation to use, with zbud as the default. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/zpool: zbud/zsmalloc implement zpoolDan Streetman2-0/+179
Update zbud and zsmalloc to implement the zpool api. [fengguang.wu@intel.com: make functions static] Signed-off-by: Dan Streetman <ddstreet@ieee.org> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/zpool: implement common zpool api to zbud/zsmallocDan Streetman5-18/+495
Add zpool api. zpool provides an interface for memory storage, typically of compressed memory. Users can select what backend to use; currently the only implementations are zbud, a low density implementation with up to two compressed pages per storage page, and zsmalloc, a higher density implementation with multiple compressed pages per storage page. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/zbud: change zbud_alloc size type to size_tDan Streetman2-3/+3
Change the type of the zbud_alloc() size param from unsigned int to size_t. Technically, this should not make any difference, as the zbud implementation already restricts the size to well within either type's limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use size_t, this brings the size parameter type in line with zsmalloc/zpool. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06zram: replace global tb_lock with fine grain lockWeijie Yang2-33/+60
Currently, we use a rwlock tb_lock to protect concurrent access to the whole zram meta table. However, according to the actual access model, there is only a small chance for upper user to access the same table[index], so the current lock granularity is too big. The idea of optimization is to change the lock granularity from whole meta table to per table entry (table -> table[index]), so that we can protect concurrent access to the same table[index], meanwhile allow the maximum concurrency. With this in mind, several kinds of locks which could be used as a per-entry lock were tested and compared: Test environment: x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04, kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO. iozone test: iozone -t 4 -R -r 16K -s 200M -I +Z (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------------- Initial write 1381094 1425435 1422860 1423075 1421521 Rewrite 1529479 1641199 1668762 1672855 1654910 Read 8468009 11324979 11305569 11117273 10997202 Re-read 8467476 11260914 11248059 11145336 10906486 Reverse Read 6821393 8106334 8282174 8279195 8109186 Stride read 7191093 8994306 9153982 8961224 9004434 Random read 7156353 8957932 9167098 8980465 8940476 Mixed workload 4172747 5680814 5927825 5489578 5972253 Random write 1483044 1605588 1594329 1600453 1596010 Pwrite 1276644 1303108 1311612 1314228 1300960 Pread 4324337 4632869 4618386 4457870 4500166 To enhance the possibility of access the same table[index] concurrently, set zram a small disksize(10MB) and let threads run with large loop count. fio test: fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4 --filename=/dev/zram0 --name=seq-write --rw=write --stonewall --name=seq-read --rw=read --stonewall --name=seq-readwrite --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall (10MB zram raw block device, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------- seq-write 933789 999357 1003298 995961 1001958 seq-read 5634130 6577930 6380861 6243912 6230006 seq-rw 1405687 1638117 1640256 1633903 1634459 rand-rw 1386119 1614664 1617211 1609267 1612471 All the optimization methods show a higher performance than the base, however, it is hard to say which method is the most appropriate. On the other hand, zram is mostly used on small embedded system, so we don't want to increase any memory footprint. This patch pick the bit_spinlock method, pack object size and page_flag into an unsigned long table.value, so as to not increase any memory overhead on both 32-bit and 64-bit system. On the third hand, even though different kinds of locks have different performances, we can ignore this difference, because: if zram is used as zram swapfile, the swap subsystem can prevent concurrent access to the same swapslot; if zram is used as zram-blk for set up filesystem on it, the upper filesystem and the page cache also prevent concurrent access of the same block mostly. So we can ignore the different performances among locks. Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06zram: use size_t instead of u16Minchan Kim1-1/+1
Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16 or more. In these cases u16 is not sufficiently large to represent a compressed page's size so use size_t. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06zram: remove unused SECTOR_SIZE defineSergey Senozhatsky1-1/+0
Drop SECTOR_SIZE define, because it's not used. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06zram: rename struct `table' to `zram_table_entry'Sergey Senozhatsky1-2/+2
Andrew Morton has recently noted that `struct table' actually represents table entry and, thus, should be renamed. Rename to `zram_table_entry'. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/highmem: make kmap cache coloring awareMax Filippov1-11/+75
User-visible effect: Architectures that choose this method of maintaining cache coherency (MIPS and xtensa currently) are able to use high memory on cores with aliasing data cache. Without this fix such architectures can not use high memory (in case of xtensa it means that at most 128 MBytes of physical memory is available). The problem: VIPT cache with way size larger than MMU page size may suffer from aliasing problem: a single physical address accessed via different virtual addresses may end up in multiple locations in the cache. Virtual mappings of a physical address that always get cached in different cache locations are said to have different colors. L1 caching hardware usually doesn't handle this situation leaving it up to software. Software must avoid this situation as it leads to data corruption. What can be done: One way to handle this is to flush and invalidate data cache every time page mapping changes color. The other way is to always map physical page at a virtual address with the same color. Low memory pages already have this property. Giving architecture a way to control color of high memory page mapping allows reusing of existing low memory cache alias handling code. How this is done with this patch: Provide hooks that allow architectures with aliasing cache to align mapping address of high pages according to their color. Such architectures may enforce similar coloring of low- and high-memory page mappings and reuse existing cache management functions to support highmem. This code is based on the implementation of similar feature for MIPS by Leonid Yegoshin. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com> Cc: Chris Zankel <chris@zankel.net> Cc: Marc Gauthier <marc@cadence.com> Cc: David Rientjes <rientjes@google.com> Cc: Steven Hill <Steven.Hill@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mmu_notifier: add call_srcu and sync function for listener to delay call and ↵Peter Zijlstra2-1/+45
sync When kernel device drivers or subsystems want to bind their lifespan to t= he lifespan of the mm_struct, they usually use one of the following methods: 1. Manually calling a function in the interested kernel module. The funct= ion call needs to be placed in mmput. This method was rejected by several ker= nel maintainers. 2. Registering to the mmu notifier release mechanism. The problem with the latter approach is that the mmu_notifier_release cal= lback is called from__mmu_notifier_release (called from exit_mmap). That functi= on iterates over the list of mmu notifiers and don't expect the release call= back function to remove itself from the list. Therefore, the callback function= in the kernel module can't release the mmu_notifier_object, which is actuall= y the kernel module's object itself. As a result, the destruction of the kernel module's object must to be done in a delayed fashion. This patch adds support for this delayed callback, by adding a new mmu_notifier_call_srcu function that receives a function ptr and calls th= at function with call_srcu. In that function, the kernel module releases its object. To use mmu_notifier_call_srcu, the calling module needs to call b= efore that a new function called mmu_notifier_unregister_no_release that as its= name implies, unregisters a notifier without calling its notifier release call= back. This patch also adds a function that will call barrier_srcu so those kern= el modules can sync with mmu_notifier. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Oded Gabbay <oded.gabbay@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: BUG when __kmap_atomic_idx equals KM_TYPE_NRChintan Pandya1-1/+1
__kmap_atomic_idx is per_cpu variable. Each CPU can use KM_TYPE_NR entries from FIXMAP i.e. from 0 to KM_TYPE_NR - 1. Allowing __kmap_atomic_idx to over- shoot to KM_TYPE_NR can mess up with next CPU's 0th entry which is a bug. Hence BUG_ON if __kmap_atomic_idx >= KM_TYPE_NR. Fix the off-by-on in this test. Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: clean up reclaim size variable use in try_charge()Johannes Weiner1-3/+3
Charge reclaim and OOM currently use the charge batch variable, but batching is already disabled at that point. To simplify the charge logic, the batch variable is reset to the original request size when reclaim is entered, so it's functionally equal, but it's misleading. Switch reclaim/OOM to nr_pages, which is the original request size. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06kernel/smp.c:on_each_cpu_cond(): fix warning in fallback pathSasha Levin1-1/+1
The rarely-executed memry-allocation-failed callback path generates a WARN_ON_ONCE() when smp_call_function_single() succeeds. Presumably it's supposed to warn on failures. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <htejun@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: change confusing #ifdef use in __access_remote_vmRik van Riel1-2/+4
This patch changes confusing #ifdef use in __access_remote_vm into merely ugly #ifdef use. Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651 Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: softdirty: respect VM_SOFTDIRTY in PTE holesPeter Feiner1-6/+21
After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap should report that the VMA's virtual pages are soft-dirty until VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to /proc/pid/clear_refs). However, pagemap ignores the VM_SOFTDIRTY flag for virtual addresses that fall in PTE holes (i.e., virtual addresses that don't have a PMD, PUD, or PGD allocated yet). To observe this bug, use mmap to create a VMA large enough such that there's a good chance that the VMA will occupy an unused PMD, then test the soft-dirty bit on its pages. In practice, I found that a VMA that covered a PMD's worth of address space was big enough. This patch adds the necessary VMA lookup to the PTE hole callback in /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs' VM_SOFTDIRTY flag. Signed-off-by: Peter Feiner <pfeiner@google.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: mark fault_around_bytes __read_mostlyKirill A. Shutemov1-1/+2
fault_around_bytes can only be changed via debugfs. Let's mark it read-mostly. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: close race between do_fault_around() and fault_around_bytes_set()Kirill A. Shutemov1-14/+7
Things can go wrong if fault_around_bytes will be changed under do_fault_around(): between fault_around_mask() and fault_around_pages(). Let's read fault_around_bytes only once during do_fault_around() and calculate mask based on the reading. Note: fault_around_bytes can only be updated via debug interface. Also I've tried but was not able to trigger a bad behaviour without the patch. So I would not consider this patch as urgent. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06memcg, vmscan: Fix forced scan of anonymous pagesJerome Marchand1-7/+13
When memory cgoups are enabled, the code that decides to force to scan anonymous pages in get_scan_count() compares global values (free, high_watermark) to a value that is restricted to a memory cgroup (file). It make the code over-eager to force anon scan. For instance, it will force anon scan when scanning a memcg that is mainly populated by anonymous page, even when there is plenty of file pages to get rid of in others memcgs, even when swappiness == 0. It breaks user's expectation about swappiness and hurts performance. This patch makes sure that forced anon scan only happens when there not enough file pages for the all zone, not just in one random memcg. [hannes@cmpxchg.org: cleanups] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, vmscan: fix an outdated comment still mentioning get_scan_ratioJerome Marchand1-1/+1
Quite a while ago, get_scan_ratio() has been renamed get_scan_count(), however a comment in shrink_active_list() still mention it. This patch fixes the outdated comment. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, oom: remove unnecessary exit_state checkDavid Rientjes2-2/+1
The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: fix potential infinite loop in dissolve_free_huge_pages()Li Zhong1-0/+3
It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to 0 to indicate huge pages not supported. When this is the case, hugetlbfs could be disabled during boot time: hugetlbfs: disabling because there are no supported hugepage sizes Then in dissolve_free_huge_pages(), order is kept maximum (64 for 64bits), and the for loop below won't end: for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order) As suggested by Naoya, below fix checks hugepages_supported() before calling dissolve_free_huge_pages(). [rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()] Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, thp: restructure thp avoidance of light synchronous migrationDavid Rientjes1-8/+9
__GFP_NO_KSWAPD, once the way to determine if an allocation was for thp or not, has gained more users. Their use is not necessarily wrong, they are trying to do a memory allocation that can easily fail without disturbing kswapd, so the bit has gained additional usecases. This restructures the check to determine whether MIGRATE_SYNC_LIGHT should be used for memory compaction in the page allocator. Rather than testing solely for __GFP_NO_KSWAPD, test for all bits that must be set for thp allocations. This also moves the check to be done only after the page allocator is aborted for deferred or contended memory compaction since setting migration_mode for this case is pointless. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, oom: rename zonelist locking functionsDavid Rientjes3-22/+18
try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, oom: ensure memoryless node zonelist always includes zonesDavid Rientjes3-3/+12
With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [akpm@linux-foundation.org: fix non-numa build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06memory-hotplug: sh: suitable memory should go to ZONE_MOVABLEWang Nan1-2/+3
This patch introduces zone_for_memory() to arch_add_memory() on sh to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06memory-hotplug: ppc: suitable memory should go to ZONE_MOVABLEWang Nan1-1/+2
This patch introduces zone_for_memory() to arch_add_memory() on powerpc to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06memory-hotplug: ia64: suitable memory should go to ZONE_MOVABLEWang Nan1-1/+2
This patch introduces zone_for_memory() to arch_add_memory() on ia64 to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06memory-hotplug: x86_32: suitable memory should go to ZONE_MOVABLEWang Nan1-1/+2
This patch introduces zone_for_memory() to arch_add_memory() on x86_32 to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06memory-hotplug: x86_64: suitable memory should go to ZONE_MOVABLEWang Nan1-1/+2
This patch introduces zone_for_memory() to arch_add_memory() on x86_64 to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06memory-hotplug: add zone_for_memory() for selecting zone for new memoryWang Nan2-0/+29
This series of patches fixes a problem when adding memory in bad manner. For example: for a x86_64 machine booted with "mem=400M" and with 2GiB memory installed, following commands cause problem: # echo 0x40000000 > /sys/devices/system/memory/probe [ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff] # echo 0x48000000 > /sys/devices/system/memory/probe [ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff] # echo online_movable > /sys/devices/system/memory/memory9/state # echo 0x50000000 > /sys/devices/system/memory/probe [ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff] # echo 0x58000000 > /sys/devices/system/memory/probe [ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff] # echo online_movable > /sys/devices/system/memory/memory11/state # echo online> /sys/devices/system/memory/memory8/state # echo online> /sys/devices/system/memory/memory10/state # echo offline> /sys/devices/system/memory/memory9/state [ 30.558819] Offlined Pages 32768 # free total used free shared buffers cached Mem: 780588 18014398509432020 830552 0 0 51180 -/+ buffers/cache: 18014398509380840 881732 Swap: 0 0 0 This is because the above commands probe higher memory after online a section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE. After the second online_movable, the problem can be observed from zoneinfo: # cat /proc/zoneinfo ... Node 0, zone Movable pages free 65491 min 250 low 312 high 375 scanned 0 spanned 18446744073709518848 present 65536 managed 65536 ... This series of patches solve the problem by checking ZONE_MOVABLE when choosing zone for new memory. If new memory is inside or higher than ZONE_MOVABLE, makes it go there instead. After applying this series of patches, following are free and zoneinfo result (after offlining memory9): bash-4.2# free total used free shared buffers cached Mem: 780956 80112 700844 0 0 51180 -/+ buffers/cache: 28932 752024 Swap: 0 0 0 bash-4.2# cat /proc/zoneinfo Node 0, zone DMA pages free 3389 min 14 low 17 high 21 scanned 0 spanned 4095 present 3998 managed 3977 nr_free_pages 3389 ... start_pfn: 1 inactive_ratio: 1 Node 0, zone DMA32 pages free 73724 min 341 low 426 high 511 scanned 0 spanned 98304 present 98304 managed 92958 nr_free_pages 73724 ... start_pfn: 4096 inactive_ratio: 1 Node 0, zone Normal pages free 32630 min 120 low 150 high 180 scanned 0 spanned 32768 present 32768 managed 32768 nr_free_pages 32630 ... start_pfn: 262144 inactive_ratio: 1 Node 0, zone Movable pages free 65476 min 241 low 301 high 361 scanned 0 spanned 98304 present 65536 managed 65536 nr_free_pages 65476 ... start_pfn: 294912 inactive_ratio: 1 This patch (of 7): Introduce zone_for_memory() in arch independent code for arch_add_memory() use. Many arch_add_memory() function simply selects ZONE_HIGHMEM or ZONE_NORMAL and add new memory into it. However, with the existance of ZONE_MOVABLE, the selection method should be carefully considered: if new, higher memory is added after ZONE_MOVABLE is setup, the default zone and ZONE_MOVABLE may overlap each other. should_add_memory_movable() checks the status of ZONE_MOVABLE. If it has already contain memory, compare the address of new memory and movable memory. If new memory is higher than movable, it should be added into ZONE_MOVABLE instead of default zone. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06slub: remove kmemcg id from create_unique_idVladimir Davydov1-6/+0
This function is never called for memcg caches, because they are unmergeable, so remove the dead code. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, writeback: prevent race when calculating dirty limitsDavid Rientjes1-4/+1
Setting vm_dirty_bytes and dirty_background_bytes is not protected by any serialization. Therefore, it's possible for either variable to change value after the test in global_dirty_limits() to determine whether available_memory needs to be initialized or not. Always ensure that available_memory is properly initialized. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_modeDavid Rientjes1-0/+26
Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") improved the previous khugepaged logic which allocated a transparent hugepages from the node of the first page being collapsed. However, it is still possible to collapse pages to remote memory which may suffer from additional access latency. With the current policy, it is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely if the majority are allocated from that node. When zone_reclaim_mode is enabled, it means the VM should make every attempt to allocate locally to prevent NUMA performance degradation. In this case, we do not want to collapse hugepages to remote nodes that would suffer from increased access latency. Thus, when zone_reclaim_mode is enabled, only allow collapsing to nodes with RECLAIM_DISTANCE or less. There is no functional change for systems that disable zone_reclaim_mode. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/shmem.c: remove the unused gfp arg to shmem_add_to_page_cache()Wang Sheng-Hui1-4/+4
The gfp arg is not used in shmem_add_to_page_cache. Remove this unused arg. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: describe mmap_sem rules for __lock_page_or_retry() and callersPaul Cassella6-8/+82
Add a comment describing the circumstances in which __lock_page_or_retry() will or will not release the mmap_sem when returning 0. Add comments to lock_page_or_retry()'s callers (filemap_fault(), do_swap_page()) noting the impact on VM_FAULT_RETRY returns. Add comments on up the call tree, particularly replacing the false "We return with mmap_sem still held" comments. Signed-off-by: Paul Cassella <cassella@cray.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: page_alloc: reduce cost of the fair zone allocation policyMel Gorman2-48/+59
The fair zone allocation policy round-robins allocations between zones within a node to avoid age inversion problems during reclaim. If the first allocation fails, the batch counts are reset and a second attempt made before entering the slow path. One assumption made with this scheme is that batches expire at roughly the same time and the resets each time are justified. This assumption does not hold when zones reach their low watermark as the batches will be consumed at uneven rates. Allocation failure due to watermark depletion result in additional zonelist scans for the reset and another watermark check before hitting the slowpath. On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA machine it's variable due to the variability of measuring overhead with the vmstat changes. The system CPU overhead comparison looks like 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanilla vmstat-v5 lowercost-v5 User 746.94 774.56 802.00 System 65336.22 32847.27 40852.33 Elapsed 27553.52 27415.04 27368.46 However it is worth noting that the overall benchmark still completed faster and intuitively it makes sense to take as few passes as possible through the zonelists. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: page_alloc: abort fair zone allocation policy when remotes nodes are ↵Mel Gorman1-1/+1
encountered The purpose of numa_zonelist_order=zone is to preserve lower zones for use with 32-bit devices. If locality is preferred then the numa_zonelist_order=node policy should be used. Unfortunately, the fair zone allocation policy overrides this by skipping zones on remote nodes until the lower one is found. While this makes sense from a page aging and performance perspective, it breaks the expected zonelist policy. This patch restores the expected behaviour for zone-list ordering. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: vmscan: only update per-cpu thresholds for online CPUMel Gorman1-1/+1
When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered to get more accurate counts to avoid breaching watermarks. This threshold update iterates over all possible CPUs which is unnecessary. Only online CPUs need to be updated. If a new CPU is onlined, refresh_zone_stat_thresholds() will set the thresholds correctly. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: move zone->pages_scanned into a vmstat counterMel Gorman4-8/+16
zone->pages_scanned is a write-intensive cache line during page reclaim and it's also updated during page free. Move the counter into vmstat to take advantage of the per-cpu updates and do not update it in the free paths unless necessary. On a small UMA machine running tiobench the difference is marginal. On a 4-node machine the overhead is more noticable. Note that automatic NUMA balancing was disabled for this test as otherwise the system CPU overhead is unpredictable. 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5 vmstat-v5 User 746.94 759.78 774.56 System 65336.22 58350.98 32847.27 Elapsed 27553.52 27282.02 27415.04 Note that the overhead reduction will vary depending on where exactly pages are allocated and freed. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: rearrange zone fields into read-only, page alloc, statistics and page ↵Mel Gorman3-109/+113
reclaim lines The arrangement of struct zone has changed over time and now it has reached the point where there is some inappropriate sharing going on. On x86-64 for example o The zone->node field is shared with the zone lock and zone->node is accessed frequently from the page allocator due to the fair zone allocation policy. o span_seqlock is almost never used by shares a line with free_area o Some zone statistics share a cache line with the LRU lock so reclaim-intensive and allocator-intensive workloads can bounce the cache line on a stat update This patch rearranges struct zone to put read-only and read-mostly fields together and then splits the page allocator intensive fields, the zone statistics and the page reclaim intensive fields into their own cache lines. Note that the type of lowmem_reserve changes due to the watermark calculations being signed and avoiding a signed/unsigned conversion there. On the test configuration I used the overall size of struct zone shrunk by one cache line. On smaller machines, this is not likely to be noticable. However, on a 4-node NUMA machine running tiobench the system CPU overhead is reduced by this patch. 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5r9 User 746.94 759.78 System 65336.22 58350.98 Elapsed 27553.52 27282.02 Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: pagemap: avoid unnecessary overhead when tracepoints are deactivatedMel Gorman2-11/+9
This was formerly the series "Improve sequential read throughput" which noted some major differences in performance of tiobench since 3.0. While there are a number of factors, two that dominated were the introduction of the fair zone allocation policy and changes to CFQ. The behaviour of fair zone allocation policy makes more sense than tiobench as a benchmark and CFQ defaults were not changed due to insufficient benchmarking. This series is what's left. It's one functional fix to the fair zone allocation policy when used on NUMA machines and a reduction of overhead in general. tiobench was used for the comparison despite its flaws as an IO benchmark as in this case we are primarily interested in the overhead of page allocator and page reclaim activity. On UMA, it makes little difference to overhead 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 383.61 386.77 System 403.83 401.74 Elapsed 5411.50 5413.11 On a 4-socket NUMA machine it's a bit more noticable 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 746.94 802.00 System 65336.22 40852.33 Elapsed 27553.52 27368.46 This patch (of 6): The LRU insertion and activate tracepoints take PFN as a parameter forcing the overhead to the caller. Move the overhead to the tracepoint fast-assign method to ensure the cost is only incurred when the tracepoint is active. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: trace-vmscan-postprocess.pl: report the number of file/anon pages ↵Chen Yucong1-0/+53
respectively Until now, the reporting from trace-vmscan-postprocess.pl is not very useful because we cannot directly use this script for checking the file/anon ratio of scanning. This patch aims to report respectively the number of file/anon pages which were scanned/reclaimed by kswapd or direct-reclaim. Sample output is usually something like the following. Summary Direct reclaims: 8823 Direct reclaim pages scanned: 2438797 Direct reclaim file pages scanned: 1315200 Direct reclaim anon pages scanned: 1123597 Direct reclaim pages reclaimed: 446139 Direct reclaim file pages reclaimed: 378668 Direct reclaim anon pages reclaimed: 67471 Direct reclaim write file sync I/O: 0 Direct reclaim write anon sync I/O: 0 Direct reclaim write file async I/O: 0 Direct reclaim write anon async I/O: 4240 Wake kswapd requests: 122310 Time stalled direct reclaim: 13.78 seconds Kswapd wakeups: 25817 Kswapd pages scanned: 170779115 Kswapd file pages scanned: 162725123 Kswapd anon pages scanned: 8053992 Kswapd pages reclaimed: 129065738 Kswapd file pages reclaimed: 128500930 Kswapd anon pages reclaimed: 564808 Kswapd reclaim write file sync I/O: 0 Kswapd reclaim write anon sync I/O: 0 Kswapd reclaim write file async I/O: 36 Kswapd reclaim write anon async I/O: 730730 Time kswapd awake: 1015.50 seconds Signed-off-by: Chen Yucong <slaoub@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: update the description for vm_total_pagesWang Sheng-Hui1-1/+5
vm_total_pages is calculated by nr_free_pagecache_pages(), which counts the number of pages which are beyond the high watermark within all zones. So vm_total_pages is not equal to total number of pages which the VM controls. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/memory.c: don't forget to set softdirty on file mapped faultCyrill Gorcunov1-1/+1
Otherwise we may not notice that pte was softdirty because pte_mksoft_dirty helper _returns_ new pte but doesn't modify the argument. In case if page fault happend on dirty filemapping the newly created pte may loose softdirty bit thus if a userspace program is tracking memory changes with help of a memory tracker (CONFIG_MEM_SOFT_DIRTY) it might miss modification of a memory page (which in worts case may lead to data inconsistency). Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06drivers/firmware/memmap.c: don't allocate firmware_map_entry of same memory ↵Yasuaki Ishimatsu1-0/+4
range When limiting memory by mem= and ACPI DSDT table has PNP0C80, firmware_map_entrys of same memory range are allocated and memmap X sysfses which have same memory range are created as follows: # cat /sys/firmware/memmap/0/* 0x407ffffffff 0x40000000000 System RAM # cat /sys/firmware/memmap/33/* 0x407ffffffff 0x40000000000 System RAM # cat /sys/firmware/memmap/35/* 0x407ffffffff 0x40000000000 System RAM In this case, when hot-removing memory, kernel panic occurs, showing following call trace: BUG: unable to handle kernel paging request at 00000001003e000b IP: sysfs_open_file+0x46/0x2b0 PGD 203a89fe067 PUD 0 Oops: 0000 [#1] SMP ... Call Trace: do_dentry_open+0x1ef/0x2a0 finish_open+0x31/0x40 do_last+0x57c/0x1220 path_openat+0xc2/0x4c0 do_filp_open+0x4b/0xb0 do_sys_open+0xf3/0x1f0 SyS_open+0x1e/0x20 system_call_fastpath+0x16/0x1b The problem occurs as follows: When calling e820_reserve_resources(), firmware_map_entrys of all e820 memory map are allocated. And all firmware_map_entrys is added map_entries list as follows: map_entries -> +--- entry A --------+ -> ... | start 0x407ffffffff| | end 0x40000000000| | type System RAM | +--------------------+ After that, if ACPI DSDT table has PNP0C80 and the memory range is limited by mem=, the PNP0C80 is hot-added. Then firmware_map_entry of PNP0C80 is allocated and added map_entries list as follows: map_entries -> +--- entry A --------+ -> ... -> +--- entry B --------+ | start 0x407ffffffff| | start 0x407ffffffff| | end 0x40000000000| | end 0x40000000000| | type System RAM | | type System RAM | +--------------------+ +--------------------+ Then memmap 0 sysfs for entry B is created. After that, firmware_memmap_init() creates memmap sysfses of all firmware_map_entrys in map_entries list. As a result, memmap 33 sysfs for entry A and memmap 35 sysfs for entry B are created. But kobject of entry B has been used by memmap 0 sysfs. So when creating memmap 35 sysfs, the kobject is broken. If hot-removing memory, memmap 0 sysfs is destroyed and kobject of memmap 0 sysfs is freed. But the kobject can be accessed via memmap 35 sysfs. So when open memmap 35 sysfs, kernel panic occurs. This patch checks whether there is firmware_map_entry of same memory range in map_entries list and don't allocate firmware_map_entry of same memroy range. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06drivers/firmware/memmap.c: pass the correct argument to ↵Yasuaki Ishimatsu1-1/+1
firmware_map_find_entry_bootmem() firmware_map_add_hotplug() calls firmware_map_find_entry_bootmem() to get free firmware_map_entry. But end arguments is not correct. So firmware_map_find_entry_bootmem() cannot not find firmware_map_entry. The patch passes the correct end argument to firmware_map_find_entry_bootmem(). Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/vmalloc.c: clean up map_vm_area third argumentWANG Chao6-20/+11
Currently map_vm_area() takes (struct page *** pages) as third argument, and after mapping, it moves (*pages) to point to (*pages + nr_mappped_pages). It looks like this kind of increment is useless to its caller these days. The callers don't care about the increments and actually they're trying to avoid this by passing another copy to map_vm_area(). The caller can always guarantee all the pages can be mapped into vm_area as specified in first argument and the caller only cares about whether map_vm_area() fails or not. This patch cleans up the pointer movement in map_vm_area() and updates its callers accordingly. Signed-off-by: WANG Chao <chaowang@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: make copy_pte_range static againJerome Marchand2-5/+1
Commit 71e3aac0724f ("thp: transparent hugepage core") adds copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this function have been used outside of memory.c, but it currently isn't. This patch makes copy_pte_range() static again. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, hugetlb: remove hugetlb_zero and hugetlb_infinityDavid Rientjes3-8/+3
They are unnecessary: "zero" can be used in place of "hugetlb_zero" and passing extra2 == NULL is equivalent to infinity. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, hugetlb: generalize writes to nr_hugepagesDavid Rientjes1-32/+26
Three different interfaces alter the maximum number of hugepages for an hstate: - /proc/sys/vm/nr_hugepages for global number of hugepages of the default hstate, - /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages for global number of hugepages for a specific hstate, and - /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages/mempolicy for number of hugepages for a specific hstate over the set of allowed nodes. Generalize the code so that a single function handles all of these writes instead of duplicating the code in two different functions. This decreases the number of lines of code, but also reduces the size of .text by about half a percent since set_max_huge_pages() can be inlined. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06hwpoison: fix race with changing page during offliningAndi Kleen1-0/+10
When a hwpoison page is locked it could change state due to parallel modifications. The original compound page can be torn down and then this 4k page becomes part of a differently-size compound page is is a standalone regular page. Check after the lock if the page is still the same compound page. We could go back, grab the new head page and try again but it should be quite rare, so I thought this was safest. A retry loop would be more difficult to test and may have more side effects. The hwpoison code by design only tries to handle cases that are reasonably common in workloads, as visible in page-flags. I'm not really that concerned about handling this (likely rare case), just not crashing on it. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm,hugetlb: simplify error handling in hugetlb_cow()Davidlohr Bueso1-19/+16
When returning from hugetlb_cow(), we always (1) put back the refcount for each referenced page -- always 'old', and 'new' if allocation was successful. And (2) retake the page table lock right before returning, as the callers expects. This logic can be simplified and encapsulated, as proposed in this patch. In addition to cleaner code, we also shave a few bytes off the instruction text: text data bss dec hex filename 28399 462 41328 70189 1122d mm/hugetlb.o-baseline 28367 462 41328 70157 1120d mm/hugetlb.o-patched Passes libhugetlbfs testcases. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm,hugetlb: make unmap_ref_private() return voidDavidlohr Bueso1-18/+14
This function always returns 1, thus no need to check return value in hugetlb_cow(). By doing so, we can get rid of the unnecessary WARN_ON call. While this logic perhaps existed as a way of identifying future unmap_ref_private() mishandling, reality is it serves no apparent purpose. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: replace init_page_accessed by __SetPageReferencedHugh Dickins4-15/+6
Do we really need an exported alias for __SetPageReferenced()? Its callers better know what they're doing, in which case the page would not be already marked referenced. Kill init_page_accessed(), just __SetPageReferenced() inline. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/hwpoison-inject.c: remove unnecessary null test before ↵Fabian Frederick1-2/+1
debugfs_remove_recursive Fix checkpatch warning: "WARNING: debugfs_remove_recursive(NULL) is safe this check is probably not required" Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfacesRafael Aquini3-3/+4
Historically, we exported shared pages to userspace via sysinfo(2) sharedram and /proc/meminfo's "MemShared" fields. With the advent of tmpfs, from kernel v2.4 onward, that old way for accounting shared mem was deemed inaccurate and we started to export a hard-coded 0 for sysinfo.sharedram. Later on, during the 2.6 timeframe, "MemShared" got re-introduced to /proc/meminfo re-branded as "Shmem", but we're still reporting sysinfo.sharedmem as that old hard-coded zero, which makes the "shared memory" report inconsistent across interfaces. This patch leverages the addition of explicit accounting for pages used by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" -- in order to make the users of sysinfo(2) and si_meminfo*() friends aware of that vmstat entry and make them report it consistently across the interfaces, as well to make sysinfo(2) returned data consistent with our current API documentation states. Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: catch memory commitment underflowKonstantin Khlebnikov1-0/+5
Print a warning (if CONFIG_DEBUG_VM=y) when memory commitment becomes too negative. This shouldn't happen any more - the previous two patches fixed the committed_as underflow issues. [akpm@linux-foundation.org: use VM_WARN_ONCE, per Dave] Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06shmem: update memory reservation on truncateKonstantin Khlebnikov1-0/+17
A shared anonymous mapping created without MAP_NORESERVE holds memory reservation for whole range of shmem segment. Usually there is no way to change its size, but /proc/<pid>/map_files/... (available if CONFIG_CHECKPOINT_RESTORE=y) allows that. This patch adjusts the memory reservation in shmem_setattr(). Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06shmem: fix double uncharge in __shmem_file_setup()Konstantin Khlebnikov1-6/+6
If __shmem_file_setup() fails on struct file allocation it uncharges memory commitment twice: first by shmem_unacct_size() and second time implicitly in shmem_evict_inode() when it kills the newly created inode. This patch removes shmem_unacct_size() from error path if the inode was already there. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06include/linux/mmdebug.h: add VM_WARN_ONCE()Andrew Morton1-0/+2
It was missing... Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, vmalloc: constify allocation maskDavid Rientjes1-4/+4
tmp_mask in the __vmalloc_area_node() iteration never changes so it can be moved into function scope and marked with const. This causes the movl and orl to only be done once per call rather than area->nr_pages times. nested_gfp can also be marked const. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/vmalloc.c: add a schedule point to vmalloc()Eric Dumazet1-0/+2
It is not uncommon on busy servers to get stuck hundred of ms in vmalloc() calls (like file descriptor expansions). Add a cond_resched() to __vmalloc_area_node() to be gentle to other tasks. [akpm@linux-foundation.org: only do it for __GFP_WAIT, per David] Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: update the description for madvise_removeWang Sheng-Hui1-3/+0
Currently, we have more filesystems supporting fallocate, e.g ext4/btrfs. Remove the outdated comment for madvise_remove. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm tracing: tell mm_migrate_pages event about numa_misplacedMax Asbock1-0/+1
The mm_migrate_pages trace event reports a reason for the migration, typically as a symbolic string. The exception is the reason MR_NUMA_MISPLACED for which it just displays the numeric value: mm_migrate_pages: nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason=0x5 This patch makes the output consistent by introducing a string value for MR_NUMA_MISPLACED. The event is then reported as: mm_migrate_pages: nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason=numa_misplaced Signed-off-by: Max Asbock <masbock@linux.vnet.ibm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: vmscan: clean up struct scan_controlJohannes Weiner1-53/+46
Reorder the members by input and output, then turn the individual integers for may_writepage, may_unmap, may_swap, compaction_ready, hibernation_mode into bit fields to save stack space: +72/-296 -224 kswapd 104 176 +72 try_to_free_pages 80 56 -24 try_to_free_mem_cgroup_pages 80 56 -24 shrink_all_memory 88 64 -24 reclaim_clean_pages_from_list 168 144 -24 mem_cgroup_shrink_node_zone 104 80 -24 __zone_reclaim 176 152 -24 balance_pgdat 152 - -152 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: vmscan: move swappiness out of scan_controlJohannes Weiner1-14/+13
Swappiness is determined for each scanned memcg individually in shrink_zone() and is not a parameter that applies throughout the reclaim scan. Move it out of struct scan_control to prevent accidental use of a stale value. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: vmscan: remove all_unreclaimable()Johannes Weiner1-25/+24
Direct reclaim currently calls shrink_zones() to reclaim all members of a zonelist, and if that wasn't successful it does another pass through the same zonelist to check overall reclaimability. Just check reclaimability in shrink_zones() directly and propagate the result through the return value. Then remove all_unreclaimable(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: vmscan: rework compaction-ready signaling in direct reclaimJohannes Weiner1-38/+32
Page reclaim for a higher-order page runs until compaction is ready, then aborts and signals this situation through the return value of shrink_zones(). This is an oddly specific signal to encode in the return value of shrink_zones(), though, and can be quite confusing. Introduce sc->compaction_ready and signal the compactability of the zones out-of-band to free up the return value of shrink_zones() for actual zone reclaimability. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: vmscan: remove remains of kswapd-managed zone->all_unreclaimableJohannes Weiner1-8/+0
shrink_zones() has a special branch to skip the all_unreclaimable() check during hibernation, because a frozen kswapd can't mark a zone unreclaimable. But ever since commit 6e543d5780e3 ("mm: vmscan: fix do_try_to_free_pages() livelock"), determining a zone to be unreclaimable is done by directly looking at its scan history and no longer relies on kswapd setting the per-zone flag. Remove this branch and let shrink_zones() check the reclaimability of the target zones regardless of hibernation state. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <Kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mem-hotplug: improve zone_movable_is_highmem logicWang Nan1-0/+2
In original code, zone_movable_is_highmem() assumes ZONE_MOVABLE not highmem if CONFIG_HAVE_MEMBLOCK_NODE_MAP is not set. In online_pages, it extracts pages from the previous zone before ZONE_MOVABLE. Which is logically inconsistent: If HAVE_MEMBLOCK_NODE_MAP is turned off but HIGHMEM is on, zone_movable_is_highmem() makes movable zone not highmem, but online_pages() extracts pages from ZONE_HIGHMEM. This inconsistency doesn't cause real problem currently, because all architectures support online_pages also have HAVE_MEMBLOCK_NODE_MAP. However, fixing it makes code clear, and also helps futher coding. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Zhen <zhangzhen@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Jiang Liu <liuj97@gmail.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/mem-hotplug: replace simple_strtoull() with kstrtoull()Zhang Zhen1-1/+3
Use the newer and more pleasant kstrtoull() to replace simple_strtoull(), because simple_strtoull() is marked for obsoletion. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: do not acquire page_cgroup lock for kmem pagesJohannes Weiner1-14/+7
Kmem page charging and uncharging is serialized by means of exclusive access to the page. Do not take the page_cgroup lock and don't set pc->flags atomically. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsedJohannes Weiner1-9/+0
There is a write barrier between setting pc->mem_cgroup and PageCgroupUsed, which was added to allow LRU operations to lookup the memcg LRU list of a page without acquiring the page_cgroup lock. But ever since commit 38c5d72f3ebe ("memcg: simplify LRU handling by new rule"), pages are ensured to be off-LRU while charging, so nobody else is changing LRU state while pc->mem_cgroup is being written, and there are no read barriers anymore. Remove the unnecessary write barrier. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: use root_mem_cgroup res_counterJohannes Weiner1-108/+44
Due to an old optimization to keep expensive res_counter changes at a minimum, the root_mem_cgroup res_counter is never charged; there is no limit at that level anyway, and any statistics can be generated on demand by summing up the counters of all other cgroups. However, with per-cpu charge caches, res_counter operations do not even show up in profiles anymore, so this optimization is no longer necessary. Remove it to simplify the code. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: catch root bypass in move prechargeJohannes Weiner1-1/+8
When mem_cgroup_try_charge() returns -EINTR, it bypassed the charge to the root memcg. But move precharging does not catch this and treats this case as if no charge had happened, thus leaking a charge against root. Because of an old optimization, the root memcg's res_counter is not actually charged right now, but it's still an imbalance and subsequent patches will charge the root memcg again. Catch those bypasses to the root memcg and properly cancel them before giving up the move. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: simplify move precharge functionJohannes Weiner1-33/+15
The move precharge function does some baroque things: it tries raw res_counter charging of the entire amount first, and then falls back to a loop of one-by-one charges, with checks for pending signals and cond_resched() batching. Just use mem_cgroup_try_charge() without __GFP_WAIT for the first bulk charge attempt. In the one-by-one loop, remove the signal check (this is already checked in try_charge), and simply call cond_resched() after every charge - it's not that expensive. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: remove explicit OOM parameter in charge pathMichal Hocko1-22/+10
For the page allocator, __GFP_NORETRY implies that no OOM should be triggered, whereas memcg has an explicit parameter to disable OOM. The only callsites that want OOM disabled are THP charges and charge moving. THP already uses __GFP_NORETRY and charge moving can use it as well - one full reclaim cycle should be plenty. Switch it over, then remove the OOM parameter. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL chargesJohannes Weiner1-4/+4
There is no reason why oom-disabled and __GFP_NOFAIL charges should try to reclaim only once when every other charge tries several times before giving up. Make them all retry the same number of times. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: huge_memory: use GFP_TRANSHUGE when charging huge pagesJohannes Weiner1-3/+3
Transparent huge page charges prefer falling back to regular pages rather than spending a lot of time in direct reclaim. Desired reclaim behavior is usually declared in the gfp mask, but THP charges use GFP_KERNEL and then rely on the fact that OOM is disabled for THP charges, and that OOM-disabled charges don't retry reclaim. Needless to say, this is anything but obvious and quite error prone. Convert THP charges to use GFP_TRANSHUGE instead, which implies __GFP_NORETRY, to indicate the low-latency requirement. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: reclaim at least once for __GFP_NORETRYJohannes Weiner1-3/+3
Currently, __GFP_NORETRY tries charging once and gives up before even trying to reclaim. Bring the behavior on par with the page allocator and reclaim at least once before giving up. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: rearrange charging fast pathJohannes Weiner1-16/+17
The charging path currently starts out with OOM condition checks when OOM is the rarest possible case. Rearrange this code to run OOM/task dying checks only after trying the percpu charge and the res_counter charge and bail out before entering reclaim. Attempting a charge does not hurt an (oom-)killed task as much as every charge attempt having to check OOM conditions. Also, only check __GFP_NOFAIL when the charge would actually fail. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: memcontrol: fold mem_cgroup_do_charge()Johannes Weiner1-102/+64
These patches rework memcg charge lifetime to integrate more naturally with the lifetime of user pages. This drastically simplifies the code and reduces charging and uncharging overhead. The most expensive part of charging and uncharging is the page_cgroup bit spinlock, which is removed entirely after this series. Here are the top-10 profile entries of a stress test that reads a 128G sparse file on a freshly booted box, without even a dedicated cgroup (i.e. executing in the root memcg). Before: 15.36% cat [kernel.kallsyms] [k] copy_user_generic_string 13.31% cat [kernel.kallsyms] [k] memset 11.48% cat [kernel.kallsyms] [k] do_mpage_readpage 4.23% cat [kernel.kallsyms] [k] get_page_from_freelist 2.38% cat [kernel.kallsyms] [k] put_page 2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge 2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common 1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn After: 15.67% cat [kernel.kallsyms] [k] copy_user_generic_string 13.48% cat [kernel.kallsyms] [k] memset 11.42% cat [kernel.kallsyms] [k] do_mpage_readpage 3.98% cat [kernel.kallsyms] [k] get_page_from_freelist 2.46% cat [kernel.kallsyms] [k] put_page 2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn 1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk 1.30% cat [kernel.kallsyms] [k] kfree As you can see, the memcg footprint has shrunk quite a bit. text data bss dec hex filename 37970 9892 400 48262 bc86 mm/memcontrol.o.old 35239 9892 400 45531 b1db mm/memcontrol.o This patch (of 13): This function was split out because mem_cgroup_try_charge() got too big. But having essentially one sequence of operations arbitrarily split in half is not good for reworking the code. Fold it back in. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: page-flags: clean up the page flag test, set, clear macrosJohannes Weiner1-7/+14
- PAGEFLAG_FALSE only defines TEST, make it define SET and CLEAR as well, analogous to PAGEFLAG. - Define TESTSETFLAG_FALSE, analogous to TESTSETFLAG. - Define TESTSCFLAG_FALSE, analogous to TESTSCFLAG - Make PG_mlocked accessors the same on both MMU and !MMU setups Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, thp: replace smp_mb after atomic_add by smp_mb__after_atomicWaiman Long1-1/+1
In some architectures like x86, atomic_add() is a full memory barrier. In that case, an additional smp_mb() is just a waste of time. This patch replaces that smp_mb() by smp_mb__after_atomic() which will avoid the redundant memory barrier in some architectures. With a 3.16-rc1 based kernel, this patch reduced the execution time of breaking 1000 transparent huge pages from 38,245us to 30,964us. A reduction of 19% which is quite sizeable. It also reduces the %cpu time of the __split_huge_page_refcount function in the perf profile from 2.18% to 1.15%. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, thp: move invariant bug check out of loop in __split_huge_page_mapWaiman Long1-2/+2
In __split_huge_page_map(), the check for page_mapcount(page) is invariant within the for loop. Because of the fact that the macro is implemented using atomic_read(), the redundant check cannot be optimized away by the compiler leading to unnecessary read to the page structure. This patch moves the invariant bug check out of the loop so that it will be done only once. On a 3.16-rc1 based kernel, the execution time of a microbenchmark that broke up 1000 transparent huge pages using munmap() had an execution time of 38,245us and 38,548us with and without the patch respectively. The performance gain is about 1%. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, CMA: clean-up log messageJoonsoo Kim1-2/+2
We don't need explicit 'CMA:' prefix, since we already define prefix 'cma:' in pr_fmt. So remove it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, CMA: change cma_declare_contiguous() to obey coding conventionJoonsoo Kim4-10/+11
Conventionally, we put output param to the end of param list and put the 'base' ahead of 'size', but cma_declare_contiguous() doesn't look like that, so change it. Additionally, move down cma_areas reference code to the position where it is really needed. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, CMA: clean-up CMA allocation error pathJoonsoo Kim1-3/+4
We can remove one call sites for clear_cma_bitmap() if we first call it before checking error number. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06PPC, KVM, CMA: use general CMA reserved area management frameworkJoonsoo Kim5-277/+14
Now, we have general CMA reserved area management framework, so use it for future maintainabilty. There is no functional change. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06CMA: generalize CMA reserved area management functionalityJoonsoo Kim8-291/+383
Currently, there are two users on CMA functionality, one is the DMA subsystem and the other is the KVM on powerpc. They have their own code to manage CMA reserved area even if they looks really similar. From my guess, it is caused by some needs on bitmap management. KVM side wants to maintain bitmap not for 1 page, but for more size. Eventually it use bitmap where one bit represents 64 pages. When I implement CMA related patches, I should change those two places to apply my change and it seem to be painful to me. I want to change this situation and reduce future code management overhead through this patch. This change could also help developer who want to use CMA in their new feature development, since they can use CMA easily without copying & pasting this reserved area management code. In previous patches, we have prepared some features to generalize CMA reserved area management and now it's time to do it. This patch moves core functions to mm/cma.c and change DMA APIs to use these functions. There is no functional change in DMA APIs. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06DMA, CMA: support arbitrary bitmap granularityJoonsoo Kim1-24/+53
PPC KVM's CMA area management requires arbitrary bitmap granularity, since they want to reserve very large memory and manage this region with bitmap that one bit for several pages to reduce management overheads. So support arbitrary bitmap granularity for following generalization. [akpm@linux-foundation.org: s/1/1UL/] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06DMA, CMA: support alignment constraint on CMA regionJoonsoo Kim1-8/+18
PPC KVM's CMA area management needs alignment constraint on CMA region. So support it to prepare generalization of CMA area management functionality. Additionally, add some comments which tell us why alignment constraint is needed on CMA region. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06DMA, CMA: separate core CMA management codes from DMA APIsJoonsoo Kim1-48/+77
To prepare future generalization work on CMA area management code, we need to separate core CMA management codes from DMA APIs. We will extend these core functions to cover requirements of PPC KVM's CMA area management functionality in following patches. This separation helps us not to touch DMA APIs while extending core functions. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/internal.h: use nth_pageFabian Frederick1-1/+1
Use nth_page instead of pfn_to_page(page_to_pfn Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm: page_alloc: simplify drain_zone_pages by using min()Michal Nazarewicz1-6/+2
Instead of open-coding getting minimal value of two, just use min macro. That is why it is there for. While changing the function also change type of batch local variable to match type of per_cpu_pages::batch (which is int). Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mem-hotplug: introduce MMOP_OFFLINE to replace the hard coding -1Tang Chen3-16/+20
In store_mem_state(), we have: ... 334 else if (!strncmp(buf, "offline", min_t(int, count, 7))) 335 online_type = -1; ... 355 case -1: 356 ret = device_offline(&mem->dev); 357 break; ... Here, "offline" is hard coded as -1. This patch does the following renaming: ONLINE_KEEP -> MMOP_ONLINE_KEEP ONLINE_KERNEL -> MMOP_ONLINE_KERNEL ONLINE_MOVABLE -> MMOP_ONLINE_MOVABLE and introduces MMOP_OFFLINE = -1 to avoid hard coding. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Hu Tao <hutao@cn.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mem-hotplug: avoid illegal state prefixed with legal state when changing ↵Tang Chen1-4/+4
state of memory_block We use the following command to online a memory_block: echo online|online_kernel|online_movable > /sys/devices/system/memory/memoryXXX/state But, if we do the following: echo online_fhsjkghfkd > /sys/devices/system/memory/memoryXXX/state the block will also be onlined. This is because the following code in store_mem_state() does not compare the whole string, but only the prefix of the string. store_mem_state() { ...... 328 if (!strncmp(buf, "online_kernel", min_t(int, count, 13))) Here, only compare the first 13 letters of the string. If we give "online_kernelXXXXXX", it will be recognized as online_kernel, which is incorrect. 329 online_type = ONLINE_KERNEL; 330 else if (!strncmp(buf, "online_movable", min_t(int, count, 14))) We have the same problem here, 331 online_type = ONLINE_MOVABLE; 332 else if (!strncmp(buf, "online", min_t(int, count, 6))) here, (Here is more problematic. If we give online_movalbe, which is a typo of online_movable, it will be recognized as online without noticing the author.) 333 online_type = ONLINE_KEEP; 334 else if (!strncmp(buf, "offline", min_t(int, count, 7))) and here. 335 online_type = -1; 336 else { 337 ret = -EINVAL; 338 goto err; 339 } ...... } This patch fixes this problem by using sysfs_streq() to compare the whole string. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reported-by: Hu Tao <hutao@cn.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault()Hugh Dickins1-1/+1
Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or orig_pte upon which all subsequent decisions and pte_same() tests will be made. I have no evidence that its lack is responsible for the mm/filemap.c:202 BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by trinity, and I am not optimistic that it will fix it. But I have found no other explanation, and ACCESS_ONCE() here will surely not hurt. If gcc does re-access the pte before passing it down, then that would be disastrous for correct page fault handling, and certainly could explain the page_mapped() BUGs seen (concurrent fault causing page to be mapped in a second time on top of itself: mapcount 2 for a single pte). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06vmalloc: use rcu list iterator to reduce vmap_area_lock contentionJoonsoo Kim1-3/+3
Richard Yao reported a month ago that his system have a trouble with vmap_area_lock contention during performance analysis by /proc/meminfo. Andrew asked why his analysis checks /proc/meminfo stressfully, but he didn't answer it. https://lkml.org/lkml/2014/4/10/416 Although I'm not sure that this is right usage or not, there is a solution reducing vmap_area_lock contention with no side-effect. That is just to use rcu list iterator in get_vmalloc_info(). rcu can be used in this function because all RCU protocol is already respected by writers, since Nick Piggin commit db64fe02258f1 ("mm: rewrite vmap layer") back in linux-2.6.28 Specifically : insertions use list_add_rcu(), deletions use list_del_rcu() and kfree_rcu(). Note the rb tree is not used from rcu reader (it would not be safe), only the vmap_area_list has full RCU protection. Note that __purge_vmap_area_lazy() already uses this rcu protection. rcu_read_lock(); list_for_each_entry_rcu(va, &vmap_area_list, list) { if (va->flags & VM_LAZY_FREE) { if (va->va_start < *start) *start = va->va_start; if (va->va_end > *end) *end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; list_add_tail(&va->purge_list, &valist); va->flags |= VM_LAZY_FREEING; va->flags &= ~VM_LAZY_FREE; } } rcu_read_unlock(); Peter: : While rcu list traversal over the vmap_area_list is safe, this may : arrive at different results than the spinlocked version. The rcu list : traversal version will not be a 'snapshot' of a single, valid instant : of the entire vmap_area_list, but rather a potential amalgam of : different list states. Joonsoo: : Yes, you are right, but I don't think that we should be strict here. : Meminfo is already not a 'snapshot' at specific time. While we try to get : certain stats, the other stats can change. And, although we may arrive at : different results than the spinlocked version, the difference would not be : large and would not make serious side-effect. [edumazet@google.com: add more commit description] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Richard Yao <ryao@gentoo.org> Acked-by: Eric Dumazet <edumazet@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06include/linux/memblock.h: add __init to memblock_set_bottom_up()Fabian Frederick1-2/+2
memblock_set_bottom_up() is only called by __init cmdline_parse_movable_node() and __init numa_init(). Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/page_alloc.c: unexport alloc_pages_exact_nid()Andrew Morton1-1/+0
It is only called by mm/page_cgroup.c whcih cannot be modular. Reported-by: David Rientjes <rientjes@google.com> Cc: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/page_alloc.c: add __meminit to alloc_pages_exact_nid()Fabian Frederick2-2/+2
alloc_pages_exact_nid() is only called by __meminit alloc_page_cgroup() Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/memory_hotplug.c: add __meminit to grow_zone_span/grow_pgdat_spanFabian Frederick1-4/+4
grow_zone_span and grow_pgdat_span are only called by __meminit __add_zone Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Toshi Kani <toshi.kani@hp.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>