aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2015-04-15Merge branches 'cve-fixup', 'ipoib', 'iser', 'misc-4.1', 'or-mlx4' and 'srp' ↵rdma-for-linusfor-nextDoug Ledford27-1132/+1508
into for-4.1
2015-04-15mlx5: wrong page mask if CONFIG_ARCH_DMA_ADDR_T_64BIT enabled for 32Bit ↵Honggang LI1-4/+6
architectures If CONFIG_ARCH_DMA_ADDR_T_64BIT enabled for x86 systems and physical memory is more than 4GB, dma_map_page may return a valid memory address which greater than 0xffffffff. As a result, the mlx5 device page allocator RB tree will be initialized with valid addresses greater than 0xfffffff. However, (addr & PAGE_MASK) set the high four bytes to zeros. So, it's impossible for the function, free_4k, to release the pages whose addresses greater than 4GB. Memory leaks. And mlx5_ib module can't release the pages when user try to remove the module, as a result, system hang. [root@rdma05 root]# dmesg | grep addr | head addr = 3fe384000 addr & PAGE_MASK = fe384000 [root@rdma05 root]# rmmod mlx5_ib <---- hang on ---------------------- cosnole log ----------------- mlx5_ib 0000:04:00.0: irq 138 for MSI/MSI-X alloc irq_desc for 139 on node -1 alloc kstat_irqs on node -1 mlx5_ib 0000:04:00.0: irq 139 for MSI/MSI-X 0000:04:00.0:free_4k:221:(pid 1519): page not found 0000:04:00.0:free_4k:221:(pid 1519): page not found 0000:04:00.0:free_4k:221:(pid 1519): page not found 0000:04:00.0:free_4k:221:(pid 1519): page not found ---------------------- cosnole log ----------------- Fixes: bf0bf77f6519 ('mlx5: Support communicating arbitrary host page size to firmware') Signed-off-by: Honggang Li <honli@redhat.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Rewrite bounce buffer code pathSagi Grimberg3-88/+137
In some rare cases, IO operations may be not aligned to page boundaries. This prevents iser from performing fast memory registration. In order to overcome that iser uses a bounce buffer to carry the transaction. We basically allocate a buffer in the size of the transaction and perform a copy. The buffer allocation using kmalloc is too restrictive since it requires higher order (atomic) allocations for large transactions (which may result in memory exhaustion fairly fast for some workloads). We rewrite the bounce buffer code path to allocate scattered pages and perform a copy between the transaction sg and the bounce sg. Reported-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Bump version to 1.6Sagi Grimberg1-1/+1
Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Remove code duplication for a single DMA entrySagi Grimberg1-27/+21
In singleton scatterlists, DMA memory registration code is taken both for Fastreg and FMR code paths. Move it to a function. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Pass struct iser_mem_reg to iser_fast_reg_mr and iser_reg_sig_mrSagi Grimberg1-43/+32
Instead of passing ib_sge as output variable, we pass the mem_reg pointer to have the routines fill the rkey as well. This reduces code duplication and extra assignments. This is a preparation step to unify some registration logics together. Also, pass iser_fast_reg_mr the fastreg descriptor directly. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Modify struct iser_mem_reg membersSagi Grimberg3-33/+29
No need to keep lkey, va, len variables, we can keep them as struct ib_sge. This will help when we change the memory registration logic. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Make fastreg pool cache friendlySagi Grimberg1-1/+1
Memory regions are resources that are saved in the device caches. Increase the probability for a cache hit by adding the MRU descriptor to pool head. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Move PI context alloc/free to routinesSagi Grimberg1-55/+63
Make iser_[create|destroy]_fastreg_desc shorter, more readable and easily extendable. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Move fastreg descriptor pool get/put to helper functionsSagi Grimberg2-18/+37
Instead of open-coding connection fastreg pool get/put, we introduce iser_reg_desc[get|put] helpers. We aren't setting these static as this will be a per-device routine later on. Also, cleanup iser_unreg_rdma_mem_fastreg a bit. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Merge build page-vec into register page-vecSagi Grimberg1-58/+33
No need for these two separate. Keep it in a single routine like in the fastreg case. This will also make iser_reg_page_vec closer to iser_fast_reg_mr arguments. This is a preparation step for registration flow refactor. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Get rid of struct iser_rdma_regdSagi Grimberg4-70/+53
This struct members other than struct iser_mem_reg are unused, so remove it altogether. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Remove redundant assignments in iser_reg_page_vecSagi Grimberg1-5/+2
Buffer length was assigned twice, and no reason to set va to io_addr and then add the offset, just set va to io_addr + offset. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Move memory reg/dereg routines to iser_memory.cSagi Grimberg3-91/+88
As memory registration/de-registration methods, lets move them to their natural location. While we're at it, make iser_reg_page_vec routine static. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Don't pass ib_device to fall_to_bounce_buff routineSagi Grimberg1-6/+6
No need to pass that, we can take it from the task. In a later stage, this function will be invoked according to a device capability. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Remove a redundant struct iser_data_bufSagi Grimberg3-52/+34
No need to keep two iser_data_buf structures just in case we use mem copy. We can avoid that just by adding a pointer to the original sg. So keep only two iser_data_buf per command (data and protection) and pass the relevant data_buf to bounce buffer routine. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Remove redundant cmd_data_len calculationSagi Grimberg1-4/+1
This code was added before we had protection data length calculation (in iser_send_command), so we needed to calc the sg data length from the sg itself. This is not needed anymore. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Adir Lev <adirl@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Fix wrong calculation of protection buffer lengthSagi Grimberg1-2/+2
This length miss-calculation may cause a silent data corruption in the DIX case and cause the device to reference unmapped area. Fixes: d77e65350f2d ('libiscsi, iser: Adjust data_length to include protection information') Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Handle fastreg/local_inv completion errorsSagi Grimberg1-5/+6
Fast registration and local invalidate work requests can also fail. We should call error completion handler for them. Reported-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/iser: Fix unload during ep_poll wrong dereferenceSagi Grimberg1-1/+1
In case the user unloaded ib_iser while ep_connect is in progress, we need to destroy the endpoint although ep_disconnect wasn't invoked (we detect this by the iser conn state != DOWN). However, if we got an REJECTED/UNREACHABLE CM event we move the connection state to DOWN which will prevent us from destroying the endpoint in the module unload stage. Fix this by setting the connection state to TERMINATING in iser_conn_error so we can still destroy the endpoint at unload stage. Reported-by: Ariel Nahum <arieln@mellanox.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15ib_srpt: convert printk's to pr_* functionsDoug Ledford1-97/+91
The driver already defined the pr_format, it just hadn't been converted to use pr_info, pr_warn, and pr_err instead of the equivalent printks. Convert so that messages from the driver are now properly tagged with their driver name and can be more easily debugged. In addition, a number of these printk's were not newline terminated, so fix that at the same time. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/srp: Use P_Key cache for P_Key lookupsBart Van Assche1-4/+5
This change slightly reduces the time needed to log in. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: David Dillow <dave@thedillows.org> Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15infiniband/mlx4: check for mapping errorSebastian Ott2-0/+13
Since ib_dma_map_single can fail use ib_dma_mapping_error to check for errors. Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15MAINTAINERS: Adding list of maintainers for ocrdmaSelvin Xavier1-0/+9
Updating the MAINTAINERS file with ocrdma maintainers and their email ids Signed-off-by: Selvin Xavier <selvin.xavier@emulex.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15rdma: replace deprecated ifconfig in docStephen Hemminger1-3/+6
The ifconfig command has been deprecated for many years. To encourage new users not to continue using it and learning iproute2; the ifconfig should not be used in examples. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15ib_uverbs: Fix pages leak when using XRC SRQsSébastien Dugué1-11/+11
Hello, When an application using XRCs abruptly terminates, the mmaped pages of the CQ buffers are leaked. This comes from the fact that when resources are released in ib_uverbs_cleanup_ucontext(), we fail to release the CQs because their refcount is not 0. When creating an XRC SRQ, we increment the associated CQ refcount. This refcount is only decremented when the SRQ is released. Therefore we need to release the SRQs prior to the CQs to make sure that all references to the CQs are gone before trying to release these. Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/mlx4: Fix WQE LSO segment calculationErez Shitrit1-2/+1
The current code decreases from the mss size (which is the gso_size from the kernel skb) the size of the packet headers. It shouldn't do that because the mss that comes from the stack (e.g IPoIB) includes only the tcp payload without the headers. The result is indication to the HW that each packet that the HW sends is smaller than what it could be, and too many packets will be sent for big messages. An easy way to demonstrate one more aspect of the problem is by configuring the ipoib mtu to be less than 2*hlen (2*56) and then run app sending big TCP messages. This will tell the HW to send packets with giant (negative value which under unsigned arithmetics becomes a huge positive one) length and the QP moves to SQE state. Fixes: b832be1e4007 ('IB/mlx4: Add IPoIB LSO support') Reported-by: Matthew Finlay <matt@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Remove IPOIB_MCAST_RUN bitErez Shitrit2-5/+2
After Doug Ledford's changes there is no need in that bit, it's semantic becomes subset of the IPOIB_FLAG_OPER_UP bit. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Save only IPOIB_MAX_PATH_REC_QUEUE skb'sErez Shitrit1-3/+10
Whenever there is no path->ah to the destination, keep only defined number of skb's. Otherwise there are cases that the driver can keep infinite list of skb's. For example, when one device want to send unicast arp to the destination, and from some reason the SM doesn't respond, the driver currently keeps all the skb's. If that unicast arp traffic stopped, all these skb's are kept by the path object till the interface is down. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Handle QP in SQE stateErez Shitrit2-1/+63
As the result of a completion error the QP can moved to SQE state by the hardware. Since it's not the Error state, there are no flushes and hence the driver doesn't know about that. The fix creates a task that after completion with error which is not a flush tracks the QP state and if it is in SQE state moves it back to RTS. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Update broadcast record values after each successful join requestErez Shitrit1-1/+17
Update the cached broadcast record in the priv object after every new join of this broadcast domain group. These values are needed for the port configuration (MTU size) and to all the new multicast (non-broadcast) join requests initial parameters. For example, SM starts with 2K MTU for all the fabric, and after that it restarts (or handover to new SM) with new port configuration of 4K MTU. Without using the new values, the driver will keep its old configuration of 2K and will not apply the new configuration of 4K. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Use one linear skb in RX flowErez Shitrit3-72/+13
The current code in the RX flow uses two sg entries for each incoming packet, the first one was for the IB headers and the second for the rest of the data, that causes two dma map/unmap and two allocations, and few more actions that were done at the data path. Use only one linear skb on each incoming packet, for the data (IB headers and payload), that reduces the packet processing in the data-path (only one skb, no frags, the first frag was not used anyway, less memory allocations) and the dma handling (only one dma map/unmap over each incoming packet instead of two map/unmap per each incoming packet). After commit 73d3fe6d1c6d ("gro: fix aggregation for skb using frag_list") from Eric Dumazet, we will get full aggregation for large packets. When running bandwidth tests before and after the (over the card's numa node), using "netperf -H 1.1.1.3 -T -t TCP_STREAM", the results before are ~12Gbs before and after ~16Gbs on my setup (Mellanox's ConnectX3). Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: drop mcast_mutex usageDoug Ledford1-38/+32
We needed the mcast_mutex when we had to prevent the join completion callback from having the value it stored in mcast->mc overwritten by a delayed return from ib_sa_join_multicast. By storing the return of ib_sa_join_multicast in an intermediate variable, we prevent a delayed return from ib_sa_join_multicast overwriting the valid contents of mcast->mc, and we no longer need a mutex to force the join callback to run after the return of ib_sa_join_multicast. This allows us to do away with the mutex entirely and protect our critical sections with a just a spinlock instead. This is highly desirable as there were some places where we couldn't use a mutex because the code was not allowed to sleep, and so we were currently using a mix of mutex and spinlock to protect what we needed to protect. Now we only have a spin lock and the locking complexity is greatly reduced. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: deserialize multicast joinsDoug Ledford1-168/+82
Allow the ipoib layer to attempt to join all outstanding multicast groups at once. The ib_sa layer will serialize multiple attempts to join the same group, but will process attempts to join different groups in parallel. Take advantage of that. In order to make this happen, change the mcast_join_thread to loop through all needed joins, sending a join request for each one that we still need to join. There are a few special cases we handle though: 1) Don't attempt to join anything but the broadcast group until the join of the broadcast group has succeeded. 2) No longer restart the join task at the end of completion handling. If we completed successfully, we are done. The join task now needs kicked either by mcast_send or mcast_restart_task or mcast_start_thread, but should not need started anytime else except when scheduling a backoff attempt to rejoin. 3) No longer use separate join/completion routines for regular and sendonly joins, pass them all through the same routine and just do the right thing based on the SENDONLY join flag. 4) Only try to join a SENDONLY join twice, then drop the packets and quit trying. We leave the mcast group in the list so that if we get a new packet, all that we have to do is queue up the packet and restart the join task and it will automatically try to join twice and then either send or flush the queue again. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: fix MCAST_FLAG_BUSY usageDoug Ledford2-128/+238
Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast objects") added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. And when we did complete it, sometimes we continued to touch the mcast entry after the completion, opening us up to possible use after free issues. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. In order to make things clearer and aid in resolving the rtnl deadlock bug I've been chasing, I cleaned this up a bit. 1) Remove the MCAST_JOIN_STARTED flag entirely 2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight 3) Test mcast->mc directly to see if we have completed ib_sa_join_multicast (using IS_ERR_OR_NULL) 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize the mcast->done completion struct 5) Make sure that before calling complete(&mcast->done), we always clear the MCAST_FLAG_BUSY bit 6) Take the mcast_mutex before we call ib_sa_multicast_join and also take the mutex in our join callback. This forces ib_sa_multicast_join to return and set mcast->mc before we process the callback. This way, our callback can safely clear mcast->mc if there is an error on the join and we will do the right thing as a result in mcast_dev_flush. 7) Because we need the mutex to synchronize mcast->mc, we can no longer call mcast_sendonly_join directly from mcast_send and instead must add sendonly join processing to the mcast_join_task 8) Make MCAST_RUN mean that we have a working mcast subsystem, not that we have a running task. We know when we need to reschedule our join task thread and don't need a flag to tell us. 9) Add a helper for rescheduling the join task thread A number of different races are resolved with these changes. These races existed with the old MCAST_FLAG_BUSY usage, the MCAST_JOIN_STARTED flag was an attempt to address them, and while it helped, a determined effort could still trip things up. One race looks something like this: Thread 1 Thread 2 ib_sa_join_multicast (as part of running restart mcast task) alloc member call callback ifconfig ib0 down wait_for_completion callback call completes wait_for_completion in mcast_dev_flush completes mcast->mc is PTR_ERR_OR_NULL so we skip ib_sa_leave_multicast return from callback return from ib_sa_join_multicast set mcast->mc = return from ib_sa_multicast We now have a permanently unbalanced join/leave issue that trips up the refcounting in core/multicast.c Another like this: Thread 1 Thread 2 Thread 3 ib_sa_multicast_join ifconfig ib0 down priv->broadcast = NULL join_complete wait_for_completion mcast->mc is not yet set, so don't clear return from ib_sa_join_multicast and set mcast->mc complete return -EAGAIN (making mcast->mc invalid) call ib_sa_multicast_leave on invalid mcast->mc, hang forever By holding the mutex around ib_sa_multicast_join and taking the mutex early in the callback, we force mcast->mc to be valid at the time we run the callback. This allows us to clear mcast->mc if there is an error and the join is going to fail. We do this before we complete the mcast. In this way, mcast_dev_flush always sees consistent state in regards to mcast->mc membership at the time that the wait_for_completion() returns. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: No longer use flush as a parameterDoug Ledford4-30/+39
Various places in the IPoIB code had a deadlock related to flushing the ipoib workqueue. Now that we have per device workqueues and a specific flush workqueue, there is no longer a deadlock issue with flushing the device specific workqueues and we can do so unilaterally. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Use dedicated workqueues per interfaceDoug Ledford6-38/+66
During my recent work on the rtnl lock deadlock in the IPoIB driver, I saw that even once I fixed the apparent races for a single device, as soon as that device had any children, new races popped up. It turns out that this is because no matter how well we protect against races on a single device, the fact that all devices use the same workqueue, and flush_workqueue() flushes *everything* from that workqueue means that we would also have to prevent all races between different devices (for instance, ipoib_mcast_restart_task on interface ib0 can race with ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on the rtnl_lock). There are several possible solutions to this problem: Make carrier_on_task and mcast_restart_task try to take the rtnl for some set period of time and if they fail, then bail. This runs the real risk of dropping work on the floor, which can end up being its own separate kind of deadlock. Set some global flag in the driver that says some device is in the middle of going down, letting all tasks know to bail. Again, this can drop work on the floor. Or the method this patch attempts to use, which is when we bring an interface up, create a workqueue specifically for that interface, so that when we take it back down, we are flushing only those tasks associated with our interface. In addition, keep the global workqueue, but now limit it to only flush tasks. In this way, the flush tasks can always flush the device specific work queues without having deadlock issues. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Make the carrier_on_task race awareDoug Ledford1-8/+17
We blindly assume that we can just take the rtnl lock and that will prevent races with downing this interface. Unfortunately, that's not the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() in an attempt to clear out all remaining instances of ipoib_join_task. But, since this task is put on the same workqueue as the join task, the flush_workqueue waits on this thread too. But this thread is deadlocked on the rtnl lock. The better thing here is to use trylock and loop on that until we either get the lock or we see that FLAG_OPER_UP has been cleared, in which case we don't need to do anything anyway and we just return. While investigating which flag should be used, FLAG_ADMIN_UP or FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more appropriate flag to use. However, there was a mix of these two flags in use in the existing code. So while we check for that flag here as part of this race fix, also cleanup the two places that had used the less appropriate flag for their tests. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Consolidate rtnl_lock tasks in workqueueDoug Ledford1-6/+2
The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and needs to keep it held. It also needs to call flush_workqueue() to flush out any outstanding work. In the past, we've had to try and make sure that we didn't flush out any outstanding join completions because they also wanted to grab rtnl_lock() and that would deadlock. It turns out that the only thing in the join completion handler that needs this lock can be safely moved to our carrier_on_task, thereby reducing the potential for the join completion code and the flush code to deadlock against each other. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: change init sequence orderingDoug Ledford1-7/+17
In preparation for using per device work queues, we need to move the start of the neighbor thread task to after ipoib_ib_dev_init and move the destruction of the neighbor task to before ipoib_ib_dev_cleanup. Otherwise we will end up freeing our workqueue with work possibly still on it. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: factor out ah flushingDoug Ledford1-18/+28
Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at appropriate times to flush out all remaining ah entries before we shut the device down. Because neighbors and mcast entries can each have a reference on any given ah, we must make sure to free all of those first before our ah will actually have a 0 refcount and be able to be reaped. This factoring is needed in preparation for having per-device work queues. The original per-device workqueue code resulted in the following error message: <ibdev>: ib_dealloc_pd failed That error was tracked down to this issue. With the changes to which workqueues were flushed when, there were no flushes of the per device workqueue after the last ah's were freed, resulting in an attempt to dealloc the pd with outstanding resources still allocated. This code puts the explicit flushes in the needed places to avoid that problem. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/core: don't disallow registering region starting at 0x0Yann Droneaud1-2/+2
In a call to ib_umem_get(), if address is 0x0 and size is already page aligned, check added in commit 8494057ab5e4 ("IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic") will refuse to register a memory region that could otherwise be valid (provided vm.mmap_min_addr sysctl and mmap_low_allowed SELinux knobs allow userspace to map something at address 0x0). This patch allows back such registration: ib_umem_get() should probably don't care of the base address provided it can be pinned with get_user_pages(). There's two possible overflows, in (addr + size) and in PAGE_ALIGN(addr + size), this patch keep ensuring none of them happen while allowing to pin memory at address 0x0. Anyway, the case of size equal 0 is no more (partially) handled as 0-length memory region are disallowed by an earlier check. Link: http://mid.gmane.org/cover.1428929103.git.ydroneaud@opteya.com Cc: <stable@vger.kernel.org> # 8494057ab5e4 ("IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic") Cc: Shachar Raindel <raindel@mellanox.com> Cc: Jack Morgenstein <jackm@mellanox.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/core: disallow registering 0-sized memory regionYann Droneaud1-0/+3
If ib_umem_get() is called with a size equal to 0 and an non-page aligned address, one page will be pinned and a 0-sized umem will be returned to the caller. This should not be allowed: it's not expected for a memory region to have a size equal to 0. This patch adds a check to explicitly refuse to register a 0-sized region. Link: http://mid.gmane.org/cover.1428929103.git.ydroneaud@opteya.com Cc: <stable@vger.kernel.org> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Jack Morgenstein <jackm@mellanox.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/mlx4: Change alias guids default to be host assignedYishai Hadas1-2/+2
Change the default mode to be HOST assigned instead of SM assigned. This is the expected operational mode, because it doesn't depend on SM availability. As PF generates random GUIDs as the initial admin values, this gives out of the box experience. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15net/mlx4_core: Return the admin alias GUID upon host view requestYishai Hadas1-14/+27
Return the admin alias GUID value upon a GET request via HOST. We do this so that the GUID value requested by the admin is returned even if the SM has not yet approved this GUID (e.g. the SM is down). Note that this does not create a problem, since the virtual port will remain down until the SM does ACK the requested GUID value. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15net/mlx4_core: Raise slave shutdown event upon FLRYishai Hadas1-0/+2
There might be cases that PF doesn't get a "reset" command upon slave down (e.g. virsh destroy). In these cases, however, an FLR event is issued. Therefore, when the PF receives an FLR event for a slave, it should also generate a shutdown event on the PF for that slave, to let the PF upper layers (mlx4_ib, eth) perform any required cleanup/actions associated with slave shutdown. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/mlx4: Request alias GUID on demandYishai Hadas3-0/+75
Request GIDs from the SM on demand, i.e., when a VF actually needs them, and release them when the GIDs are no longer in use. In cloud environments, this is useful for GID migrations, in which a GID is assigned to a VF on the destination HCA, while the VF on the source HCA is shutdown (but the GID was not administratively released). Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/mlx4: Change init flow to request alias GUIDs for active VFsYishai Hadas3-79/+50
Change the init flow to ask GUIDs only for active VFs. This is done for both SM & HOST modes so that there is no need any more to maintain the ownership record type. In case SM mode is used, the initial value will be 0, ask the SM to assign, for the HOST mode the initial value will be the HOST generated GUID. This will enable out of the box experience for both probed and attached VFs. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/mlx4: Manage admin alias GUID upon admin requestYishai Hadas2-9/+15
Set the admin alias GUID per the administrator's request via the sysfs mechanism into the core layer. The "get" request returns the current value. However, if the administrator requests the SM to assign a new value by requesting 0, the SM assigned GUID is returned. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15net/mlx4_core: Set initial admin GUIDs for VFsYishai Hadas3-0/+17
To have out of the box experience, the PF generates random GUIDs who serve as the initial admin values. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15net/mlx4_core: Manage alias GUID per VFYishai Hadas3-0/+20
Manages alias GUIDs per VF per port in the core layer. This is a pre-step for managing alias GUIDs in a mode that the admin GUID is returned via ib_query_gid() regardless of whether the SM has approved it or not. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/mlx4: Alias GUID adding persistency supportYishai Hadas3-84/+245
If the SM rejects an alias GUID request the PF driver keeps trying to acquire the specified GUID indefinitely, utilizing an exponential backoff scheme. Retrying is managed per GUID entry. Each entry that wasn't applied holds its next retry information. Retry requests to the SM consist of records of 8 consecutive GUIDS. Each record that contains GUIDs requiring retries holds its next time-to-run based on the retry information of all its GUID entries. The record having the lowest retry time will run first when that retry time arrives. Since the method (SET or DELETE) as sent to the SM applies to all the GUIDs in the record, we must handle SET requests and DELETE requests in separate SM messages (one for SETs and the other for DELETEs). To avoid race conditions where a GUID entry request (set or delete) was modified after the SM request was sent, we save the method and the requested indices as part of the callback's context -- thus, only the requested indexes are evaluated when the response is received. When an GUID entry is approved we turn off its retry-required bit, this prevents redundant SM retries from occurring on that record. The port down event should be sent only when previously it was up. Likewise, the port up event should be sent only if previously the port was down. Synchronization was added around the flows that change entries and record state to prevent race conditions. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15Merge branch 'kconfig' of ↵Linus Torvalds15-198/+174
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kconfig updates from Michal Marek: "Here is the kconfig stuff for v4.1-rc1: - fixes for mergeconfig (used by make kvmconfig/tinyconfig) - header cleanup - make -s *config is silent now" * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kconfig: Do not print status messages in make -s mode kconfig: Simplify Makefile kbuild: add generic mergeconfig target, %.config merge_config.sh: rename MAKE to RUNMAKE merge_config.sh: improve indentation kbuild: mergeconfig: remove redundant $(objtree) kbuild: mergeconfig: move an error check to merge_config.sh kbuild: mergeconfig: fix "jobserver unavailable" warning kconfig: Remove unnecessary prototypes from headers kconfig: Remove dead code kconfig: Get rid of the P() macro in headers kconfig: fix a misspelling in scripts/kconfig/merge_config.sh
2015-04-15Merge branch 'kbuild' of ↵Linus Torvalds11-48/+50
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kbuild updates from Michal Marek: "Here is the first round of kbuild changes for v4.1-rc1: - kallsyms fix for ARM and cleanup - make dep(end) removed (developers have no sense of nostalgia these days...) - include Makefiles by relative path - stop useless rebuilds of asm-offsets.h and bounds.h" * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: Kbuild: kallsyms: drop special handling of pre-3.0 GCC symbols Kbuild: kallsyms: ignore veneers emitted by the ARM linker kbuild: ia64: use $(src)/Makefile.gate rather than particular path kbuild: include $(src)/Makefile rather than $(obj)/Makefile kbuild: use relative path more to include Makefile kbuild: use relative path to include Makefile kbuild: do not add $(bounds-file) and $(offsets-file) to targets kbuild: remove warning about "make depend" kbuild: Don't reset timestamps in include/generated if not needed
2015-04-15Merge branch 'next' of ↵Linus Torvalds31-1156/+1953
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Highlights for this window: - improved AVC hashing for SELinux by John Brooks and Stephen Smalley - addition of an unconfined label to Smack - Smack documentation update - TPM driver updates" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (28 commits) lsm: copy comm before calling audit_log to avoid race in string printing tomoyo: Do not generate empty policy files tomoyo: Use if_changed when generating builtin-policy.h tomoyo: Use bin2c to generate builtin-policy.h selinux: increase avtab max buckets selinux: Use a better hash function for avtab selinux: convert avtab hash table to flex_array selinux: reconcile security_netlbl_secattr_to_sid() and mls_import_netlbl_cat() selinux: remove unnecessary pointer reassignment Smack: Updates for Smack documentation tpm/st33zp24/spi: Add missing device table for spi phy. tpm/st33zp24: Add proper wait for ordinal duration in case of irq mode smack: Fix gcc warning from unused smack_syslog_lock mutex in smackfs.c Smack: Allow an unconfined label in bringup mode Smack: getting the Smack security context of keys Smack: Assign smack_known_web as default smk_in label for kernel thread's socket tpm/tpm_infineon: Use struct dev_pm_ops for power management MAINTAINERS: Add Jason as designated reviewer for TPM tpm: Update KConfig text to include TPM2.0 FIFO chips tpm/st33zp24/dts/st33zp24-spi: Add dts documentation for st33zp24 spi phy ...
2015-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds168-2202/+18223
Pull crypto update from Herbert Xu: "Here is the crypto update for 4.1: New interfaces: - user-space interface for AEAD - user-space interface for RNG (i.e., pseudo RNG) New hashes: - ARMv8 SHA1/256 - ARMv8 AES - ARMv8 GHASH - ARM assembler and NEON SHA256 - MIPS OCTEON SHA1/256/512 - MIPS img-hash SHA1/256 and MD5 - Power 8 VMX AES/CBC/CTR/GHASH - PPC assembler AES, SHA1/256 and MD5 - Broadcom IPROC RNG driver Cleanups/fixes: - prevent internal helper algos from being exposed to user-space - merge common code from assembly/C SHA implementations - misc fixes" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (169 commits) crypto: arm - workaround for building with old binutils crypto: arm/sha256 - avoid sha256 code on ARMv7-M crypto: x86/sha512_ssse3 - move SHA-384/512 SSSE3 implementation to base layer crypto: x86/sha256_ssse3 - move SHA-224/256 SSSE3 implementation to base layer crypto: x86/sha1_ssse3 - move SHA-1 SSSE3 implementation to base layer crypto: arm64/sha2-ce - move SHA-224/256 ARMv8 implementation to base layer crypto: arm64/sha1-ce - move SHA-1 ARMv8 implementation to base layer crypto: arm/sha2-ce - move SHA-224/256 ARMv8 implementation to base layer crypto: arm/sha256 - move SHA-224/256 ASM/NEON implementation to base layer crypto: arm/sha1-ce - move SHA-1 ARMv8 implementation to base layer crypto: arm/sha1_neon - move SHA-1 NEON implementation to base layer crypto: arm/sha1 - move SHA-1 ARM asm implementation to base layer crypto: sha512-generic - move to generic glue implementation crypto: sha256-generic - move to generic glue implementation crypto: sha1-generic - move to generic glue implementation crypto: sha512 - implement base layer for SHA-512 crypto: sha256 - implement base layer for SHA-256 crypto: sha1 - implement base layer for SHA-1 crypto: api - remove instance when test failed crypto: api - Move alg ref count init to crypto_check_alg ...
2015-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1451-32244/+55301
Pull networking updates from David Miller: 1) Add BQL support to via-rhine, from Tino Reichardt. 2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers can support hw switch offloading. From Floria Fainelli. 3) Allow 'ip address' commands to initiate multicast group join/leave, from Madhu Challa. 4) Many ipv4 FIB lookup optimizations from Alexander Duyck. 5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel Borkmann. 6) Remove the ugly compat support in ARP for ugly layers like ax25, rose, etc. And use this to clean up the neigh layer, then use it to implement MPLS support. All from Eric Biederman. 7) Support L3 forwarding offloading in switches, from Scott Feldman. 8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed up route lookups even further. From Alexander Duyck. 9) Many improvements and bug fixes to the rhashtable implementation, from Herbert Xu and Thomas Graf. In particular, in the case where an rhashtable user bulk adds a large number of items into an empty table, we expand the table much more sanely. 10) Don't make the tcp_metrics hash table per-namespace, from Eric Biederman. 11) Extend EBPF to access SKB fields, from Alexei Starovoitov. 12) Split out new connection request sockets so that they can be established in the main hash table. Much less false sharing since hash lookups go direct to the request sockets instead of having to go first to the listener then to the request socks hashed underneath. From Eric Dumazet. 13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk. 14) Support stable privacy address generation for RFC7217 in IPV6. From Hannes Frederic Sowa. 15) Hash network namespace into IP frag IDs, also from Hannes Frederic Sowa. 16) Convert PTP get/set methods to use 64-bit time, from Richard Cochran. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits) fm10k: Bump driver version to 0.15.2 fm10k: corrected VF multicast update fm10k: mbx_update_max_size does not drop all oversized messages fm10k: reset head instead of calling update_max_size fm10k: renamed mbx_tx_dropped to mbx_tx_oversized fm10k: update xcast mode before synchronizing multicast addresses fm10k: start service timer on probe fm10k: fix function header comment fm10k: comment next_vf_mbx flow fm10k: don't handle mailbox events in iov_event path and always process mailbox fm10k: use separate workqueue for fm10k driver fm10k: Set PF queues to unlimited bandwidth during virtualization fm10k: expose tx_timeout_count as an ethtool stat fm10k: only increment tx_timeout_count in Tx hang path fm10k: remove extraneous "Reset interface" message fm10k: separate PF only stats so that VF does not display them fm10k: use hw->mac.max_queues for stats fm10k: only show actual queues, not the maximum in hardware fm10k: allow creation of VLAN on default vid fm10k: fix unused warnings ...
2015-04-14Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-armLinus Torvalds93-604/+2429
Pull ARM updates from Russell King: "Included in this update are both some long term fixes and some new features. Fixes: - An integer overflow in the calculation of ELF_ET_DYN_BASE. - Avoiding OOMs for high-order IOMMU allocations - SMP requires the data cache to be enabled for synchronisation primitives to work, so prevent the CPU_DCACHE_DISABLE option being visible on SMP builds. - A bug going back 10+ years in the noMMU ARM94* CPU support code, where it corrupts registers. Found by folk getting Linux running on their cameras. - Versatile Express needs an errata workaround enabled for CPU hot-unplug to work. Features: - Clean up module linker by handling out of range relocations separately from relocation cases we don't handle. - Fix a long term bug in the pci_mmap_page_range() code, which we hope won't impact userspace (we hope there's no users of the existing broken interface.) - Don't map DMA coherent allocations when we don't have a MMU. - Drop experimental status for SMP_ON_UP. - Warn when DT doesn't specify ePAPR mandatory cache properties. - Add documentation concerning how we find the start of physical memory for AUTO_ZRELADDR kernels, detailing why we have chosen the mask and the implications of changing it. - Updates from Ard Biesheuvel to address some issues with large kernels (such as allyesconfig) failing to link. - Allow hibernation to work on modern (ARMv7) CPUs - this appears to have never worked in the past on these CPUs. - Enable IRQ_SHOW_LEVEL, which changes the /proc/interrupts output format (hopefully without userspace breaking... let's hope that if it causes someone a problem, they tell us.) - Fix tegra-ahb DT offsets. - Rework ARM errata 643719 code (and ARMv7 flush_cache_louis()/ flush_dcache_all()) code to be more efficient, and enable this errata workaround by default for ARMv7+SMP CPUs. This complements the Versatile Express fix above. - Rework ARMv7 context code for errata 430973, so that only Cortex A8 CPUs are impacted by the branch target buffer flush when this errata is enabled. Also update the help text to indicate that all r1p* A8 CPUs are impacted. - Switch ARM to the generic show_mem() implementation, it conveys all the information which we were already reporting. - Prevent slow timer sources being used for udelay() - timers running at less than 1MHz are not useful for this, and can cause udelay() to return immediately, without any wait. Using such a slow timer is silly. - VDSO support for 32-bit ARM, mainly for gettimeofday() using the ARM architected timer. - Perf support for Scorpion performance monitoring units" vdso semantic conflict fixed up as per linux-next. * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (52 commits) ARM: update errata 430973 documentation to cover Cortex A8 r1p* ARM: ensure delay timer has sufficient accuracy for delays ARM: switch to use the generic show_mem() implementation ARM: proc-v7: avoid errata 430973 workaround for non-Cortex A8 CPUs ARM: enable ARM errata 643719 workaround by default ARM: cache-v7: optimise test for Cortex A9 r0pX devices ARM: cache-v7: optimise branches in v7_flush_cache_louis ARM: cache-v7: consolidate initialisation of cache level index ARM: cache-v7: shift CLIDR to extract appropriate field before masking ARM: cache-v7: use movw/movt instructions ARM: allow 16-bit instructions in ALT_UP() ARM: proc-arm94*.S: fix setup function ARM: vexpress: fix CPU hotplug with CT9x4 tile. ARM: 8276/1: Make CPU_DCACHE_DISABLE depend on !SMP ARM: 8335/1: Documentation: DT bindings: Tegra AHB: document the legacy base address ARM: 8334/1: amba: tegra-ahb: detect and correct bogus base address ARM: 8333/1: amba: tegra-ahb: fix register offsets in the macros ARM: 8339/1: Enable CONFIG_GENERIC_IRQ_SHOW_LEVEL ARM: 8338/1: kexec: Relax SMP validation to improve DT compatibility ARM: 8337/1: mm: Do not invoke OOM for higher order IOMMU DMA allocations ...
2015-04-14Merge branch 'for-linus' of ↵Linus Torvalds119-7255/+1337
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "The major change in this merge is the removal of the support for 31-bit kernels. Naturally 31-bit user space will continue to work via the compat layer. And then some cleanup, some improvements and bug fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (23 commits) s390/smp: wait until secondaries are active & online s390/hibernate: fix save and restore of kernel text section s390/cacheinfo: add missing facility check s390/syscalls: simplify syscall_get_arch() s390/irq: enforce correct irqclass_sub_desc array size s390: remove "64" suffix from mem64.S and swsusp_asm64.S s390/ipl: cleanup macro usage s390/ipl: cleanup shutdown_action attributes s390/ipl: cleanup bin attr usage s390/uprobes: fix address space annotation s390: add missing arch_release_task_struct() declaration s390: make couple of functions and variables static s390/maccess: improve s390_kernel_write() s390/maccess: remove potentially broken probe_kernel_write() s390/watchdog: support for KVM hypervisors and delete pr_info messages s390/watchdog: enable KEEPALIVE for /dev/watchdog s390/dasd: remove setting of scheduler from driver s390/traps: panic() instead of die() on translation exception s390: remove test_facility(2) (== z/Architecture mode active) checks s390/cmpxchg: simplify cmpxchg_double ...
2015-04-14Merge tag 'pm+acpi-4.1-rc1' of ↵Linus Torvalds96-928/+1797
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI updates from Rafael Wysocki: "These are mostly fixes and cleanups all over, although there are a few items that sort of fall into the new feature category. First off, we have new callbacks for PM domains that should help us to handle some issues related to device initialization in a better way. There also is some consolidation in the unified device properties API area allowing us to use that inferface for accessing data coming from platform initialization code in addition to firmware-provided data. We have some new device/CPU IDs in a few drivers, support for new chips and a new cpufreq driver too. Specifics: - Generic PM domains support update including new PM domain callbacks to handle device initialization better (Russell King, Rafael J Wysocki, Kevin Hilman) - Unified device properties API update including a new mechanism for accessing data provided by platform initialization code (Rafael J Wysocki, Adrian Hunter) - ARM cpuidle update including ARM32/ARM64 handling consolidation (Daniel Lezcano) - intel_idle update including support for the Silvermont Core in the Baytrail SOC and for the Airmont Core in the Cherrytrail and Braswell SOCs (Len Brown, Mathias Krause) - New cpufreq driver for Hisilicon ACPU (Leo Yan) - intel_pstate update including support for the Knights Landing chip (Dasaratharaman Chandramouli, Kristen Carlson Accardi) - QorIQ cpufreq driver update (Tang Yuantian, Arnd Bergmann) - powernv cpufreq driver update (Shilpasri G Bhat) - devfreq update including Tegra support changes (Tomeu Vizoso, MyungJoo Ham, Chanwoo Choi) - powercap RAPL (Running-Average Power Limit) driver update including support for Intel Broadwell server chips (Jacob Pan, Mathias Krause) - ACPI device enumeration update related to the handling of the special PRP0001 device ID allowing DT-style 'compatible' property to be used for ACPI device identification (Rafael J Wysocki) - ACPI EC driver update including limited _DEP support (Lan Tianyu, Lv Zheng) - ACPI backlight driver update including a new mechanism to allow native backlight handling to be forced on non-Windows 8 systems and a new quirk for Lenovo Ideapad Z570 (Aaron Lu, Hans de Goede) - New Windows Vista compatibility quirk for Sony VGN-SR19XN (Chen Yu) - Assorted ACPI fixes and cleanups (Aaron Lu, Martin Kepplinger, Masanari Iida, Mika Westerberg, Nan Li, Rafael J Wysocki) - Fixes related to suspend-to-idle for the iTCO watchdog driver and the ACPI core system suspend/resume code (Rafael J Wysocki, Chen Yu) - PM tracing support for the suspend phase of system suspend/resume transitions (Zhonghui Fu) - Configurable delay for the system suspend/resume testing facility (Brian Norris) - PNP subsystem cleanups (Peter Huewe, Rafael J Wysocki)" * tag 'pm+acpi-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (74 commits) ACPI / scan: Fix NULL pointer dereference in acpi_companion_match() ACPI / scan: Rework modalias creation when "compatible" is present intel_idle: mark cpu id array as __initconst powercap / RAPL: mark rapl_ids array as __initconst powercap / RAPL: add ID for Broadwell server intel_pstate: Knights Landing support intel_pstate: remove MSR test cpufreq: fix qoriq uniprocessor build ACPI / scan: Take the PRP0001 position in the list of IDs into account ACPI / scan: Simplify acpi_match_device() ACPI / scan: Generalize of_compatible matching device property: Introduce firmware node type for platform data device property: Make it possible to use secondary firmware nodes PM / watchdog: iTCO: stop watchdog during system suspend cpufreq: hisilicon: add acpu driver ACPI / EC: Call acpi_walk_dep_device_list() after installing EC opregion handler cpufreq: powernv: Report cpu frequency throttling intel_idle: Add support for the Airmont Core in the Cherrytrail and Braswell SOCs intel_idle: Update support for Silvermont Core in Baytrail SOC PM / devfreq: tegra: Register governor on module init ...
2015-04-14Merge branch 'master' of ↵David S. Miller12-138/+215
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2015-04-14 This series contains updates to fm10k only. Fixed transmit statistics which was actually using values from the receive ring, instead of the transmit ring. Fixed up spelling mistakes in code comments and resolved unused argument warnings. Added support for netconsole. Fixed up statistic reporting so that we are only reporting from actual queues as well as display PF only stats for just the PF and not the VF. Also fixed an issue that when returning virtualization queues from the VF back to the PF, we were retaining the VF rate limiter. Fixed up the driver to use a separate workqueue, which helps reduce and stabilize latency between scheduling the work in our interrupt and actually performing the work. Fixed a bug where the VF tried to set a multicast address before requesting the required xcast mode. Fix VF multicast update since VFs were being improperly added to the switch's mutlicast group. The error stems from the fact that incorrect arguments were passed to the update_mc_addr(). Thanks to Alex Duyck for the extensive review. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-14Merge branch 'for-linus' of ↵Linus Torvalds59-239/+3965
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input subsystem updates from Dmitry Torokhov: "You will get the following new drivers: - Qualcomm PM8941 power key drver - ChipOne icn8318 touchscreen controller driver - Broadcom iProc touchscreen and keypad drivers - Semtech SX8654 I2C touchscreen controller driver ALPS driver now supports newer SS4 devices; Elantech got a fix that should make it work on some ASUS laptops; and a slew of other enhancements and random fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (51 commits) Input: alps - non interleaved V2 dualpoint has separate stick button bits Input: alps - fix touchpad buttons getting stuck when used with trackpoint Input: atkbd - document "no new force-release quirks" policy Input: ALPS - make alps_get_pkt_id_ss4_v2() and others static Input: ALPS - V7 devices can report 5-finger taps Input: ALPS - add support for SS4 touchpad devices Input: ALPS - refactor alps_set_abs_params_mt() Input: elantech - fix absolute mode setting on some ASUS laptops Input: atmel_mxt_ts - split out touchpad initialisation logic Input: atmel_mxt_ts - implement support for T100 touch object Input: cros_ec_keyb - fix clearing keyboard state on wakeup Input: gscps2 - drop pci_ids dependency Input: synaptics - allocate 3 slots to keep stability in image sensors Input: Revert "Revert "synaptics - use dmax in input_mt_assign_slots"" Input: MT - make slot assignment work for overcovered solutions mfd: tc3589x: enforce device-tree only mode Input: tc3589x - localize platform data Input: tsc2007 - Convert msecs to jiffies only once Input: edt-ft5x06 - remove EV_SYN event report Input: edt-ft5x06 - allow to setting the maximum axes value through the DT ...
2015-04-14Merge branch 'i2c/for-next' of ↵Linus Torvalds46-276/+2483
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: "Most notable: - introducing the i2c_quirk infrastructure. Now, flaws of I2C controllers can be described and the core will check if the flaws collide with the messages to be sent - wait_for_completion return type cleanup series - new drivers for Digicolor, Netlogic XLP, Ingenic JZ4780 - updates to the I2C slave framework which include API changes. Its only user was updated, too. Documentation was finally added - changed dynamic bus numbering for the DT case. This could change bus numbers for users. However, it fixes a collision where dynamic and static busses request the same id. - driver bugfixes, cleanups" * 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (52 commits) i2c: xlp9xx: Driver for Netlogic XLP9XX/5XX I2C controller of: Add vendor prefix 'netlogic' i2c: davinci: use ICPFUNC to toggle I2C as gpio for bus recovery i2c: davinci: use bus recovery infrastructure i2c: change input parameter to i2c_adapter for prepare/unprepare_recovery i2c: i2c-mux-gpio: remove error messages for probe deferrals i2c: jz4780: Add i2c bus controller driver for Ingenic JZ4780 i2c: dln2: set the device tree node of the adapter i2c: davinci: fixup wait_for_completion_timeout handling i2c: mpc: Fix ISR return value i2c: slave-eeprom: add more info when to increase the pointer i2c: slave: add documentation for i2c-slave-eeprom Documentation: i2c: describe the new slave mode i2c: slave: rework the slave API i2c: add support for the Digicolor I2C controller i2c: busses with dynamic ids should start after fixed ids for DT of: base: add function to get highest id of an alias stem i2c: designware: Suppress error message if platform_get_irq() < 0 i2c: mpc: assign the correct prescaler from SVR i2c: img-scb: fixup of wait_for_completion_timeout return handling ...
2015-04-14Merge tag 'vfio-v4.1-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds18-259/+1640
Pull VFIO updates from Alex Williamson: - VFIO platform bus driver support (Baptiste Reynal, Antonios Motakis, testing and review by Eric Auger) - Split VFIO irqfd support to separate module (Alex Williamson) - vfio-pci VGA arbiter client (Alex Williamson) - New vfio-pci.ids= module option (Alex Williamson) - vfio-pci D3 power state support for idle devices (Alex Williamson) * tag 'vfio-v4.1-rc1' of git://github.com/awilliam/linux-vfio: (30 commits) vfio-pci: Fix use after free vfio-pci: Move idle devices to D3hot power state vfio-pci: Remove warning if try-reset fails vfio-pci: Allow PCI IDs to be specified as module options vfio-pci: Add VGA arbiter client vfio-pci: Add module option to disable VGA region access vgaarb: Stub vga_set_legacy_decoding() vfio: Split virqfd into a separate module for vfio bus drivers vfio: virqfd_lock can be static vfio: put off the allocation of "minor" in vfio_create_group vfio/platform: implement IRQ masking/unmasking via an eventfd vfio: initialize the virqfd workqueue in VFIO generic code vfio: move eventfd support code for VFIO_PCI to a separate file vfio: pass an opaque pointer on virqfd initialization vfio: add local lock for virqfd instead of depending on VFIO PCI vfio: virqfd: rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export vfio/platform: support for level sensitive interrupts vfio/platform: trigger an interrupt via eventfd vfio/platform: initial interrupts support code ...
2015-04-14Merge tag 'pinctrl-v4.1-1' of ↵Linus Torvalds96-957/+15323
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pincontrol updates from Linus Walleij: "This is the bulk of pin control changes for the v4.1 development cycle. Nothing really exciting this time: we basically added a few new drivers and subdrivers and stabilized them in linux-next. Some cleanups too. With sunrisepoint Intel has a real fine fully featured pin control driver for contemporary hardware, and the AMD driver is also for large deployments. Most of the others are ARM devices. New drivers: - Intel Sunrisepoint - AMD KERNCZ GPIO - Broadcom Cygnus IOMUX New subdrivers: - Marvell MVEBU Armada 39x SoCs - Samsung Exynos 5433 - nVidia Tegra 210 - Mediatek MT8135 - Mediatek MT8173 - AMLogic Meson8b - Qualcomm PM8916 On top of this cleanups and development history for the above drivers as issues were fixed after merging" * tag 'pinctrl-v4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (71 commits) pinctrl: sirf: move sgpio lock into state container pinctrl: Add support for PM8916 GPIO's and MPP's pinctrl: bcm2835: Fix support for threaded level triggered IRQs sh-pfc: r8a7790: add EtherAVB pin groups pinctrl: Document "function" + "pins" pinmux binding pinctrl: intel: Add Intel Sunrisepoint pin controller and GPIO support pinctrl: fsl: imx: Check for 0 config register pinctrl: Add support for Meson8b documentation: Extend pinctrl docs for Meson8b pinctrl: Cleanup Meson8 driver Fix inconsistent spinlock of AMD GPIO driver which can be recognized by static analysis tool smatch. Declare constant Variables with Sparse's suggestion. pinctrl: at91: convert __raw to endian agnostic IO pinctrl: constify of_device_id array pinctrl: pinconf-generic: add dt node names to error messages pinctrl: pinconf-generic: scan also referenced phandle node pinctrl: mvebu: add suspend/resume support to Armada XP pinctrl driver pinctrl: st: Display pin's function when printing pinctrl debug information pinctrl: st: Show correct pin direction also in GPIO mode pinctrl: st: Supply a GPIO get_direction() call-back pinctrl: st: Move st_get_pio_control() further up the source file ...
2015-04-14Merge tag 'backlight-for-linus-4.1' of ↵Linus Torvalds2-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight Pull backlight updates from Lee Jones: "Changes to existing drivers: - Use of_get_child_by_name() instead of refcount; 88pm860x_bl - Terminate array with NULL element; da9052_bl" * tag 'backlight-for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: backlight: da9052_bl: Terminate da9052_wled_ids array with empty element backlight: 88pm860x_bl: Use of_get_child_by_name() instead of refcount hack
2015-04-14Merge tag 'mfd-for-linus-4.1' of ↵Linus Torvalds82-1185/+4256
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "Changes to existing drivers: - Rename child driver [axp288_battery => axp288_fuel_gauge]; axp20x - Rename child driver [max77693-flash => max77693-led]; max77693 - Error handling fixes; intel_soc_pmic - GPIO tweaking; intel_soc_pmic - Remove non-DT code; vexpress-sysreg, tc3589x - Remove unused/legacy code; ti_am335x_tscadc, rts5249, rtsx_gops, rtsx_pcr, rtc-s5m, sec-core, max77693, menelaus, wm5102-tables - Trivial fixups; rtsx_pci, da9150-core, sec-core, max7769, max77693, mc13xxx-core, dln2, hi6421-pmic-core, rk808, twl4030-power, lpc_ich, menelaus, twl6040 - Update register/address values; rts5227, rts5249 - DT and/or binding document fixups; arizona, da9150, mt6397, axp20x, qcom-rpm, qcom-spmi-pmic - Couple of trivial core Kconfig fixups - Remove use of seq_printf return value; ab8500-debugfs - Remove __exit markups; menelaus, tps65010 - Fix platform-device name collisions; mfd-core New drivers/supported devices: - Add support for wm8280/wm8281 into arizona - Add support for COMe-cBL6 into kempld-core - Add support for rts524a and rts525a into rts5249 - Add support for ipq8064 into qcom_rpm - Add support for extcon into axp20x - New MediaTek MT6397 PMIC driver - New Maxim MAX77843 PMIC dirver - New Intel Quark X1000 I2C-GPIO driver - New Skyworks SKY81452 driver" * tag 'mfd-for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (76 commits) mfd: sec: Fix RTC alarm interrupt number on S2MPS11 mfd: wm5102: Remove registers for output 3R from readable list mfd: tps65010: Remove incorrect __exit markups mfd: devicetree: bindings: Add Qualcomm RPM regulator subnodes mfd: axp20x: Add support for extcon cell mfd: lpc_ich: Sort IDs mfd: twl6040: Remove wrong and unneeded "platform:twl6040" modalias mfd: qcom-spmi-pmic: Add specific compatible strings for Qualcomm's SPMI PMIC's mfd: axp20x: Fix duplicate const for model names mfd: menelaus: Use macro for magic number mfd: menelaus: Drop support for SW controller VCORE mfd: menelaus: Delete omap_has_menelaus mfd: arizona: Correct type of gpio_defaults mfd: lpc_ich: Sort IDs mfd: Fix a typo in Kconfig mfd: qcom_rpm: Add support for IPQ8064 mfd: devicetree: qcom_rpm: Document IPQ8064 resources mfd: core: Fix platform-device name collisions mfd: intel_quark_i2c_gpio: Don't crash if !DMI dt-bindings: Add vendor-prefix for X-Powers ...
2015-04-15lsm: copy comm before calling audit_log to avoid race in string printingRichard Guy Briggs1-6/+9
When task->comm is passed directly to audit_log_untrustedstring() without getting a copy or using the task_lock, there is a race that could happen that would output a NULL (\0) in the middle of the output string that would effectively truncate the rest of the report text after the comm= field in the audit log message, losing fields. Using get_task_comm() to get a copy while acquiring the task_lock to prevent this and to prevent the result from being a mixture of old and new values of comm would incur potentially unacceptable overhead, considering that the value can be influenced by userspace and therefore untrusted anyways. Copy the value before passing it to audit_log_untrustedstring() ensures that a local copy is used to calculate the length *and* subsequently printed. Even if this value contains a mix of old and new values, it will only calculate and copy up to the first NULL, preventing the rest of the audit log message being truncated. Use a second local copy of comm to avoid a race between the first and second calls to audit_log_untrustedstring() with comm. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2015-04-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds153-1419/+2312
Merge first patchbomb from Andrew Morton: - arch/sh updates - ocfs2 updates - kernel/watchdog feature - about half of mm/ * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits) Documentation: update arch list in the 'memtest' entry Kconfig: memtest: update number of test patterns up to 17 arm: add support for memtest arm64: add support for memtest memtest: use phys_addr_t for physical addresses mm: move memtest under mm mm, hugetlb: abort __get_user_pages if current has been oom killed mm, mempool: do not allow atomic resizing memcg: print cgroup information when system panics due to panic_on_oom mm: numa: remove migrate_ratelimited mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE mm: split ET_DYN ASLR from mmap ASLR s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE mm: expose arch_mmap_rnd when available s390: standardize mmap_rnd() usage powerpc: standardize mmap_rnd() usage mips: extract logic for mmap_rnd() arm64: standardize mmap_rnd() usage x86: standardize mmap_rnd() usage arm: factor out mmap ASLR into mmap_rnd ...
2015-04-14Documentation: update arch list in the 'memtest' entryVladimir Murzin1-1/+1
Since arm64/arm support memtest command line option update the "memtest" entry. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14Kconfig: memtest: update number of test patterns up to 17Vladimir Murzin1-1/+1
Additional test patterns for memtest were introduced since commit 63823126c221 ("x86: memtest: add additional (regular) test patterns"), but looks like Kconfig was not updated that time. Update Kconfig entry with the actual number of maximum test patterns. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14arm: add support for memtestVladimir Murzin1-0/+3
Add support for memtest command line option. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14arm64: add support for memtestVladimir Murzin1-0/+2
Add support for memtest command line option. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14memtest: use phys_addr_t for physical addressesVladimir Murzin2-10/+10
Since memtest might be used by other architectures pass input parameters as phys_addr_t instead of long to prevent overflow. Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: move memtest under mmVladimir Murzin7-21/+21
Memtest is a simple feature which fills the memory with a given set of patterns and validates memory contents, if bad memory regions is detected it reserves them via memblock API. Since memblock API is widely used by other architectures this feature can be enabled outside of x86 world. This patch set promotes memtest to live under generic mm umbrella and enables memtest feature for arm/arm64. It was reported that this patch set was useful for tracking down an issue with some errant DMA on an arm64 platform. This patch (of 6): There is nothing platform dependent in the core memtest code, so other platforms might benefit from this feature too. [linux@roeck-us.net: MEMTEST depends on MEMBLOCK] Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm, hugetlb: abort __get_user_pages if current has been oom killedDavid Rientjes1-0/+9
If __get_user_pages() is faulting a significant number of hugetlb pages, usually as the result of mmap(MAP_LOCKED), it can potentially allocate a very large amount of memory. If the process has been oom killed, this will cause a lot of memory to potentially deplete memory reserves. In the same way that commit 4779280d1ea4 ("mm: make get_user_pages() interruptible") aborted for pending SIGKILLs when faulting non-hugetlb memory, based on the premise of commit 462e00cc7151 ("oom: stop allocating user memory if TIF_MEMDIE is set"), hugetlb page faults now terminate when the process has been oom killed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm, mempool: do not allow atomic resizingDavid Rientjes4-11/+11
Allocating a large number of elements in atomic context could quickly deplete memory reserves, so just disallow atomic resizing entirely. Nothing currently uses mempool_resize() with anything other than GFP_KERNEL, so convert existing callers to drop the gfp_mask. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Steffen Maier <maier@linux.vnet.ibm.com> [zfcp] Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Steve French <sfrench@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14memcg: print cgroup information when system panics due to panic_on_oomBalasubramani Vivekanandan3-11/+15
If kernel panics due to oom, caused by a cgroup reaching its limit, when 'compulsory panic_on_oom' is enabled, then we will only see that the OOM happened because of "compulsory panic_on_oom is enabled" but this doesn't tell the difference between mempolicy and memcg. And dumping system wide information is plain wrong and more confusing. This patch provides the information of the cgroup whose limit triggerred panic Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: numa: remove migrate_ratelimitedMel Gorman2-25/+0
This code is dead since commit 9e645ab6d089 ("sched/numa: Continue PTE scanning even if migrate rate limited") so remove it. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZEKees Cook9-25/+14
The arch_randomize_brk() function is used on several architectures, even those that don't support ET_DYN ASLR. To avoid bulky extern/#define tricks, consolidate the support under CONFIG_ARCH_HAS_ELF_RANDOMIZE for the architectures that support it, while still handling CONFIG_COMPAT_BRK. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Russell King <linux@arm.linux.org.uk> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "David A. Long" <dave.long@linaro.org> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Arun Chandran <achandran@mvista.com> Cc: Yann Droneaud <ydroneaud@opteya.com> Cc: Min-Hua Chen <orca.chen@gmail.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Alex Smith <alex@alex-smith.me.uk> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: Vineeth Vijayan <vvijayan@mvista.com> Cc: Jeff Bailey <jeffbailey@google.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Behan Webster <behanw@converseincode.com> Cc: Ismael Ripoll <iripoll@upv.es> Cc: Jan-Simon Mller <dl9pf@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: split ET_DYN ASLR from mmap ASLRKees Cook9-33/+6
This fixes the "offset2lib" weakness in ASLR for arm, arm64, mips, powerpc, and x86. The problem is that if there is a leak of ASLR from the executable (ET_DYN), it means a leak of shared library offset as well (mmap), and vice versa. Further details and a PoC of this attack is available here: http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html With this patch, a PIE linked executable (ET_DYN) has its own ASLR region: $ ./show_mmaps_pie 54859ccd6000-54859ccd7000 r-xp ... /tmp/show_mmaps_pie 54859ced6000-54859ced7000 r--p ... /tmp/show_mmaps_pie 54859ced7000-54859ced8000 rw-p ... /tmp/show_mmaps_pie 7f75be764000-7f75be91f000 r-xp ... /lib/x86_64-linux-gnu/libc.so.6 7f75be91f000-7f75beb1f000 ---p ... /lib/x86_64-linux-gnu/libc.so.6 7f75beb1f000-7f75beb23000 r--p ... /lib/x86_64-linux-gnu/libc.so.6 7f75beb23000-7f75beb25000 rw-p ... /lib/x86_64-linux-gnu/libc.so.6 7f75beb25000-7f75beb2a000 rw-p ... 7f75beb2a000-7f75beb4d000 r-xp ... /lib64/ld-linux-x86-64.so.2 7f75bed45000-7f75bed46000 rw-p ... 7f75bed46000-7f75bed47000 r-xp ... 7f75bed47000-7f75bed4c000 rw-p ... 7f75bed4c000-7f75bed4d000 r--p ... /lib64/ld-linux-x86-64.so.2 7f75bed4d000-7f75bed4e000 rw-p ... /lib64/ld-linux-x86-64.so.2 7f75bed4e000-7f75bed4f000 rw-p ... 7fffb3741000-7fffb3762000 rw-p ... [stack] 7fffb377b000-7fffb377d000 r--p ... [vvar] 7fffb377d000-7fffb377f000 r-xp ... [vdso] The change is to add a call the newly created arch_mmap_rnd() into the ELF loader for handling ET_DYN ASLR in a separate region from mmap ASLR, as was already done on s390. Removes CONFIG_BINFMT_ELF_RANDOMIZE_PIE, which is no longer needed. Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Russell King <linux@arm.linux.org.uk> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "David A. Long" <dave.long@linaro.org> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Arun Chandran <achandran@mvista.com> Cc: Yann Droneaud <ydroneaud@opteya.com> Cc: Min-Hua Chen <orca.chen@gmail.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Alex Smith <alex@alex-smith.me.uk> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: Vineeth Vijayan <vvijayan@mvista.com> Cc: Jeff Bailey <jeffbailey@google.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Behan Webster <behanw@converseincode.com> Cc: Ismael Ripoll <iripoll@upv.es> Cc: Jan-Simon Mller <dl9pf@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14s390: redefine randomize_et_dyn for ELF_ET_DYN_BASEKees Cook2-12/+7
In preparation for moving ET_DYN randomization into the ELF loader (which requires a static ELF_ET_DYN_BASE), this redefines s390's existing ET_DYN randomization in a call to arch_mmap_rnd(). This refactoring results in the same ET_DYN randomization on s390. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: expose arch_mmap_rnd when availableKees Cook14-14/+37
When an architecture fully supports randomizing the ELF load location, a per-arch mmap_rnd() function is used to find a randomized mmap base. In preparation for randomizing the location of ET_DYN binaries separately from mmap, this renames and exports these functions as arch_mmap_rnd(). Additionally introduces CONFIG_ARCH_HAS_ELF_RANDOMIZE for describing this feature on architectures that support it (which is a superset of ARCH_BINFMT_ELF_RANDOMIZE_PIE, since s390 already supports a separated ET_DYN ASLR from mmap ASLR without the ARCH_BINFMT_ELF_RANDOMIZE_PIE logic). Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Russell King <linux@arm.linux.org.uk> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "David A. Long" <dave.long@linaro.org> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Arun Chandran <achandran@mvista.com> Cc: Yann Droneaud <ydroneaud@opteya.com> Cc: Min-Hua Chen <orca.chen@gmail.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Alex Smith <alex@alex-smith.me.uk> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: Vineeth Vijayan <vvijayan@mvista.com> Cc: Jeff Bailey <jeffbailey@google.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Behan Webster <behanw@converseincode.com> Cc: Ismael Ripoll <iripoll@upv.es> Cc: Jan-Simon Mller <dl9pf@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14s390: standardize mmap_rnd() usageKees Cook1-11/+23
In preparation for splitting out ET_DYN ASLR, this refactors the use of mmap_rnd() to be used similarly to arm and x86, and extracts the checking of PF_RANDOMIZE. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14powerpc: standardize mmap_rnd() usageKees Cook1-11/+15
In preparation for splitting out ET_DYN ASLR, this refactors the use of mmap_rnd() to be used similarly to arm and x86. (Can mmap ASLR be safely enabled in the legacy mmap case here? Other archs use "mm->mmap_base = TASK_UNMAPPED_BASE + random_factor".) Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mips: extract logic for mmap_rnd()Kees Cook1-8/+16
In preparation for splitting out ET_DYN ASLR, extract the mmap ASLR selection into a separate function. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14arm64: standardize mmap_rnd() usageKees Cook2-8/+11
In preparation for splitting out ET_DYN ASLR, this refactors the use of mmap_rnd() to be used similarly to arm and x86. This additionally enables mmap ASLR on legacy mmap layouts, which appeared to be missing on arm64, and was already supported on arm. Additionally removes a copy/pasted declaration of an unused function. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14x86: standardize mmap_rnd() usageKees Cook1-16/+20
In preparation for splitting out ET_DYN ASLR, this refactors the use of mmap_rnd() to be used similarly to arm, and extracts the checking of PF_RANDOMIZE. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14arm: factor out mmap ASLR into mmap_rndKees Cook1-4/+12
To address the "offset2lib" ASLR weakness[1], this separates ET_DYN ASLR from mmap ASLR, as already done on s390. The architectures that are already randomizing mmap (arm, arm64, mips, powerpc, s390, and x86), have their various forms of arch_mmap_rnd() made available via the new CONFIG_ARCH_HAS_ELF_RANDOMIZE. For these architectures, arch_randomize_brk() is collapsed as well. This is an alternative to the solutions in: https://lkml.org/lkml/2015/2/23/442 I've been able to test x86 and arm, and the buildbot (so far) seems happy with building the rest. [1] http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html This patch (of 10): In preparation for splitting out ET_DYN ASLR, this moves the ASLR calculations for mmap on ARM into a separate routine, similar to x86. This also removes the redundant check of personality (PF_RANDOMIZE is already set before calling arch_pick_mmap_layout). Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Russell King <linux@arm.linux.org.uk> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "David A. Long" <dave.long@linaro.org> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Arun Chandran <achandran@mvista.com> Cc: Yann Droneaud <ydroneaud@opteya.com> Cc: Min-Hua Chen <orca.chen@gmail.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Alex Smith <alex@alex-smith.me.uk> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: Vineeth Vijayan <vvijayan@mvista.com> Cc: Jeff Bailey <jeffbailey@google.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Behan Webster <behanw@converseincode.com> Cc: Ismael Ripoll <iripoll@upv.es> Cc: Jan-Simon Mller <dl9pf@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14fs/binfmt_elf.c: fix bug in loading of PIE binariesMichael Davidson1-1/+8
With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down address allocation strategy, load_elf_binary() will attempt to map a PIE binary into an address range immediately below mm->mmap_base. Unfortunately, load_elf_ binary() does not take account of the need to allocate sufficient space for the entire binary which means that, while the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are that is supposed to be the "gap" between the stack and the binary. Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this means that binaries with large data segments > 128MB can end up mapping part of their data segment over their stack resulting in corruption of the stack (and the data segment once the binary starts to run). Any PIE binary with a data segment > 128MB is vulnerable to this although address randomization means that the actual gap between the stack and the end of the binary is normally greater than 128MB. The larger the data segment of the binary the higher the probability of failure. Fix this by calculating the total size of the binary in the same way as load_elf_interp(). Signed-off-by: Michael Davidson <md@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: memcontrol: let mem_cgroup_move_account() have effect only if MMU enabledChen Gang1-86/+86
When !MMU, it will report warning. The related warning with allmodconfig under c6x: CC mm/memcontrol.o mm/memcontrol.c:2802:12: warning: 'mem_cgroup_move_account' defined but not used [-Wunused-function] static int mem_cgroup_move_account(struct page *page, ^ Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14x86, mm: support huge KVA mappings on x86Toshi Kani3-3/+69
Implement huge KVA mapping interfaces on x86. On x86, MTRRs can override PAT memory types with a 4KB granularity. When using a huge page, MTRRs can override the memory type of the huge page, which may lead a performance penalty. The processor can also behave in an undefined manner if a huge page is mapped to a memory range that MTRRs have mapped with multiple different memory types. Therefore, the mapping code falls back to use a smaller page size toward 4KB when a mapping range is covered by non-WB type of MTRRs. The WB type of MTRRs has no affect on the PAT memory types. pud_set_huge() and pmd_set_huge() call mtrr_type_lookup() to see if a given range is covered by MTRRs. MTRR_TYPE_WRBACK indicates that the range is either covered by WB or not covered and the MTRR default value is set to WB. 0xFF indicates that MTRRs are disabled. HAVE_ARCH_HUGE_VMAP is selected when X86_64 or X86_32 with X86_PAE is set. X86_32 without X86_PAE is not supported since such config can unlikey be benefited from this feature, and there was an issue found in testing. [fengguang.wu@intel.com: ioremap_pud_capable can be static] Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14x86, mm: support huge I/O mapping capability I/FToshi Kani2-2/+23
Implement huge I/O mapping capability interfaces for ioremap() on x86. IOREMAP_MAX_ORDER is defined to PUD_SHIFT on x86/64 and PMD_SHIFT on x86/32, which overrides the default value defined in <linux/vmalloc.h>. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: change vunmap to tear down huge KVA mappingsToshi Kani2-0/+14
Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA mappings when they are set. pud_clear_huge() and pmd_clear_huge() return zero when no-operation is performed, i.e. huge page mapping was not used. These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined on the architecture. [akpm@linux-foundation.org: use consistent code layout] Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: change ioremap to set up huge I/O mappingsToshi Kani2-0/+31
ioremap_pud_range() and ioremap_pmd_range() are changed to create huge I/O mappings when their capability is enabled, and a request meets required conditions -- both virtual & physical addresses are aligned by their huge page size, and a requested range fufills their huge page size. When pud_set_huge() or pmd_set_huge() returns zero, i.e. no-operation is performed, the code simply falls back to the next level. The changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined on the architecture. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14lib/ioremap.c: add huge I/O map capability interfacesToshi Kani5-0/+52
Add ioremap_pud_enabled() and ioremap_pmd_enabled(), which return 1 when I/O mappings with pud/pmd are enabled on the kernel. ioremap_huge_init() calls arch_ioremap_pud_supported() and arch_ioremap_pmd_supported() to initialize the capabilities at boot-time. A new kernel option "nohugeiomap" is also added, so that user can disable the huge I/O map capabilities when necessary. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: change __get_vm_area_node() to use fls_long()Toshi Kani1-1/+3
ioremap() and its related interfaces are used to create I/O mappings to memory-mapped I/O devices. The mapping sizes of the traditional I/O devices are relatively small. Non-volatile memory (NVM), however, has many GB and is going to have TB soon. It is not very efficient to create large I/O mappings with 4KB. This patchset extends the ioremap() interfaces to transparently create I/O mappings with huge pages whenever possible. ioremap() continues to use 4KB mappings when a huge page does not fit into a requested range. There is no change necessary to the drivers using ioremap(). A requested physical address must be aligned by a huge page size (1GB or 2MB on x86) for using huge page mapping, though. The kernel huge I/O mapping will improve performance of NVM and other devices with large memory, and reduce the time to create their mappings as well. On x86, MTRRs can override PAT memory types with a 4KB granularity. When using a huge page, MTRRs can override the memory type of the huge page, which may lead a performance penalty. The processor can also behave in an undefined manner if a huge page is mapped to a memory range that MTRRs have mapped with multiple different memory types. Therefore, the mapping code falls back to use a smaller page size toward 4KB when a mapping range is covered by non-WB type of MTRRs. The WB type of MTRRs has no affect on the PAT memory types. The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that the arch supports huge KVA mappings for ioremap(). User may specify a new kernel option "nohugeiomap" to disable the huge I/O mapping capability of ioremap() when necessary. Patch 1-4 change common files to support huge I/O mappings. There is no change in the functinalities unless HAVE_ARCH_HUGE_VMAP is defined on the architecture of the system. Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set HAVE_ARCH_HUGE_VMAP on x86. This patch (of 6): __get_vm_area_node() takes unsigned long size, which is a 64-bit value on a 64-bit kernel. However, fls(size) simply ignores the upper 32-bit. Change to use fls_long() to handle the size properly. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm/page_alloc.c: clean up commentYaowei Bai1-1/+1
Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14sparc: clarify __GFP_NOFAIL allocationMichal Hocko1-11/+11
Commit 920c3ed74134 ("[SPARC64]: Add basic infrastructure for MD add/remove notification") has added __GFP_NOFAIL for the allocation request but it hasn't mentioned why is this strict requirement really needed. The code was handling an allocation failure and propagated it properly up the callchain so it is not clear why it is needed. Dave has clarified the intention when I tried to remove the flag as not being necessary: : It is a serious failure. : : If we miss an MDESC update due to this allocation failure, the update : is not an event which gets retransmitted so we will lose the updated : machine description forever. : : We really need this allocation to succeed. So add a comment to clarify the nofail flag and get rid of the failure check because __GFP_NOFAIL allocation doesn't fail. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vipul Pandya <vipul@chelsio.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: clarify __GFP_NOFAIL deprecation statusMichal Hocko1-2/+4
__GFP_NOFAIL is documented as a deprecated flag since commit 478352e789f5 ("mm: add comment about deprecation of __GFP_NOFAIL"). This has discouraged people from using it but in some cases an opencoded endless loop around allocator has been used instead. So the allocator is not aware of the de facto __GFP_NOFAIL allocation because this information was not communicated properly. Let's make clear that if the allocation context really cannot afford failure because there is no good failure policy then using __GFP_NOFAIL is preferable to opencoding the loop outside of the allocator. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vipul Pandya <vipul@chelsio.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: cma: constify and use correct signness in mm/cma.cSasha Levin2-16/+20
Constify function parameters and use correct signness where needed. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com> Acked-by: Gregory Fong <gregory.0xf0@gmail.com> Cc: Pintu Kumar <pintu.k@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14kernel, cpuset: remove exception for __GFP_THISNODEDavid Rientjes1-13/+5
Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so remove the obscure comment about it and its special-case exception. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm, thp: really limit transparent hugepage allocation to local nodeDavid Rientjes2-3/+9
Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") restructured alloc_hugepage_vma() with the intent of only allocating transparent hugepages locally when there was not an effective interleave mempolicy. alloc_pages_exact_node() does not limit the allocation to the single node, however, but rather prefers it. This is because __GFP_THISNODE is not set which would cause the node-local nodemask to be passed. Without it, only a nodemask that prefers the local node is passed. Fix this by passing __GFP_THISNODE and falling back to small pages when the allocation fails. Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") suffers from a similar problem for khugepaged, which is also fixed. Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: remove GFP_THISNODEDavid Rientjes4-31/+27
NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm, mempolicy: migrate_to_node should only migrate to nodeDavid Rientjes1-1/+2
migrate_to_node() is intended to migrate a page from one source node to a target node. Today, migrate_to_node() could end up migrating to any node, not only the target node. This is because the page migration allocator, new_node_page() does not pass __GFP_THISNODE to alloc_pages_exact_node(). This causes the target node to be preferred but allows fallback to any other node in order of affinity. Prevent this by allocating with __GFP_THISNODE. If memory is not available, -ENOMEM will be returned as appropriate. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14cleancache: remove limit on the number of cleancache enabled filesystemsVladimir Davydov3-182/+94
The limit equals 32 and is imposed by the number of entries in the fs_poolid_map and shared_fs_poolid_map. Nowadays it is insufficient, because with containers on board a Linux host can have hundreds of active fs mounts. These maps were introduced by commit 49a9ab815acb8 ("mm: cleancache: lazy initialization to allow tmem backends to build/run as modules") in order to allow compiling cleancache drivers as modules. Real pool ids are stored in these maps while super_block->cleancache_poolid points to an entry in the map, so that on cleancache registration we can walk over all (if there are <= 32 of them, of course) cleancache-enabled super blocks and assign real pool ids. Actually, there is absolutely no need in these maps, because we can iterate over all super blocks immediately using iterate_supers. This is not racy, because cleancache_init_ops is called from mount_fs with super_block->s_umount held for writing, while iterate_supers takes this semaphore for reading, so if we call iterate_supers after setting cleancache_ops, all super blocks that had been created before cleancache_register_ops was called will be assigned pool ids by the action function of iterate_supers while all newer super blocks will receive it in cleancache_init_fs. This patch therefore removes the maps and hence the artificial limit on the number of cleancache enabled filesystems. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14cleancache: forbid overriding cleancache_opsVladimir Davydov4-17/+18
Currently, cleancache_register_ops returns the previous value of cleancache_ops to allow chaining. However, chaining, as it is implemented now, is extremely dangerous due to possible pool id collisions. Suppose, a new cleancache driver is registered after the previous one assigned an id to a super block. If the new driver assigns the same id to another super block, which is perfectly possible, we will have two different filesystems using the same id. No matter if the new driver implements chaining or not, we are likely to get data corruption with such a configuration eventually. This patch therefore disables the ability to override cleancache_ops altogether as potentially dangerous. If there is already cleancache driver registered, all further calls to cleancache_register_ops will return EBUSY. Since no user of cleancache implements chaining, we only need to make minor changes to the code outside the cleancache core. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14cleancache: zap uuid arg of cleancache_init_shared_fsVladimir Davydov3-7/+7
Use super_block->s_uuid instead. Every shared filesystem using cleancache must now initialize super_block->s_uuid before calling cleancache_init_shared_fs. The only one on the tree, ocfs2, already meets this requirement. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: copy fs uuid to superblockVladimir Davydov1-0/+2
Currently, maximal number of cleancache enabled filesystems equals 32, which is insufficient nowadays, because a Linux host can have hundreds of containers on board, each of which might want its own filesystem. This patch set targets at removing this limitation - see patch 4 for more details. Patches 1-3 prepare the code for this change. This patch (of 4): This will allow us to remove the uuid argument from cleancache_init_shared_fs. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: refactor do_wp_page handling of shared vma into a functionShachar Raindel1-38/+48
The do_wp_page function is extremely long. Extract the logic for handling a page belonging to a shared vma into a function of its own. This helps the readability of the code, without doing any functional change in it. Signed-off-by: Shachar Raindel <raindel@mellanox.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Haggai Eran <haggaie@mellanox.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Michel Lespinasse <walken@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: refactor do_wp_page, extract the page copy flowShachar Raindel1-118/+147
In some cases, do_wp_page had to copy the page suffering a write fault to a new location. If the function logic decided that to do this, it was done by jumping with a "goto" operation to the relevant code block. This made the code really hard to understand. It is also against the kernel coding style guidelines. This patch extracts the page copy and page table update logic to a separate function. It also clean up the naming, from "gotten" to "wp_page_copy", and adds few comments. Signed-off-by: Shachar Raindel <raindel@mellanox.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Haggai Eran <haggaie@mellanox.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Michel Lespinasse <walken@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: refactor do_wp_page - rewrite the unlock flowShachar Raindel1-9/+12
When do_wp_page is ending, in several cases it needs to unlock the pages and ptls it was accessing. Currently, this logic was "called" by using a goto jump. This makes following the control flow of the function harder. Readability was further hampered by the unlock case containing large amount of logic needed only in one of the 3 cases. Using goto for cleanup is generally allowed. However, moving the trivial unlocking flows to the relevant call sites allow deeper refactoring in the next patch. Signed-off-by: Shachar Raindel <raindel@mellanox.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Haggai Eran <haggaie@mellanox.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Michel Lespinasse <walken@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: refactor do_wp_page, extract the reuse caseShachar Raindel1-49/+68
Currently do_wp_page contains 265 code lines. It also contains 9 goto statements, of which 5 are targeting labels which are not cleanup related. This makes the function extremely difficult to understand. The following patches are an attempt at breaking the function to its basic components, and making it easier to understand. The patches are straight forward function extractions from do_wp_page. As we extract functions, we remove unneeded parameters and simplify the code as much as possible. However, the functionality is supposed to remain completely unchanged. The patches also attempt to document the functionality of each extracted function. In patch 2, we split the unlock logic to the contain logic relevant to specific needs of each use case, instead of having huge number of conditional decisions in a single unlock flow. This patch (of 4): When do_wp_page is ending, in several cases it needs to reuse the existing page. This is achieved by making the page table writable, and possibly updating the page-cache state. Currently, this logic was "called" by using a goto jump. This makes following the control flow of the function harder. It is also against the coding style guidelines for using goto. As the code can easily be refactored into a specialized function, refactor it out and simplify the code flow in do_wp_page. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Haggai Eran <haggaie@mellanox.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Michel Lespinasse <walken@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: do not add nr_pmds into mm_struct if PMD is foldedKirill A. Shutemov1-0/+2
CONFIG_PGTABLE_LEVELS is now available on every architecture and we can use it to check if we need to add nr_pmds into mm_struct. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: David Howells <dhowells@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: define default PGTABLE_LEVELS to twoKirill A. Shutemov2-0/+9
By this time all architectures which support more than two page table levels should be covered. This patch add default definiton of PGTABLE_LEVELS equal 2. We also add assert to detect inconsistence between CONFIG_PGTABLE_LEVELS and __PAGETABLE_PMD_FOLDED/__PAGETABLE_PUD_FOLDED. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: David Howells <dhowells@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14x86: expose number of page table levels on Kconfig levelKirill A. Shutemov13-40/+42
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14um: expose number of page table levelsKirill A. Shutemov1-0/+5
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Richard Weinberger <richard@nod.at> Cc: Jeff Dike <jdike@addtoit.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14tile: expose number of page table levelsKirill A. Shutemov1-0/+5
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Chris Metcalf <cmetcalf@ezchip.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14sparc: expose number of page table levelsKirill A. Shutemov1-0/+4
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14sh: expose number of page table levelsKirill A. Shutemov1-0/+4
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: linux-sh@vger.kernel.org Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14s390: expose number of page table levelsKirill A. Shutemov1-0/+5
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Core mm expects __PAGETABLE_{PUD,PMD}_FOLDED to be defined if these page table levels folded. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14powerpc: expose number of page table levels on Kconfig levelKirill A. Shutemov1-0/+6
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14parisc: expose number of page table levels on Kconfig levelKirill A. Shutemov6-15/+18
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mips: expose number of page table levels on Kconfig levelKirill A. Shutemov1-0/+5
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14m68k: mark PMD folded and expose number of page table levelsKirill A. Shutemov1-0/+4
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Core mm expects __PAGETABLE_{PUD,PMD}_FOLDED to be defined if these page table levels folded. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ia64: expose number of page table levels on Kconfig levelKirill A. Shutemov6-31/+23
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. We need to define PGTABLE_LEVELS before sourcing init/Kconfig: arch/Kconfig will define default value and it's sourced from init/Kconfig. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14arm: expose number of page table levels on Kconfig levelKirill A. Shutemov1-0/+5
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Russell King <linux@arm.linux.org.uk> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14arm64: expose number of page table levels on Kconfig levelKirill A. Shutemov9-32/+32
We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. ARM64_PGTABLE_LEVELS is renamed to PGTABLE_LEVELS and defined before sourcing init/Kconfig: arch/Kconfig will define default value and it's sourced from init/Kconfig. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14alpha: expose number of page table levels on Kconfig levelKirill A. Shutemov1-0/+4
I've implemented accounting for pmd page tables as we have for pte (see mm->nr_ptes). It's requires a new counter in mm_struct: mm->nr_pmds. But the feature doesn't make any sense if an architecture has PMD level folded and it would be nice get rid of the counter in this case. The problem is that we cannot use __PAGETABLE_PMD_FOLDED in <linux/mm_types.h> due to circular dependencies: <linux/mm_types> -> <asm/pgtable.h> -> <linux/mm_types.h> In most cases <asm/pgtable.h> wants <linux/mm_types.h> to get definition of struct page and struct vm_area_struct. I've tried to split mm_struct into separate header file to be able to user <asm/pgtable.h> there. But it doesn't fly on some architectures, like ARM: it wants mm_struct <asm/pgtable.h> to implement tlb flushing. I don't see how to fix it without massive de-inlining or coverting a lot for inline functions to macros. This is other approach: expose number of page tables in use via Kconfig and use it in <linux/mm_types.h> instead of __PAGETABLE_PMD_FOLDED from <asm/pgtable.h>. This patch (of 19): We would want to use number of page table level to define mm_struct. Let's expose it as CONFIG_PGTABLE_LEVELS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: David Howells <dhowells@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: completely remove dumping per-cpu lists from show_mem()Konstantin Khlebnikov2-21/+2
It seems nobody needs this. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: hide per-cpu lists in output of show_mem()Konstantin Khlebnikov2-9/+31
This makes show_mem() much less verbose on huge machines. Instead of huge and almost useless dump of counters for each per-zone per-cpu lists this patch prints the sum of these counters for each zone (free_pcp) and size of per-cpu list for current cpu (local_pcp). The filter flag SHOW_MEM_PERCPU_LISTS reverts to the old verbose mode. [akpm@linux-foundation.org: update show_free_areas comment] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14page_writeback: clean up mess around cancel_dirty_page()Konstantin Khlebnikov9-49/+41
This patch replaces cancel_dirty_page() with a helper function account_page_cleaned() which only updates counters. It's called from truncate_complete_page() and from try_to_free_buffers() (hack for ext3). Page is locked in both cases, page-lock protects against concurrent dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from ongoing truncation"). Delete_from_page_cache() shouldn't be called for dirty pages, they must be handled by caller (either written or truncated). This patch treats final dirty accounting fixup at the end of __delete_from_page_cache() as a debug check and adds WARN_ON_ONCE() around it. If something removes dirty pages without proper handling that might be a bug and unwritten data might be lost. Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough here. cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is helper for nfs_invalidate_page() and it's called only in case complete invalidation. The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up and make try_to_free_buffers() not race with dirty pages") and 3e67c0987d75 ("truncate: clear page dirtiness before running try_to_free_buffers()") first was reverted right in v2.6.20 in commit ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3 data=journal"). Custom fixes were introduced between these points. NFS in v2.6.23, commit 1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()"). Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do dirty page accounting when removing a page from the page cache"). Since v2.6.25 all of them are redundant. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Tejun Heo <tj@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: incorporate zero pages into transparent huge pagesEbru Akagunduz1-8/+20
This patch improves THP collapse rates, by allowing zero pages. Currently THP can collapse 4kB pages into a THP when there are up to khugepaged_max_ptes_none pte_none ptes in a 2MB range. This patch counts pte none and mapped zero pages with the same variable. The patch was tested with a program that allocates 800MB of memory, and performs interleaved reads and writes, in a pattern that causes some 2MB areas to first see read accesses, resulting in the zero pfn being mapped there. To simulate memory fragmentation at allocation time, I modified do_huge_pmd_anonymous_page to return VM_FAULT_FALLBACK for read faults. Without the patch, only %50 of the program was collapsed into THP and the percentage did not increase over time. With this patch after 10 minutes of waiting khugepaged had collapsed %99 of the program's memory. [aarcange@redhat.com: fix bogus BUG()] Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm/compaction: enhance compaction finish conditionJoonsoo Kim3-7/+29
Compaction has anti fragmentation algorithm. It is that freepage should be more than pageblock order to finish the compaction if we don't find any freepage in requested migratetype buddy list. This is for mitigating fragmentation, but, there is a lack of migratetype consideration and it is too excessive compared to page allocator's anti fragmentation algorithm. Not considering migratetype would cause premature finish of compaction. For example, if allocation request is for unmovable migratetype, freepage with CMA migratetype doesn't help that allocation and compaction should not be stopped. But, current logic regards this situation as compaction is no longer needed, so finish the compaction. Secondly, condition is too excessive compared to page allocator's logic. We can steal freepage from other migratetype and change pageblock migratetype on more relaxed conditions in page allocator. This is designed to prevent fragmentation and we can use it here. Imposing hard constraint only to the compaction doesn't help much in this case since page allocator would cause fragmentation again. To solve these problems, this patch borrows anti fragmentation logic from page allocator. It will reduce premature compaction finish in some cases and reduce excessive compaction work. stress-highalloc test in mmtests with non movable order 7 allocation shows considerable increase of compaction success rate. Compaction success rate (Compaction success * 100 / Compaction stalls, %) 31.82 : 42.20 I tested it on non-reboot 5 runs stress-highalloc benchmark and found that there is no more degradation on allocation success rate than before. That roughly means that this patch doesn't result in more fragmentations. Vlastimil suggests additional idea that we only test for fallbacks when migration scanner has scanned a whole pageblock. It looked good for fragmentation because chance of stealing increase due to making more free pages in certain pageblock. So, I tested it, but, it results in decreased compaction success rate, roughly 38.00. I guess the reason that if system is low memory condition, watermark check could be failed due to not enough order 0 free page and so, sometimes, we can't reach a fallback check although migrate_pfn is aligned to pageblock_nr_pages. I can insert code to cope with this situation but it makes code more complicated so I don't include his idea at this patch. [akpm@linux-foundation.org: fix CONFIG_CMA=n build] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm/page_alloc: factor out fallback freepage checkingJoonsoo Kim1-52/+91
This is preparation step to use page allocator's anti fragmentation logic in compaction. This patch just separates fallback freepage checking part from fallback freepage management part. Therefore, there is no functional change. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm/cma: change fallback behaviour for CMA freepageJoonsoo Kim1-17/+19
Freepage with MIGRATE_CMA can be used only for MIGRATE_MOVABLE and they should not be expanded to other migratetype buddy list to protect them from unmovable/reclaimable allocation. Implementing these requirements in __rmqueue_fallback(), that is, finding largest possible block of freepage has bad effect that high order freepage with MIGRATE_CMA are broken continually although there are suitable order CMA freepage. Reason is that they are not be expanded to other migratetype buddy list and next __rmqueue_fallback() invocation try to finds another largest block of freepage and break it again. So, MIGRATE_CMA fallback should be handled separately. This patch introduces __rmqueue_cma_fallback(), that just wrapper of __rmqueue_smallest() and call it before __rmqueue_fallback() if migratetype == MIGRATE_MOVABLE. This results in unintended behaviour change that MIGRATE_CMA freepage is always used first rather than other migratetype as movable allocation's fallback. But, as already mentioned above, MIGRATE_CMA can be used only for MIGRATE_MOVABLE, so it is better to use MIGRATE_CMA freepage first as much as possible. Otherwise, we needlessly take up precious freepages with other migratetype and increase chance of fragmentation. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm, hotplug: fix concurrent memory hot-add deadlockDavid Rientjes3-28/+30
There's a deadlock when concurrently hot-adding memory through the probe interface and switching a memory block from offline to online. When hot-adding memory via the probe interface, add_memory() first takes mem_hotplug_begin() and then device_lock() is later taken when registering the newly initialized memory block. This creates a lock dependency of (1) mem_hotplug.lock (2) dev->mutex. When switching a memory block from offline to online, dev->mutex is first grabbed in device_online() when the write(2) transitions an existing memory block from offline to online, and then online_pages() will take mem_hotplug_begin(). This creates a lock inversion between mem_hotplug.lock and dev->mutex. Vitaly reports that this deadlock can happen when kworker handling a probe event races with systemd-udevd switching a memory block's state. This patch requires the state transition to take mem_hotplug_begin() before dev->mutex. Hot-adding memory via the probe interface creates a memory block while holding mem_hotplug_begin(), there is no way to take dev->mutex first in this case. online_pages() and offline_pages() are only called when transitioning memory block state. We now require that mem_hotplug_begin() is taken before calling them -- this requires exporting the mem_hotplug_begin() and mem_hotplug_done() to generic code. In all hot-add and hot-remove cases, mem_hotplug_begin() is done prior to device_online(). This is all that is needed to avoid the deadlock. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Zhen <zhenzhang.zhang@huawei.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Wang Nan <wangnan0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14cma: debug: document new debugfs interfaceSasha Levin1-0/+21
Document the structure and files under the new debugfs interface. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm-cma-allocation-trigger-fixAndrew Morton1-1/+1
s/CONFIG_CMA_ALIGNMENT/0/, per Joonsoo Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: cma: release triggerSasha Levin1-0/+58
Provides a userspace interface to trigger a CMA release. Usage: echo [pages] > free This would provide testing/fuzzing access to the CMA release paths. [akpm@linux-foundation.org: coding-style fixes] [mhocko@suse.cz: fix build] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: cma: allocation triggerSasha Levin3-1/+63
Provides a userspace interface to trigger a CMA allocation. Usage: echo [pages] > alloc This would provide testing/fuzzing access to the CMA allocation paths. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: cma: debugfs interfaceSasha Levin5-15/+91
I've noticed that there is no interfaces exposed by CMA which would let me fuzz what's going on in there. This small patchset exposes some information out to userspace, plus adds the ability to trigger allocation and freeing from userspace. This patch (of 3): Implement a simple debugfs interface to expose information about CMA areas in the system. Useful for testing/sanity checks for CMA since it was impossible to previously retrieve this information in userspace. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14memory hotplug: use macro to switch between section and pfnSheng Yong2-2/+2
Use macro section_nr_to_pfn() to switch between section and pfn, instead of open-coding it. No semantic changes. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: memcontrol: update copyright noticeJohannes Weiner1-0/+6
Add myself to the list of copyright holders. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm/memblock.c: rename local variable of memblock_type to `type'Baoquan He1-2/+2
A small cleanup. Seems in e3239ff9 ("memblock: Rename memblock_region to memblock_type and memblock_property to memblock_region") this one was missed. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: move mm_populate()-related code to mm/gup.cKirill A. Shutemov2-118/+118
It's odd that we have populate_vma_page_range() and __mm_populate() in mm/mlock.c. It's implementation of generic memory population and mlocking is one of possible side effect, if VM_LOCKED is set. __get_user_pages() is core of the implementation. Let's move the code into mm/gup.c. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: move gup() -> posix mlock() error conversion out of __mm_populateKirill A. Shutemov1-4/+7
This is praparation to moving mm_populate()-related code out of mm/mlock.c. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: rename __mlock_vma_pages_range() to populate_vma_page_range()Kirill A. Shutemov4-27/+17
__mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on vma flags. The same codepath is used for MAP_POPULATE. Let's rename __mlock_vma_pages_range() to populate_vma_page_range(). This patch also drops mlock_vma_pages_range() references from documentation. It has gone in cea10a19b797 ("mm: directly use __mlock_vma_pages_range() in find_extend_vma()"). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: rename FOLL_MLOCK to FOLL_POPULATEKirill A. Shutemov4-6/+6
After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in get_user_pages only for mlock") FOLL_MLOCK has lost its original meaning: we don't necessarily mlock the page if the flags is set -- we also take VM_LOCKED into consideration. Since we use the same codepath for __mm_populate(), let's rename FOLL_MLOCK to FOLL_POPULATE. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14slob: make slob_alloc_node() static and remove EXPORT_SYMBOL()Fabian Frederick1-2/+1
slob_alloc_node() is only used in slob.c. Remove the EXPORT_SYMBOL and make slob_alloc_node() static. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14slub: use bool function return values of true/false not 1/0Joe Perches1-6/+6
Use the normal return values for bool functions Signed-off-by: Joe Perches <joe@perches.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm, slab: correct config option in commentDavid Rientjes1-1/+1
CONFIG_SLAB_DEBUG doesn't exist, CONFIG_DEBUG_SLAB does. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm/slub.c: parse slub_debug O option in switch statementChris J Arges1-9/+7
By moving the O option detection into the switch statement, we allow this parameter to be combined with other options correctly. Previously options like slub_debug=OFZ would only detect the 'o' and use DEBUG_DEFAULT_FLAGS to fill in the rest of the flags. Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm/migrate: mark unmap_and_move() "noinline" to avoid ICE in gcc 4.7.3Geert Uytterhoeven1-3/+14
With gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-12ubuntu1) : mm/migrate.c: In function `migrate_pages': mm/migrate.c:1148:1: internal compiler error: in push_minipool_fix, at config/arm/arm.c:13500 Please submit a full bug report, with preprocessed source if appropriate. See <file:///usr/share/doc/gcc-4.7/README.Bugs> for instructions. Preprocessed source stored into /tmp/ccPoM1tr.out file, please attach this to your bugreport. make[1]: *** [mm/migrate.o] Error 1 make: *** [mm/migrate.o] Error 2 Mark unmap_and_move() (which is used in a single place only) "noinline" to work around this compiler bug. [akpm@linux-foundation.org: make it conditional on gcc-4.7.3 and arm] [khilman@kernel.org: fine-tune compiler versions] [akpm@linux-foundation.org: fix comment] Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reported-by: Kevin Hilman <khilman@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Tested-by: Kevin Hilman <khilman@linaro.org> Tested-by: Lina Iyer <lina.iyer@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14watchdog: introduce the hardlockup_detector_disable() functionUlrich Obergfell3-27/+5
Have kvm_guest_init() use hardlockup_detector_disable() instead of watchdog_enable_hardlockup_detector(false). Remove the watchdog_hardlockup_detector_is_enabled() and the watchdog_enable_hardlockup_detector() function which are no longer needed. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14watchdog: clean up some function names and argumentsUlrich Obergfell1-8/+12
Rename the update_timers*() functions to update_watchdog*(). Remove the boolean argument from watchdog_enable_all_cpus() because update_watchdog_all_cpus() is now a generic function to change the run state of the lockup detectors and to have the lockup detectors use a new sample period. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14watchdog: enable the new user interface of the watchdog mechanismUlrich Obergfell5-89/+97
With the current user interface of the watchdog mechanism it is only possible to disable or enable both lockup detectors at the same time. This series introduces new kernel parameters and changes the semantics of some existing kernel parameters, so that the hard lockup detector and the soft lockup detector can be disabled or enabled individually. With this series applied, the user interface is as follows. - parameters in /proc/sys/kernel . soft_watchdog This is a new parameter to control and examine the run state of the soft lockup detector. . nmi_watchdog The semantics of this parameter have changed. It can now be used to control and examine the run state of the hard lockup detector. . watchdog This parameter is still available to control the run state of both lockup detectors at the same time. If this parameter is examined, it shows the logical OR of soft_watchdog and nmi_watchdog. . watchdog_thresh The semantics of this parameter are not affected by the patch. - kernel command line parameters . nosoftlockup The semantics of this parameter have changed. It can now be used to disable the soft lockup detector at boot time. . nmi_watchdog=0 or nmi_watchdog=1 Disable or enable the hard lockup detector at boot time. The patch introduces '=1' as a new option. . nowatchdog The semantics of this parameter are not affected by the patch. It is still available to disable both lockup detectors at boot time. Also, remove the proc_dowatchdog() function which is no longer needed. [dzickus@redhat.com: wrote changelog] [dzickus@redhat.com: update documentation for kernel params and sysctl] Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14watchdog: implement error handling for failure to set up hardware perf eventsUlrich Obergfell1-0/+30
If watchdog_nmi_enable() fails to set up the hardware perf event of one CPU, the entire hard lockup detector is deemed unreliable. Hence, disable the hard lockup detector and shut down the hardware perf events on all CPUs. [dzickus@redhat.com: update comments to explain some code] Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14watchdog: introduce separate handlers for parameters in /proc/sys/kernelUlrich Obergfell2-0/+67
Separate handlers for each watchdog parameter in /proc/sys/kernel replace the proc_dowatchdog() function. Three of those handlers merely call proc_watchdog_common() with one different argument. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14watchdog: introduce proc_watchdog_common()Ulrich Obergfell1-0/+65
Three of four handlers for the watchdog parameters in /proc/sys/kernel essentially have to do the same thing. if the parameter is being read { return the state of the corresponding bit(s) in 'watchdog_enabled' } else { set/clear the state of the corresponding bit(s) in 'watchdog_enabled' update the run state of the lockup detector(s) } Hence, introduce a common function that can be called by those handlers. The callers pass a 'bit mask' to this function to indicate which bit(s) should be set/cleared in 'watchdog_enabled'. This function handles an uncommon race with watchdog_nmi_enable() where a concurrent update of 'watchdog_enabled' is possible. We use 'cmpxchg' to detect the concurrency. [This avoids introducing a new spinlock or a mutex to synchronize updates of 'watchdog_enabled'. Using the same lock or mutex in watchdog thread context and in system call context needs to be considered carefully because it can make the code prone to deadlock situations in connection with parking/unparking the watchdog threads.] Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14watchdog: move definition of 'watchdog_proc_mutex' outside of proc_dowatchdog()Ulrich Obergfell1-1/+2
This series removes proc_dowatchdog(). Since multiple new functions need the 'watchdog_proc_mutex' to serialize access to the watchdog parameters in /proc/sys/kernel, move the mutex outside of any function. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14watchdog: introduce the proc_watchdog_update() functionUlrich Obergfell1-0/+23
This series introduces a separate handler for each watchdog parameter in /proc/sys/kernel. The separate handlers need a common function that they can call to update the run state of the lockup detectors, or to have the lockup detectors use a new sample period. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14watchdog: new definitions and variables, initializationUlrich Obergfell2-1/+28
The hardlockup and softockup had always been tied together. Due to the request of KVM folks, they had a need to have one enabled but not the other. Internally rework the code to split things apart more cleanly. There is a bunch of churn here, but the end result should be code that should be easier to maintain and fix without knowing the internals of what is going on. This patch (of 9): Introduce new definitions and variables to separate the user interface in /proc/sys/kernel from the internal run state of the lockup detectors. The internal run state is represented by two bits in a new variable that is named 'watchdog_enabled'. This helps simplify the code, for example: - In order to check if any of the two lockup detectors is enabled, it is sufficient to check if 'watchdog_enabled' is not zero. - In order to enable/disable one or both lockup detectors, it is sufficient to set/clear one or both bits in 'watchdog_enabled'. - Concurrent updates of 'watchdog_enabled' need not be synchronized via a spinlock or a mutex. Updates can either be atomic or concurrency can be detected by using 'cmpxchg'. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: make mlog_errno return the errnoAndrew Morton1-2/+3
ocfs2 does mlog_errno(v); return v; in many places. Change mlog_errno() so we can do return mlog_errno(v); For some weird reason this patch reduces the size of ocfs2 by 6k: akpm3:/usr/src/25> size fs/ocfs2/ocfs2.ko text data bss dec hex filename 1146613 82767 832192 2061572 1f7504 fs/ocfs2/ocfs2.ko-before 1140857 82767 832192 2055816 1f5e88 fs/ocfs2/ocfs2.ko-after [dan.carpenter@oracle.com: double evaluation concerns in mlog_errno()] Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: check if the ocfs2 lock resource has been initialized before calling ↵alex chen1-0/+5
ocfs2_dlm_lock If ocfs2 lockres has not been initialized before calling ocfs2_dlm_lock, the lock won't be dropped and then will lead umount hung. The case is described below: ocfs2_mknod ocfs2_mknod_locked __ocfs2_mknod_locked ocfs2_journal_access_di Failed because of -ENOMEM or other reasons, the inode lockres has not been initialized yet. iput(inode) ocfs2_evict_inode ocfs2_delete_inode ocfs2_inode_lock ocfs2_inode_lock_full_nested __ocfs2_cluster_lock Succeeds and allocates a new dlm lockres. ocfs2_clear_inode ocfs2_open_unlock ocfs2_drop_inode_locks ocfs2_drop_lock Since lockres has not been initialized, the lock can't be dropped and the lockres can't be migrated, thus umount will hang forever. Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: logging: remove static buffer, use vsprintf extension %pVJoe Perches1-15/+18
Use the vsprintf %pV extension to avoid using a static buffer and remove the now unnecessary buffer. Signed-off-by: Joe Perches <joe@perches.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: incorrect check for debugfs returnsChengyu Song3-16/+37
debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs is not configured, so the return value should be checked against ERROR_VALUE as well, otherwise the later dereference of the dentry pointer would crash the kernel. This patch tries to solve this problem by fixing certain checks. However, I have that found other call sites are protected by #ifdef CONFIG_DEBUG_FS. In current implementation, if CONFIG_DEBUG_FS is defined, then the above two functions will never return any ERROR_VALUE. So another possibility to fix this is to surround all the buggy checks/functions with the same #ifdef CONFIG_DEBUG_FS. But I'm not sure if this would break any functionality, as only OCFS2_FS_STATS declares dependency on DEBUG_FS. Signed-off-by: Chengyu Song <csong84@gatech.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: fix a typo in the copyright statementJakub Wilk1-1/+1
Signed-off-by: Jakub Wilk <jwilk@jwilk.net> Reviewed-by: Eric Ren <zren@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: fix possible uninitialized variable accessJoseph Qi3-2/+10
In ocfs2_local_alloc_find_clear_bits and ocfs2_get_dentry, variable numfound and set may be uninitialized and then used in tracepoint. In ocfs2_xattr_block_get and ocfs2_delete_xattr_in_bucket, variable block_off and xv may be uninitialized and then used in the following logic due to unchecked return value. This patch fixes these possible issues. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: remove goto statement in ocfs2_check_dir_for_entry()Daeseok Youn1-8/+5
Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: rollback the cleared bits if error occurs after ↵Joseph Qi1-0/+2
ocfs2_block_group_clear_bits ocfs2_block_group_clear_bits will clear bits in block group bitmap. Once it succeeds but fails in the following step, it will cause block group bitmap mismatch the corresponding count recorded in dinode. So rollback the cleared bits if error occurs. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: use ENOENT instead of EEXIST when get system file failsJoseph Qi2-3/+3
When ocfs2_get_system_file_inode fails, it is obscure to set the return value to -EEXIST. So change it to -ENOENT. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: use actual name length when find entry in ocfs2_orphan_del()Joseph Qi1-2/+2
If the namelen is 20 and name only has actual length 16, it will fail in ocfs2_find_entry because of mismatch. So use actual name length when find entry. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: dereferencing freed pointers in ocfs2_reflink()Dan Carpenter1-1/+1
The code at the "out" label assumes that "default_acl" and "acl" are NULL, but actually the pointers can be NULL, unitialized, or freed. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: fix typo in ocfs2_reserve_local_alloc_bitsJoseph Qi1-1/+1
In ocfs2_reserve_local_alloc_bits, it calls ocfs2_error if local alloc inode bitmap used bits mismatch, but the log mistakes it as free bits. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: do not use ocfs2_zero_extend during direct IOJoseph Qi1-8/+130
In ocfs2_direct_IO_write, we use ocfs2_zero_extend to zero allocated clusters in case of cluster not aligned. But ocfs2_zero_extend uses page cache, this may happen that it clears the data which blockdev_direct_IO has already written. We should use blkdev_issue_zeroout instead of ocfs2_zero_extend during direct IO. So fix this issue by introducing ocfs2_direct_IO_zero_extend and ocfs2_direct_IO_extend_no_holes. Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Tested-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: take inode lock when get clustersJoseph Qi1-1/+10
We need take inode lock when calling ocfs2_get_clusters. And use GFP_NOFS instead of GFP_KERNEL. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: no need get dinode bh when zeroing extendJoseph Qi1-5/+1
Since di_bh won't be used when zeroing extend, set it to NULL. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: fix a typing error in ocfs2_direct_IO_writeJoseph Qi1-1/+1
Only when direct IO succeeds we need consider zeroing out in case of cluster not aligned. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: avoid a pointless delay in o2cb_cluster_check()Daeseok Youn1-1/+1
Fix an off-by-one when attempting to avoid an msleep() on the final loop iteration. Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: one function call less in user_cluster_connect() after error detectionMarkus Elfring1-4/+2
kfree() was called by user_cluster_connect() even if a previous call of the kzalloc() function failed. Return from this implementation directly after failure detection. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: one function call less in ocfs2_init_slot_info() after error detectionMarkus Elfring1-1/+1
__ocfs2_free_slot_info() was called by ocfs2_init_slot_info() even if a call of the kzalloc() function failed. Return from this implementation directly after corresponding exception handling. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: one function call less in ocfs2_merge_rec_right() after error detectionMarkus Elfring1-1/+1
ocfs2_free_path() was called by ocfs2_merge_rec_right() even if a call of the ocfs2_get_right_path() function failed. Return from this implementation directly after corresponding exception handling. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: one function call less in ocfs2_merge_rec_left() after error detectionMarkus Elfring1-1/+1
ocfs2_free_path() was called by ocfs2_merge_rec_left() even if a call of the ocfs2_get_left_path() function failed. Return from this implementation directly after corresponding exception handling. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: less function calls in ocfs2_figure_merge_contig_type() after error ↵Markus Elfring1-11/+13
detection ocfs2_free_path() was called in some cases by ocfs2_figure_merge_contig_type() during error handling even if the passed variables "left_path" and "right_path" contained still a null pointer. Corresponding implementation details could be improved by adjustments for jump labels according to the current Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: less function calls in ocfs2_convert_inline_data_to_extents() after ↵Markus Elfring1-2/+3
error detection kfree() was called in a few cases by ocfs2_convert_inline_data_to_extents() during error handling even if the passed variable "pages" contained a null pointer. * Return from this implementation directly after failure detection for the function call "kcalloc". * Corresponding details could be improved by the introduction of another jump label. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: delete unnecessary checks before three function callsMarkus Elfring3-14/+7
kfree(), ocfs2_free_path() and __ocfs2_free_slot_info() test whether their argument is NULL and then return immediately. Thus the test around their calls is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14arch/sh/kernel/dwarf.c: use mempool_create_slab_pool()David Rientjes1-8/+4
Mempools created for slab caches should use mempool_create_slab_pool(). Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14arch/sh/kernel/dwarf.c: destroy mempools on cleanupDavid Rientjes1-1/+5
dwarf_reg_pool and dwarf_frame_pool are not properly destroyed when cleaning up the dwarf unwinder. Destroy them with mempool_destroy(). Also mark dwarf_unwinder_cleanup() as __init. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14scripts/coccinelle/misc/bugon.cocci: update bug_on conversion warningFabian Frederick1-1/+1
if()/BUG conversion to BUG_ON must be avoided when there's side effect in condition. The reason being BUG_ON won't execute the condition when CONFIG_BUG is not defined. With inspiration from Bruce Fields. Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Julia Lawall <Julia.Lawall@lip6.fr> Cc: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm/hugetlb: use pmd_page() in follow_huge_pmd()Gerald Schaefer1-2/+1
Commit 61f77eda9bbf ("mm/hugetlb: reduce arch dependent code around follow_huge_*") broke follow_huge_pmd() on s390, where pmd and pte layout differ and using pte_page() on a huge pmd will return wrong results. Using pmd_page() instead fixes this. All architectures that were touched by that commit have pmd_page() defined, so this should not break anything on other architectures. Fixes: 61f77eda "mm/hugetlb: reduce arch dependent code around follow_huge_*" Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz>, Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14Merge tag 'gfs2-merge-window' of ↵Linus Torvalds13-97/+186
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Bob Peterson: "Here is a list of patches we've accumulated for GFS2 for the current upstream merge window. Most of the patches fix GFS2 quotas, which were not properly enforced. There's another that adds me as a GFS2 co-maintainer, and a couple patches that fix a kernel panic doing splice_write on GFS2 as well as a few correctness patches" * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: fix quota refresh race in do_glock() gfs2: incorrect check for debugfs returns gfs2: allow fallocate to max out quotas/fs efficiently gfs2: allow quota_check and inplace_reserve to return available blocks gfs2: perform quota checks against allocation parameters GFS2: Move gfs2_file_splice_write outside of #ifdef GFS2: Allocate reservation during splice_write GFS2: gfs2_set_acl(): Cache "no acl" as well Add myself (Bob Peterson) as a maintainer of GFS2
2015-04-14Merge tag 'jfs-4.1' of git://github.com/kleikamp/linux-shaggyLinus Torvalds1-1/+1
Pull jfs update from David Kleikamp: "Not much this time. Just a one-liner format fix" * tag 'jfs-4.1' of git://github.com/kleikamp/linux-shaggy: jfs: %pf is only for function pointers
2015-04-14Merge tag 'please-pull-pstore' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore fix from Tony Luck: "Just one pstore fix this release" * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore: Fix the ramoops module parameters update
2015-04-14fm10k: Bump driver version to 0.15.2Jeff Kirsher1-1/+1
With the recent driver changes, bump the version. Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
2015-04-14fm10k: corrected VF multicast updateJeff Kirsher1-2/+5
VFs were being improperly added to the switch's multicast group. The error stems from the fact that incorrect arguments were passed to the "update_mc_addr" function. It would seem to be a copy paste error since the parameters are similar to the "update_uc_addr" function. Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com> Acked-by: Matthew Vick <matthew.vick@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-04-14fm10k: mbx_update_max_size does not drop all oversized messagesJeff Kirsher1-3/+6
When we call update_max_size it does not drop all oversized messages. This is due to the difficulty in performing this operation, since it is a FIFO which makes updating anything other than head or tail very difficult. To fix this, modify validate_msg_size to ensure that we error out later when trying to transmit the message that could be oversized. This will generally be a rare condition, as it requires the FIFO to include a message larger than the max_size negotiated during mailbox connect. Note that max_size is always smaller than rx.size so it should be safe to use here. Also, update the update_max_size function header comment to clearly indicate that it does not drop all oversized messages, but only those at the head of the FIFO. Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Matthew Vick <matthew.vick@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-04-14fm10k: reset head instead of calling update_max_sizeJeff Kirsher1-2/+16
When we forcefully shutdown the mailbox, we then go about resetting max size to 0, and clearing all messages in the FIFO. Instead, we should just reset the head pointer so that the FIFO becomes empty, rather than changing the max size to 0. This helps prevent increment in tx_dropped counter during mailbox negotiation, which is confusing to viewers of Linux ethtool statistics output. Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Matthew Vick <matthew.vick@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-04-14fm10k: renamed mbx_tx_dropped to mbx_tx_oversizedJeff Kirsher1-1/+1
The use of dropped doesn't really mean dropped mailbox messages, but rather specifically messages which were too large to fit in the remote Rx FIFO. Rename the stat to more clearly indicate what it means. Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Matthew Vick <matthew.vick@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-04-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller36-579/+739
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next A final pull request, I know it's very late but this time I think it's worth a bit of rush. The following patchset contains Netfilter/nf_tables updates for net-next, more specifically concatenation support and dynamic stateful expression instantiation. This also comes with a couple of small patches. One to fix the ebtables.h userspace header and another to get rid of an obsolete example file in tree that describes a nf_tables expression. This time, I decided to paste the original descriptions. This will result in a rather large commit description, but I think these bytes to keep. Patrick McHardy says: ==================== netfilter: nf_tables: concatenation support The following patches add support for concatenations, which allow multi dimensional exact matches in O(1). The basic idea is to split the data registers, currently consisting of 4 registers of 16 bytes each, into smaller units, 16 registers of 4 bytes each, and making sure each register store always leaves the full 32 bit in a well defined state, meaning smaller stores will zero the remaining bits. Based on that, we can load multiple adjacent registers with different values, thereby building a concatenated bigger value, and use that value for set lookups. Sets are changed to use variable sized extensions for their key and data values, removing the fixed limit of 16 bytes while saving memory if less space is needed. As a side effect, these patches will allow some nice optimizations in the future, like using jhash2 in nft_hash, removing the masking in nft_cmp_fast, optimized data comparison using 32 bit word size etc. These are not done so far however. The patches are split up as follows: * the first five patches add length validation to register loads and stores to make sure we stay within bounds and prepare the validation functions for the new addressing mode * the next patches prepare for changing to 32 bit addressing by introducing a struct nft_regs, which holds the verdict register as well as the data registers. The verdict members are moved to a new struct nft_verdict to allow to pull struct nft_data out of the stack. * the next patches contain preparatory conversions of expressions and sets to use 32 bit addressing * the next patch introduces so far unused register conversion helpers for parsing and dumping register numbers over netlink * following is the real conversion to 32 bit addressing, consisting of replacing struct nft_data in struct nft_regs by an array of u32s and actually translating and validating the new register numbers. * the final two patches add support for variable sized data items and variable sized keys / data in set elements The patches have been verified to work correctly with nft binaries using both old and new addressing. ==================== Patrick McHardy says: ==================== netfilter: nf_tables: dynamic stateful expression instantiation The following patches are the grand finale of my nf_tables set work, using all the building blocks put in place by the previous patches to support something like iptables hashlimit, but a lot more powerful. Sets are extended to allow attaching expressions to set elements. The dynset expression dynamically instantiates these expressions based on a template when creating new set elements and evaluates them for all new or updated set members. In combination with concatenations this effectively creates state tables for arbitrary combinations of keys, using the existing expression types to maintain that state. Regular set GC takes care of purging expired states. We currently support two different stateful expressions, counter and limit. Using limit as a template we can express the functionality of hashlimit, but completely unrestricted in the combination of keys. Using counter we can perform accounting for arbitrary flows. The following examples from patch 5/5 show some possibilities. Userspace syntax is still WIP, especially the listing of state tables will most likely be seperated from normal set listings and use a more structured format: 1. Limit the rate of new SSH connections per host, similar to iptables hashlimit: flow ip saddr timeout 60s \ limit 10/second \ accept 2. Account network traffic between each set of /24 networks: flow ip saddr & 255.255.255.0 . ip daddr & 255.255.255.0 \ counter 3. Account traffic to each host per user: flow skuid . ip daddr \ counter 4. Account traffic for each combination of source address and TCP flags: flow ip saddr . tcp flags \ counter The resulting set content after a Xmas-scan look like this: { 192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040, 192.168.122.1 . ack : counter packets 74 bytes 3848, 192.168.122.1 . psh | ack : counter packets 35 bytes 3144 } In the future the "expressions attached to elements" will be extended to also support user created non-stateful expressions to allow to efficiently select beween a set of parameter sets, f.i. a set of log statements with different prefixes based on the interface, which currently require one rule each. This will most likely have to wait until the next kernel version though. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>