commit 946de0e6b6ed49eacb03e3cddfcb1d774d6378ed Author: Greg Kroah-Hartman Date: Thu Aug 14 09:38:34 2014 +0800 Linux 3.14.17 commit 23647e98786630a3d0ef65cc0304dc75f2da0eb7 Author: Dave Chinner Date: Tue May 20 08:18:09 2014 +1000 xfs: log vector rounding leaks log space commit 110dc24ad2ae4e9b94b08632fe1eb2fcdff83045 upstream. The addition of direct formatting of log items into the CIL linear buffer added alignment restrictions that the start of each vector needed to be 64 bit aligned. Hence padding was added in xlog_finish_iovec() to round up the vector length to ensure the next vector started with the correct alignment. This adds a small number of bytes to the size of the linear buffer that is otherwise unused. The issue is that we then use the linear buffer size to determine the log space used by the log item, and this includes the unused space. Hence when we account for space used by the log item, it's more than is actually written into the iclogs, and hence we slowly leak this space. This results on log hangs when reserving space, with threads getting stuck with these stack traces: Call Trace: [] schedule+0x29/0x70 [] xlog_grant_head_wait+0xa2/0x1a0 [] xlog_grant_head_check+0xbd/0x140 [] xfs_log_reserve+0x103/0x220 [] xfs_trans_reserve+0x2f5/0x310 ..... The 4 bytes is significant. Brain Foster did all the hard work in tracking down a reproducable leak to inode chunk allocation (it went away with the ikeep mount option). His rough numbers were that creating 50,000 inodes leaked 11 log blocks. This turns out to be roughly 800 inode chunks or 1600 inode cluster buffers. That works out at roughly 4 bytes per cluster buffer logged, and at that I started looking for a 4 byte leak in the buffer logging code. What I found was that a struct xfs_buf_log_format structure for an inode cluster buffer is 28 bytes in length. This gets rounded up to 32 bytes, but the vector length remains 28 bytes. Hence the CIL ticket reservation is decremented by 32 bytes (via lv->lv_buf_len) for that vector rather than 28 bytes which are written into the log. The fix for this problem is to separately track the bytes used by the log vectors in the item and use that instead of the buffer length when accounting for the log space that will be used by the formatted log item. Again, thanks to Brian Foster for doing all the hard work and long hours to isolate this leak and make finding the bug relatively simple. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Brian Foster Signed-off-by: Dave Chinner Cc: Bill Signed-off-by: Greg Kroah-Hartman commit a7129df1c448ee418dd53c00404bfc5a168aca53 Author: Andrey Utkin Date: Mon Aug 4 23:47:41 2014 +0300 arch/sparc/math-emu/math_32.c: drop stray break operator [ Upstream commit 093758e3daede29cb4ce6aedb111becf9d4bfc57 ] This commit is a guesswork, but it seems to make sense to drop this break, as otherwise the following line is never executed and becomes dead code. And that following line actually saves the result of local calculation by the pointer given in function argument. So the proposed change makes sense if this code in the whole makes sense (but I am unable to analyze it in the whole). Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81641 Reported-by: David Binderman Signed-off-by: Andrey Utkin Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit bb4c9762d94625b1acd8a5232ff334c601ee0278 Author: Sowmini Varadhan Date: Fri Aug 1 09:50:40 2014 -0400 sparc64: ldc_connect() should not return EINVAL when handshake is in progress. [ Upstream commit 4ec1b01029b4facb651b8ef70bc20a4be4cebc63 ] The LDC handshake could have been asynchronously triggered after ldc_bind() enables the ldc_rx() receive interrupt-handler (and thus intercepts incoming control packets) and before vio_port_up() calls ldc_connect(). If that is the case, ldc_connect() should return 0 and let the state-machine progress. Signed-off-by: Sowmini Varadhan Acked-by: Karl Volz Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit f8cfa5d87b4367e7501610256581a17531ae158c Author: Christopher Alexander Tobias Schulze Date: Sun Aug 3 16:01:53 2014 +0200 sunsab: Fix detection of BREAK on sunsab serial console [ Upstream commit fe418231b195c205701c0cc550a03f6c9758fd9e ] Fix detection of BREAK on sunsab serial console: BREAK detection was only performed when there were also serial characters received simultaneously. To handle all BREAKs correctly, the check for BREAK and the corresponding call to uart_handle_break() must also be done if count == 0, therefore duplicate this code fragment and pull it out of the loop over the received characters. Patch applies to 3.16-rc6. Signed-off-by: Christopher Alexander Tobias Schulze Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit eb191d89b4db51c12e0a9c60059cd2bfce857d8d Author: Christopher Alexander Tobias Schulze Date: Sun Aug 3 15:44:52 2014 +0200 bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000 [ Upstream commit 5cdceab3d5e02eb69ea0f5d8fa9181800baf6f77 ] Fix regression in bbc i2c temperature and fan control on some Sun systems that causes the driver to refuse to load due to the bbc_i2c_bussel resource not being present on the (second) i2c bus where the temperature sensors and fan control are located. (The check for the number of resources was removed when the driver was ported to a pure OF driver in mid 2008.) Signed-off-by: Christopher Alexander Tobias Schulze Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 49e6ec05a28fa71d131923fb02d0116833635f7f Author: David S. Miller Date: Mon Aug 4 20:07:37 2014 -0700 sparc64: Guard against flushing openfirmware mappings. [ Upstream commit 4ca9a23765da3260058db3431faf5b4efd8cf926 ] Based almost entirely upon a patch by Christopher Alexander Tobias Schulze. In commit db64fe02258f1507e13fe5212a989922323685ce ("mm: rewrite vmap layer") lazy VMAP tlb flushing was added to the vmalloc layer. This causes problems on sparc64. Sparc64 has two VMAP mapped regions and they are not contiguous with eachother. First we have the malloc mapping area, then another unrelated region, then the vmalloc region. This "another unrelated region" is where the firmware is mapped. If the lazy TLB flushing logic in the vmalloc code triggers after we've had both a module unload and a vfree or similar, it will pass an address range that goes from somewhere inside the malloc region to somewhere inside the vmalloc region, and thus covering the openfirmware area entirely. The sparc64 kernel learns about openfirmware's dynamic mappings in this region early in the boot, and then services TLB misses in this area. But openfirmware has some locked TLB entries which are not mentioned in those dynamic mappings and we should thus not disturb them. These huge lazy TLB flush ranges causes those openfirmware locked TLB entries to be removed, resulting in all kinds of problems including hard hangs and crashes during reboot/reset. Besides causing problems like this, such huge TLB flush ranges are also incredibly inefficient. A plea has been made with the author of the VMAP lazy TLB flushing code, but for now we'll put a safety guard into our flush_tlb_kernel_range() implementation. Since the implementation has become non-trivial, stop defining it as a macro and instead make it a function in a C source file. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit b4a8febea9f14c69ec3038cb10c1fcfd2d44401d Author: David S. Miller Date: Mon Aug 4 16:34:01 2014 -0700 sparc64: Do not insert non-valid PTEs into the TSB hash table. [ Upstream commit 18f38132528c3e603c66ea464727b29e9bbcb91b ] The assumption was that update_mmu_cache() (and the equivalent for PMDs) would only be called when the PTE being installed will be accessible by the user. This is not true for code paths originating from remove_migration_pte(). There are dire consequences for placing a non-valid PTE into the TSB. The TLB miss frramework assumes thatwhen a TSB entry matches we can just load it into the TLB and return from the TLB miss trap. So if a non-valid PTE is in there, we will deadlock taking the TLB miss over and over, never satisfying the miss. Just exit early from update_mmu_cache() and friends in this situation. Based upon a report and patch from Christopher Alexander Tobias Schulze. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit a8973758f3bd08a2d102ee96ec28b2fe9242d77b Author: David S. Miller Date: Sat May 17 11:28:05 2014 -0700 sparc64: Add membar to Niagara2 memcpy code. [ Upstream commit 5aa4ecfd0ddb1e6dcd1c886e6c49677550f581aa ] This is the prevent previous stores from overlapping the block stores done by the memcpy loop. Based upon a glibc patch by Jose E. Marchesi Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit c184f95bbe3e613461c17ded62206c921a6e64b7 Author: David S. Miller Date: Wed May 7 14:07:32 2014 -0700 sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus. [ Upstream commit b18eb2d779240631a098626cb6841ee2dd34fda0 ] Access to the TSB hash tables during TLB misses requires that there be an atomic 128-bit quad load available so that we fetch a matching TAG and DATA field at the same time. On cpus prior to UltraSPARC-III only virtual address based quad loads are available. UltraSPARC-III and later provide physical address based variants which are easier to use. When we only have virtual address based quad loads available this means that we have to lock the TSB into the TLB at a fixed virtual address on each cpu when it runs that process. We can't just access the PAGE_OFFSET based aliased mapping of these TSBs because we cannot take a recursive TLB miss inside of the TLB miss handler without risking running out of hardware trap levels (some trap combinations can be deep, such as those generated by register window spill and fill traps). Without huge pages it's working perfectly fine, but when the huge TSB got added another chunk of fixed virtual address space was not allocated for this second TSB mapping. So we were mapping both the 8K and 4MB TSBs to the same exact virtual address, causing multiple TLB matches which gives undefined behavior. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 8f0a3c34dd189a5318c23dbb82b3c675669e10e0 Author: David S. Miller Date: Tue May 6 21:27:37 2014 -0700 sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses. [ Upstream commit e5c460f46ae7ee94831cb55cb980f942aa9e5a85 ] This was found using Dave Jone's trinity tool. When a user process which is 32-bit performs a load or a store, the cpu chops off the top 32-bits of the effective address before translating it. This is because we run 32-bit tasks with the PSTATE_AM (address masking) bit set. We can't run the kernel with that bit set, so when the kernel accesses userspace no address masking occurs. Since a 32-bit process will have no mappings in that region we will properly fault, so we don't try to handle this using access_ok(), which can safely just be a NOP on sparc64. Real faults from 32-bit processes should never generate such addresses so a bug check was added long ago, and it barks in the logs if this happens. But it also barks when a kernel user access causes this condition, and that _can_ happen. For example, if a pointer passed into a system call is "0xfffffffc" and the kernel access 4 bytes offset from that pointer. Just handle such faults normally via the exception entries. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 89e69c2e74e8b45bd4ae90201e68f1762dc6edca Author: David S. Miller Date: Tue Apr 29 13:28:23 2014 -0700 sparc64: Give more detailed information in {pgd,pmd}_ERROR() and kill pte_ERROR(). [ Upstream commit fe866433f843b080246ce729b5e6b27b5f5d9a58 ] pte_ERROR() is not used anywhere, delete it. For pgd_ERROR() and pmd_ERROR(), output something similar to x86, giving the address of the pgd/pmd as well as it's value. Also provide the caller, since these macros are invoked from pgd_clear_bad() and pmd_clear_bad() which provides little context as to what high level operation was occuring when the BAD state was detected. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit e4f2eef20cc69dfcb98636db108786e1fc26f3b9 Author: David S. Miller Date: Tue Apr 29 13:03:27 2014 -0700 sparc64: Add basic validations to {pud,pmd}_bad(). [ Upstream commit 26cf432551d749e7d581db33529507a711c6eaab ] Instead of returning false we should at least check the most basic things, otherwise page table corruptions will be very difficult to debug. PMD and PTE tables are of size PAGE_SIZE, so none of the sub-PAGE_SIZE bits should be set. We also complement this with a check that the physical address the pud/pmd points to is valid memory. PowerPC was used as a guide while implementating this. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit eb8ab8736fb6f66081841e407a37f6c1222f4d81 Author: David S. Miller Date: Sat May 3 22:52:50 2014 -0700 sparc64: Use 'ILOG2_4MB' instead of constant '22'. [ Upstream commit 0eef331a3d0ee970dcbebd1bd5fcb57ca33ece01 ] Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 97f73178c946059fc540bbf839ffbd63eeeb7b20 Author: David S. Miller Date: Tue Apr 29 12:58:03 2014 -0700 sparc64: Fix range check in kern_addr_valid(). [ Upstream commit ee73887e92a69ae0a5cda21c68ea75a27804c944 ] In commit b2d438348024b75a1ee8b66b85d77f569a5dfed8 ("sparc64: Make PAGE_OFFSET variable."), the MAX_PHYS_ADDRESS_BITS value was increased (to 47). This constant reference to '41UL' was missed. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit b97e18681f5881ca566221501fd6d7de44dd0724 Author: David S. Miller Date: Mon Apr 28 23:52:11 2014 -0700 sparc64: Fix top-level fault handling bugs. [ Upstream commit 70ffc6ebaead783ac8dafb1e87df0039bb043596 ] Make get_user_insn() able to cope with huge PMDs. Next, make do_fault_siginfo() more robust when get_user_insn() can't actually fetch the instruction. In particular, use the MMU announced fault address when that happens, instead of calling compute_effective_address() and computing garbage. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit d5ad2c01bc531cc0e4310813d99211ff49c98c04 Author: David S. Miller Date: Mon Apr 28 23:50:08 2014 -0700 sparc64: Handle 32-bit tasks properly in compute_effective_address(). [ Upstream commit d037d16372bbe4d580342bebbb8826821ad9edf0 ] If we have a 32-bit task we must chop off the top 32-bits of the 64-bit value just as the cpu would. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 83ad186cc68bd775e88d857be50c602d3de13eac Author: David S. Miller Date: Mon Apr 28 19:11:27 2014 -0700 sparc64: Don't use _PAGE_PRESENT in pte_modify() mask. [ Upstream commit eaf85da82669b057f20c4e438dc2566b51a83af6 ] Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 4f5e9cf3886f090981995875b51cb3c8884cc39c Author: David S. Miller Date: Sun Apr 27 21:01:56 2014 -0700 sparc64: Fix hex values in comment above pte_modify(). [ Upstream commit c2e4e676adb40ea764af79d3e08be954e14a0f4c ] When _PAGE_SPECIAL and _PAGE_PMD_HUGE were added to the mask, the comment was not updated. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit b38b6f2820dc4da30e018f36aa064344e7182139 Author: David S. Miller Date: Fri Apr 25 10:21:12 2014 -0700 sparc64: Fix bugs in get_user_pages_fast() wrt. THP. [ Upstream commit 04df419de34104d8818b8c5cffaa062fa36d20ea ] The large PMD path needs to check _PAGE_VALID not _PAGE_PRESENT, to decide if it needs to bail and return 0. pmd_large() should therefore just check _PAGE_PMD_HUGE. Calls to gup_huge_pmd() are guarded with a check of pmd_large(), so we just need to add a valid bit check. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit a73d6bccb870b403451cfc325520a424bda135fe Author: David S. Miller Date: Thu Apr 24 13:58:02 2014 -0700 sparc64: Fix huge PMD invalidation. [ Upstream commit 51e5ef1bb7ab0e5fa7de4e802da5ab22fe35f0bf ] On sparc64 "present" and "valid" are seperate PTE bits, this allows us to naturally distinguish between the user explicitly asking for PROT_NONE with mprotect() and other situations. However we weren't handling this properly in the huge PMD paths. First of all, the page table walker in the TSB miss path only checks for _PAGE_PMD_HUGE. So the generic pmdp_invalidate() would clear _PAGE_PRESENT but the TLB miss paths would still load it into the TLB as a valid huge PMD. Fix this by clearing the valid bit in pmdp_invalidate(), and also checking the valid bit in USER_PGTABLE_CHECK_PMD_HUGE using "brgez" since _PAGE_VALID is bit 63 in both the sun4u and sun4v pte layouts. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 24924bc7e27730ce45cfdcda8333249e3af38c41 Author: David S. Miller Date: Sun Apr 20 21:55:01 2014 -0400 sparc64: Fix executable bit testing in set_pmd_at() paths. [ Upstream commit 5b1e94fa439a3227beefad58c28c17f68287a8e9 ] This code was mistakenly using the exec bit from the PMD in all cases, even when the PMD isn't a huge PMD. If it's not a huge PMD, test the exec bit in the individual ptes down in tlb_batch_pmd_scan(). Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 7f6073dcb4268b6a848894c8dd07832b39992375 Author: Kirill Tkhai Date: Thu Apr 17 00:45:24 2014 +0400 sparc64: Make itc_sync_lock raw [ Upstream commit 49b6c01f4c1de3b5e5427ac5aba80f9f6d27837a ] One more place where we must not be able to be preempted or to be interrupted in RT. Always actually disable interrupts during synchronization cycle. Signed-off-by: Kirill Tkhai Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 4e81df2a60624359b40ed435804e4b7ea31b545c Author: David S. Miller Date: Wed Apr 30 19:37:48 2014 -0700 sparc64: Fix argument sign extension for compat_sys_futex(). [ Upstream commit aa3449ee9c87d9b7660dd1493248abcc57769e31 ] Only the second argument, 'op', is signed. Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 662a0e381b140f8e7b52d77950d97f958946e877 Author: Eric Dumazet Date: Tue Aug 5 16:49:52 2014 +0200 sctp: fix possible seqlock seadlock in sctp_packet_transmit() [ Upstream commit 757efd32d5ce31f67193cc0e6a56e4dffcc42fb1 ] Dave reported following splat, caused by improper use of IP_INC_STATS_BH() in process context. BUG: using __this_cpu_add() in preemptible [00000000] code: trinity-c117/14551 caller is __this_cpu_preempt_check+0x13/0x20 CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33 ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207 0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40 ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3 Call Trace: [] dump_stack+0x4e/0x7a [] check_preemption_disabled+0xfa/0x100 [] __this_cpu_preempt_check+0x13/0x20 [] sctp_packet_transmit+0x692/0x710 [sctp] [] sctp_outq_flush+0x2a2/0xc30 [sctp] [] ? mark_held_locks+0x7c/0xb0 [] ? _raw_spin_unlock_irqrestore+0x5d/0x80 [] sctp_outq_uncork+0x1a/0x20 [sctp] [] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp] [] sctp_do_sm+0xdb/0x330 [sctp] [] ? preempt_count_sub+0xab/0x100 [] ? sctp_cname+0x70/0x70 [sctp] [] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp] [] sctp_sendmsg+0x88f/0xe30 [sctp] [] ? lock_release_holdtime.part.28+0x9a/0x160 [] ? put_lock_stats.isra.27+0xe/0x30 [] inet_sendmsg+0x104/0x220 [] ? inet_sendmsg+0x5/0x220 [] sock_sendmsg+0x9e/0xe0 [] ? might_fault+0xb9/0xc0 [] ? might_fault+0x5e/0xc0 [] SYSC_sendto+0x124/0x1c0 [] ? syscall_trace_enter+0x250/0x330 [] SyS_sendto+0xe/0x10 [] tracesys+0xdd/0xe2 This is a followup of commits f1d8cba61c3c4b ("inet: fix possible seqlock deadlocks") and 7f88c6b23afbd315 ("ipv6: fix possible seqlock deadlock in ip6_finish_output2") Signed-off-by: Eric Dumazet Cc: Hannes Frederic Sowa Reported-by: Dave Jones Acked-by: Neil Horman Acked-by: Hannes Frederic Sowa Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 1ae7e76c6177eeed8ac1437ece4afde1183745f5 Author: Sven Eckelmann Date: Mon May 26 17:21:39 2014 +0200 batman-adv: Fix out-of-order fragmentation support [ Upstream commit d9124268d84a836f14a6ead54ff9d8eee4c43be5 ] batadv_frag_insert_packet was unable to handle out-of-order packets because it dropped them directly. This is caused by the way the fragmentation lists is checked for the correct place to insert a fragmentation entry. The fragmentation code keeps the fragments in lists. The fragmentation entries are kept in descending order of sequence number. The list is traversed and each entry is compared with the new fragment. If the current entry has a smaller sequence number than the new fragment then the new one has to be inserted before the current entry. This ensures that the list is still in descending order. An out-of-order packet with a smaller sequence number than all entries in the list still has to be added to the end of the list. The used hlist has no information about the last entry in the list inside hlist_head and thus the last entry has to be calculated differently. Currently the code assumes that the iterator variable of hlist_for_each_entry can be used for this purpose after the hlist_for_each_entry finished. This is obviously wrong because the iterator variable is always NULL when the list was completely traversed. Instead the information about the last entry has to be stored in a different variable. This problem was introduced in 610bfc6bc99bc83680d190ebc69359a05fc7f605 ("batman-adv: Receive fragmented packets and merge"). Signed-off-by: Sven Eckelmann Signed-off-by: Marek Lindner Signed-off-by: Antonio Quartulli Signed-off-by: Greg Kroah-Hartman commit b79cc879e13e5be871583fa67df959066c94c028 Author: Sasha Levin Date: Thu Jul 31 23:00:35 2014 -0400 iovec: make sure the caller actually wants anything in memcpy_fromiovecend [ Upstream commit 06ebb06d49486676272a3c030bfeef4bd969a8e6 ] Check for cases when the caller requests 0 bytes instead of running off and dereferencing potentially invalid iovecs. Signed-off-by: Sasha Levin Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit aa85c9d440546cb8e03ce94492f67816af107f70 Author: Vlad Yasevich Date: Thu Jul 31 10:33:06 2014 -0400 net: Correctly set segment mac_len in skb_segment(). [ Upstream commit fcdfe3a7fa4cb74391d42b6a26dc07c20dab1d82 ] When performing segmentation, the mac_len value is copied right out of the original skb. However, this value is not always set correctly (like when the packet is VLAN-tagged) and we'll end up copying a bad value. One way to demonstrate this is to configure a VM which tags packets internally and turn off VLAN acceleration on the forwarding bridge port. The packets show up corrupt like this: 16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0, 0x0000: 8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.....d. 0x0010: 0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..d....A.^d..0.. 0x0020: 29e3 01c9 f871 0000 0101 080a 000a e833)....q.........3 0x0030: 000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp 0x0040: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp 0x0050: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp 0x0060: 6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp ... This also leads to awful throughput as GSO packets are dropped and cause retransmissions. The solution is to set the mac_len using the values already available in then new skb. We've already adjusted all of the header offset, so we might as well correctly figure out the mac_len using skb_reset_mac_len(). After this change, packets are segmented correctly and performance is restored. CC: Eric Dumazet Signed-off-by: Vlad Yasevich Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 6d7a7d3623a1732deff22faeb4a70322794d70f5 Author: Vlad Yasevich Date: Thu Jul 31 10:30:25 2014 -0400 macvlan: Initialize vlan_features to turn on offload support. [ Upstream commit 081e83a78db9b0ae1f5eabc2dedecc865f509b98 ] Macvlan devices do not initialize vlan_features. As a result, any vlan devices configured on top of macvlans perform very poorly. Initialize vlan_features based on the vlan features of the lower-level device. Signed-off-by: Vlad Yasevich Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 672fcd4d4631dc45c650cad3576f880c0907e2e3 Author: Daniel Borkmann Date: Tue Jul 22 15:22:45 2014 +0200 net: sctp: inherit auth_capable on INIT collisions [ Upstream commit 1be9a950c646c9092fb3618197f7b6bfb50e82aa ] Jason reported an oops caused by SCTP on his ARM machine with SCTP authentication enabled: Internal error: Oops: 17 [#1] ARM CPU: 0 PID: 104 Comm: sctp-test Not tainted 3.13.0-68744-g3632f30c9b20-dirty #1 task: c6eefa40 ti: c6f52000 task.ti: c6f52000 PC is at sctp_auth_calculate_hmac+0xc4/0x10c LR is at sg_init_table+0x20/0x38 pc : [] lr : [] psr: 40000013 sp : c6f538e8 ip : 00000000 fp : c6f53924 r10: c6f50d80 r9 : 00000000 r8 : 00010000 r7 : 00000000 r6 : c7be4000 r5 : 00000000 r4 : c6f56254 r3 : c00c8170 r2 : 00000001 r1 : 00000008 r0 : c6f1e660 Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 0005397f Table: 06f28000 DAC: 00000015 Process sctp-test (pid: 104, stack limit = 0xc6f521c0) Stack: (0xc6f538e8 to 0xc6f54000) [...] Backtrace: [] (sctp_auth_calculate_hmac+0x0/0x10c) from [] (sctp_packet_transmit+0x33c/0x5c8) [] (sctp_packet_transmit+0x0/0x5c8) from [] (sctp_outq_flush+0x7fc/0x844) [] (sctp_outq_flush+0x0/0x844) from [] (sctp_outq_uncork+0x24/0x28) [] (sctp_outq_uncork+0x0/0x28) from [] (sctp_side_effects+0x1134/0x1220) [] (sctp_side_effects+0x0/0x1220) from [] (sctp_do_sm+0xac/0xd4) [] (sctp_do_sm+0x0/0xd4) from [] (sctp_assoc_bh_rcv+0x118/0x160) [] (sctp_assoc_bh_rcv+0x0/0x160) from [] (sctp_inq_push+0x6c/0x74) [] (sctp_inq_push+0x0/0x74) from [] (sctp_rcv+0x7d8/0x888) While we already had various kind of bugs in that area ec0223ec48a9 ("net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable") and b14878ccb7fa ("net: sctp: cache auth_enable per endpoint"), this one is a bit of a different kind. Giving a bit more background on why SCTP authentication is needed can be found in RFC4895: SCTP uses 32-bit verification tags to protect itself against blind attackers. These values are not changed during the lifetime of an SCTP association. Looking at new SCTP extensions, there is the need to have a method of proving that an SCTP chunk(s) was really sent by the original peer that started the association and not by a malicious attacker. To cause this bug, we're triggering an INIT collision between peers; normal SCTP handshake where both sides intent to authenticate packets contains RANDOM; CHUNKS; HMAC-ALGO parameters that are being negotiated among peers: ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ----------> <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] --------- -------------------- COOKIE-ECHO --------------------> <-------------------- COOKIE-ACK --------------------- RFC4895 says that each endpoint therefore knows its own random number and the peer's random number *after* the association has been established. The local and peer's random number along with the shared key are then part of the secret used for calculating the HMAC in the AUTH chunk. Now, in our scenario, we have 2 threads with 1 non-blocking SEQ_PACKET socket each, setting up common shared SCTP_AUTH_KEY and SCTP_AUTH_ACTIVE_KEY properly, and each of them calling sctp_bindx(3), listen(2) and connect(2) against each other, thus the handshake looks similar to this, e.g.: ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ----------> <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] --------- <--------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ----------- -------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] --------> ... Since such collisions can also happen with verification tags, the RFC4895 for AUTH rather vaguely says under section 6.1: In case of INIT collision, the rules governing the handling of this Random Number follow the same pattern as those for the Verification Tag, as explained in Section 5.2.4 of RFC 2960 [5]. Therefore, each endpoint knows its own Random Number and the peer's Random Number after the association has been established. In RFC2960, section 5.2.4, we're eventually hitting Action B: B) In this case, both sides may be attempting to start an association at about the same time but the peer endpoint started its INIT after responding to the local endpoint's INIT. Thus it may have picked a new Verification Tag not being aware of the previous Tag it had sent this endpoint. The endpoint should stay in or enter the ESTABLISHED state but it MUST update its peer's Verification Tag from the State Cookie, stop any init or cookie timers that may running and send a COOKIE ACK. In other words, the handling of the Random parameter is the same as behavior for the Verification Tag as described in Action B of section 5.2.4. Looking at the code, we exactly hit the sctp_sf_do_dupcook_b() case which triggers an SCTP_CMD_UPDATE_ASSOC command to the side effect interpreter, and in fact it properly copies over peer_{random, hmacs, chunks} parameters from the newly created association to update the existing one. Also, the old asoc_shared_key is being released and based on the new params, sctp_auth_asoc_init_active_key() updated. However, the issue observed in this case is that the previous asoc->peer.auth_capable was 0, and has *not* been updated, so that instead of creating a new secret, we're doing an early return from the function sctp_auth_asoc_init_active_key() leaving asoc->asoc_shared_key as NULL. However, we now have to authenticate chunks from the updated chunk list (e.g. COOKIE-ACK). That in fact causes the server side when responding with ... <------------------ AUTH; COOKIE-ACK ----------------- ... to trigger a NULL pointer dereference, since in sctp_packet_transmit(), it discovers that an AUTH chunk is being queued for xmit, and thus it calls sctp_auth_calculate_hmac(). Since the asoc->active_key_id is still inherited from the endpoint, and the same as encoded into the chunk, it uses asoc->asoc_shared_key, which is still NULL, as an asoc_key and dereferences it in ... crypto_hash_setkey(desc.tfm, &asoc_key->data[0], asoc_key->len) ... causing an oops. All this happens because sctp_make_cookie_ack() called with the *new* association has the peer.auth_capable=1 and therefore marks the chunk with auth=1 after checking sctp_auth_send_cid(), but it is *actually* sent later on over the then *updated* association's transport that didn't initialize its shared key due to peer.auth_capable=0. Since control chunks in that case are not sent by the temporary association which are scheduled for deletion, they are issued for xmit via SCTP_CMD_REPLY in the interpreter with the context of the *updated* association. peer.auth_capable was 0 in the updated association (which went from COOKIE_WAIT into ESTABLISHED state), since all previous processing that performed sctp_process_init() was being done on temporary associations, that we eventually throw away each time. The correct fix is to update to the new peer.auth_capable value as well in the collision case via sctp_assoc_update(), so that in case the collision migrated from 0 -> 1, sctp_auth_asoc_init_active_key() can properly recalculate the secret. This therefore fixes the observed server panic. Fixes: 730fc3d05cd4 ("[SCTP]: Implete SCTP-AUTH parameter processing") Reported-by: Jason Gunthorpe Signed-off-by: Daniel Borkmann Tested-by: Jason Gunthorpe Cc: Vlad Yasevich Acked-by: Vlad Yasevich Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit e4a6605a6f2be752a67bd0005b059c57b6bc1a28 Author: Ivan Vecera Date: Tue Jul 29 16:29:30 2014 +0200 bna: fix performance regression [ Upstream commit c36c9d50cc6af5c5bfcc195f21b73f55520c15f9 ] The recent commit "e29aa33 bna: Enable Multi Buffer RX" is causing a performance regression. It does not properly update 'cmpl' pointer at the end of the loop in NAPI handler bnad_cq_process(). The result is only one packet / per NAPI-schedule is processed. Signed-off-by: Ivan Vecera Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit c796399ce155470156af1b4eadb92fd7b48ef370 Author: Christoph Paasch Date: Tue Jul 29 13:40:57 2014 +0200 tcp: Fix integer-overflow in TCP vegas [ Upstream commit 1f74e613ded11517db90b2bd57e9464d9e0fb161 ] In vegas we do a multiplication of the cwnd and the rtt. This may overflow and thus their result is stored in a u64. However, we first need to cast the cwnd so that actually 64-bit arithmetic is done. Then, we need to do do_div to allow this to be used on 32-bit arches. Cc: Stephen Hemminger Cc: Neal Cardwell Cc: Eric Dumazet Cc: David Laight Cc: Doug Leith Fixes: 8d3a564da34e (tcp: tcp_vegas cong avoid fix) Signed-off-by: Christoph Paasch Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 11eea6013cab0a09a8a7ede2f5b097fa77fcc4ba Author: Christoph Paasch Date: Tue Jul 29 12:07:27 2014 +0200 tcp: Fix integer-overflows in TCP veno [ Upstream commit 45a07695bc64b3ab5d6d2215f9677e5b8c05a7d0 ] In veno we do a multiplication of the cwnd and the rtt. This may overflow and thus their result is stored in a u64. However, we first need to cast the cwnd so that actually 64-bit arithmetic is done. A first attempt at fixing 76f1017757aa0 ([TCP]: TCP Veno congestion control) was made by 159131149c2 (tcp: Overflow bug in Vegas), but it failed to add the required cast in tcp_veno_cong_avoid(). Fixes: 76f1017757aa0 ([TCP]: TCP Veno congestion control) Signed-off-by: Christoph Paasch Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit ec7e72757898d577df4667baa7dd1c7781d0fa9b Author: Dmitry Popov Date: Tue Jul 29 03:07:52 2014 +0400 ip_tunnel(ipv4): fix tunnels with "local any remote $remote_ip" [ Upstream commit 95cb5745983c222867cc9ac593aebb2ad67d72c0 ] Ipv4 tunnels created with "local any remote $ip" didn't work properly since 7d442fab0 (ipv4: Cache dst in tunnels). 99% of packets sent via those tunnels had src addr = 0.0.0.0. That was because only dst_entry was cached, although fl4.saddr has to be cached too. Every time ip_tunnel_xmit used cached dst_entry (tunnel_rtable_get returned non-NULL), fl4.saddr was initialized with tnl_params->saddr (= 0 in our case), and wasn't changed until iptunnel_xmit(). This patch adds saddr to ip_tunnel->dst_cache, fixing this issue. Reported-by: Sergey Popov Signed-off-by: Dmitry Popov Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 3c9d360016a04e00ccbbba02a1845eac49567e97 Author: Florian Fainelli Date: Mon Jul 28 16:28:07 2014 -0700 net: phy: re-apply PHY fixups during phy_register_device [ Upstream commit d92f5dec6325079c550889883af51db1b77d5623 ] Commit 87aa9f9c61ad ("net: phy: consolidate PHY reset in phy_init_hw()") moved the call to phy_scan_fixups() in phy_init_hw() after a software reset is performed. By the time phy_init_hw() is called in phy_device_register(), no driver has been bound to this PHY yet, so all the checks in phy_init_hw() against the PHY driver and the PHY driver's config_init function will return 0. We will therefore never call phy_scan_fixups() as we should. Fix this by calling phy_scan_fixups() and check for its return value to restore the intended functionality. This broke PHY drivers which do register an early PHY fixup callback to intercept the PHY probing and do things like changing the 32-bits unique PHY identifier when a pseudo-PHY address has been used, as well as board-specific PHY fixups that need to be applied during driver probe time. Reported-by: Hauke Merthens Reported-by: Jonas Gorski Signed-off-by: Florian Fainelli Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 4ae821221c14ba88e348e42458e425c348521764 Author: Andrey Ryabinin Date: Sat Jul 26 21:26:58 2014 +0400 net: sendmsg: fix NULL pointer dereference [ Upstream commit 40eea803c6b2cfaab092f053248cbeab3f368412 ] Sasha's report: > While fuzzing with trinity inside a KVM tools guest running the latest -next > kernel with the KASAN patchset, I've stumbled on the following spew: > > [ 4448.949424] ================================================================== > [ 4448.951737] AddressSanitizer: user-memory-access on address 0 > [ 4448.952988] Read of size 2 by thread T19638: > [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813 > [ 4448.956823] ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40 > [ 4448.958233] ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d > [ 4448.959552] 0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000 > [ 4448.961266] Call Trace: > [ 4448.963158] dump_stack (lib/dump_stack.c:52) > [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184) > [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352) > [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339) > [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339) > [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555) > [ 4448.970103] sock_sendmsg (net/socket.c:654) > [ 4448.971584] ? might_fault (mm/memory.c:3741) > [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740) > [ 4448.973596] ? verify_iovec (net/core/iovec.c:64) > [ 4448.974522] ___sys_sendmsg (net/socket.c:2096) > [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254) > [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273) > [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1)) > [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188) > [ 4448.980535] __sys_sendmmsg (net/socket.c:2181) > [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600) > [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607) > [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2)) > [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600) > [ 4448.986754] SyS_sendmmsg (net/socket.c:2201) > [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542) > [ 4448.988929] ================================================================== This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0. After this report there was no usual "Unable to handle kernel NULL pointer dereference" and this gave me a clue that address 0 is mapped and contains valid socket address structure in it. This bug was introduced in f3d3342602f8bcbf37d7c46641cb9bca7618eb1c (net: rework recvmsg handler msg_name and msg_namelen logic). Commit message states that: "Set msg->msg_name = NULL if user specified a NULL in msg_name but had a non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't affect sendto as it would bail out earlier while trying to copy-in the address." But in fact this affects sendto when address 0 is mapped and contains socket address structure in it. In such case copy-in address will succeed, verify_iovec() function will successfully exit with msg->msg_namelen > 0 and msg->msg_name == NULL. This patch fixes it by setting msg_namelen to 0 if msg_name == NULL. Cc: Hannes Frederic Sowa Cc: Eric Dumazet Cc: Reported-by: Sasha Levin Signed-off-by: Andrey Ryabinin Acked-by: Hannes Frederic Sowa Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit b733ea82c2a264c244fea37edda2268c44d76074 Author: Eric Dumazet Date: Sat Jul 26 08:58:10 2014 +0200 ip: make IP identifiers less predictable [ Upstream commit 04ca6973f7c1a0d8537f2d9906a0cf8e69886d75 ] In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and Jedidiah describe ways exploiting linux IP identifier generation to infer whether two machines are exchanging packets. With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we changed IP id generation, but this does not really prevent this side-channel technique. This patch adds a random amount of perturbation so that IP identifiers for a given destination [1] are no longer monotonically increasing after an idle period. Note that prandom_u32_max(1) returns 0, so if generator is used at most once per jiffy, this patch inserts no hole in the ID suite and do not increase collision probability. This is jiffies based, so in the worst case (HZ=1000), the id can rollover after ~65 seconds of idle time, which should be fine. We also change the hash used in __ip_select_ident() to not only hash on daddr, but also saddr and protocol, so that ICMP probes can not be used to infer information for other protocols. For IPv6, adds saddr into the hash as well, but not nexthdr. If I ping the patched target, we can see ID are now hard to predict. 21:57:11.008086 IP (...) A > target: ICMP echo request, seq 1, length 64 21:57:11.010752 IP (... id 2081 ...) target > A: ICMP echo reply, seq 1, length 64 21:57:12.013133 IP (...) A > target: ICMP echo request, seq 2, length 64 21:57:12.015737 IP (... id 3039 ...) target > A: ICMP echo reply, seq 2, length 64 21:57:13.016580 IP (...) A > target: ICMP echo request, seq 3, length 64 21:57:13.019251 IP (... id 3437 ...) target > A: ICMP echo reply, seq 3, length 64 [1] TCP sessions uses a per flow ID generator not changed by this patch. Signed-off-by: Eric Dumazet Reported-by: Jeffrey Knockel Reported-by: Jedidiah R. Crandall Cc: Willy Tarreau Cc: Hannes Frederic Sowa Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 265459c3d15b39826d09b3380eef7572fb7649a2 Author: Eric Dumazet Date: Mon Jun 2 05:26:03 2014 -0700 inetpeer: get rid of ip_id_count [ Upstream commit 73f156a6e8c1074ac6327e0abd1169e95eb66463 ] Ideally, we would need to generate IP ID using a per destination IP generator. linux kernels used inet_peer cache for this purpose, but this had a huge cost on servers disabling MTU discovery. 1) each inet_peer struct consumes 192 bytes 2) inetpeer cache uses a binary tree of inet_peer structs, with a nominal size of ~66000 elements under load. 3) lookups in this tree are hitting a lot of cache lines, as tree depth is about 20. 4) If server deals with many tcp flows, we have a high probability of not finding the inet_peer, allocating a fresh one, inserting it in the tree with same initial ip_id_count, (cf secure_ip_id()) 5) We garbage collect inet_peer aggressively. IP ID generation do not have to be 'perfect' Goal is trying to avoid duplicates in a short period of time, so that reassembly units have a chance to complete reassembly of fragments belonging to one message before receiving other fragments with a recycled ID. We simply use an array of generators, and a Jenkin hash using the dst IP as a key. ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it belongs (it is only used from this file) secure_ip_id() and secure_ipv6_id() no longer are needed. Rename ip_select_ident_more() to ip_select_ident_segs() to avoid unnecessary decrement/increment of the number of segments. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 910c396e1c58e0f6f6c817edf9292f245779ec65 Author: Dmitry Kravkov Date: Thu Jul 24 18:54:47 2014 +0300 bnx2x: fix crash during TSO tunneling [ Upstream commit fe26566d8a05151ba1dce75081f6270f73ec4ae1 ] When TSO packet is transmitted additional BD w/o mapping is used to describe the packed. The BD needs special handling in tx completion. kernel: Call Trace: kernel: [] dump_stack+0x19/0x1b kernel: [] warn_slowpath_common+0x61/0x80 kernel: [] warn_slowpath_fmt+0x5c/0x80 kernel: [] ? find_iova+0x4d/0x90 kernel: [] intel_unmap_page.part.36+0x142/0x160 kernel: [] intel_unmap_page+0x26/0x30 kernel: [] bnx2x_free_tx_pkt+0x157/0x2b0 [bnx2x] kernel: [] bnx2x_tx_int+0xac/0x220 [bnx2x] kernel: [] ? read_tsc+0x9/0x20 kernel: [] bnx2x_poll+0xbb/0x3c0 [bnx2x] kernel: [] net_rx_action+0x15a/0x250 kernel: [] __do_softirq+0xf7/0x290 kernel: [] call_softirq+0x1c/0x30 kernel: [] do_softirq+0x55/0x90 kernel: [] irq_exit+0x115/0x120 kernel: [] do_IRQ+0x58/0xf0 kernel: [] common_interrupt+0x6d/0x6d kernel: [] ? clockevents_notify+0x127/0x140 kernel: [] ? cpuidle_enter_state+0x4f/0xc0 kernel: [] cpuidle_idle_call+0xc5/0x200 kernel: [] arch_cpu_idle+0xe/0x30 kernel: [] cpu_startup_entry+0xf5/0x290 kernel: [] start_secondary+0x265/0x27b kernel: ---[ end trace 11aa7726f18d7e80 ]--- Fixes: a848ade408b ("bnx2x: add CSUM and TSO support for encapsulation protocols") Reported-by: Yulong Pei Cc: Michal Schmidt Signed-off-by: Dmitry Kravkov Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 2a36368133b394784d89aa4806dda4478bed38e9 Author: Tobias Brunner Date: Thu Jun 26 15:12:45 2014 +0200 xfrm: Fix installation of AH IPsec SAs [ Upstream commit a0e5ef53aac8e5049f9344857d8ec5237d31e58b ] The SPI check introduced in ea9884b3acf3311c8a11db67bfab21773f6f82ba was intended for IPComp SAs but actually prevented AH SAs from getting installed (depending on the SPI). Fixes: ea9884b3acf3 ("xfrm: check user specified spi for IPComp") Cc: Fan Du Signed-off-by: Tobias Brunner Signed-off-by: Steffen Klassert Signed-off-by: Greg Kroah-Hartman