aboutsummaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2015-03-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller9-13/+51
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix missing initialization of tuple structure in nfnetlink_cthelper to avoid mismatches when looking up to attach userspace helpers to flows, from Ian Wilson. 2) Fix potential crash in nft_hash when we hit -EAGAIN in nft_hash_walk(), from Herbert Xu. 3) We don't need to indicate the hook information to update the basechain default policy in nf_tables. 4) Restore tracing over nfnetlink_log due to recent rework to accomodate logging infrastructure into nf_tables. 5) Fix wrong IP6T_INV_PROTO check in xt_TPROXY. 6) Set IP6T_F_PROTO flag in nft_compat so we can use SYNPROXY6 and REJECT6 from xt over nftables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-22netfilter: nft_compat: set IP6T_F_PROTO flag if protocol is setPablo Neira Ayuso1-0/+6
ip6tables extensions check for this flag to restrict match/target to a given protocol. Without this flag set, SYNPROXY6 returns an error. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Patrick McHardy <kaber@trash.net>
2015-03-20net: validate the range we feed to iov_iter_init() in sys_sendto/sys_recvfromAl Viro1-0/+4
Cc: stable@vger.kernel.org # v3.19 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20net: compat: Update get_compat_msghdr() to match copy_msghdr_from_user() ↵Catalin Marinas1-0/+7
behaviour Commit db31c55a6fb2 (net: clamp ->msg_namelen instead of returning an error) introduced the clamping of msg_namelen when the unsigned value was larger than sizeof(struct sockaddr_storage). This caused a msg_namelen of -1 to be valid. The native code was subsequently fixed by commit dbb490b96584 (net: socket: error on a negative msg_namelen). In addition, the native code sets msg_namelen to 0 when msg_name is NULL. This was done in commit (6a2a2b3ae075 net:socket: set msg_namelen to 0 if msg_name is passed as NULL in msghdr struct from userland) and subsequently updated by 08adb7dabd48 (fold verify_iovec() into copy_msghdr_from_user()). This patch brings the get_compat_msghdr() in line with copy_msghdr_from_user(). Fixes: db31c55a6fb2 (net: clamp ->msg_namelen instead of returning an error) Cc: David S. Miller <davem@davemloft.net> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20tcp: fix tcp fin memory accountingJosh Hunt1-5/+1
tcp_send_fin() does not account for the memory it allocates properly, so sk_forward_alloc can be negative in cases where we've sent a FIN: ss example output (ss -amn | grep -B1 f4294): tcp FIN-WAIT-1 0 1 192.168.0.1:45520 192.0.2.1:8080 skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0) Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20ipv6: fix backtracking for throw routesSteven Barth1-0/+1
for throw routes to trigger evaluation of other policy rules EAGAIN needs to be propagated up to fib_rules_lookup similar to how its done for IPv4 A simple testcase for verification is: ip -6 rule add lookup 33333 priority 33333 ip -6 route add throw 2001:db8::1 ip -6 route add 2001:db8::1 via fe80::1 dev wlan0 table 33333 ip route get 2001:db8::1 Signed-off-by: Steven Barth <cyrus@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20ipv6: call ipv6_proxy_select_ident instead of ipv6_select_ident in ↵Sabrina Dubroca1-5/+3
udp6_ufo_fragment Matt Grant reported frequent crashes in ipv6_select_ident when udp6_ufo_fragment is called from openvswitch on a skb that doesn't have a dst_entry set. ipv6_proxy_select_ident generates the frag_id without using the dst associated with the skb. This approach was suggested by Vladislav Yasevich. Fixes: 0508c07f5e0c ("ipv6: Select fragment id during UFO segmentation if not set.") Cc: Vladislav Yasevich <vyasevic@redhat.com> Reported-by: Matt Grant <matt@mattgrant.net.nz> Tested-by: Matt Grant <matt@mattgrant.net.nz> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20netfilter: xt_TPROXY: fix invflags check in tproxy_tg6_check()Pablo Neira Ayuso1-2/+2
We have to check for IP6T_INV_PROTO in invflags, instead of flags. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Balazs Scheidler <bazsi@balabit.hu>
2015-03-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds26-86/+189
Pull networking fixes from David Miller: 1) Fix packet header offset calculation in _decode_session6(), from Hajime Tazaki. 2) Fix route leak in error paths of xfrm_lookup(), from Huaibin Wang. 3) Be sure to clear state properly when scans fail in iwlwifi mvm code, from Luciano Coelho. 4) iwlwifi tries to stop scans that aren't actually running, also from Luciano Coelho. 5) mac80211 should drop mesh frames that are not encrypted, fix from Bob Copeland. 6) Add new device ID to b43 wireless driver for BCM432228 chips, from Rafał Miłecki. 7) Fix accidental addition of members after variable sized array in struct tc_u_hnode, from WANG Cong. 8) Don't re-enable interrupts until after we call napi_complete() in ibmveth and WIZnet drivers, frm Yongbae Park. 9) Fix regression in vlan tag handling of fec driver, from Fugang Duan. 10) If a network namespace change fails during rtnl_newlink(), we don't unwind the device registry properly. 11) Fix two TCP regressions, from Neal Cardwell: - Don't allow snd_cwnd_cnt to accumulate huge values due to missing test in tcp_cong_avoid_ai(). - Restore CUBIC back to advancing cwnd by 1.5x packets per RTT. 12) Fix performance regression in xne-netback involving push TX notifications, from David Vrabel. 13) __skb_tstamp_tx() can be called with a NULL sk pointer, do not dereference blindly. From Willem de Bruijn. 14) Fix potential stack overflow in RDS protocol stack, from Arnd Bergmann. 15) VXLAN_VID_MASK used incorrectly in new remote checksum offload support of VXLAN driver. Fix from Alexey Kodanev. 16) Fix too small netlink SKB allocation in inet_diag layer, from Eric Dumazet. 17) ieee80211_check_combinations() does not count interfaces correctly, from Andrei Otcheretianski. 18) Hardware feature determination in bxn2x driver references a piece of software state that actually isn't initialized yet, fix from Michal Schmidt. 19) inet_csk_wait_for_connect() needs a sched_annotate_sleep() annoation, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits) Revert "net: cx82310_eth: use common match macro" net/mlx4_en: Set statistics bitmap at port init IB/mlx4: Saturate RoCE port PMA counters in case of overflow net/mlx4_en: Fix off-by-one in ethtool statistics display IB/mlx4: Verify net device validity on port change event act_bpf: allow non-default TC_ACT opcodes as BPF exec outcome Revert "smc91x: retrieve IRQ and trigger flags in a modern way" inet: Clean up inet_csk_wait_for_connect() vs. might_sleep() ip6_tunnel: fix error code when tunnel exists netdevice.h: fix ndo_bridge_* comments bnx2x: fix encapsulation features on 57710/57711 mac80211: ignore CSA to same channel nl80211: ignore HT/VHT capabilities without QoS/WMM mac80211: ask for ECSA IE to be considered for beacon parse CRC mac80211: count interfaces correctly for combination checks isdn: icn: use strlcpy() when parsing setup options rxrpc: bogus MSG_PEEK test in rxrpc_recvmsg() caif: fix MSG_OOB test in caif_seqpkt_recvmsg() bridge: reset bridge mtu after deleting an interface can: kvaser_usb: Fix tx queue start/stop race conditions ...
2015-03-19netfilter: restore rule tracing via nfnetlink_logPablo Neira Ayuso4-10/+34
Since fab4085 ("netfilter: log: nf_log_packet() as real unified interface"), the loginfo structure that is passed to nf_log_packet() is used to explicitly indicate the logger type you want to use. This is a problem for people tracing rules through nfnetlink_log since packets are always routed to the NF_LOG_TYPE logger after the aforementioned patch. We can fix this by removing the trace loginfo structures, but that still changes the log level from 4 to 5 for tracing messages and there may be someone relying on this outthere. So let's just introduce a new nf_log_trace() function that restores the former behaviour. Reported-by: Markus Kötter <koetter@rrzn.uni-hannover.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-17act_bpf: allow non-default TC_ACT opcodes as BPF exec outcomeDaniel Borkmann1-8/+28
Revisiting commit d23b8ad8ab23 ("tc: add BPF based action") with regards to eBPF support, I was thinking that it might be better to improve return semantics from a BPF program invoked through BPF_PROG_RUN(). Currently, in case filter_res is 0, we overwrite the default action opcode with TC_ACT_SHOT. A default action opcode configured through tc's m_bpf can be: TC_ACT_RECLASSIFY, TC_ACT_PIPE, TC_ACT_SHOT, TC_ACT_UNSPEC, TC_ACT_OK. In cls_bpf, we have the possibility to overwrite the default class associated with the classifier in case filter_res is _not_ 0xffffffff (-1). That allows us to fold multiple [e]BPF programs into a single one, where they would otherwise need to be defined as a separate classifier with its own classid, needlessly redoing parsing work, etc. Similarly, we could do better in act_bpf: Since above TC_ACT* opcodes are exported to UAPI anyway, we reuse them for return-code-to-tc-opcode mapping, where we would allow above possibilities. Thus, like in cls_bpf, a filter_res of 0xffffffff (-1) means that the configured _default_ action is used. Any unkown return code from the BPF program would fail in tcf_bpf() with TC_ACT_UNSPEC. Should we one day want to make use of TC_ACT_STOLEN or TC_ACT_QUEUED, which both have the same semantics, we have the option to either use that as a default action (filter_res of 0xffffffff) or non-default BPF return code. All that will allow us to transparently use tcf_bpf() for both BPF flavours. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17inet: Clean up inet_csk_wait_for_connect() vs. might_sleep()Eric Dumazet1-0/+1
I got the following trace with current net-next kernel : [14723.885290] WARNING: CPU: 26 PID: 22658 at kernel/sched/core.c:7285 __might_sleep+0x89/0xa0() [14723.885325] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810e8734>] prepare_to_wait_exclusive+0x34/0xa0 [14723.885355] CPU: 26 PID: 22658 Comm: netserver Not tainted 4.0.0-dbg-DEV #1379 [14723.885359] ffffffff81a223a8 ffff881fae9e7ca8 ffffffff81650b5d 0000000000000001 [14723.885364] ffff881fae9e7cf8 ffff881fae9e7ce8 ffffffff810a72e7 0000000000000000 [14723.885367] ffffffff81a57620 000000000000093a 0000000000000000 ffff881fae9e7e64 [14723.885371] Call Trace: [14723.885377] [<ffffffff81650b5d>] dump_stack+0x4c/0x65 [14723.885382] [<ffffffff810a72e7>] warn_slowpath_common+0x97/0xe0 [14723.885386] [<ffffffff810a73e6>] warn_slowpath_fmt+0x46/0x50 [14723.885390] [<ffffffff810f4c5d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [14723.885393] [<ffffffff810e8734>] ? prepare_to_wait_exclusive+0x34/0xa0 [14723.885396] [<ffffffff810e8734>] ? prepare_to_wait_exclusive+0x34/0xa0 [14723.885399] [<ffffffff810ccdc9>] __might_sleep+0x89/0xa0 [14723.885403] [<ffffffff81581846>] lock_sock_nested+0x36/0xb0 [14723.885406] [<ffffffff815829a3>] ? release_sock+0x173/0x1c0 [14723.885411] [<ffffffff815ea1f7>] inet_csk_accept+0x157/0x2a0 [14723.885415] [<ffffffff810e8900>] ? abort_exclusive_wait+0xc0/0xc0 [14723.885419] [<ffffffff8161b96d>] inet_accept+0x2d/0x150 [14723.885424] [<ffffffff8157db6f>] SYSC_accept4+0xff/0x210 [14723.885428] [<ffffffff8165a451>] ? retint_swapgs+0xe/0x44 [14723.885431] [<ffffffff810f4c5d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [14723.885437] [<ffffffff81369c0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [14723.885441] [<ffffffff8157ef40>] SyS_accept+0x10/0x20 [14723.885444] [<ffffffff81659872>] system_call_fastpath+0x12/0x17 [14723.885447] ---[ end trace ff74cd83355b1873 ]--- In commit 26cabd31259ba43f68026ce3f62b78094124333f Peter added a sched_annotate_sleep() in sk_wait_event() Is the following patch needed as well ? Alternative would be to use sk_wait_event() from inet_csk_wait_for_connect() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17ip6_tunnel: fix error code when tunnel existsNicolas Dichtel1-16/+17
After commit 2b0bb01b6edb, the kernel returns -ENOBUFS when user tries to add an existing tunnel with ioctl API: $ ip -6 tunnel add ip6tnl1 mode ip6ip6 dev eth1 add tunnel "ip6tnl0" failed: No buffer space available It's confusing, the right error is EEXIST. This patch also change a bit the code returned: - ENOBUFS -> ENOMEM - ENOENT -> ENODEV Fixes: 2b0bb01b6edb ("ip6_tunnel: Return an error when adding an existing tunnel.") CC: Steffen Klassert <steffen.klassert@secunet.com> Reported-by: Pierre Cheynier <me@pierre-cheynier.net> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17Merge tag 'virtio-next-for-linus' of ↵Linus Torvalds1-4/+20
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull virtio fixes from Rusty Russell: "Not entirely surprising: the ongoing QEMU work on virtio 1.0 has revealed more minor issues with our virtio 1.0 drivers just introduced in the kernel. (I would normally use my fixes branch for this, but there were a batch of them...)" * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: virtio_mmio: fix access width for mmio uapi/virtio_scsi: allow overriding CDB/SENSE size virtio_mmio: generation support virtio_rpmsg: set DRIVER_OK before using device 9p/trans_virtio: fix hot-unplug virtio-balloon: do not call blocking ops when !TASK_RUNNING virtio_blk: fix comment for virtio 1.0 virtio_blk: typo fix virtio_balloon: set DRIVER_OK before using device virtio_console: avoid config access from irq virtio_console: init work unconditionally
2015-03-17netfilter: nf_tables: allow to change chain policy without hook if it existsPablo Neira Ayuso1-1/+4
If there's an existing base chain, we have to allow to change the default policy without indicating the hook information. However, if the chain doesn't exists, we have to enforce the presence of the hook attribute. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-16Merge tag 'mac80211-for-davem-2015-03-16' of ↵David S. Miller5-8/+47
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Here are a few fixes that I'd like to still get in: * disable U-APSD for better interoperability, from Michal Kazior * drop unencrypted frames in mesh forwarding, from Bob Copeland * treat non-QoS/WMM HT stations as non-HT, to fix confusion when they connect and then get QoS packets anyway due to HT * fix counting interfaces for combination checks, otherwise the interface combinations aren't properly enforced (from Andrei) * fix pure ECSA by reacting to the IE change * ignore erroneous (E)CSA to the current channel which sometimes happens due to AP/GO bugs ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16Merge branch 'master' of ↵David S. Miller4-8/+9
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2015-03-16 1) Fix the network header offset in _decode_session6 when multiple IPv6 extension headers are present. From Hajime Tazaki. 2) Fix an interfamily tunnel crash. We set outer mode protocol too early and may dispatch to the wrong address family. Move the setting of the outer mode protocol behind the last accessing of the inner mode to fix the crash. 3) Most callers of xfrm_lookup() expect that dst_orig is released on error. But xfrm_lookup_route() may need dst_orig to handle certain error cases. So introduce a flag that tells what should be done in case of error. From Huaibin Wang. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16mac80211: ignore CSA to same channelJohannes Berg2-0/+14
If the AP is confused and starts doing a CSA to the same channel, just ignore that request instead of trying to act it out since it was likely sent in error anyway. In the case of the bug I was investigating the GO was misbehaving and sending out a beacon with CSA IEs still included after having actually done the channel switch. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-03-16nl80211: ignore HT/VHT capabilities without QoS/WMMJohannes Berg1-0/+10
As HT/VHT depend heavily on QoS/WMM, it's not a good idea to let userspace add clients that have HT/VHT but not QoS/WMM. Since it does so in certain cases we've observed (client is using HT IEs but not QoS/WMM) just ignore the HT/VHT info at this point and don't pass it down to the drivers which might unconditionally use it. Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-03-16mac80211: ask for ECSA IE to be considered for beacon parse CRCJohannes Berg1-1/+2
When a beacon from the AP contains only the ECSA IE, and not a CSA IE as well, this ECSA IE is not considered for calculating the CRC and the beacon might be dropped as not being interesting. This is clearly wrong, it should be handled and the channel switch should be executed. Fix this by including the ECSA IE ID in the bitmap of interesting IEs. Reported-by: Gil Tribush <gil.tribush@intel.com> Reviewed-by: Luciano Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-03-16mac80211: count interfaces correctly for combination checksAndrei Otcheretianski1-1/+1
Since moving the interface combination checks to mac80211, it's broken because it now only considers interfaces with an assigned channel context, so for example any interface that isn't active can still be up, which is clearly an issue; also, in particular P2P-Device wdevs are an issue since they never have a chanctx. Fix this by counting running interfaces instead the ones with a channel context assigned. Cc: stable@vger.kernel.org [3.16+] Fixes: 73de86a38962b ("cfg80211/mac80211: move interface counting for combination check to mac80211") Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> [rewrite commit message, dig out the commit it fixes] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-03-15rxrpc: bogus MSG_PEEK test in rxrpc_recvmsg()Al Viro1-1/+1
[I would really like an ACK on that one from dhowells; it appears to be quite straightforward, but...] MSG_PEEK isn't passed to ->recvmsg() via msg->msg_flags; as the matter of fact, neither the kernel users of rxrpc, nor the syscalls ever set that bit in there. It gets passed via flags; in fact, another such check in the same function is done correctly - as flags & MSG_PEEK. It had been that way (effectively disabled) for 8 years, though, so the patch needs beating up - that case had never been tested. If it is correct, it's -stable fodder. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-15caif: fix MSG_OOB test in caif_seqpkt_recvmsg()Al Viro1-1/+1
It should be checking flags, not msg->msg_flags. It's ->sendmsg() instances that need to look for that in ->msg_flags, ->recvmsg() ones (including the other ->recvmsg() instance in that file, as well as unix_dgram_recvmsg() this one claims to be imitating) check in flags. Braino had been introduced in commit dcda13 ("caif: Bugfix - use MSG_TRUNC in receive") back in 2010, so it goes quite a while back. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14bridge: reset bridge mtu after deleting an interfaceVenkat Venkatsubra1-0/+2
On adding an interface br_add_if() sets the MTU to the min of all the interfaces. Do the same thing on removing an interface too in br_del_if. Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13inet_diag: fix possible overflow in inet_diag_dump_one_icsk()Eric Dumazet1-3/+15
inet_diag_dump_one_icsk() allocates too small skb. Add inet_sk_attr_size() helper right before inet_sk_diag_fill() so that it can be updated if/when new attributes are added. iproute2/ss currently does not use this dump_one() interface, this might explain nobody noticed this problem yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13netfilter: Fix potential crash in nft_hash walkerHerbert Xu1-0/+2
When we get back an EAGAIN from rhashtable_walk_next we were treating it as a valid object which obviously doesn't work too well. Luckily this is hard to trigger so it seems nobody has run into it yet. This patch fixes it by redoing the next call when we get an EAGAIN. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-139p/trans_virtio: fix hot-unplugMichael S. Tsirkin1-4/+20
On device hot-unplug, 9p/virtio currently will kfree channel while it might still be in use. Of course, it might stay used forever, so it's an extremely ugly hack, but it seems better than use-after-free that we have now. [ Unused variable removed, whitespace cleanup, msg single-lined --RR ] Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-03-12netfilter: Zero the tuple in nfnl_cthelper_parse_tuple()Ian Wilson1-0/+3
nfnl_cthelper_parse_tuple() is called from nfnl_cthelper_new(), nfnl_cthelper_get() and nfnl_cthelper_del(). In each case they pass a pointer to an nf_conntrack_tuple data structure local variable: struct nf_conntrack_tuple tuple; ... ret = nfnl_cthelper_parse_tuple(&tuple, tb[NFCTH_TUPLE]); The problem is that this local variable is not initialized, and nfnl_cthelper_parse_tuple() only initializes two fields: src.l3num and dst.protonum. This leaves all other fields with undefined values based on whatever is on the stack: tuple->src.l3num = ntohs(nla_get_be16(tb[NFCTH_TUPLE_L3PROTONUM])); tuple->dst.protonum = nla_get_u8(tb[NFCTH_TUPLE_L4PROTONUM]); The symptom observed was that when the rpc and tns helpers were added then traffic to port 1536 was being sent to user-space. Signed-off-by: Ian Wilson <iwilson@brocade.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-12rds: avoid potential stack overflowArnd Bergmann1-18/+22
The rds_iw_update_cm_id function stores a large 'struct rds_sock' object on the stack in order to pass a pair of addresses. This happens to just fit withint the 1024 byte stack size warning limit on x86, but just exceed that limit on ARM, which gives us this warning: net/rds/iw_rdma.c:200:1: warning: the frame size of 1056 bytes is larger than 1024 bytes [-Wframe-larger-than=] As the use of this large variable is basically bogus, we can rearrange the code to not do that. Instead of passing an rds socket into rds_iw_get_device, we now just pass the two addresses that we have available in rds_iw_update_cm_id, and we change rds_iw_get_mr accordingly, to create two address structures on the stack there. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12sock: fix possible NULL sk dereference in __skb_tstamp_txWillem de Bruijn1-2/+6
Test that sk != NULL before reading sk->sk_tsflags. Fixes: 49ca0d8bfaf3 ("net-timestamp: no-payload option") Reported-by: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11xps: must clear sender_cpu before forwardingEric Dumazet3-1/+3
John reported that my previous commit added a regression on his router. This is because sender_cpu & napi_id share a common location, so get_xps_queue() can see garbage and perform an out of bound access. We need to make sure sender_cpu is cleared before doing the transmit, otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger this bug. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: John <jw@nuclearfallout.net> Bisected-by: John <jw@nuclearfallout.net> Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11net: sysctl_net_core: check SNDBUF and RCVBUF for min lengthAlexey Kodanev1-4/+6
sysctl has sysctl.net.core.rmem_*/wmem_* parameters which can be set to incorrect values. Given that 'struct sk_buff' allocates from rcvbuf, incorrectly set buffer length could result to memory allocation failures. For example, set them as follows: # sysctl net.core.rmem_default=64 net.core.wmem_default = 64 # sysctl net.core.wmem_default=64 net.core.wmem_default = 64 # ping localhost -s 1024 -i 0 > /dev/null This could result to the following failure: skbuff: skb_over_panic: text:ffffffff81628db4 len:-32 put:-32 head:ffff88003a1cc200 data:ffff88003a1cc200 tail:0xffffffe0 end:0xc0 dev:<NULL> kernel BUG at net/core/skbuff.c:102! invalid opcode: 0000 [#1] SMP ... task: ffff88003b7f5550 ti: ffff88003ae88000 task.ti: ffff88003ae88000 RIP: 0010:[<ffffffff8155fbd1>] [<ffffffff8155fbd1>] skb_put+0xa1/0xb0 RSP: 0018:ffff88003ae8bc68 EFLAGS: 00010296 RAX: 000000000000008d RBX: 00000000ffffffe0 RCX: 0000000000000000 RDX: ffff88003fdcf598 RSI: ffff88003fdcd9c8 RDI: ffff88003fdcd9c8 RBP: ffff88003ae8bc88 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 00000000000002b2 R12: 0000000000000000 R13: 0000000000000000 R14: ffff88003d3f7300 R15: ffff88000012a900 FS: 00007fa0e2b4a840(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000d0f7e0 CR3: 000000003b8fb000 CR4: 00000000000006f0 Stack: ffff88003a1cc200 00000000ffffffe0 00000000000000c0 ffffffff818cab1d ffff88003ae8bd68 ffffffff81628db4 ffff88003ae8bd48 ffff88003b7f5550 ffff880031a09408 ffff88003b7f5550 ffff88000012aa48 ffff88000012ab00 Call Trace: [<ffffffff81628db4>] unix_stream_sendmsg+0x2c4/0x470 [<ffffffff81556f56>] sock_write_iter+0x146/0x160 [<ffffffff811d9612>] new_sync_write+0x92/0xd0 [<ffffffff811d9cd6>] vfs_write+0xd6/0x180 [<ffffffff811da499>] SyS_write+0x59/0xd0 [<ffffffff81651532>] system_call_fastpath+0x12/0x17 Code: 00 00 48 89 44 24 10 8b 87 c8 00 00 00 48 89 44 24 08 48 8b 87 d8 00 00 00 48 c7 c7 30 db 91 81 48 89 04 24 31 c0 e8 4f a8 0e 00 <0f> 0b eb fe 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 RIP [<ffffffff8155fbd1>] skb_put+0xa1/0xb0 RSP <ffff88003ae8bc68> Kernel panic - not syncing: Fatal exception Moreover, the possible minimum is 1, so we can get another kernel panic: ... BUG: unable to handle kernel paging request at ffff88013caee5c0 IP: [<ffffffff815604cf>] __alloc_skb+0x12f/0x1f0 ... Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11tcp: restore 1.5x per RTT limit to CUBIC cwnd growth in congestion avoidanceNeal Cardwell1-2/+4
Commit 814d488c6126 ("tcp: fix the timid additive increase on stretch ACKs") fixed a bug where tcp_cong_avoid_ai() would either credit a connection with an increase of snd_cwnd_cnt, or increase snd_cwnd, but not both, resulting in cwnd increasing by 1 packet on at most every alternate invocation of tcp_cong_avoid_ai(). Although the commit correctly implemented the CUBIC algorithm, which can increase cwnd by as much as 1 packet per 1 packet ACKed (2x per RTT), in practice that could be too aggressive: in tests on network paths with small buffers, YouTube server retransmission rates nearly doubled. This commit restores CUBIC to a maximum cwnd growth rate of 1 packet per 2 packets ACKed (1.5x per RTT). In YouTube tests this restored retransmit rates to low levels. Testing: This patch has been tested in datacenter netperf transfers and live youtube.com and google.com servers. Fixes: 9cd981dcf174 ("tcp: fix stretch ACK bugs in CUBIC") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11tcp: fix tcp_cong_avoid_ai() credit accumulation bug with decreases in wNeal Cardwell1-0/+6
The recent change to tcp_cong_avoid_ai() to handle stretch ACKs introduced a bug where snd_cwnd_cnt could accumulate a very large value while w was large, and then if w was reduced snd_cwnd could be incremented by a large delta, leading to a large burst and high packet loss. This was tickled when CUBIC's bictcp_update() sets "ca->cnt = 100 * cwnd". This bug crept in while preparing the upstream version of 814d488c6126. Testing: This patch has been tested in datacenter netperf transfers and live youtube.com and google.com servers. Fixes: 814d488c6126 ("tcp: fix the timid additive increase on stretch ACKs") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10net: Handle unregister properly when netdev namespace change fails.David S. Miller1-13/+13
If rtnl_newlink() fails on it's call to dev_change_net_namespace(), we have to make use of the ->dellink() method, if present, just like we do when rtnl_configure_link() fails. Fixes: 317f4810e45e ("rtnl: allow to create device with IFLA_LINK_NETNSID set") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10net: add comment for sock_efree() usageOliver Hartkopp1-0/+4
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09net_sched: fix struct tc_u_hnode layout in u32WANG Cong1-1/+4
We dynamically allocate divisor+1 entries for ->ht[] in tc_u_hnode: ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), GFP_KERNEL); So ->ht is supposed to be the last field of this struct, however this is broken, since an rcu head is appended after it. Fixes: 1ce87720d456 ("net: sched: make cls_u32 lockless") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds13-81/+143
Pull networking fixes from David Miller: 1) nft_compat accidently truncates ethernet protocol to 8-bits, from Arturo Borrero. 2) Memory leak in ip_vs_proc_conn(), from Julian Anastasov. 3) Don't allow the space required for nftables rules to exceed the maximum value representable in the dlen field. From Patrick McHardy. 4) bcm63xx_enet can accidently leave interrupts permanently disabled due to errors in the NAPI polling exit logic. Fix from Nicolas Schichan. 5) Fix OOPSes triggerable by the ping protocol module, due to missing address family validations etc. From Lorenzo Colitti. 6) Don't use RCU locking in sleepable context in team driver, from Jiri Pirko. 7) xen-netback miscalculates statistic offset pointers when reporting the stats to userspace. From David Vrabel. 8) Fix a leak of up to 256 pages per VIF destroy in xen-netaback, also from David Vrabel. 9) ip_check_defrag() cannot assume that skb_network_offset(), particularly when it is used by the AF_PACKET fanout defrag code. From Alexander Drozdov. 10) gianfar driver doesn't query OF node names properly when trying to determine the number of hw queues available. Fix it to explicitly check for OF nodes named queue-group. From Tobias Waldekranz. 11) MID field in macb driver should be 12 bits, not 16. From Punnaiah Choudary Kalluri. 12) Fix unintentional regression in traceroute due to timestamp socket option changes. Empty ICMP payloads should be allowed in non-timestamp cases. From Willem de Bruijn. 13) When devices are unregistered, we have to get rid of AF_PACKET multicast list entries that point to it via ifindex. Fix from Francesco Ruggeri. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits) tipc: fix bug in link failover handling net: delete stale packet_mclist entries net: macb: constify macb configuration data MAINTAINERS: add Marc Kleine-Budde as co maintainer for CAN networking layer MAINTAINERS: linux-can moved to github can: kvaser_usb: Read all messages in a bulk-in URB buffer can: kvaser_usb: Avoid double free on URB submission failures can: peak_usb: fix missing ctrlmode_ init for every dev can: add missing initialisations in CAN related skbuffs ip: fix error queue empty skb handling bgmac: Clean warning messages tcp: align tcp_xmit_size_goal() on tcp_tso_autosize() net: fec: fix unbalanced clk disable on driver unbind net: macb: Correct the MID field length value net: gianfar: correctly determine the number of queue groups ipv4: ip_check_defrag should not assume that skb_network_offset is zero net: bcmgenet: properly disable password matching net: eth: xgene: fix booting with devicetree bnx2x: Force fundamental reset for EEH recovery xen-netback: refactor xenvif_handle_frag_list() ...
2015-03-09tipc: fix bug in link failover handlingJon Paul Maloy1-3/+4
In commit c637c1035534867b85b78b453c38c495b58e2c5a ("tipc: resolve race problem at unicast message reception") we introduced a new mechanism for delivering buffers upwards from link to socket layer. That code contains a bug in how we handle the new link input queue during failover. When a link is reset, some of its users may be blocked because of congestion, and in order to resolve this, we add any pending wakeup pseudo messages to the link's input queue, and deliver them to the socket. This misses the case where the other, remaining link also may have congested users. Currently, the owner node's reference to the remaining link's input queue is unconditionally overwritten by the reset link's input queue. This has the effect that wakeup events from the remaining link may be unduely delayed (but not lost) for a potentially long period. We fix this by adding the pending events from the reset link to the input queue that is currently referenced by the node, whichever one it is. This commit should be applied to both net and net-next. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09net: delete stale packet_mclist entriesFrancesco Ruggeri1-8/+14
When an interface is deleted from a net namespace the ifindex in the corresponding entries in PF_PACKET sockets' mclists becomes stale. This can create inconsistencies if later an interface with the same ifindex is moved from a different namespace (not that unlikely since ifindexes are per-namespace). In particular we saw problems with dev->promiscuity, resulting in "promiscuity touches roof, set promiscuity failed. promiscuity feature of device might be broken" warnings and EOVERFLOW failures of setsockopt(PACKET_ADD_MEMBERSHIP). This patch deletes the mclist entries for interfaces that are deleted. Since this now causes setsockopt(PACKET_DROP_MEMBERSHIP) to fail with EADDRNOTAVAIL if called after the interface is deleted, also make packet_mc_drop not fail. Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09can: add missing initialisations in CAN related skbuffsOliver Hartkopp1-0/+3
When accessing CAN network interfaces with AF_PACKET sockets e.g. by dhclient this can lead to a skb_under_panic due to missing skb initialisations. Add the missing initialisations at the CAN skbuff creation times on driver level (rx path) and in the network layer (tx path). Reported-by: Austin Schuh <austin@peloton-tech.com> Reported-by: Daniel Steer <daniel.steer@mclaren.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2015-03-08ip: fix error queue empty skb handlingWillem de Bruijn3-23/+53
When reading from the error queue, msg_name and msg_control are only populated for some errors. A new exception for empty timestamp skbs added a false positive on icmp errors without payload. `traceroute -M udpconn` only displayed gateways that return payload with the icmp error: the embedded network headers are pulled before sock_queue_err_skb, leaving an skb with skb->len == 0 otherwise. Fix this regression by refining when msg_name and msg_control branches are taken. The solutions for the two fields are independent. msg_name only makes sense for errors that configure serr->port and serr->addr_offset. Test the first instead of skb->len. This also fixes another issue. saddr could hold the wrong data, as serr->addr_offset is not initialized in some code paths, pointing to the start of the network header. It is only valid when serr->port is set (non-zero). msg_control support differs between IPv4 and IPv6. IPv4 only honors requests for ICMP and timestamps with SOF_TIMESTAMPING_OPT_CMSG. The skb->len test can simply be removed, because skb->dev is also tested and never true for empty skbs. IPv6 honors requests for all errors aside from local errors and timestamps on empty skbs. In both cases, make the policy more explicit by moving this logic to a new function that decides whether to process msg_control and that optionally prepares the necessary fields in skb->cb[]. After this change, the IPv4 and IPv6 paths are more similar. The last case is rxrpc. Here, simply refine to only match timestamps. Fixes: 49ca0d8bfaf3 ("net-timestamp: no-payload option") Reported-by: Jan Niehusmann <jan@gondor.com> Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes v1->v2 - fix local origin test inversion in ip6_datagram_support_cmsg - make v4 and v6 code paths more similar by introducing analogous ipv4_datagram_support_cmsg - fix compile bug in rxrpc Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-08sunrpc: fix braino in ->poll()Al Viro1-1/+1
POLL_OUT isn't what callers of ->poll() are expecting to see; it's actually __SI_POLL | 2 and it's a siginfo code, not a poll bitmap bit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Cc: Bruce Fields <bfields@fieldses.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-08Merge tag 'tty-4.0-rc3' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial fixes from Greg KH: "Here are some tty and serial driver fixes for 4.0-rc3. Along with the atime fix that you know about, here are some other serial driver bugfixes as well. Most notable is a wait_until_sent bugfix that was traced back to being around since before 2.6.12 that Johan has fixed up. All have been in linux-next successfully" * tag 'tty-4.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: TTY: fix tty_wait_until_sent maximum timeout TTY: fix tty_wait_until_sent on 64-bit machines USB: serial: fix infinite wait_until_sent timeout TTY: bfin_jtag_comm: remove incorrect wait_until_sent operation net: irda: fix wait_until_sent poll timeout serial: uapi: Declare all userspace-visible io types serial: core: Fix iotype userspace breakage serial: sprd: Fix missing spin_unlock in sprd_handle_irq() console: Fix console name size mismatch tty: fix up atime/mtime mess, take four serial: 8250_dw: Fix get_mctrl behaviour serial:8250:8250_pci: delete unneeded quirk entries serial:8250:8250_pci: fix redundant entry report for WCH_CH352_2S Change email address for 8250_pci serial: 8250: Revert "tty: serial: 8250_core: read only RX if there is something in the FIFO" Revert "tty/serial: of_serial: add DT alias ID handling"
2015-03-07net: irda: fix wait_until_sent poll timeoutJohan Hovold1-1/+3
In case an infinite timeout (0) is requested, the irda wait_until_sent implementation would use a zero poll timeout rather than the default 200ms. Note that wait_until_sent is currently never called with a 0-timeout argument due to a bug in tty_wait_until_sent. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable <stable@vger.kernel.org> # v2.6.12 Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06Merge tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2-2/+3
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - Fix a regression in the NFSv4 open state recovery code - Fix a regression in the NFSv4 close code - Fix regressions and side-effects of the loop-back mounted NFS fixes in 3.18, that cause the NFS read() syscall to return EBUSY. - Fix regressions around the readdirplus code and how it interacts with the VFS lazy unmount changes that went into v3.18. - Fix issues with out-of-order RPC call replies replacing updated attributes with stale ones (particularly after a truncate()). - Fix an underflow checking issue with RPC/RDMA credits - Fix a number of issues with the NFSv4 delegation return/free code. - Fix issues around stale NFSv4.1 leases when doing a mount" * tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits) NFSv4.1: Clear the old state by our client id before establishing a new lease NFSv4: Fix a race in NFSv4.1 server trunking discovery NFS: Don't write enable new pages while an invalidation is proceeding NFS: Fix a regression in the read() syscall NFSv4: Ensure we skip delegations that are already being returned NFSv4: Pin the superblock while we're returning the delegation NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation() NFSv4: Ensure that we don't reap a delegation that is being returned NFS: Fix stateid used for NFS v4 closes NFSv4: Don't call put_rpccred() under the rcu_read_lock() NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache() NFSv3: Use the readdir fileid as the mounted-on-fileid NFS: Don't invalidate a submounted dentry in nfs_prime_dcache() NFSv4: Set a barrier in the update_changeattr() helper NFS: Fix nfs_post_op_update_inode() to set an attribute barrier NFS: Remove size hack in nfs_inode_attrs_need_update() NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommit NFS: Add attribute update barriers to NFS writebacks NFS: Set an attribute barrier on all updates NFS: Add attribute update barriers to nfs_setattr_update_inode() ...
2015-03-05tcp: align tcp_xmit_size_goal() on tcp_tso_autosize()Eric Dumazet1-7/+3
With some mss values, it is possible tcp_xmit_size_goal() puts one segment more in TSO packet than tcp_tso_autosize(). We send then one TSO packet followed by one single MSS. It is not a serious bug, but we can do slightly better, especially for drivers using netif_set_gso_max_size() to lower gso_max_size. Using same formula avoids these corner cases and makes tcp_xmit_size_goal() a bit faster. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 605ad7f184b6 ("tcp: refine TSO autosizing") Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller3-32/+46
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Don't truncate ethernet protocol type to u8 in nft_compat, from Arturo Borrero. 2) Fix several problems in the addition/deletion of elements in nf_tables. 3) Fix module refcount leak in ip_vs_sync, from Julian Anastasov. 4) Fix a race condition in the abort path in the nf_tables transaction infrastructure. Basically aborted rules can show up as active rules until changes are unrolled, oneliner from Patrick McHardy. 5) Check for overflows in the data area of the rule, also from Patrick. 6) Fix off-by-one in the per-rule user data size field. This introduces a new nft_userdata structure that is placed at the beginning of the user data area that contains the length to save some bits from the rule and we only need one bit to indicate its presence, from Patrick. 7) Fix rule replacement error path, the replaced rule is deleted on error instead of leaving it in place. This has been fixed by relying on the abort path to undo the incomplete replacement. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-05ipv4: ip_check_defrag should not assume that skb_network_offset is zeroAlexander Drozdov1-4/+7
ip_check_defrag() may be used by af_packet to defragment outgoing packets. skb_network_offset() of af_packet's outgoing packets is not zero. Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04net: ping: Return EAFNOSUPPORT when appropriate.Lorenzo Colitti2-4/+13
1. For an IPv4 ping socket, ping_check_bind_addr does not check the family of the socket address that's passed in. Instead, make it behave like inet_bind, which enforces either that the address family is AF_INET, or that the family is AF_UNSPEC and the address is 0.0.0.0. 2. For an IPv6 ping socket, ping_check_bind_addr returns EINVAL if the socket family is not AF_INET6. Return EAFNOSUPPORT instead, for consistency with inet6_bind. 3. Make ping_v4_sendmsg and ping_v6_sendmsg return EAFNOSUPPORT instead of EINVAL if an incorrect socket address structure is passed in. 4. Make IPv6 ping sockets be IPv6-only. The code does not support IPv4, and it cannot easily be made to support IPv4 because the protocol numbers for ICMP and ICMPv6 are different. This makes connect(::ffff:192.0.2.1) fail with EAFNOSUPPORT instead of making the socket unusable. Among other things, this fixes an oops that can be triggered by: int s = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP); struct sockaddr_in6 sin6 = { .sin6_family = AF_INET6, .sin6_addr = in6addr_any, }; bind(s, (struct sockaddr *) &sin6, sizeof(sin6)); Change-Id: If06ca86d9f1e4593c0d6df174caca3487c57a241 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04netfilter: nf_tables: fix error handling of rule replacementPablo Neira Ayuso1-6/+0
In general, if a transaction object is added to the list successfully, we can rely on the abort path to undo what we've done. This allows us to simplify the error handling of the rule replacement path in nf_tables_newrule(). This implicitly fixes an unnecessary removal of the old rule, which needs to be left in place if we fail to replace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-04netfilter: nf_tables: fix userdata length overflowPatrick McHardy1-9/+19
The NFT_USERDATA_MAXLEN is defined to 256, however we only have a u8 to store its size. Introduce a struct nft_userdata which contains a length field and indicate its presence using a single bit in the rule. The length field of struct nft_userdata is also a u8, however we don't store zero sized data, so the actual length is udata->len + 1. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-04netfilter: nf_tables: check for overflow of rule dlen fieldPatrick McHardy1-0/+4
Check that the space required for the expressions doesn't exceed the size of the dlen field, which would lead to the iterators crashing. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-04netfilter: nf_tables: fix transaction race conditionPatrick McHardy1-1/+1
A race condition exists in the rule transaction code for rules that get added and removed within the same transaction. The new rule starts out as inactive in the current and active in the next generation and is inserted into the ruleset. When it is deleted, it is additionally set to inactive in the next generation as well. On commit the next generation is begun, then the actions are finalized. For the new rule this would mean clearing out the inactive bit for the previously current, now next generation. However nft_rule_clear() clears out the bits for *both* generations, activating the rule in the current generation, where it should be deactivated due to being deleted. The rule will thus be active until the deletion is finalized, removing the rule from the ruleset. Similarly, when aborting a transaction for the same case, the undo of insertion will remove it from the RCU protected rule list, the deletion will clear out all bits. However until the next RCU synchronization after all operations have been undone, the rule is active on CPUs which can still see the rule on the list. Generally, there may never be any modifications of the current generations' inactive bit since this defeats the entire purpose of atomicity. Change nft_rule_clear() to only touch the next generations bit to fix this. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-03Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2-0/+4
Pull nfsd fixes from Bruce Fields: "Three miscellaneous bugfixes, most importantly the clp->cl_revoked bug, which we've seen several reports of people hitting" * 'for-4.0' of git://linux-nfs.org/~bfields/linux: sunrpc: integer underflow in rsc_parse() nfsd: fix clp->cl_revoked list deletion causing softlock in nfsd svcrpc: fix memory leak in gssp_accept_sec_context_upcall
2015-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds40-80/+192
Pull networking fixes from David Miller: 1) If an IPVS tunnel is created with a mixed-family destination address, it cannot be removed. Fix from Alexey Andriyanov. 2) Fix module refcount underflow in netfilter's nft_compat, from Pablo Neira Ayuso. 3) Generic statistics infrastructure can reference variables sitting on a released function stack, therefore use dynamic allocation always. Fix from Ignacy Gawędzki. 4) skb_copy_bits() return value test is inverted in ip_check_defrag(). 5) Fix network namespace exit in openvswitch, we have to release all of the per-net vports. From Pravin B Shelar. 6) Fix signedness bug in CAIF's cfpkt_iterate(), from Dan Carpenter. 7) Fix rhashtable grow/shrink behavior, only expand during inserts and shrink during deletes. From Daniel Borkmann. 8) Netdevice names with semicolons should never be allowed, because they serve as a separator. From Matthew Thode. 9) Use {,__}set_current_state() where appropriate, from Fabian Frederick. 10) Revert byte queue limits support in r8169 driver, it's causing regressions we can't figure out. 11) tcp_should_expand_sndbuf() erroneously uses tp->packets_out to measure packets in flight, properly use tcp_packets_in_flight() instead. From Neal Cardwell. 12) Fix accidental removal of support for bluetooth in CSR based Intel wireless cards. From Marcel Holtmann. 13) We accidently added a behavioral change between native and compat tasks, wrt testing the MSG_CMSG_COMPAT bit. Just ignore it if the user happened to set it in a native binary as that was always the behavior we had. From Catalin Marinas. 14) Check genlmsg_unicast() return valud in hwsim netlink tx frame handling, from Bob Copeland. 15) Fix stale ->radar_required setting in mac80211 that can prevent starting new scans, from Eliad Peller. 16) Fix memory leak in nl80211 monitor, from Johannes Berg. 17) Fix race in TX index handling in xen-netback, from David Vrabel. 18) Don't enable interrupts in amx-xgbe driver until all software et al. state is ready for the interrupt handler to run. From Thomas Lendacky. 19) Add missing netlink_ns_capable() checks to rtnl_newlink(), from Eric W Biederman. 20) The amount of header space needed in macvtap was not calculated properly, fix it otherwise we splat past the beginning of the packet. From Eric Dumazet. 21) Fix bcmgenet TCP TX perf regression, from Jaedon Shin. 22) Don't raw initialize or mod timers, use setup_timer() and mod_timer() instead. From Vaishali Thakkar. 23) Fix software maintained statistics in bcmgenet and systemport drivers, from Florian Fainelli. 24) DMA descriptor updates in sh_eth need proper memory barriers, from Ben Hutchings. 25) Don't do UDP Fragmentation Offload on RAW sockets, from Michal Kubecek. 26) Openvswitch's non-masked set actions aren't constructed properly into netlink messages, fix from Joe Stringer. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits) openvswitch: Fix serialization of non-masked set actions. gianfar: Reduce logging noise seen due to phy polling if link is down ibmveth: Add function to enable live MAC address changes net: bridge: add compile-time assert for cb struct size udp: only allow UFO for packets from SOCK_DGRAM sockets sh_eth: Really fix padding of short frames on TX Revert "sh_eth: Enable Rx descriptor word 0 shift for r8a7790" sh_eth: Fix RX recovery on R-Car in case of RX ring underrun sh_eth: Ensure proper ordering of descriptor active bit write/read net/mlx4_en: Disbale GRO for incoming loopback/selftest packets net/mlx4_core: Fix wrong mask and error flow for the update-qp command net: systemport: fix software maintained statistics net: bcmgenet: fix software maintained statistics rxrpc: don't multiply with HZ twice rxrpc: terminate retrans loop when sending of skb fails net/hsr: Fix NULL pointer dereference and refcnt bugs when deleting a HSR interface. net: pasemi: Use setup_timer and mod_timer net: stmmac: Use setup_timer and mod_timer net: 8390: axnet_cs: Use setup_timer and mod_timer net: 8390: pcnet_cs: Use setup_timer and mod_timer ...
2015-03-03openvswitch: Fix serialization of non-masked set actions.Joe Stringer1-1/+7
Set actions consist of a regular OVS_KEY_ATTR_* attribute nested inside of a OVS_ACTION_ATTR_SET action attribute. When converting masked actions back to regular set actions, the inner attribute length was not changed, ie, double the length being serialized. This patch fixes the bug. Fixes: 83d2b9b ("net: openvswitch: Support masked set actions.") Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-03net: bridge: add compile-time assert for cb struct sizeFlorian Westphal1-0/+2
make build fail if structure no longer fits into ->cb storage. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-03mac80211: disable u-APSD queues by defaultMichal Kazior1-6/+17
Some APs experience problems when working with U-APSD. Decreasing the probability of that happening by using legacy mode for all ACs but VO isn't enough. Cisco 4410N originally forced us to enable VO by default only because it treated non-VO ACs as legacy. However some APs (notably Netgear R7000) silently reclassify packets to different ACs. Since u-APSD ACs require trigger frames for frame retrieval clients would never see some frames (e.g. ARP responses) or would fetch them accidentally after a long time. It makes little sense to enable u-APSD queues by default because it needs userspace applications to be aware of it to actually take advantage of the possible additional powersavings. Implicitly depending on driver autotrigger frame support doesn't make much sense. Cc: stable@vger.kernel.org Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-03-03mac80211: drop unencrypted frames in mesh fwdingBob Copeland1-0/+3
The mesh forwarding path was not checking that data frames were protected when running an encrypted network; add the necessary check. Cc: stable@vger.kernel.org Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-03-02udp: only allow UFO for packets from SOCK_DGRAM socketsMichal Kubeček2-2/+4
If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer checksum is to be computed on segmentation. However, in this case, skb->csum_start and skb->csum_offset are never set as raw socket transmit path bypasses udp_send_skb() where they are usually set. As a result, driver may access invalid memory when trying to calculate the checksum and store the result (as observed in virtio_net driver). Moreover, the very idea of modifying the userspace provided UDP header is IMHO against raw socket semantics (I wasn't able to find a document clearly stating this or the opposite, though). And while allowing CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit too intrusive change just to handle a corner case like this. Therefore disallowing UFO for packets from SOCK_DGRAM seems to be the best option. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01rxrpc: don't multiply with HZ twiceFlorian Westphal1-1/+1
rxrpc_resend_timeout has an initial value of 4 * HZ; use it as-is. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01rxrpc: terminate retrans loop when sending of skb failsFlorian Westphal1-3/+4
Typo, 'stop' is never set to true. Seems intent is to not attempt to retransmit more packets after sendmsg returns an error. This change is based on code inspection only. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01net/hsr: Fix NULL pointer dereference and refcnt bugs when deleting a HSR ↵Arvid Brodin3-3/+14
interface. To repeat: $ sudo ip link del hsr0 BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff8187f495>] hsr_del_port+0x15/0xa0 etc... Bug description: As part of the hsr master device destruction, hsr_del_port() is called for each of the hsr ports. At each such call, the master device is updated regarding features and mtu. When the master device is freed before the slave interfaces, master will be NULL in hsr_del_port(), which led to a NULL pointer dereference. Additionally, dev_put() was called on the master device itself in hsr_del_port(), causing a refcnt error. A third bug in the same code path was that the rtnl lock was not taken before hsr_del_port() was called as part of hsr_dev_destroy(). The reporter (Nicolas Dichtel) also said: "hsr_netdev_notify() supposes that the port will always be available when the notification is for an hsr interface. It's wrong. For example, netdev_wait_allrefs() may resend NETDEV_UNREGISTER.". As a precaution against this, a check for port == NULL was added in hsr_dev_notify(). Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Fixes: 51f3c605318b056a ("net/hsr: Move slave init to hsr_slave.c.") Signed-off-by: Arvid Brodin <arvid.brodin@alten.se> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-01net: do not use rcu in rtnl_dump_ifinfo()Eric Dumazet1-3/+1
We did a failed attempt in the past to only use rcu in rtnl dump operations (commit e67f88dd12f6 "net: dont hold rtnl mutex during netlink dump callbacks") Now that dumps are holding RTNL anyway, there is no need to also use rcu locking, as it forbids any scheduling ability, like GFP_KERNEL allocations that controlling path should use instead of GFP_ATOMIC whenever possible. This should fix following splat Cong Wang reported : [ INFO: suspicious RCU usage. ] 3.19.0+ #805 Tainted: G W include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ip/771: #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff8182b8f4>] netlink_dump+0x21/0x26c #1: (rcu_read_lock){......}, at: [<ffffffff817d785b>] rcu_read_lock+0x0/0x6e stack backtrace: CPU: 3 PID: 771 Comm: ip Tainted: G W 3.19.0+ #805 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000001 ffff8800d51e7718 ffffffff81a27457 0000000029e729e6 ffff8800d6108000 ffff8800d51e7748 ffffffff810b539b ffffffff820013dd 00000000000001c8 0000000000000000 ffff8800d7448088 ffff8800d51e7758 Call Trace: [<ffffffff81a27457>] dump_stack+0x4c/0x65 [<ffffffff810b539b>] lockdep_rcu_suspicious+0x107/0x110 [<ffffffff8109796f>] rcu_preempt_sleep_check+0x45/0x47 [<ffffffff8109e457>] ___might_sleep+0x1d/0x1cb [<ffffffff8109e67d>] __might_sleep+0x78/0x80 [<ffffffff814b9b1f>] idr_alloc+0x45/0xd1 [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d [<ffffffff814b9f9d>] ? idr_for_each+0x53/0x101 [<ffffffff817c1383>] alloc_netid+0x61/0x69 [<ffffffff817c14c3>] __peernet2id+0x79/0x8d [<ffffffff817c1ab7>] peernet2id+0x13/0x1f [<ffffffff817d8673>] rtnl_fill_ifinfo+0xa8d/0xc20 [<ffffffff810b17d9>] ? __lock_is_held+0x39/0x52 [<ffffffff817d894f>] rtnl_dump_ifinfo+0x149/0x213 [<ffffffff8182b9c2>] netlink_dump+0xef/0x26c [<ffffffff8182bcba>] netlink_recvmsg+0x17b/0x2c5 [<ffffffff817b0adc>] __sock_recvmsg+0x4e/0x59 [<ffffffff817b1b40>] sock_recvmsg+0x3f/0x51 [<ffffffff817b1f9a>] ___sys_recvmsg+0xf6/0x1d9 [<ffffffff8115dc67>] ? handle_pte_fault+0x6e1/0xd3d [<ffffffff8100a3a0>] ? native_sched_clock+0x35/0x37 [<ffffffff8109f45b>] ? sched_clock_local+0x12/0x72 [<ffffffff8109f6ac>] ? sched_clock_cpu+0x9e/0xb7 [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d [<ffffffff811abde8>] ? __fcheck_files+0x4c/0x58 [<ffffffff811ac556>] ? __fget_light+0x2d/0x52 [<ffffffff817b376f>] __sys_recvmsg+0x42/0x60 [<ffffffff817b379f>] SyS_recvmsg+0x12/0x1c Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 0c7aecd4bde4b7302 ("netns: add rtnl cmd to add and get peer netns ids") Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28Merge tag 'mac80211-for-davem-2015-02-27' of ↵David S. Miller6-9/+14
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A few patches have accumulated, among them the fix for Linus's four-way-handshake problem. The others are various small fixes for problems all over, nothing really stands out. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28net: Verify permission to link_net in newlinkEric W. Biederman1-0/+3
When applicable verify that the caller has permisson to the underlying network namespace for a newly created network device. Similary checks exist for the network namespace a network device will be created in. Fixes: 317f4810e45e ("rtnl: allow to create device with IFLA_LINK_NETNSID set") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28net: Verify permission to dest_net in newlinkEric W. Biederman1-0/+4
When applicable verify that the caller has permision to create a network device in another network namespace. This check is already present when moving a network device between network namespaces in setlink so all that is needed is to duplicate that check in newlink. This change almost backports cleanly, but there are context conflicts as the code that follows was added in v4.0-rc1 Fixes: b51642f6d77b net: Enable a userns root rtnl calls that are safe for unprivilged users Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27rhashtable: remove indirection for grow/shrink decision functionsDaniel Borkmann3-6/+0
Currently, all real users of rhashtable default their grow and shrink decision functions to rht_grow_above_75() and rht_shrink_below_30(), so that there's currently no need to have this explicitly selectable. It can/should be generic and private inside rhashtable until a real use case pops up. Since we can make this private, we'll save us this additional indirection layer and can improve insertion/deletion time as well. Reference: http://patchwork.ozlabs.org/patch/443040/ Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-26sunrpc: integer underflow in rsc_parse()Dan Carpenter1-0/+2
If we call groups_alloc() with invalid values then it's might lead to memory corruption. For example, with a negative value then we might not allocate enough for sizeof(struct group_info). (We're doing this in the caller for consistency with other callers of groups_alloc(). The other alternative might be to move the check out of all the callers into groups_alloc().) Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Simo Sorce <simo@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-02-26mac80211: Send EAPOL frames at lowest rateJouni Malinen1-0/+1
The current minstrel_ht rate control behavior is somewhat optimistic in trying to find optimum TX rate. While this is usually fine for normal Data frames, there are cases where a more conservative set of retry parameters would be beneficial to make the connection more robust. EAPOL frames are critical to the authentication and especially the EAPOL-Key message 4/4 (the last message in the 4-way handshake) is important to get through to the AP. If that message is lost, the only recovery mechanism in many cases is to reassociate with the AP and start from scratch. This can often be avoided by trying to send the frame with more conservative rate and/or with more link layer retries. In most cases, minstrel_ht is currently using the initial EAPOL-Key frames for probing higher rates and this results in only five link layer transmission attempts (one at high(ish) MCS and four at MCS0). While this works with most APs, it looks like there are some deployed APs that may have issues with the EAPOL frames using HT MCS immediately after association. Similarly, there may be issues in cases where the signal strength or radio environment is not good enough to be able to get frames through even at couple of MCS 0 tries. The best approach for this would likely to be to reduce the TX rate for the last rate (3rd rate parameter in the set) to a low basic rate (say, 6 Mbps on 5 GHz and 2 or 5.5 Mbps on 2.4 GHz), but doing that cleanly requires some more effort. For now, we can start with a simple one-liner that forces the minimum rate to be used for EAPOL frames similarly how the TX rate is selected for the IEEE 802.11 Management frames. This does result in a small extra latency added to the cases where the AP would be able to receive the higher rate, but taken into account how small number of EAPOL frames are used, this is likely to be insignificant. A future optimization in the minstrel_ht design can also allow this patch to be reverted to get back to the more optimized initial TX rate. It should also be noted that many drivers that do not use minstrel as the rate control algorithm are already doing similar workarounds by forcing the lowest TX rate to be used for EAPOL frames. Cc: stable@vger.kernel.org Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-02-24af_packet: don't pass empty blocks for PACKET_V3Alexander Drozdov1-1/+9
Before da413eec729d ("packet: Fixed TPACKET V3 to signal poll when block is closed rather than every packet") poll listening for an af_packet socket was not signaled if there was no packets to process. After the patch poll is signaled evety time when block retire timer expires. That happens because af_packet closes the current block on timeout even if the block is empty. Passing empty blocks to the user not only wastes CPU but also wastes ring buffer space increasing probability of packets dropping on small timeouts. Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Cc: Dan Collins <dan@dcollins.co.nz> Cc: Willem de Bruijn <willemb@google.com> Cc: Guy Harris <guy@alum.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-24rtnetlink: avoid 0 sized arraysSasha Levin1-2/+2
Arrays (when not in a struct) "shall have a value greater than zero". GCC complains when it's not the case here. Fixes: ba7d49b1f0 ("rtnetlink: provide api for getting and setting slave info") Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-24mac80211/minstrel: fix !x!=0 confusionJiri Slaby1-1/+1
Commit 06d961a8e210 ("mac80211/minstrel: use the new rate control API") inverted the condition 'if (msr->sample_limit != 0)' to 'if (!msr->sample_limit != 0)'. But it is confusing both to people and compilers (gcc5): net/mac80211/rc80211_minstrel.c: In function 'minstrel_get_rate': net/mac80211/rc80211_minstrel.c:376:26: warning: logical not is only applied to the left hand side of comparison if (!msr->sample_limit != 0) ^ Let there be only 'if (!msr->sample_limit)'. Fixes: 06d961a8e210 ("mac80211/minstrel: use the new rate control API") Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-02-24Merge https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvsPablo Neira Ayuso1-0/+3
Simon Horman says: ==================== Second Round of IPVS Fixes for v3.20 This patch resolves some memory leaks in connection synchronisation code that date back to v2.6.39. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-24cfg80211: calls nl80211_exit on errorJunjie Mao1-0/+1
nl80211_exit should be called in cfg80211_init if nl80211_init succeeds but regulatory_init or create_singlethread_workqueue fails. Signed-off-by: Junjie Mao <junjie_mao@yeah.net> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-02-24cfg80211: fix n_reg_rules to match world_regdomJason Abele1-1/+1
There are currently 8 rules in the world_regdom, but only the first 6 are applied due to an incorrect value for n_reg_rules. This causes channels 149-165 and 60GHz to be disabled. Signed-off-by: Jason Abele <jason@aether.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-02-24nl80211: fix memory leak in monitor flags parsingJohannes Berg1-4/+4
If monitor flags parsing results in active monitor but that isn't supported, the already allocated message is leaked. Fix this by moving the allocation after this check. Reported-by: Christian Engelmayer <cengelma@gmx.at> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-02-24nl80211: use loop index as type for net detect frequency resultsSamuel Tan1-3/+1
We currently add nested members of the NL80211_ATTR_SCAN_FREQUENCIES as NLA_U32 attributes of type NL80211_ATTR_WIPHY_FREQ in cfg80211_net_detect_results. However, since there can be an arbitrary number of frequency results, we should use the loop index of the loop used to add the frequency results to NL80211_ATTR_SCAN_FREQUENCIES as the type (i.e. nla_type) for each result attribute, rather than a fixed type. This change is in line with how nested members are added to NL80211_ATTR_SCAN_FREQUENCIES in the functions nl80211_send_wowlan_nd and nl80211_add_scan_req. Signed-off-by: Samuel Tan <samueltan@chromium.org> Signed-off-by: Luciano Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-02-24mac80211: clear sdata->radar_requiredEliad Peller1-0/+5
If ieee80211_vif_use_channel() fails, we have to clear sdata->radar_required (which we might have just set). Failing to do it results in stale radar_required field which prevents starting new scan requests. Reported-by: Jouni Malinen <j@w1.fi> Signed-off-by: Eliad Peller <eliad@wizery.com> [use false instead of 0] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-02-23Merge tag 'nfs-rdma-for-4.0-3' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust2-2/+3
NFS: RDMA Client Sparse Fix #2 This patch fixes another sparse fix found by Dan Carpenter's tool. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'nfs-rdma-for-4.0-3' of git://git.linux-nfs.org/projects/anna/nfs-rdma: xprtrdma: Store RDMA credits in unsigned variables
2015-02-23ipv6: addrconf: validate new MTU before applying itMarcelo Leitner1-1/+16
Currently we don't check if the new MTU is valid or not and this allows one to configure a smaller than minimum allowed by RFCs or even bigger than interface own MTU, which is a problem as it may lead to packet drops. If you have a daemon like NetworkManager running, this may be exploited by remote attackers by forging RA packets with an invalid MTU, possibly leading to a DoS. (NetworkManager currently only validates for values too small, but not for too big ones.) The fix is just to make sure the new value is valid. That is, between IPV6_MIN_MTU and interface's MTU. Note that similar check is already performed at ndisc_router_discovery(), for when kernel itself parses the RA. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-23net: compat: Ignore MSG_CMSG_COMPAT in compat_sys_{send, recv}msgCatalin Marinas1-9/+0
With commit a7526eb5d06b (net: Unbreak compat_sys_{send,recv}msg), the MSG_CMSG_COMPAT flag is blocked at the compat syscall entry points, changing the kernel compat behaviour from the one before the commit it was trying to fix (1be374a0518a, net: Block MSG_CMSG_COMPAT in send(m)msg and recv(m)msg). On 32-bit kernels (!CONFIG_COMPAT), MSG_CMSG_COMPAT is 0 and the native 32-bit sys_sendmsg() allows flag 0x80000000 to be set (it is ignored by the kernel). However, on a 64-bit kernel, the compat ABI is different with commit a7526eb5d06b. This patch changes the compat_sys_{send,recv}msg behaviour to the one prior to commit 1be374a0518a. The problem was found running 32-bit LTP (sendmsg01) binary on an arm64 kernel. Arguably, LTP should not pass 0xffffffff as flags to sendmsg() but the general rule is not to break user ABI (even when the user behaviour is not entirely sane). Fixes: a7526eb5d06b (net: Unbreak compat_sys_{send,recv}msg) Cc: Andy Lutomirski <luto@amacapital.net> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-23irda: replace current->state by set_current_state()Fabian Frederick2-3/+3
Use helper functions to access current->state. Direct assignments are prone to races and therefore buggy. current->state = TASK_RUNNING can be replaced by __set_current_state() Thanks to Peter Zijlstra for the exact definition of the problem. Suggested-By: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-23xprtrdma: Store RDMA credits in unsigned variablesChuck Lever2-2/+3
Dan Carpenter's static checker pointed out: net/sunrpc/xprtrdma/rpc_rdma.c:879 rpcrdma_reply_handler() warn: can 'credits' be negative? "credits" is defined as an int. The credits value comes from the server as a 32-bit unsigned integer. A malicious or broken server can plant a large unsigned integer in that field which would result in an underflow in the following logic, potentially triggering a deadlock of the mount point by blocking the client from issuing more RPC requests. net/sunrpc/xprtrdma/rpc_rdma.c: 876 credits = be32_to_cpu(headerp->rm_credit); 877 if (credits == 0) 878 credits = 1; /* don't deadlock */ 879 else if (credits > r_xprt->rx_buf.rb_max_requests) 880 credits = r_xprt->rx_buf.rb_max_requests; 881 882 cwnd = xprt->cwnd; 883 xprt->cwnd = credits << RPC_CWNDSHIFT; 884 if (xprt->cwnd > cwnd) 885 xprt_release_rqst_cong(rqst->rq_task); Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: eba8ff660b2d ("xprtrdma: Move credit update to RPC . . .") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-02-23decnet: Fix obvious o/0 typoRasmus Villemoes1-1/+1
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-22tcp: fix tcp_should_expand_sndbuf() to use tcp_packets_in_flight()Neal Cardwell1-1/+1
tcp_should_expand_sndbuf() does not expand the send buffer if we have filled the congestion window. However, it should use tcp_packets_in_flight() instead of tp->packets_out to make this check. Testing has established that the difference matters a lot if there are many SACKed packets, causing a needless performance shortfall. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-22net: pktgen: disable xmit_clone on virtual devicesEric Dumazet1-0/+3
Trying to use burst capability (aka xmit_more) on a virtual device like bonding is not supported. For example, skb might be queued multiple times on a qdisc, with various list corruptions. Fixes: 38b2cf2982dc ("net: pktgen: packet bursting via skb->xmit_more") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-22ipvs: add missing ip_vs_pe_put in sync codeJulian Anastasov1-0/+3
ip_vs_conn_fill_param_sync() gets in param.pe a module reference for persistence engine from __ip_vs_pe_getbyname() but forgets to put it. Problem occurs in backup for sync protocol v1 (2.6.39). Also, pe_data usually comes in sync messages for connection templates and ip_vs_conn_new() copies the pointer only in this case. Make sure pe_data is not leaked if it comes unexpectedly for normal connections. Leak can happen only if bogus messages are sent to backup server. Fixes: fe5e7a1efb66 ("IPVS: Backup, Adding Version 1 receive capability") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-02-22netfilter: nf_tables: fix addition/deletion of elements from commit/abortPablo Neira Ayuso1-9/+12
We have several problems in this path: 1) There is a use-after-free when removing individual elements from the commit path. 2) We have to uninit() the data part of the element from the abort path to avoid a chain refcount leak. 3) We have to check for set->flags to see if there's a mapping, instead of the element flags. 4) We have to check for !(flags & NFT_SET_ELEM_INTERVAL_END) to skip elements that are part of the interval that have no data part, so they don't need to be uninit(). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-22netfilter: nft_compat: don't truncate ethernet protocol type to u8Arturo Borrero1-7/+7
Use u16 for protocol and then cast it to __be16 >> net/netfilter/nft_compat.c:140:37: sparse: incorrect type in assignment (different base types) net/netfilter/nft_compat.c:140:37: expected restricted __be16 [usertype] ethproto net/netfilter/nft_compat.c:140:37: got unsigned char [unsigned] [usertype] proto >> net/netfilter/nft_compat.c:351:37: sparse: incorrect type in assignment (different base types) net/netfilter/nft_compat.c:351:37: expected restricted __be16 [usertype] ethproto net/netfilter/nft_compat.c:351:37: got unsigned char [unsigned] [usertype] proto Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-21af_packet: allow packets defragmentation not only for hash fanout typeAlexander Drozdov1-5/+5
Packets defragmentation was introduced for PACKET_FANOUT_HASH only, see 7736d33f4262 ("packet: Add pre-defragmentation support for ipv4 fanouts") It may be useful to have defragmentation enabled regardless of fanout type. Without that, the AF_PACKET user may have to: 1. Collect fragments from different rings 2. Defragment by itself Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-21net: reject creation of netdev names with colonsMatthew Thode1-1/+1
colons are used as a separator in netdev device lookup in dev_ioctl.c Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME Signed-off-by: Matthew Thode <mthode@mthode.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-21Merge tag 'nfs-for-3.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-1/+4
Pull more NFS client updates from Trond Myklebust: "Highlights include: - Fix a use-after-free in decode_cb_sequence_args() - Fix a compile error when #undef CONFIG_PROC_FS - NFSv4.1 backchannel spinlocking issue - Cleanups in the NFS unstable write code requested by Linus - NFSv4.1 fix issues when the server denies our backchannel request - Cleanups in create_session and bind_conn_to_session" * tag 'nfs-for-3.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.1: Clean up bind_conn_to_session NFSv4.1: Always set up a forward channel when binding the session NFSv4.1: Don't set up a backchannel if the server didn't agree to do so NFSv4.1: Clean up create_session pnfs: Refactor the *_layout_mark_request_commit to use pnfs_layout_mark_request_commit NFSv4: Kill unused nfs_inode->delegation_state field NFS: struct nfs_commit_info.lock must always point to inode->i_lock nfs: Can call nfs_clear_page_commit() instead nfs: Provide and use helper functions for marking a page as unstable SUNRPC: Always manipulate rpc_rqst::rq_bc_pa_list under xprt->bc_pa_lock SUNRPC: Fix a compile error when #undef CONFIG_PROC_FS NFSv4.1: Convert open-coded array allocation calls to kmalloc_array() NFSv4.1: Fix a kfree() of uninitialised pointers in decode_cb_sequence_args
2015-02-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller4-18/+28
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains updates for your net tree, they are: 1) Fix removal of destination in IPVS when the new mixed family support is used, from Alexey Andriyanov via Simon Horman. 2) Fix module refcount undeflow in nft_compat when reusing a match / target. 3) Fix iptables-restore when the recent match is used with a new hitcount that exceeds threshold, from Florian Westphal. 4) Fix stack corruption in xt_socket due to using stack storage to save the inner IPv6 header, from Eric Dumazet. I'll follow up soon with another batch with more fixes that are still cooking. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20caif: fix a signedness bug in cfpkt_iterate()Dan Carpenter2-4/+4
The cfpkt_iterate() function can return -EPROTO on error, but the function is a u16 so the negative value gets truncated to a positive unsigned short. This causes a static checker warning. The only caller which might care is cffrml_receive(), when it's checking the frame checksum. I modified cffrml_receive() so that it never says -EPROTO is a valid checksum. Also this isn't ever going to be inlined so I removed the "inline". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20ethtool: Add hw-switch-offload to netdev_features_strings.Rami Rosen1-0/+1
commit aafb3e98b279 (netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads) add a new feature without adding it to netdev_features_strings array; this patch fixes this. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20sock: sock_dequeue_err_skb() needs hard irq safetyEric Dumazet1-2/+3
Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb() from hard IRQ context. Therefore, sock_dequeue_err_skb() needs to block hard irq or corruptions or hangs can happen. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 364a9e93243d1 ("sock: deduplicate errqueue dequeue") Fixes: cb820f8e4b7f7 ("net: Provide a generic socket error queue delivery method for Tx time stamps.") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20openvswitch: Fix net exit.Pravin B Shelar2-2/+45
Open vSwitch allows moving internal vport to different namespace while still connected to the bridge. But when namespace deleted OVS does not detach these vports, that results in dangling pointer to netdevice which causes kernel panic as follows. This issue is fixed by detaching all ovs ports from the deleted namespace at net-exit. BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 IP: [<ffffffffa0aadaa5>] ovs_vport_locate+0x35/0x80 [openvswitch] Oops: 0000 [#1] SMP Call Trace: [<ffffffffa0aa6391>] lookup_vport+0x21/0xd0 [openvswitch] [<ffffffffa0aa65f9>] ovs_vport_cmd_get+0x59/0xf0 [openvswitch] [<ffffffff8167e07c>] genl_family_rcv_msg+0x1bc/0x3e0 [<ffffffff8167e319>] genl_rcv_msg+0x79/0xc0 [<ffffffff8167d919>] netlink_rcv_skb+0xb9/0xe0 [<ffffffff8167deac>] genl_rcv+0x2c/0x40 [<ffffffff8167cffd>] netlink_unicast+0x12d/0x1c0 [<ffffffff8167d3da>] netlink_sendmsg+0x34a/0x6b0 [<ffffffff8162e140>] sock_sendmsg+0xa0/0xe0 [<ffffffff8162e5e8>] ___sys_sendmsg+0x408/0x420 [<ffffffff8162f541>] __sys_sendmsg+0x51/0x90 [<ffffffff8162f592>] SyS_sendmsg+0x12/0x20 [<ffffffff81764ee9>] system_call_fastpath+0x12/0x17 Reported-by: Assaf Muller <amuller@redhat.com> Fixes: 46df7b81454("openvswitch: Add support for network namespaces.") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Reviewed-by: Thomas Graf <tgraf@noironetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20ematch: Fix auto-loading of ematch modules.Ignacy Gawędzki1-0/+1
In tcf_em_validate(), after calling request_module() to load the kind-specific module, set em->ops to NULL before returning -EAGAIN, so that module_put() is not called again by tcf_em_tree_destroy(). Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20ipv4: ip_check_defrag should correctly check return value of skb_copy_bitsAlexander Drozdov1-1/+1
skb_copy_bits() returns zero on success and negative value on error, so it is needed to invert the condition in ip_check_defrag(). Fixes: 1bf3751ec90c ("ipv4: ip_check_defrag must not modify skb before unsharing") Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-19Merge branch 'for-linus' of ↵Linus Torvalds6-162/+54
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "On the RBD side, there is a conversion to blk-mq from Christoph, several long-standing bug fixes from Ilya, and some cleanup from Rickard Strandqvist. On the CephFS side there is a long list of fixes from Zheng, including improved session handling, a few IO path fixes, some dcache management correctness fixes, and several blocking while !TASK_RUNNING fixes. The core code gets a few cleanups and Chaitanya has added support for TCP_NODELAY (which has been used on the server side for ages but we somehow missed on the kernel client). There is also an update to MAINTAINERS to fix up some email addresses and reflect that Ilya and Zheng are doing most of the maintenance for RBD and CephFS these days. Do not be surprised to see a pull request come from one of them in the future if I am unavailable for some reason" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits) MAINTAINERS: update Ceph and RBD maintainers libceph: kfree() in put_osd() shouldn't depend on authorizer libceph: fix double __remove_osd() problem rbd: convert to blk-mq ceph: return error for traceless reply race ceph: fix dentry leaks ceph: re-send requests when MDS enters reconnecting stage ceph: show nocephx_require_signatures and notcp_nodelay options libceph: tcp_nodelay support rbd: do not treat standalone as flatten ceph: fix atomic_open snapdir ceph: properly mark empty directory as complete client: include kernel version in client metadata ceph: provide seperate {inode,file}_operations for snapdir ceph: fix request time stamp encoding ceph: fix reading inline data when i_size > PAGE_SIZE ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) rbd: fix error paths in rbd_dev_refresh() ...
2015-02-19gen_stats.c: Duplicate xstats buffer for later useIgnacy Gawędzki1-1/+14
The gnet_stats_copy_app() function gets called, more often than not, with its second argument a pointer to an automatic variable in the caller's stack. Therefore, to avoid copying garbage afterwards when calling gnet_stats_finish_copy(), this data is better copied to a dynamically allocated memory that gets freed after use. [xiyou.wangcong@gmail.com: remove a useless kfree()] Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-19Merge branch 'kconfig' of ↵Linus Torvalds3-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kconfig updates from Michal Marek: "Yann E Morin was supposed to take over kconfig maintainership, but this hasn't happened. So I'm sending a few kconfig patches that I collected: - Fix for missing va_end in kconfig - merge_config.sh displays used if given too few arguments - s/boolean/bool/ in Kconfig files for consistency, with the plan to only support bool in the future" * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kconfig: use va_end to match corresponding va_start merge_config.sh: Display usage if given too few arguments kconfig: use bool instead of boolean for type definition attributes
2015-02-19libceph: kfree() in put_osd() shouldn't depend on authorizerIlya Dryomov1-2/+3
a255651d4cad ("ceph: ensure auth ops are defined before use") made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. Cc: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
2015-02-19libceph: fix double __remove_osd() problemIlya Dryomov1-8/+18
It turns out it's possible to get __remove_osd() called twice on the same OSD. That doesn't sit well with rb_erase() - depending on the shape of the tree we can get a NULL dereference, a soft lockup or a random crash at some point in the future as we end up touching freed memory. One scenario that I was able to reproduce is as follows: <osd3 is idle, on the osd lru list> <con reset - osd3> con_fault_finish() osd_reset() <osdmap - osd3 down> ceph_osdc_handle_map() <takes map_sem> kick_requests() <takes request_mutex> reset_changed_osds() __reset_osd() __remove_osd() <releases request_mutex> <releases map_sem> <takes map_sem> <takes request_mutex> __kick_osd_requests() __reset_osd() __remove_osd() <-- !!! A case can be made that osd refcounting is imperfect and reworking it would be a proper resolution, but for now Sage and I decided to fix this by adding a safe guard around __remove_osd(). Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil <sage@redhat.com> Cc: stable@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd() Cc: stable@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts Cc: stable@vger.kernel.org # 3.9+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
2015-02-19libceph: tcp_nodelay supportChaitanya Huilgol2-2/+28
TCP_NODELAY socket option set on connection sockets, disables Nagle’s algorithm and improves latency characteristics. tcp_nodelay(default)/notcp_nodelay option flags provided to enable/disable setting the socket option. Signed-off-by: Chaitanya Huilgol <chaitanya.huilgol@sandisk.com> [idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19libceph: use mon_client.c/put_generic_request() moreIlya Dryomov1-2/+2
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19libceph: nuke pool op infrastructureIlya Dryomov3-148/+3
On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil <sage@newdream.net> wrote: > On Mon, 22 Dec 2014, Ilya Dryomov wrote: >> Actually, pool op stuff has been unused for over two years - looks like >> it was added for rbd create_snap and that got ripped out in 2012. It's >> unlikely we'd ever need to manage pools or snaps from the kernel client >> so I think it makes sense to nuke it. Sage? > > Yep! Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-18Merge tag 'virtio-next-for-linus' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull virtio updates from Rusty Russell: "OK, this has the big virtio 1.0 implementation, as specified by OASIS. On top of tht is the major rework of lguest, to use PCI and virtio 1.0, to double-check the implementation. Then comes the inevitable fixes and cleanups from that work" * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (80 commits) virtio: don't set VIRTIO_CONFIG_S_DRIVER_OK twice. virtio_net: unconditionally define struct virtio_net_hdr_v1. tools/lguest: don't use legacy definitions for net device in example launcher. virtio: Don't expose legacy net features when VIRTIO_NET_NO_LEGACY defined. tools/lguest: use common error macros in the example launcher. tools/lguest: give virtqueues names for better error messages tools/lguest: more documentation and checking of virtio 1.0 compliance. lguest: don't look in console features to find emerg_wr. tools/lguest: don't start devices until DRIVER_OK status set. tools/lguest: handle indirect partway through chain. tools/lguest: insert driver references from the 1.0 spec (4.1 Virtio Over PCI) tools/lguest: insert device references from the 1.0 spec (4.1 Virtio Over PCI) tools/lguest: rename virtio_pci_cfg_cap field to match spec. tools/lguest: fix features_accepted logic in example launcher. tools/lguest: handle device reset correctly in example launcher. virtual: Documentation: simplify and generalize paravirt_ops.txt lguest: remove NOTIFY call and eventfd facility. lguest: remove NOTIFY facility from demonstration launcher. lguest: use the PCI console device's emerg_wr for early boot messages. lguest: always put console in PCI slot #1. ...
2015-02-18Merge branch 'cleanups'Trond Myklebust359-12061/+19176
Merge cleanups requested by Linus. * cleanups: (3 commits) pnfs: Refactor the *_layout_mark_request_commit to use pnfs_layout_mark_request_commit nfs: Can call nfs_clear_page_commit() instead nfs: Provide and use helper functions for marking a page as unstable
2015-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds19-48/+173
Pull networking updates from David Miller: 1) Missing netlink attribute validation in nft_lookup, from Patrick McHardy. 2) Restrict ipv6 partial checksum handling to UDP, since that's the only case it works for. From Vlad Yasevich. 3) Clear out silly device table sentinal macros used by SSB and BCMA drivers. From Joe Perches. 4) Make sure the remote checksum code never creates a situation where the remote checksum is applied yet the tunneling metadata describing the remote checksum transformation is still present. Otherwise an external entity might see this and apply the checksum again. From Tom Herbert. 5) Use msecs_to_jiffies() where applicable, from Nicholas Mc Guire. 6) Don't explicitly initialize timer struct fields, use setup_timer() and mod_timer() instead. From Vaishali Thakkar. 7) Don't invoke tg3_halt() without the tp->lock held, from Jun'ichi Nomura. 8) Missing __percpu annotation in ipvlan driver, from Eric Dumazet. 9) Don't potentially perform skb_get() on shared skbs, also from Eric Dumazet. 10) Fix COW'ing of metrics for non-DST_HOST routes in ipv6, from Martin KaFai Lau. 11) Fix merge resolution error between the iov_iter changes in vhost and some bug fixes that occurred at the same time. From Jason Wang. 12) If rtnl_configure_link() fails we have to perform a call to ->dellink() before unregistering the device. From WANG Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (39 commits) net: dsa: Set valid phy interface type rtnetlink: call ->dellink on failure when ->newlink exists com20020-pci: add support for eae single card vhost_net: fix wrong iter offset when setting number of buffers net: spelling fixes net/core: Fix warning while make xmldocs caused by dev.c net: phy: micrel: disable NAND-tree for KSZ8021, KSZ8031, KSZ8051, KSZ8081 ipv6: fix ipv6_cow_metrics for non DST_HOST case openvswitch: Fix key serialization. r8152: restore hw settings hso: fix rx parsing logic when skb allocation fails tcp: make sure skb is not shared before using skb_get() bridge: netfilter: Move sysctl-specific error code inside #ifdef ipv6: fix possible deadlock in ip6_fl_purge / ip6_fl_gc ipvlan: add a missing __percpu pcpu_stats tg3: Hold tp->lock before calling tg3_halt() from tg3_init_one() bgmac: fix device initialization on Northstar SoCs (condition typo) qlcnic: Delete existing multicast MAC list before adding new net/mlx5_core: Fix configuration of log_uar_page_sz sunvnet: don't change gso data on clones ...
2015-02-17svcrpc: fix memory leak in gssp_accept_sec_context_upcallDavid Ramos1-0/+2
Our UC-KLEE tool found a kernel memory leak of 512 bytes (on x86_64) for each call to gssp_accept_sec_context_upcall() (net/sunrpc/auth_gss/gss_rpc_upcall.c). Since it appears that this call can be triggered by remote connections (at least, from a cursory a glance at the call chain), it may be exploitable to cause kernel memory exhaustion. We found the bug in kernel 3.16.3, but it appears to date back to commit 9dfd87da1aeb0fd364167ad199f40fe96a6a87be (2013-08-20). The gssp_accept_sec_context_upcall() function performs a pair of calls to gssp_alloc_receive_pages() and gssp_free_receive_pages(). The first allocates memory for arg->pages. The second then frees the pages pointed to by the arg->pages array, but not the array itself. Reported-by: David A. Ramos <daramos@stanford.edu> Fixes: 9dfd87da1aeb ("rpc: fix huge kmalloc's in gss-proxy”) Signed-off-by: David A. Ramos <daramos@stanford.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-02-17net: dsa: Set valid phy interface typeGuenter Roeck1-2/+7
If the phy interface mode is not found in devicetree, or if devicetree is not configured, of_get_phy_mode returns -ENODEV. The current code sets the phy interface mode to the return value from of_get_phy_mode without checking if it is valid. This invalid phy interface mode is passed as parameter to of_phy_connect or to phy_connect_direct. This sets the phy interface mode to the invalid value, which in turn causes problems for any code using phydev->interface. Fixes: b31f65fb4383 ("net: dsa: slave: Fix autoneg for phys on switch MDIO bus") Fixes: 0d8bcdd383b8 ("net: dsa: allow for more complex PHY setups") Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-16netfilter: xt_socket: fix a stack corruption bugEric Dumazet1-9/+12
As soon as extract_icmp6_fields() returns, its local storage (automatic variables) is deallocated and can be overwritten. Lets add an additional parameter to make sure storage is valid long enough. While we are at it, adds some const qualifiers. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: b64c9256a9b76 ("tproxy: added IPv6 support to the socket match") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-16netfilter: xt_recent: don't reject rule if new hitcount exceeds table maxFlorian Westphal1-6/+5
given: -A INPUT -m recent --update --seconds 30 --hitcount 4 and iptables-save > foo then iptables-restore < foo will fail with: kernel: xt_recent: hitcount (4) is larger than packets to be remembered (4) for table DEFAULT Even when the check is fixed, the restore won't work if the hitcount is increased to e.g. 6, since by the time checkentry runs it will find the 'old' incarnation of the table. We can avoid this by increasing the maximum threshold silently; we only have to rm all the current entries of the table (these entries would not have enough room to handle the increased hitcount). This even makes (not-very-useful) -A INPUT -m recent --update --seconds 30 --hitcount 4 -A INPUT -m recent --update --seconds 30 --hitcount 42 work. Fixes: abc86d0f99242b7f142b (netfilter: xt_recent: relax ip_pkt_list_tot restrictions) Tracked-down-by: Chris Vine <chris@cvine.freeserve.co.uk> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-16netfilter: nft_compat: fix module refcount underflowPablo Neira Ayuso1-2/+10
Feb 12 18:20:42 nfdev kernel: ------------[ cut here ]------------ Feb 12 18:20:42 nfdev kernel: WARNING: CPU: 4 PID: 4359 at kernel/module.c:963 module_put+0x9b/0xba() Feb 12 18:20:42 nfdev kernel: CPU: 4 PID: 4359 Comm: ebtables-compat Tainted: G W 3.19.0-rc6+ #43 [...] Feb 12 18:20:42 nfdev kernel: Call Trace: Feb 12 18:20:42 nfdev kernel: [<ffffffff815fd911>] dump_stack+0x4c/0x65 Feb 12 18:20:42 nfdev kernel: [<ffffffff8103e6f7>] warn_slowpath_common+0x9c/0xb6 Feb 12 18:20:42 nfdev kernel: [<ffffffff8109919f>] ? module_put+0x9b/0xba Feb 12 18:20:42 nfdev kernel: [<ffffffff8103e726>] warn_slowpath_null+0x15/0x17 Feb 12 18:20:42 nfdev kernel: [<ffffffff8109919f>] module_put+0x9b/0xba Feb 12 18:20:42 nfdev kernel: [<ffffffff813ecf7c>] nft_match_destroy+0x45/0x4c Feb 12 18:20:42 nfdev kernel: [<ffffffff813e683f>] nf_tables_rule_destroy+0x28/0x70 Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
2015-02-15rtnetlink: call ->dellink on failure when ->newlink existsWANG Cong1-1/+8
Ignacy reported that when eth0 is down and add a vlan device on top of it like: ip link add link eth0 name eth0.1 up type vlan id 1 We will get a refcount leak: unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2 The problem is when rtnl_configure_link() fails in rtnl_newlink(), we simply call unregister_device(), but for stacked device like vlan, we almost do nothing when we unregister the upper device, more work is done when we unregister the lower device, so call its ->dellink(). Reported-by: Ignacy Gawedzki <ignacy.gawedzki@green-communications.fr> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-14net: spelling fixesStephen Hemminger3-3/+3
Spelling errors caught by codespell. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-14net/core: Fix warning while make xmldocs caused by dev.cMasanari Iida1-1/+1
This patch fix following warning wile make xmldocs. Warning(.//net/core/dev.c:5345): No description found for parameter 'bonding_info' Warning(.//net/core/dev.c:5345): Excess function parameter 'netdev_bonding_info' description in 'netdev_bonding_info_change' This warning starts to appear after following patch was added into Linus's tree during merger period. commit 61bd3857ff2c7daf756d49b41e6277bbdaa8f789 net/core: Add event for a change in slave state Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-14ipv6: fix ipv6_cow_metrics for non DST_HOST caseMartin KaFai Lau1-1/+1
ipv6_cow_metrics() currently assumes only DST_HOST routes require dynamic metrics allocation from inetpeer. The assumption breaks when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric. Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set() is called after the route is created. This patch creates the metrics array (by calling dst_cow_metrics_generic) in ipv6_cow_metrics(). Test: radvd.conf: interface qemubr0 { AdvLinkMTU 1300; AdvCurHopLimit 30; prefix fd00:face:face:face::/64 { AdvOnLink on; AdvAutonomous on; AdvRouterAddr off; }; }; Before: [root@qemu1 ~]# ip -6 r show | egrep -v unreachable fd00:face:face:face::/64 dev eth0 proto kernel metric 256 expires 27sec fe80::/64 dev eth0 proto kernel metric 256 default via fe80::74df:d0ff:fe23:8ef2 dev eth0 proto ra metric 1024 expires 27sec After: [root@qemu1 ~]# ip -6 r show | egrep -v unreachable fd00:face:face:face::/64 dev eth0 proto kernel metric 256 expires 27sec mtu 1300 fe80::/64 dev eth0 proto kernel metric 256 mtu 1300 default via fe80::74df:d0ff:fe23:8ef2 dev eth0 proto ra metric 1024 expires 27sec mtu 1300 hoplimit 30 Fixes: 8e2ec639173f325 (ipv6: don't use inetpeer to store metrics for routes.) Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-14openvswitch: Fix key serialization.Pravin B Shelar1-1/+1
Fix typo where mask is used rather than key. Fixes: 74ed7ab9264("openvswitch: Add support for unique flow IDs.") Reported-by: Joe Stringer <joestringer@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-13net: use %*pb[l] to print bitmaps including cpumasks and nodemasksTejun Heo2-22/+8
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13SUNRPC: Always manipulate rpc_rqst::rq_bc_pa_list under xprt->bc_pa_lockChuck Lever1-1/+4
Other code that accesses rq_bc_pa_list holds xprt->bc_pa_lock. xprt_complete_bc_request() should do the same. Fixes: 2ea24497a1b3 ("SUNRPC: RPC callbacks may be split . . .") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-13tcp: make sure skb is not shared before using skb_get()Eric Dumazet1-8/+24
IPv6 can keep a copy of SYN message using skb_get() in tcp_v6_conn_request() so that caller wont free the skb when calling kfree_skb() later. Therefore TCP fast open has to clone the skb it is queuing in child->sk_receive_queue, as all skbs consumed from receive_queue are freed using __kfree_skb() (ie assuming skb->users == 1) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: 5b7ed0892f2af ("tcp: move fastopen functions to tcp_fastopen.c") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-12memcg: cleanup static keys decrementVladimir Davydov1-0/+4
Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of wrapper functions. Although this patch moves static keys decrement from __mem_cgroup_free to mem_cgroup_css_free, it does not introduce any functional changes, because the keys are incremented on setting the limit (tcp or kmem), which can only happen after successful mem_cgroup_css_online. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Glauber Costa <glommer@parallels.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David S. Miller <davem@davemloft.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linuxLinus Torvalds6-151/+209
Pull nfsd updates from Bruce Fields: "The main change is the pNFS block server support from Christoph, which allows an NFS client connected to shared disk to do block IO to the shared disk in place of NFS reads and writes. This also requires xfs patches, which should arrive soon through the xfs tree, barring unexpected problems. Support for other filesystems is also possible if there's interest. Thanks also to Chuck Lever for continuing work to get NFS/RDMA into shape" * 'for-3.20' of git://linux-nfs.org/~bfields/linux: (32 commits) nfsd: default NFSv4.2 to on nfsd: pNFS block layout driver exportfs: add methods for block layout exports nfsd: add trace events nfsd: update documentation for pNFS support nfsd: implement pNFS layout recalls nfsd: implement pNFS operations nfsd: make find_any_file available outside nfs4state.c nfsd: make find/get/put file available outside nfs4state.c nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c nfsd: add fh_fsid_match helper nfsd: move nfsd_fh_match to nfsfh.h fs: add FL_LAYOUT lease type fs: track fl_owner for leases nfs: add LAYOUT_TYPE_MAX enum value nfsd: factor out a helper to decode nfstime4 values sunrpc/lockd: fix references to the BKL nfsd: fix year-2038 nfs4 state problem svcrdma: Handle additional inline content svcrdma: Move read list XDR round-up logic ...
2015-02-12bridge: netfilter: Move sysctl-specific error code inside #ifdefGeert Uytterhoeven1-5/+2
If CONFIG_SYSCTL=n: net/bridge/br_netfilter.c: In function ‘br_netfilter_init’: net/bridge/br_netfilter.c:996: warning: label ‘err1’ defined but not used Move the label and the code after it inside the existing #ifdef to get rid of the warning. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-12ipv6: fix possible deadlock in ip6_fl_purge / ip6_fl_gcJan Stancek1-2/+2
Use spin_lock_bh in ip6_fl_purge() to prevent following potentially deadlock scenario between ip6_fl_purge() and ip6_fl_gc() timer. ================================= [ INFO: inconsistent lock state ] 3.19.0 #1 Not tainted --------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/5/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (ip6_fl_lock){+.?...}, at: [<ffffffff8171155d>] ip6_fl_gc+0x2d/0x180 {SOFTIRQ-ON-W} state was registered at: [<ffffffff810ee9a0>] __lock_acquire+0x4a0/0x10b0 [<ffffffff810efd54>] lock_acquire+0xc4/0x2b0 [<ffffffff81751d2d>] _raw_spin_lock+0x3d/0x80 [<ffffffff81711798>] ip6_flowlabel_net_exit+0x28/0x110 [<ffffffff815f9759>] ops_exit_list.isra.1+0x39/0x60 [<ffffffff815fa320>] cleanup_net+0x100/0x1e0 [<ffffffff810ad80a>] process_one_work+0x20a/0x830 [<ffffffff810adf4b>] worker_thread+0x11b/0x460 [<ffffffff810b42f4>] kthread+0x104/0x120 [<ffffffff81752bfc>] ret_from_fork+0x7c/0xb0 irq event stamp: 84640 hardirqs last enabled at (84640): [<ffffffff81752080>] _raw_spin_unlock_irq+0x30/0x50 hardirqs last disabled at (84639): [<ffffffff81751eff>] _raw_spin_lock_irq+0x1f/0x80 softirqs last enabled at (84628): [<ffffffff81091ad1>] _local_bh_enable+0x21/0x50 softirqs last disabled at (84629): [<ffffffff81093b7d>] irq_exit+0x12d/0x150 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(ip6_fl_lock); <Interrupt> lock(ip6_fl_lock); *** DEADLOCK *** Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-12xfrm: release dst_orig in case of error in xfrm_lookup()huaibin Wang1-6/+6
dst_orig should be released on error. Function like __xfrm_route_forward() expects that behavior. Since a recent commit, xfrm_lookup() may also be called by xfrm_lookup_route(), which expects the opposite. Let's introduce a new flag (XFRM_LOOKUP_KEEP_DST_REF) to tell what should be done in case of error. Fixes: f92ee61982d("xfrm: Generate blackhole routes only from route lookup functions") Signed-off-by: huaibin Wang <huaibin.wang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-02-11Merge branch 'next' of ↵Linus Torvalds2-26/+40
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer updates from James Morris: "Highlights: - Smack adds secmark support for Netfilter - /proc/keys is now mandatory if CONFIG_KEYS=y - TPM gets its own device class - Added TPM 2.0 support - Smack file hook rework (all Smack users should review this!)" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits) cipso: don't use IPCB() to locate the CIPSO IP option SELinux: fix error code in policydb_init() selinux: add security in-core xattr support for pstore and debugfs selinux: quiet the filesystem labeling behavior message selinux: Remove unused function avc_sidcmp() ima: /proc/keys is now mandatory Smack: Repair netfilter dependency X.509: silence asn1 compiler debug output X.509: shut up about included cert for silent build KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y MAINTAINERS: email update tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device smack: fix possible use after frees in task_security() callers smack: Add missing logging in bidirectional UDS connect check Smack: secmark support for netfilter Smack: Rework file hooks tpm: fix format string error in tpm-chip.c char/tpm/tpm_crb: fix build error smack: Fix a bidirectional UDS connect check typo smack: introduce a special case for tmpfs in smack_d_instantiate() ...
2015-02-11Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-5/+3
Merge second set of updates from Andrew Morton: "More of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits) mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() vmstat: Reduce time interval to stat update on idle cpu mm/page_owner.c: remove unnecessary stack_trace field Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files mm: incorporate read-only pages into transparent huge pages vmstat: do not use deferrable delayed work for vmstat_update mm: more aggressive page stealing for UNMOVABLE allocations mm: always steal split buddies in fallback allocations mm: when stealing freepages, also take pages created by splitting buddy page mincore: apply page table walker on do_mincore() mm: /proc/pid/clear_refs: avoid split_huge_page() mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) mempolicy: apply page table walker on queue_pages_range() arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() memcg: cleanup preparation for page table walk numa_maps: remove numa_maps->vma numa_maps: fix typo in gather_hugetbl_stats pagemap: use walk->vma instead of calling find_vma() clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() ...
2015-02-11Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds10-517/+628
Pull NFS client updates from Trond Myklebust: "Highlights incluse: Features: - Removing the forced serialisation of open()/close() calls in NFSv4.x (x>0) makes for a significant performance improvement in metadata intensive workloads. - Full support for the pNFS "flexible files" layout type - Further RPC/RDMA client improvements from Chuck Bugfixes: - Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING - Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL - Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called as part of the namespace cleanup, - Stable fix: Ensure we reference the inode for return-on-close in delegreturn - Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to the same source address/port combination during a disconnect/ reconnect event. This is a requirement imposed by most NFSv3 server duplicate reply cache implementations. Optimisations: - Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT Other: - Add Anna Schumaker as co-maintainer for the NFS client" * tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits) SUNRPC: Cleanup to remove xs_tcp_close() pnfs: delete an unintended goto pnfs/flexfiles: Do not dprintk after the free SUNRPC: Fix stupid typo in xs_sock_set_reuseport SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG SUNRPC: Handle connection reset more efficiently. SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT SUNRPC: Remove TCP socket linger code SUNRPC: Remove TCP client connection reset hack SUNRPC: TCP/UDP always close the old socket before reconnecting SUNRPC: Add helpers to prevent socket create from racing SUNRPC: Ensure xs_reset_transport() resets the close connection flags SUNRPC: Do not clear the source port in xs_reset_transport SUNRPC: Handle EADDRINUSE on connect SUNRPC: Set SO_REUSEPORT socket option for TCP connections NFSv4.1: Fix pnfs_put_lseg races NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS ...
2015-02-11mm: gup: use get_user_pages_unlockedAndrea Arcangeli1-4/+2
This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to the page fault in order to release the mmap_sem during the I/O. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11mm: page_counter: pull "-1" handling out of page_counter_memparse()Johannes Weiner1-1/+1
The unified hierarchy interface for memory cgroups will no longer use "-1" to mean maximum possible resource value. In preparation for this, make the string an argument and let the caller supply it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11gue: Use checksum partial with remote checksum offloadTom Herbert1-6/+22
Change remote checksum handling to set checksum partial as default behavior. Added an iflink parameter to configure not using checksum partial (calling csum_partial to update checksum). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offloadTom Herbert2-2/+3
This patch adds infrastructure so that remote checksum offload can set CHECKSUM_PARTIAL instead of calling csum_partial and writing the modfied checksum field. Add skb_remcsum_adjust_partial function to set an skb for using CHECKSUM_PARTIAL with remote checksum offload. Changed skb_remcsum_process and skb_gro_remcsum_process to take a boolean argument to indicate if checksum partial can be set or the checksum needs to be modified using the normal algorithm. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11udp: Set SKB_GSO_UDP_TUNNEL* in UDP GRO pathTom Herbert2-2/+17
Properly set GSO types and skb->encapsulation in the UDP tunnel GRO complete so that packets are properly represented for GSO. This sets SKB_GSO_UDP_TUNNEL or SKB_GSO_UDP_TUNNEL_CSUM depending on whether non-zero checksums were received, and sets SKB_GSO_TUNNEL_REMCSUM if the remote checksum option was processed. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11net: Fix remcsum in GRO path to not change packetTom Herbert1-10/+10
Remote checksum offload processing is currently the same for both the GRO and non-GRO path. When the remote checksum offload option is encountered, the checksum field referred to is modified in the packet. So in the GRO case, the packet is modified in the GRO path and then the operation is skipped when the packet goes through the normal path based on skb->remcsum_offload. There is a problem in that the packet may be modified in the GRO path, but then forwarded off host still containing the remote checksum option. A remote host will again perform RCO but now the checksum verification will fail since GRO RCO already modified the checksum. To fix this, we ensure that GRO restores a packet to it's original state before returning. In this model, when GRO processes a remote checksum option it still changes the checksum per the algorithm but on return from lower layer processing the checksum is restored to its original value. In this patch we add define gro_remcsum structure which is passed to skb_gro_remcsum_process to save offset and delta for the checksum being changed. After lower layer processing, skb_gro_remcsum_cleanup is called to restore the checksum before returning from GRO. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11openvswitch: Add missing initialization in validate_and_copy_set_tun()Geert Uytterhoeven1-1/+1
net/openvswitch/flow_netlink.c: In function ‘validate_and_copy_set_tun’: net/openvswitch/flow_netlink.c:1749: warning: ‘err’ may be used uninitialized in this function If ipv4_tun_from_nlattr() returns a different positive value than OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS, err will be uninitialized, and validate_and_copy_set_tun() may return an undefined value instead of a zero success indicator. Initialize err to zero to fix this. Fixes: 1dd144cf5b4b47e1 ("openvswitch: Support VXLAN Group Policy extension") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11openvswitch: Reset key metadata for packet execution.Pravin B Shelar1-0/+2
Userspace packet execute command pass down flow key for given packet. But userspace can skip some parameter with zero value. Therefore kernel needs to initialize key metadata to zero. Fixes: 0714812134 ("openvswitch: Eliminate memset() from flow_extract.") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11rds: rds_cong_queue_updates needs to defer the congestion update transmissionSowmini Varadhan1-1/+15
When the RDS transport is TCP, we cannot inline the call to rds_send_xmit from rds_cong_queue_update because (a) we are already holding the sock_lock in the recv path, and will deadlock when tcp_setsockopt/tcp_sendmsg try to get the sock lock (b) cong_queue_update does an irqsave on the rds_cong_lock, and this will trigger warnings (for a good reason) from functions called out of sock_lock. This patch reverts the change introduced by 2fa57129d ("RDS: Bypass workqueue when queueing cong updates"). The patch has been verified for both RDS/TCP as well as RDS/RDMA to ensure that there are not regressions for either transport: - for verification of RDS/TCP a client-server unit-test was used, with the server blocked in gdb and thus unable to drain its rcvbuf, eventually triggering a RDS congestion update. - for RDS/RDMA, the standard IB regression tests were used Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11ipv6: Partial checksum only UDP packetsVlad Yasevich1-1/+1
ip6_append_data is used by other protocols and some of them can't be partially checksummed. Only partially checksum UDP protocol. Fixes: 32dce968dd987a (ipv6: Allow for partial checksums on non-ufo packets) Reported-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2-6/+58
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains two small Netfilter updates for your net-next tree, they are: 1) Add ebtables support to nft_compat, from Arturo Borrero. 2) Fix missing validation of the SET_ID attribute in the lookup expressions, from Patrick McHardy. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11cipso: don't use IPCB() to locate the CIPSO IP optionPaul Moore2-26/+40
Using the IPCB() macro to get the IPv4 options is convenient, but unfortunately NetLabel often needs to examine the CIPSO option outside of the scope of the IP layer in the stack. While historically IPCB() worked above the IP layer, due to the inclusion of the inet_skb_param struct at the head of the {tcp,udp}_skb_cb structs, recent commit 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses") reordered the tcp_skb_cb struct and invalidated this IPCB() trick. This patch fixes the problem by creating a new function, cipso_v4_optptr(), which locates the CIPSO option inside the IP header without calling IPCB(). Unfortunately, this isn't as fast as a simple lookup so some additional tweaks were made to limit the use of this new function. Cc: <stable@vger.kernel.org> # 3.18 Reported-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com>
2015-02-10SUNRPC: Cleanup to remove xs_tcp_close()Trond Myklebust1-6/+1
xs_tcp_close() is now just a call to xs_tcp_shutdown(), so remove it, and replace the entry in xs_tcp_ops. Suggested-by: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09ipv4: Namespecify TCP PMTU mechanismFan Du4-21/+23
Packetization Layer Path MTU Discovery works separately beside Path MTU Discovery at IP level, different net namespace has various requirements on which one to chose, e.g., a virutalized container instance would require TCP PMTU to probe an usable effective mtu for underlying tunnel, while the host would employ classical ICMP based PMTU to function. Hence making TCP PMTU mechanism per net namespace to decouple two functionality. Furthermore the probe base MSS should also be configured separately for each namespace. Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller11-54/+71
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09SUNRPC: Fix stupid typo in xs_sock_set_reuseportTrond Myklebust1-2/+3
Yes, kernel_setsockopt() hates you for using a char argument. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09tcp: don't include Fast Open option in SYN-ACK on pure SYN-dataYuchung Cheng1-5/+8
If a server has enabled Fast Open and it receives a pure SYN-data packet (without a Fast Open option), it won't accept the data but it incorrectly returns a SYN-ACK with a Fast Open cookie and also increments the SNMP stat LINUX_MIB_TCPFASTOPENPASSIVEFAIL. This patch makes the server include a Fast Open cookie in SYN-ACK only if the SYN has some Fast Open option (i.e., when client requests or presents a cookie). Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09openvswitch: Only set TUNNEL_VXLAN_OPT if VXLAN-GBP metadata is setThomas Graf1-1/+1
This avoids setting TUNNEL_VXLAN_OPT for VXLAN frames which don't have any GBP metadata set. It is not invalid to set it but unnecessary. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09ipv6: Make __ipv6_select_ident staticVlad Yasevich1-1/+2
Make __ipv6_select_ident() static as it isn't used outside the file. Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.) Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09ipv6: Fix fragment id assignment on LE arches.Vlad Yasevich1-1/+1
Recent commit: 0508c07f5e0c94f38afd5434e8b2a55b84553077 Author: Vlad Yasevich <vyasevich@gmail.com> Date: Tue Feb 3 16:36:15 2015 -0500 ipv6: Select fragment id during UFO segmentation if not set. Introduced a bug on LE in how ipv6 fragment id is assigned. This was cought by nightly sparce check: Resolve the following sparce error: net/ipv6/output_core.c:57:38: sparse: incorrect type in assignment (different base types) net/ipv6/output_core.c:57:38: expected restricted __be32 [usertype] ip6_frag_id net/ipv6/output_core.c:57:38: got unsigned int [unsigned] [assigned] [usertype] id Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.) Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09bridge: Fix inability to add non-vlan fdb entryToshiaki Makita1-7/+5
Bridge's default_pvid adds a vid by default, by which we cannot add a non-vlan fdb entry by default, because br_fdb_add() adds fdb entries for all vlans instead of a non-vlan one when any vlan is configured. # ip link add br0 type bridge # ip link set eth0 master br0 # bridge fdb add 12:34:56:78:90:ab dev eth0 master temp # bridge fdb show brport eth0 | grep 12:34:56:78:90:ab 12:34:56:78:90:ab dev eth0 vlan 1 static We expect a non-vlan fdb entry as well as vlan 1: 12:34:56:78:90:ab dev eth0 static To fix this, we need to insert a non-vlan fdb entry if vlan is not specified, even when any vlan is configured. Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09net: dsa: Remove redundant phy_attach()Andrew Lunn1-13/+0
dsa_slave_phy_setup() finds the phy for the port via device tree and using of_phy_connect(), or it uses the fall back of taking a phy from the switch internal mdio bus and calling phy_connect_direct(). Either way, if a phy is found, phy_attach_direct() is called to attach the phy to the slave device. In dsa_slave_create(), a second call to phy_attach() is made. This results in the warning "PHY already attached". Remove this second, redundant attaching of the phy. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: remove tipc_snprintfRichard Alpe4-60/+4
tipc_snprintf() was heavily utilized by the old netlink API which no longer exists (now netlink compat). In this patch we swap tipc_snprintf() to the identical scnprintf() in the only remaining occurrence. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: nl compat add noop and remove legacy nl frameworkRichard Alpe12-301/+13
Add TIPC_CMD_NOOP to compat layer and remove the old framework. All legacy nl commands are now converted to the compat layer in netlink_compat.c. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl stats show to nl compatRichard Alpe2-35/+15
Convert TIPC_CMD_SHOW_STATS to compat layer. This command does not have any counterpart in the new API, meaning it now solely exists as a function in the compat layer. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl net id get to nl compatRichard Alpe2-23/+18
Convert TIPC_CMD_GET_NETID to compat dumpit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl net id set to nl compatRichard Alpe2-26/+14
Convert TIPC_CMD_SET_NETID to compat doit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl node addr set to nl compatRichard Alpe3-27/+27
Convert TIPC_CMD_SET_NODE_ADDR to compat doit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl node dump to nl compatRichard Alpe4-60/+22
Convert TIPC_CMD_GET_NODES to compat dumpit and remove global node counter solely used by the legacy API. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl media dump to nl compatRichard Alpe4-24/+19
Convert TIPC_CMD_GET_MEDIA_NAMES to compat dumpit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl socket dump to nl compatRichard Alpe4-89/+111
Convert socket (port) listing to compat dumpit call. If a socket (port) has publications a second dumpit call is issued to collect them and format then into the legacy buffer before continuing to process the sockets (ports). Command converted in this patch: TIPC_CMD_SHOW_PORTS Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl name table dump to nl compatRichard Alpe4-190/+101
Add functionality for printing a dump header and convert TIPC_CMD_SHOW_NAME_TABLE to compat dumpit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl link stat reset to nl compatRichard Alpe4-41/+27
Convert TIPC_CMD_RESET_LINK_STATS to compat doit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl link prop set to nl compatRichard Alpe4-153/+48
Convert setting of link proprieties to compat doit calls. Commands converted in this patch: TIPC_CMD_SET_LINK_TOL TIPC_CMD_SET_LINK_PRI TIPC_CMD_SET_LINK_WINDOW Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl link dump to nl compatRichard Alpe4-79/+23
Convert TIPC_CMD_GET_LINKS to compat dumpit and remove global link counter solely used by the legacy API. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl link stat to nl compatRichard Alpe6-192/+205
Add functionality for safely appending string data to a TLV without keeping write count in the caller. Convert TIPC_CMD_SHOW_LINK_STATS to compat dumpit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl bearer enable/disable to nl compatRichard Alpe4-57/+149
Introduce a framework for transcoding legacy nl action into actions (.doit) calls from the new nl API. This is done by converting the incoming TLV data into netlink data with nested netlink attributes. Unfortunately due to the randomness of the legacy API we can't do this generically so each legacy netlink command requires a specific transcoding recipe. In this case for bearer enable and bearer disable. Convert TIPC_CMD_ENABLE_BEARER and TIPC_CMD_DISABLE_BEARER into doit compat calls. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl bearer dump to nl compatRichard Alpe4-34/+273
Introduce a framework for dumping netlink data from the new netlink API and formatting it to the old legacy API format. This is done by looping the dump data and calling a format handler for each entity, in this case a bearer. We dump until either all data is dumped or we reach the limited buffer size of the legacy API. Remember, the legacy API doesn't scale. In this commit we convert TIPC_CMD_GET_BEARER_NAMES to use the compat layer. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: move and rename the legacy nl api to "nl compat"Richard Alpe13-76/+130
The new netlink API is no longer "v2" but rather the standard API and the legacy API is now "nl compat". We split them into separate start/stop and put them in different files in order to further distinguish them. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUGTrond Myklebust1-2/+2
Now that the linger code is gone, the xs_tcp_fin_timeout variable has no real function. Keep it for now, since it is part of the /proc interface, but only define it if that /proc interface is enabled. Suggested-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Handle connection reset more efficiently.Trond Myklebust1-16/+18
If the connection reset is due to an active call on our side, then the state change is sometimes not reported. Catch those instances using xs_error_report() instead. Also remove the xs_tcp_shutdown() call in xs_tcp_send_request() as the change in behaviour makes it redundant. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flagTrond Myklebust2-2/+0
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_releaseTrond Myklebust1-5/+1
Use of socket shutdown() means that we monitor the shutdown process through the xs_tcp_state_change() callback, so it is preferable to a full close in all cases unless we're destroying the transport. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connectionTrond Myklebust1-2/+2
The previous behaviour left the connection half-open in order to try to scrape the last replies from the socket. Now that we have more reliable reconnection, change the behaviour to close down the socket faster. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORTTrond Myklebust1-3/+0
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Remove TCP socket linger codeTrond Myklebust1-35/+0
Now that we no longer use the partial shutdown code when closing the socket, we no longer need to worry about the TCP linger2 state. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09xfrm: Fix local error reporting crash with interfamily tunnelsSteffen Klassert2-2/+2
We set the outer mode protocol too early. As a result, the local error handler might dispatch to the wrong address family and report the error to a wrong socket type. We fix this by setting the outer protocol to the skb after we accessed the inner mode for the last time, right before we do the atcual encapsulation where we switch finally to the outer mode. Reported-by: Chris Ruehl <chris.ruehl@gtsys.com.hk> Tested-by: Chris Ruehl <chris.ruehl@gtsys.com.hk> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-02-08net:rfs: adjust table size checkingEric Dumazet1-1/+1
Make sure root user does not try something stupid. Also make sure mask field in struct rps_sock_flow_table does not share a cache line with the potentially often dirtied flow table. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 567e4b79731c ("net: rfs: add hash collision detection") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09ipvs: fix inability to remove a mixed-family RSAlexey Andriyanov1-1/+1
The current code prevents any operation with a mixed-family dest unless IP_VS_CONN_F_TUNNEL flag is set. The problem is that it's impossible for the client to follow this rule, because ip_vs_genl_parse_dest does not even read the destination conn_flags when cmd = IPVS_CMD_DEL_DEST (need_full_dest = 0). Also, not every client can pass this flag when removing a dest. ipvsadm, for example, does not support the "-i" command line option together with the "-d" option. This change disables any checks for mixed-family on IPVS_CMD_DEL_DEST command. Signed-off-by: Alexey Andriyanov <alan@al-an.info> Fixes: bc18d37f676f ("ipvs: Allow heterogeneous pools now that we support them") Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-02-08SUNRPC: Remove TCP client connection reset hackTrond Myklebust1-66/+1
Instead we rely on SO_REUSEPORT to provide the reconnection semantics that we need for NFSv2/v3. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: TCP/UDP always close the old socket before reconnectingTrond Myklebust1-2/+3
It is not safe to call xs_reset_transport() from inside xs_udp_setup_socket() or xs_tcp_setup_socket(), since they do not own the correct locks. Instead, do it in xs_connect(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: Add helpers to prevent socket create from racingTrond Myklebust2-6/+38
The socket lock is currently held by the task that is requesting the connection be established. While that is efficient in the case where the connection happens quickly, it is racy in the case where it doesn't. What we really want is for the connect helper to be able to block access to the socket while it is being set up. This patch does so by arranging to transfer the socket lock from the task that is requesting the connect attempt, and then releasing that lock once everything is done. This scheme also gives us automatic protection against collisions with the RPC close code, so we can kill the cancel_delayed_work_sync() call in xs_close(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: Ensure xs_reset_transport() resets the close connection flagsTrond Myklebust1-16/+13
Otherwise, we may end up looping. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: Do not clear the source port in xs_reset_transportTrond Myklebust1-2/+0
Now that we can reuse bound ports after a close, we never really want to clear the transport's source port after it has been set. Doing so really messes up the NFSv3 DRC on the server. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: Handle EADDRINUSE on connectTrond Myklebust2-0/+5
Now that we're setting SO_REUSEPORT, we still need to handle the case where a connect() is attempted, but the old socket is still lingering. Essentially, all we want to do here is handle the error by waiting a few seconds and then retrying. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08net: rfs: add hash collision detectionEric Dumazet3-24/+28
Receive Flow Steering is a nice solution but suffers from hash collisions when a mix of connected and unconnected traffic is received on the host, when flow hash table is populated. Also, clearing flow in inet_release() makes RFS not very good for short lived flows, as many packets can follow close(). (FIN , ACK packets, ...) This patch extends the information stored into global hash table to not only include cpu number, but upper part of the hash value. I use a 32bit value, and dynamically split it in two parts. For host with less than 64 possible cpus, this gives 6 bits for the cpu number, and 26 (32-6) bits for the upper part of the hash. Since hash bucket selection use low order bits of the hash, we have a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big enough. If the hash found in flow table does not match, we fallback to RPS (if it is enabled for the rxqueue). This means that a packet for an non connected flow can avoid the IPI through a unrelated/victim CPU. This also means we no longer have to clear the table at socket close time, and this helps short lived flows performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08gre/ipip: use be16 variants of netlink functionsSabrina Dubroca2-12/+12
encap.sport and encap.dport are __be16, use nla_{get,put}_be16 instead of nla_{get,put}_u16. Fixes the sparse warnings: warning: incorrect type in assignment (different base types) expected restricted __be32 [addressable] [usertype] o_key got restricted __be16 [addressable] [usertype] i_flags warning: incorrect type in assignment (different base types) expected restricted __be16 [usertype] sport got unsigned short warning: incorrect type in assignment (different base types) expected restricted __be16 [usertype] dport got unsigned short warning: incorrect type in argument 3 (different base types) expected unsigned short [unsigned] [usertype] value got restricted __be16 [usertype] sport warning: incorrect type in argument 3 (different base types) expected unsigned short [unsigned] [usertype] value got restricted __be16 [usertype] dport Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08SUNRPC: Set SO_REUSEPORT socket option for TCP connectionsTrond Myklebust1-4/+49
When using TCP, we need the ability to reuse port numbers after a disconnection, so that the NFSv3 server knows that we're the same client. Currently we use a hack to work around the TCP socket's TIME_WAIT: we send an RST instead of closing, which doesn't always work... The SO_REUSEPORT option added in Linux 3.9 allows us to bind multiple TCP connections to the same source address+port combination, and thus to use ordinary TCP close() instead of the current hack. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08tipc: fix bug in socket reception functionJon Paul Maloy1-3/+2
In commit c637c1035534867b85b78b453c38c495b58e2c5a ("tipc: resolve race problem at unicast message reception") we introduced a time limit for how long the function tipc_sk_eneque() would be allowed to execute its loop. Unfortunately, the test for when this limit is passed was put in the wrong place, resulting in a lost message when the test is true. We fix this by moving the test to before we dequeue the next buffer from the input queue. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08rt6_probe_deferred: Do not depend on struct orderingMichael Büsch1-1/+1
rt6_probe allocates a struct __rt6_probe_work and schedules a work handler rt6_probe_deferred. But rt6_probe_deferred kfree's the struct work_struct instead of struct __rt6_probe_work. This works, because struct work_struct is the first element of struct __rt6_probe_work. Change it to kfree struct __rt6_probe_work to not implicitly depend on struct work_struct being the first element. This does not affect the generated code. Signed-off-by: Michael Buesch <m@bues.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08Merge tag 'nfs-rdma-for-3.20-part-2' of ↵Trond Myklebust1-3/+4
git://git.linux-nfs.org/projects/anna/nfs-rdma NFS: RDMA Client Sparse Fixes This patch fixes a sparse warning in the initial submission. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'nfs-rdma-for-3.20-part-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma: xprtrdma: Address sparse complaint in rpcr_to_rdmar()
2015-02-08tcp: mitigate ACK loops for connections as tcp_timewait_sockNeal Cardwell1-5/+24
Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is represented by a tcp_timewait_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: mitigate ACK loops for connections as tcp_sockNeal Cardwell2-7/+23
Ensure that in state ESTABLISHED, where the connection is represented by a tcp_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers or ACK numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. There is already a similar (although global) rate-limiting mechanism for "challenge ACKs". When deciding whether to send a challence ACK, we first consult the new per-connection rate limit, and then the global rate limit. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: mitigate ACK loops for connections as tcp_request_sockNeal Cardwell1-1/+5
In the SYN_RECV state, where the TCP connection is represented by tcp_request_sock, we now rate-limit SYNACKs in response to a client's retransmitted SYNs: we do not send a SYNACK in response to client SYN if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a SYNACK in response to a client's retransmitted SYN. This allows the vast majority of legitimate client connections to proceed unimpeded, even for the most aggressive platforms, iOS and MacOS, which actually retransmit SYNs 1-second intervals for several times in a row. They use SYN RTO timeouts following the progression: 1,1,1,1,1,2,4,8,16,32. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacksNeal Cardwell3-0/+14
Helpers for mitigating ACK loops by rate-limiting dupacks sent in response to incoming out-of-window packets. This patch includes: - rate-limiting logic - sysctl to control how often we allow dupacks to out-of-window packets - SNMP counter for cases where we rate-limited our dupack sending The rate-limiting logic in this patch decides to not send dupacks in response to out-of-window segments if (a) they are SYNs or pure ACKs and (b) the remote endpoint is sending them faster than the configured rate limit. We rate-limit our responses rather than blocking them entirely or resetting the connection, because legitimate connections can rely on dupacks in response to some out-of-window segments. For example, zero window probes are typically sent with a sequence number that is below the current window, and ZWPs thus expect to thus elicit a dupack in response. We allow dupacks in response to TCP segments with data, because these may be spurious retransmissions for which the remote endpoint wants to receive DSACKs. This is safe because segments with data can't realistically be part of ACK loops, which by their nature consist of each side sending pure/data-less ACKs to each other. The dupack interval is controlled by a new sysctl knob, tcp_invalid_ratelimit, given in milliseconds, in case an administrator needs to dial this upward in the face of a high-rate DoS attack. The name and units are chosen to be analogous to the existing analogous knob for ICMP, icmp_ratelimit. The default value for tcp_invalid_ratelimit is 500ms, which allows at most one such dupack per 500ms. This is chosen to be 2x faster than the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule 2.4). We allow the extra 2x factor because network delay variations can cause packets sent at 1 second intervals to be compressed and arrive much closer. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08openvswitch: Initialize unmasked key and uid lenPravin B Shelar1-0/+2
Flow alloc needs to initialize unmasked key pointer. Otherwise it can crash kernel trying to free random unmasked-key pointer. general protection fault: 0000 [#1] SMP 3.19.0-rc6-net-next+ #457 Hardware name: Supermicro X7DWU/X7DWU, BIOS 1.1 04/30/2008 RIP: 0010:[<ffffffff8111df0e>] [<ffffffff8111df0e>] kfree+0xac/0x196 Call Trace: [<ffffffffa060bd87>] flow_free+0x21/0x59 [openvswitch] [<ffffffffa060bde0>] ovs_flow_free+0x21/0x23 [openvswitch] [<ffffffffa0605b4a>] ovs_packet_cmd_execute+0x2f3/0x35f [openvswitch] [<ffffffffa0605995>] ? ovs_packet_cmd_execute+0x13e/0x35f [openvswitch] [<ffffffff811fe6fb>] ? nla_parse+0x4f/0xec [<ffffffff8139a2fc>] genl_family_rcv_msg+0x26d/0x2c9 [<ffffffff8107620f>] ? __lock_acquire+0x90e/0x9aa [<ffffffff8139a3be>] genl_rcv_msg+0x66/0x89 [<ffffffff8139a358>] ? genl_family_rcv_msg+0x2c9/0x2c9 [<ffffffff81399591>] netlink_rcv_skb+0x3e/0x95 [<ffffffff81399898>] ? genl_rcv+0x18/0x37 [<ffffffff813998a7>] genl_rcv+0x27/0x37 [<ffffffff81399033>] netlink_unicast+0x103/0x191 [<ffffffff81399382>] netlink_sendmsg+0x2c1/0x310 [<ffffffff811007ad>] ? might_fault+0x50/0xa0 [<ffffffff8135c773>] do_sock_sendmsg+0x5f/0x7a [<ffffffff8135c799>] sock_sendmsg+0xb/0xd [<ffffffff8135cacf>] ___sys_sendmsg+0x1a3/0x218 [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86 [<ffffffff8115a9d0>] ? fsnotify+0x32c/0x348 [<ffffffff8115a720>] ? fsnotify+0x7c/0x348 [<ffffffff8113e5f5>] ? __fget+0xaa/0xbf [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86 [<ffffffff8135cccd>] __sys_sendmsg+0x3d/0x5e [<ffffffff8135cd02>] SyS_sendmsg+0x14/0x16 [<ffffffff81411852>] system_call_fastpath+0x12/0x17 Fixes: 74ed7ab9264("openvswitch: Add support for unique flow IDs.") CC: Joe Stringer <joestringer@nicira.com> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07bridge: add missing bridge port check for offloadsRoopa Prabhu1-2/+2
This patch fixes a missing bridge port check caught by smatch. setlink/dellink of attributes like vlans can come for a bridge device and there is no need to offload those today. So, this patch adds a bridge port check. (In these cases however, the BRIDGE_SELF flags will always be set and we may not hit a problem with the current code). smatch complaint: The patch 68e331c785b8: "bridge: offload bridge port attributes to switch asic if feature flag set" from Jan 29, 2015, leads to the following Smatch complaint: net/bridge/br_netlink.c:552 br_setlink() error: we previously assumed 'p' could be null (see line 518) net/bridge/br_netlink.c 517 518 if (p && protinfo) { ^ Check for NULL. Reported-By: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>