aboutsummaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2015-02-21Merge tag 'nfs-for-3.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-1/+4
Pull more NFS client updates from Trond Myklebust: "Highlights include: - Fix a use-after-free in decode_cb_sequence_args() - Fix a compile error when #undef CONFIG_PROC_FS - NFSv4.1 backchannel spinlocking issue - Cleanups in the NFS unstable write code requested by Linus - NFSv4.1 fix issues when the server denies our backchannel request - Cleanups in create_session and bind_conn_to_session" * tag 'nfs-for-3.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.1: Clean up bind_conn_to_session NFSv4.1: Always set up a forward channel when binding the session NFSv4.1: Don't set up a backchannel if the server didn't agree to do so NFSv4.1: Clean up create_session pnfs: Refactor the *_layout_mark_request_commit to use pnfs_layout_mark_request_commit NFSv4: Kill unused nfs_inode->delegation_state field NFS: struct nfs_commit_info.lock must always point to inode->i_lock nfs: Can call nfs_clear_page_commit() instead nfs: Provide and use helper functions for marking a page as unstable SUNRPC: Always manipulate rpc_rqst::rq_bc_pa_list under xprt->bc_pa_lock SUNRPC: Fix a compile error when #undef CONFIG_PROC_FS NFSv4.1: Convert open-coded array allocation calls to kmalloc_array() NFSv4.1: Fix a kfree() of uninitialised pointers in decode_cb_sequence_args
2015-02-19Merge branch 'for-linus' of ↵Linus Torvalds6-162/+54
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "On the RBD side, there is a conversion to blk-mq from Christoph, several long-standing bug fixes from Ilya, and some cleanup from Rickard Strandqvist. On the CephFS side there is a long list of fixes from Zheng, including improved session handling, a few IO path fixes, some dcache management correctness fixes, and several blocking while !TASK_RUNNING fixes. The core code gets a few cleanups and Chaitanya has added support for TCP_NODELAY (which has been used on the server side for ages but we somehow missed on the kernel client). There is also an update to MAINTAINERS to fix up some email addresses and reflect that Ilya and Zheng are doing most of the maintenance for RBD and CephFS these days. Do not be surprised to see a pull request come from one of them in the future if I am unavailable for some reason" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits) MAINTAINERS: update Ceph and RBD maintainers libceph: kfree() in put_osd() shouldn't depend on authorizer libceph: fix double __remove_osd() problem rbd: convert to blk-mq ceph: return error for traceless reply race ceph: fix dentry leaks ceph: re-send requests when MDS enters reconnecting stage ceph: show nocephx_require_signatures and notcp_nodelay options libceph: tcp_nodelay support rbd: do not treat standalone as flatten ceph: fix atomic_open snapdir ceph: properly mark empty directory as complete client: include kernel version in client metadata ceph: provide seperate {inode,file}_operations for snapdir ceph: fix request time stamp encoding ceph: fix reading inline data when i_size > PAGE_SIZE ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) rbd: fix error paths in rbd_dev_refresh() ...
2015-02-19Merge branch 'kconfig' of ↵Linus Torvalds3-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kconfig updates from Michal Marek: "Yann E Morin was supposed to take over kconfig maintainership, but this hasn't happened. So I'm sending a few kconfig patches that I collected: - Fix for missing va_end in kconfig - merge_config.sh displays used if given too few arguments - s/boolean/bool/ in Kconfig files for consistency, with the plan to only support bool in the future" * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kconfig: use va_end to match corresponding va_start merge_config.sh: Display usage if given too few arguments kconfig: use bool instead of boolean for type definition attributes
2015-02-19libceph: kfree() in put_osd() shouldn't depend on authorizerIlya Dryomov1-2/+3
a255651d4cad ("ceph: ensure auth ops are defined before use") made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. Cc: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
2015-02-19libceph: fix double __remove_osd() problemIlya Dryomov1-8/+18
It turns out it's possible to get __remove_osd() called twice on the same OSD. That doesn't sit well with rb_erase() - depending on the shape of the tree we can get a NULL dereference, a soft lockup or a random crash at some point in the future as we end up touching freed memory. One scenario that I was able to reproduce is as follows: <osd3 is idle, on the osd lru list> <con reset - osd3> con_fault_finish() osd_reset() <osdmap - osd3 down> ceph_osdc_handle_map() <takes map_sem> kick_requests() <takes request_mutex> reset_changed_osds() __reset_osd() __remove_osd() <releases request_mutex> <releases map_sem> <takes map_sem> <takes request_mutex> __kick_osd_requests() __reset_osd() __remove_osd() <-- !!! A case can be made that osd refcounting is imperfect and reworking it would be a proper resolution, but for now Sage and I decided to fix this by adding a safe guard around __remove_osd(). Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil <sage@redhat.com> Cc: stable@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd() Cc: stable@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts Cc: stable@vger.kernel.org # 3.9+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
2015-02-19libceph: tcp_nodelay supportChaitanya Huilgol2-2/+28
TCP_NODELAY socket option set on connection sockets, disables Nagle’s algorithm and improves latency characteristics. tcp_nodelay(default)/notcp_nodelay option flags provided to enable/disable setting the socket option. Signed-off-by: Chaitanya Huilgol <chaitanya.huilgol@sandisk.com> [idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19libceph: use mon_client.c/put_generic_request() moreIlya Dryomov1-2/+2
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19libceph: nuke pool op infrastructureIlya Dryomov3-148/+3
On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil <sage@newdream.net> wrote: > On Mon, 22 Dec 2014, Ilya Dryomov wrote: >> Actually, pool op stuff has been unused for over two years - looks like >> it was added for rbd create_snap and that got ripped out in 2012. It's >> unlikely we'd ever need to manage pools or snaps from the kernel client >> so I think it makes sense to nuke it. Sage? > > Yep! Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-18Merge tag 'virtio-next-for-linus' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull virtio updates from Rusty Russell: "OK, this has the big virtio 1.0 implementation, as specified by OASIS. On top of tht is the major rework of lguest, to use PCI and virtio 1.0, to double-check the implementation. Then comes the inevitable fixes and cleanups from that work" * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (80 commits) virtio: don't set VIRTIO_CONFIG_S_DRIVER_OK twice. virtio_net: unconditionally define struct virtio_net_hdr_v1. tools/lguest: don't use legacy definitions for net device in example launcher. virtio: Don't expose legacy net features when VIRTIO_NET_NO_LEGACY defined. tools/lguest: use common error macros in the example launcher. tools/lguest: give virtqueues names for better error messages tools/lguest: more documentation and checking of virtio 1.0 compliance. lguest: don't look in console features to find emerg_wr. tools/lguest: don't start devices until DRIVER_OK status set. tools/lguest: handle indirect partway through chain. tools/lguest: insert driver references from the 1.0 spec (4.1 Virtio Over PCI) tools/lguest: insert device references from the 1.0 spec (4.1 Virtio Over PCI) tools/lguest: rename virtio_pci_cfg_cap field to match spec. tools/lguest: fix features_accepted logic in example launcher. tools/lguest: handle device reset correctly in example launcher. virtual: Documentation: simplify and generalize paravirt_ops.txt lguest: remove NOTIFY call and eventfd facility. lguest: remove NOTIFY facility from demonstration launcher. lguest: use the PCI console device's emerg_wr for early boot messages. lguest: always put console in PCI slot #1. ...
2015-02-18Merge branch 'cleanups'Trond Myklebust359-12061/+19176
Merge cleanups requested by Linus. * cleanups: (3 commits) pnfs: Refactor the *_layout_mark_request_commit to use pnfs_layout_mark_request_commit nfs: Can call nfs_clear_page_commit() instead nfs: Provide and use helper functions for marking a page as unstable
2015-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds19-48/+173
Pull networking updates from David Miller: 1) Missing netlink attribute validation in nft_lookup, from Patrick McHardy. 2) Restrict ipv6 partial checksum handling to UDP, since that's the only case it works for. From Vlad Yasevich. 3) Clear out silly device table sentinal macros used by SSB and BCMA drivers. From Joe Perches. 4) Make sure the remote checksum code never creates a situation where the remote checksum is applied yet the tunneling metadata describing the remote checksum transformation is still present. Otherwise an external entity might see this and apply the checksum again. From Tom Herbert. 5) Use msecs_to_jiffies() where applicable, from Nicholas Mc Guire. 6) Don't explicitly initialize timer struct fields, use setup_timer() and mod_timer() instead. From Vaishali Thakkar. 7) Don't invoke tg3_halt() without the tp->lock held, from Jun'ichi Nomura. 8) Missing __percpu annotation in ipvlan driver, from Eric Dumazet. 9) Don't potentially perform skb_get() on shared skbs, also from Eric Dumazet. 10) Fix COW'ing of metrics for non-DST_HOST routes in ipv6, from Martin KaFai Lau. 11) Fix merge resolution error between the iov_iter changes in vhost and some bug fixes that occurred at the same time. From Jason Wang. 12) If rtnl_configure_link() fails we have to perform a call to ->dellink() before unregistering the device. From WANG Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (39 commits) net: dsa: Set valid phy interface type rtnetlink: call ->dellink on failure when ->newlink exists com20020-pci: add support for eae single card vhost_net: fix wrong iter offset when setting number of buffers net: spelling fixes net/core: Fix warning while make xmldocs caused by dev.c net: phy: micrel: disable NAND-tree for KSZ8021, KSZ8031, KSZ8051, KSZ8081 ipv6: fix ipv6_cow_metrics for non DST_HOST case openvswitch: Fix key serialization. r8152: restore hw settings hso: fix rx parsing logic when skb allocation fails tcp: make sure skb is not shared before using skb_get() bridge: netfilter: Move sysctl-specific error code inside #ifdef ipv6: fix possible deadlock in ip6_fl_purge / ip6_fl_gc ipvlan: add a missing __percpu pcpu_stats tg3: Hold tp->lock before calling tg3_halt() from tg3_init_one() bgmac: fix device initialization on Northstar SoCs (condition typo) qlcnic: Delete existing multicast MAC list before adding new net/mlx5_core: Fix configuration of log_uar_page_sz sunvnet: don't change gso data on clones ...
2015-02-17net: dsa: Set valid phy interface typeGuenter Roeck1-2/+7
If the phy interface mode is not found in devicetree, or if devicetree is not configured, of_get_phy_mode returns -ENODEV. The current code sets the phy interface mode to the return value from of_get_phy_mode without checking if it is valid. This invalid phy interface mode is passed as parameter to of_phy_connect or to phy_connect_direct. This sets the phy interface mode to the invalid value, which in turn causes problems for any code using phydev->interface. Fixes: b31f65fb4383 ("net: dsa: slave: Fix autoneg for phys on switch MDIO bus") Fixes: 0d8bcdd383b8 ("net: dsa: allow for more complex PHY setups") Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-15rtnetlink: call ->dellink on failure when ->newlink existsWANG Cong1-1/+8
Ignacy reported that when eth0 is down and add a vlan device on top of it like: ip link add link eth0 name eth0.1 up type vlan id 1 We will get a refcount leak: unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2 The problem is when rtnl_configure_link() fails in rtnl_newlink(), we simply call unregister_device(), but for stacked device like vlan, we almost do nothing when we unregister the upper device, more work is done when we unregister the lower device, so call its ->dellink(). Reported-by: Ignacy Gawedzki <ignacy.gawedzki@green-communications.fr> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-14net: spelling fixesStephen Hemminger3-3/+3
Spelling errors caught by codespell. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-14net/core: Fix warning while make xmldocs caused by dev.cMasanari Iida1-1/+1
This patch fix following warning wile make xmldocs. Warning(.//net/core/dev.c:5345): No description found for parameter 'bonding_info' Warning(.//net/core/dev.c:5345): Excess function parameter 'netdev_bonding_info' description in 'netdev_bonding_info_change' This warning starts to appear after following patch was added into Linus's tree during merger period. commit 61bd3857ff2c7daf756d49b41e6277bbdaa8f789 net/core: Add event for a change in slave state Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-14ipv6: fix ipv6_cow_metrics for non DST_HOST caseMartin KaFai Lau1-1/+1
ipv6_cow_metrics() currently assumes only DST_HOST routes require dynamic metrics allocation from inetpeer. The assumption breaks when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric. Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set() is called after the route is created. This patch creates the metrics array (by calling dst_cow_metrics_generic) in ipv6_cow_metrics(). Test: radvd.conf: interface qemubr0 { AdvLinkMTU 1300; AdvCurHopLimit 30; prefix fd00:face:face:face::/64 { AdvOnLink on; AdvAutonomous on; AdvRouterAddr off; }; }; Before: [root@qemu1 ~]# ip -6 r show | egrep -v unreachable fd00:face:face:face::/64 dev eth0 proto kernel metric 256 expires 27sec fe80::/64 dev eth0 proto kernel metric 256 default via fe80::74df:d0ff:fe23:8ef2 dev eth0 proto ra metric 1024 expires 27sec After: [root@qemu1 ~]# ip -6 r show | egrep -v unreachable fd00:face:face:face::/64 dev eth0 proto kernel metric 256 expires 27sec mtu 1300 fe80::/64 dev eth0 proto kernel metric 256 mtu 1300 default via fe80::74df:d0ff:fe23:8ef2 dev eth0 proto ra metric 1024 expires 27sec mtu 1300 hoplimit 30 Fixes: 8e2ec639173f325 (ipv6: don't use inetpeer to store metrics for routes.) Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-14openvswitch: Fix key serialization.Pravin B Shelar1-1/+1
Fix typo where mask is used rather than key. Fixes: 74ed7ab9264("openvswitch: Add support for unique flow IDs.") Reported-by: Joe Stringer <joestringer@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-13net: use %*pb[l] to print bitmaps including cpumasks and nodemasksTejun Heo2-22/+8
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13SUNRPC: Always manipulate rpc_rqst::rq_bc_pa_list under xprt->bc_pa_lockChuck Lever1-1/+4
Other code that accesses rq_bc_pa_list holds xprt->bc_pa_lock. xprt_complete_bc_request() should do the same. Fixes: 2ea24497a1b3 ("SUNRPC: RPC callbacks may be split . . .") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-13tcp: make sure skb is not shared before using skb_get()Eric Dumazet1-8/+24
IPv6 can keep a copy of SYN message using skb_get() in tcp_v6_conn_request() so that caller wont free the skb when calling kfree_skb() later. Therefore TCP fast open has to clone the skb it is queuing in child->sk_receive_queue, as all skbs consumed from receive_queue are freed using __kfree_skb() (ie assuming skb->users == 1) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: 5b7ed0892f2af ("tcp: move fastopen functions to tcp_fastopen.c") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-12memcg: cleanup static keys decrementVladimir Davydov1-0/+4
Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of wrapper functions. Although this patch moves static keys decrement from __mem_cgroup_free to mem_cgroup_css_free, it does not introduce any functional changes, because the keys are incremented on setting the limit (tcp or kmem), which can only happen after successful mem_cgroup_css_online. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Glauber Costa <glommer@parallels.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David S. Miller <davem@davemloft.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linuxLinus Torvalds6-151/+209
Pull nfsd updates from Bruce Fields: "The main change is the pNFS block server support from Christoph, which allows an NFS client connected to shared disk to do block IO to the shared disk in place of NFS reads and writes. This also requires xfs patches, which should arrive soon through the xfs tree, barring unexpected problems. Support for other filesystems is also possible if there's interest. Thanks also to Chuck Lever for continuing work to get NFS/RDMA into shape" * 'for-3.20' of git://linux-nfs.org/~bfields/linux: (32 commits) nfsd: default NFSv4.2 to on nfsd: pNFS block layout driver exportfs: add methods for block layout exports nfsd: add trace events nfsd: update documentation for pNFS support nfsd: implement pNFS layout recalls nfsd: implement pNFS operations nfsd: make find_any_file available outside nfs4state.c nfsd: make find/get/put file available outside nfs4state.c nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c nfsd: add fh_fsid_match helper nfsd: move nfsd_fh_match to nfsfh.h fs: add FL_LAYOUT lease type fs: track fl_owner for leases nfs: add LAYOUT_TYPE_MAX enum value nfsd: factor out a helper to decode nfstime4 values sunrpc/lockd: fix references to the BKL nfsd: fix year-2038 nfs4 state problem svcrdma: Handle additional inline content svcrdma: Move read list XDR round-up logic ...
2015-02-12bridge: netfilter: Move sysctl-specific error code inside #ifdefGeert Uytterhoeven1-5/+2
If CONFIG_SYSCTL=n: net/bridge/br_netfilter.c: In function ‘br_netfilter_init’: net/bridge/br_netfilter.c:996: warning: label ‘err1’ defined but not used Move the label and the code after it inside the existing #ifdef to get rid of the warning. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-12ipv6: fix possible deadlock in ip6_fl_purge / ip6_fl_gcJan Stancek1-2/+2
Use spin_lock_bh in ip6_fl_purge() to prevent following potentially deadlock scenario between ip6_fl_purge() and ip6_fl_gc() timer. ================================= [ INFO: inconsistent lock state ] 3.19.0 #1 Not tainted --------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/5/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (ip6_fl_lock){+.?...}, at: [<ffffffff8171155d>] ip6_fl_gc+0x2d/0x180 {SOFTIRQ-ON-W} state was registered at: [<ffffffff810ee9a0>] __lock_acquire+0x4a0/0x10b0 [<ffffffff810efd54>] lock_acquire+0xc4/0x2b0 [<ffffffff81751d2d>] _raw_spin_lock+0x3d/0x80 [<ffffffff81711798>] ip6_flowlabel_net_exit+0x28/0x110 [<ffffffff815f9759>] ops_exit_list.isra.1+0x39/0x60 [<ffffffff815fa320>] cleanup_net+0x100/0x1e0 [<ffffffff810ad80a>] process_one_work+0x20a/0x830 [<ffffffff810adf4b>] worker_thread+0x11b/0x460 [<ffffffff810b42f4>] kthread+0x104/0x120 [<ffffffff81752bfc>] ret_from_fork+0x7c/0xb0 irq event stamp: 84640 hardirqs last enabled at (84640): [<ffffffff81752080>] _raw_spin_unlock_irq+0x30/0x50 hardirqs last disabled at (84639): [<ffffffff81751eff>] _raw_spin_lock_irq+0x1f/0x80 softirqs last enabled at (84628): [<ffffffff81091ad1>] _local_bh_enable+0x21/0x50 softirqs last disabled at (84629): [<ffffffff81093b7d>] irq_exit+0x12d/0x150 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(ip6_fl_lock); <Interrupt> lock(ip6_fl_lock); *** DEADLOCK *** Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11Merge branch 'next' of ↵Linus Torvalds2-26/+40
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer updates from James Morris: "Highlights: - Smack adds secmark support for Netfilter - /proc/keys is now mandatory if CONFIG_KEYS=y - TPM gets its own device class - Added TPM 2.0 support - Smack file hook rework (all Smack users should review this!)" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits) cipso: don't use IPCB() to locate the CIPSO IP option SELinux: fix error code in policydb_init() selinux: add security in-core xattr support for pstore and debugfs selinux: quiet the filesystem labeling behavior message selinux: Remove unused function avc_sidcmp() ima: /proc/keys is now mandatory Smack: Repair netfilter dependency X.509: silence asn1 compiler debug output X.509: shut up about included cert for silent build KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y MAINTAINERS: email update tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device smack: fix possible use after frees in task_security() callers smack: Add missing logging in bidirectional UDS connect check Smack: secmark support for netfilter Smack: Rework file hooks tpm: fix format string error in tpm-chip.c char/tpm/tpm_crb: fix build error smack: Fix a bidirectional UDS connect check typo smack: introduce a special case for tmpfs in smack_d_instantiate() ...
2015-02-11Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-5/+3
Merge second set of updates from Andrew Morton: "More of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits) mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() vmstat: Reduce time interval to stat update on idle cpu mm/page_owner.c: remove unnecessary stack_trace field Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files mm: incorporate read-only pages into transparent huge pages vmstat: do not use deferrable delayed work for vmstat_update mm: more aggressive page stealing for UNMOVABLE allocations mm: always steal split buddies in fallback allocations mm: when stealing freepages, also take pages created by splitting buddy page mincore: apply page table walker on do_mincore() mm: /proc/pid/clear_refs: avoid split_huge_page() mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) mempolicy: apply page table walker on queue_pages_range() arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() memcg: cleanup preparation for page table walk numa_maps: remove numa_maps->vma numa_maps: fix typo in gather_hugetbl_stats pagemap: use walk->vma instead of calling find_vma() clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() ...
2015-02-11Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds10-517/+628
Pull NFS client updates from Trond Myklebust: "Highlights incluse: Features: - Removing the forced serialisation of open()/close() calls in NFSv4.x (x>0) makes for a significant performance improvement in metadata intensive workloads. - Full support for the pNFS "flexible files" layout type - Further RPC/RDMA client improvements from Chuck Bugfixes: - Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING - Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL - Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called as part of the namespace cleanup, - Stable fix: Ensure we reference the inode for return-on-close in delegreturn - Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to the same source address/port combination during a disconnect/ reconnect event. This is a requirement imposed by most NFSv3 server duplicate reply cache implementations. Optimisations: - Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT Other: - Add Anna Schumaker as co-maintainer for the NFS client" * tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits) SUNRPC: Cleanup to remove xs_tcp_close() pnfs: delete an unintended goto pnfs/flexfiles: Do not dprintk after the free SUNRPC: Fix stupid typo in xs_sock_set_reuseport SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG SUNRPC: Handle connection reset more efficiently. SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT SUNRPC: Remove TCP socket linger code SUNRPC: Remove TCP client connection reset hack SUNRPC: TCP/UDP always close the old socket before reconnecting SUNRPC: Add helpers to prevent socket create from racing SUNRPC: Ensure xs_reset_transport() resets the close connection flags SUNRPC: Do not clear the source port in xs_reset_transport SUNRPC: Handle EADDRINUSE on connect SUNRPC: Set SO_REUSEPORT socket option for TCP connections NFSv4.1: Fix pnfs_put_lseg races NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS ...
2015-02-11mm: gup: use get_user_pages_unlockedAndrea Arcangeli1-4/+2
This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to the page fault in order to release the mmap_sem during the I/O. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11mm: page_counter: pull "-1" handling out of page_counter_memparse()Johannes Weiner1-1/+1
The unified hierarchy interface for memory cgroups will no longer use "-1" to mean maximum possible resource value. In preparation for this, make the string an argument and let the caller supply it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11gue: Use checksum partial with remote checksum offloadTom Herbert1-6/+22
Change remote checksum handling to set checksum partial as default behavior. Added an iflink parameter to configure not using checksum partial (calling csum_partial to update checksum). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offloadTom Herbert2-2/+3
This patch adds infrastructure so that remote checksum offload can set CHECKSUM_PARTIAL instead of calling csum_partial and writing the modfied checksum field. Add skb_remcsum_adjust_partial function to set an skb for using CHECKSUM_PARTIAL with remote checksum offload. Changed skb_remcsum_process and skb_gro_remcsum_process to take a boolean argument to indicate if checksum partial can be set or the checksum needs to be modified using the normal algorithm. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11udp: Set SKB_GSO_UDP_TUNNEL* in UDP GRO pathTom Herbert2-2/+17
Properly set GSO types and skb->encapsulation in the UDP tunnel GRO complete so that packets are properly represented for GSO. This sets SKB_GSO_UDP_TUNNEL or SKB_GSO_UDP_TUNNEL_CSUM depending on whether non-zero checksums were received, and sets SKB_GSO_TUNNEL_REMCSUM if the remote checksum option was processed. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11net: Fix remcsum in GRO path to not change packetTom Herbert1-10/+10
Remote checksum offload processing is currently the same for both the GRO and non-GRO path. When the remote checksum offload option is encountered, the checksum field referred to is modified in the packet. So in the GRO case, the packet is modified in the GRO path and then the operation is skipped when the packet goes through the normal path based on skb->remcsum_offload. There is a problem in that the packet may be modified in the GRO path, but then forwarded off host still containing the remote checksum option. A remote host will again perform RCO but now the checksum verification will fail since GRO RCO already modified the checksum. To fix this, we ensure that GRO restores a packet to it's original state before returning. In this model, when GRO processes a remote checksum option it still changes the checksum per the algorithm but on return from lower layer processing the checksum is restored to its original value. In this patch we add define gro_remcsum structure which is passed to skb_gro_remcsum_process to save offset and delta for the checksum being changed. After lower layer processing, skb_gro_remcsum_cleanup is called to restore the checksum before returning from GRO. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11openvswitch: Add missing initialization in validate_and_copy_set_tun()Geert Uytterhoeven1-1/+1
net/openvswitch/flow_netlink.c: In function ‘validate_and_copy_set_tun’: net/openvswitch/flow_netlink.c:1749: warning: ‘err’ may be used uninitialized in this function If ipv4_tun_from_nlattr() returns a different positive value than OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS, err will be uninitialized, and validate_and_copy_set_tun() may return an undefined value instead of a zero success indicator. Initialize err to zero to fix this. Fixes: 1dd144cf5b4b47e1 ("openvswitch: Support VXLAN Group Policy extension") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11openvswitch: Reset key metadata for packet execution.Pravin B Shelar1-0/+2
Userspace packet execute command pass down flow key for given packet. But userspace can skip some parameter with zero value. Therefore kernel needs to initialize key metadata to zero. Fixes: 0714812134 ("openvswitch: Eliminate memset() from flow_extract.") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11rds: rds_cong_queue_updates needs to defer the congestion update transmissionSowmini Varadhan1-1/+15
When the RDS transport is TCP, we cannot inline the call to rds_send_xmit from rds_cong_queue_update because (a) we are already holding the sock_lock in the recv path, and will deadlock when tcp_setsockopt/tcp_sendmsg try to get the sock lock (b) cong_queue_update does an irqsave on the rds_cong_lock, and this will trigger warnings (for a good reason) from functions called out of sock_lock. This patch reverts the change introduced by 2fa57129d ("RDS: Bypass workqueue when queueing cong updates"). The patch has been verified for both RDS/TCP as well as RDS/RDMA to ensure that there are not regressions for either transport: - for verification of RDS/TCP a client-server unit-test was used, with the server blocked in gdb and thus unable to drain its rcvbuf, eventually triggering a RDS congestion update. - for RDS/RDMA, the standard IB regression tests were used Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11ipv6: Partial checksum only UDP packetsVlad Yasevich1-1/+1
ip6_append_data is used by other protocols and some of them can't be partially checksummed. Only partially checksum UDP protocol. Fixes: 32dce968dd987a (ipv6: Allow for partial checksums on non-ufo packets) Reported-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2-6/+58
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains two small Netfilter updates for your net-next tree, they are: 1) Add ebtables support to nft_compat, from Arturo Borrero. 2) Fix missing validation of the SET_ID attribute in the lookup expressions, from Patrick McHardy. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11cipso: don't use IPCB() to locate the CIPSO IP optionPaul Moore2-26/+40
Using the IPCB() macro to get the IPv4 options is convenient, but unfortunately NetLabel often needs to examine the CIPSO option outside of the scope of the IP layer in the stack. While historically IPCB() worked above the IP layer, due to the inclusion of the inet_skb_param struct at the head of the {tcp,udp}_skb_cb structs, recent commit 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses") reordered the tcp_skb_cb struct and invalidated this IPCB() trick. This patch fixes the problem by creating a new function, cipso_v4_optptr(), which locates the CIPSO option inside the IP header without calling IPCB(). Unfortunately, this isn't as fast as a simple lookup so some additional tweaks were made to limit the use of this new function. Cc: <stable@vger.kernel.org> # 3.18 Reported-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com>
2015-02-10SUNRPC: Cleanup to remove xs_tcp_close()Trond Myklebust1-6/+1
xs_tcp_close() is now just a call to xs_tcp_shutdown(), so remove it, and replace the entry in xs_tcp_ops. Suggested-by: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09ipv4: Namespecify TCP PMTU mechanismFan Du4-21/+23
Packetization Layer Path MTU Discovery works separately beside Path MTU Discovery at IP level, different net namespace has various requirements on which one to chose, e.g., a virutalized container instance would require TCP PMTU to probe an usable effective mtu for underlying tunnel, while the host would employ classical ICMP based PMTU to function. Hence making TCP PMTU mechanism per net namespace to decouple two functionality. Furthermore the probe base MSS should also be configured separately for each namespace. Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller11-54/+71
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09SUNRPC: Fix stupid typo in xs_sock_set_reuseportTrond Myklebust1-2/+3
Yes, kernel_setsockopt() hates you for using a char argument. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09tcp: don't include Fast Open option in SYN-ACK on pure SYN-dataYuchung Cheng1-5/+8
If a server has enabled Fast Open and it receives a pure SYN-data packet (without a Fast Open option), it won't accept the data but it incorrectly returns a SYN-ACK with a Fast Open cookie and also increments the SNMP stat LINUX_MIB_TCPFASTOPENPASSIVEFAIL. This patch makes the server include a Fast Open cookie in SYN-ACK only if the SYN has some Fast Open option (i.e., when client requests or presents a cookie). Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09openvswitch: Only set TUNNEL_VXLAN_OPT if VXLAN-GBP metadata is setThomas Graf1-1/+1
This avoids setting TUNNEL_VXLAN_OPT for VXLAN frames which don't have any GBP metadata set. It is not invalid to set it but unnecessary. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09ipv6: Make __ipv6_select_ident staticVlad Yasevich1-1/+2
Make __ipv6_select_ident() static as it isn't used outside the file. Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.) Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09ipv6: Fix fragment id assignment on LE arches.Vlad Yasevich1-1/+1
Recent commit: 0508c07f5e0c94f38afd5434e8b2a55b84553077 Author: Vlad Yasevich <vyasevich@gmail.com> Date: Tue Feb 3 16:36:15 2015 -0500 ipv6: Select fragment id during UFO segmentation if not set. Introduced a bug on LE in how ipv6 fragment id is assigned. This was cought by nightly sparce check: Resolve the following sparce error: net/ipv6/output_core.c:57:38: sparse: incorrect type in assignment (different base types) net/ipv6/output_core.c:57:38: expected restricted __be32 [usertype] ip6_frag_id net/ipv6/output_core.c:57:38: got unsigned int [unsigned] [assigned] [usertype] id Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.) Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09bridge: Fix inability to add non-vlan fdb entryToshiaki Makita1-7/+5
Bridge's default_pvid adds a vid by default, by which we cannot add a non-vlan fdb entry by default, because br_fdb_add() adds fdb entries for all vlans instead of a non-vlan one when any vlan is configured. # ip link add br0 type bridge # ip link set eth0 master br0 # bridge fdb add 12:34:56:78:90:ab dev eth0 master temp # bridge fdb show brport eth0 | grep 12:34:56:78:90:ab 12:34:56:78:90:ab dev eth0 vlan 1 static We expect a non-vlan fdb entry as well as vlan 1: 12:34:56:78:90:ab dev eth0 static To fix this, we need to insert a non-vlan fdb entry if vlan is not specified, even when any vlan is configured. Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09net: dsa: Remove redundant phy_attach()Andrew Lunn1-13/+0
dsa_slave_phy_setup() finds the phy for the port via device tree and using of_phy_connect(), or it uses the fall back of taking a phy from the switch internal mdio bus and calling phy_connect_direct(). Either way, if a phy is found, phy_attach_direct() is called to attach the phy to the slave device. In dsa_slave_create(), a second call to phy_attach() is made. This results in the warning "PHY already attached". Remove this second, redundant attaching of the phy. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: remove tipc_snprintfRichard Alpe4-60/+4
tipc_snprintf() was heavily utilized by the old netlink API which no longer exists (now netlink compat). In this patch we swap tipc_snprintf() to the identical scnprintf() in the only remaining occurrence. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: nl compat add noop and remove legacy nl frameworkRichard Alpe12-301/+13
Add TIPC_CMD_NOOP to compat layer and remove the old framework. All legacy nl commands are now converted to the compat layer in netlink_compat.c. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl stats show to nl compatRichard Alpe2-35/+15
Convert TIPC_CMD_SHOW_STATS to compat layer. This command does not have any counterpart in the new API, meaning it now solely exists as a function in the compat layer. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl net id get to nl compatRichard Alpe2-23/+18
Convert TIPC_CMD_GET_NETID to compat dumpit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl net id set to nl compatRichard Alpe2-26/+14
Convert TIPC_CMD_SET_NETID to compat doit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl node addr set to nl compatRichard Alpe3-27/+27
Convert TIPC_CMD_SET_NODE_ADDR to compat doit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl node dump to nl compatRichard Alpe4-60/+22
Convert TIPC_CMD_GET_NODES to compat dumpit and remove global node counter solely used by the legacy API. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl media dump to nl compatRichard Alpe4-24/+19
Convert TIPC_CMD_GET_MEDIA_NAMES to compat dumpit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl socket dump to nl compatRichard Alpe4-89/+111
Convert socket (port) listing to compat dumpit call. If a socket (port) has publications a second dumpit call is issued to collect them and format then into the legacy buffer before continuing to process the sockets (ports). Command converted in this patch: TIPC_CMD_SHOW_PORTS Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl name table dump to nl compatRichard Alpe4-190/+101
Add functionality for printing a dump header and convert TIPC_CMD_SHOW_NAME_TABLE to compat dumpit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl link stat reset to nl compatRichard Alpe4-41/+27
Convert TIPC_CMD_RESET_LINK_STATS to compat doit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl link prop set to nl compatRichard Alpe4-153/+48
Convert setting of link proprieties to compat doit calls. Commands converted in this patch: TIPC_CMD_SET_LINK_TOL TIPC_CMD_SET_LINK_PRI TIPC_CMD_SET_LINK_WINDOW Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl link dump to nl compatRichard Alpe4-79/+23
Convert TIPC_CMD_GET_LINKS to compat dumpit and remove global link counter solely used by the legacy API. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl link stat to nl compatRichard Alpe6-192/+205
Add functionality for safely appending string data to a TLV without keeping write count in the caller. Convert TIPC_CMD_SHOW_LINK_STATS to compat dumpit. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl bearer enable/disable to nl compatRichard Alpe4-57/+149
Introduce a framework for transcoding legacy nl action into actions (.doit) calls from the new nl API. This is done by converting the incoming TLV data into netlink data with nested netlink attributes. Unfortunately due to the randomness of the legacy API we can't do this generically so each legacy netlink command requires a specific transcoding recipe. In this case for bearer enable and bearer disable. Convert TIPC_CMD_ENABLE_BEARER and TIPC_CMD_DISABLE_BEARER into doit compat calls. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl bearer dump to nl compatRichard Alpe4-34/+273
Introduce a framework for dumping netlink data from the new netlink API and formatting it to the old legacy API format. This is done by looping the dump data and calling a format handler for each entity, in this case a bearer. We dump until either all data is dumped or we reach the limited buffer size of the legacy API. Remember, the legacy API doesn't scale. In this commit we convert TIPC_CMD_GET_BEARER_NAMES to use the compat layer. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: move and rename the legacy nl api to "nl compat"Richard Alpe13-76/+130
The new netlink API is no longer "v2" but rather the standard API and the legacy API is now "nl compat". We split them into separate start/stop and put them in different files in order to further distinguish them. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUGTrond Myklebust1-2/+2
Now that the linger code is gone, the xs_tcp_fin_timeout variable has no real function. Keep it for now, since it is part of the /proc interface, but only define it if that /proc interface is enabled. Suggested-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Handle connection reset more efficiently.Trond Myklebust1-16/+18
If the connection reset is due to an active call on our side, then the state change is sometimes not reported. Catch those instances using xs_error_report() instead. Also remove the xs_tcp_shutdown() call in xs_tcp_send_request() as the change in behaviour makes it redundant. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flagTrond Myklebust2-2/+0
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_releaseTrond Myklebust1-5/+1
Use of socket shutdown() means that we monitor the shutdown process through the xs_tcp_state_change() callback, so it is preferable to a full close in all cases unless we're destroying the transport. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connectionTrond Myklebust1-2/+2
The previous behaviour left the connection half-open in order to try to scrape the last replies from the socket. Now that we have more reliable reconnection, change the behaviour to close down the socket faster. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORTTrond Myklebust1-3/+0
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09SUNRPC: Remove TCP socket linger codeTrond Myklebust1-35/+0
Now that we no longer use the partial shutdown code when closing the socket, we no longer need to worry about the TCP linger2 state. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08net:rfs: adjust table size checkingEric Dumazet1-1/+1
Make sure root user does not try something stupid. Also make sure mask field in struct rps_sock_flow_table does not share a cache line with the potentially often dirtied flow table. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 567e4b79731c ("net: rfs: add hash collision detection") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08SUNRPC: Remove TCP client connection reset hackTrond Myklebust1-66/+1
Instead we rely on SO_REUSEPORT to provide the reconnection semantics that we need for NFSv2/v3. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: TCP/UDP always close the old socket before reconnectingTrond Myklebust1-2/+3
It is not safe to call xs_reset_transport() from inside xs_udp_setup_socket() or xs_tcp_setup_socket(), since they do not own the correct locks. Instead, do it in xs_connect(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: Add helpers to prevent socket create from racingTrond Myklebust2-6/+38
The socket lock is currently held by the task that is requesting the connection be established. While that is efficient in the case where the connection happens quickly, it is racy in the case where it doesn't. What we really want is for the connect helper to be able to block access to the socket while it is being set up. This patch does so by arranging to transfer the socket lock from the task that is requesting the connect attempt, and then releasing that lock once everything is done. This scheme also gives us automatic protection against collisions with the RPC close code, so we can kill the cancel_delayed_work_sync() call in xs_close(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: Ensure xs_reset_transport() resets the close connection flagsTrond Myklebust1-16/+13
Otherwise, we may end up looping. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: Do not clear the source port in xs_reset_transportTrond Myklebust1-2/+0
Now that we can reuse bound ports after a close, we never really want to clear the transport's source port after it has been set. Doing so really messes up the NFSv3 DRC on the server. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08SUNRPC: Handle EADDRINUSE on connectTrond Myklebust2-0/+5
Now that we're setting SO_REUSEPORT, we still need to handle the case where a connect() is attempted, but the old socket is still lingering. Essentially, all we want to do here is handle the error by waiting a few seconds and then retrying. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08net: rfs: add hash collision detectionEric Dumazet3-24/+28
Receive Flow Steering is a nice solution but suffers from hash collisions when a mix of connected and unconnected traffic is received on the host, when flow hash table is populated. Also, clearing flow in inet_release() makes RFS not very good for short lived flows, as many packets can follow close(). (FIN , ACK packets, ...) This patch extends the information stored into global hash table to not only include cpu number, but upper part of the hash value. I use a 32bit value, and dynamically split it in two parts. For host with less than 64 possible cpus, this gives 6 bits for the cpu number, and 26 (32-6) bits for the upper part of the hash. Since hash bucket selection use low order bits of the hash, we have a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big enough. If the hash found in flow table does not match, we fallback to RPS (if it is enabled for the rxqueue). This means that a packet for an non connected flow can avoid the IPI through a unrelated/victim CPU. This also means we no longer have to clear the table at socket close time, and this helps short lived flows performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08gre/ipip: use be16 variants of netlink functionsSabrina Dubroca2-12/+12
encap.sport and encap.dport are __be16, use nla_{get,put}_be16 instead of nla_{get,put}_u16. Fixes the sparse warnings: warning: incorrect type in assignment (different base types) expected restricted __be32 [addressable] [usertype] o_key got restricted __be16 [addressable] [usertype] i_flags warning: incorrect type in assignment (different base types) expected restricted __be16 [usertype] sport got unsigned short warning: incorrect type in assignment (different base types) expected restricted __be16 [usertype] dport got unsigned short warning: incorrect type in argument 3 (different base types) expected unsigned short [unsigned] [usertype] value got restricted __be16 [usertype] sport warning: incorrect type in argument 3 (different base types) expected unsigned short [unsigned] [usertype] value got restricted __be16 [usertype] dport Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08SUNRPC: Set SO_REUSEPORT socket option for TCP connectionsTrond Myklebust1-4/+49
When using TCP, we need the ability to reuse port numbers after a disconnection, so that the NFSv3 server knows that we're the same client. Currently we use a hack to work around the TCP socket's TIME_WAIT: we send an RST instead of closing, which doesn't always work... The SO_REUSEPORT option added in Linux 3.9 allows us to bind multiple TCP connections to the same source address+port combination, and thus to use ordinary TCP close() instead of the current hack. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08tipc: fix bug in socket reception functionJon Paul Maloy1-3/+2
In commit c637c1035534867b85b78b453c38c495b58e2c5a ("tipc: resolve race problem at unicast message reception") we introduced a time limit for how long the function tipc_sk_eneque() would be allowed to execute its loop. Unfortunately, the test for when this limit is passed was put in the wrong place, resulting in a lost message when the test is true. We fix this by moving the test to before we dequeue the next buffer from the input queue. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08rt6_probe_deferred: Do not depend on struct orderingMichael Büsch1-1/+1
rt6_probe allocates a struct __rt6_probe_work and schedules a work handler rt6_probe_deferred. But rt6_probe_deferred kfree's the struct work_struct instead of struct __rt6_probe_work. This works, because struct work_struct is the first element of struct __rt6_probe_work. Change it to kfree struct __rt6_probe_work to not implicitly depend on struct work_struct being the first element. This does not affect the generated code. Signed-off-by: Michael Buesch <m@bues.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08Merge tag 'nfs-rdma-for-3.20-part-2' of ↵Trond Myklebust1-3/+4
git://git.linux-nfs.org/projects/anna/nfs-rdma NFS: RDMA Client Sparse Fixes This patch fixes a sparse warning in the initial submission. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'nfs-rdma-for-3.20-part-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma: xprtrdma: Address sparse complaint in rpcr_to_rdmar()
2015-02-08tcp: mitigate ACK loops for connections as tcp_timewait_sockNeal Cardwell1-5/+24
Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is represented by a tcp_timewait_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: mitigate ACK loops for connections as tcp_sockNeal Cardwell2-7/+23
Ensure that in state ESTABLISHED, where the connection is represented by a tcp_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers or ACK numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. There is already a similar (although global) rate-limiting mechanism for "challenge ACKs". When deciding whether to send a challence ACK, we first consult the new per-connection rate limit, and then the global rate limit. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: mitigate ACK loops for connections as tcp_request_sockNeal Cardwell1-1/+5
In the SYN_RECV state, where the TCP connection is represented by tcp_request_sock, we now rate-limit SYNACKs in response to a client's retransmitted SYNs: we do not send a SYNACK in response to client SYN if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a SYNACK in response to a client's retransmitted SYN. This allows the vast majority of legitimate client connections to proceed unimpeded, even for the most aggressive platforms, iOS and MacOS, which actually retransmit SYNs 1-second intervals for several times in a row. They use SYN RTO timeouts following the progression: 1,1,1,1,1,2,4,8,16,32. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacksNeal Cardwell3-0/+14
Helpers for mitigating ACK loops by rate-limiting dupacks sent in response to incoming out-of-window packets. This patch includes: - rate-limiting logic - sysctl to control how often we allow dupacks to out-of-window packets - SNMP counter for cases where we rate-limited our dupack sending The rate-limiting logic in this patch decides to not send dupacks in response to out-of-window segments if (a) they are SYNs or pure ACKs and (b) the remote endpoint is sending them faster than the configured rate limit. We rate-limit our responses rather than blocking them entirely or resetting the connection, because legitimate connections can rely on dupacks in response to some out-of-window segments. For example, zero window probes are typically sent with a sequence number that is below the current window, and ZWPs thus expect to thus elicit a dupack in response. We allow dupacks in response to TCP segments with data, because these may be spurious retransmissions for which the remote endpoint wants to receive DSACKs. This is safe because segments with data can't realistically be part of ACK loops, which by their nature consist of each side sending pure/data-less ACKs to each other. The dupack interval is controlled by a new sysctl knob, tcp_invalid_ratelimit, given in milliseconds, in case an administrator needs to dial this upward in the face of a high-rate DoS attack. The name and units are chosen to be analogous to the existing analogous knob for ICMP, icmp_ratelimit. The default value for tcp_invalid_ratelimit is 500ms, which allows at most one such dupack per 500ms. This is chosen to be 2x faster than the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule 2.4). We allow the extra 2x factor because network delay variations can cause packets sent at 1 second intervals to be compressed and arrive much closer. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08openvswitch: Initialize unmasked key and uid lenPravin B Shelar1-0/+2
Flow alloc needs to initialize unmasked key pointer. Otherwise it can crash kernel trying to free random unmasked-key pointer. general protection fault: 0000 [#1] SMP 3.19.0-rc6-net-next+ #457 Hardware name: Supermicro X7DWU/X7DWU, BIOS 1.1 04/30/2008 RIP: 0010:[<ffffffff8111df0e>] [<ffffffff8111df0e>] kfree+0xac/0x196 Call Trace: [<ffffffffa060bd87>] flow_free+0x21/0x59 [openvswitch] [<ffffffffa060bde0>] ovs_flow_free+0x21/0x23 [openvswitch] [<ffffffffa0605b4a>] ovs_packet_cmd_execute+0x2f3/0x35f [openvswitch] [<ffffffffa0605995>] ? ovs_packet_cmd_execute+0x13e/0x35f [openvswitch] [<ffffffff811fe6fb>] ? nla_parse+0x4f/0xec [<ffffffff8139a2fc>] genl_family_rcv_msg+0x26d/0x2c9 [<ffffffff8107620f>] ? __lock_acquire+0x90e/0x9aa [<ffffffff8139a3be>] genl_rcv_msg+0x66/0x89 [<ffffffff8139a358>] ? genl_family_rcv_msg+0x2c9/0x2c9 [<ffffffff81399591>] netlink_rcv_skb+0x3e/0x95 [<ffffffff81399898>] ? genl_rcv+0x18/0x37 [<ffffffff813998a7>] genl_rcv+0x27/0x37 [<ffffffff81399033>] netlink_unicast+0x103/0x191 [<ffffffff81399382>] netlink_sendmsg+0x2c1/0x310 [<ffffffff811007ad>] ? might_fault+0x50/0xa0 [<ffffffff8135c773>] do_sock_sendmsg+0x5f/0x7a [<ffffffff8135c799>] sock_sendmsg+0xb/0xd [<ffffffff8135cacf>] ___sys_sendmsg+0x1a3/0x218 [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86 [<ffffffff8115a9d0>] ? fsnotify+0x32c/0x348 [<ffffffff8115a720>] ? fsnotify+0x7c/0x348 [<ffffffff8113e5f5>] ? __fget+0xaa/0xbf [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86 [<ffffffff8135cccd>] __sys_sendmsg+0x3d/0x5e [<ffffffff8135cd02>] SyS_sendmsg+0x14/0x16 [<ffffffff81411852>] system_call_fastpath+0x12/0x17 Fixes: 74ed7ab9264("openvswitch: Add support for unique flow IDs.") CC: Joe Stringer <joestringer@nicira.com> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07bridge: add missing bridge port check for offloadsRoopa Prabhu1-2/+2
This patch fixes a missing bridge port check caught by smatch. setlink/dellink of attributes like vlans can come for a bridge device and there is no need to offload those today. So, this patch adds a bridge port check. (In these cases however, the BRIDGE_SELF flags will always be set and we may not hit a problem with the current code). smatch complaint: The patch 68e331c785b8: "bridge: offload bridge port attributes to switch asic if feature flag set" from Jan 29, 2015, leads to the following Smatch complaint: net/bridge/br_netlink.c:552 br_setlink() error: we previously assumed 'p' could be null (see line 518) net/bridge/br_netlink.c 517 518 if (p && protinfo) { ^ Check for NULL. Reported-By: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07net: use netif_rx_ni() from process contextEric Dumazet1-2/+2
Hotpluging a cpu might be rare, yet we have to use proper handlers when taking over packets found in backlog queues. dev_cpu_callback() runs from process context, thus we should call netif_rx_ni() to properly invoke softirq handler. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07rds: Make rds_message_copy_from_user() return 0 on success.Sowmini Varadhan1-4/+4
Commit 083735f4b01b ("rds: switch rds_message_copy_from_user() to iov_iter") breaks rds_message_copy_from_user() semantics on success, and causes it to return nbytes copied, when it should return 0. This commit fixes that bug. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07net: rds: Remove repeated function names from debug outputRasmus Villemoes3-6/+6
The macro rdsdebug is defined as pr_debug("%s(): " fmt, __func__ , ##args) Hence it doesn't make sense to include the name of the calling function explicitly in the format string passed to rdsdebug. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07net: openvswitch: Support masked set actions.Jarno Rajahalme2-172/+362
OVS userspace already probes the openvswitch kernel module for OVS_ACTION_ATTR_SET_MASKED support. This patch adds the kernel module implementation of masked set actions. The existing set action sets many fields at once. When only a subset of the IP header fields, for example, should be modified, all the IP fields need to be exact matched so that the other field values can be copied to the set action. A masked set action allows modification of an arbitrary subset of the supported header bits without requiring the rest to be matched. Masked set action is now supported for all writeable key types, except for the tunnel key. The set tunnel action is an exception as any input tunnel info is cleared before action processing starts, so there is no tunnel info to mask. The kernel module converts all (non-tunnel) set actions to masked set actions. This makes action processing more uniform, and results in less branching and duplicating the action processing code. When returning actions to userspace, the fully masked set actions are converted back to normal set actions. We use a kernel internal action code to be able to tell the userspace provided and converted masked set actions apart. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07Merge tag 'nfc-next-3.20-2' of ↵David S. Miller9-42/+1146
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next NFC: 3.20 second pull request This is the second NFC pull request for 3.20. It brings: - NCI NFCEE (NFC Execution Environment, typically an embedded or external secure element) discovery and enabling/disabling support. In order to communicate with an NFCEE, we also added NCI's logical connections support to the NCI stack. - HCI over NCI protocol support. Some secure elements only understand HCI and thus we need to send them HCI frames when they're part of an NCI chipset. - NFC_EVT_TRANSACTION userspace API addition. Whenever an application running on a secure element needs to notify its host counterpart, we send an NFC_EVENT_SE_TRANSACTION event to userspace through the NFC netlink socket. - Secure element and HCI transaction event support for the st21nfcb chipset. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARYDaniel Borkmann1-12/+6
ifla_vf_policy[] is wrong in advertising its individual member types as NLA_BINARY since .type = NLA_BINARY in combination with .len declares the len member as *max* attribute length [0, len]. The issue is that when do_setvfinfo() is being called to set up a VF through ndo handler, we could set corrupted data if the attribute length is less than the size of the related structure itself. The intent is exactly the opposite, namely to make sure to pass at least data of minimum size of len. Fixes: ebc08a6f47ee ("rtnetlink: Add VF config code to rtnetlink") Cc: Mitch Williams <mitch.a.williams@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07dsa: correctly determine the number of switches in a systemTobias Waldekranz1-1/+1
The number of connected switches was sourced from the number of children to the DSA node, change it to the number of available children, skipping any disabled switches. Fixes: 5e95329b701c4 ("dsa: add device tree bindings to register DSA switches") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06ipv6: addrconf: add missing validate_link_af handlerDaniel Borkmann1-0/+17
We still need a validate_link_af() handler with an appropriate nla policy, similarly as we have in IPv4 case, otherwise size validations are not being done properly in that case. Fixes: f53adae4eae5 ("net: ipv6: add tokenized interface identifier support") Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes") Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: eliminate race condition at multicast receptionJon Paul Maloy8-59/+114
In a previous commit in this series we resolved a race problem during unicast message reception. Here, we resolve the same problem at multicast reception. We apply the same technique: an input queue serializing the delivery of arriving buffers. The main difference is that here we do it in two steps. First, the broadcast link feeds arriving buffers into the tail of an arrival queue, which head is consumed at the socket level, and where destination lookup is performed. Second, if the lookup is successful, the resulting buffer clones are fed into a second queue, the input queue. This queue is consumed at reception in the socket just like in the unicast case. Both queues are protected by the same lock, -the one of the input queue. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: simplify socket multicast receptionJon Paul Maloy6-101/+84
The structure 'tipc_port_list' is used to collect port numbers representing multicast destination socket on a receiving node. The list is not based on a standard linked list, and is in reality optimized for the uncommon case that there are more than one multicast destinations per node. This makes the list handling unecessarily complex, and as a consequence, even the socket multicast reception becomes more complex. In this commit, we replace 'tipc_port_list' with a new 'struct tipc_plist', which is based on a standard list. We give the new list stack (push/pop) semantics, someting that simplifies the implementation of the function tipc_sk_mcast_rcv(). Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: simplify connection abort notifications when links breakJon Paul Maloy1-40/+29
The new input message queue in struct tipc_link can be used for delivering connection abort messages to subscribing sockets. This makes it possible to simplify the code for such cases. This commit removes the temporary list in tipc_node_unlock() used for transforming abort subscriptions to messages. Instead, the abort messages are now created at the moment of lost contact, and then added to the last failed link's generic input queue for delivery to the sockets concerned. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: resolve race problem at unicast message receptionJon Paul Maloy11-241/+372
TIPC handles message cardinality and sequencing at the link layer, before passing messages upwards to the destination sockets. During the upcall from link to socket no locks are held. It is therefore possible, and we see it happen occasionally, that messages arriving in different threads and delivered in sequence still bypass each other before they reach the destination socket. This must not happen, since it violates the sequentiality guarantee. We solve this by adding a new input buffer queue to the link structure. Arriving messages are added safely to the tail of that queue by the link, while the head of the queue is consumed, also safely, by the receiving socket. Sequentiality is secured per socket by only allowing buffers to be dequeued inside the socket lock. Since there may be multiple simultaneous readers of the queue, we use a 'filter' parameter to reduce the risk that they peek the same buffer from the queue, hence also reducing the risk of contention on the receiving socket locks. This solves the sequentiality problem, and seems to cause no measurable performance degradation. A nice side effect of this change is that lock handling in the functions tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that will enable future simplifications of those functions. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: use existing sk_write_queue for outgoing packet chainJon Paul Maloy1-18/+13
The list for outgoing traffic buffers from a socket is currently allocated on the stack. This forces us to initialize the queue for each sent message, something costing extra CPU cycles in the most critical data path. Later in this series we will introduce a new safe input buffer queue, something that would force us to initialize even the spinlock of the outgoing queue. A closer analysis reveals that the queue always is filled and emptied within the same lock_sock() session. It is therefore safe to use a queue aggregated in the socket itself for this purpose. Since there already exists a queue for this in struct sock, sk_write_queue, we introduce use of that queue in this commit. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: split up function tipc_msg_eval()Jon Paul Maloy3-43/+48
The function tipc_msg_eval() is in reality doing two related, but different tasks. First it tries to find a new destination for named messages, in case there was no first lookup, or if the first lookup failed. Second, it does what its name suggests, evaluating the validity of the message and its destination, and returning an appropriate error code depending on the result. This is confusing, and in this commit we choose to break it up into two functions. A new function, tipc_msg_lookup_dest(), first attempts to find a new destination, if the message is of the right type. If this lookup fails, or if the message should not be subject to a second lookup, the already existing tipc_msg_reverse() is called. This function performs prepares the message for rejection, if applicable. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: enqueue arrived buffers in socket in separate functionJon Paul Maloy1-15/+31
The code for enqueuing arriving buffers in the function tipc_sk_rcv() contains long code lines and currently goes to two indentation levels. As a cosmetic preparaton for the next commits, we break it out into a separate function. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: simplify message forwarding and rejection in socket layerJon Paul Maloy1-62/+58
Despite recent improvements, the handling of error codes and return values at reception of messages in the socket layer is still confusing. In this commit, we try to make it more comprehensible. First, we separate between the return values coming from the functions called by tipc_sk_rcv(), -those are TIPC specific error codes, and the return values returned by tipc_sk_rcv() itself. Second, we don't use the returned TIPC error code as indication for whether a buffer should be forwarded/rejected or not; instead we use the buffer pointer passed along with filter_msg(). This separation is necessary because we sometimes want to forward messages even when there is no error (i.e., protocol messages and successfully secondary looked up data messages). Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: reduce usage of context info in socket and linkJon Paul Maloy10-91/+98
The most common usage of namespace information is when we fetch the own node addess from the net structure. This leads to a lot of passing around of a parameter of type 'struct net *' between functions just to make them able to obtain this address. However, in many cases this is unnecessary. The own node address is readily available as a member of both struct tipc_sock and tipc_link, and can be fetched from there instead. The fact that the vast majority of functions in socket.c and link.c anyway are maintaining a pointer to their respective base structures makes this option even more compelling. In this commit, we introduce the inline functions tsk_own_node() and link_own_node() to make it easy for functions to fetch the node address from those structs instead of having to pass along and dereference the namespace struct. In particular, we make calls to the msg_xx() functions in msg.{h,c} context independent by directly passing them the own node address as parameter when needed. Those functions should be regarded as leaves in the code dependency tree, and it is hence desirable to keep them namspace unaware. Apart from a potential positive effect on cache behavior, these changes make it easier to introduce the changes that will follow later in this series. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05pktgen: fix UDP checksum computationSabrina Dubroca1-8/+8
This patch fixes two issues in UDP checksum computation in pktgen. First, the pseudo-header uses the source and destination IP addresses. Currently, the ports are used for IPv4. Second, the UDP checksum covers both header and data. So we need to generate the data earlier (move pktgen_finalize_skb up), and compute the checksum for UDP header + data. Fixes: c26bf4a51308c ("pktgen: Add UDPCSUM flag to support UDP checksums") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05net: ipv6: allow explicitly choosing optimistic addressesErik Kline2-3/+20
RFC 4429 ("Optimistic DAD") states that optimistic addresses should be treated as deprecated addresses. From section 2.1: Unless noted otherwise, components of the IPv6 protocol stack should treat addresses in the Optimistic state equivalently to those in the Deprecated state, indicating that the address is available for use but should not be used if another suitable address is available. Optimistic addresses are indeed avoided when other addresses are available (i.e. at source address selection time), but they have not heretofore been available for things like explicit bind() and sendmsg() with struct in6_pktinfo, etc. This change makes optimistic addresses treated more like deprecated addresses than tentative ones. Signed-off-by: Erik Kline <ek@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05flowcache: Fix kernel panic in flow_cache_flush_taskMiroslav Urbanek1-1/+1
flow_cache_flush_task references a structure member flow_cache_gc_work where it should reference flow_cache_flush_task instead. Kernel panic occurs on kernels using IPsec during XFRM garbage collection. The garbage collection interval can be shortened using the following sysctl settings: net.ipv4.xfrm4_gc_thresh=4 net.ipv6.xfrm6_gc_thresh=4 With the default settings, our productions servers crash approximately once a week. With the settings above, they crash immediately. Fixes: ca925cf1534e ("flowcache: Make flow cache name space aware") Reported-by: Tomáš Charvát <tc@excello.cz> Tested-by: Jan Hejl <jh@excello.cz> Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller28-204/+274
Conflicts: drivers/net/vxlan.c drivers/vhost/net.c include/linux/if_vlan.h net/core/dev.c The net/core/dev.c conflict was the overlap of one commit marking an existing function static whilst another was adding a new function. In the include/linux/if_vlan.h case, the type used for a local variable was changed in 'net', whereas the function got rewritten to fix a stacked vlan bug in 'net-next'. In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next' overlapped with an endainness fix for VHOST 1.0 in 'net'. In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter in 'net-next' whereas in 'net' there was a bug fix to pass in the correct network namespace pointer in calls to this function. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05xprtrdma: Address sparse complaint in rpcr_to_rdmar()Chuck Lever1-3/+4
With "make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__": linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: warning: incorrect type in initializer (different base types) linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: expected restricted __be32 [usertype] *buffer linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: got unsigned int [usertype] *rq_buffer As far as I can tell this is a false positive. Reported-by: kbuild-all@01.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-02-05sit: fix some __be16/u16 mismatchesEric Dumazet1-4/+4
Fixes following sparse warnings : net/ipv6/sit.c:1509:32: warning: incorrect type in assignment (different base types) net/ipv6/sit.c:1509:32: expected restricted __be16 [usertype] sport net/ipv6/sit.c:1509:32: got unsigned short net/ipv6/sit.c:1514:32: warning: incorrect type in assignment (different base types) net/ipv6/sit.c:1514:32: expected restricted __be16 [usertype] dport net/ipv6/sit.c:1514:32: got unsigned short net/ipv6/sit.c:1711:38: warning: incorrect type in argument 3 (different base types) net/ipv6/sit.c:1711:38: expected unsigned short [unsigned] [usertype] value net/ipv6/sit.c:1711:38: got restricted __be16 [usertype] sport net/ipv6/sit.c:1713:38: warning: incorrect type in argument 3 (different base types) net/ipv6/sit.c:1713:38: expected unsigned short [unsigned] [usertype] value net/ipv6/sit.c:1713:38: got restricted __be16 [usertype] dport Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05net: remove some sparse warningsEric Dumazet1-3/+3
netdev_adjacent_add_links() and netdev_adjacent_del_links() are static. queue->qdisc has __rcu annotation, need to use RCU_INIT_POINTER() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05ip6_gre: fix endianness errors in ip6gre_errSabrina Dubroca1-2/+2
info is in network byte order, change it back to host byte order before use. In particular, the current code sets the MTU of the tunnel to a wrong (too big) value. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04Revert "bridge: Let bridge not age 'externally' learnt FDB entries, they are ↵David S. Miller1-1/+1
removed when 'external' entity notifies the aging" This reverts commit 9a05dde59a35eee5643366d3d1e1f43fc9069adb. Requested by Scott Feldman. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04pkt_sched: fq: better control of DDOS trafficEric Dumazet1-2/+17
FQ has a fast path for skb attached to a socket, as it does not have to compute a flow hash. But for other packets, FQ being non stochastic means that hosts exposed to random Internet traffic can allocate million of flows structure (104 bytes each) pretty easily. Not only host can OOM, but lookup in RB trees can take too much cpu and memory resources. This patch adds a new attribute, orphan_mask, that is adding possibility of having a stochastic hash for orphaned skb. Its default value is 1024 slots, to mimic SFQ behavior. Note: This does not apply to locally generated TCP traffic, and no locally generated traffic will share a flow structure with another perfect or stochastic flow. This patch also handles the specific case of SYNACK messages: They are attached to the listener socket, and therefore all map to a single hash bucket. If listener have set SO_MAX_PACING_RATE, hoping to have new accepted socket inherit this rate, SYNACK might be paced and even dropped. This is very similar to an internal patch Google have used more than one year. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04Merge branch 'for-davem' of ↵David S. Miller15-382/+192
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs More iov_iter work from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tcp: do not pace pure ack packetsEric Dumazet2-3/+17
When we added pacing to TCP, we decided to let sch_fq take care of actual pacing. All TCP had to do was to compute sk->pacing_rate using simple formula: sk->pacing_rate = 2 * cwnd * mss / rtt It works well for senders (bulk flows), but not very well for receivers or even RPC : cwnd on the receiver can be less than 10, rtt can be around 100ms, so we can end up pacing ACK packets, slowing down the sender. Really, only the sender should pace, according to its own logic. Instead of adding a new bit in skb, or call yet another flow dissection, we tweak skb->truesize to a small value (2), and we instruct sch_fq to use new helper and not pace pure ack. Note this also helps TCP small queue, as ack packets present in qdisc/NIC do not prevent sending a data packet (RPC workload) This helps to reduce tx completion overhead, ack packets can use regular sock_wfree() instead of tcp_wfree() which is a bit more expensive. This has no impact in the case packets are sent to loopback interface, as we do not coalesce ack packets (were we would detect skb->truesize lie) In case netem (with a delay) is used, skb_orphan_partial() also sets skb->truesize to 1. This patch is a combination of two patches we used for about one year at Google. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04netfilter: Use rhashtable walk iteratorHerbert Xu1-17/+36
This patch gets rid of the manual rhashtable walk in nft_hash which touches rhashtable internals that should not be exposed. It does so by using the rhashtable iterator primitives. Note that I'm leaving nft_hash_destroy alone since it's only invoked on shutdown and it shouldn't be affected by changes to rhashtable internals (or at least not what I'm planning to change). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04netlink: Use rhashtable walk iteratorHerbert Xu1-66/+64
This patch gets rid of the manual rhashtable walk in netlink which touches rhashtable internals that should not be exposed. It does so by using the rhashtable iterator primitives. In fact the existing code was very buggy. Some sockets weren't shown at all while others were shown more than once. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04cls_api.c: Fix dumping of non-existing actions' stats.Ignacy Gawędzki1-3/+4
In tcf_exts_dump_stats(), ensure that exts->actions is not empty before accessing the first element of that list and calling tcf_action_copy_stats() on it. This fixes some random segvs when adding filters of type "basic" with no particular action. This also fixes the dumping of those "no-action" filters, which more often than not made calls to tcf_action_copy_stats() fail and consequently netlink attributes added by the caller to be removed by a call to nla_nest_cancel(). Fixes: 33be62715991 ("net_sched: act: use standard struct list_head") Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04pkt_sched: fq: avoid hang when quantum 0Kenneth Klette Jonassen1-2/+8
Configuring fq with quantum 0 hangs the system, presumably because of a non-interruptible infinite loop. Either way quantum 0 does not make sense. Reproduce with: sudo tc qdisc add dev lo root fq quantum 0 initial_quantum 0 ping 127.0.0.1 Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04net/core: Add event for a change in slave stateMoni Shoua2-0/+21
Add event which provides an indication on a change in the state of a bonding slave. The event handler should cast the pointer to the appropriate type (struct netdev_bonding_info) in order to get the full info about the slave. Signed-off-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tipc: separate link starting event from link timeout eventJon Paul Maloy1-1/+3
When a new link instance is created, it is trigged to start by sending it a TIPC_STARTING_EVT, whereafter a regular link reset is applied to it. The starting event is codewise treated as a timeout event, and prompts a link RESET message to be sent to the peer node, carrying a link session identifier. The later link_reset() call nudges this session identifier, whereafter all subsequent RESET messages will be sent out with the new identifier. The latter session number overrides the former, causing the peer to unconditionally accept it irrespective of its current working state. We don't think that this causes any problem, but it is not in accordance with the protocol spec, and may cause confusion when debugging TIPC sessions. To avoid this, we make the starting event distinct from the subsequent timeout events, by not allowing the former to send out any RESET message. This eliminates the described problem. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tipc: eliminate race during node creationJon Paul Maloy2-13/+7
Instances of struct node are created in the function tipc_disc_rcv() under the assumption that there is no race between received discovery messages arriving from the same node. This assumption is wrong. When we use more than one bearer, it is possible that discovery messages from the same node arrive at the same moment, resulting in creation of two instances of struct tipc_node. This may later cause confusion during link establishment, and may result in one of the links never becoming activated. We fix this by making lookup and potential creation of nodes atomic. Instead of first looking up the node, and in case of failure, create it, we now start with looking up the node inside node_link_create(), and return a reference to that one if found. Otherwise, we go ahead and create the node as we did before. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tipc: avoid stale link after aborted failoverJon Paul Maloy2-0/+5
During link failover it may happen that the remaining link goes down while it is still in the process of taking over traffic from a previously failed link. When this happens, we currently abort the failover procedure and reset the first failed link to non-failover mode, so that it will be ready to re-establish contact with its peer when it comes available. However, if the first link goes down because its bearer was manually disabled, it is not enough to reset it; it must also be deleted; which is supposed to happen when the failover procedure is finished. Otherwise it will remain a zombie link: attached to the owner node structure, in mode LINK_STOPPED, and permanently blocking any re- establishing of the link to the peer via the interface in question. We fix this by amending the failover abort procedure. Apart from resetting the link to non-failover state, we test if the link is also in LINK_STOPPED mode. If so, we delete it, using the conditional tipc_link_delete() function introduced in the previous commit. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04tipc: add reference count to struct tipc_linkJon Paul Maloy2-31/+50
When a bearer is disabled, all pertaining links will be reset and deleted. However, if there is a second active link towards a killed link's destination, the delete has to be postponed until the failover is finished. During this interval, we currently put the link in zombie mode, i.e., we take it out of traffic, delete its timer, but leave it attached to the owner node structure until all missing packets have been received. When this is done, we detach the link from its node and delete it, assuming that the synchronous timer deletion that was initiated earlier in a different thread has finished. This is unsafe, as the failover may finish before del_timer_sync() has returned in the other thread. We fix this by adding an atomic reference counter of type kref in struct tipc_link. The counter keeps track of the references kept to the link by the owner node and the timer. We then do a conditional delete, based on the reference counter, both after the failover has been finished and when the timer expires, if applicable. Whoever comes last, will actually delete the link. This approach also implies that we can make the deletion of the timer asynchronous. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04net: rds: use correct size for max unacked packets and bytesSasha Levin1-2/+2
Max unacked packets/bytes is an int while sizeof(long) was used in the sysctl table. This means that when they were getting read we'd also leak kernel memory to userspace along with the timeout values. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04Merge tag 'mac80211-next-for-davem-2015-02-03' of ↵David S. Miller32-160/+1325
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Last round of updates for net-next: * revert a patch that caused a regression with mesh userspace (Bob) * fix a number of suspend/resume related races (from Emmanuel, Luca and myself - we'll look at backporting later) * add software implementations for new ciphers (Jouni) * add a new ACPI ID for Broadcom's rfkill (Mika) * allow using netns FD for wireless (Vadim) * some other cleanups (various) Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04Merge branch 'for-upstream' of ↵David S. Miller9-165/+532
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2015-02-03 Here's what's likely the last bluetooth-next pull request for 3.20. Notable changes include: - xHCI workaround + a new id for the ath3k driver - Several new ids for the btusb driver - Support for new Intel Bluetooth controllers - Minor cleanups to ieee802154 code - Nested sleep warning fix in socket accept() code path - Fixes for Out of Band pairing handling - Support for LE scan restarting for HCI_QUIRK_STRICT_DUPLICATE_FILTER - Improvements to data we expose through debugfs - Proper handling of Hardware Error HCI events Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04net: add skb functions to process remote checksum offloadTom Herbert1-16/+2
This patch adds skb_remcsum_process and skb_gro_remcsum_process to perform the appropriate adjustments to the skb when receiving remote checksum offload. Updated vxlan and gue to use these functions. Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did not see any change in performance. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04bridge: Let bridge not age 'externally' learnt FDB entries, they are removed ↵Siva Mannem1-1/+1
when 'external' entity notifies the aging When 'learned_sync' flag is turned on, the offloaded switch port syncs learned MAC addresses to bridge's FDB via switchdev notifier (NETDEV_SWITCH_FDB_ADD). Currently, FDB entries learnt via this mechanism are wrongly being deleted by bridge aging logic. This patch ensures that FDB entries synced from offloaded switch ports are not deleted by bridging logic. Such entries can only be deleted via switchdev notifier (NETDEV_SWITCH_FDB_DEL). Signed-off-by: Siva Mannem <siva.mannem.lnx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04xps: fix xps for stacked devicesEric Dumazet2-1/+10
A typical qdisc setup is the following : bond0 : bonding device, using HTB hierarchy eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc XPS allows to spread packets on specific tx queues, based on the cpu doing the send. Problem is that dequeues from bond0 qdisc can happen on random cpus, due to the fact that qdisc_run() can dequeue a batch of packets. CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1 CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0 CPUB -> dequeue packet P1 from bond0 enqueue packet on eth1/eth2 CPUC -> dequeue packet P2 from bond0 enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0) get_xps_queue() then might select wrong queue for P1, since current cpu might be different than CPUA. P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping), if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock) Effect of this bug is TCP reorders, and more generally not optimal TX queue placement. (A victim bulk flow can be migrated to the wrong TX queue for a while) To fix this, we have to record sender cpu number the first time dev_queue_xmit() is called for one tx skb. We can union napi_id (used on receive path) and sender_cpu, granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for this union idea) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04NFC: nci: Move NFCEE discovery logicChristophe Ricard1-1/+7
NFCEE_DISCOVER_CMD is a specified NCI command used to discover NFCEE IDs. Move nci_nfcee_discover() call to nci_discover_se() in order to guarantee: - NFCEE_DISCOVER_CMD run when the NCI state machine is initialized - NFCEE_DISCOVER_CMD is not run in case there is not discover_se hook defined by a NFC device driver. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-04NFC: nci: Move logical connection structure allocationChristophe Ricard3-26/+34
conn_info is currently allocated only after nfcee_discovery_ntf which is not generic enough for logical connection other than NFCEE. The corresponding conn_info is now created in nci_core_conn_create_rsp(). Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-04NFC: nci: Change credits field to credits_cntChristophe Ricard1-1/+1
For consistency sake change nci_core_conn_create_rsp structure credits field to credits_cnt. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-04NFC: nci: Support all destinations type when creating a connectionChristophe Ricard1-15/+33
The current implementation limits nci_core_conn_create_req() to only manage NCI_DESTINATION_NFCEE. Add new parameters to nci_core_conn_create() to support all destination types described in the NCI specification. Because there are some parameters with variable size dynamic buffer allocation is needed. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-04NFC: nci: Add reference to the RF logical connectionChristophe Ricard3-7/+5
The NCI_STATIC_RF_CONN_ID logical connection is the most used connection. Keeping it directly accessible in the nci_dev structure will simplify and optimize the access. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-03ipv6: Select fragment id during UFO segmentation if not set.Vlad Yasevich3-21/+44
If the IPv6 fragment id has not been set and we perform fragmentation due to UFO, select a new fragment id. We now consider a fragment id of 0 as unset and if id selection process returns 0 (after all the pertrubations), we set it to 0x80000000, thus giving us ample space not to create collisions with the next packet we may have to fragment. When doing UFO integrity checking, we also select the fragment id if it has not be set yet. This is stored into the skb_shinfo() thus allowing UFO to function correclty. This patch also removes duplicate fragment id generation code and moves ipv6_select_ident() into the header as it may be used during GSO. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04net: switch sockets to ->read_iter/->write_iterAl Viro1-29/+27
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04net/socket.c: fold do_sock_{read,write} into callersAl Viro1-35/+21
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04net: bury net/core/iovec.c - nothing in there is used anymoreAl Viro2-138/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04tipc: tipc ->sendmsg() conversionAl Viro2-7/+14
This one needs to copy the same data from user potentially more than once. Sadly, MTU changes can trigger that ;-/ Cc: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04net: switch memcpy_fromiovec()/memcpy_fromiovecend() users to copy_from_iter()Al Viro5-14/+12
That takes care of the majority of ->sendmsg() instances - most of them via memcpy_to_msg() or assorted getfrag() callbacks. One place where we still keep memcpy_fromiovecend() is tipc - there we potentially read the same data over and over; separate patch, that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04ip: convert tcp_sendmsg() to iov_iter primitivesAl Viro3-131/+115
patch is actually smaller than it seems to be - most of it is unindenting the inner loop body in tcp_sendmsg() itself... the bit in tcp_input.c is going to get reverted very soon - that's what memcpy_from_msg() will become, but not in this commit; let's keep it reasonably contained... There's one potentially subtle change here: in case of short copy from userland, mainline tcp_send_syn_data() discards the skb it has allocated and falls back to normal path, where we'll send as much as possible after rereading the same data again. This patch trims SYN+data skb instead - that way we don't need to copy from the same place twice. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04ip: stash a pointer to msghdr in struct ping_fakehdrAl Viro2-6/+4
... instead of storing its ->mgs_iter.iov there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04rxrpc: make the users of rxrpc_kernel_send_data() set kvec-backed msg_iter ↵Al Viro1-3/+0
properly Use iov_iter_kvec() there, get rid of set_fs() games - now that rxrpc_send_data() uses iov_iter primitives, it'll handle ITER_KVEC just fine. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04rxrpc: switch rxrpc_send_data() to iov_iter primitivesAl Viro1-33/+10
Convert skb_add_data() to iov_iter; allows to get rid of the explicit messing with iovec in its only caller - skb_add_data() will keep advancing ->msg_iter for us, so there's no need to similate that manually. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04vmci: propagate msghdr all way down to __qp_memcpy_to_queue()Al Viro1-2/+1
Switch from passing msg->iov_iter.iov to passing msg itself Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04ipv6: rawv6_send_hdrinc(): pass msghdrAl Viro1-4/+3
Switch from passing msg->iov_iter.iov to passing msg itself Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04ipv4: raw_send_hdrinc(): pass msghdrAl Viro1-4/+3
Switch from passing msg->iov_iter.iov to passing msg itself Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04netlink: make the check for "send from tx_ring" deterministicAl Viro1-0/+5
As it is, zero msg_iovlen means that the first iovec in the kernel array of iovecs is left uninitialized, so checking if its ->iov_base is NULL is random. Since the real users of that thing are doing sendto(fd, NULL, 0, ...), they are getting msg_iovlen = 1 and msg_iov[0] = {NULL, 0}, which is what this test is trying to catch. As suggested by davem, let's just check that msg_iovlen was 1 and msg_iov[0].iov_base was NULL - _that_ is well-defined and it catches what we want to catch. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-03netlabel: Less function calls in netlbl_mgmt_add_common() after error detectionMarkus Elfring1-25/+24
The functions "cipso_v4_doi_putdef" and "kfree" could be called in some cases by the netlbl_mgmt_add_common() function during error handling even if the passed variables contained still a null pointer. * This implementation detail could be improved by adjustments for jump labels. * Let us return immediately after the first failed function call according to the current Linux coding style convention. * Let us delete also an unnecessary check for the variable "entry" there. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-03netlabel: Deletion of an unnecessary check before the function call ↵Markus Elfring1-2/+1
"cipso_v4_doi_free" The cipso_v4_doi_free() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-03netlabel: Deletion of an unnecessary check before the function call ↵Markus Elfring1-2/+1
"cipso_v4_doi_putdef" The cipso_v4_doi_putdef() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-03SUNRPC: NULL utsname dereference on NFS umount during namespace cleanupTrond Myklebust2-7/+13
Fix an Oopsable condition when nsm_mon_unmon is called as part of the namespace cleanup, which now apparently happens after the utsname has been freed. Link: http://lkml.kernel.org/r/20150125220604.090121ae@neptune.home Reported-by: Bruno Prémont <bonbons@linux-vserver.org> Cc: stable@vger.kernel.org # 3.18 Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-03Merge branch 'flexfiles'Trond Myklebust1-7/+19
* flexfiles: (53 commits) pnfs: lookup new lseg at lseg boundary nfs41: .init_read and .init_write can be called with valid pg_lseg pnfs: Update documentation on the Layout Drivers pnfs/flexfiles: Add the FlexFile Layout Driver nfs: count DIO good bytes correctly with mirroring nfs41: wait for LAYOUTRETURN before retrying LAYOUTGET nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes nfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags nfs/flexfiles: send layoutreturn before freeing lseg nfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE nfs41: allow async version layoutreturn nfs41: add range to layoutreturn args pnfs: allow LD to ask to resend read through pnfs nfs: add nfs_pgio_current_mirror helper nfs: only reset desc->pg_mirror_idx when mirroring is supported nfs41: add a debug warning if we destroy an unempty layout pnfs: fail comparison when bucket verifier not set nfs: mirroring support for direct io nfs: add mirroring support to pgio layer pnfs: pass ds_commit_idx through the commit path ... Conflicts: fs/nfs/pnfs.c fs/nfs/pnfs.h
2015-02-03sunrpc: add rpc_count_iostats_idxWeston Andros Adamson1-7/+19
Add a call to tally stats for a task under a different statsidx than what's contained in the task structure. This is needed to properly account for pnfs reads/writes when the DS nfs version != the MDS version. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03Merge tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust4-344/+468
NFS: Client side changes for RDMA These patches improve the scalability of the NFSoRDMA client and take large variables off of the stack. Additionally, the GFP_* flags are updated to match what TCP uses. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (21 commits) xprtrdma: Update the GFP flags used in xprt_rdma_allocate() xprtrdma: Clean up after adding regbuf management xprtrdma: Allocate zero pad separately from rpcrdma_buffer xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req xprtrdma: Add struct rpcrdma_regbuf and helpers xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy() xprtrdma: Simplify synopsis of rpcrdma_buffer_create() xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack xprtrdma: Take struct ib_device_attr off the stack xprtrdma: Free the pd if ib_query_qp() fails xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt xprtrdma: Move credit update to RPC reply handler xprtrdma: Remove rl_mr field, and the mr_chunk union xprtrdma: Remove rpcrdma_ep::rep_ia xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt xprtrdma: Clean up hdrlen xprtrdma: Display XIDs in host byte order xprtrdma: Modernize htonl and ntohl ...
2015-02-03net: rfkill: Add Broadcom BCM2E40 bluetooth ACPI IDMika Westerberg1-0/+1
This is yet another Broadcom bluetooth chip with ACPI ID BCM2E40. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-02-03Bluetooth: Fix potential NULL dereferenceJohan Hedberg1-4/+3
The bnep_get_device function may be triggered by an ioctl just after a connection has gone down. In such a case the respective L2CAP chan->conn pointer will get set to NULL (by l2cap_chan_del). This patch adds a missing NULL check for this case in the bnep_get_device() function. Reported-by: Patrik Flykt <patrik.flykt@linux.intel.com> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller6-63/+118
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Validate hooks for nf_tables NAT expressions, otherwise users can crash the kernel when using them from the wrong hook. We already got one user trapped on this when configuring masquerading. 2) Fix a BUG splat in nf_tables with CONFIG_DEBUG_PREEMPT=y. Reported by Andreas Schultz. 3) Avoid unnecessary reroute of traffic in the local input path in IPVS that triggers a crash in in xfrm. Reported by Florian Wiessner and fixes by Julian Anastasov. 4) Fix memory and module refcount leak from the error path of nf_tables_newchain(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net: sctp: Deletion of an unnecessary check before the function call "kfree"Markus Elfring1-2/+1
The kfree() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-By: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: Allow for partial checksums on non-ufo packetsVlad Yasevich1-1/+10
Currntly, if we are not doing UFO on the packet, all UDP packets will start with CHECKSUM_NONE and thus perform full checksum computations in software even if device support IPv6 checksum offloading. Let's start start with CHECKSUM_PARTIAL if the device supports it and we are sending only a single packet at or below mtu size. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02udpv6: Add lockless sendmsg() supportVlad Yasevich1-4/+20
This commit adds the same functionaliy to IPv6 that commit 903ab86d195cca295379699299c5fc10beba31c7 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue Mar 1 02:36:48 2011 +0000 udp: Add lockless transmit path added to IPv4. UDP transmit path can now run without a socket lock, thus allowing multiple threads to send to a single socket more efficiently. This is only used when corking/MSG_MORE is not used. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: Introduce udpv6_send_skb()Vlad Yasevich1-27/+40
Now that we can individually construct IPv6 skbs to send, add a udpv6_send_skb() function to populate the udp header and send the skb. This allows udp_v6_push_pending_frames() to re-use this function as well as enables us to add lockless sendmsg() support. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: introduce ipv6_make_skbVlad Yasevich1-19/+84
This commit is very similar to commit 1c32c5ad6fac8cee1a77449f5abf211e911ff830 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue Mar 1 02:36:47 2011 +0000 inet: Add ip_make_skb and ip_finish_skb It adds IPv6 version of the helpers ip6_make_skb and ip6_finish_skb. The job of ip6_make_skb is to collect messages into an ipv6 packet and poplulate ipv6 eader. The job of ip6_finish_skb is to transmit the generated skb. Together they replicated the job of ip6_push_pending_frames() while also provide the capability to be called independently. This will be needed to add lockless UDP sendmsg support. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: Append sending data to arbitrary queueVlad Yasevich1-39/+67
Add the ability to append data to arbitrary queue. This will be needed later to implement lockless UDP sends. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02ipv6: pull cork initialization into its own function.Vlad Yasevich1-69/+89
Pull IPv6 cork initialization into its own function that can be re-used. IPv6 specific cork data did not have an explicit data structure. This patch creats eone so that just ipv6 cork data can be as arguemts. Also, since IPv6 tries to save the flow label into inet_cork_full tructure, pass the full cork. Adjust ip6_cork_release() to take cork data structures. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net: dctcp: loosen requirement to assert ECT(0) during 3WHSFlorian Westphal1-9/+5
One deployment requirement of DCTCP is to be able to run in a DC setting along with TCP traffic. As Glenn Judd's NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter" [1] (tba) explains, one way to solve this on switch side is to split DCTCP and TCP traffic in two queues per switch port based on the DSCP: one queue soley intended for DCTCP traffic and one for non-DCTCP traffic. For the DCTCP queue, there's the marking threshold K as explained in commit e3118e8359bb ("net: tcp: add DCTCP congestion control algorithm") for RED marking ECT(0) packets with CE. For the non-DCTCP queue, there's f.e. a classic tail drop queue. As already explained in e3118e8359bb, running DCTCP at scale when not marking SYN/SYN-ACK packets with ECT(0) has severe consequences as for non-ECT(0) packets, traversing the RED marking DCTCP queue will result in a severe reduction of connection probability. This is due to the DCTCP queue being dominated by ECT(0) traffic and switches handle non-ECT traffic in the RED marking queue after passing K as drops, where K is usually a low watermark in order to leave enough tailroom for bursts. Splitting DCTCP traffic among several queues (ECN and non-ECN queue) is being considered a terrible idea in the network community as it splits single flows across multiple network paths. Therefore, commit e3118e8359bb implements this on Linux as ECT(0) marked traffic, as we argue that marking all packets of a DCTCP flow is the only viable solution and also doesn't speak against the draft. However, recently, a DCTCP implementation for FreeBSD hit also their mainline kernel [2]. In order to let them play well together with Linux' DCTCP, we would need to loosen the requirement that ECT(0) has to be asserted during the 3WHS as not implemented in FreeBSD. This simplifies the ECN test and lets DCTCP work together with FreeBSD. Joint work with Daniel Borkmann. [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd [2] https://github.com/freebsd/freebsd/commit/8ad879445281027858a7fa706d13e458095b595f Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Glenn Judd <glenn.judd@morganstanley.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net-timestamp: no-payload only sysctlWillem de Bruijn3-1/+32
Tx timestamps are looped onto the error queue on top of an skb. This mechanism leaks packet headers to processes unless the no-payload options SOF_TIMESTAMPING_OPT_TSONLY is set. Add a sysctl that optionally drops looped timestamp with data. This only affects processes without CAP_NET_RAW. The policy is checked when timestamps are generated in the stack. It is possible for timestamps with data to be reported after the sysctl is set, if these were queued internally earlier. No vulnerability is immediately known that exploits knowledge gleaned from packet headers, but it may still be preferable to allow administrators to lock down this path at the cost of possible breakage of legacy applications. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes (v1 -> v2) - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW) (rfc -> v1) - document the sysctl in Documentation/sysctl/net.txt - fix access control race: read .._OPT_TSONLY only once, use same value for permission check and skb generation. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02net-timestamp: no-payload optionWillem de Bruijn4-11/+25
Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit timestamps, this loops timestamps on top of empty packets. Doing so reduces the pressure on SO_RCVBUF. Payload inspection and cmsg reception (aside from timestamps) are no longer possible. This works together with a follow on patch that allows administrators to only allow tx timestamping if it does not loop payload or metadata. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes (rfc -> v1) - add documentation - remove unnecessary skb->len test (thanks to Richard Cochran) Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-02NFC: nci: Change NCI state machine to LISTEN_ACTIVEChristophe Ricard1-0/+8
When receiving an interface activation notification, if the RF interface is NCI_RF_INTERFACE_NFCEE_DIRECT, we need to ignore the following parameters and change the NCI state machine to NCI_LISTEN_ACTIVE. According to the NCI specification, the parameters should be 0 and shall be ignored. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add RF NFCEE action notification supportChristophe Ricard1-0/+11
The NFCC sends an NCI_OP_RF_NFCEE_ACTION_NTF notification to the host (DH) to let it know that for example an RF transaction with a payment reader is done. For now the notification handler is empty. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: Forward NFC_EVT_TRANSACTION to user spaceChristophe Ricard3-0/+70
NFC_EVT_TRANSACTION is sent through netlink in order for a specific application running on a secure element to notify userspace of an event. Typically the secure element application counterpart on the host could interpret that event and act upon it. Forwarded information contains: - SE host generating the event - Application IDentifier doing the operation - Applications parameters Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add HCI over NCI protocol supportChristophe Ricard4-18/+722
According to the NCI specification, one can use HCI over NCI to talk with specific NFCEE. The HCI network is viewed as one logical NFCEE. This is needed to support secure element running HCI only firmwares embedded on an NCI capable chipset, like e.g. the st21nfcb. There is some duplication between this piece of code and the HCI core code, but the latter would need to be abstracted even more to be able to use NCI as a logical transport for HCP packets. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Support logical connections managementChristophe Ricard2-0/+85
In order to communicate with an NFCEE, we need to open a logical connection to it, by sending the NCI_OP_CORE_CONN_CREATE_CMD command to the NFCC. It's left up to the drivers to decide when to close an already opened logical connection. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add NFCEE enabling and disabling supportChristophe Ricard2-0/+34
NFCEEs can be enabled or disabled by sending the NCI_OP_NFCEE_MODE_SET_CMD command to the NFCC. This patch provides an API for drivers to enable and disable e.g. their NCI discoveredd secure elements. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add NFCEE discover supportChristophe Ricard3-0/+68
NFCEEs (NFC Execution Environment) have to be explicitly discovered by sending the NCI_OP_NFCEE_DISCOVER_CMD command. The NFCC will respond to this command by telling us how many NFCEEs are connected to it. Then the NFCC sends a notification command for each and every NFCEE connected. Here we implement support for sending NCI_OP_NFCEE_DISCOVER_CMD command, receiving the response and the potential notifications. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02NFC: nci: Add dynamic logical connections supportChristophe Ricard4-33/+127
The current NCI core only support the RF static connection. For other NFC features such as Secure Element communication, we may need to create logical connections to the NFCEE (Execution Environment. In order to track each logical connection ID dynamically, we add a linked list of connection info pointers to the nci_dev structure. Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-02-02Bluetooth: Remove mgmt_rp_read_local_oob_ext_data structJohan Hedberg1-16/+9
This extended return parameters struct conflicts with the new Read Local OOB Extended Data command definition. To avoid the conflict simply rename the old "extended" version to the normal one and update the code appropriately to take into account the two possible response PDU sizes. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-02Merge branch 'locks-3.20' of git://git.samba.org/jlayton/linux into for-3.20J. Bruce Fields22-60/+123
Christoph's block pnfs patches have some minor dependencies on these lock patches.
2015-02-02Bluetooth: Add restarting to service discoveryJakub Pawlowski1-5/+48
When using LE_SCAN_FILTER_DUP_ENABLE, some controllers would send advertising report from each LE device only once. That means that we don't get any updates on RSSI value, and makes Service Discovery very slow. This patch adds restarting scan when in Service Discovery, and device with filtered uuid is found, but it's not in RSSI range to send event yet. This way if device moves into range, we will quickly get RSSI update. Signed-off-by: Jakub Pawlowski <jpawlowski@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-02Bluetooth: Add le_scan_restart work for LE scan restartingJakub Pawlowski2-1/+92
Currently there is no way to restart le scan, and it's needed in service scan method. The way it work: it disable, and then enable le scan on controller. During the restart, we must remember when the scan was started, and it's duration, to later re-schedule the le_scan_disable work, that was stopped during the stop scan phase. Signed-off-by: Jakub Pawlowski <jpawlowski@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-02-01bridge: offload bridge port attributes to switch asic if feature flag setRoopa Prabhu1-3/+23
This patch adds support to set/del bridge port attributes in hardware from the bridge driver. With this, when the user sends a bridge setlink message with no flags or master flags set, - the bridge driver ndo_bridge_setlink handler sets settings in the kernel - calls the swicthdev api to propagate the attrs to the switchdev hardware You can still use the self flag to go to the switch hw or switch port driver directly. With this, it also makes sure a notification goes out only after the attributes are set both in the kernel and hw. The patch calls switchdev api only if BRIDGE_FLAGS_SELF is not set. This is because the offload cases with BRIDGE_FLAGS_SELF are handled in the caller (in rtnetlink.c). Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01swdevice: add new apis to set and del bridge port attributesRoopa Prabhu1-0/+110
This patch adds two new api's netdev_switch_port_bridge_setlink and netdev_switch_port_bridge_dellink to offload bridge port attributes to switch port (The names of the apis look odd with 'switch_port_bridge', but am more inclined to change the prefix of the api to something else. Will take any suggestions). The api's look at the NETIF_F_HW_SWITCH_OFFLOAD feature flag to pass bridge port attributes to the port device. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01bridge: add flags argument to ndo_bridge_setlink and ndo_bridge_dellinkRoopa Prabhu3-8/+10
bridge flags are needed inside ndo_bridge_setlink/dellink handlers to avoid another call to parse IFLA_AF_SPEC inside these handlers This is used later in this series Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01ipv4: tcp: get rid of ugly unicast_sockEric Dumazet2-32/+35
In commit be9f4a44e7d41 ("ipv4: tcp: remove per net tcp_sock") I tried to address contention on a socket lock, but the solution I chose was horrible : commit 3a7c384ffd57e ("ipv4: tcp: unicast_sock should not land outside of TCP stack") addressed a selinux regression. commit 0980e56e506b ("ipv4: tcp: set unicast_sock uc_ttl to -1") took care of another regression. commit b5ec8eeac46 ("ipv4: fix ip_send_skb()") fixed another regression. commit 811230cd85 ("tcp: ipv4: initialize unicast_sock sk_pacing_rate") was another shot in the dark. Really, just use a proper socket per cpu, and remove the skb_orphan() call, to re-enable flow control. This solves a serious problem with FQ packet scheduler when used in hostile environments, as we do not want to allocate a flow structure for every RST packet sent in response to a spoofed packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01Bluetooth: Fix OOB data present for BR/EDR Secure Connections Only modeMarcel Holtmann1-17/+21
When using Secure Connections Only mode, then only P-256 OOB data is valid and should be provided. In case userspace provides P-192 and P-256 OOB data, then the P-192 values will be set to zero. However the present value of the IO capability exchange still mentioned that both values would be available. Fix this by telling the controller clearly that only the P-256 OOB data is present. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Expose remote OOB information as debugfs entryMarcel Holtmann1-0/+31
For debugging purposes it is good to know which OOB data is actually currently loaded for each controller. So expose that list via debugfs. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Expose hardware error code as debugfs entryMarcel Holtmann1-0/+3
When the Hardware Error event is send by the controller, the Bluetooth core stores the error code. Expose it via debugfs so it can be retrieved later on. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Expose debug keys usage setting via debugfsMarcel Holtmann1-0/+22
To allow easier debugging when debug keys are generated, provide debugfs entry for checking the setting of debug keys usage. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Track changes from HCI Write Simple Pairing Debug Mode commandMarcel Holtmann1-0/+19
When the HCI Write Simple Pairing Debug Mode command has been issued, the result needs to be tracked and stored. The hdev->ssp_debug_mode variable is already present, but was never updated when the mode in the controller was actually changed. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-02-01Bluetooth: Expose Secure Simple Pairing debug mode setting in debugfsMarcel Holtmann1-1/+22
The value of the ssp_debug_mode should be accessible via debugfs to be able to determine if a BR/EDR controller generates debugs keys or not. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-01-31ipv4: icmp: use percpu allocationEric Dumazet1-9/+8
Get rid of nr_cpu_ids and use modern percpu allocation. Note that the sockets themselves are not yet allocated using NUMA affinity. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-31tcp: use SACK RTTs for CCKenneth Klette Jonassen1-2/+4
Current behavior only passes RTTs from sequentially acked data to CC. If sender gets a combined ACK for segment 1 and SACK for segment 3, then the computed RTT for CC is the time between sending segment 1 and receiving SACK for segment 3. Pass the minimum computed RTT from any acked data to CC, i.e. time between sending segment 3 and receiving SACK for segment 3. Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-31Bluetooth: Allow remote OOB data to only provide P-192 or P-256 valuesMarcel Holtmann1-4/+25
In case the remote only provided P-192 or P-256 data for OOB pairing, then make sure that the data value pointers are correctly set. That way the core can provide correct information when remote OOB data present information have to be communicated. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>