aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2022-06-29compat: drop CentOS 8 Stream supportHEADmasterJason A. Donenfeld2-7/+1
Nobody uses this and it's impossible to maintain given the current CI situation. RHEL 7 and 8 release remain for now, though that might not always be the case. See the link for details. Link: https://lists.zx2c4.com/pipermail/wireguard/2022-June/007664.html Suggested-by: Philip J. Perry <phil@elrepo.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-06-28compat: do not backport ktime_get_coarse_boottime_ns to c8sJason A. Donenfeld1-2/+2
Also bump the c8s version stamp. Reported-by: Vladimír Beneš <vbenes@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-06-27version: bumpv1.0.20220627Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-06-22compat: handle backported rng and blake2sJason A. Donenfeld2-6/+8
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05qemu: give up on RHEL8 in CIJason A. Donenfeld1-6/+0
They keep breaking their kernel and being difficult when I send patches to fix it, so just give up on trying to support this in the CI. It'll bitrot and people will complain and we'll see what happens at that point. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05qemu: set panic_on_warn=1 from cmdlineJason A. Donenfeld14-19/+13
Rather than setting this once init is running, set panic_on_warn from the kernel command line, so that it catches splats from WireGuard initialization code and the various crypto selftests. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05qemu: use vports on armJason A. Donenfeld5-6/+25
Rather than having to hack up QEMU, just use the virtio serial device. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05netns: limit parallelism to $(nproc) tests at onceJason A. Donenfeld1-10/+10
The parallel tests were added to catch queueing issues from multiple cores. But what happens in reality when testing tons of processes is that these separate threads wind up fighting with the scheduler, and we wind up with contention in places we don't care about that decrease the chances of hitting a bug. So just do a test with the number of CPU cores, rather than trying to scale up arbitrarily. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-05netns: make routing loop test non-fatalJason A. Donenfeld1-1/+13
I hate to do this, but I still do not have a good solution to actually fix this bug across architectures. So just disable it for now, so that the CI can still deliver actionable results. This commit adds a large red warning, so that at least the failure isn't lost forever, and hopefully this can be revisited down the line. Link: https://lore.kernel.org/netdev/CAHmME9pv1x6C4TNdL6648HydD8r+txpV4hTUXOBVkrapBXH4QQ@mail.gmail.com/ Link: https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/ Link: https://lore.kernel.org/wireguard/CAHmME9rNnBiNvBstb7MPwK-7AmAN0sOfnhdR=eeLrowWcKxaaQ@mail.gmail.com/ Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-04-14device: check for metadata_dst with skb_valid_dst()Nikolay Aleksandrov3-1/+9
When we try to transmit an skb with md_dst attached through wireguard we hit a null pointer dereference in wg_xmit() due to the use of dst_mtu() which calls into dst_blackhole_mtu() which in turn tries to dereference dst->dev. Since wireguard doesn't use md_dsts we should use skb_valid_dst(), which checks for DST_METADATA flag, and if it's set, then falls back to wireguard's device mtu. That gives us the best chance of transmitting the packet; otherwise if the blackhole netdev is used we'd get ETH_MIN_MTU. [ 263.693506] BUG: kernel NULL pointer dereference, address: 00000000000000e0 [ 263.693908] #PF: supervisor read access in kernel mode [ 263.694174] #PF: error_code(0x0000) - not-present page [ 263.694424] PGD 0 P4D 0 [ 263.694653] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 263.694876] CPU: 5 PID: 951 Comm: mausezahn Kdump: loaded Not tainted 5.18.0-rc1+ #522 [ 263.695190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [ 263.695529] RIP: 0010:dst_blackhole_mtu+0x17/0x20 [ 263.695770] Code: 00 00 00 0f 1f 44 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 47 10 48 83 e0 fc 8b 40 04 85 c0 75 09 48 8b 07 <8b> 80 e0 00 00 00 c3 66 90 0f 1f 44 00 00 48 89 d7 be 01 00 00 00 [ 263.696339] RSP: 0018:ffffa4a4422fbb28 EFLAGS: 00010246 [ 263.696600] RAX: 0000000000000000 RBX: ffff8ac9c3553000 RCX: 0000000000000000 [ 263.696891] RDX: 0000000000000401 RSI: 00000000fffffe01 RDI: ffffc4a43fb48900 [ 263.697178] RBP: ffffa4a4422fbb90 R08: ffffffff9622635e R09: 0000000000000002 [ 263.697469] R10: ffffffff9b69a6c0 R11: ffffa4a4422fbd0c R12: ffff8ac9d18b1a00 [ 263.697766] R13: ffff8ac9d0ce1840 R14: ffff8ac9d18b1a00 R15: ffff8ac9c3553000 [ 263.698054] FS: 00007f3704c337c0(0000) GS:ffff8acaebf40000(0000) knlGS:0000000000000000 [ 263.698470] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 263.698826] CR2: 00000000000000e0 CR3: 0000000117a5c000 CR4: 00000000000006e0 [ 263.699214] Call Trace: [ 263.699505] <TASK> [ 263.699759] wg_xmit+0x411/0x450 [ 263.700059] ? bpf_skb_set_tunnel_key+0x46/0x2d0 [ 263.700382] ? dev_queue_xmit_nit+0x31/0x2b0 [ 263.700719] dev_hard_start_xmit+0xd9/0x220 [ 263.701047] __dev_queue_xmit+0x8b9/0xd30 [ 263.701344] __bpf_redirect+0x1a4/0x380 [ 263.701664] __dev_queue_xmit+0x83b/0xd30 [ 263.701961] ? packet_parse_headers+0xb4/0xf0 [ 263.702275] packet_sendmsg+0x9a8/0x16a0 [ 263.702596] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 263.702933] sock_sendmsg+0x5e/0x60 [ 263.703239] __sys_sendto+0xf0/0x160 [ 263.703549] __x64_sys_sendto+0x20/0x30 [ 263.703853] do_syscall_64+0x3b/0x90 [ 263.704162] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 263.704494] RIP: 0033:0x7f3704d50506 [ 263.704789] Code: 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 55 48 83 ec 30 44 89 4c 24 2c 4c 89 [ 263.705652] RSP: 002b:00007ffe954b0b88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 263.706141] RAX: ffffffffffffffda RBX: 0000558bb259b490 RCX: 00007f3704d50506 [ 263.706544] RDX: 000000000000004a RSI: 0000558bb259b7b2 RDI: 0000000000000003 [ 263.706952] RBP: 0000000000000000 R08: 00007ffe954b0b90 R09: 0000000000000014 [ 263.707339] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe954b0b90 [ 263.707735] R13: 000000000000004a R14: 0000558bb259b7b2 R15: 0000000000000001 [ 263.708132] </TASK> [ 263.708398] Modules linked in: bridge netconsole bonding [last unloaded: bridge] [ 263.708942] CR2: 00000000000000e0 Link: https://github.com/cilium/cilium/issues/19428 Reported-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> [Jason: polyfilled for < 4.3] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-04-06qemu: enable ACPI for SMPJason A. Donenfeld2-0/+2
It turns out that by having CONFIG_ACPI=n, we've been failing to boot additional CPUs, and so these systems were functionally UP. The code bloat is unfortunate for build times, but I don't see an alternative. So this commit sets CONFIG_ACPI=y for x86_64 and i686 configs. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-04-06socket: ignore v6 endpoints when ipv6 is disabledJason A. Donenfeld1-2/+2
The previous commit fixed a memory leak on the send path in the event that IPv6 is disabled at compile time, but how did a packet even arrive there to begin with? It turns out we have previously allowed IPv6 endpoints even when IPv6 support is disabled at compile time. This is awkward and inconsistent. Instead, let's just ignore all things IPv6, the same way we do other malformed endpoints, in the case where IPv6 is disabled. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-04-06socket: free skb in send6 when ipv6 is disabledWang Hai1-0/+1
I got a memory leak report: unreferenced object 0xffff8881191fc040 (size 232): comm "kworker/u17:0", pid 23193, jiffies 4295238848 (age 3464.870s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff814c3ef4>] slab_post_alloc_hook+0x84/0x3b0 [<ffffffff814c8977>] kmem_cache_alloc_node+0x167/0x340 [<ffffffff832974fb>] __alloc_skb+0x1db/0x200 [<ffffffff82612b5d>] wg_socket_send_buffer_to_peer+0x3d/0xc0 [<ffffffff8260e94a>] wg_packet_send_handshake_initiation+0xfa/0x110 [<ffffffff8260ec81>] wg_packet_handshake_send_worker+0x21/0x30 [<ffffffff8119c558>] process_one_work+0x2e8/0x770 [<ffffffff8119ca2a>] worker_thread+0x4a/0x4b0 [<ffffffff811a88e0>] kthread+0x120/0x160 [<ffffffff8100242f>] ret_from_fork+0x1f/0x30 In function wg_socket_send_buffer_as_reply_to_skb() or wg_socket_send_ buffer_to_peer(), the semantics of send6() is required to free skb. But when CONFIG_IPV6 is disable, kfree_skb() is missing. This patch adds it to fix this bug. Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-03-03qemu: simplify RNG seedingJason A. Donenfeld1-18/+8
We don't actualy need to write anything in the pool. Instead, we just force the total over 128, and we should be good to go for all old kernels. We also only need this on getrandom() kernels, which simplifies things too. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-03-02queueing: use CFI-safe ptr_ring cleanup functionJason A. Donenfeld3-1/+17
We make too nuanced use of ptr_ring to entirely move to the skb_array wrappers, but we at least should avoid the naughty function pointer cast when cleaning up skbs. Otherwise RAP/CFI will honk at us. This patch uses the __skb_array_destroy_skb wrapper for the cleanup, rather than directly providing kfree_skb, which is what other drivers in the same situation do too. Reported-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-13crypto: curve25519-x86_64: use in/out register constraints more preciselyJason A. Donenfeld1-293/+504
Rather than passing all variables as modified, pass ones that are only read into that parameter. This helps with old gcc versions when alternatives are additionally used, and lets gcc's codegen be a little bit more efficient. This also syncs up with the latest Vale/EverCrypt output. This also forward ports 3c9f3b6 ("crypto: curve25519-x86_64: solve register constraints with reserved registers"). Cc: Aymeric Fromherz <aymeric.fromherz@inria.fr> Cc: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/wireguard/1554725710.1290070.1639240504281.JavaMail.zimbra@inria.fr/ Link: https://github.com/project-everest/hacl-star/pull/501 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-13compat: drop Ubuntu 14.04Jason A. Donenfeld1-6/+4
It's been over a year since we announced sunsetting this. Link: https://lore.kernel.org/wireguard/CAHmME9rckipsdZYW+LA=x6wCMybdFFA+VqoogFXnR=kHYiCteg@mail.gmail.com/T Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-08version: bumpv1.0.20211208Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-06crypto: curve25519-x86_64: solve register constraints with reserved registersMathias Krause1-4/+4
The register constraints for the inline assembly in fsqr() and fsqr2() are pretty tight on what the compiler may assign to the remaining three register variables. The clobber list only allows the following to be used: RDI, RSI, RBP and R12. With RAP reserving R12 and a kernel having CONFIG_FRAME_POINTER=y, claiming RBP, there are only two registers left so the compiler rightfully complains about impossible constraints. Provide alternatives that'll allow a memory reference for 'out' to solve the allocation constraint dilemma for this configuration. Also make 'out' an input-only operand as it is only used as such. This not only allows gcc to optimize its usage further, but also works around older gcc versions, apparently failing to handle multiple alternatives correctly, as in failing to initialize the 'out' operand with its input value. Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-06compat: udp_tunnel: don't take reference to non-init namespaceJason A. Donenfeld1-5/+7
The comment to sk_change_net is instructive: Kernel sockets, f.e. rtnl or icmp_socket, are a part of a namespace. They should not hold a reference to a namespace in order to allow to stop it. Sockets after sk_change_net should be released using sk_release_kernel We weren't following these rules before, and were instead using __sock_create, which means we kept a reference to the namespace, which in turn meant that interfaces were not cleaned up on namespace exit. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03compat: siphash: use _unaligned version by defaultArnd Bergmann2-34/+28
On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS because the ordinary load/store instructions (ldr, ldrh, ldrb) can tolerate any misalignment of the memory address. However, load/store double and load/store multiple instructions (ldrd, ldm) may still only be used on memory addresses that are 32-bit aligned, and so we have to use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we may end up with a severe performance hit due to alignment traps that require fixups by the kernel. Testing shows that this currently happens with clang-13 but not gcc-11. In theory, any compiler version can produce this bug or other problems, as we are dealing with undefined behavior in C99 even on architectures that support this in hardware, see also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363. Fortunately, the get_unaligned() accessors do the right thing: when building for ARMv6 or later, the compiler will emit unaligned accesses using the ordinary load/store instructions (but avoid the ones that require 32-bit alignment). When building for older ARM, those accessors will emit the appropriate sequence of ldrb/mov/orr instructions. And on architectures that can truly tolerate any kind of misalignment, the get_unaligned() accessors resolve to the leXX_to_cpup accessors that operate on aligned addresses. Since the compiler will in fact emit ldrd or ldm instructions when building this code for ARM v6 or later, the solution is to use the unaligned accessors unconditionally on architectures where this is known to be fast. The _aligned version of the hash function is however still needed to get the best performance on architectures that cannot do any unaligned access in hardware. This new version avoids the undefined behavior and should produce the fastest hash on all architectures we support. Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03ratelimiter: use kvcalloc() instead of kvzalloc()Gustavo A. R. Silva2-2/+24
Use 2-factor argument form kvcalloc() instead of kvzalloc(). Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03receive: drop handshakes if queue lock is contendedJason A. Donenfeld1-3/+13
If we're being delivered packets from multiple CPUs so quickly that the ring lock is contended for CPU tries, then it's safe to assume that the queue is near capacity anyway, so just drop the packet rather than spinning. This helps deal with multicore DoS that can interfere with data path performance. It _still_ does not completely fix the issue, but it again chips away at it. Reported-by: Streun Fabio <fstreun@student.ethz.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03receive: use ring buffer for incoming handshakesJason A. Donenfeld5-43/+37
Apparently the spinlock on incoming_handshake's skb_queue is highly contended, and a torrent of handshake or cookie packets can bring the data plane to its knees, simply by virtue of enqueueing the handshake packets to be processed asynchronously. So, we try switching this to a ring buffer to hopefully have less lock contention. This alleviates the problem somewhat, though it still isn't perfect, so future patches will have to improve this further. However, it at least doesn't completely diminish the data plane. Reported-by: Streun Fabio <fstreun@student.ethz.ch> Reported-by: Joel Wanner <joel.wanner@inf.ethz.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03device: reset peer src endpoint when netns exitsJason A. Donenfeld5-2/+60
Each peer's endpoint contains a dst_cache entry that takes a reference to another netdev. When the containing namespace exits, we take down the socket and prevent future sockets from being created (by setting creating_net to NULL), which removes that potential reference on the netns. However, it doesn't release references to the netns that a netdev cached in dst_cache might be taking, so the netns still might fail to exit. Since the socket is gimped anyway, we can simply clear all the dst_caches (by way of clearing the endpoint src), which will release all references. However, the current dst_cache_reset function only releases those references lazily. But it turns out that all of our usages of wg_socket_clear_peer_endpoint_src are called from contexts that are not exactly high-speed or bottle-necked. For example, when there's connection difficulty, or when userspace is reconfiguring the interface. And in particular for this patch, when the netns is exiting. So for those cases, it makes more sense to call dst_release immediately. For that, we add a small helper function to dst_cache. This patch also adds a test to netns.sh from Hangbin Liu to ensure this doesn't regress. Test-by: Hangbin Liu <liuhangbin@gmail.com> Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03main: rename 'mod_init' & 'mod_exit' functions to be module-specificRandy Dunlap1-4/+4
Rename module_init & module_exit functions that are named "mod_init" and "mod_exit" so that they are unique in both the System.map file and in initcall_debug output instead of showing up as almost anonymous "mod_init". This is helpful for debugging and in determining how long certain module_init calls take to execute. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03netns: actually test for routing loopsJason A. Donenfeld1-1/+5
We previously removed the restriction on looping to self, and then added a test to make sure the kernel didn't blow up during a routing loop. The kernel didn't blow up, thankfully, but on certain architectures where skb fragmentation is easier, such as ppc64, the skbs weren't actually being discarded after a few rounds through. But the test wasn't catching this. So actually test explicitly for massive increases in tx to see if we have a routing loop. Note that the actual loop problem will need to be addressed in a different commit. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-12-03compat: update for RHEL 8.5Peter Georg2-4/+4
RHEL 8.5 has been released. Replace all ISCENTOS8S checks with ISRHEL8. Increase RHEL_MINOR for CentOS 8 Stream detection to 6. Signed-off-by: Peter Georg <peter.georg@physik.uni-regensburg.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-08-08compat: account for grsecurity backports and changesMathias Krause2-3/+9
grsecurity kernels tend to carry additional backports and changes, like commit b60b87fc2996 ("netlink: add ethernet address policy types") or the SYM_FUNC_* changes. RAP nowadays hooks the latter, therefore no diversion to RAP_ENTRY is needed any more. Instead of relying on the kernel version test, also test for the macros we're about to define to not already be defined to account for these additional changes in the grsecurity patch without breaking compatibility to the older public ones. Also test for CONFIG_PAX instead of RAP_PLUGIN for the timer API related changes as these don't depend on the RAP plugin to be enabled but just a PaX/grsecurity patch to be applied. While there is no preprocessor knob for the latter, use CONFIG_PAX as this will likely be enabled in every kernel that uses the patch. Signed-off-by: Mathias Krause <minipli@grsecurity.net> [zx2c4: small changes to include a header nearby a macro def test] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-15compat: account for latest c8s backportsJason A. Donenfeld1-3/+3
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-06version: bumpv1.0.20210606Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-06qemu: increase default dmesg log sizeJason A. Donenfeld1-0/+1
The selftests currently parse the kernel log at the end to track potential memory leaks. With these tests now reading off the end of the buffer, due to recent optimizations, some creation messages were lost, making the tests think that there was a free without an alloc. Fix this by increasing the kernel log size. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-06qemu: add disgusting hacks for RHEL 8Jason A. Donenfeld1-1/+7
Red Hat does awful things to their kernel for RHEL 8, such that it doesn't even compile in most configurations. This is utter craziness, and their response to me sending patches to fix this stuff has been to stonewall for months on end and then do nothing. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: add missing __rcu annotation to satisfy sparseJason A. Donenfeld1-1/+1
A __rcu annotation got lost during refactoring, which caused sparse to become enraged. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: free empty intermediate nodes when removing single nodeJason A. Donenfeld3-131/+137
When removing single nodes, it's possible that that node's parent is an empty intermediate node, in which case, it too should be removed. Otherwise the trie fills up and never is fully emptied, leading to gradual memory leaks over time for tries that are modified often. There was originally code to do this, but was removed during refactoring in 2016 and never reworked. Now that we have proper parent pointers from the previous commits, we can implement this properly. In order to reduce branching and expensive comparisons, we want to keep the double pointer for parent assignment (which lets us easily chain up to the root), but we still need to actually get the parent's base address. So encode the bit number into the last two bits of the pointer, and pack and unpack it as needed. This is a little bit clumsy but is the fastest and less memory wasteful of the compromises. Note that we align the root struct here to a minimum of 4, because it's embedded into a larger struct, and we're relying on having the bottom two bits for our flag, which would only be 16-bit aligned on m68k. The existing macro-based helpers were a bit unwieldy for adding the bit packing to, so this commit replaces them with safer and clearer ordinary functions. We add a test to the randomized/fuzzer part of the selftests, to free the randomized tries by-peer, refuzz it, and repeat, until it's supposed to be empty, and then then see if that actually resulted in the whole thing being emptied. That combined with kmemcheck should hopefully make sure this commit is doing what it should. Along the way this resulted in various other cleanups of the tests and fixes for recent graphviz. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: allocate nodes in kmem_cacheJason A. Donenfeld3-13/+38
The previous commit moved from O(n) to O(1) for removal, but in the process introduced an additional pointer member to a struct that increased the size from 60 to 68 bytes, putting nodes in the 128-byte slab. With deployed systems having as many as 2 million nodes, this represents a significant doubling in memory usage (128 MiB -> 256 MiB). Fix this by using our own kmem_cache, that's sized exactly right. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: remove nodes in O(1)Jason A. Donenfeld2-84/+57
Previously, deleting peers would require traversing the entire trie in order to rebalance nodes and safely free them. This meant that removing 1000 peers from a trie with a half million nodes would take an extremely long time, during which we're holding the rtnl lock. Large-scale users were reporting 200ms latencies added to the networking stack as a whole every time their userspace software would queue up significant removals. That's a serious situation. This commit fixes that by maintaining a double pointer to the parent's bit pointer for each node, and then using the already existing node list belonging to each peer to go directly to the node, fix up its pointers, and free it with RCU. This means removal is O(1) instead of O(n), and we don't use gobs of stack. The removal algorithm has the same downside as the code that it fixes: it won't collapse needlessly long runs of fillers. We can enhance that in the future if it ever becomes a problem. This commit documents that limitation with a TODO comment in code, a small but meaningful improvement over the prior situation. Currently the biggest flaw, which the next commit addresses, is that because this increases the node size on 64-bit machines from 60 bytes to 68 bytes. 60 rounds up to 64, but 68 rounds up to 128. So we wind up using twice as much memory per node, because of power-of-two allocations, which is a big bummer. We'll need to figure something out there. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: initialize list head in selftestJason A. Donenfeld1-1/+2
The randomized trie tests weren't initializing the dummy peer list head, resulting in a NULL pointer dereference when used. Fix this by initializing it in the randomized trie test, just like we do for the static unit test. While we're at it, all of the other strings like this have the word "self-test", so add it to the missing place here. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04peer: allocate in kmem_cacheJason A. Donenfeld3-4/+27
With deployments having upwards of 600k peers now, this somewhat heavy structure could benefit from more fine-grained allocations. Specifically, instead of using a 2048-byte slab for a 1544-byte object, we can now use 1544-byte objects directly, thus saving almost 25% per-peer, or with 600k peers, that's a savings of 303 MiB. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02global: use synchronize_net rather than synchronize_rcuJason A. Donenfeld2-4/+4
Many of the synchronization points are sometimes called under the rtnl lock, which means we should use synchronize_net rather than synchronize_rcu. Under the hood, this expands to using the expedited flavor of function in the event that rtnl is held, in order to not stall other concurrent changes. This fixes some very, very long delays when removing multiple peers at once, which would cause some operations to take several minutes. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02kbuild: do not use -O3Jason A. Donenfeld1-3/+2
Apparently, various versions of gcc have O3-related miscompiles. Looking at the difference between -O2 and -O3 for gcc 11 doesn't indicate miscompiles, but the difference also doesn't seem so significant for performance that it's worth risking. Link: https://lore.kernel.org/lkml/CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com/ Link: https://lore.kernel.org/lkml/CAHmME9otB5Wwxp7H8bR_i2uH2esEMvoBMC8uEXBMH9p0q1s6Bw@mail.gmail.com/ Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02netns: make sure rp_filter is disabled on vethcJason A. Donenfeld1-0/+1
Some distros may enable strict rp_filter by default, which will prevent vethc from receiving the packets with an unroutable reverse path address. Reported-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-24version: bumpv1.0.20210424Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-23Revert "compat: skb_mark_not_on_list will be backported to Ubuntu 18.04"Thadeu Lima de Souza Cascardo1-1/+1
This reverts commit cad80597c7947f0def83caf8cb56aff0149c83a8. Because this commit has not been backported so far, due to the implications of building Ubuntu's backport of wireguard in a timely manner. For now, reverting this fix would allow wireguard-linux-compat CI to work on Ubuntu 18.04. A different fix or the same one can be applied again when the time is right. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-22compat: update and improve detection of CentOS Stream 8Peter Georg2-2/+2
CentOS Stream 8 by now (4.18.0-301.1.el8) reports RHEL_MINOR=5. The current RHEL 8 minor release is still 3. RHEL 8.4 is in beta. Replace equal comparison by greater equal to (hopefully) be a little bit more future proof. Signed-off-by: Peter Georg <peter.georg@physik.uni-regensburg.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-03-07compat: icmp_ndo_send functions were backported extensivelyJason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-19version: bumpv1.0.20210219Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-19qemu: bump default kernel versionJason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-19compat: zero out skb->cb before icmpJason A. Donenfeld1-4/+16
This corresponds to the fancier upstream commit that's still on lkml, which passes a zeroed ip_options struct to __icmp_send. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18compat: skb_mark_not_on_list will be backported to Ubuntu 18.04Thadeu Lima de Souza Cascardo1-1/+1
linux commit 22f6bbb7bcfcef0b373b0502a7ff390275c575dd ("net: use skb_list_del_init() to remove from RX sublists") will be backported to Ubuntu 18.04 default kernel, which is based on linux 4.15. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18queueing: get rid of per-peer ring buffersJason A. Donenfeld9-93/+160
Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we *already* have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18device: do not generate ICMP for non-IP packetsJason A. Donenfeld1-3/+4
If skb->protocol doesn't match the actual skb->data header, it's probably not a good idea to pass it off to icmp{,v6}_ndo_send, which is expecting to reply to a valid IP packet. So this commit has that early mismatch case jump to a later error label. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18selftests: test multiple parallel streamsJason A. Donenfeld1-1/+17
In order to test ndo_start_xmit being called in parallel, explicitly add separate tests, which should all run on different cores. This should help tease out bugs associated with queueing up packets from different cores in parallel. Currently, it hasn't found those types of bugs, but given future planned work, this is a useful regression to avoid. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-08peer: put frequently used members above cache linesJason A. Donenfeld1-2/+2
The is_dead boolean is checked for every single packet, while the internal_id member is used basically only for pr_debug messages. So it makes sense to hoist up is_dead into some space formerly unused by a struct hole, while demoting internal_api to below the lowest struct cache line. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-07compat: redefine version constants for sublevel>=256Jason A. Donenfeld2-0/+11
With the 4.4.256 and 4.9.256 kernels, the previous calculation for integer comparison overflowed. This commit redefines the broken constants to have more space for the sublevel. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-07compat: remove unused version.h headersJason A. Donenfeld2-2/+0
We don't need this in all files, and it just complicates things. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-24version: bumpv1.0.20210124Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-24compat: skb_mark_not_on_list was backported to 4.14Jason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-13compat: SYM_FUNC_* was backported to c8sJason A. Donenfeld1-1/+12
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-21version: bumpv1.0.20201221Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-21socket: remove bogus __be32 annotationJann Horn1-2/+2
The endpoint->src_if4 has nothing to do with fixed-endian numbers; remove the bogus annotation. This was introduced in https://git.zx2c4.com/wireguard-monolithic-historical/commit?id=14e7d0a499a676ec55176c0de2f9fcbd34074a82 in the historical WireGuard repo because the old code used to zero-initialize multiple members as follows: endpoint->src4.s_addr = endpoint->src_if4 = fl.saddr = 0; Because fl.saddr is fixed-endian and an assignment returns a value with the type of its left operand, this meant that sparse detected an assignment between values of different endianness. Since then, this assignment was already split up into separate statements; just the cast survived. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-21global: avoid double unlikely() notation when using IS_ERR()Antonio Quartulli2-3/+3
The definition of IS_ERR() already applies the unlikely() notation when checking the error status of the passed pointer. For this reason there is no need to have the same notation outside of IS_ERR() itself. Clean up code by removing redundant notation. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-19simd: detect -rt kernels >= 5.4Jason A. Donenfeld1-1/+1
The 5.4 series of -rt kernels moved from PREEMPT_RT_BASE/PREEMPT_RT_FULL to PREEMPT_RT, so we have to account for it here. Otherwise users get scheduling-while-atomic splats. Reported-by: Erik Schuitema <erik@essd.nl> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-16gitignore: ignore intermediary build fileL.W.Reek1-0/+1
Signed-off-by: L.W.Reek <syphyr@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-14compat: drop rhel 8.2, add rhel 8.4 supportJason A. Donenfeld1-8/+5
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-11-12version: bumpv1.0.20201112Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-11-12qemu: bump default testing versionJason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-11-12compat: SYM_FUNC_{START,END} were backported to 5.4Jason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-11-04qemu: drop build support for rhel 8.2Jason A. Donenfeld1-1/+0
This reverts commit feb89cab65c6ab1a6cbeeaaeb11b1a174772cea8. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-10-29netns: check that route_me_harder packets use the right skJason A. Donenfeld2-0/+11
If netfilter changes the packet mark, the packet is rerouted. The ip_route_me_harder family of functions fails to use the right sk, opting to instead use skb->sk, resulting in a routing loop when used with tunnels. Fixing this inside of the compat layer with skb_orphan would work but would cause other problems, by disabling TSQ, so instead we warn if the calling kernel hasn't yet backported the fix for this. Reported-by: Chen Minqiang <ptpt52@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-09-09noise: take lock when removing handshake entry from tableJason A. Donenfeld1-4/+1
Eric reported that syzkaller found a race of this variety: CPU 1 CPU 2 -------------------------------------------|--------------------------------------- wg_index_hashtable_replace(old, ...) | if (hlist_unhashed(&old->index_hash)) | | wg_index_hashtable_remove(old) | hlist_del_init_rcu(&old->index_hash) | old->index_hash.pprev = NULL hlist_replace_rcu(&old->index_hash, ...) | *old->index_hash.pprev | Syzbot wasn't actually able to reproduce this more than once or create a reproducer, because the race window between checking "hlist_unhashed" and calling "hlist_replace_rcu" is just so small. Adding an mdelay(5) or similar there helps make this demonstrable using this simple script: #!/bin/bash set -ex trap 'kill $pid1; kill $pid2; ip link del wg0; ip link del wg1' EXIT ip link add wg0 type wireguard ip link add wg1 type wireguard wg set wg0 private-key <(wg genkey) listen-port 9999 wg set wg1 private-key <(wg genkey) peer $(wg show wg0 public-key) endpoint 127.0.0.1:9999 persistent-keepalive 1 wg set wg0 peer $(wg show wg1 public-key) ip link set wg0 up yes link set wg1 up | ip -force -batch - & pid1=$! yes link set wg1 down | ip -force -batch - & pid2=$! wait The fundumental underlying problem is that we permit calls to wg_index_ hashtable_remove(handshake.entry) without requiring the caller to take the handshake mutex that is intended to protect members of handshake during mutations. This is consistently the case with calls to wg_index_ hashtable_insert(handshake.entry) and wg_index_hashtable_replace( handshake.entry), but it's missing from a pertinent callsite of wg_ index_hashtable_remove(handshake.entry). So, this patch makes sure that mutex is taken. The original code was a little bit funky though, in the form of: remove(handshake.entry) lock(), memzero(handshake.some_members), unlock() remove(handshake.entry) The original intention of that double removal pattern outside the lock appears to be some attempt to prevent insertions that might happen while locks are dropped during expensive crypto operations, but actually, all callers of wg_index_hashtable_insert(handshake.entry) take the write lock and then explicitly check handshake.state, as they should, which the aforementioned memzero clears, which means an insertion should already be impossible. And regardless, the original intention was necessarily racy, since it wasn't guaranteed that something else would run after the unlock() instead of after the remove(). So, from a soundness perspective, it seems positive to remove what looks like a hack at best. The crash from both syzbot and from the script above is as follows: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 7395 Comm: kworker/0:3 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: wg-kex-wg1 wg_packet_handshake_receive_worker RIP: 0010:hlist_replace_rcu include/linux/rculist.h:505 [inline] RIP: 0010:wg_index_hashtable_replace+0x176/0x330 drivers/net/wireguard/peerlookup.c:174 Code: 00 fc ff df 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 44 01 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 10 48 89 c6 48 c1 ee 03 <80> 3c 0e 00 0f 85 06 01 00 00 48 85 d2 4c 89 28 74 47 e8 a3 4f b5 RSP: 0018:ffffc90006a97bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888050ffc4f8 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88808e04e010 RBP: ffff88808e04e000 R08: 0000000000000001 R09: ffff8880543d0000 R10: ffffed100a87a000 R11: 000000000000016e R12: ffff8880543d0000 R13: ffff88808e04e008 R14: ffff888050ffc508 R15: ffff888050ffc500 FS: 0000000000000000(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f5505db0 CR3: 0000000097cf7000 CR4: 00000000001526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: wg_noise_handshake_begin_session+0x752/0xc9a drivers/net/wireguard/noise.c:820 wg_receive_handshake_packet drivers/net/wireguard/receive.c:183 [inline] wg_packet_handshake_receive_worker+0x33b/0x730 drivers/net/wireguard/receive.c:220 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Note that this fixes the same issue as the previous commit, but in a more direct way. Upstream, the commit message of that previous commit has been changed to: wireguard: peerlookup: take lock before checking hash in replace operation Eric's suggested fix for the previous commit's mentioned race condition was to simply take the table->lock in wg_index_hashtable_replace(). The table->lock of the hash table is supposed to protect the bucket heads, not the entires, but actually, since all the mutator functions are already taking it, it makes sense to take it too for the test to hlist_unhashed, as a defense in depth measure, so that it no longer races with deletions, regardless of what other locks are protecting individual entries. This is sensible from a performance perspective because, as Eric pointed out, the case of being unhashed is already the unlikely case, so this won't add common contention. And comparing instructions, this basically doesn't make much of a difference other than pushing and popping %r13, used by the new `bool ret`. More generally, I like the idea of locking consistency across table mutator functions, and this might let me rest slightly easier at night. Since we've already tagged it, we're not going to change it at this point, but I include mention of it here for reference. Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-09-08version: bumpv1.0.20200908Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-09-08peerlookup: take lock before checking hash in replace operationJason A. Donenfeld1-3/+8
Eric reported that syzkaller found a race of this variety: CPU 1 CPU 2 -------------------------------------------|--------------------------------------- wg_index_hashtable_replace(old, ...) | if (hlist_unhashed(&old->index_hash)) | | wg_index_hashtable_remove(old) | hlist_del_init_rcu(&old->index_hash) | old->index_hash.pprev = NULL hlist_replace_rcu(&old->index_hash, ...) | *old->index_hash.pprev | The table->lock of the hash table is supposed to protect the bucket heads, not the entires, but actually, since all the mutator functions are already taking it, it makes sense to take it too for the test to hlist_unhashed, so that it no longer races with deletions. This is fine because, as Eric pointed out, the case of being unhashed is already the unlikely case, so this won't add common contention. And comparing instructions, this basically doesn't make much of a difference other than pushing and popping %r13, used by the new `bool ret`. The syzkaller crash is as follows: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 7395 Comm: kworker/0:3 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: wg-kex-wg1 wg_packet_handshake_receive_worker RIP: 0010:hlist_replace_rcu include/linux/rculist.h:505 [inline] RIP: 0010:wg_index_hashtable_replace+0x176/0x330 drivers/net/wireguard/peerlookup.c:174 Code: 00 fc ff df 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 44 01 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 10 48 89 c6 48 c1 ee 03 <80> 3c 0e 00 0f 85 06 01 00 00 48 85 d2 4c 89 28 74 47 e8 a3 4f b5 RSP: 0018:ffffc90006a97bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888050ffc4f8 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88808e04e010 RBP: ffff88808e04e000 R08: 0000000000000001 R09: ffff8880543d0000 R10: ffffed100a87a000 R11: 000000000000016e R12: ffff8880543d0000 R13: ffff88808e04e008 R14: ffff888050ffc508 R15: ffff888050ffc500 FS: 0000000000000000(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f5505db0 CR3: 0000000097cf7000 CR4: 00000000001526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: wg_noise_handshake_begin_session+0x752/0xc9a drivers/net/wireguard/noise.c:820 wg_receive_handshake_packet drivers/net/wireguard/receive.c:183 [inline] wg_packet_handshake_receive_worker+0x33b/0x730 drivers/net/wireguard/receive.c:220 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Modules linked in: ---[ end trace 0d737db78b72da84 ]--- RIP: 0010:hlist_replace_rcu include/linux/rculist.h:505 [inline] RIP: 0010:wg_index_hashtable_replace+0x176/0x330 drivers/net/wireguard/peerlookup.c:174 Code: 00 fc ff df 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 44 01 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 10 48 89 c6 48 c1 ee 03 <80> 3c 0e 00 0f 85 06 01 00 00 48 85 d2 4c 89 28 74 47 e8 a3 4f b5 RSP: 0018:ffffc90006a97bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888050ffc4f8 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88808e04e010 RBP: ffff88808e04e000 R08: 0000000000000001 R09: ffff8880543d0000 R10: ffffed100a87a000 R11: 000000000000016e R12: ffff8880543d0000 R13: ffff88808e04e008 R14: ffff888050ffc508 R15: ffff888050ffc500 FS: 0000000000000000(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f5505db0 CR3: 0000000097cf7000 CR4: 00000000001526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27compat: backport NLA policy macrosJason A. Donenfeld1-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27netlink: consistently use NLA_POLICY_MIN_LEN()Johannes Berg1-2/+2
Change places that open-code NLA_POLICY_MIN_LEN() to use the macro instead, giving us flexibility in how we handle the details of the macro. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27netlink: consistently use NLA_POLICY_EXACT_LEN()Johannes Berg1-5/+5
Change places that open-code NLA_POLICY_EXACT_LEN() to use the macro instead, giving us flexibility in how we handle the details of the macro. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27compat: backport kfree_sensitive and switch to itJason A. Donenfeld3-3/+7
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-07-29compat: drop support for SUSE 15.1Jason A. Donenfeld1-10/+7
Now that WireGuard is properly supported by 15.2 and people have had sufficient time to upgrade, we can drop support for 15.1 in this compat module. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-07-29version: bumpv1.0.20200729Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-07-29compat: add missing headers for ip_tunnel_parse_protocolJason A. Donenfeld1-0/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-07-29compat: ipv6_dst_lookup_flow was ported to rhel 7.9 betaJason A. Donenfeld1-1/+4
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-07-29compat: allow override of depmod basedirRicardo Mendoza1-1/+2
When building in an environment with a different modules install path we need to be able to also override the depmod basedir flag. Signed-off-by: Ricardo Mendoza <ricmm@pantacor.com> [zx2c4: changed name of env var and added quotes to argument] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-07-29compat: rhel 8.3 beta removed nf_nat_core.hJason A. Donenfeld1-1/+1
Reported-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-07-12version: bumpv1.0.20200712Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-30compat: backport ip_tunnel_parse_protocol and ip_tunnel_header_opsJason A. Donenfeld1-0/+22
These are required for moving wg_examine_packet_protocol out of wireguard and into upstream. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-30queueing: make use of ip_tunnel_parse_protocolJason A. Donenfeld2-18/+3
Now that wg_examine_packet_protocol has been added for general consumption as ip_tunnel_parse_protocol, it's possible to remove wg_examine_packet_protocol and simply use the new ip_tunnel_parse_protocol function directly. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-30device: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld1-0/+1
WireGuard uses skb->protocol to determine packet type, and bails out if it's not set or set to something it's not expecting. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and wireguard then rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Reported-by: Hans Wippel <ndev@hwipl.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-29compat: SUSE 15.1 is the final SUSE we need to supportJason A. Donenfeld1-8/+8
>=15.2 is in SUSE's kernel now. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-29compat: rhel 8.3 backported skb_reset_redirectJason A. Donenfeld1-1/+4
Reported-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-29receive: account for napi_gro_receive never returning GRO_DROPJason A. Donenfeld1-8/+2
The napi_gro_receive function no longer returns GRO_DROP ever, making handling GRO_DROP dead code. This commit removes that dead code. Further, it's not even clear that device drivers have any business in taking action after passing off received packets; that's arguably out of their hands. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-23version: bumpv1.0.20200623Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-22netns: workaround bad 5.2.y backportJason A. Donenfeld1-1/+2
ca7a03c4175 was backported to 5.2 to fix 7d9e5f422150, but 7d9e5f422150 wasn't added until 5.3, so this fix for a reference underflow in 5.3 becomes a memory leak in 5.2. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-22device: avoid circular netns referencesJason A. Donenfeld6-46/+71
Before, we took a reference to the creating netns if the new netns was different. This caused issues with circular references, with two wireguard interfaces swapping namespaces. The solution is to rather not take any extra references at all, but instead simply invalidate the creating netns pointer when that netns is deleted. In order to prevent this from happening again, this commit improves the rough object leak tracking by allowing it to account for created and destroyed interfaces, aside from just peers and keys. That then makes it possible to check for the object leak when having two interfaces take a reference to each others' namespaces. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-21noise: do not assign initiation time in if conditionFrank Werner-Krippendorf1-2/+2
Fixes an error condition reported by checkpatch.pl which caused by assigning a variable in an if condition in wg_noise_handshake_consume_ initiation(). Signed-off-by: Frank Werner-Krippendorf <mail@hb9fxq.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-18Kbuild: remove -fvisibility=hidden from cflagsJason A. Donenfeld1-1/+1
This was originally done in 2015 as a means of decreasing module size, but it has the effect of creating JUMP11 relocations on ARM when compiled in THUMB2 mode without CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11=y, which results in `B ...` instructions being generated with jumps that are too far, rather than `B.W ...` instructions, which can handle the larger sized jump. Get rid of the old hack, which had minimum utility anyway. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-15compat: drop centos 8.1 support as 8.2 is now outJason A. Donenfeld1-7/+4
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-11version: bumpv1.0.20200611Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-04compat: remove stale suse supportJason A. Donenfeld1-11/+3
The 42.x series is no longer supported, and the 15.2 kernel is getting a proper backport, so at the moment, we only care about supporting 15.1. Eventually we'll drop that too. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-28compat: bionic-hwe-5.0/disco kernel backported skb_reset_redirect and ipv6 flowJason A. Donenfeld1-2/+4
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-28qemu: mark per_cpu_load_addr as static for gcc-10Jason A. Donenfeld1-0/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-28qemu: work around broken centos8 kernelJason A. Donenfeld1-0/+1
RHEL needs to apply https://lore.kernel.org/patchwork/patch/974664/ before we can revert this monstrosity. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-28compat: ubuntu appears to have backported ipv6_dst_lookup_flowJason A. Donenfeld1-1/+3
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-28qemu: patch in UTS_UBUNTU_RELEASE_ABI for Ubuntu detectionJason A. Donenfeld1-0/+1
This kind of thing really makes me queezy and upset, but there's little that can be done about such situations when dealing with Canonical's kernel. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-28qemu: support fetching kernels for arbitrary URLsJason A. Donenfeld1-1/+11
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21compat: backport iptunnel_xmit to 3.11Jason A. Donenfeld1-4/+11
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21compat: narrow the breadth of iptunnel_xmit backportJason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21compat: widen breadth of prandom_u32_max backportJason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21compat: backport skb_scrub_packet to 3.11Jason A. Donenfeld1-0/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21compat: widen breadth of memzero_explicit backportJason A. Donenfeld1-3/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21compat: widen breadth of integer constantsJason A. Donenfeld1-1/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21qemu: add extra fill in idt handler for newer binutilsJason A. Donenfeld1-0/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21qemu: use cbuild gcc for avx512 exclusionJason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21qemu: force 2MB pages for binutils 2.31Jason A. Donenfeld1-0/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21qemu: patch kernels that rely on ancient makeJason A. Donenfeld1-0/+1
Kernels without 9feeb638cde0 ("tools build: fix # escaping in .cmd files for future Make") face problems when building with more recent make, so patch these to avoid issues. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21qemu: remove -Werror in order to build ancient kernels betterJason A. Donenfeld1-0/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-21qemu: always use cbuild gcc rather than system gccJason A. Donenfeld1-3/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-20version: bumpv1.0.20200520Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-20compat: support CentOS 8 explicitlyJason A. Donenfeld1-4/+7
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-20compat: RHEL7 backported the skb hash renamingsJason A. Donenfeld1-3/+3
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-20compat: ip6_dst_lookup_flow was backported to 4.14, 4.9, and 4.4Jason A. Donenfeld1-1/+1
Also remove the confusing 119/118 distinction from the Debian clause, which is no longer as important. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-20compat: backport renamed/missing skb hash membersJason A. Donenfeld2-2/+15
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-19noise: separate receive counter from send counterJason A. Donenfeld5-54/+50
In "queueing: preserve flow hash across packet scrubbing", we were required to slightly increase the size of the receive replay counter to something still fairly small, but an increase nonetheless. It turns out that we can recoup some of the additional memory overhead by splitting up the prior union type into two distinct types. Before, we used the same "noise_counter" union for both sending and receiving, with sending just using a simple atomic64_t, while receiving used the full replay counter checker. This meant that most of the memory being allocated for the sending counter was being wasted. Since the old "noise_counter" type increased in size in the prior commit, now is a good time to split up that union type into a distinct "noise_replay_ counter" for receiving and a boring atomic64_t for sending, each using neither more nor less memory than required. Also, since sometimes the replay counter is accessed without necessitating additional accesses to the bitmap, we can reduce cache misses by hoisting the always-necessary lock above the bitmap in the struct layout. We also change a "noise_replay_counter" stack allocation to kmalloc in a -DDEBUG selftest so that KASAN doesn't trigger a stack frame warning. All and all, removing a bit of abstraction in this commit makes the code simpler and smaller, in addition to the motivating memory usage recuperation. For example, passing around raw "noise_symmetric_key" structs is something that really only makes sense within noise.c, in the one place where the sending and receiving keys can safely be thought of as the same type of object; subsequent to that, it's important that we uniformly access these through keypair->{sending,receiving}, where their distinct roles are always made explicit. So this patch allows us to draw that distinction clearly as well. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-19queueing: preserve flow hash across packet scrubbingJason A. Donenfeld4-4/+17
It's important that we clear most header fields during encapsulation and decapsulation, because the packet is substantially changed, and we don't want any info leak or logic bug due to an accidental correlation. But, for encapsulation, it's wrong to clear skb->hash, since it's used by fq_codel and flow dissection in general. Without it, classification does not proceed as usual. This change might make it easier to estimate the number of innerflows by examining clustering of out of order packets, but this shouldn't open up anything that can't already be inferred otherwise (e.g. syn packet size inference), and fq_codel can be disabled anyway. Furthermore, it might be the case that the hash isn't used or queried at all until after wireguard transmits the encrypted UDP packet, which means skb->hash might still be zero at this point, and thus no hash taken over the inner packet data. In order to address this situation, we force a calculation of skb->hash before encrypting packet data. Of course this means that fq_codel might transmit packets slightly more out of order than usual. Toke did some testing on beefy machines with high quantities of parallel flows and found that increasing the reply-attack counter to 8192 takes care of the most pathological cases pretty well. Reported-by: Dave Taht <dave.taht@gmail.com> Reviewed-and-tested-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-19noise: read preshared key while taking lockJason A. Donenfeld1-1/+5
Prior we read the preshared key after dropping the handshake lock, which isn't an actual crypto issue if it races, but it's still not quite correct. So copy that part of the state into a temporary like we do with the rest of the handshake state variables. Then we can release the lock, operate on the temporary, and zero it out at the end of the function. In performance tests, the impact of this was entirely unnoticable, probably because those bytes are coming from the same cacheline as other things that are being copied out in the same manner. Reported-by: Matt Dunwoodie <ncon@noconroy.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-19compat: support RHEL 8 as 8.2, drop 8.1 supportJason A. Donenfeld1-9/+4
This should help with 8.3 beta rolls being recognized as 8.1 instead of 8.2 quirks. Reported-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-19qemu: add -fcommon for compiling ping with gcc-10Jason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-08qemu: use newer iproute2 for gcc-10Jason A. Donenfeld1-1/+1
gcc-10 switched to defaulting to -fno-common, which broke iproute2-5.4. This was fixed in iproute-5.6, so switch to that. Because we're after a stable testing surface, we generally don't like to bump these unnecessarily, but in this case, being able to actually build is a basic necessity. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-06version: bumpv1.0.20200506Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-05send/receive: use explicit unlikely branch instead of implicit coalescingJason A. Donenfeld2-16/+12
It's very unlikely that send will become true. It's nearly always false between 0 and 120 seconds of a session, and in most cases becomes true only between 120 and 121 seconds before becoming false again. So, unlikely(send) is clearly the right option here. What happened before was that we had this complex boolean expression with multiple likely and unlikely clauses nested. Since this is evaluated left-to-right anyway, the whole thing got converted to unlikely. So, we can clean this up to better represent what's going on. The generated code is the same. Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-05selftests: initalize ipv6 members to NULL to squelch clang warningJason A. Donenfeld1-2/+2
Without setting these to NULL, clang complains in certain configurations that have CONFIG_IPV6=n: In file included from drivers/net/wireguard/ratelimiter.c:223: drivers/net/wireguard/selftest/ratelimiter.c:173:34: error: variable 'skb6' is uninitialized when used here [-Werror,-Wuninitialized] ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count); ^~~~ drivers/net/wireguard/selftest/ratelimiter.c:123:29: note: initialize the variable 'skb6' to silence this warning struct sk_buff *skb4, *skb6; ^ = NULL drivers/net/wireguard/selftest/ratelimiter.c:173:40: error: variable 'hdr6' is uninitialized when used here [-Werror,-Wuninitialized] ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count); ^~~~ drivers/net/wireguard/selftest/ratelimiter.c:125:22: note: initialize the variable 'hdr6' to silence this warning struct ipv6hdr *hdr6; ^ We silence this warning by setting the variables to NULL as the warning suggests. Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-04compat: Ubuntu 19.10 and 18.04-hwe backported skb_reset_redirectJason A. Donenfeld1-2/+4
Reported-by: Pascal Ernster <pascal.ernster@rub.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-04send: cond_resched() when processing tx ringbuffersJason A. Donenfeld1-0/+2
Users with pathological hardware reported CPU stalls on CONFIG_ PREEMPT_VOLUNTARY=y, because the ringbuffers would stay full, meaning these workers would never terminate. That turned out not to be okay on systems without forced preemption. This commit adds a cond_resched() to the bottom of each loop iteration, so that these workers don't hog the core. We don't do this on encryption/decryption because the compat module here uses simd_relax, which already includes a call to schedule in preempt_enable. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-04socket: remove errant restriction on looping to selfJason A. Donenfeld2-15/+51
It's already possible to create two different interfaces and loop packets between them. This has always been possible with tunnels in the kernel, and isn't specific to wireguard. Therefore, the networking stack already needs to deal with that. At the very least, the packet winds up exceeding the MTU and is discarded at that point. So, since this is already something that happens, there's no need to forbid the not very exceptional case of routing a packet back to the same interface; this loop is no different than others, and we shouldn't special case it, but rather rely on generic handling of loops in general. This also makes it easier to do interesting things with wireguard such as onion routing. At the same time, we add a selftest for this, ensuring that both onion routing works and infinite routing loops do not crash the kernel. We also add a test case for wireguard interfaces nesting packets and sending traffic between each other, as well as the loop in this case too. We make sure to send some throughput-heavy traffic for this use case, to stress out any possible recursion issues with the locks around workqueues. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-03qemu: use normal kernel stack size on ppc64Jason A. Donenfeld1-0/+1
While at some point it might have made sense to be running these tests on ppc64 with 4k stacks, the kernel hasn't actually used 4k stacks on 64-bit powerpc in a long time, and more interesting things that we test don't really work when we deviate from the default (16k). So, we stop pushing our luck in this commit, and return to the default instead of the minimum. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-03compat: use bash instead of bc for HZ-->USEC calculationJason A. Donenfeld1-5/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-03compat: detect Debian's backport of ip6_dst_lookup_flow into 4.19.118Jason A. Donenfeld2-1/+5
Link: https://bugs.debian.org/959157 Reported-by: Luca Filipozzi <lfilipoz@debian.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-03qemu: loop entropy adding until getrandom doesn't blockJason A. Donenfeld1-1/+4
Before the 256 was just a guess, which was made wrong by qemu 5.0, so instead actually query whether or not we're all set. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-30compat: timeconst.h is a generated artifactJason A. Donenfeld1-1/+1
Before we were trying to check for timeconst.h by looking in the kernel source directory. This isn't quite correct on configurations in which the object directory is separate from the kernel source directory, for example when using O="elsewhere" as a make option when building the kernel. The correct fix is to use $(CURDIR), which should point to where we want. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-29version: bumpv1.0.20200429Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-29compat: ip6_dst_lookup_flow was backported to 4.19.119Jason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-29compat: ip6_dst_lookup_flow was backported to 3.16.83Jason A. Donenfeld1-5/+6
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-28receive: use tunnel helpers for decapsulating ECN markingsToke Høiland-Jørgensen2-16/+2
WireGuard currently only propagates ECN markings on tunnel decap according to the old RFC3168 specification. However, the spec has since been updated in RFC6040 to recommend slightly different decapsulation semantics. This was implemented in the kernel as a set of common helpers for ECN decapsulation, so let's just switch over WireGuard to using those, so it can benefit from this enhancement and any future tweaks. We do not drop packets with invalid ECN marking combinations, because WireGuard is frequently used to work around broken ISPs, which could be doing that. Reported-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> Cc: Dave Taht <dave.taht@gmail.com> Cc: Rodney W. Grimes <ietf@gndrsh.dnsmgr.net> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-26version: bumpv1.0.20200426Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-26compat: prefix icmp[v6]_ndo_send with __compatJason A. Donenfeld1-4/+6
Some distros that backported icmp[v6]_ndo_send still try to build the compat module in some corner case circumstances, resulting in errors. Work around this with the usual __compat games. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-24main: mark as in-treeJason A. Donenfeld1-0/+1
We've been merged upstream and should no longer taint kernels. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-22queueing: cleanup ptr_ring in error path of packet_queue_initJason A. Donenfeld1-1/+3
Prior, if the alloc_percpu of packet_percpu_multicore_worker_alloc failed, the previously allocated ptr_ring wouldn't be freed. This commit adds the missing call to ptr_ring_cleanup in the error case. Reported-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-22compat: kvmalloc_array is not required anywayJason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-22compat: don't assume READ_ONCE barriers on old kernelsJason A. Donenfeld1-4/+4
76ebbe78f7390aee075a7f3768af197ded1bdfbb didn't come until 4.15. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-22compat: import latest fixes for ptr_ringJason A. Donenfeld1-36/+70
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-16compat: include sch_generic.h header for skb_reset_tcJason A. Donenfeld1-0/+1
Reported-by: King DuckZ <dev00@gmx.it> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-14crypto: do not export symbolsJason A. Donenfeld5-19/+0
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-14version: bumpv1.0.20200413Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-14compat: backport hsiphash_1u32 for testsJason A. Donenfeld1-0/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-14compat: error out if bc is missingJason A. Donenfeld1-1/+5
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-14compat: support RHEL 7.8's faulty siphash backportJason A. Donenfeld1-1/+1
Reported-by: Christian Weiss <cwei@gmx.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-08git: add gitattributes so tarball doesn't have gitignore filesJason A. Donenfeld1-0/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-07compat: support latest suse 15.1 and 15.2Jason A. Donenfeld1-4/+7
Contributed-by: Martin Hauke <mardnh@gmx.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-01version: bumpv1.0.20200401Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-01qemu: bump default kernel to 5.5.14Jason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-01compat: queueing: skb_reset_redirect change has been backported to 5.[45]Christian Hesse1-1/+1
This is a follow up to 2d4fa2a6e7903ec3340f1b075456cbd84ba6a744. Upstream commit 2c64605b590edadb3fb46d1ec6badb49e940b479 has been backported to 5.4.29 and 5.5.14. Signed-off-by: Christian Hesse <mail@eworm.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-30version: bumpv1.0.20200330Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-28queueing: backport skb_reset_redirect change from 5.6Jason A. Donenfeld2-1/+11
This backports upstream commit 2c64605b590edadb3fb46d1ec6badb49e940b479. It makes no difference for us, but it's nice to keep this code in sync with upstream as much as possible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-18version: bumpv0.0.20200318Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-18send: use normaler alignment formula from upstreamJason A. Donenfeld1-1/+1
Slightly more meh, but upstream likes it better, and I'd rather minimize the delta between trees. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-18noise: error out precomputed DH during handshake rather than configJason A. Donenfeld5-48/+49
We precompute the static-static ECDH during configuration time, in order to save an expensive computation later when receiving network packets. However, not all ECDH computations yield a contributory result. Prior, we were just not letting those peers be added to the interface. However, this creates a strange inconsistency, since it was still possible to add other weird points, like a valid public key plus a low-order point, and, like points that result in zeros, a handshake would not complete. In order to make the behavior more uniform and less surprising, simply allow all peers to be added. Then, we'll error out later when doing the crypto if there's an issue. This also adds more separation between the crypto layer and the configuration layer. Discussed-with: Mathias Hall-Andersen <mathias@hall-andersen.dk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-17receive: remove dead code from default packet type caseJason A. Donenfeld1-2/+1
The situation in which we wind up hitting the default case here indicates a major bug in earlier parsing code. It is not a usual thing that should ever happen, which means a "friendly" message for it doesn't make sense. Rather, replace this with a WARN_ON, just like we do earlier in the file for a similar situation, so that somebody sends us a bug report and we can fix it. Reported-by: Fabian Freyer <fabianfreyer@radicallyopensecurity.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-17wireguard: queueing: account for skb->protocol==0Jason A. Donenfeld3-4/+10
We carry out checks to the effect of: if (skb->protocol != wg_examine_packet_protocol(skb)) goto err; By having wg_skb_examine_untrusted_ip_hdr return 0 on failure, this means that the check above still passes in the case where skb->protocol is zero, which is possible to hit with AF_PACKET: struct sockaddr_pkt saddr = { .spkt_device = "wg0" }; unsigned char buffer[5] = { 0 }; sendto(socket(AF_PACKET, SOCK_PACKET, /* skb->protocol = */ 0), buffer, sizeof(buffer), 0, (const struct sockaddr *)&saddr, sizeof(saddr)); Additional checks mean that this isn't actually a problem in the code base, but I could imagine it becoming a problem later if the function is used more liberally. I would prefer to fix this by having wg_examine_packet_protocol return a 32-bit ~0 value on failure, which will never match any value of skb->protocol, which would simply change the generated code from a mov to a movzx. However, sparse complains, and adding __force casts doesn't seem like a good idea, so instead we just add a simple helper function to check for the zero return value. Since wg_examine_packet_protocol itself gets inlined, this winds up not adding an additional branch to the generated code, since the 0 return value already happens in a mergable branch. Reported-by: Fabian Freyer <fabianfreyer@radicallyopensecurity.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-03compat: RHEL 8.2 backported ipv6_dst_lookup_flowJason A. Donenfeld1-1/+1
Reported-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-19curve25519-x86_64: avoid use of r12Jason A. Donenfeld1-55/+55
This causes problems with RAP and KERNEXEC for PaX, as r12 is a reserved register. It also leads to a more compact instruction encoding, saving about 100 cycles. Suggested-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-15compat: RHEL 7 backported skb_ensure_writable()Luis Ressel1-1/+1
Reported-by: chotaire <chotaire@chotaire.net> Signed-off-by: Luis Ressel <aranea@aixah.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-15version: bumpv0.0.20200215Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-14socket: remove useless synchronize_netJason A. Donenfeld1-1/+0
Utter non-sense from way back when. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Fixes: 8906775b ("socket: synchronize net on socket tear down")
2020-02-14send: cleanup skb padding calculationJason A. Donenfeld1-6/+11
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-14version: bumpv0.0.20200214Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-14send: account for mtu=0 devicesJason A. Donenfeld3-4/+12
It turns out there's an easy way to get packets queued up while still having an MTU of zero, and that's via persistent keep alive. This commit makes sure that in whatever condition, we don't wind up dividing by zero. Note that an MTU of zero for a wireguard interface is something quasi-valid, so I don't think the correct fix is to limit it via min_mtu. This can be reproduced easily with: ip link add wg0 type wireguard ip link add wg1 type wireguard ip link set wg0 up mtu 0 ip link set wg1 up wg set wg0 private-key <(wg genkey) wg set wg1 listen-port 1 private-key <(wg genkey) peer $(wg show wg0 public-key) wg set wg0 peer $(wg show wg1 public-key) persistent-keepalive 1 endpoint 127.0.0.1:1 However, while min_mtu=0 seems fine, it makes sense to restrict the max_mtu. This commit also restricts the maximum MTU to the greatest number for which rounding up to the padding multiple won't overflow a signed integer. Packets this large were always rejected anyway eventually, due to checks deeper in, but it seems more sound not to even let the administrator configure something that won't work anyway. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-13receive: reset last_under_load to zeroJason A. Donenfeld1-2/+5
This is a small optimization that prevents more expensive comparisons from happening when they are no longer necessary, by clearing the last_under_load variable whenever we wind up in a state where we were under load but we no longer are. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Suggested-by: Matt Dunwoodie <ncon@noconroy.net>
2020-02-12netns: ensure that icmp src address is correct with natJason A. Donenfeld3-36/+101
This is a small test to ensure that icmp_ndo_send is actually doing the right with with regards to the source address. It tests this by ensuring that the error comes back along the right path. Also, backport the new ndo function for this. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-06chacha20poly1305: defensively protect against large inputsJason A. Donenfeld1-1/+3
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-05version: bumpv0.0.20200205Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-05netns: ensure non-addition of peers with failed precomputationJason A. Donenfeld1-0/+6
Ensure that peers with low order points are ignored, both in the case where we already have a device private key and in the case where we do not. This adds points that naturally give a zero output. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-05netns: tie socket waiting to target pidJason A. Donenfeld1-9/+8
Without this, we wind up proceeding too early sometimes when the previous process has just used the same listening port. So, we tie the listening socket query to the specific pid we're interested in. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-05noise: reject peers with low order public keysJason A. Donenfeld2-7/+9
Our static-static calculation returns a failure if the public key is of low order. We check for this when peers are added, and don't allow them to be added if they're low order, except in the case where we haven't yet been given a private key. In that case, we would defer the removal of the peer until we're given a private key, since at that point we're doing new static-static calculations which incur failures we can act on. This meant, however, that we wound up removing peers rather late in the configuration flow. Syzkaller points out that peer_remove calls flush_workqueue, which in turn might then wait for sending a handshake initiation to complete. Since handshake initiation needs the static identity lock, holding the static identity lock while calling peer_remove can result in a rare deadlock. We have precisely this case in this situation of late-stage peer removal based on an invalid public key. We can't drop the lock when removing, because then incoming handshakes might interact with a bogus static-static calculation. While the band-aid patch for this would involve breaking up the peer removal into two steps like wg_peer_remove_all does, in order to solve the locking issue, there's actually a much more elegant way of fixing this: If the static-static calculation succeeds with one private key, it *must* succeed with all others, because all 32-byte strings map to valid private keys, thanks to clamping. That means we can get rid of this silly dance and locking headaches of removing peers late in the configuration flow, and instead just reject them early on, regardless of whether the device has yet been assigned a private key. For the case where the device doesn't yet have a private key, we safely use zeros just for the purposes of checking for low order points by way of checking the output of the calculation. The following PoC will trigger the deadlock: ip link add wg0 type wireguard ip addr add 10.0.0.1/24 dev wg0 ip link set wg0 up ping -f 10.0.0.2 & while true; do wg set wg0 private-key /dev/null peer AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= allowed-ips 10.0.0.0/24 endpoint 10.0.0.3:1234 wg set wg0 private-key <(echo AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=) done [ 0.949105] ====================================================== [ 0.949550] WARNING: possible circular locking dependency detected [ 0.950143] 5.5.0-debug+ #18 Not tainted [ 0.950431] ------------------------------------------------------ [ 0.950959] wg/89 is trying to acquire lock: [ 0.951252] ffff8880333e2128 ((wq_completion)wg-kex-wg0){+.+.}, at: flush_workqueue+0xe3/0x12f0 [ 0.951865] [ 0.951865] but task is already holding lock: [ 0.952280] ffff888032819bc0 (&wg->static_identity.lock){++++}, at: wg_set_device+0x95d/0xcc0 [ 0.953011] [ 0.953011] which lock already depends on the new lock. [ 0.953011] [ 0.953651] [ 0.953651] the existing dependency chain (in reverse order) is: [ 0.954292] [ 0.954292] -> #2 (&wg->static_identity.lock){++++}: [ 0.954804] lock_acquire+0x127/0x350 [ 0.955133] down_read+0x83/0x410 [ 0.955428] wg_noise_handshake_create_initiation+0x97/0x700 [ 0.955885] wg_packet_send_handshake_initiation+0x13a/0x280 [ 0.956401] wg_packet_handshake_send_worker+0x10/0x20 [ 0.956841] process_one_work+0x806/0x1500 [ 0.957167] worker_thread+0x8c/0xcb0 [ 0.957549] kthread+0x2ee/0x3b0 [ 0.957792] ret_from_fork+0x24/0x30 [ 0.958234] [ 0.958234] -> #1 ((work_completion)(&peer->transmit_handshake_work)){+.+.}: [ 0.958808] lock_acquire+0x127/0x350 [ 0.959075] process_one_work+0x7ab/0x1500 [ 0.959369] worker_thread+0x8c/0xcb0 [ 0.959639] kthread+0x2ee/0x3b0 [ 0.959896] ret_from_fork+0x24/0x30 [ 0.960346] [ 0.960346] -> #0 ((wq_completion)wg-kex-wg0){+.+.}: [ 0.960945] check_prev_add+0x167/0x1e20 [ 0.961351] __lock_acquire+0x2012/0x3170 [ 0.961725] lock_acquire+0x127/0x350 [ 0.961990] flush_workqueue+0x106/0x12f0 [ 0.962280] peer_remove_after_dead+0x160/0x220 [ 0.962600] wg_set_device+0xa24/0xcc0 [ 0.962994] genl_rcv_msg+0x52f/0xe90 [ 0.963298] netlink_rcv_skb+0x111/0x320 [ 0.963618] genl_rcv+0x1f/0x30 [ 0.963853] netlink_unicast+0x3f6/0x610 [ 0.964245] netlink_sendmsg+0x700/0xb80 [ 0.964586] __sys_sendto+0x1dd/0x2c0 [ 0.964854] __x64_sys_sendto+0xd8/0x1b0 [ 0.965141] do_syscall_64+0x90/0xd9a [ 0.965408] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 0.965769] [ 0.965769] other info that might help us debug this: [ 0.965769] [ 0.966337] Chain exists of: [ 0.966337] (wq_completion)wg-kex-wg0 --> (work_completion)(&peer->transmit_handshake_work) --> &wg->static_identity.lock [ 0.966337] [ 0.967417] Possible unsafe locking scenario: [ 0.967417] [ 0.967836] CPU0 CPU1 [ 0.968155] ---- ---- [ 0.968497] lock(&wg->static_identity.lock); [ 0.968779] lock((work_completion)(&peer->transmit_handshake_work)); [ 0.969345] lock(&wg->static_identity.lock); [ 0.969809] lock((wq_completion)wg-kex-wg0); [ 0.970146] [ 0.970146] *** DEADLOCK *** [ 0.970146] [ 0.970531] 5 locks held by wg/89: [ 0.970908] #0: ffffffff827433c8 (cb_lock){++++}, at: genl_rcv+0x10/0x30 [ 0.971400] #1: ffffffff82743480 (genl_mutex){+.+.}, at: genl_rcv_msg+0x642/0xe90 [ 0.971924] #2: ffffffff827160c0 (rtnl_mutex){+.+.}, at: wg_set_device+0x9f/0xcc0 [ 0.972488] #3: ffff888032819de0 (&wg->device_update_lock){+.+.}, at: wg_set_device+0xb0/0xcc0 [ 0.973095] #4: ffff888032819bc0 (&wg->static_identity.lock){++++}, at: wg_set_device+0x95d/0xcc0 [ 0.973653] [ 0.973653] stack backtrace: [ 0.973932] CPU: 1 PID: 89 Comm: wg Not tainted 5.5.0-debug+ #18 [ 0.974476] Call Trace: [ 0.974638] dump_stack+0x97/0xe0 [ 0.974869] check_noncircular+0x312/0x3e0 [ 0.975132] ? print_circular_bug+0x1f0/0x1f0 [ 0.975410] ? __kernel_text_address+0x9/0x30 [ 0.975727] ? unwind_get_return_address+0x51/0x90 [ 0.976024] check_prev_add+0x167/0x1e20 [ 0.976367] ? graph_lock+0x70/0x160 [ 0.976682] __lock_acquire+0x2012/0x3170 [ 0.976998] ? register_lock_class+0x1140/0x1140 [ 0.977323] lock_acquire+0x127/0x350 [ 0.977627] ? flush_workqueue+0xe3/0x12f0 [ 0.977890] flush_workqueue+0x106/0x12f0 [ 0.978147] ? flush_workqueue+0xe3/0x12f0 [ 0.978410] ? find_held_lock+0x2c/0x110 [ 0.978662] ? lock_downgrade+0x6e0/0x6e0 [ 0.978919] ? queue_rcu_work+0x60/0x60 [ 0.979166] ? netif_napi_del+0x151/0x3b0 [ 0.979501] ? peer_remove_after_dead+0x160/0x220 [ 0.979871] peer_remove_after_dead+0x160/0x220 [ 0.980232] wg_set_device+0xa24/0xcc0 [ 0.980516] ? deref_stack_reg+0x8e/0xc0 [ 0.980801] ? set_peer+0xe10/0xe10 [ 0.981040] ? __ww_mutex_check_waiters+0x150/0x150 [ 0.981430] ? __nla_validate_parse+0x163/0x270 [ 0.981719] ? genl_family_rcv_msg_attrs_parse+0x13f/0x310 [ 0.982078] genl_rcv_msg+0x52f/0xe90 [ 0.982348] ? genl_family_rcv_msg_attrs_parse+0x310/0x310 [ 0.982690] ? register_lock_class+0x1140/0x1140 [ 0.983049] netlink_rcv_skb+0x111/0x320 [ 0.983298] ? genl_family_rcv_msg_attrs_parse+0x310/0x310 [ 0.983645] ? netlink_ack+0x880/0x880 [ 0.983888] genl_rcv+0x1f/0x30 [ 0.984168] netlink_unicast+0x3f6/0x610 [ 0.984443] ? netlink_detachskb+0x60/0x60 [ 0.984729] ? find_held_lock+0x2c/0x110 [ 0.984976] netlink_sendmsg+0x700/0xb80 [ 0.985220] ? netlink_broadcast_filtered+0xa60/0xa60 [ 0.985533] __sys_sendto+0x1dd/0x2c0 [ 0.985763] ? __x64_sys_getpeername+0xb0/0xb0 [ 0.986039] ? sockfd_lookup_light+0x17/0x160 [ 0.986397] ? __sys_recvmsg+0x8c/0xf0 [ 0.986711] ? __sys_recvmsg_sock+0xd0/0xd0 [ 0.987018] __x64_sys_sendto+0xd8/0x1b0 [ 0.987283] ? lockdep_hardirqs_on+0x39b/0x5a0 [ 0.987666] do_syscall_64+0x90/0xd9a [ 0.987903] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 0.988223] RIP: 0033:0x7fe77c12003e [ 0.988508] Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 4 [ 0.989666] RSP: 002b:00007fffada2ed58 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 0.990137] RAX: ffffffffffffffda RBX: 00007fe77c159d48 RCX: 00007fe77c12003e [ 0.990583] RDX: 0000000000000040 RSI: 000055fd1d38e020 RDI: 0000000000000004 [ 0.991091] RBP: 000055fd1d38e020 R08: 000055fd1cb63358 R09: 000000000000000c [ 0.991568] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000002c [ 0.992014] R13: 0000000000000004 R14: 000055fd1d38e020 R15: 0000000000000001 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reported-by: syzbot <syzkaller@googlegroups.com>
2020-02-05allowedips: remove previously added list item when OOM failEric Dumazet1-0/+1
In the unlikely case a new node could not be allocated, we need to remove @newnode from @peer->allowedips_list before freeing it. syzbot reported: BUG: KASAN: use-after-free in __list_del_entry_valid+0xdc/0xf5 lib/list_debug.c:54 Read of size 8 at addr ffff88809881a538 by task syz-executor.4/30133 CPU: 0 PID: 30133 Comm: syz-executor.4 Not tainted 5.5.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 __kasan_report.cold+0x1b/0x32 mm/kasan/report.c:506 kasan_report+0x12/0x20 mm/kasan/common.c:639 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135 __list_del_entry_valid+0xdc/0xf5 lib/list_debug.c:54 __list_del_entry include/linux/list.h:132 [inline] list_del include/linux/list.h:146 [inline] root_remove_peer_lists+0x24f/0x4b0 drivers/net/wireguard/allowedips.c:65 wg_allowedips_free+0x232/0x390 drivers/net/wireguard/allowedips.c:300 wg_peer_remove_all+0xd5/0x620 drivers/net/wireguard/peer.c:187 wg_set_device+0xd01/0x1350 drivers/net/wireguard/netlink.c:542 genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline] genl_family_rcv_msg net/netlink/genetlink.c:717 [inline] genl_rcv_msg+0x67d/0xea0 net/netlink/genetlink.c:734 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477 genl_rcv+0x29/0x40 net/netlink/genetlink.c:745 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:672 ____sys_sendmsg+0x753/0x880 net/socket.c:2343 ___sys_sendmsg+0x100/0x170 net/socket.c:2397 __sys_sendmsg+0x105/0x1d0 net/socket.c:2430 __do_sys_sendmsg net/socket.c:2439 [inline] __se_sys_sendmsg net/socket.c:2437 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x45b399 Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f99a9bcdc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f99a9bce6d4 RCX: 000000000045b399 RDX: 0000000000000000 RSI: 0000000020001340 RDI: 0000000000000003 RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000004 R13: 00000000000009ba R14: 00000000004cb2b8 R15: 0000000000000009 Allocated by task 30103: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] __kasan_kmalloc mm/kasan/common.c:513 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:527 kmem_cache_alloc_trace+0x158/0x790 mm/slab.c:3551 kmalloc include/linux/slab.h:556 [inline] kzalloc include/linux/slab.h:670 [inline] add+0x70a/0x1970 drivers/net/wireguard/allowedips.c:236 wg_allowedips_insert_v4+0xf6/0x160 drivers/net/wireguard/allowedips.c:320 set_allowedip drivers/net/wireguard/netlink.c:343 [inline] set_peer+0xfb9/0x1150 drivers/net/wireguard/netlink.c:468 wg_set_device+0xbd4/0x1350 drivers/net/wireguard/netlink.c:591 genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline] genl_family_rcv_msg net/netlink/genetlink.c:717 [inline] genl_rcv_msg+0x67d/0xea0 net/netlink/genetlink.c:734 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477 genl_rcv+0x29/0x40 net/netlink/genetlink.c:745 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:672 ____sys_sendmsg+0x753/0x880 net/socket.c:2343 ___sys_sendmsg+0x100/0x170 net/socket.c:2397 __sys_sendmsg+0x105/0x1d0 net/socket.c:2430 __do_sys_sendmsg net/socket.c:2439 [inline] __se_sys_sendmsg net/socket.c:2437 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 30103: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] kasan_set_free_info mm/kasan/common.c:335 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474 kasan_slab_free+0xe/0x10 mm/kasan/common.c:483 __cache_free mm/slab.c:3426 [inline] kfree+0x10a/0x2c0 mm/slab.c:3757 add+0x12d2/0x1970 drivers/net/wireguard/allowedips.c:266 wg_allowedips_insert_v4+0xf6/0x160 drivers/net/wireguard/allowedips.c:320 set_allowedip drivers/net/wireguard/netlink.c:343 [inline] set_peer+0xfb9/0x1150 drivers/net/wireguard/netlink.c:468 wg_set_device+0xbd4/0x1350 drivers/net/wireguard/netlink.c:591 genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline] genl_family_rcv_msg net/netlink/genetlink.c:717 [inline] genl_rcv_msg+0x67d/0xea0 net/netlink/genetlink.c:734 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477 genl_rcv+0x29/0x40 net/netlink/genetlink.c:745 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:672 ____sys_sendmsg+0x753/0x880 net/socket.c:2343 ___sys_sendmsg+0x100/0x170 net/socket.c:2397 __sys_sendmsg+0x105/0x1d0 net/socket.c:2430 __do_sys_sendmsg net/socket.c:2439 [inline] __se_sys_sendmsg net/socket.c:2437 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff88809881a500 which belongs to the cache kmalloc-64 of size 64 The buggy address is located 56 bytes inside of 64-byte region [ffff88809881a500, ffff88809881a540) The buggy address belongs to the page: page:ffffea0002620680 refcount:1 mapcount:0 mapping:ffff8880aa400380 index:0x0 raw: 00fffe0000000200 ffffea000250b748 ffffea000254bac8 ffff8880aa400380 raw: 0000000000000000 ffff88809881a000 0000000100000020 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88809881a400: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff88809881a480: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc >ffff88809881a500: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ^ ffff88809881a580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff88809881a600: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: wireguard@lists.zx2c4.com Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-30compat: remove RHEL-7.6 workaroundJason A. Donenfeld1-1/+1
We only support the latest RHEL-7 and RHEL-8. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-30compat: support building for RHEL-8.2Ilie Halip1-3/+8
RedHat backported some more changes, now released as kernel 4.18.0-168.el8. To maintain compatibility with kernel -147, a new macro is introduced: ISRHEL82. Compile-tested with the -168 and -147 kernels. Signed-off-by: Ilie Halip <ilie.halip@gmail.com> [zx2c4: we normally only support the latest RHEL, but having some beta support for the time being sounds like a good plan, given that there may be interest from RedHat in actually merging this into their kernels.] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-28version: bumpv0.0.20200128Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-28compat: account for frankenzinc being in 5.5Jason A. Donenfeld2-0/+79
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-28compat: refuse to build on >= 5.6Jason A. Donenfeld1-0/+4
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-28qemu: bump kernelJason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-21version: bumpv0.0.20200121Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-21curve25519: x86_64: replace with formally verified implementationJason A. Donenfeld3-2308/+1300
This comes from INRIA's HACL*/Vale. It implements the same algorithm and implementation strategy as the code it replaces, only this code has been formally verified, sans the base point multiplication, which uses code similar to prior, only it uses the formally verified field arithmetic alongside reproducable ladder generation steps. This doesn't have a pure-bmi2 version, which means haswell no longer benefits, but the increased (doubled) code complexity is not worth it for a single generation of chips that's already old. Performance-wise, this is around 1% slower on older microarchitectures, and slightly faster on newer microarchitectures, mainly 10nm ones or backports of 10nm to 14nm. This implementation is "everest" below: Xeon E5-2680 v4 (Broadwell) armfazh: 133340 cycles per call everest: 133436 cycles per call Xeon Gold 5120 (Sky Lake Server) armfazh: 112636 cycles per call everest: 113906 cycles per call Core i5-6300U (Sky Lake Client) armfazh: 116810 cycles per call everest: 117916 cycles per call Core i7-7600U (Kaby Lake) armfazh: 119523 cycles per call everest: 119040 cycles per call Core i7-8750H (Coffee Lake) armfazh: 113914 cycles per call everest: 113650 cycles per call Core i9-9880H (Coffee Lake Refresh) armfazh: 112616 cycles per call everest: 114082 cycles per call Core i3-8121U (Cannon Lake) armfazh: 113202 cycles per call everest: 111382 cycles per call Core i7-8265U (Whiskey Lake) armfazh: 127307 cycles per call everest: 127697 cycles per call Core i7-8550U (Kaby Lake Refresh) armfazh: 127522 cycles per call everest: 127083 cycles per call Xeon Platinum 8275CL (Cascade Lake) armfazh: 114380 cycles per call everest: 114656 cycles per call Achieving these kind of results with formally verified code is quite remarkable, especialy considering that performance is favorable for newer chips. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-11device: skb_list_walk_safe moved upstreamJason A. Donenfeld2-8/+9
This won't be ported to 5.6, of course, but it's still cleaner to get this out of the way. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-11Makefile: strip prefixed v from version.hJason A. Donenfeld3-22/+12
We also no longer do anything dynamic with dkms.conf, and we don't rewrite any files at all, but rather pass this through as a cflag to the compiler optionally. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reported-by: Egbert Verhage <egbert@eggiecode.org>
2020-01-05version: bumpv0.0.20200105Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-02qemu: only compare archs when deciding whether to use kvmJason A. Donenfeld1-14/+15
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-02qemu: re-add dependency on wireguard sourcesJason A. Donenfeld1-1/+1
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-01-02socket: mark skbs as not on list when receiving via groJason A. Donenfeld1-0/+1
Certain drivers will pass gro skbs to udp, at which point the udp driver simply iterates through them and passes them off to encap_rcv, which is where we pick up. At the moment, we're not attempting to coalesce these into bundles, but we also don't want to wind up having cascaded lists of skbs treated separately. The right behavior here, then, is to just mark each incoming one as not on a list. This can be seen in practice, for example, with Qualcomm's rmnet_perf driver. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Tested-by: Yaroslav Furman <yaro330@gmail.com>
2020-01-01qemu: bump packages and support m68k properlyJason A. Donenfeld5-17/+22
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-12-26version: bumpv0.0.20191226Jason A. Donenfeld2-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-12-26dkms: set maximum kernel to 5.5Jason A. Donenfeld1-2/+2
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>