aboutsummaryrefslogtreecommitdiffstats
path: root/net/netfilter/nft_set_pipapo.h
AgeCommit message (Collapse)AuthorFilesLines
10 daysnetfilter: nft_set_pipapo: remove dirty flagFlorian Westphal1-2/+0
After previous change: ->clone exists: ->dirty is always true ->clone == NULL ->dirty is always false So remove this flag. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-02-21netfilter: nft_set_pipapo: speed up bulk element insertionsFlorian Westphal1-0/+2
Insertions into the set are slow when we try to add many elements. For 800k elements I get: time nft -f pipapo_800k real 19m34.849s user 0m2.390s sys 19m12.828s perf stats: --95.39%--nft_pipapo_insert |--76.60%--pipapo_insert | --76.37%--pipapo_resize | |--72.87%--memcpy_orig | |--1.88%--__free_pages_ok | | --0.89%--free_tail_page_prepare | --1.38%--kvmalloc_node .. --18.56%--pipapo_get.isra.0 |--13.91%--__bitmap_and |--3.01%--pipapo_refill |--0.81%--__kmalloc | --0.74%--__kmalloc_large_node | --0.66%--__alloc_pages .. --0.52%--memset_orig So lots of time is spent in copying exising elements to make space for the next one. Instead of allocating to the exact size of the new rule count, allocate extra slack to reduce alloc/copy/free overhead. After: time nft -f pipapo_800k real 1m54.110s user 0m2.515s sys 1m51.377s --80.46%--nft_pipapo_insert |--73.45%--pipapo_get.isra.0 |--57.63%--__bitmap_and | |--8.52%--pipapo_refill |--3.45%--__kmalloc | --3.05%--__kmalloc_large_node | --2.58%--__alloc_pages --2.59%--memset_orig |--6.51%--pipapo_insert --5.96%--pipapo_resize |--3.63%--memcpy_orig --2.13%--kvmalloc_node The new @rules_alloc fills a hole, so struct size doesn't go up. Also make it so rule removal doesn't shrink unless the free/extra space exceeds two pages. This should be safe as well: When a rule gets removed, the attempt to lower the allocated size is already allowed to fail. Exception: do exact allocations as long as set is very small (less than one page needed). v2: address comments from Stefano: kdoc comment formatting changes remove redundant assignment switch back to PAGE_SIZE Link: https://lore.kernel.org/netfilter-devel/20240213141753.17ef27a6@elisabeth/ Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2024-02-21netfilter: nft_set_pipapo: shrink data structuresFlorian Westphal1-19/+10
The set uses a mix of 'int', 'unsigned int', and size_t. The rule count limit is NFT_PIPAPO_RULE0_MAX, which cannot exceed INT_MAX (a few helpers use 'int' as return type). Add a compile-time assertion for this. Replace size_t usage in structs with unsigned int or u8 where the stored values are smaller. Replace signed-int arguments for lengths with 'unsigned int' where possible. Last, remove lt_aligned member: its set but never read. struct nft_pipapo_match 40 bytes -> 32 bytes struct nft_pipapo_field 56 bytes -> 32 bytes Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2024-02-21netfilter: nft_set_pipapo: constify lookup fn args where possibleFlorian Westphal1-3/+3
Those get called from packet path, content must not be modified. No functional changes intended. Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2024-02-15netfilter: nft_set_pipapo: fix missing : in kdocPablo Neira Ayuso1-2/+2
Add missing : in kdoc field names. Fixes: 8683f4b9950d ("nft_set_pipapo: Prepare for vectorised implementation: helpers") Reported-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-02-08netfilter: nft_set_pipapo: remove scratch_aligned pointerFlorian Westphal1-4/+2
use ->scratch for both avx2 and the generic implementation. After previous change the scratch->map member is always aligned properly for AVX2, so we can just use scratch->map in AVX2 too. The alignoff delta is stored in the scratchpad so we can reconstruct the correct address to free the area again. Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-02-08netfilter: nft_set_pipapo: store index in scratch mapsFlorian Westphal1-2/+12
Pipapo needs a scratchpad area to keep state during matching. This state can be large and thus cannot reside on stack. Each set preallocates percpu areas for this. On each match stage, one scratchpad half starts with all-zero and the other is inited to all-ones. At the end of each stage, the half that starts with all-ones is always zero. Before next field is tested, pointers to the two halves are swapped, i.e. resmap pointer turns into fill pointer and vice versa. After the last field has been processed, pipapo stashes the index toggle in a percpu variable, with assumption that next packet will start with the all-zero half and sets all bits in the other to 1. This isn't reliable. There can be multiple sets and we can't be sure that the upper and lower half of all set scratch map is always in sync (lookups can be conditional), so one set might have swapped, but other might not have been queried. Thus we need to keep the index per-set-and-cpu, just like the scratchpad. Note that this bug fix is incomplete, there is a related issue. avx2 and normal implementation might use slightly different areas of the map array space due to the avx2 alignment requirements, so m->scratch (generic/fallback implementation) and ->scratch_aligned (avx) may partially overlap. scratch and scratch_aligned are not distinct objects, the latter is just the aligned address of the former. After this change, write to scratch_align->map_index may write to scratch->map, so this issue becomes more prominent, we can set to 1 a bit in the supposedly-all-zero area of scratch->map[]. A followup patch will remove the scratch_aligned and makes generic and avx code use the same (aligned) area. Its done in a separate change to ease review. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: expose opaque set element as struct nft_elem_privPablo Neira Ayuso1-1/+3
Add placeholder structure and place it at the beginning of each struct nft_*_elem for each existing set backend, instead of exposing elements as void type to the frontend which defeats compiler type checks. Use this pointer to this new type to replace void *. This patch updates the following set backend API to use this new struct nft_elem_priv placeholder structure: - update - deactivate - flush - get as well as the following helper functions: - nft_set_elem_ext() - nft_set_elem_init() - nft_set_elem_destroy() - nf_tables_set_elem_destroy() This patch adds nft_elem_priv_cast() to cast struct nft_elem_priv to native element representation from the corresponding set backend. BUILD_BUG_ON() makes sure this .priv placeholder is always at the top of the opaque set element representation. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-12netfilter: nf_tables: Annotate struct nft_pipapo_match with __counted_byKees Cook1-1/+1
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct nft_pipapo_match. Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: netfilter-devel@vger.kernel.org Cc: coreteam@netfilter.org Cc: netdev@vger.kernel.org Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1] Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2021-05-29netfilter: nf_tables: prefer direct calls for set lookupsFlorian Westphal1-2/+0
Extend nft_set_do_lookup() to use direct calls when retpoline feature is enabled. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-14netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to ↵Stefano Brivio1-0/+2
non-AVX2 version Arturo reported this backtrace: [709732.358791] WARNING: CPU: 3 PID: 456 at arch/x86/kernel/fpu/core.c:128 kernel_fpu_begin_mask+0xae/0xe0 [709732.358793] Modules linked in: binfmt_misc nft_nat nft_chain_nat nf_nat nft_counter nft_ct nf_tables nf_conntrack_netlink nfnetlink 8021q garp stp mrp llc vrf intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul mgag200 ghash_clmulni_intel drm_kms_helper cec aesni_intel drm libaes crypto_simd cryptd glue_helper mei_me dell_smbios iTCO_wdt evdev intel_pmc_bxt iTCO_vendor_support dcdbas pcspkr rapl dell_wmi_descriptor wmi_bmof sg i2c_algo_bit watchdog mei acpi_ipmi ipmi_si button nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipmi_devintf ipmi_msghandler ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor sd_mod t10_pi crc_t10dif crct10dif_generic raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ahci libahci tg3 libata xhci_pci libphy xhci_hcd ptp usbcore crct10dif_pclmul crct10dif_common bnxt_en crc32c_intel scsi_mod [709732.358941] pps_core i2c_i801 lpc_ich i2c_smbus wmi usb_common [709732.358957] CPU: 3 PID: 456 Comm: jbd2/dm-0-8 Not tainted 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1 [709732.358959] Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.9.3 09/23/2020 [709732.358964] RIP: 0010:kernel_fpu_begin_mask+0xae/0xe0 [709732.358969] Code: ae 54 24 04 83 e3 01 75 38 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 33 48 83 c4 10 5b c3 65 8a 05 5e 21 5e 76 84 c0 74 92 <0f> 0b eb 8e f0 80 4f 01 40 48 81 c7 00 14 00 00 e8 dd fb ff ff eb [709732.358972] RSP: 0018:ffffbb9700304740 EFLAGS: 00010202 [709732.358976] RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000001 [709732.358979] RDX: ffffbb9700304970 RSI: ffff922fe1952e00 RDI: 0000000000000003 [709732.358981] RBP: ffffbb9700304970 R08: ffff922fc868a600 R09: ffff922fc711e462 [709732.358984] R10: 000000000000005f R11: ffff922ff0b27180 R12: ffffbb9700304960 [709732.358987] R13: ffffbb9700304b08 R14: ffff922fc664b6c8 R15: ffff922fc664b660 [709732.358990] FS: 0000000000000000(0000) GS:ffff92371fec0000(0000) knlGS:0000000000000000 [709732.358993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [709732.358996] CR2: 0000557a6655bdd0 CR3: 000000026020a001 CR4: 00000000007706e0 [709732.358999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [709732.359001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [709732.359003] PKRU: 55555554 [709732.359005] Call Trace: [709732.359009] <IRQ> [709732.359035] nft_pipapo_avx2_lookup+0x4c/0x1cba [nf_tables] [709732.359046] ? sched_clock+0x5/0x10 [709732.359054] ? sched_clock_cpu+0xc/0xb0 [709732.359061] ? record_times+0x16/0x80 [709732.359068] ? plist_add+0xc1/0x100 [709732.359073] ? psi_group_change+0x47/0x230 [709732.359079] ? skb_clone+0x4d/0xb0 [709732.359085] ? enqueue_task_rt+0x22b/0x310 [709732.359098] ? bnxt_start_xmit+0x1e8/0xaf0 [bnxt_en] [709732.359102] ? packet_rcv+0x40/0x4a0 [709732.359121] nft_lookup_eval+0x59/0x160 [nf_tables] [709732.359133] nft_do_chain+0x350/0x500 [nf_tables] [709732.359152] ? nft_lookup_eval+0x59/0x160 [nf_tables] [709732.359163] ? nft_do_chain+0x364/0x500 [nf_tables] [709732.359172] ? fib4_rule_action+0x6d/0x80 [709732.359178] ? fib_rules_lookup+0x107/0x250 [709732.359184] nft_nat_do_chain+0x8a/0xf2 [nft_chain_nat] [709732.359193] nf_nat_inet_fn+0xea/0x210 [nf_nat] [709732.359202] nf_nat_ipv4_out+0x14/0xa0 [nf_nat] [709732.359207] nf_hook_slow+0x44/0xc0 [709732.359214] ip_output+0xd2/0x100 [709732.359221] ? __ip_finish_output+0x210/0x210 [709732.359226] ip_forward+0x37d/0x4a0 [709732.359232] ? ip4_key_hashfn+0xb0/0xb0 [709732.359238] ip_sublist_rcv_finish+0x4f/0x60 [709732.359243] ip_sublist_rcv+0x196/0x220 [709732.359250] ? ip_rcv_finish_core.isra.22+0x400/0x400 [709732.359255] ip_list_rcv+0x137/0x160 [709732.359264] __netif_receive_skb_list_core+0x29b/0x2c0 [709732.359272] netif_receive_skb_list_internal+0x1a6/0x2d0 [709732.359280] gro_normal_list.part.156+0x19/0x40 [709732.359286] napi_complete_done+0x67/0x170 [709732.359298] bnxt_poll+0x105/0x190 [bnxt_en] [709732.359304] ? irqentry_exit+0x29/0x30 [709732.359309] ? asm_common_interrupt+0x1e/0x40 [709732.359315] net_rx_action+0x144/0x3c0 [709732.359322] __do_softirq+0xd5/0x29c [709732.359329] asm_call_irq_on_stack+0xf/0x20 [709732.359332] </IRQ> [709732.359339] do_softirq_own_stack+0x37/0x40 [709732.359346] irq_exit_rcu+0x9d/0xa0 [709732.359353] common_interrupt+0x78/0x130 [709732.359358] asm_common_interrupt+0x1e/0x40 [709732.359366] RIP: 0010:crc_41+0x0/0x1e [crc32c_intel] [709732.359370] Code: ff ff f2 4d 0f 38 f1 93 a8 fe ff ff f2 4c 0f 38 f1 81 b0 fe ff ff f2 4c 0f 38 f1 8a b0 fe ff ff f2 4d 0f 38 f1 93 b0 fe ff ff <f2> 4c 0f 38 f1 81 b8 fe ff ff f2 4c 0f 38 f1 8a b8 fe ff ff f2 4d [709732.359373] RSP: 0018:ffffbb97008dfcd0 EFLAGS: 00000246 [709732.359377] RAX: 000000000000002a RBX: 0000000000000400 RCX: ffff922fc591dd50 [709732.359379] RDX: ffff922fc591dea0 RSI: 0000000000000a14 RDI: ffffffffc00dddc0 [709732.359382] RBP: 0000000000001000 R08: 000000000342d8c3 R09: 0000000000000000 [709732.359384] R10: 0000000000000000 R11: ffff922fc591dff0 R12: ffffbb97008dfe58 [709732.359386] R13: 000000000000000a R14: ffff922fd2b91e80 R15: ffff922fef83fe38 [709732.359395] ? crc_43+0x1e/0x1e [crc32c_intel] [709732.359403] ? crc32c_pcl_intel_update+0x97/0xb0 [crc32c_intel] [709732.359419] ? jbd2_journal_commit_transaction+0xaec/0x1a30 [jbd2] [709732.359425] ? irq_exit_rcu+0x3e/0xa0 [709732.359447] ? kjournald2+0xbd/0x270 [jbd2] [709732.359454] ? finish_wait+0x80/0x80 [709732.359470] ? commit_timeout+0x10/0x10 [jbd2] [709732.359476] ? kthread+0x116/0x130 [709732.359481] ? kthread_park+0x80/0x80 [709732.359488] ? ret_from_fork+0x1f/0x30 [709732.359494] ---[ end trace 081a19978e5f09f5 ]--- that is, nft_pipapo_avx2_lookup() uses the FPU running from a softirq that interrupted a kthread, also using the FPU. That's exactly the reason why irq_fpu_usable() is there: use it, and if we can't use the FPU, fall back to the non-AVX2 version of the lookup operation, i.e. nft_pipapo_lookup(). Reported-by: Arturo Borrero Gonzalez <arturo@netfilter.org> Cc: <stable@vger.kernel.org> # 5.6.x Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15nft_set_pipapo: Prepare for single ranged field usageStefano Brivio1-0/+3
A few adjustments in nft_pipapo_init() are needed to allow usage of this set back-end for a single, ranged field. Provide a convenient NFT_PIPAPO_MIN_FIELDS definition that currently makes sure that the rbtree back-end is selected instead, for sets with a single field. This finally allows a fair comparison with rbtree sets, by defining NFT_PIPAPO_MIN_FIELDS as 0 and skipping rbtree back-end initialisation: ---------------.--------------------------.-------------------------. AMD Epyc 7402 | baselines, Mpps | Mpps, % over rbtree | 1 thread |__________________________|_________________________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | pipapo | ---------------| hook | no | single | pipapo |single field| type entries | drop | ranges | field |single field| AVX2 | ---------------|--------|--------|--------|------------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 6.0 +58% | 9.6 +153% | ---------------|--------|--------|--------|------------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 9.1 +57% |11.6 +100% | ---------------|--------|--------|--------|------------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 2.8 +55% | 6.5 +261% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | [1] | [1] | 30000 | 19.6 | 11.6 | 3.9 | 0.9 -77% | 2.7 -31% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | | | 10000 | 19.6 | 11.6 | 4.4 | 2.1 -52% | 5.6 +27% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | | | 4 threads 10000| 77.9 | 45.1 | 17.4 | 8.3 -52% |22.4 +29% | ---------------|--------|--------|--------|------------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 4.5 +5% | 8.2 +91% | ---------------|--------|--------|--------|------------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 2.8 +47% | 6.6 +247% | ---------------|--------|--------|--------|------------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 6.0 +54% | 9.9 +154% | ---------------'--------'--------'--------'------------'------------' [1] Causes switch of lookup table buckets for 'port' to 4-bit groups Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15nft_set_pipapo: Prepare for vectorised implementation: helpersStefano Brivio1-0/+277
Move most macros and helpers to a header file, so that they can be conveniently used by related implementations. No functional changes are intended here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>