linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00

Author	SHA1	Message	Date
Linus Torvalds	454cb97726	This update includes the following changes: API: - Remove physical address skcipher walking. - Fix boot-up self-test race. Algorithms: - Optimisations for x86/aes-gcm. - Optimisations for x86/aes-xts. - Remove VMAC. - Remove keywrap. Drivers: - Remove n2. Others: - Fixes for padata UAF. - Fix potential rhashtable deadlock by moving schedule_work outside lock. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmeSIvwACgkQxycdCkmx i6dkYw//bJ6OxIXdtsDWVtJF4GnfxLYSU33GGGMWrbwxS/EihL12rkB3JPw2avJb oFBP8rWl5Qv9tDF2gjn6TyBaydVnKMA9nUbsqKN6m/DZ/RcCpHigQ21HVzny3bhw rHsZcWoy14TXMuni1DhLnYPftbF+7qZ/pdT5WYr4MEchQhzQc6XWaS2T5by16bjn HHsPHNZj+kFDf4kKYab3jmnly8Qo0wpTMvuX1tsiUqt7YABcg3dobIisMPatxg8A CIgdBZJRivC55Cqm4JT7P+y63PsJVGCyoLXOAGoZN5CLwdTSGND12DJ1awEcOswc 7fMlCk0gDrhniUTUzP8VsP8EUCezIIpaIfne9v/0OERo6DbiuX+NeEwxWJNdIHeS vZocY5a6hS84iBdsuPrUaPqZI6oUSYFIwKPJUwbyaY4j1cfowHz8zbgmmPO5TUV7 NAI7/QpoMA3GNWn3p+64eeXekT2DcU5o3i14dbJ31FQhlFbzVWA7/2Z5ydu18Fex ntTEplPCzYrsqwuxmFDb/3dsk3Z98RquZZJzIKAXKSXTNBOYJaFOCTyugdkn18Nq p6dJNXEvl6lnjylgILa0ltv6TI8h7IRpuqi+FAqExOXR3H3gelVXUjMXnC0fmjrd +ARAzq223xPWwsKEd00Rb3FEoq0XyChvxh4n3BqM4XhSenWggOc= =/75o -----END PGP SIGNATURE----- Merge tag 'v6.14-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Remove physical address skcipher walking - Fix boot-up self-test race Algorithms: - Optimisations for x86/aes-gcm - Optimisations for x86/aes-xts - Remove VMAC - Remove keywrap Drivers: - Remove n2 Others: - Fixes for padata UAF - Fix potential rhashtable deadlock by moving schedule_work outside lock" * tag 'v6.14-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (75 commits) rhashtable: Fix rhashtable_try_insert test dt-bindings: crypto: qcom,inline-crypto-engine: Document the SM8750 ICE dt-bindings: crypto: qcom,prng: Document SM8750 RNG dt-bindings: crypto: qcom-qce: Document the SM8750 crypto engine crypto: asymmetric_keys - Remove unused key_being_used_for[] padata: avoid UAF for reorder_work padata: fix UAF in padata_reorder padata: add pd get/put refcnt helper crypto: skcipher - call cond_resched() directly crypto: skcipher - optimize initializing skcipher_walk fields crypto: skcipher - clean up initialization of skcipher_walk::flags crypto: skcipher - fold skcipher_walk_skcipher() into skcipher_walk_virt() crypto: skcipher - remove redundant check for SKCIPHER_WALK_SLOW crypto: skcipher - remove redundant clamping to page size crypto: skcipher - remove unnecessary page alignment of bounce buffer crypto: skcipher - document skcipher_walk_done() and rename some vars crypto: omap - switch from scatter_walk to plain offset crypto: powerpc/p10-aes-gcm - simplify handling of linear associated data crypto: bcm - Drop unused setting of local 'ptr' variable crypto: hisilicon/qm - support new function communication ...	2025-01-24 07:48:10 -08:00
Eric Biggers	3cd46a78ee	crypto: x86/aes-xts - additional optimizations Reduce latency by taking advantage of the property vaesenclast(key, a) ^ b == vaesenclast(key ^ b, a), like I did in the AES-GCM code. Also replace a vpand and vpxor with a vpternlogd. On AMD Zen 5 this improves performance by about 3%. Intel performance remains about the same, with a 0.1% improvement being seen on Icelake. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-12-21 22:46:24 +08:00
Eric Biggers	68e95f5c64	crypto: x86/aes-xts - more code size optimizations Prefer immediates of -128 to 128, since the former fits in a signed byte, saving 3 bytes per instruction. Also prefer VEX-coded instructions to EVEX where this is easy to do. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-12-21 22:46:24 +08:00
Eric Biggers	77a4b5675b	crypto: x86/aes-xts - change len parameter to int The AES-XTS assembly code currently treats the length as signed, since this saves a few instructions in the loop compared to treating it as unsigned. Therefore update the type to make this clear. (It is not actually passed any values larger than PAGE_SIZE.) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-12-21 22:46:24 +08:00
Eric Biggers	bd7e7df6e6	crypto: x86/aes-xts - improve some comments Improve some of the comments in aes-xts-avx-x86_64.S. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-12-21 22:46:24 +08:00
Eric Biggers	d1bb1c32f9	crypto: x86/aes-xts - make the register aliases per-function Since aes-xts-avx-x86_64.S contains multiple functions, move the register aliases for the parameters and local variables of the XTS update function into the macro that generates that function. Then add register aliases to aes_xts_encrypt_iv() to improve readability there. This makes aes-xts-avx-x86_64.S consistent with the GCM assembly files. No change in the generated code. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-12-21 22:46:24 +08:00
Eric Biggers	5b7981c1ca	crypto: x86/aes-xts - use .irp when useful Use .irp instead of repeating code. No change in the generated code. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-12-21 22:46:24 +08:00
Eric Biggers	95791ccd11	crypto: x86/aes-gcm - tune better for AMD CPUs Reorganize the main loop to free up the RNDKEYLAST[0-3] registers and use them for more cached round keys. This improves performance by about 2% on AMD Zen 4 and Zen 5. Intel performance remains about the same. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-12-21 22:46:24 +08:00
Eric Biggers	3cae5a3c05	crypto: x86/aes-gcm - code size optimization Prefer immediates of -128 to 128, since the former fits in a signed byte, saving 3 bytes per instruction. Also replace a vpand and vpxor with a vpternlogd. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-12-21 22:46:24 +08:00
Eric Biggers	a6185842d1	crypto: x86 - remove assignments of 0 to cra_alignmask Struct fields are zero by default, so these lines of code have no effect. Remove them to reduce the number of matches that are found when grepping for cra_alignmask. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-12-14 17:21:44 +08:00
Eric Biggers	ed4bc981d5	x86/crc-t10dif: expose CRC-T10DIF function through lib Move the x86 CRC-T10DIF assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241202012056.209768-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-12-01 17:23:13 -08:00
Eric Biggers	55d1ecceb8	x86/crc32: expose CRC32 functions through lib Move the x86 CRC32 assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20241202010844.144356-14-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-12-01 17:23:01 -08:00
Eric Biggers	1e6b72e60a	x86/crc32: update prototype for crc32_pclmul_le_16() - Change the len parameter from unsigned int to size_t, so that the library function which takes a size_t can safely use this code. - Move the crc parameter to the front, as this is the usual convention. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20241202010844.144356-13-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-12-01 17:23:01 -08:00
Eric Biggers	64e3586c0b	x86/crc32: update prototype for crc_pcl() - Change the len parameter from unsigned int to size_t, so that the library function which takes a size_t can safely use this code. - Rename to crc32c_x86_3way() which is much clearer. - Move the crc parameter to the front, as this is the usual convention. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20241202010844.144356-12-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-12-01 17:23:01 -08:00
Linus Torvalds	02b2f1a7b8	This update includes the following changes: API: - Add sig driver API. - Remove signing/verification from akcipher API. - Move crypto_simd_disabled_for_test to lib/crypto. - Add WARN_ON for return values from driver that indicates memory corruption. Algorithms: - Provide crc32-arch and crc32c-arch through Crypto API. - Optimise crc32c code size on x86. - Optimise crct10dif on arm/arm64. - Optimise p10-aes-gcm on powerpc. - Optimise aegis128 on x86. - Output full sample from test interface in jitter RNG. - Retry without padata when it fails in pcrypt. Drivers: - Add support for Airoha EN7581 TRNG. - Add support for STM32MP25x platforms in stm32. - Enable iproc-r200 RNG driver on BCMBCA. - Add Broadcom BCM74110 RNG driver. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmc6sQsACgkQxycdCkmx i6dfHxAAnkI65TE6agZq9DlkEU4ZqOsxxdk0MsGIhbCUTxW3KENzu9vtKjnvg9T/ Ou0d2J49ny87Y4zaA59Wf/Q1+gg5YSQR5kelonpfrPLkCkJjr72HZpyCHv8TTzEC uHHoVj9cnPIF5/yfiqQsrWT1ACip9vn+slyVPaMJV1qR6gnvnSALtsg4e/vKHkn7 ZMaf2pZ2ROYXdB02nMK5KQcCrxD64MQle/yQepY44eYjnT+XclkqPdi6o1nUSpj/ RFAeY0jFSTu0pj3DqT48TnU/LiiNLlFOZrGjCdEySoac63vmTtKqfYDmrRaFz4hB sucxbgJ3xnnYseRijtfXnxaD/IkDJln+ipGNQKAZLfOVMDCTxPdYGmOpobMTXMS+ 0sY0eAHgqr23P9pOp+sOzcAEFIqg6llAYQVWx3Zl4vpXBUuxzg6AqmHnPicnck7y Lw1cJhQxij2De3dG2ZL/0dgQxMjGN/YfCM8SSg6l+Xn3j4j47rqJNH2ZsmXtbJ2n kTkmemmWdgRR1IvgQQGsvyKs9ThkcEDW+IzW26SUv3Clvru2NSkX4ZPHbezZQf+D R0wMZsW3Fw7Zymerz1GIBSqdLnsyFWtIAjukDpOR6ordPgOBeDt76v6tw5vL2/II KYoeN1pdEEecwuhAsEvCryT5ZG4noBeNirf/ElWAfEybgcXiTks= =T8pa -----END PGP SIGNATURE----- Merge tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add sig driver API - Remove signing/verification from akcipher API - Move crypto_simd_disabled_for_test to lib/crypto - Add WARN_ON for return values from driver that indicates memory corruption Algorithms: - Provide crc32-arch and crc32c-arch through Crypto API - Optimise crc32c code size on x86 - Optimise crct10dif on arm/arm64 - Optimise p10-aes-gcm on powerpc - Optimise aegis128 on x86 - Output full sample from test interface in jitter RNG - Retry without padata when it fails in pcrypt Drivers: - Add support for Airoha EN7581 TRNG - Add support for STM32MP25x platforms in stm32 - Enable iproc-r200 RNG driver on BCMBCA - Add Broadcom BCM74110 RNG driver" * tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (112 commits) crypto: marvell/cesa - fix uninit value for struct mv_cesa_op_ctx crypto: cavium - Fix an error handling path in cpt_ucode_load_fw() crypto: aesni - Move back to module_init crypto: lib/mpi - Export mpi_set_bit crypto: aes-gcm-p10 - Use the correct bit to test for P10 hwrng: amd - remove reference to removed PPC_MAPLE config crypto: arm/crct10dif - Implement plain NEON variant crypto: arm/crct10dif - Macroify PMULL asm code crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply crypto: arm64/crct10dif - Remove obsolete chunking logic crypto: bcm - add error check in the ahash_hmac_init function crypto: caam - add error check to caam_rsa_set_priv_key_form hwrng: bcm74110 - Add Broadcom BCM74110 RNG driver dt-bindings: rng: add binding for BCM74110 RNG padata: Clean up in padata_do_multithreaded() crypto: inside-secure - Fix the return value of safexcel_xcbcmac_cra_init() crypto: qat - Fix missing destroy_workqueue in adf_init_aer() crypto: rsassa-pkcs1 - Reinstate support for legacy protocols ...	2024-11-19 10:28:41 -08:00
Herbert Xu	dccd55892b	crypto: aesni - Move back to module_init This patch reverts commit `0fbafd06bd` ("crypto: aesni - fix failing setkey for rfc4106-gcm-aesni") by moving the aesni init function back to module_init from late_initcall. The original patch was needed because tests were synchronous. This is no longer the case so there is no need to postpone the registration. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-11-15 19:52:51 +08:00
Eric Biggers	7cc26d4a5f	crypto: x86/aegis128 - remove unneeded RETs Remove returns that are immediately followed by another return. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	a09be0354b	crypto: x86/aegis128 - remove unneeded FRAME_BEGIN and FRAME_END Stop using FRAME_BEGIN and FRAME_END in the AEGIS assembly functions, since all these functions are now leaf functions. This eliminates some unnecessary instructions. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	a0927a03e7	crypto: x86/aegis128 - take advantage of block-aligned len Update a caller of aegis128_aesni_ad() to round down the length to a block boundary. After that, aegis128_aesni_ad(), aegis128_aesni_enc(), and aegis128_aesni_dec() are only passed whole blocks. Update the assembly code to take advantage of that, which eliminates some unneeded instructions. For aegis128_aesni_enc() and aegis128_aesni_dec(), the length is also always nonzero, so stop checking for zero length. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	933e897431	crypto: x86/aegis128 - optimize partial block handling using SSE4.1 Optimize the code that loads and stores partial blocks, taking advantage of SSE4.1. The code is adapted from that in aes-gcm-aesni-x86_64.S. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	8da94b300f	crypto: x86/aegis128 - improve assembly function prototypes Adjust the prototypes of the AEGIS assembly functions: - Use proper types instead of 'void *', when applicable. - Move the length parameter to after the buffers it describes rather than before, to match the usual convention. Also shorten its name to just len (which is the name used in the assembly code). - Declare register aliases at the beginning of each function rather than once per file. This was necessary because len was moved, but also it allows adding some aliases where raw registers were used before. - Put assoclen and cryptlen in the correct order when declaring the finalization function in the .c file. - Remove the unnecessary "crypto_" prefix. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	af2aff7caf	crypto: x86/aegis128 - optimize length block preparation using SSE4.1 Start using SSE4.1 instructions in the AES-NI AEGIS code, with the first use case being preparing the length block in fewer instructions. In practice this does not reduce the set of CPUs on which the code can run, because all Intel and AMD CPUs with AES-NI also have SSE4.1. Upgrade the existing SSE2 feature check to SSE4.1, though it seems this check is not strictly necessary; the aesni-intel module has been getting away with using SSE4.1 despite checking for AES-NI only. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	595bca25a6	crypto: x86/aegis128 - don't bother with special code for aligned data Remove the AEGIS assembly code paths that were "optimized" to operate on 16-byte aligned data using movdqa, and instead just use the code paths that use movdqu and can handle data with any alignment. This does not reduce performance. movdqa is basically a historical artifact; on aligned data, movdqu and movdqa have had the same performance since Intel Nehalem (2008) and AMD Bulldozer (2011). And code that requires AES-NI cannot run on CPUs older than those anyway. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	b8d2e7bac3	crypto: x86/aegis128 - eliminate some indirect calls Instead of using a struct of function pointers to decide whether to call the encryption or decryption assembly functions, use a conditional branch on a bool. Force-inline the functions to avoid actually generating the branch. This improves performance slightly since indirect calls are slow. Remove the now-unnecessary CFI stubs. Note that just force-inlining the existing functions might cause the compiler to optimize out the indirect branches, but that would not be a reliable way to do it and the CFI stubs would still be required. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	ebb445f5e7	crypto: x86/aegis128 - remove no-op init and exit functions Don't bother providing empty stubs for the init and exit methods in struct aead_alg, since they are optional anyway. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	3b2f2d22fb	crypto: x86/aegis128 - access 32-bit arguments as 32-bit Fix the AEGIS assembly code to access 'unsigned int' arguments as 32-bit values instead of 64-bit, since the upper bits of the corresponding 64-bit registers are not guaranteed to be zero. Note: there haven't been any reports of this bug actually causing incorrect behavior. Neither gcc nor clang guarantee zero-extension to 64 bits, but zero-extension is likely to happen in practice because most instructions that operate on 32-bit registers zero-extend to 64 bits. Fixes: `1d373d4e8e` ("crypto: x86 - Add optimized AEGIS implementations") Cc: stable@vger.kernel.org Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-28 18:33:10 +08:00
Eric Biggers	84dd048cf8	crypto: x86/crc32c - eliminate jump table and excessive unrolling crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations fully unrolled and uses a jump table to jump into the correct location. This optimization is misguided, as it bloats the binary code size and introduces an indirect call. x86_64 CPUs can predict loops well, so it is fine to just use a loop instead. Loop bookkeeping instructions can compete with the crc instructions for the ALUs, but this is easily mitigated by unrolling the loop by a smaller amount, such as 4 times. Therefore, re-roll the loop and make related tweaks to the code. This reduces the binary code size of crc_pclmul() from 4546 bytes to 418 bytes, a 91% reduction. In general it also makes the code faster, with some large improvements seen when retpoline is enabled. More detailed performance results are shown below. They are given as percent improvement in throughput (negative means regressed) for CPU microarchitecture vs. input length in bytes. E.g. an improvement from 40 GB/s to 50 GB/s would be listed as 25%. Table 1: Results with retpoline enabled (the default): \| 512 \| 833 \| 1024 \| 2000 \| 3173 \| 4096 \| ---------------------+-------+-------+-------+------ +-------+-------+ Intel Haswell \| 35.0% \| 20.7% \| 17.8% \| 9.7% \| -0.2% \| 4.4% \| Intel Emerald Rapids \| 66.8% \| 45.2% \| 36.3% \| 19.3% \| 0.0% \| 5.4% \| AMD Zen 2 \| 29.5% \| 17.2% \| 13.5% \| 8.6% \| -0.5% \| 2.8% \| Table 2: Results with retpoline disabled: \| 512 \| 833 \| 1024 \| 2000 \| 3173 \| 4096 \| ---------------------+-------+-------+-------+------ +-------+-------+ Intel Haswell \| 3.3% \| 4.8% \| 4.5% \| 0.9% \| -2.9% \| 0.3% \| Intel Emerald Rapids \| 7.5% \| 6.4% \| 5.2% \| 2.3% \| -0.0% \| 0.6% \| AMD Zen 2 \| 11.8% \| 1.4% \| 0.2% \| 1.3% \| -0.9% \| -0.2% \| Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-26 14:41:59 +08:00
Eric Biggers	eebcadfa21	crypto: x86/crc32c - access 32-bit arguments as 32-bit Fix crc32c-pcl-intel-asm_64.S to access 32-bit arguments as 32-bit values instead of 64-bit, since the upper bits of the corresponding 64-bit registers are not guaranteed to be zero. Also update the type of the length argument to be unsigned int rather than int, as the assembly code treats it as unsigned. Note: there haven't been any reports of this bug actually causing incorrect behavior. Neither gcc nor clang guarantee zero-extension to 64 bits, but zero-extension is likely to happen in practice because most instructions that operate on 32-bit registers zero-extend to 64 bits. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-26 14:41:59 +08:00
Eric Biggers	84ebf9dbe6	crypto: x86/crc32c - simplify code for handling fewer than 200 bytes The assembly code in crc32c-pcl-intel-asm_64.S is invoked only for lengths >= 512, due to the overhead of saving and restoring FPU state. Therefore, it is unnecessary for this code to be excessively "optimized" for lengths < 200. Eliminate the excessive unrolling of this part of the code and use a more straightforward qword-at-a-time loop. Note: the part of the code in question is not entirely redundant, as it is still used to process any remainder mod 24, as well as any remaining data when fewer than 200 bytes remain after least one 3072-byte chunk. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-26 14:41:59 +08:00
Dr. David Alan Gilbert	528888f33d	crypto: x86/cast5 - Remove unused cast5_ctr_16way commit `e2d60e2f59` ("crypto: x86/cast5 - drop CTR mode implementation") removed the calls to cast5_ctr_16way but left the avx implementation. Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-10-10 17:08:02 +08:00
Al Viro	5f60d5f6bb	move asm/unaligned.h to linux/unaligned.h asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h	2024-10-02 17:23:23 -04:00
Eric Biggers	c299d7af9d	crypto: x86/aesni - update docs for aesni-intel module Update the kconfig help and module description to reflect that VAES instructions are now used in some cases. Also fix XTR => XCTR. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-09-06 14:50:45 +08:00
Fangrui Song	3363c460ef	crypto: x86/sha256 - Add parentheses around macros' single arguments The macros FOUR_ROUNDS_AND_SCHED and DO_4ROUNDS rely on an unexpected/undocumented behavior of the GNU assembler, which might change in the future (https://sourceware.org/bugzilla/show_bug.cgi?id=32073). M (1) (2) // 1 arg !? Future: 2 args M 1 + 2 // 1 arg !? Future: 3 args M 1 2 // 2 args Add parentheses around the single arguments to support future GNU assembler and LLVM integrated assembler (when the IsOperator hack from the following link is dropped). Link: `055006475e` Signed-off-by: Fangrui Song <maskray@google.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-08-24 21:36:07 +08:00
Eric Biggers	001412493e	crypto: x86/aes-gcm - fix PREEMPT_RT issue in gcm_crypt() On PREEMPT_RT, kfree() takes sleeping locks and must not be called with preemption disabled. Therefore, on PREEMPT_RT skcipher_walk_done() must not be called from within a kernel_fpu_{begin,end}() pair, even when it's the last call which is guaranteed to not allocate memory. Therefore, move the last skcipher_walk_done() in gcm_crypt() to the end of the function so that it goes after the kernel_fpu_end(). To make this work cleanly, rework the data processing loop to handle only non-last data segments. Fixes: `b06affb1cb` ("crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Closes: https://lore.kernel.org/linux-crypto/20240802102333.itejxOsJ@linutronix.de Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-08-10 12:25:34 +08:00
Eric Biggers	e6e758fa64	crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM Rewrite the AES-NI implementations of AES-GCM, taking advantage of things I learned while writing the VAES-AVX10 implementations. This is a complete rewrite that reduces the AES-NI GCM source code size by about 70% and the binary code size by about 95%, while not regressing performance and in fact improving it significantly in many cases. The following summarizes the state before this patch: - The aesni-intel module registered algorithms "generic-gcm-aesni" and "rfc4106-gcm-aesni" with the crypto API that actually delegated to one of three underlying implementations according to the CPU capabilities detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2. - The AES-NI + AVX and AES-NI + AVX2 assembly code was in aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and 257 KB of binary. This massive binary size was not really appropriate, and depending on the kconfig it could take up over 1% the size of the entire vmlinux. The main loops did 8 blocks per iteration. The AVX code minimized the use of carryless multiplication whereas the AVX2 code did not. The "AVX2" code did not actually use AVX2; the check for AVX2 was really a check for Intel Haswell or later to detect support for fast carryless multiplication. The long source length was caused by factors such as significant code duplication. - The AES-NI only assembly code was in aesni-intel_asm.S and consisted of 1501 lines of source and 15 KB of binary. The main loops did 4 blocks per iteration and minimized the use of carryless multiplication by using Karatsuba multiplication and a multiplication-less reduction. - The assembly code was contributed in 2010-2013. Maintenance has been sporadic and most design choices haven't been revisited. - The assembly function prototypes and the corresponding glue code were separate from and were not consistent with the new VAES-AVX10 code I recently added. The older code had several issues such as not precomputing the GHASH key powers, which hurt performance. This rewrite achieves the following goals: - Much shorter source and binary sizes. The assembly source shrinks from 4300 lines to 1130 lines, and it produces about 9 KB of binary instead of 272 KB. This is achieved via a better designed AES-GCM implementation that doesn't excessively unroll the code and instead prioritizes the parts that really matter. Sharing the C glue code with the VAES-AVX10 implementations also saves 250 lines of C source. - Improve performance on most (possibly all) CPUs on which this code runs, for most (possibly all) message lengths. Benchmark results are given in Tables 1 and 2 below. - Use the same function prototypes and glue code as the new VAES-AVX10 algorithms. This fixes some issues with the integration of the assembly and results in some significant performance improvements, primarily on short messages. Also, the AVX and non-AVX implementations are now registered as separate algorithms with the crypto API, which makes them both testable by the self-tests. - Keep support for AES-NI without AVX (for Westmere, Silvermont, Goldmont, and Tremont), but unify the source code with AES-NI + AVX. Since 256-bit vectors cannot be used without VAES anyway, this is made feasible by just using the non-VEX coded form of most instructions. - Use a unified approach where the main loop does 8 blocks per iteration and uses Karatsuba multiplication to save one pclmulqdq per block but does not use the multiplication-less reduction. This strikes a good balance across the range of CPUs on which this code runs. - Don't spam the kernel log with an informational message on every boot. The following tables summarize the improvement in AES-GCM throughput on various CPU microarchitectures as a result of this patch: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell \| 2% \| 8% \| 11% \| 18% \| 31% \| 26% \| Intel Skylake \| 1% \| 4% \| 7% \| 12% \| 26% \| 19% \| Intel Cascade Lake \| 3% \| 8% \| 10% \| 18% \| 33% \| 24% \| AMD Zen 1 \| 6% \| 12% \| 6% \| 15% \| 27% \| 24% \| AMD Zen 2 \| 8% \| 13% \| 13% \| 19% \| 26% \| 28% \| AMD Zen 3 \| 8% \| 14% \| 13% \| 19% \| 26% \| 25% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| -------------------+-------+-------+-------+-------+-------+ Intel Broadwell \| 35% \| 29% \| 45% \| 55% \| 54% \| Intel Skylake \| 25% \| 19% \| 28% \| 33% \| 27% \| Intel Cascade Lake \| 36% \| 28% \| 39% \| 49% \| 54% \| AMD Zen 1 \| 27% \| 22% \| 23% \| 29% \| 26% \| AMD Zen 2 \| 32% \| 24% \| 22% \| 25% \| 31% \| AMD Zen 3 \| 30% \| 24% \| 22% \| 23% \| 26% \| Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell \| 3% \| 8% \| 11% \| 19% \| 32% \| 28% \| Intel Skylake \| 3% \| 4% \| 7% \| 13% \| 28% \| 27% \| Intel Cascade Lake \| 3% \| 9% \| 11% \| 19% \| 33% \| 28% \| AMD Zen 1 \| 15% \| 18% \| 14% \| 20% \| 36% \| 33% \| AMD Zen 2 \| 9% \| 16% \| 13% \| 21% \| 26% \| 27% \| AMD Zen 3 \| 8% \| 15% \| 12% \| 18% \| 23% \| 23% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| -------------------+-------+-------+-------+-------+-------+ Intel Broadwell \| 36% \| 31% \| 40% \| 51% \| 53% \| Intel Skylake \| 28% \| 21% \| 23% \| 30% \| 30% \| Intel Cascade Lake \| 36% \| 29% \| 36% \| 47% \| 53% \| AMD Zen 1 \| 35% \| 31% \| 32% \| 35% \| 36% \| AMD Zen 2 \| 31% \| 30% \| 27% \| 38% \| 30% \| AMD Zen 3 \| 27% \| 23% \| 24% \| 32% \| 26% \| The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be listed as 10%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or Intel low-power CPUs, as these weren't readily available to me. However, based on the design of the new code and the available information about these other CPU microarchitectures, I wouldn't expect any significant regressions, and there's a good chance performance is improved just as it is above. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-06-07 19:47:58 +08:00
Eric Biggers	b06affb1cb	crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or AVX10. There are two implementations, sharing most source code: one using 256-bit vectors and one using 512-bit vectors. This patch improves AES-GCM performance by up to 162%; see Tables 1 and 2 below. I wrote the new AES-GCM assembly code from scratch, focusing on correctness, performance, code size (both source and binary), and documenting the source. The new assembly file aes-gcm-avx10-x86_64.S is about 1200 lines including extensive comments, and it generates less than 8 KB of binary code. The main loop does 4 vectors at a time, with the AES and GHASH instructions interleaved. Any remainder is handled using a simple 1 vector at a time loop, with masking. Several VAES + AVX512 implementations of AES-GCM exist from Intel, including one in OpenSSL and one proposed for inclusion in Linux in 2021 (https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/). These aren't really suitable to be used, though, due to the massive amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux) and well as the significantly larger amount of assembly source (4978 lines for OpenSSL, 1788 lines for Linux). Also, Intel's code does not support 256-bit vectors, which makes it not usable on future AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have downclocking issues. So I ended up starting from scratch. Usually my much shorter code is actually slightly faster than Intel's AVX512 code, though it depends on message length and on which of Intel's implementations is used; for details, see Tables 3 and 4 below. To facilitate potential integration into other projects, I've dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause, the same as the recently added RISC-V crypto code. The following two tables summarize the performance improvement over the existing AES-GCM code in Linux that uses AES-NI and AVX2: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 60% \| 62% \| 70% \| 69% \| Intel Sapphire Rapids \| 157% \| 145% \| 162% \| 119% \| 96% \| 96% \| Intel Emerald Rapids \| 156% \| 144% \| 161% \| 115% \| 95% \| 100% \| AMD Zen 4 \| 103% \| 89% \| 78% \| 56% \| 54% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 66% \| 48% \| 49% \| 70% \| 53% \| Intel Sapphire Rapids \| 80% \| 60% \| 41% \| 62% \| 38% \| Intel Emerald Rapids \| 79% \| 60% \| 41% \| 62% \| 38% \| AMD Zen 4 \| 51% \| 35% \| 27% \| 32% \| 25% \| Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 59% \| 63% \| 67% \| 71% \| Intel Sapphire Rapids \| 159% \| 145% \| 161% \| 125% \| 102% \| 100% \| Intel Emerald Rapids \| 158% \| 144% \| 161% \| 124% \| 100% \| 103% \| AMD Zen 4 \| 110% \| 95% \| 80% \| 59% \| 56% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 67% \| 56% \| 46% \| 70% \| 56% \| Intel Sapphire Rapids \| 79% \| 62% \| 39% \| 61% \| 39% \| Intel Emerald Rapids \| 80% \| 62% \| 40% \| 58% \| 40% \| AMD Zen 4 \| 49% \| 36% \| 30% \| 35% \| 28% \| The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be listed as 50%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. The following two tables summarize how the performance of my code compares with Intel's AVX512 AES-GCM code, both the version that is in OpenSSL and the version that was proposed for inclusion in Linux. Neither version exists in Linux currently, but these are alternative AES-GCM implementations that could be chosen instead of mine. I collected the following numbers on Emerald Rapids using a userspace benchmark program that calls the assembly functions directly. I've also included a comparison with Cloudflare's AES-GCM implementation from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3. Table 3: VAES-based AES-256-GCM encryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14171 \| 12956 \| 12318 \| 9588 \| 7293 \| 6449 \| AVX512_Intel_OpenSSL \| 14022 \| 12467 \| 11863 \| 9107 \| 5891 \| 6472 \| AVX512_Intel_Linux \| 13954 \| 12277 \| 11530 \| 8712 \| 6627 \| 5898 \| AVX512_Cloudflare \| 12564 \| 11050 \| 10905 \| 8152 \| 5345 \| 5202 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 4939 \| 3688 \| 1846 \| 1821 \| 738 \| AVX512_Intel_OpenSSL \| 4629 \| 4532 \| 2734 \| 2332 \| 1131 \| AVX512_Intel_Linux \| 4035 \| 2966 \| 1567 \| 1330 \| 639 \| AVX512_Cloudflare \| 3344 \| 2485 \| 1141 \| 1127 \| 456 \| Table 4: VAES-based AES-256-GCM decryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14276 \| 13311 \| 13007 \| 11086 \| 8268 \| 8086 \| AVX512_Intel_OpenSSL \| 14067 \| 12620 \| 12421 \| 9587 \| 5954 \| 7060 \| AVX512_Intel_Linux \| 14116 \| 12795 \| 11778 \| 9269 \| 7735 \| 6455 \| AVX512_Cloudflare \| 13301 \| 12018 \| 11919 \| 9182 \| 7189 \| 6726 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 6454 \| 5020 \| 2635 \| 2602 \| 1079 \| AVX512_Intel_OpenSSL \| 5184 \| 5799 \| 2957 \| 2545 \| 1228 \| AVX512_Intel_Linux \| 4394 \| 4247 \| 2235 \| 1635 \| 922 \| AVX512_Cloudflare \| 4289 \| 3851 \| 1435 \| 1417 \| 574 \| So, usually my code is actually slightly faster than Intel's code, though the OpenSSL implementation has a slight edge on messages shorter than 256 bytes in this microbenchmark. (This also holds true when doing the same tests on AMD Zen 4.) It can be seen that the large code size (up to 94x larger!) of the Intel implementations doesn't seem to bring much benefit, so starting from scratch with much smaller code, as I've done, seems appropriate. The performance of my code on messages shorter than 256 bytes could be improved through a limited amount of unrolling, but it's unclear it would be worth it, given code size considerations (e.g. caches) that don't get measured in microbenchmarks. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-06-07 19:47:58 +08:00
Jeff Johnson	3aeb1da092	crypto: x86 - add missing MODULE_DESCRIPTION() macros On x86, make allmodconfig && make W=1 C=1 warns: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/crc32-pclmul.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/curve25519-x86_64.o Add the missing MODULE_DESCRIPTION() macro invocations. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-06-07 19:46:39 +08:00
Tony Luck	adc5167be5	crypto: x86/poly1305 - Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-05-31 17:12:30 +08:00
Tony Luck	b2d3d79780	crypto: x86/twofish - Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-05-31 17:12:21 +08:00
Tony Luck	6d85a058cf	crypto: x86/aes-xts - switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/20240520224620.9480-2-tony.luck@intel.com	2024-05-22 11:10:48 +02:00
Eric Biggers	ed265f7fd9	crypto: x86/aes-gcm - simplify GCM hash subkey derivation Remove a redundant expansion of the AES key, and use rodata for zeroes. Also rename rfc4106_set_hash_subkey() to aes_gcm_derive_hash_subkey() because it's used for both versions of AES-GCM, not just RFC4106. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-26 17:26:10 +08:00
Eric Biggers	a0bbb1c187	crypto: x86/aes-gcm - delete unused GCM assembly code Delete aesni_gcm_enc() and aesni_gcm_dec() because they are unused. Only the incremental AES-GCM functions (aesni_gcm_init(), aesni_gcm_enc_update(), aesni_gcm_finalize()) are actually used. This saves 17 KB of object code. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-26 17:26:10 +08:00
Eric Biggers	6a80586474	crypto: x86/aes-xts - simplify loop in xts_crypt_slowpath() Since the total length processed by the loop in xts_crypt_slowpath() is a multiple of AES_BLOCK_SIZE, just round the length down to AES_BLOCK_SIZE even on the last step. This doesn't change behavior, as the last step will process a multiple of AES_BLOCK_SIZE regardless. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-26 17:26:10 +08:00
Eric Biggers	543ea178fb	crypto: x86/aes-xts - optimize size of instructions operating on lengths x86_64 has the "interesting" property that the instruction size is generally a bit shorter for instructions that operate on the 32-bit (or less) part of registers, or registers that are in the original set of 8. This patch adjusts the AES-XTS code to take advantage of that property by changing the LEN parameter from size_t to unsigned int (which is all that's needed and is what the non-AVX implementation uses) and using the %eax register for KEYLEN. This decreases the size of aes-xts-avx-x86_64.o by 1.2%. Note that changing the kmovq to kmovd was going to be needed anyway to make the AVX10/256 code really work on CPUs that don't support 512-bit vectors (since the AVX10 spec says that 64-bit opmask instructions will only be supported on processors that support 512-bit vectors). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	e619723a85	crypto: x86/aes-xts - eliminate a few more instructions - For conditionally subtracting 16 from LEN when decrypting a message whose length isn't a multiple of 16, use the cmovnz instruction. - Fold the addition of 4*VL to LEN into the sub of VL or 16 from LEN. - Remove an unnecessary test instruction. This results in slightly shorter code, both source and binary. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	2717e01fc3	crypto: x86/aes-xts - handle AES-128 and AES-192 more efficiently Decrease the amount of code specific to the different AES variants by "right-aligning" the sequence of round keys, and for AES-128 and AES-192 just skipping irrelevant rounds at the beginning. This shrinks the size of aes-xts-avx-x86_64.o by 13.3%, and it improves the efficiency of AES-128 and AES-192. The tradeoff is that for AES-256 some additional not-taken conditional jumps are now executed. But these are predicted well and are cheap on x86. Note that the ARMv8 CE based AES-XTS implementation uses a similar strategy to handle the different AES variants. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	ea9459ef36	crypto: x86/aesni-xts - deduplicate aesni_xts_enc() and aesni_xts_dec() Since aesni_xts_enc() and aesni_xts_dec() are very similar, generate them from a macro that's passed an argument enc=1 or enc=0. This reduces the length of aesni-intel_asm.S by 112 lines while still producing the exact same object file in both 32-bit and 64-bit mode. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	1d27e1f5c8	crypto: x86/aes-xts - handle CTS encryption more efficiently When encrypting a message whose length isn't a multiple of 16 bytes, encrypt the last full block in the main loop. This works because only decryption uses the last two tweaks in reverse order, not encryption. This improves the performance of decrypting messages whose length isn't a multiple of the AES block length, shrinks the size of aes-xts-avx-x86_64.o by 5.0%, and eliminates two instructions (a test and a not-taken conditional jump) when encrypting a message whose length is a multiple of the AES block length. While it's not super useful to optimize for ciphertext stealing given that it's rarely needed in practice, the other two benefits mentioned above make this optimization worthwhile. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	7daba20cc7	crypto: x86/sha256-ni - simplify do_4rounds Instead of loading the message words into both MSG and \m0 and then adding the round constants to MSG, load the message words into \m0 and the round constants into MSG and then add \m0 to MSG. This shortens the source code slightly. It changes the instructions slightly, but it doesn't affect binary code size and doesn't seem to affect performance. Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:18 +08:00
Eric Biggers	59e62b20ac	crypto: x86/sha256-ni - optimize code size - Load the SHA-256 round constants relative to a pointer that points into the middle of the constants rather than to the beginning. Since x86 instructions use signed offsets, this decreases the instruction length required to access some of the later round constants. - Use punpcklqdq or punpckhqdq instead of longer instructions such as pshufd, pblendw, and palignr. This doesn't harm performance. The end result is that sha256_ni_transform shrinks from 839 bytes to 791 bytes, with no loss in performance. Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:18 +08:00

1 2 3 4 5 ...

863 Commits