Fast and Clean: Auditable high-performance assembly via constraint solving
DOI:
https://doi.org/10.46586/tches.v2024.i1.87-132Keywords:
Superoptimization, Constraint Solving, Cryptography, Post-Quantum Cryptography, Armv8.1-M, AArch64, Helium, Neon, Kyber, Dilithium, X25519, Fast Fourier Transform, FFT, Number Theoretic Transform, NTT, Software Pipelining, Google OR-ToolsAbstract
Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice.
In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture.
We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.
Published
Issue
Section
License
Copyright (c) 2023 Amin Abdulrahman, Hanno Becker, Matthias J. Kannwischer, Fabien Klein
This work is licensed under a Creative Commons Attribution 4.0 International License.