af_alg: Add ability to use Linux kernel crypto API on data in memory

Discussion:

Bruno Haible

2018-05-06 15:12:51 UTC

It is to be expected that crypto hardware would not only speed up
sha1_stream but also sha1_buffer (where the input is in memory).

The second step is to add to 'af_alg' a function that can be used by each of
md5_buffer, sha1_buffer, etc.

2018-05-06 Bruno Haible <***@clisp.org>

af_alg: Add ability to use Linux kernel crypto API on data in memory.
* lib/af_alg.h (afalg_buffer): New declaration.
* lib/af_alg.c (afalg_buffer): New function.

diff --git a/lib/af_alg.h b/lib/af_alg.h
index 45c2c12..018fa22 100644
--- a/lib/af_alg.h
+++ b/lib/af_alg.h
@@ -37,6 +37,30 @@ extern "C" {

# if USE_LINUX_CRYPTO_API

+/* Compute a message digest of a memory region.
+
+ The memory region starts at BUFFER and is LEN bytes long.
+
+ ALG is the message digest algorithm; see the file /proc/crypto.
+
+ RESBLOCK points to a block of HASHLEN bytes, for the result.
+ HASHLEN must be the length of the message digest, in bytes, in particular:
+
+ alg | hashlen
+ -------+--------
+ md5 | 16
+ sha1 | 20
+ sha224 | 28
+ sha256 | 32
+ sha384 | 48
+ sha512 | 64
+
+ If successful, fill RESBLOCK and return 0.
+ Upon failure, return a negated error number. */
+int
+afalg_buffer (const char *buffer, size_t len, const char *alg,
+ void *resblock, ssize_t hashlen);
+
/* Compute a message digest of the contents of a file.

STREAM is an open file stream. Regular files are handled more efficiently.
@@ -60,12 +84,21 @@ extern "C" {
If successful, fill RESBLOCK and return 0.
Upon failure, return a negated error number. */
int
-afalg_stream (FILE *stream, const char *alg, void *resblock, ssize_t hashlen);
+afalg_stream (FILE *stream, const char *alg,
+ void *resblock, ssize_t hashlen);

# else

static inline int
-afalg_stream (FILE *stream, const char *alg, void *resblock, ssize_t hashlen)
+afalg_buffer (const char *buffer, size_t len, const char *alg,
+ void *resblock, ssize_t hashlen)
+{
+ return -EAFNOSUPPORT;
+}
+
+static inline int
+afalg_stream (FILE *stream, const char *alg,
+ void *resblock, ssize_t hashlen)
{
return -EAFNOSUPPORT;
}
diff --git a/lib/af_alg.c b/lib/af_alg.c
index 0319459..08d6659 100644
--- a/lib/af_alg.c
+++ b/lib/af_alg.c
@@ -37,7 +37,73 @@
#define BLOCKSIZE 32768

int
-afalg_stream (FILE *stream, const char *alg, void *resblock, ssize_t hashlen)
+afalg_buffer (const char *buffer, size_t len, const char *alg,
+ void *resblock, ssize_t hashlen)
+{
+ /* On Linux < 4.9, the value for an empty stream is wrong (all zeroes).
+ See <https://patchwork.kernel.org/patch/9434741/>. */
+ if (len == 0)
+ return -EAFNOSUPPORT;
+
+ int cfd = socket (AF_ALG, SOCK_SEQPACKET, 0);
+ if (cfd < 0)
+ return -EAFNOSUPPORT;
+
+ int result;
+ struct sockaddr_alg salg = {
+ .salg_family = AF_ALG,
+ .salg_type = "hash",
+ };
+ /* Avoid calling both strcpy and strlen. */
+ for (int i = 0; (salg.salg_name[i] = alg[i]); i++)
+ if (i == sizeof salg.salg_name - 1)
+ {
+ result = -EINVAL;
+ goto out_cfd;
+ }
+
+ int ret = bind (cfd, (struct sockaddr *) &salg, sizeof salg);
+ if (ret != 0)
+ {
+ result = -EAFNOSUPPORT;
+ goto out_cfd;
+ }
+
+ int ofd = accept (cfd, NULL, 0);
+ if (ofd < 0)
+ {
+ result = -EAFNOSUPPORT;
+ goto out_cfd;
+ }
+
+ do
+ {
+ ssize_t size = (len > BLOCKSIZE ? BLOCKSIZE : len);
+ if (send (ofd, buffer, size, MSG_MORE) != size)
+ {
+ result = -EIO;
+ goto out_ofd;
+ }
+ buffer += size;
+ len -= size;
+ }
+ while (len > 0);
+
+ if (read (ofd, resblock, hashlen) != hashlen)
+ result = -EIO;
+ else
+ result = 0;
+
+out_ofd:
+ close (ofd);
+out_cfd:
+ close (cfd);
+ return result;
+}
+
+int
+afalg_stream (FILE *stream, const char *alg,
+ void *resblock, ssize_t hashlen)
{
int cfd = socket (AF_ALG, SOCK_SEQPACKET, 0);
if (cfd < 0)

Bruno Haible

2018-05-06 16:01:50 UTC

Permalink

Now, here's a draft patch for adding support for AF_ALG also for the
sha1_buffer etc. functions.

But I have a problem here: On 4 different systems, I don't get a speedup
from this patch.

To benchmark it, I use this set of commands:

$ ./gnulib-tool --create-testdir --dir=testdir --single-configure --symlink crypto/md5 crypto/sha1 crypto/sha256 crypto/sha512
$ cd testdir
$ mkdir without; (cd without; ../configure CPPFLAGS=-Wall CFLAGS=-O2 --without-linux-crypto; make && make check)
$ mkdir with; (cd with; ../configure CPPFLAGS=-Wall CFLAGS=-O2 --with-linux-crypto; make && make check)

$ without/gltests/bench-md5 100 1000000
real 0.391257
user 0.388
sys 0.004
$ with/gltests/bench-md5 100 1000000
real 9.800789
user 1.088
sys 8.648
$ without/gltests/bench-md5 1000 100000
real 0.289286
user 0.288
sys 0.000
$ with/gltests/bench-md5 1000 100000
real 1.220016
user 0.104
sys 1.116
$ without/gltests/bench-md5 10000 10000
real 0.270131
user 0.268
sys 0.000
$ with/gltests/bench-md5 10000 10000
real 0.375399
user 0.020
sys 0.352
$ without/gltests/bench-md5 100000 1000
real 0.280091
user 0.276
sys 0.000
$ with/gltests/bench-md5 100000 1000
real 0.295650
user 0.000
sys 0.292
$ without/gltests/bench-md5 100000 1000
real 0.276514
user 0.276
sys 0.000
$ with/gltests/bench-md5 100000 1000
real 0.292350
user 0.000
sys 0.292
$ without/gltests/bench-md5 1000000 100
real 0.261845
user 0.260
sys 0.004
$ with/gltests/bench-md5 1000000 100
real 0.265650
user 0.000
sys 0.260
[and similarly for sha1 etc.]

Tested this on
- Intel Xeon X5450
- Intel Xeon E5-2603 v3
- Intel Core i7-2600
- Intel Core m3-6Y30
On all four, no speedup is visible.

On machines without crypto instructions or crypto devices, I would expect
that
- sha1_stream gets slightly faster with than without linux-crypto
(because the copy of data from the file to user-space is optimized away).
- sha1_buffer is slightly slower with than without linux-crypto
(because of the overhead of copying the data from user to kernel space).

Whereas on machines with crypto instructions or crypto devices, I would
expect a significant benefit for both functions.

You showed us significant benefits for sha1_stream, whereas I see no benefit
for sha1_buffer. How is this possible?

In <https://en.wikipedia.org/wiki/AES_instruction_set> I read that there are
specialized instructions for AES. Does it mean that there are NO specialized
instructions for MD5, SHA-1, SHA-224 ... SHA-512? In this case, all the work
we have done is futile for Intel CPUs and only beneficial for embedded CPUs??

Can you try this comparison on the Intel Xeon you have access to, please?

Bruno

Matteo Croce

2018-05-06 16:59:45 UTC

Permalink

Post by Bruno Haible
Now, here's a draft patch for adding support for AF_ALG also for the
sha1_buffer etc. functions.
But I have a problem here: On 4 different systems, I don't get a speedup
from this patch.
$ ./gnulib-tool --create-testdir --dir=testdir --single-configure --symlink crypto/md5 crypto/sha1 crypto/sha256 crypto/sha512
$ cd testdir
$ mkdir without; (cd without; ../configure CPPFLAGS=-Wall CFLAGS=-O2 --without-linux-crypto; make && make check)
$ mkdir with; (cd with; ../configure CPPFLAGS=-Wall CFLAGS=-O2 --with-linux-crypto; make && make check)
$ without/gltests/bench-md5 100 1000000
real 0.391257
user 0.388
sys 0.004
$ with/gltests/bench-md5 100 1000000
real 9.800789
user 1.088
sys 8.648
$ without/gltests/bench-md5 1000 100000
real 0.289286
user 0.288
sys 0.000
$ with/gltests/bench-md5 1000 100000
real 1.220016
user 0.104
sys 1.116
$ without/gltests/bench-md5 10000 10000
real 0.270131
user 0.268
sys 0.000
$ with/gltests/bench-md5 10000 10000
real 0.375399
user 0.020
sys 0.352
$ without/gltests/bench-md5 100000 1000
real 0.280091
user 0.276
sys 0.000
$ with/gltests/bench-md5 100000 1000
real 0.295650
user 0.000
sys 0.292
$ without/gltests/bench-md5 100000 1000
real 0.276514
user 0.276
sys 0.000
$ with/gltests/bench-md5 100000 1000
real 0.292350
user 0.000
sys 0.292
$ without/gltests/bench-md5 1000000 100
real 0.261845
user 0.260
sys 0.004
$ with/gltests/bench-md5 1000000 100
real 0.265650
user 0.000
sys 0.260
[and similarly for sha1 etc.]
Tested this on
- Intel Xeon X5450
- Intel Xeon E5-2603 v3
- Intel Core i7-2600
- Intel Core m3-6Y30
On all four, no speedup is visible.
On machines without crypto instructions or crypto devices, I would expect
that
- sha1_stream gets slightly faster with than without linux-crypto
(because the copy of data from the file to user-space is optimized away).
- sha1_buffer is slightly slower with than without linux-crypto
(because of the overhead of copying the data from user to kernel space).
Whereas on machines with crypto instructions or crypto devices, I would
expect a significant benefit for both functions.
You showed us significant benefits for sha1_stream, whereas I see no benefit
for sha1_buffer. How is this possible?
In <https://en.wikipedia.org/wiki/AES_instruction_set> I read that there are
specialized instructions for AES. Does it mean that there are NO specialized
instructions for MD5, SHA-1, SHA-224 ... SHA-512? In this case, all the work
we have done is futile for Intel CPUs and only beneficial for embedded CPUs??
Can you try this comparison on the Intel Xeon you have access to, please?
Bruno

Hi Bruno,

I've checked out latest gnulib, and after double checking that commit
761523ddea70f0456b556c09868910686751fff5 was there I ran this:

***@turbo:~/src/gnulib/testdir$ strace -e trace=%network
with/gltests/bench-md5 10000000 100
real 1.138617
user 1.139
sys 0.000
+++ exited with 0 +++
***@turbo:~/src/gnulib/testdir$ strace -e trace=%network
with/gltests/bench-sha1 10000000 100
real 1.259929
user 1.260
sys 0.000
+++ exited with 0 +++

It seems that kernel API are not used in this test, or I'm running
them the wrong way?

--
Matteo Croce
per aspera ad upstream

Bruno Haible

2018-05-06 19:00:32 UTC

Permalink

Hi Matteo,

Post by Matteo Croce
I've checked out latest gnulib, and after double checking that commit

Please take commit 55efbb1178e045d52b0f52a2160f3d943c4f8a2c plus the patch
from https://lists.gnu.org/archive/html/bug-gnulib/2018-05/msg00035.html.
Then follow these instructions:

$ ./gnulib-tool --create-testdir --dir=testdir --single-configure --symlink crypto/md5 crypto/sha1 crypto/sha256 crypto/sha512
$ cd testdir
$ mkdir without; (cd without; ../configure CPPFLAGS=-Wall CFLAGS=-O2 --without-linux-crypto; make && make check)
$ mkdir with; (cd with; ../configure CPPFLAGS=-Wall CFLAGS=-O2 --with-linux-crypto; make && make check)

Bruno

Matteo Croce

2018-05-06 20:14:28 UTC

Permalink

Post by Bruno Haible
Hi Matteo,

Post by Matteo Croce
I've checked out latest gnulib, and after double checking that commit

Please take commit 55efbb1178e045d52b0f52a2160f3d943c4f8a2c plus the patch
from https://lists.gnu.org/archive/html/bug-gnulib/2018-05/msg00035.html.
$ ./gnulib-tool --create-testdir --dir=testdir --single-configure --symlink crypto/md5 crypto/sha1 crypto/sha256 crypto/sha512
$ cd testdir
$ mkdir without; (cd without; ../configure CPPFLAGS=-Wall CFLAGS=-O2 --without-linux-crypto; make && make check)
$ mkdir with; (cd with; ../configure CPPFLAGS=-Wall CFLAGS=-O2 --with-linux-crypto; make && make check)
Bruno

Hi Bruno,

I have 55efbb1178e045d52b0f52a2160f3d943c4f8a2c but the patch fails to apply.

$ wget -qO- https://lists.gnu.org/archive/html/bug-gnulib/2018-05/txtdxlUGSMYnt.txt
|patch -p1
patching file lib/md5.c
Hunk #1 FAILED at 221.
1 out of 1 hunk FAILED -- saving rejects to file lib/md5.c.rej
patching file lib/sha1.c
Hunk #1 FAILED at 209.
1 out of 1 hunk FAILED -- saving rejects to file lib/sha1.c.rej
patching file lib/sha256.c
Hunk #1 FAILED at 273.
Hunk #2 FAILED at 295.
2 out of 2 hunks FAILED -- saving rejects to file lib/sha256.c.rej
patching file lib/sha512.c
Hunk #1 FAILED at 281.
Hunk #2 FAILED at 303.
2 out of 2 hunks FAILED -- saving rejects to file lib/sha512.c.rej

BTW, the instructions are you referring to, are for AES. For SHA1 and
other hashes, an ASM implementation of the algorythm with SSSE3 or
AVX2 is compiled into the kernel.
FYI: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/crypto/sha1_avx2_x86_64_asm.S

--
Matteo Croce
per aspera ad upstream

Bruno Haible

2018-05-06 23:15:00 UTC

Permalink

Post by Matteo Croce
I have 55efbb1178e045d52b0f52a2160f3d943c4f8a2c but the patch fails to apply.

Oops, I goofed with "git diff". Here's the correct patch to test.

Bruno

Paul Eggert

2018-05-07 02:07:15 UTC

Permalink

Post by Bruno Haible
Oops, I goofed with "git diff". Here's the correct patch to test.

I tried those bench-md5 benchmarks on two platforms, with somewhat more
disappointing results.

I observed a real-time slowdown ranging from 11% (large buffers) to 22x (small
buffers) on Intel Xeon E3-1225 V2 (circa 2012 CPU), Ubuntu 16.04, Linux 4.4.0,
glibc 2.23. See attached file ubuntu1604.txt.

I observed a real-time slowdown ranging from 8% (large buffers) to 43x (small
buffers) on AMD Phenom II X4 910e (circa 2010 CPU), Fedora 28, Linux 4.16.5,
glibc 2.27. See attached file fedora28.txt.

These numbers compare somewhat unfavorably with your report, where the real-time
slowdown ranged from 1.5% (large buffers) to 25x (small buffers), as reported in
<https://lists.gnu.org/r/bug-gnulib/2018-05/msg00035.html>.

Matteo Croce

2018-05-07 09:55:29 UTC

Permalink

Post by Paul Eggert

Post by Bruno Haible
Oops, I goofed with "git diff". Here's the correct patch to test.

I tried those bench-md5 benchmarks on two platforms, with somewhat more
disappointing results.
I observed a real-time slowdown ranging from 11% (large buffers) to 22x
(small buffers) on Intel Xeon E3-1225 V2 (circa 2012 CPU), Ubuntu 16.04,
Linux 4.4.0, glibc 2.23. See attached file ubuntu1604.txt.
I observed a real-time slowdown ranging from 8% (large buffers) to 43x
(small buffers) on AMD Phenom II X4 910e (circa 2010 CPU), Fedora 28, Linux
4.16.5, glibc 2.27. See attached file fedora28.txt.
These numbers compare somewhat unfavorably with your report, where the
real-time slowdown ranged from 1.5% (large buffers) to 25x (small buffers),
as reported in <https://lists.gnu.org/r/bug-gnulib/2018-05/msg00035.html>.

Hi all,

I tried all the above, I can confirm the disappointing results with
md5 or small buffers.
This is what happens on my machine, a Lenovo Laptop with Intel(R)
Core(TM) i7-6820HQ CPU @ 2.70GHz running Fedora 27

with large buffers all the algos are faster but md5:

$ without/gltests/bench-md5 1000000000 1
real 1.520719
user 1.520
sys 0.000
$ with/gltests/bench-md5 1000000000 1
real 1.684162
user 0.000
sys 1.684

$ without/gltests/bench-sha1 1000000000 1
real 1.696258
user 1.696
sys 0.000
$ with/gltests/bench-sha1 1000000000 1
real 1.072500
user 0.000
sys 1.072

$ without/gltests/bench-sha256 1000000000 1
real 4.467676
user 4.468
sys 0.000
$ with/gltests/bench-sha256 1000000000 1
real 2.527936
user 0.009
sys 2.519

$ without/gltests/bench-sha512 1000000000 1
real 2.684985
user 2.685
sys 0.000
$ with/gltests/bench-sha256 1000000000 1
real 2.546133
user 0.004
sys 2.542

While for sha1, af_alg become faster with buffers > 100k:

$ without/gltests/bench-sha1 100 1000000
real 0.292869
user 0.293
sys 0.000
$ with/gltests/bench-sha1 100 1000000
real 9.153545
user 0.698
sys 8.421

$ without/gltests/bench-sha1 1000 100000
real 0.190652
user 0.191
sys 0.000
$ with/gltests/bench-sha1 1000 100000
real 1.033346
user 0.071
sys 0.963

$ without/gltests/bench-sha1 10000 10000
real 0.183897
user 0.184
sys 0.000
$ with/gltests/bench-sha1 10000 10000
real 0.214090
user 0.003
sys 0.212

$ without/gltests/bench-sha1 100000 1000
real 0.181184
user 0.181
sys 0.000
$ with/gltests/bench-sha1 100000 1000
real 0.131482
user 0.002
sys 0.130

$ without/gltests/bench-sha1 1000000 100
real 0.178751
user 0.179
sys 0.000
$ with/gltests/bench-sha1 1000000 100
real 0.122498
user 0.000

sha256 instead, become faster with af_alg with buffers > 10k:

$ without/gltests/bench-sha256 100 1000000
real 0.617181
user 0.617
sys 0.000
$ with/gltests/bench-sha256 100 1000000
real 9.655386
user 0.703
sys 8.950

$ without/gltests/bench-sha256 1000 100000
real 0.470694
user 0.471
sys 0.000
$ with/gltests/bench-sha256 1000 100000
real 1.203199
user 0.091
sys 1.112

$ without/gltests/bench-sha256 10000 10000
real 0.459542
user 0.460
sys 0.000
$ with/gltests/bench-sha256 10000 10000
real 0.360933
user 0.003
sys 0.358

$ without/gltests/bench-sha256 100000 1000
real 0.454326
user 0.454
sys 0.000
$ with/gltests/bench-sha256 100000 1000
real 0.279998
user 0.000
sys 0.280

$ without/gltests/bench-sha256 1000000 100
real 0.451635
user 0.452
sys 0.000
$ with/gltests/bench-sha256 1000000 100
real 0.266343
user 0.001
sys 0.265

$ without/gltests/bench-sha256 10000000 10
real 0.443723
user 0.444
sys 0.000
$ with/gltests/bench-sha256 10000000 10
real 0.260270
user 0.000
sys 0.260

Keep in mind that I have the infamous patch to mitigate the Intel CPU
bug, which adds a big overhead to syscalls, but it will hopefully
disappear on future CPUs:

$ dmesg |grep isolation
[ 0.000000] Kernel/User page tables isolation: enabled

--
Matteo Croce
per aspera ad upstream

Matteo Croce

2018-05-07 12:10:19 UTC

Permalink

Post by Matteo Croce

Post by Paul Eggert

Post by Bruno Haible
Oops, I goofed with "git diff". Here's the correct patch to test.

I tried those bench-md5 benchmarks on two platforms, with somewhat more
disappointing results.
I observed a real-time slowdown ranging from 11% (large buffers) to 22x
(small buffers) on Intel Xeon E3-1225 V2 (circa 2012 CPU), Ubuntu 16.04,
Linux 4.4.0, glibc 2.23. See attached file ubuntu1604.txt.
I observed a real-time slowdown ranging from 8% (large buffers) to 43x
(small buffers) on AMD Phenom II X4 910e (circa 2010 CPU), Fedora 28, Linux
4.16.5, glibc 2.27. See attached file fedora28.txt.
These numbers compare somewhat unfavorably with your report, where the
real-time slowdown ranged from 1.5% (large buffers) to 25x (small buffers),
as reported in <https://lists.gnu.org/r/bug-gnulib/2018-05/msg00035.html>.

Hi all,
I tried all the above, I can confirm the disappointing results with
md5 or small buffers.
This is what happens on my machine, a Lenovo Laptop with Intel(R)
$ without/gltests/bench-md5 1000000000 1
real 1.520719
user 1.520
sys 0.000
$ with/gltests/bench-md5 1000000000 1
real 1.684162
user 0.000
sys 1.684
$ without/gltests/bench-sha1 1000000000 1
real 1.696258
user 1.696
sys 0.000
$ with/gltests/bench-sha1 1000000000 1
real 1.072500
user 0.000
sys 1.072
$ without/gltests/bench-sha256 1000000000 1
real 4.467676
user 4.468
sys 0.000
$ with/gltests/bench-sha256 1000000000 1
real 2.527936
user 0.009
sys 2.519
$ without/gltests/bench-sha512 1000000000 1
real 2.684985
user 2.685
sys 0.000
$ with/gltests/bench-sha256 1000000000 1
real 2.546133
user 0.004
sys 2.542
$ without/gltests/bench-sha1 100 1000000
real 0.292869
user 0.293
sys 0.000
$ with/gltests/bench-sha1 100 1000000
real 9.153545
user 0.698
sys 8.421
$ without/gltests/bench-sha1 1000 100000
real 0.190652
user 0.191
sys 0.000
$ with/gltests/bench-sha1 1000 100000
real 1.033346
user 0.071
sys 0.963
$ without/gltests/bench-sha1 10000 10000
real 0.183897
user 0.184
sys 0.000
$ with/gltests/bench-sha1 10000 10000
real 0.214090
user 0.003
sys 0.212
$ without/gltests/bench-sha1 100000 1000
real 0.181184
user 0.181
sys 0.000
$ with/gltests/bench-sha1 100000 1000
real 0.131482
user 0.002
sys 0.130
$ without/gltests/bench-sha1 1000000 100
real 0.178751
user 0.179
sys 0.000
$ with/gltests/bench-sha1 1000000 100
real 0.122498
user 0.000
$ without/gltests/bench-sha256 100 1000000
real 0.617181
user 0.617
sys 0.000
$ with/gltests/bench-sha256 100 1000000
real 9.655386
user 0.703
sys 8.950
$ without/gltests/bench-sha256 1000 100000
real 0.470694
user 0.471
sys 0.000
$ with/gltests/bench-sha256 1000 100000
real 1.203199
user 0.091
sys 1.112
$ without/gltests/bench-sha256 10000 10000
real 0.459542
user 0.460
sys 0.000
$ with/gltests/bench-sha256 10000 10000
real 0.360933
user 0.003
sys 0.358
$ without/gltests/bench-sha256 100000 1000
real 0.454326
user 0.454
sys 0.000
$ with/gltests/bench-sha256 100000 1000
real 0.279998
user 0.000
sys 0.280
$ without/gltests/bench-sha256 1000000 100
real 0.451635
user 0.452
sys 0.000
$ with/gltests/bench-sha256 1000000 100
real 0.266343
user 0.001
sys 0.265
$ without/gltests/bench-sha256 10000000 10
real 0.443723
user 0.444
sys 0.000
$ with/gltests/bench-sha256 10000000 10
real 0.260270
user 0.000
sys 0.260
Keep in mind that I have the infamous patch to mitigate the Intel CPU
bug, which adds a big overhead to syscalls, but it will hopefully
$ dmesg |grep isolation
[ 0.000000] Kernel/User page tables isolation: enabled
--
Matteo Croce
per aspera ad upstream

I did some tests, it seems that a big overhead is the creation and
binding of the kernel socket:

$ strace -r -e trace=%network,%desc with/gltests/bench-sha1 100 1
0.000785 socket(AF_ALG, SOCK_SEQPACKET, 0) = 3
0.000101 bind(3, {sa_family=AF_ALG, sa_data="hash\0... sha1\0...
"}, 88) = 0
0.000086 accept(3, NULL, NULL) = 4
0.000065 sendto(4, "\0\2\3\5\7\n\f\17\22\26\33
&-5>IUbp\201\223\246\274\323\355\t'Gj\217\267"..., 100, MSG_MORE,
NULL, 0) = 100
0.000117 read(4,
"v\3770\230\10\374\322\25\26\340\253Y\266\257D\266\30&G\354", 20) = 20

I changed the code to allocate the socket only once and then reuse it
to see if there are some differences.
Obviously it works only if you always use the same algo and with a
single thread, it's just an experiment.

current code

$ without/gltests/bench-sha1 100 1000000
real 0.292869
user 0.293
sys 0.000
$ with/gltests/bench-sha1 100 1000000
real 9.153545
user 0.698
sys 8.421

one time alg

$ with/gltests/bench-sha1 100 1000000
real 1.365084
user 0.178
sys 1.187

An idea is to keep a cache of FDs, one per algo, and initialize them
only once per algo.

--
Matteo Croce
per aspera ad upstream

Matteo Croce

2018-05-12 12:03:12 UTC

Permalink

This echoes back to Bruno's suggestion of runtime parsing of
/proc/crypto - which adds more complexity.

That's very difficult, as entries will appear in /proc only after a module load

For example if this change is endorsed by Redhat, and enabled by default
just for sha*sum in the next major version release, it will provide lots
of supporting evidence that it works well, and might be enabled globally
by default.

FYI I'm doing this as an hobbyist, as I have an home server running Debian 9.

Cheers,

--
Matteo Croce
per aspera ad upstream