Discussion:
bug#26193: [0-9] versus [[:digit:]]
(too old to reply)
Paul Eggert
2017-03-22 02:09:41 UTC
Permalink
Raw Message
Using what is to me the more obvious [0-9] pattern takes almost 50 times as
long as using the [[:digit:]] pattern. Seems very strange.
Thanks for reporting that. In general, patterns like [a-z] can be much slower
than [[:lower:]] due to poorly-thought-out POSIX interfaces. However, [0-9] is a
special case: we can optimize such patterns safely if both ends are ASCII
digits. I installed the attached patch to Gnulib to do that; it fixes the
performance glitch you noticed, at least for me.
Jim Meyering
2017-03-22 04:28:05 UTC
Permalink
Raw Message
Post by Paul Eggert
Using what is to me the more obvious [0-9] pattern takes almost 50 times as
long as using the [[:digit:]] pattern. Seems very strange.
Thanks for reporting that. In general, patterns like [a-z] can be much
slower than [[:lower:]] due to poorly-thought-out POSIX interfaces. However,
[0-9] is a special case: we can optimize such patterns safely if both ends
are ASCII digits. I installed the attached patch to Gnulib to do that; it
fixes the performance glitch you noticed, at least for me.
Thank you, Paul. I confirmed that that solves it for me, too, with a
multibyte locale. I didn't reproduce it initially because I was using
LC_ALL=C.
John P. Linderman
2017-03-22 12:44:23 UTC
Permalink
Raw Message
Thanks, all. That puts the runtimes on equal footing:

+ wc conjectures
125441818 125441818 6249180939 conjectures
+ rusage /home/jpl/src/grep-3.0/src/grep P[[:digit:]] conjectures
A[21]=11{11}:22<LP3
5.85 real 5.14 user 0.70 sys 0 pf 118 pr 0 sw 0 rb 0 wb 1 vcx 11 icx 2420
mx 0 ix 0 id 0 is
+ rusage /home/jpl/src/grep-3.0/src/grep P[[:digit:]] conjectures
A[21]=11{11}:22<LP3
5.77 real 5.10 user 0.67 sys 0 pf 121 pr 0 sw 0 rb 0 wb 1 vcx 7 icx 2492 mx
0 ix 0 id 0 is
+ rusage /home/jpl/src/grep-3.0/src/grep P[0-9] conjectures
A[21]=11{11}:22<LP3
5.80 real 5.15 user 0.62 sys 0 pf 119 pr 0 sw 0 rb 0 wb 1 vcx 1001 icx 2424
mx 0 ix 0 id 0 is
Post by Paul Eggert
Post by Paul Eggert
Using what is to me the more obvious [0-9] pattern takes almost 50 times as
long as using the [[:digit:]] pattern. Seems very strange.
Thanks for reporting that. In general, patterns like [a-z] can be much
slower than [[:lower:]] due to poorly-thought-out POSIX interfaces.
However,
Post by Paul Eggert
[0-9] is a special case: we can optimize such patterns safely if both
ends
Post by Paul Eggert
are ASCII digits. I installed the attached patch to Gnulib to do that; it
fixes the performance glitch you noticed, at least for me.
Thank you, Paul. I confirmed that that solves it for me, too, with a
multibyte locale. I didn't reproduce it initially because I was using
LC_ALL=C.
Paul Eggert
2017-03-22 18:01:43 UTC
Permalink
Raw Message
In my measurements, P[0-9] is still a tiny bit slower if one is using
glibc regex, due to a performance problem in glibc. You can work around
it by configuring --with-included-regex. It's probably not worth
worrying about, though.

By the way, using LC_ALL=C should help avoid performance problems like
these in the future, if all you're doing is something where single-byte
pattern matching suffices.
John P. Linderman
2017-03-22 21:58:39 UTC
Permalink
Raw Message
I used to use LC_ALL=C, but, as I vaguely recall, it got in the way of
dealing with UNICODE. I tried a couple LC values aimed at UNICODE and the
US, but something always went pear-shaped. I finally give up. I am
perfectly happy to suffer a tiny bit of performance, to have most things
work without thinking. A factor of 6, or 35, is not tiny, since I use grep
and friends intensely. That's how I discovered the performance problem to
begin with. Anyway, thank you for fixing my problem. I suspect that many of
us pioneers (using UNIX since 1973) have '[0-9]' wired into our fingers.
Post by Paul Eggert
In my measurements, P[0-9] is still a tiny bit slower if one is using
glibc regex, due to a performance problem in glibc. You can work around it
by configuring --with-included-regex. It's probably not worth worrying
about, though.
By the way, using LC_ALL=C should help avoid performance problems like
these in the future, if all you're doing is something where single-byte
pattern matching suffices.
Jim Meyering
2017-03-23 01:57:05 UTC
Permalink
Raw Message
Post by John P. Linderman
I used to use LC_ALL=C, but, as I vaguely recall, it got in the way of
dealing with UNICODE. I tried a couple LC values aimed at UNICODE and the
US, but something always went pear-shaped. I finally give up. I am perfectly
happy to suffer a tiny bit of performance, to have most things work without
thinking. A factor of 6, or 35, is not tiny, since I use grep and friends
intensely. That's how I discovered the performance problem to begin with.
Anyway, thank you for fixing my problem. I suspect that many of us pioneers
(using UNIX since 1973) have '[0-9]' wired into our fingers.
Post by Paul Eggert
In my measurements, P[0-9] is still a tiny bit slower if one is using
glibc regex, due to a performance problem in glibc. You can work around it
by configuring --with-included-regex. It's probably not worth worrying
about, though.
By the way, using LC_ALL=C should help avoid performance problems like
these in the future, if all you're doing is something where single-byte
pattern matching suffices.
I've just pulled that gnulib change into grep's repository with the
attached, along with a NEWS update:

Loading...