Discussion:
[bug #48055] Regex ranges and locales in gnu-awk regextype
(too old to reply)
James Youngman
2016-11-27 17:15:25 UTC
Permalink
Raw Message
Findutils uses the regular expression implementation from gnulib. So this
problem likely also exists there, or perhaps has already been fixed there.
<http://savannah.gnu.org/bugs/?48055>
Summary: Regex ranges and locales in gnu-awk regextype
Project: findutils
Submitted by: piotrjurkiewicz
Submitted on: Mon 30 May 2016 08:12:40 AM CEST
Category: find
Severity: 3 - Normal
Item Group: Wrong result
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
Discussion Lock: Any
Release: 4.6.0
Fixed Release: None
_______________________________________________________
Starting with gawk 4.0 the traditional behaviour of regex ranges has been
brought back. This means that [a-z] matches only lowercase letters and
[A-Z]
matches only uppercase letters, regardless of locale and collation being
set.
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
$ echo ABC | LC_COLLATE=pl_PL.utf8 gawk '$0 ~ /^[a-b]/' # gawk pre-4.0
ABC
$ echo ABC | LC_COLLATE=pl_PL.utf8 gawk '$0 ~ /^[a-b]/' # gawk 4.0+
[nothing]
Findutils, however, still emulate the old behaviour of gawk in gnu-awk
mode.
That is, when using certain locales, [a-z] and [A-Z] ranges matches both
lowercase and uppercase letters.
mkdir test
cd test
touch a.lower
touch b.UPPER
LC_COLLATE=pl_PL.utf8 find -regextype gnu-awk -regex '.*[a-z]{5}$'
LC_COLLATE=pl_PL.utf8 find -regextype gnu-awk -regex '.*[A-Z]{5}$'
./a.lower
./b.UPPER
instead just one file with appropriate case.
_______________________________________________________
<http://savannah.gnu.org/bugs/?48055>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
--
--
This email is intended solely for the use of its addressee, sender, and any
readers of a mailing list archive in which it happens to appear. If you
have received this email in error, please say or type three times, "I
believe in the utility of email disclaimers," and then reply to the author
correcting any spellings (and, optionally, any incorrect spellings),
accompanying these with humorous jests about the author's parentage. If
you are not the addressee, you are nevertheless permitted to both copy and
forward this email since without such permissions email systems are unable
to transmit email to anybody, intended recipient or not. To those still
reading by this point, the author would like to apologise for being unable
to maintain a consistent level of humour throughout this disclaimer.
Contents may settle during transit. Do not feed the animals.
Paul Eggert
2016-11-28 00:56:21 UTC
Permalink
Raw Message
Post by James Youngman
Findutils uses the regular expression implementation from gnulib. So this
problem likely also exists there, or perhaps has already been fixed there.
I can't seem to reproduce the problem on Fedora 24, so perhaps it's been fixed
already.

$ ls
a.lower b.UPPER
$ LC_COLLATE=pl_PL.utf8 find -regextype gnu-awk -regex '.*[a-z]{5}$'
./a.lower
$ LC_COLLATE=pl_PL.utf8 find -regextype gnu-awk -regex '.*[A-Z]{5}$'
./b.UPPER
$ find --version | head -n1
find (GNU findutils) 4.6.0

Loading...