Discussion:
bug#24975: Matching issues with characters whose encoding ends in some other character
(too old to reply)
Jim Meyering
2016-11-27 23:59:05 UTC
Permalink
Raw Message
On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas
$ locale charmap
GB18030
$ printf '\uC9\n' | grep '.*7' | hd
00000000 81 30 87 37 0a |.0.7.|
00000005
U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
[...]
Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
[...]
Same behaviour with 2.26 on Solaris 11.
Thank you for the report.
$ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c
5
I confirmed that the problem does not arise (i.e., no match, with exit
status of 1) when we force the use of glibc's regex matcher by
$ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E
'()\1.*7' k); echo $?
1
This bisected to v2.18-54-g3ef4c8e, but that commit was just the
messenger: it exposed the latent bug by making it so this case was no
longer handled by glibc's regexp matcher, but rather by grep's dfa.c.
I've fixed this by forcing any non-UTF8 multibyte locale to use regex
rather than DFA matcher with the following.
The gnulib/dfa patch makes that change, and the grep change updates to
latest gnulib, adds tests and NEWS.

I suspect this won't be the last word in this area, because it feels
like we should be able to adjust DFA's tables so that people using
such locales can retain DFA's efficiency without the bug in the
current implementation.
Norihiro Tanaka
2016-11-28 14:47:57 UTC
Permalink
Raw Message
Post by Jim Meyering
I suspect this won't be the last word in this area, because it feels
like we should be able to adjust DFA's tables so that people using
such locales can retain DFA's efficiency without the bug in the
current implementation.
Hi Jim,

It is a bug in dfa for period expression in non-UTF8 locales. dfa
calculates transition for single byte characters and a multibyte
character separately and merge both results. However, if backs to
an initial state in transition for single byte characters, we should
stop matching single byte characters.

Thanks,
Norihiro
Paul Eggert
2016-11-28 16:48:29 UTC
Permalink
Raw Message
Thanks for that DFA fix, which should be much better than the previous
workaround. I installed it into gnulib and installed the attached patch
into grep.

Loading...