[Bug 1915738] [NEW] egrep: U+D56D (항) breaks ^/$ matching
Cyle Riggs
1915738 at bugs.launchpad.net
Mon Feb 15 16:44:32 UTC 2021
Public bug reported:
In theory the regular expression ^.*$ should match any and every string,
including empty strings, but this specific Korean character U+D56D (항),
which I was unlucky enough to have one of my scripts come across, breaks
the expected behavior in egrep:
$ echo '' | egrep '^.*$'; echo $?
0
$ echo 'foo' | egrep '^.*$'; echo $?
foo
0
$ echo 'bar' | egrep '^.*$'; echo $?
bar
0
$ echo 'の名' | egrep '^.*$'; echo $?
の名
0
$ echo '항' | egrep '^.*$'; echo $?
1
Have I lost my mind...or should I go buy a lottery ticket? Here are some
rambling one-liners to illustrate the behavior further.
# An attempt to match the pattern ^.*$ (beginning of string, anything, end of string) against this Korean character fails:
$ echo '항' | egrep '^.*$'; echo $?
1
# As you can see here a match works when the $ is dropped from the pattern:
$ echo '항' | egrep '^.*'; echo $?
항
0
# Also using the -P flag from grep instead of -E correctly matches the original pattern:
$ echo '항' | grep -P '^.*$'; echo $?
항
0
# Sending a different Korean character (U+C720) to the same original pattern works as expected as well:
$ echo '유' | egrep '^.*$'; echo $?
유
0
# Combining the two leads to the original failure mentioned:
$ echo '항유' | egrep '^.*$'; echo $?
1
# And reversing the order of the combination does not effect the outcome:
$ echo '유항' | egrep '^.*$'; echo $?
1
# But dropping the $ from the pattern gives the expected match:
$ echo '유항' | egrep '^.*'; echo $?
유항
0
# Dropping the ^ from the pattern also gives the expected match:
$ echo '유항' | egrep '.*$'; echo $?
유항
0
# Surrounding U+D56D with U+C720 does not alter the behavior:
$ echo '유항유' | egrep '^.*$'; echo $?
1
# But again dropping U+D56D (항) from the input string returns egrep to the expected behavior:
$ echo '유유' | egrep '^.*$'; echo $?
유유
0
# And to make it very clear what the input is, here I'm using python to give a raw dump of the input:
$ echo '유항유' | python -c 'import sys; print(repr(sys.stdin.read().encode("unicode-escape")))'
b'\\uc720\\ud56d\\uc720\\n'
# My grep/egrep version:
$ grep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
$ egrep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
# My bash version
$ bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
===========================
If somebody could explain this behavior I would appreciate it. If it
could be fixed, even better. In the meantime I think I will prefer 'grep
-P' over 'egrep' when I expect strings to contain Korean text. In this
contrived example the '^' and '$' didn't make a lot of sense, but I
thought it would be best to provide the simplest possible reproduction
case rather than spell out my full use case.
ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: grep 3.4-1
ProcVersionSignature: Ubuntu 5.4.0-65.73-generic 5.4.78
Uname: Linux 5.4.0-65-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair wl
ApportVersion: 2.20.11-0ubuntu27.16
Architecture: amd64
CasperMD5CheckResult: skip
Date: Mon Feb 15 17:10:42 2021
InstallationDate: Installed on 2020-01-22 (389 days ago)
InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805)
SourcePackage: grep
UpgradeStatus: Upgraded to focal on 2021-02-01 (13 days ago)
** Affects: grep (Ubuntu)
Importance: Undecided
Status: New
** Tags: amd64 apport-bug focal
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grep in Ubuntu.
https://bugs.launchpad.net/bugs/1915738
Title:
egrep: U+D56D (항) breaks ^/$ matching
Status in grep package in Ubuntu:
New
Bug description:
In theory the regular expression ^.*$ should match any and every
string, including empty strings, but this specific Korean character
U+D56D (항), which I was unlucky enough to have one of my scripts come
across, breaks the expected behavior in egrep:
$ echo '' | egrep '^.*$'; echo $?
0
$ echo 'foo' | egrep '^.*$'; echo $?
foo
0
$ echo 'bar' | egrep '^.*$'; echo $?
bar
0
$ echo 'の名' | egrep '^.*$'; echo $?
の名
0
$ echo '항' | egrep '^.*$'; echo $?
1
Have I lost my mind...or should I go buy a lottery ticket? Here are
some rambling one-liners to illustrate the behavior further.
# An attempt to match the pattern ^.*$ (beginning of string, anything, end of string) against this Korean character fails:
$ echo '항' | egrep '^.*$'; echo $?
1
# As you can see here a match works when the $ is dropped from the pattern:
$ echo '항' | egrep '^.*'; echo $?
항
0
# Also using the -P flag from grep instead of -E correctly matches the original pattern:
$ echo '항' | grep -P '^.*$'; echo $?
항
0
# Sending a different Korean character (U+C720) to the same original pattern works as expected as well:
$ echo '유' | egrep '^.*$'; echo $?
유
0
# Combining the two leads to the original failure mentioned:
$ echo '항유' | egrep '^.*$'; echo $?
1
# And reversing the order of the combination does not effect the outcome:
$ echo '유항' | egrep '^.*$'; echo $?
1
# But dropping the $ from the pattern gives the expected match:
$ echo '유항' | egrep '^.*'; echo $?
유항
0
# Dropping the ^ from the pattern also gives the expected match:
$ echo '유항' | egrep '.*$'; echo $?
유항
0
# Surrounding U+D56D with U+C720 does not alter the behavior:
$ echo '유항유' | egrep '^.*$'; echo $?
1
# But again dropping U+D56D (항) from the input string returns egrep to the expected behavior:
$ echo '유유' | egrep '^.*$'; echo $?
유유
0
# And to make it very clear what the input is, here I'm using python to give a raw dump of the input:
$ echo '유항유' | python -c 'import sys; print(repr(sys.stdin.read().encode("unicode-escape")))'
b'\\uc720\\ud56d\\uc720\\n'
# My grep/egrep version:
$ grep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
$ egrep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
# My bash version
$ bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
===========================
If somebody could explain this behavior I would appreciate it. If it
could be fixed, even better. In the meantime I think I will prefer
'grep -P' over 'egrep' when I expect strings to contain Korean text.
In this contrived example the '^' and '$' didn't make a lot of sense,
but I thought it would be best to provide the simplest possible
reproduction case rather than spell out my full use case.
ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: grep 3.4-1
ProcVersionSignature: Ubuntu 5.4.0-65.73-generic 5.4.78
Uname: Linux 5.4.0-65-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair wl
ApportVersion: 2.20.11-0ubuntu27.16
Architecture: amd64
CasperMD5CheckResult: skip
Date: Mon Feb 15 17:10:42 2021
InstallationDate: Installed on 2020-01-22 (389 days ago)
InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805)
SourcePackage: grep
UpgradeStatus: Upgraded to focal on 2021-02-01 (13 days ago)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grep/+bug/1915738/+subscriptions
More information about the foundations-bugs
mailing list