[Bug 91175] Re: cut gets confused with UTF-8 characters
ar barzh paour
j_p_b at orange.fr
Thu Nov 29 08:02:17 UTC 2012
la commande
echo "tañva"|cut -c1-4
donne
tañ
au lieu de tañv
LANG=fr_FR.UTF-8
LANGUAGE=
LC_CTYPE="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_PAPER="fr_FR.UTF-8"
LC_NAME="fr_FR.UTF-8"
LC_ADDRESS="fr_FR.UTF-8"
LC_TELEPHONE="fr_FR.UTF-8"
LC_MEASUREMENT="fr_FR.UTF-8"
LC_IDENTIFICATION="fr_FR.UTF-8"
LC_ALL=
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to coreutils in Ubuntu.
https://bugs.launchpad.net/bugs/91175
Title:
cut gets confused with UTF-8 characters
Status in “coreutils” package in Ubuntu:
Triaged
Bug description:
Binary package hint: coreutils
GNU cut gets confused about character boundaries with UTF-8 encoded
files.
An example, as they (almost) say, is worth a thousand words:
nslater at hinata: ~ $ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
nslater at hinata: ~ $ cat foo.txt
She said “I think I found a bug.”
nslater at hinata: ~ $ cat foo.txt | cut --characters 10-
“I think I found a bug.”
nslater at hinata: ~ $ cat foo.txt | cut --characters 11-
��I think I found a bug.”
nslater at hinata: ~ $ cat foo.txt | cut --characters 12-
�I think I found a bug.”
nslater at hinata: ~ $ cat foo.txt | cut --characters 13-
I think I found a bug.”
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/91175/+subscriptions
More information about the foundations-bugs
mailing list