file globbing in now case insensitive?
Colin Watson
cjwatson at ubuntu.com
Sun Apr 2 23:42:29 UTC 2017
On Sun, Apr 02, 2017 at 07:09:45PM +0200, Ralf Mardorf wrote:
> On Sun, 2 Apr 2017 17:29:35 +0100, Colin Watson wrote:
> >Or just export LC_COLLATE=C.UTF-8.
>
> You not necessarily need to export it.
Certainly, although most people who ask this question want it to work
for subprocesses as well as just the current shell.
> However, this works for Ubuntu...
>
> [root at moonstudio tmp]# grep PRETTY /etc/os-release
> PRETTY_NAME="Ubuntu 16.04.2 LTS"
> [root at moonstudio tmp]# locale -a | grep C
> C
> C.UTF-8
> [root at moonstudio tmp]# ls
> A b
> [root at moonstudio tmp]# LC_COLLATE=C echo [A-Z]*
> A
> [root at moonstudio tmp]# LC_COLLATE=C.UTF-8 echo [A-Z]*
> A
>
> ...but isn't portable. On Arch Linux...
The above constructions aren't so much unportable as simply wrong,
though! (I think perhaps they seemed to work for you because your
initial locale was set up a little differently on Ubuntu.) The reason
is a bit subtle. Shell parsing is far more complex than is good for
anyone, and so I'll simplify a lot for the sake of brevity; you can find
a more complete but also much longer description in your shell's manual
page. For instance, the "SIMPLE COMMAND EXPANSION" section in bash(1)
covers this.
A command line such as this:
LC_COLLATE=C echo [A-Z]*
... is processed by the shell first by splitting it into words.
Variable assignments are *not* processed immediately, but are saved for
later. The other words undergo a series of expansions. The last
expansion performed is pathname expansion, which turns [A-Z]* into a
sorted list of filenames matching that pattern, with both the matching
and the sorting being locale-dependent [1]. *After* this has been done,
the right-hand side of each variable assignment itself undergoes various
expansions and the result is assigned to the variable.
[1] There are even further subtleties. Whether any of this is
locale-dependent itself depends on the shell. For example, I don't
think dash(1)'s pathname expansion behaves this way.
Your example therefore does not work because LC_COLLATE is not set to C
until after [A-Z]* has been expanded. If you're trying to construct
something that sets this for just a single command, then a correct
version would be:
(LC_COLLATE=C; echo [A-Z]*)
(Of course using a subshell may have other effects. Again, it's often
simpler just to export LC_COLLATE early on and be done with it.)
When it comes to C.UTF-8, *that* indeed is unportable. It's become more
widespread, and it's an obvious useful extension of the C locale to the
full range of UTF-8 codepoints, so I hope eventually it'll make it into
glibc upstream and be available pretty much everywhere, but in the
meantime it's true that we have to put up with it not necessarily being
available. If you have systems where this is relevant then your best
bet is probably to go for LC_COLLATE=C and accept that odd things may
happen if you have file names containing non-ASCII characters.
Another thing you can do instead of all this, if you can rely on bash
4.3 or above, is:
shopt -s globasciiranges
This doesn't make [A-Z] match other upper-case letters such as Á (you
really do need something like [[:upper:]] for that), but if you happen
to only care about ASCII then it may be good enough.
--
Colin Watson [cjwatson at ubuntu.com]
More information about the ubuntu-users
mailing list