[Bug 75695] Re: huge performance hit for -i with UTF-8 locales
Peter Cordes
peter at cordes.ca
Sun Jun 23 00:10:10 UTC 2013
The performance hit of -i hasn't changed with 12.04 LTS. Will have to
check with a newer grep, I guess. Seeing e.g. 25 secs to grep -i on the
.c/.h files in a Linux source tree, 0.5 secs to grep without -i. 1.3
secs for a LANG=C grep -i. No disk I/O, files are cached.
So a factor of about 20 slowdown for en_CA.utf8 vs. POSIX case
insensitive grepping.
Ubuntu 12.04 does set LANG=en_CA.utf8, and /usr/lib/locale now just
contains locale-archive. So I'm not seeing any system calls trying to
open non-existant files like ahendry was.
Again, haven't yet tried with the most recent ubuntu. This should be
trivially easy for most people to test, as it doesn't require grep to
actually match anything. (I still used the volatile s3tc pattern from
my original report when searching the Linux tree). You just need a new
version of grep, and locale support for a utf8 English locale (e.g.
en_US.utf8).
just run these 3 commands:
time find -name '*.[ch]' | xargs grep -i 'volatile.*s3tc'
time find -name '*.[ch]' | xargs grep 'volatile.*s3tc'
time find -name '*.[ch]' | LANG=C xargs grep -i 'volatile.*s3tc'
If the LANG=C version isn't much faster than the grep -i with your
default locale (and/or LANG=en_US.utf8 if your default for some reason
isn't slow), then the problem is fixed and grep has fast case-
insensitive utf8 matching.
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grep in Ubuntu.
https://bugs.launchpad.net/bugs/75695
Title:
huge performance hit for -i with UTF-8 locales
Status in grep:
Unknown
Status in “grep” package in Ubuntu:
Incomplete
Bug description:
On a source tree with 28MB of .c and .h files (Mesa), grep is slow with -i and fast without it with the default Ubuntu locale settings (LANG=en_US.UTF-8, no LC_ variables set). Actually, even some [Vv] style patterns are much faster with LANG=C, so this is even more like
https://bugs.launchpad.net/distros/ubuntu/+source/grep/+bug/47634
My box is a core 2 duo (2.4GHz), which makes a beast like gnome feel
almost as snappy as fluxbox :) Everything is in the disk cache, so
I/O isn't a factor. Neither is memory bandwidth. The machine was
otherwise idle. I'm running AMD64 Edgy.
peter at tesla:/usr/local/src/g965/mesa$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
... (all the same)
(times are measured for the second run in a row, so the CPU core it runs on is at full clock speed the whole time.)
time find -name '*.[ch]' | xargs grep -i 'volatile_s3tc'
real 0m3.498s; user 0m3.483s; sys 0m0.023s
time find -name '*.[ch]' | xargs grep 'volatile.*s3tc'
real 0m0.076s; user 0m0.050s; sys 0m0.023s
Non UTF-8 locales are just as fast as without -i
time find -name '*.[ch]' | LANG=C xargs grep -i 'volatile.*s3tc'
real 0m0.083s; user 0m0.067s; sys 0m0.020s
time find -name '*.[ch]' | LANG=en_CA xargs grep -i 'volatile.*s3tc'
real 0m0.079s; user 0m0.050s; sys 0m0.027s
Making a case insensitive pattern takes more time, but is not really slow. However, it probably doesn't really match everything that grep -i would on input that wasn't all 7 bit ASCII:
time find -name '*.[ch]' | xargs grep '[Vv][Oo][Ll][Aa][Tt][Ii][Ll][Ee].*[Ss]3[Tt][Cc]'
real 0m0.340s; user 0m0.313s; sys 0m0.027s
It is affected by locale settings, too.
time find -name '*.[ch]' | LANG=C xargs grep '[Vv][Oo][Ll][Aa][Tt][Ii][Ll][Ee].*[Ss]3[Tt][Cc]'
real 0m0.096s; user 0m0.080s; sys 0m0.027s
To manage notifications about this bug go to:
https://bugs.launchpad.net/grep/+bug/75695/+subscriptions
More information about the foundations-bugs
mailing list