[Bug 75695] Re: huge performance hit for -i with UTF-8 locales

Peter Cordes peter at cordes.ca
Sun Jun 23 00:10:10 UTC 2013


The performance hit of -i hasn't changed with 12.04 LTS.  Will have to
check with a newer grep, I guess.  Seeing e.g. 25 secs to grep -i on the
.c/.h files in a Linux source tree, 0.5 secs to grep without -i.  1.3
secs for a LANG=C grep -i.  No disk I/O, files are cached.

  So a factor of about 20 slowdown for en_CA.utf8 vs. POSIX case
insensitive grepping.

 Ubuntu 12.04 does set LANG=en_CA.utf8, and /usr/lib/locale now just
contains locale-archive.  So I'm not seeing any system calls trying to
open non-existant files like ahendry was.

 Again, haven't yet tried with the most recent ubuntu.  This should be
trivially easy for most people to test, as it doesn't require grep to
actually match anything.  (I still used the volatile s3tc pattern from
my original report when searching the Linux tree).  You just need a new
version of grep, and locale support for a utf8 English locale (e.g.
en_US.utf8).

 just run these 3 commands:
time find -name '*.[ch]' | xargs grep -i 'volatile.*s3tc'
time find -name '*.[ch]' | xargs grep 'volatile.*s3tc'
time find -name '*.[ch]' | LANG=C xargs grep -i 'volatile.*s3tc'

 If the LANG=C version isn't much faster than the grep -i with your
default locale (and/or LANG=en_US.utf8 if your default for some reason
isn't slow), then the problem is fixed and grep has fast case-
insensitive utf8 matching.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grep in Ubuntu.
https://bugs.launchpad.net/bugs/75695

Title:
  huge performance hit for -i with UTF-8 locales

Status in grep:
  Unknown
Status in “grep” package in Ubuntu:
  Incomplete

Bug description:
  On a source tree with 28MB of .c and .h files (Mesa), grep is slow with -i and fast without it with the default Ubuntu locale settings (LANG=en_US.UTF-8, no LC_ variables set).  Actually, even some [Vv] style patterns are much faster with LANG=C, so this is even more like 
  https://bugs.launchpad.net/distros/ubuntu/+source/grep/+bug/47634

   My box is a core 2 duo (2.4GHz), which makes a beast like gnome feel
  almost as snappy as fluxbox :)  Everything is in the disk cache, so
  I/O isn't a factor.  Neither is memory bandwidth.  The machine was
  otherwise idle.  I'm running  AMD64 Edgy.

  peter at tesla:/usr/local/src/g965/mesa$ locale
  LANG=en_US.UTF-8
  LC_CTYPE="en_US.UTF-8"
  ... (all the same)

  (times are measured for the second run in a row, so the CPU core it runs on is at full clock speed the whole time.)
  time find -name '*.[ch]' | xargs grep -i 'volatile_s3tc'
   real    0m3.498s; user    0m3.483s; sys     0m0.023s

  time find -name '*.[ch]' | xargs grep  'volatile.*s3tc'
   real    0m0.076s; user    0m0.050s; sys     0m0.023s

  
  Non UTF-8 locales are just as fast as without -i
  time find -name '*.[ch]' | LANG=C xargs grep -i 'volatile.*s3tc'
   real    0m0.083s; user    0m0.067s; sys     0m0.020s

  time find -name '*.[ch]' | LANG=en_CA xargs grep -i 'volatile.*s3tc'
   real    0m0.079s; user    0m0.050s; sys     0m0.027s

  
   Making a case insensitive pattern takes more time, but is not really slow.  However, it probably doesn't really match everything that grep -i would on input that wasn't all 7 bit ASCII:
   time find -name '*.[ch]' | xargs grep  '[Vv][Oo][Ll][Aa][Tt][Ii][Ll][Ee].*[Ss]3[Tt][Cc]'
  real    0m0.340s; user    0m0.313s; sys     0m0.027s

  It is affected by locale settings, too.
  time find -name '*.[ch]' | LANG=C xargs grep  '[Vv][Oo][Ll][Aa][Tt][Ii][Ll][Ee].*[Ss]3[Tt][Cc]'
  real    0m0.096s; user    0m0.080s; sys     0m0.027s

To manage notifications about this bug go to:
https://bugs.launchpad.net/grep/+bug/75695/+subscriptions




More information about the foundations-bugs mailing list