[Bug 1648641] [NEW] COLLATE "en_US.UTF-8" sorting takes 30x longer on newer builds
nicholas wilson
nwilson5 at gmail.com
Thu Dec 8 22:10:24 UTC 2016
Public bug reported:
(do humor my lack of full understanding of these packages).
Was having issues sorting with COLLATE "en_US.UTF-8" on ubuntu 16.04,
told it was related to glibc.
On ubuntu 14.04 (with eglibc 2.19) I could sort a file of 2 million
lines of international text (<40chars per line) in 20 seconds. On 16.04
(with glibc 2.23) sorting the same file with the same COLLATE took 10+
minutes. My only theory is that in 2.22 glibc added new 7.0 Unicode
library (?) but really don't have a real grasp of what's going on here.
Came upon this issue when trying to index my database for over 400M
rows. What should've taken 4 hours was running for over 24 hours (never
finished). Created a subset of that table to test / sort.
Not sure how to replicate it easily, tried creating subsets to show my issue without success. Instead put 5000 lines into pastebin that you can try sorting yourself on 14.04 vs 16.04.
http://pastebin.com/r47uD690
If you put that into a file and run the following you can see the discrepancy between 14.04 and 16.04:
LC_COLLATE="en_US.UTF-8" sort /path/to/file > /dev/null
LC_COLLATE="C" has no problems (should be way faster anyways, but
differences between 14.04 and 16.04 not noticeable).
If you do it on a 14.04 fresh build it takes < 1 second. On 16.04 it
takes 8+ seconds. Small example, but it appeared to be even worse the
larger the file (e.g. earlier example of 20 seconds vs 10 minutes).
That's about all the info I have at this moment. If you need more
information throw me a question. I am not very technically familiar with
a lot of packages involved. Only posting here as I was directed to glibc
as a potential issue with regards to sorting in different COLLATE
settings.
** Affects: glibc (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/1648641
Title:
COLLATE "en_US.UTF-8" sorting takes 30x longer on newer builds
Status in glibc package in Ubuntu:
New
Bug description:
(do humor my lack of full understanding of these packages).
Was having issues sorting with COLLATE "en_US.UTF-8" on ubuntu 16.04,
told it was related to glibc.
On ubuntu 14.04 (with eglibc 2.19) I could sort a file of 2 million
lines of international text (<40chars per line) in 20 seconds. On
16.04 (with glibc 2.23) sorting the same file with the same COLLATE
took 10+ minutes. My only theory is that in 2.22 glibc added new 7.0
Unicode library (?) but really don't have a real grasp of what's going
on here.
Came upon this issue when trying to index my database for over 400M
rows. What should've taken 4 hours was running for over 24 hours
(never finished). Created a subset of that table to test / sort.
Not sure how to replicate it easily, tried creating subsets to show my issue without success. Instead put 5000 lines into pastebin that you can try sorting yourself on 14.04 vs 16.04.
http://pastebin.com/r47uD690
If you put that into a file and run the following you can see the discrepancy between 14.04 and 16.04:
LC_COLLATE="en_US.UTF-8" sort /path/to/file > /dev/null
LC_COLLATE="C" has no problems (should be way faster anyways, but
differences between 14.04 and 16.04 not noticeable).
If you do it on a 14.04 fresh build it takes < 1 second. On 16.04 it
takes 8+ seconds. Small example, but it appeared to be even worse the
larger the file (e.g. earlier example of 20 seconds vs 10 minutes).
That's about all the info I have at this moment. If you need more
information throw me a question. I am not very technically familiar
with a lot of packages involved. Only posting here as I was directed
to glibc as a potential issue with regards to sorting in different
COLLATE settings.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1648641/+subscriptions
More information about the foundations-bugs
mailing list