[Bug 1961697] Re: Transaction ID collisions cause slow DNS lookups in getaddrinfo
KJ Tsanaktsidis
1961697 at bugs.launchpad.net
Fri Mar 11 00:42:40 UTC 2022
It's definitely non-deterministic, unfortunately. I do have a reliable
reproduction for Bionic and Focal I can trigger on my laptop, but it's a
huge pile of proprietary Ruby code that just happens to hit all the
right timings on my machine. I can validate a -proposed package if you
need though.
The reproduction instructions basically boil down to "Have IPv6, call
getaddrinfo(), and if you're unlucky, it will take > 5 seconds and make
4 DNS queries instead of two".
There is also a test case provided in the upstream glibc patch that
could also be applied.
https://sourceware.org/git/?p=glibc.git;a=blob;f=resolv/tst-resolv-
txnid-
collision.c;h=611d37362f3e5e89b92766f0790459340cc071b3;hb=2dfa659a66f20facc4082207884c20e986ddecee
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to glibc in Ubuntu.
https://bugs.launchpad.net/bugs/1961697
Title:
Transaction ID collisions cause slow DNS lookups in getaddrinfo
Status in GLibC:
Fix Released
Status in glibc package in Ubuntu:
Confirmed
Status in glibc source package in Focal:
New
Bug description:
[impact]
When resolving DNS names with getaddrinfo(), I have seen this hang for 5 seconds and then retry and succeed. The issue is that glibc will issue a both an A and AAAA query on the same socket, and in some circumstances they can be sent with the same DNS transaction ID as well.
[test case]
TBD
[regression potential]
TBD.
[original description]
I verified this with a packet capture; in the packet capture, I saw the A and AAAA queries for a name be made with the same DNS transaction ID, get responses, do nothing for five seconds, and then send the same DNS query again. On the glibc side, I confirmed that it's blocked waiting for the DNS response by interrupting it with gdb, even though the packet capture shows the response has well and truly arrived. I've attached a packet capture & a backtrace of the glibc hang.
I believe this is the same issue reported in these places:
* In RHEL: https://bugzilla.redhat.com/show_bug.cgi?id=1904153
* Also RHEL: https://bugzilla.redhat.com/show_bug.cgi?id=1903880
* Upstream: https://sourceware.org/bugzilla/show_bug.cgi?id=26600
The environment I noticed this bug in was:
* Docker for Mac on an arm64 m1 Macbook
* Docker for Mac Linux kernel version is 5.10.76-linuxkit
* Linux is also arm64, not emulated
* Container with the buggy DNS environment is Ubuntu bionic (also arm64, not emulated)
* Glibc 2.27-3ubuntu1.4
However one of the redhat reporters noticed this issue in m6 series
EC2 instances in AWS.
A patch has been provided upstream for this issue:
https://sourceware.org/pipermail/libc-alpha/2020-September/117547.html
I applied the upstream patch to glibc 2.27-3ubuntu1.4 and rebuilt the
package, and the problem went away. I've attached the exact patch I
applied, since I had to work through some conflicts.
So, I think that patch just needs to be backported to Bionic and (I
think) Focal as well. Is that reasonable?
Thanks!
To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1961697/+subscriptions
More information about the foundations-bugs
mailing list