[Bug 605773] [NEW] Wrong kernel setting zone_reclaim_mode leads to performance problems

Andras Fabian 605773 at bugs.launchpad.net
Thu Jul 15 07:27:15 UTC 2010


Public bug reported:

Binary package hint: linux-image-server

--------------------------------------------------
Description:    Ubuntu 10.04 LTS
Release:        10.04
--------------------------------------------------
linux-image-server version:
  Installed: 2.6.32.22.23
--------------------------------------------------

The background of this problem is - or how I discovered it - a migration
of PostgreSQL database server from old hardware+old OS to a new
hardware+new OS. Transition was no problem, but after we started using
the server in production, we discovered a strange problem during nightly
backups. The runtime of the backups went up from 2 1/2 hours to 6 1/2
hours (despite the fact, that the new hardware was designed to have much
more power ... which positively showed up in most other tasks!).

A longer research of the issue using the knowledge of many helpful guys
on the PostgreSQL mailing list finally helped to find the reason for
this slow down. It turned out to be a problem around the VM part of the
kernel! Under some situations, where a lot of memory - for caching
purposes - was consumed (which easily happens while backing up 100 GByte
DBs),  a congestion happened in the VM which slowed down the process
dramatically.

In depth analysis of many parts (vie /proc file system, ps, strace etc.)
and comparing with settings on the old machines, I finally found an
essential kernel setting, vm.zone_reclaim_mode, that was solely
responsible for the issue. Luckily I could construct a simple test
scenario (COPY-to-STDOU - exporting the data from a database table via
stdout ... and writing this via pipe to the file system) where I could
reproduce the issue. Our server had the value zone_reclaim_mode = 1 set,
whereas our old servers used zone_reclaim_mode = 0. By switching (via
sysctl) this values back and forth, I could easily bring down the
experimental export process to crouching speed, or let it run again.

The complete path of the analysis can be viewed at the PostgreSQL mailing list here:
(there ia also a description, how the problem can be reproduced, and what the many symptoms are)
http://archives.postgresql.org/pgsql-general/2010-07/msg00267.php

Now, the conclusion to use "zone_reclaim_mode = 0" on our type of hardware was further strengthened by a very interesting thread at LKML, where the kernel developer discussed potential issues with this setting. You can read it here:
http://lkml.org/lkml/2009/5/12/586

That discussion boils down to the fact, that for some reasons (described
there in detail), the Linux kernel thinks on modern CPU architectures
(out new Servers use Core i7 generation CPUs which are explicitly
mentioned!) that it has a NUMA architecture. And for NUMA architectures
it automatically enables "zone_reclaim_mode = 1" ... even though it is
wrong, and not even recommended under many circumstances. Interestingly,
even most posters at the LKML thread think, that it would be better to
always(!) default this value to  "zone_reclaim_mode = 0" instead of some
automatic decision.

Some more detail on what zone_reclaim_mode does can also be found here:
http://www.linuxinsight.com/proc_sys_vm_zone_reclaim_mode.html

Now, I don't know why this "defaulting to 0" is still not in the
mainline kernels. That discussion from May 2009 at LKML died down, and
obviously no one feeled responsible to commit the patches (even though,
obvioulsy one of the guys had already prepared some!). BUT, I would ask
the Ubuntu team, to maybe act on their own and provide a way in the
Ubuntu 10.04 LTS to fix this issue (because, some reports on the net
suggest, that "zone_reclaim_mode = 1" can do harm to performance in many
ways)! And I believe, that I will not be the only PostgreSQL admin being
affected by this issue!

** Affects: linux-meta (Ubuntu)
     Importance: Undecided
         Status: New

-- 
Wrong kernel setting zone_reclaim_mode leads to performance problems
https://bugs.launchpad.net/bugs/605773
You received this bug notification because you are a member of Kernel
Bugs, which is subscribed to linux-meta in ubuntu.




More information about the kernel-bugs mailing list