[SRU][B/E/F][PATCH 1/1] hugetlbfs: take read_lock on i_mmap for PMD sharing

Gerald Yang gerald.yang at canonical.com
Mon Jun 22 15:51:35 UTC 2020


Hi Colin,

The steps to simulate the issue as below:
1. add a kprobe to huge_pmd_share and set offset between taking write lock
on mapping
and vma_interval_tree_foreach:
==================================================
i_mmap_lock_write(mapping);
*kprobe*
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
==================================================
and add 2 ms delay in kprobe to simulate long search time

2. use BPF tool to insert a kprobe before the above kprobe, and get the
timestamp when a process calls to this point

3. use hugetlbfs(https://github.com/libhugetlbfs/libhugetlbfs.git) to
create 4 processes and trigger VMA interval tree scan simultaneously

So we can observe how much time each process takes to search the VMA
interval tree

Without this patch, each process needs to take the write lock before
searching the tree
so only one process can search the tree and takes around 2 ms to complete,
the the following is the output of bcc tool

root at testvm:~/kprobe# /usr/share/bcc/tools/sharepmd
kernel function: huge_pmd_share
  2 233713.406179438 11499  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  3 233715.392836926 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc20e7c90 HUGEPMD
  0 233717.378148059 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1fbfc90 HUGEPMD
  1 233719.366618585 11498  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c80 HUGEPMD
  2 233721.589174146 11499  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  0 233723.577443027 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1b3fc90 HUGEPMD
  3 233725.565996868 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc1fbfc80 HUGEPMD
  1 233727.788305070 11498  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  0 233729.773605019 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c90 HUGEPMD
  2 233731.761853316 11499  shm-fork 0        ffff93d606742910        0
ffffb05bc1b3fc80 HUGEPMD
  1 233733.988826353 11498  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  3 233735.977068987 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc20e7c90 HUGEPMD
  0 233737.965477337 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c80 HUGEPMD
  1 233740.186784801 11498  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  3 233742.174938960 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c90 HUGEPMD
  2 233744.163177097 11499  shm-fork 0        ffff93d606742910        0
ffffb05bc20e7c80 HUGEPMD
  1 233746.382554150 11498  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  0 233748.370968889 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1b3fc90 HUGEPMD
  3 233750.359224507 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c80 HUGEPMD
  2 233752.563115019 11499  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD

With the patch, all processes take read lock before searching the VMA
interval tree
so they can do it simultaneously as below

root at testvm:~/kprobe# /usr/share/bcc/tools/sharepmd
kernel function: pcibios_pm_ops
  6 471.183414969 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 471.183428598 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 471.183414908 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  3 471.183414962 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  6 476.111138155 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 476.111007350 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  1 476.111408005 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 476.111808625 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 480.421521033 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 480.421618975 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  1 480.421988461 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  6 480.422808252 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 484.732804127 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  6 484.733568351 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 484.733698494 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  1 484.733473733 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 489.037550955 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  6 489.037880018 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 489.038004825 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  1 489.038070483 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD

On Thu, Jun 4, 2020 at 6:13 PM Gavin Guo <gavin.guo at canonical.com> wrote:

> Hi Colin,
>
> On Thu, Jun 4, 2020 at 5:57 PM Colin Ian King <colin.king at canonical.com>
> wrote:
> >
> > On 04/06/2020 10:39, Gavin Guo wrote:
> > > From: Waiman Long <longman at redhat.com>
> > >
> > > BugLink: https://bugs.launchpad.net/bugs/1882039
> > >
> > > A customer with large SMP systems (up to 16 sockets) with application
> > > that uses large amount of static hugepages (~500-1500GB) are
> > > experiencing random multisecond delays.  These delays were caused by
> the
> > > long time it took to scan the VMA interval tree with mmap_sem held.
> > >
> > > The sharing of huge PMD does not require changes to the i_mmap at all.
> > > Therefore, we can just take the read lock and let other threads
> > > searching for the right VMA share it in parallel.  Once the right VMA
> is
> > > found, either the PMD lock (2M huge page for x86-64) or the
> > > mm->page_table_lock will be acquired to perform the actual PMD sharing.
> > >
> > > Lock contention, if present, will happen in the spinlock.  That is much
> > > better than contention in the rwsem where the time needed to scan the
> > > the interval tree is indeterminate.
> > >
> > > With this patch applied, the customer is seeing significant performance
> > > improvement over the unpatched kernel.
> > >
> > > Link:
> http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
> > > Signed-off-by: Waiman Long <longman at redhat.com>
> > > Suggested-by: Mike Kravetz <mike.kravetz at oracle.com>
> > > Reviewed-by: Mike Kravetz <mike.kravetz at oracle.com>
> > > Cc: Davidlohr Bueso <dave at stgolabs.net>
> > > Cc: Peter Zijlstra <peterz at infradead.org>
> > > Cc: Ingo Molnar <mingo at redhat.com>
> > > Cc: Will Deacon <will.deacon at arm.com>
> > > Cc: Matthew Wilcox <willy at infradead.org>
> > > Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
> > > Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
> > > (cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
> > > Signed-off-by: Gavin Guo <gavin.guo at canonical.com>
> > > ---
> > >  mm/hugetlb.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 2af1831596f2..9c871b47ef86 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -4891,7 +4891,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm,
> unsigned long addr, pud_t *pud)
> > >       if (!vma_shareable(vma, addr))
> > >               return (pte_t *)pmd_alloc(mm, pud, addr);
> > >
> > > -     i_mmap_lock_write(mapping);
> > > +     i_mmap_lock_read(mapping);
> > >       vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> > >               if (svma == vma)
> > >                       continue;
> > > @@ -4921,7 +4921,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm,
> unsigned long addr, pud_t *pud)
> > >       spin_unlock(ptl);
> > >  out:
> > >       pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > -     i_mmap_unlock_write(mapping);
> > > +     i_mmap_unlock_read(mapping);
> > >       return pte;
> > >  }
> > >
> > >I don't see any testing evidence of this working on B/E/F. Mind you this
> > looks straight forward, and that's what concerns me as a good win with
> > minimal change. What testing has been performed on this fix?
>
> I agree that this is really straight forward. However, I asked many
> times to the customer for the detailed data and this is all I have.
> And currently, I cannot reproduce it locally. I'll try more and also
> ask the customer to see if they can give me more details on how they
> proceed with the testing.
>
> >
> > Colin
>
> --
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20200622/2a15e9fa/attachment-0001.html>


More information about the kernel-team mailing list