[SRU][B/E/F][PATCH 1/1] hugetlbfs: take read_lock on i_mmap for PMD sharing

Thu Jun 4 10:12:40 UTC 2020

Hi Colin,

On Thu, Jun 4, 2020 at 5:57 PM Colin Ian King <colin.king at canonical.com> wrote:
>
> On 04/06/2020 10:39, Gavin Guo wrote:
> > From: Waiman Long <longman at redhat.com>
> >
> > BugLink: https://bugs.launchpad.net/bugs/1882039
> >
> > A customer with large SMP systems (up to 16 sockets) with application
> > that uses large amount of static hugepages (~500-1500GB) are
> > experiencing random multisecond delays.  These delays were caused by the
> > long time it took to scan the VMA interval tree with mmap_sem held.
> >
> > The sharing of huge PMD does not require changes to the i_mmap at all.
> > Therefore, we can just take the read lock and let other threads
> > searching for the right VMA share it in parallel.  Once the right VMA is
> > found, either the PMD lock (2M huge page for x86-64) or the
> > mm->page_table_lock will be acquired to perform the actual PMD sharing.
> >
> > Lock contention, if present, will happen in the spinlock.  That is much
> > better than contention in the rwsem where the time needed to scan the
> > the interval tree is indeterminate.
> >
> > With this patch applied, the customer is seeing significant performance
> > improvement over the unpatched kernel.
> >
> > Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
> > Signed-off-by: Waiman Long <longman at redhat.com>
> > Suggested-by: Mike Kravetz <mike.kravetz at oracle.com>
> > Reviewed-by: Mike Kravetz <mike.kravetz at oracle.com>
> > Cc: Davidlohr Bueso <dave at stgolabs.net>
> > Cc: Peter Zijlstra <peterz at infradead.org>
> > Cc: Ingo Molnar <mingo at redhat.com>
> > Cc: Will Deacon <will.deacon at arm.com>
> > Cc: Matthew Wilcox <willy at infradead.org>
> > Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
> > Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
> > (cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
> > Signed-off-by: Gavin Guo <gavin.guo at canonical.com>
> > ---
> >  mm/hugetlb.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 2af1831596f2..9c871b47ef86 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4891,7 +4891,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >       if (!vma_shareable(vma, addr))
> >               return (pte_t *)pmd_alloc(mm, pud, addr);
> >
> > -     i_mmap_lock_write(mapping);
> > +     i_mmap_lock_read(mapping);
> >       vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> >               if (svma == vma)
> >                       continue;
> > @@ -4921,7 +4921,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >       spin_unlock(ptl);
> >  out:
> >       pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > -     i_mmap_unlock_write(mapping);
> > +     i_mmap_unlock_read(mapping);
> >       return pte;
> >  }
> >
> >I don't see any testing evidence of this working on B/E/F. Mind you this
> looks straight forward, and that's what concerns me as a good win with
> minimal change. What testing has been performed on this fix?

I agree that this is really straight forward. However, I asked many
times to the customer for the detailed data and this is all I have.
And currently, I cannot reproduce it locally. I'll try more and also
ask the customer to see if they can give me more details on how they
proceed with the testing.

>
> Colin