mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	
			
				
					
						
					
					9d5403b103
				
			
			
		
	
	
		
			1157 Commits
		
	
	
	| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|  Peter Xu | 2ea7ff1e39 | mm/hugetlb: fix race condition of uffd missing/minor handling Patch series "mm/hugetlb: Fix selftest failures with write check", v3.
Currently akpm mm-unstable fails with uffd hugetlb private mapping test
randomly on a write check.
The initial bisection of that points to the recent pmd unshare series, but
it turns out there's no direction relationship with the series but only
some timing change caused the race to start trigger.
The race should be fixed in patch 1.  Patch 2 is a trivial cleanup on the
similar race with hugetlb migrations, patch 3 comment on the write check
so when anyone read it again it'll be clear why it's there.
This patch (of 3):
After the recent rework patchset of hugetlb locking on pmd sharing,
kselftest for userfaultfd sometimes fails on hugetlb private tests with
unexpected write fault checks.
It turns out there's nothing wrong within the locking series regarding
this matter, but it could have changed the timing of threads so it can
trigger an old bug.
The real bug is when we call hugetlb_no_page() we're not with the pgtable
lock.  It means we're reading the pte values lockless.  It's perfectly
fine in most cases because before we do normal page allocations we'll take
the lock and check pte_same() again.  However before that, there are
actually two paths on userfaultfd missing/minor handling that may directly
move on with the fault process without checking the pte values.
It means for these two paths we may be generating an uffd message based on
an unstable pte, while an unstable pte can legally be anything as long as
the modifier holds the pgtable lock.
One example, which is also what happened in the failing kselftest and
caused the test failure, is that for private mappings wr-protection
changes can happen on one page.  While hugetlb_change_protection()
generally requires pte being cleared before being changed, then there can
be a race condition like:
        thread 1                              thread 2
        --------                              --------
      UFFDIO_WRITEPROTECT                     hugetlb_fault
        hugetlb_change_protection
          pgtable_lock()
          huge_ptep_modify_prot_start
                                              pte==NULL
                                              hugetlb_no_page
                                                generate uffd missing event
                                                even if page existed!!
          huge_ptep_modify_prot_commit
          pgtable_unlock()
Fix this by rechecking the pte after pgtable lock for both userfaultfd
missing & minor fault paths.
This bug should have been around starting from uffd hugetlb introduced, so
attaching a Fixes to the commit.  Also attach another Fixes to the minor
support commit for easier tracking.
Note that userfaultfd is actually fine with false positives (e.g.  caused
by pte changed), but not wrong logical events (e.g.  caused by reading a
pte during changing).  The latter can confuse the userspace, so the
strictness is very much preferred.  E.g., MISSING event should never
happen on the page after UFFDIO_COPY has correctly installed the page and
returned.
Link: https://lkml.kernel.org/r/20221004193400.110155-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221004193400.110155-2-peterx@redhat.com
Fixes:  | ||
|  Peter Xu | 515778e2d7 | mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte
marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP
ifdefs to make sure the code won't be reached when not compiled in.
Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n
Fixes:  | ||
|  Andrew Morton | acfac37851 | mm/hugetlb.c: make __hugetlb_vma_unlock_write_put() static Reported-by: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Linus Torvalds | 1440f57602 | Five hotfixes - three for nilfs2, two for MM.  For are cc:stable, one is not. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0YhtwAKCRDdBJ7gKXxA juJLAQDCa0g8sfe9cTw3PT1gRnn8gWLHEkMgUWVC/aBaqYFGeQEAta+g8muv9Tpd qODv0JARH4cwONKEA24Oql+A5RnI6gQ= =QZnW -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "Five hotfixes - three for nilfs2, two for MM. For are cc:stable, one is not" * tag 'mm-hotfixes-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: nilfs2: fix leak of nilfs_root in case of writer thread creation failure nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level() nilfs2: fix use-after-free bug of struct nilfs_root mm/damon/core: initialize damon_target->list in damon_new_target() mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page | ||
|  Baolin Wang | fac35ba763 | mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and 1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified. So when looking up a CONT-PTE size hugetlb page by follow_page(), it will use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size hugetlb in follow_page_pte(). However this pte entry lock is incorrect for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get the correct lock, which is mm->page_table_lock. That means the pte entry of the CONT-PTE size hugetlb under current pte lock is unstable in follow_page_pte(), we can continue to migrate or poison the pte entry of the CONT-PTE size hugetlb, which can cause some potential race issues, even though they are under the 'pte lock'. For example, suppose thread A is trying to look up a CONT-PTE size hugetlb page by move_pages() syscall under the lock, however antoher thread B can migrate the CONT-PTE hugetlb page at the same time, which will cause thread A to get an incorrect page, if thread A also wants to do page migration, then data inconsistency error occurs. Moreover we have the same issue for CONT-PMD size hugetlb in follow_huge_pmd(). To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte() to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to get the correct pte entry lock to make the pte entry stable. Mike said: Support for CONT_PMD/_PTE was added with | ||
|  Mike Kravetz | bbff39cc6c | hugetlb: allocate vma lock for all sharable vmas The hugetlb vma lock was originally designed to synchronize pmd sharing. As such, it was only necessary to allocate the lock for vmas that were capable of pmd sharing. Later in the development cycle, it was discovered that it could also be used to simplify fault/truncation races as described in [1]. However, a subsequent change to allocate the lock for all vmas that use the page cache was never made. A fault/truncation race could leave pages in a file past i_size until the file is removed. Remove the previous restriction and allocate lock for all VM_MAYSHARE vmas. Warn in the unlikely event of allocation failure. [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com Fixes: "hugetlb: clean up code checking for fault/truncation races" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | ecfbd73387 | hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb file truncation/hole punch code may need to back out and take locks in order in the routine hugetlb_unmap_file_folio(). This code could race with vma freeing as pointed out in [1] and result in accessing a stale vma pointer. To address this, take the vma_lock when clearing the vma_lock->vma pointer. [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/ [mike.kravetz@oracle.com: address build issues] Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | 131a79b474 | hugetlb: fix vma lock handling during split vma and range unmapping Patch series "hugetlb: fixes for new vma lock series". In review of the series "hugetlb: Use new vma lock for huge pmd sharing synchronization", Miaohe Lin pointed out two key issues: 1) There is a race in the routine hugetlb_unmap_file_folio when locks are dropped and reacquired in the correct order [1]. 2) With the switch to using vma lock for fault/truncate synchronization, we need to make sure lock exists for all VM_MAYSHARE vmas, not just vmas capable of pmd sharing. These two issues are addressed here. In addition, having a vma lock present in all VM_MAYSHARE vmas, uncovered some issues around vma splitting. Those are also addressed. [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/ This patch (of 3): The hugetlb vma lock hangs off the vm_private_data field and is specific to the vma. When vm_area_dup() is called as part of vma splitting, the vma lock pointer is copied to the new vma. This will result in issues such as double freeing of the structure. Update the hugetlb open vm_ops to allocate a new vma lock for the new vma. The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE to prevent subsequent pmd sharing. hugetlb_vma_lock_free attempted to anticipate this by checking both VM_MAYSHARE and VM_SHARED. However, if only VM_MAYSHARE was set we would miss the free. With the introduction of the vma lock, a vma can not participate in pmd sharing if vm_private_data is NULL. Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final, free the vma lock to prevent sharing. Also, update the sharing code to make sure vma lock is indeed a condition for pmd sharing. hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas. Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com Fixes: "hugetlb: add vma based lock for pmd sharing" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Xin Hao | 8346d69d8b | mm/hugetlb: add available_huge_pages() func In hugetlb.c there are several places which compare the values of 'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so add a new available_huge_pages() function to do these. Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.com Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Liu Shixin | 958f32ce83 | mm: hugetlb: fix UAF in hugetlb_handle_userfault The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
  hugetlb_no_page
    /*unlock vma_lock */
    hugetlb_handle_userfault
      handle_userfault
        /* unlock mm->mmap_lock*/
                                           vm_mmap_pgoff
                                             do_mmap
                                               mmap_region
                                                 munmap_vma_range
                                                   /* clean old vma */
        /* lock vma_lock again  <--- UAF */
    /* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes:  | ||
|  Mike Kravetz | 2b21624fc2 | hugetlb: freeze allocated pages before creating hugetlb pages When creating hugetlb pages, the hugetlb code must first allocate contiguous pages from a low level allocator such as buddy, cma or memblock. The pages returned from these low level allocators are ref counted. This creates potential issues with other code taking speculative references on these pages before they can be transformed to a hugetlb page. This issue has been addressed with methods and code such as that provided in [1]. Recent discussions about vmemmap freeing [2] have indicated that it would be beneficial to freeze all sub pages, including the head page of pages returned from low level allocators before converting to a hugetlb page. This helps avoid races if we want to replace the page containing vmemmap for the head page. There have been proposals to change at least the buddy allocator to return frozen pages as described at [3]. If such a change is made, it can be employed by the hugetlb code. However, as mentioned above hugetlb uses several low level allocators so each would need to be modified to return frozen pages. For now, we can manually freeze the returned pages. This is done in two places: 1) alloc_buddy_huge_page, only the returned head page is ref counted. We freeze the head page, retrying once in the VERY rare case where there may be an inflated ref count. 2) prep_compound_gigantic_page, for gigantic pages the current code freezes all pages except the head page. New code will simply freeze the head page as well. In a few other places, code checks for inflated ref counts on newly allocated hugetlb pages. With the modifications to freeze after allocating, this code can be removed. After hugetlb pages are freshly allocated, they are often added to the hugetlb free lists. Since these pages were previously ref counted, this was done via put_page() which would end up calling the hugetlb destructor: free_huge_page. With changes to freeze pages, we simply call free_huge_page directly to add the pages to the free list. In a few other places, freshly allocated hugetlb pages were immediately put into use, and the expectation was they were already ref counted. In these cases, we must manually ref count the page. [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/ [mike.kravetz@oracle.com: fix NULL pointer dereference] Link: https://lkml.kernel.org/r/20220921202702.106069-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220916214638.155744-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | fa27759af4 | hugetlb: clean up code checking for fault/truncation races With the new hugetlb vma lock in place, it can also be used to handle page fault races with file truncation. The lock is taken at the beginning of the code fault path in read mode. During truncation, it is taken in write mode for each vma which has the file mapped. The file's size (i_size) is modified before taking the vma lock to unmap. How are races handled? The page fault code checks i_size early in processing after taking the vma lock. If the fault is beyond i_size, the fault is aborted. If the fault is not beyond i_size the fault will continue and a new page will be added to the file. It could be that truncation code modifies i_size after the check in fault code. That is OK, as truncation code will soon remove the page. The truncation code will wait until the fault is finished, as it must obtain the vma lock in write mode. This patch cleans up/removes late checks in the fault paths that try to back out pages racing with truncation. As noted above, we just let the truncation code remove the pages. [mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning] Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | 40549ba8f8 | hugetlb: use new vma_lock for pmd sharing synchronization The new hugetlb vma lock is used to address this race:
Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
The vma_lock is used as follows:
- During fault processing. The lock is acquired in read mode before
  doing a page table lock and allocation (huge_pte_alloc).  The lock is
  held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
  called.
Lock ordering issues come into play when unmapping a page from all
vmas mapping the page.  The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare.  This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
  error handling.  In these routines we 'try' to obtain the vma lock and
  fail to unmap if unsuccessful.  Calling routines already deal with the
  failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch.  This routine
  also tries to acquire the vma lock.  If it fails, it skips the
  unmapping.  However, we can not have file truncation or hole punch
  fail because of contention.  After hugetlb_vmdelete_list, truncation
  and hole punch call remove_inode_hugepages.  remove_inode_hugepages
  checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
  hugetlb_unmap_file_page is designed to drop locks and reacquire in the
  correct order to guarantee unmap success.
Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | 8d9bfb2608 | hugetlb: add vma based lock for pmd sharing Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for synchronization use by vmas that could be involved in pmd sharing. This data structure contains a rw semaphore that is the primary tool used for synchronization. This new structure is ref counted, so that it can exist when NOT attached to a vma. This is only helpful in resolving lock ordering issues where code may need to obtain the vma_lock while there are no guarantees the vma may go away. By obtaining a ref on the structure, it can be guaranteed that at least the rw semaphore will not go away. Only add infrastructure for the new lock here. Actual use will be added in subsequent patches. [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release] Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | 12710fd696 | hugetlb: rename vma_shareable() and refactor code Rename the routine vma_shareable to vma_addr_pmd_shareable as it is checking a specific address within the vma. Refactor code to check if an aligned range is shareable as this will be needed in a subsequent patch. Link: https://lkml.kernel.org/r/20220914221810.95771-6-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | 7e1813d48d | hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache remove_huge_page removes a hugetlb page from the page cache. Change to hugetlb_delete_from_page_cache as it is a more descriptive name. huge_add_to_page_cache is global in scope, but only deals with hugetlb pages. For consistency and clarity, rename to hugetlb_add_to_page_cache. Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | 3a47c54f09 | hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Commit  | ||
|  Mike Kravetz | 188a39725a | hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Patch series "hugetlb: Use new vma lock for huge pmd sharing synchronization", v2. hugetlb fault scalability regressions have recently been reported [1]. This is not the first such report, as regressions were also noted when commit | ||
|  XU pengfei | 3259914f8c | mm/hugetlb: remove unnecessary 'NULL' values from pointer Pointer variables allocate memory first, and then judge. There is no need to initialize the assignment. Link: https://lkml.kernel.org/r/20220914012113.6271-1-xupengfei@nfschina.com Signed-off-by: XU pengfei <xupengfei@nfschina.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Muchun Song | a4a00b451e | mm: hugetlb: eliminate memory-less nodes handling The memory-notify-based approach aims to handle meory-less nodes, however, it just adds the complexity of code as pointed by David in thread [1]. The handling of memory-less nodes is introduced by commit | ||
|  Muchun Song | b958d4d08f | mm: hugetlb: simplify per-node sysfs creation and removal Patch series "simplify handling of per-node sysfs creation and removal",
v4.
This patch (of 2):
The following commit offload per-node sysfs creation and removal to a
kworker and did not say why it is needed.  And it also said "I don't know
that this is absolutely required".  It seems like the author was not sure
as well.  Since it only complicates the code, this patch will revert the
changes to simplify the code.
   | ||
|  Cheng Li | 14455eabd8 | mm: use nth_page instead of mem_map_offset mem_map_next To handle the discontiguous case, mem_map_next() has a parameter named
`offset`.  As a function caller, one would be confused why "get next
entry" needs a parameter named "offset".  The other drawback of
mem_map_next() is that the callers must take care of the map between
parameter "iter" and "offset", otherwise we may get an hole or duplication
during iteration.  So we use nth_page instead of mem_map_next.
And replace mem_map_offset with nth_page() per Matthew's comments.
Link: https://lkml.kernel.org/r/1662708669-9395-1-git-send-email-lic121@chinatelecom.cn
Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
Fixes:  | ||
|  Li zeming | 8eeda55fe0 | mm/hugetlb.c: remove unnecessary initialization of local `err' Link: https://lkml.kernel.org/r/20220905020918.3552-1-zeming@nfschina.com Signed-off-by: Li zeming <zeming@nfschina.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Andrew Morton | 6d751329e7 | Merge branch 'mm-hotfixes-stable' into mm-stable | ||
|  Doug Berger | 317314527d | mm/hugetlb: correct demote page offset logic With gigantic pages it may not be true that struct page structures are
contiguous across the entire gigantic page.  The nth_page macro is used
here in place of direct pointer arithmetic to correct for this.
Mike said:
: This error could cause addressing exceptions.  However, this is only
: possible in configurations where CONFIG_SPARSEMEM &&
: !CONFIG_SPARSEMEM_VMEMMAP.  Such a configuration option is rare and
: unknown to be the default anywhere.
Link: https://lkml.kernel.org/r/20220914190917.3517663-1-opendmb@gmail.com
Fixes:  | ||
|  Miaohe Lin | 5e6b1bf1b5 | hugetlb: remove meaningless BUG_ON(huge_pte_none()) When code reaches here, invalid page would have been accessed if huge pte is none. So this BUG_ON(huge_pte_none()) is meaningless. Remove it. Link: https://lkml.kernel.org/r/20220901120030.63318-10-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | a9e1eab241 | hugetlb: add comment for subtle SetHPageVmemmapOptimized() The SetHPageVmemmapOptimized() called here seems unnecessary as it's assumed to be set when calling this function. But it's indeed cleared by above set_page_private(page, 0). Add a comment to avoid possible future confusion. Link: https://lkml.kernel.org/r/20220901120030.63318-9-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | 29be84265f | hugetlb: kill hugetlbfs_pagecache_page() Fold hugetlbfs_pagecache_page() into its sole caller to remove some duplicated code. No functional change intended. Link: https://lkml.kernel.org/r/20220901120030.63318-8-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | 12658abfc5 | hugetlb: pass NULL to kobj_to_hstate() if nid is unused We can pass NULL to kobj_to_hstate() directly when nid is unused to simplify the code. No functional change intended. Link: https://lkml.kernel.org/r/20220901120030.63318-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | bcc665436f | hugetlb: use helper {huge_pte|pmd}_lock() Use helper huge_pte_lock and pmd_lock to simplify the code. No functional change intended. Link: https://lkml.kernel.org/r/20220901120030.63318-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | 103956805c | hugetlb: use sizeof() to get the array size It's better to use sizeof() to get the array size instead of manual calculation. Minor readability improvement. Link: https://lkml.kernel.org/r/20220901120030.63318-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | 3466534131 | hugetlb: use LIST_HEAD() to define a list head Use LIST_HEAD() directly to define a list head to simplify the code. No functional change intended. Link: https://lkml.kernel.org/r/20220901120030.63318-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | c2c3a60a85 | hugetlb: Use helper macro SZ_1K Use helper macro SZ_1K to do the size conversion to make code more consistent in this file. Minor readability improvement. Link: https://lkml.kernel.org/r/20220901120030.63318-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | 263b899802 | hugetlb: make hugetlb_cma_check() static Patch series "A few cleanup patches for hugetlb", v2. This series contains a few cleanup patches to use helper functions to simplify the codes, remove unneeded nid parameter and so on. More details can be found in the respective changelogs. This patch (of 10): Make hugetlb_cma_check() static as it's only used inside mm/hugetlb.c. Link: https://lkml.kernel.org/r/20220901120030.63318-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220901120030.63318-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | 3aa4ed8040 | mm/hugetlb: make detecting shared pte more reliable If the pagetables are shared, we shouldn't copy or take references. Since src could have unshared and dst shares with another vma, huge_pte_none() is thus used to determine whether dst_pte is shared. But this check isn't reliable. A shared pte could have pte none in pagetable in fact. The page count of ptep page should be checked here in order to reliably determine whether pte is shared. [lukas.bulwahn@gmail.com: remove unused local variable dst_entry in copy_hugetlb_page_range()] Link: https://lkml.kernel.org/r/20220822082525.26071-1-lukas.bulwahn@gmail.com Link: https://lkml.kernel.org/r/20220816130553.31406-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | 01088a6036 | mm/hugetlb: fix sysfs group leak in hugetlb_unregister_node() The sysfs group per_node_hstate_attr_group and hstate_demote_attr_group when h->demote_order != 0 are created in hugetlb_register_node(). But these sysfs groups are not removed when unregister the node, thus sysfs group is leaked. Using sysfs_remove_group() to fix this issue. Link: https://lkml.kernel.org/r/20220816130553.31406-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Fengwei Yin <fengwei.yin@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | 3a5497a2da | mm/hugetlb: fix missing call to restore_reserve_on_error() When huge_add_to_page_cache() fails, the page is freed directly without calling restore_reserve_on_error() to restore reserve for newly allocated pages not in page cache. Fix this by calling restore_reserve_on_error() when huge_add_to_page_cache fails. [linmiaohe@huawei.com: remove err == -EEXIST check and retry logic] Link: https://lkml.kernel.org/r/20220823030209.57434-4-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220816130553.31406-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | 3a6bdda0b5 | mm/hugetlb: fix WARN_ON(!kobj) in sysfs_create_group() If sysfs_create_group() fails with hstate_attr_group, hstate_kobjs[hi]
will be set to NULL.  Then it will be passed to sysfs_create_group() if
h->demote_order != 0 thus triggering WARN_ON(!kobj) check.  Fix this by
making sure hstate_kobjs[hi] != NULL when calling sysfs_create_group.
Link: https://lkml.kernel.org/r/20220816130553.31406-3-linmiaohe@huawei.com
Fixes:  | ||
|  Miaohe Lin | a43a83c79b | mm/hugetlb: fix incorrect update of max_huge_pages Patch series "A few fixup patches for hugetlb". This series contains a few fixup patches to fix incorrect update of max_huge_pages, fix WARN_ON(!kobj) in sysfs_create_group() and so on. More details can be found in the respective changelogs. This patch (of 6): There should be pages_per_huge_page(h) / pages_per_huge_page(target_hstate) pages incremented for target_hstate->max_huge_pages when page is demoted. Update max_huge_pages accordingly for consistency. Link: https://lkml.kernel.org/r/20220816130553.31406-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220816130553.31406-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Feng Tang | d2226ebd54 | mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in commit | ||
|  Miaohe Lin | ab74ef708d | mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page
cache are installed in the ptes.  But hugepage_add_new_anon_rmap is called
for them mistakenly because they're not vm_shared.  This will corrupt the
page->mapping used by page cache code.
Link: https://lkml.kernel.org/r/20220712130542.18836-1-linmiaohe@huawei.com
Fixes:  | ||
|  David Hildenbrand | 1d8d14641f | mm/hugetlb: support write-faults in shared mappings If we ever get a write-fault on a write-protected page in a shared
mapping, we'd be in trouble (again).  Instead, we can simply map the page
writable.
And in fact, there is even a way right now to trigger that code via
uffd-wp ever since we stared to support it for shmem in 5.19:
--------------------------------------------------------------------------
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <fcntl.h>
 #include <unistd.h>
 #include <errno.h>
 #include <sys/mman.h>
 #include <sys/syscall.h>
 #include <sys/ioctl.h>
 #include <linux/userfaultfd.h>
 #define HUGETLB_SIZE (2 * 1024 * 1024u)
 static char *map;
 int uffd;
 static int temp_setup_uffd(void)
 {
 	struct uffdio_api uffdio_api;
 	struct uffdio_register uffdio_register;
 	struct uffdio_writeprotect uffd_writeprotect;
 	struct uffdio_range uffd_range;
 	uffd = syscall(__NR_userfaultfd,
 		       O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
 	if (uffd < 0) {
 		fprintf(stderr, "syscall() failed: %d\n", errno);
 		return -errno;
 	}
 	uffdio_api.api = UFFD_API;
 	uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
 	if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
 		fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
 		return -errno;
 	}
 	if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
 		fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
 		return -ENOSYS;
 	}
 	/* Register UFFD-WP */
 	uffdio_register.range.start = (unsigned long) map;
 	uffdio_register.range.len = HUGETLB_SIZE;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
 		fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
 		return -errno;
 	}
 	/* Writeprotect a single page. */
 	uffd_writeprotect.range.start = (unsigned long) map;
 	uffd_writeprotect.range.len = HUGETLB_SIZE;
 	uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
 	if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
 		fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
 		return -errno;
 	}
 	/* Unregister UFFD-WP without prior writeunprotection. */
 	uffd_range.start = (unsigned long) map;
 	uffd_range.len = HUGETLB_SIZE;
 	if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
 		fprintf(stderr, "UFFDIO_UNREGISTER failed: %d\n", errno);
 		return -errno;
 	}
 	return 0;
 }
 int main(int argc, char **argv)
 {
 	int fd;
 	fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
 	if (!fd) {
 		fprintf(stderr, "open() failed\n");
 		return -errno;
 	}
 	if (ftruncate(fd, HUGETLB_SIZE)) {
 		fprintf(stderr, "ftruncate() failed\n");
 		return -errno;
 	}
 	map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
 	if (map == MAP_FAILED) {
 		fprintf(stderr, "mmap() failed\n");
 		return -errno;
 	}
 	*map = 0;
 	if (temp_setup_uffd())
 		return 1;
 	*map = 0;
 	return 0;
 }
--------------------------------------------------------------------------
Above test fails with SIGBUS when there is only a single free hugetlb page.
 # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
 # ./test
 Bus error (core dumped)
And worse, with sufficient free hugetlb pages it will map an anonymous page
into a shared mapping, for example, messing up accounting during unmap
and breaking MAP_SHARED semantics:
 # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
 # ./test
 # cat /proc/meminfo | grep HugePages_
 HugePages_Total:       2
 HugePages_Free:        1
 HugePages_Rsvd:    18446744073709551615
 HugePages_Surp:        0
Reason is that uffd-wp doesn't clear the uffd-wp PTE bit when
unregistering and consequently keeps the PTE writeprotected.  Reason for
this is to avoid the additional overhead when unregistering.  Note that
this is the case also for !hugetlb and that we will end up with writable
PTEs that still have the uffd-wp PTE bit set once we return from
hugetlb_wp().  I'm not touching the uffd-wp PTE bit for now, because it
seems to be a generic thing -- wp_page_reuse() also doesn't clear it.
VM_MAYSHARE handling in hugetlb_fault() for FAULT_FLAG_WRITE indicates
that MAP_SHARED handling was at least envisioned, but could never have
worked as expected.
While at it, make sure that we never end up in hugetlb_wp() on write
faults without VM_WRITE, because we don't support maybe_mkwrite()
semantics as commonly used in the !hugetlb case -- for example, in
wp_page_reuse().
Note that there is no need to do any kind of reservation in
hugetlb_fault() in this case ...  because we already have a hugetlb page
mapped R/O that we will simply map writable and we are not dealing with
COW/unsharing.
Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com
Fixes:  | ||
|  Naoya Horiguchi | 161df60e9e | mm, hwpoison, hugetlb: support saving mechanism of raw error pages When handling memory error on a hugetlb page, the error handler tries to dissolve and turn it into 4kB pages. If it's successfully dissolved, PageHWPoison flag is moved to the raw error page, so that's all right. However, dissolve sometimes fails, then the error page is left as hwpoisoned hugepage. It's useful if we can retry to dissolve it to save healthy pages, but that's not possible now because the information about where the raw error pages is lost. Use the private field of a few tail pages to keep that information. The code path of shrinking hugepage pool uses this info to try delayed dissolve. In order to remember multiple errors in a hugepage, a singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only simple operations (adding an entry or clearing all) are required and the list is assumed not to be very long, so this simple data structure should be enough. If we failed to save raw error info, the hwpoison hugepage has errors on unknown subpage, then this new saving mechanism does not work any more, so disable saving new raw error info and freeing hwpoison hugepages. Link: https://lkml.kernel.org/r/20220714042420.1847125-4-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Naoya Horiguchi | 3a194f3f8a | mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry follow_pud_mask() does not support non-present pud entry now. As long as I tested on x86_64 server, follow_pud_mask() still simply returns no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe user-visible effect should happen. But generally we should call follow_huge_pud() for non-present pud entry for 1GB hugetlb page. Update pud_huge() and follow_huge_pud() to handle non-present pud entries. The changes are similar to previous works for pud entries commit | ||
|  Naoya Horiguchi | c0531714d6 | mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages() Patch series "mm, hwpoison: enable 1GB hugepage support", v7.
This patch (of 8):
I found a weird state of 1GB hugepage pool, caused by the following
procedure:
  - run a process reserving all free 1GB hugepages,
  - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
    /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
  - kill the reserving process.
, then all the hugepages are free *and* surplus at the same time.
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  3
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
  3
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
  0
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
  3
This state is resolved by reserving and allocating the pages then freeing
them again, so this seems not to result in serious problem.  But it's a
little surprising (shrinking pool suddenly fails).
This behavior is caused by hstate_is_gigantic() check in
return_unused_surplus_pages().  This was introduced so long ago in 2008 by
commit  | ||
|  Muchun Song | 6213834c10 | mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability There is a discussion about the name of hugetlb_vmemmap_alloc/free in
thread [1].  The suggestion suggested by David is rename "alloc/free" to
"optimize/restore" to make functionalities clearer to users, "optimize"
means the function will optimize vmemmap pages, while "restore" means
restoring its vmemmap pages discared before.  This commit does this.
Another discussion is the confusion RESERVE_VMEMMAP_NR isn't used
explicitly for vmemmap_addr but implicitly for vmemmap_end in
hugetlb_vmemmap_alloc/free.  David suggested we can compute what
hugetlb_vmemmap_init() does now at runtime.  We do not need to worry for
the overhead of computing at runtime since the calculation is simple
enough and those functions are not in a hot path.  This commit has the
following improvements:
  1) The function suffixed name ("optimize/restore") is more expressive.
  2) The logic becomes less weird in hugetlb_vmemmap_optimize/restore().
  3) The hugetlb_vmemmap_init() does not need to be exported anymore.
  4) A ->optimize_vmemmap_pages field in struct hstate is killed.
  5) There is only one place where checks is_power_of_2(sizeof(struct
     page)) instead of two places.
  6) Add more comments for hugetlb_vmemmap_optimize/restore().
  7) For external users, hugetlb_optimize_vmemmap_pages() is used for
     detecting if the HugeTLB's vmemmap pages is optimizable originally.
     In this commit, it is killed and we introduce a new helper
     hugetlb_vmemmap_optimizable() to replace it.  The name is more
     expressive.
Link: https://lore.kernel.org/all/20220404074652.68024-2-songmuchun@bytedance.com/ [1]
Link: https://lkml.kernel.org/r/20220628092235.91270-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Linus Torvalds | 6614a3c316 | - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
 
 - Some kmemleak fixes from Patrick Wang and Waiman Long
 
 - DAMON updates from SeongJae Park
 
 - memcg debug/visibility work from Roman Gushchin
 
 - vmalloc speedup from Uladzislau Rezki
 
 - more folio conversion work from Matthew Wilcox
 
 - enhancements for coherent device memory mapping from Alex Sierra
 
 - addition of shared pages tracking and CoW support for fsdax, from
   Shiyang Ruan
 
 - hugetlb optimizations from Mike Kravetz
 
 - Mel Gorman has contributed some pagealloc changes to improve latency
   and realtime behaviour.
 
 - mprotect soft-dirty checking has been improved by Peter Xu
 
 - Many other singleton patches all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
 jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
 SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
 =w/UH
 -----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
 "Most of the MM queue. A few things are still pending.
  Liam's maple tree rework didn't make it. This has resulted in a few
  other minor patch series being held over for next time.
  Multi-gen LRU still isn't merged as we were waiting for mapletree to
  stabilize. The current plan is to merge MGLRU into -mm soon and to
  later reintroduce mapletree, with a view to hopefully getting both
  into 6.1-rc1.
  Summary:
   - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
     Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
   - Some kmemleak fixes from Patrick Wang and Waiman Long
   - DAMON updates from SeongJae Park
   - memcg debug/visibility work from Roman Gushchin
   - vmalloc speedup from Uladzislau Rezki
   - more folio conversion work from Matthew Wilcox
   - enhancements for coherent device memory mapping from Alex Sierra
   - addition of shared pages tracking and CoW support for fsdax, from
     Shiyang Ruan
   - hugetlb optimizations from Mike Kravetz
   - Mel Gorman has contributed some pagealloc changes to improve
     latency and realtime behaviour.
   - mprotect soft-dirty checking has been improved by Peter Xu
   - Many other singleton patches all over the place"
 [ XFS merge from hell as per Darrick Wong in
   https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
  tools/testing/selftests/vm/hmm-tests.c: fix build
  mm: Kconfig: fix typo
  mm: memory-failure: convert to pr_fmt()
  mm: use is_zone_movable_page() helper
  hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
  hugetlbfs: cleanup some comments in inode.c
  hugetlbfs: remove unneeded header file
  hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
  hugetlbfs: use helper macro SZ_1{K,M}
  mm: cleanup is_highmem()
  mm/hmm: add a test for cross device private faults
  selftests: add soft-dirty into run_vmtests.sh
  selftests: soft-dirty: add test for mprotect
  mm/mprotect: fix soft-dirty check in can_change_pte_writable()
  mm: memcontrol: fix potential oom_lock recursion deadlock
  mm/gup.c: fix formatting in check_and_migrate_movable_page()
  xfs: fail dax mount if reflink is enabled on a partition
  mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
  userfaultfd: don't fail on unrecognized features
  hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
  ... | ||
|  Linus Torvalds | f00654007f | Folio changes for 6.0 - Fix an accounting bug that made NR_FILE_DIRTY grow without limit
    when running xfstests
 
  - Convert more of mpage to use folios
 
  - Remove add_to_page_cache() and add_to_page_cache_locked()
 
  - Convert find_get_pages_range() to filemap_get_folios()
 
  - Improvements to the read_cache_page() family of functions
 
  - Remove a few unnecessary checks of PageError
 
  - Some straightforward filesystem conversions to use folios
 
  - Split PageMovable users out from address_space_operations into their
    own movable_operations
 
  - Convert aops->migratepage to aops->migrate_folio
 
  - Remove nobh support (Christoph Hellwig)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
 gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
 mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
 7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
 /58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
 C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
 Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
 =DgUi
 -----END PGP SIGNATURE-----
Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
 - Fix an accounting bug that made NR_FILE_DIRTY grow without limit
   when running xfstests
 - Convert more of mpage to use folios
 - Remove add_to_page_cache() and add_to_page_cache_locked()
 - Convert find_get_pages_range() to filemap_get_folios()
 - Improvements to the read_cache_page() family of functions
 - Remove a few unnecessary checks of PageError
 - Some straightforward filesystem conversions to use folios
 - Split PageMovable users out from address_space_operations into
   their own movable_operations
 - Convert aops->migratepage to aops->migrate_folio
 - Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
  fs: remove the NULL get_block case in mpage_writepages
  fs: don't call ->writepage from __mpage_writepage
  fs: remove the nobh helpers
  jfs: stop using the nobh helper
  ext2: remove nobh support
  ntfs3: refactor ntfs_writepages
  mm/folio-compat: Remove migration compatibility functions
  fs: Remove aops->migratepage()
  secretmem: Convert to migrate_folio
  hugetlb: Convert to migrate_folio
  aio: Convert to migrate_folio
  f2fs: Convert to filemap_migrate_folio()
  ubifs: Convert to filemap_migrate_folio()
  btrfs: Convert btrfs_migratepage to migrate_folio
  mm/migrate: Add filemap_migrate_folio()
  mm/migrate: Convert migrate_page() to migrate_folio()
  nfs: Convert to migrate_folio
  btrfs: Convert btree_migratepage to migrate_folio
  mm/migrate: Convert expected_page_refs() to folio_expected_refs()
  mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
  ... | ||
|  Miaohe Lin | da9a298f5f | hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte When alloc_huge_page fails, *pagep is set to NULL without put_page first.
So the hugepage indicated by *pagep is leaked.
Link: https://lkml.kernel.org/r/20220709092629.54291-1-linmiaohe@huawei.com
Fixes:  | ||
|  Naoya Horiguchi | c2cb0dcce9 | mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() Originally copy_hugetlb_page_range() handles migration entries and
hwpoisoned entries in similar manner.  But recently the related code path
has more code for migration entries, and when
is_writable_migration_entry() was converted to
!is_readable_migration_entry(), hwpoison entries on source processes got
to be unexpectedly updated (which is legitimate for migration entries, but
not for hwpoison entries).  This results in unexpected serious issues like
kernel panic when forking processes with hwpoison entries in pmd.
Separate the if branch into one for hwpoison entries and one for migration
entries.
Link: https://lkml.kernel.org/r/20220704013312.2415700-3-naoya.horiguchi@linux.dev
Fixes:  | ||
|  Gang Li | dcadcf1c30 | mm, hugetlb: skip irrelevant nodes in show_free_areas() show_free_areas() allows to filter out node specific data which is irrelevant to the allocation request. But hugetlb_show_meminfo() still shows hugetlb on all nodes, which is redundant and unnecessary. Use show_mem_node_skip() to skip irrelevant nodes. And replace hugetlb_show_meminfo() with hugetlb_show_meminfo_node(nid). before-and-after sample output of OOM: before: ``` [ 214.362453] Node 1 active_anon:148kB inactive_anon:4050920kB active_file:112kB inactive_file:100kB [ 214.375429] Node 1 Normal free:45100kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig [ 214.388334] lowmem_reserve[]: 0 0 0 0 0 [ 214.390251] Node 1 Normal: 423*4kB (UE) 320*8kB (UME) 187*16kB (UE) 117*32kB (UE) 57*64kB (UME) 20 [ 214.397626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 214.401518] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB ``` after: ``` [ 145.069705] Node 1 active_anon:128kB inactive_anon:4049412kB active_file:56kB inactive_file:84kB u [ 145.110319] Node 1 Normal free:45424kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig [ 145.152315] lowmem_reserve[]: 0 0 0 0 0 [ 145.155244] Node 1 Normal: 470*4kB (UME) 373*8kB (UME) 247*16kB (UME) 168*32kB (UE) 86*64kB (UME) [ 145.164119] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB ``` Link: https://lkml.kernel.org/r/20220706034655.1834-1-ligang.bdlg@bytedance.com Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | 4ddb4d91b8 | hugetlb: do not update address in huge_pmd_unshare As an optimization for loops sequentially processing hugetlb address ranges, huge_pmd_unshare would update a passed address if it unshared a pmd. Updating a loop control variable outside the loop like this is generally a bad idea. These loops are now using hugetlb_mask_last_page to optimize scanning when non-present ptes are discovered. The same can be done when huge_pmd_unshare returns 1 indicating a pmd was unshared. Remove address update from huge_pmd_unshare. Change the passed argument type and update all callers. In loops sequentially processing addresses use hugetlb_mask_last_page to update address if pmd is unshared. [sfr@canb.auug.org.au: fix an unused variable warning/error] Link: https://lkml.kernel.org/r/20220622171117.70850960@canb.auug.org.au Link: https://lkml.kernel.org/r/20220621235620.291305-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Mike Kravetz | e95a985178 | hugetlb: skip to end of PT page mapping when pte not present Patch series "hugetlb: speed up linear address scanning", v2. At unmap, fork and remap time hugetlb address ranges are linearly scanned. We can optimize these scans if the ranges are sparsely populated. Also, enable page table "Lazy copy" for hugetlb at fork. NOTE: Architectures not defining CONFIG_ARCH_WANT_GENERAL_HUGETLB need to add an arch specific version hugetlb_mask_last_page() to take advantage of sparse address scanning improvements. Baolin Wang added the routine for arm64. Other architectures which could be optimized are: ia64, mips, parisc, powerpc, s390, sh and sparc. This patch (of 4): HugeTLB address ranges are linearly scanned during fork, unmap and remap operations. If a non-present entry is encountered, the code currently continues to the next huge page aligned address. However, a non-present entry implies that the page table page for that entry is not present. Therefore, the linear scan can skip to the end of range mapped by the page table page. This can speed operations on large sparsely populated hugetlb mappings. Create a new routine hugetlb_mask_last_page() that will return an address mask. When the mask is ORed with an address, the result will be the address of the last huge page mapped by the associated page table page. Use this mask to update addresses in routines which linearly scan hugetlb address ranges when a non-present pte is encountered. hugetlb_mask_last_page is related to the implementation of huge_pte_offset as hugetlb_mask_last_page is called when huge_pte_offset returns NULL. This patch only provides a complete hugetlb_mask_last_page implementation when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined. Architectures which provide their own versions of huge_pte_offset can also provide their own version of hugetlb_mask_last_page. Link: https://lkml.kernel.org/r/20220621235620.291305-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220621235620.291305-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Alex Sierra | 6077c943be | mm: rename is_pinnable_page() to is_longterm_pinnable_page() Patch series "Add MEMORY_DEVICE_COHERENT for coherent device memory mapping", v9. This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory owned by a device that can be mapped into CPU page tables like MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE. This patch series is mostly self-contained except for a few places where it needs to update other subsystems to handle the new memory type. System stability and performance are not affected according to our ongoing testing, including xfstests. How it works: The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver registers the memory with devmap as MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for this hardware page migration capability is the Frontier supercomputer project. This functionality is not AMD-specific. We expect other GPU vendors to find this functionality useful, and possibly other hardware types in the future. Our test nodes in the lab are similar to the Frontier configuration, with .5 TB of system memory plus 256 GB of device memory split across 4 GPUs, all in a single coherent address space. Page migration is expected to improve application efficiency significantly. We will report empirical results as they become available. Coherent device type pages at gup are now migrated back to system memory if they are being pinned long-term (FOLL_LONGTERM). The reason is, that long-term pinning would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). These series incorporate Alistair Popple patches to do this migration from pin_user_pages() calls. hmm_gup_test has been added to hmm-test to test different get user pages calls. This series includes handling of device-managed anonymous pages returned by vm_normal_pages. Although they behave like normal pages for purposes of mapping in CPU page tables and for COW, they do not support LRU lists, NUMA migration or THP. We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list. This patch (of 14): is_pinnable_page() and folio_is_pinnable() are renamed to is_longterm_pinnable_page() and folio_is_longterm_pinnable() respectively. These functions are used in the FOLL_LONGTERM flag context. Link: https://lkml.kernel.org/r/20220715150521.18165-1-alex.sierra@amd.com Link: https://lkml.kernel.org/r/20220715150521.18165-2-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Qi Zheng | 18f3962953 | mm: hugetlb: kill set_huge_swap_pte_at() Commit  | ||
|  Muchun Song | dc2628f395 | mm: hugetlb: remove minimum_order variable commit  | ||
|  Baolin Wang | 8edaec0756 | mm/hugetlb: remove unnecessary huge_ptep_set_access_flags() in hugetlb_mcopy_atomic_pte() There is no need to update the hugetlb access flags after just setting the hugetlb page table entry by set_huge_pte_at(), since the page table entry value has no changes. Thus remove the unnecessary huge_ptep_set_access_flags() in hugetlb_mcopy_atomic_pte(). Link: https://lkml.kernel.org/r/f3e28b897b53a69967a8b98a6fdcda3be80c9229.1653616175.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Miaohe Lin | ad1ac596e8 | mm/migration: fix potential pte_unmap on an not mapped pte __migration_entry_wait and migration_entry_wait_on_locked assume pte is
always mapped from caller.  But this is not the case when it's called from
migration_entry_wait_huge and follow_huge_pmd.  Add a hugetlbfs variant
that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue.
Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com
Fixes:  | ||
|  Miaohe Lin | 7ce82f4c3f | mm/migration: return errno when isolate_huge_page failed We might fail to isolate huge page due to e.g.  the page is under
migration which cleared HPageMigratable.  We should return errno in this
case rather than always return 1 which could confuse the user, i.e.  the
caller might think all of the memory is migrated while the hugetlb page is
left behind.  We make the prototype of isolate_huge_page consistent with
isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
to isolate_hugetlb as suggested by Muchun to improve the readability.
Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
Fixes:  | ||
|  Matthew Wilcox (Oracle) | d9ef44de5d | hugetlb: Convert huge_add_to_page_cache() to use a folio Remove the last caller of add_to_page_cache() Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> | ||
|  Mike Rapoport | ee65728e10 | docs: rename Documentation/vm to Documentation/mm so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Jonathan Corbet <corbet@lwn.net> Acked-by: Wu XiangCheng <bobwxc@email.cn> | ||
|  Yang Yang | 662ce1dc9c | delayacct: track delays from write-protect copy Delay accounting does not track the delay of write-protect copy.  When
tasks trigger many write-protect copys(include COW and unsharing of
anonymous pages[1]), it may spend a amount of time waiting for them.  To
get the delay of tasks in write-protect copy, could help users to evaluate
the impact of using KSM or fork() or GUP.
Also update tools/accounting/getdelays.c:
    / # ./getdelays -dl -p 231
    print delayacct stats ON
    listen forever
    PID     231
    CPU             count     real total  virtual total    delay total  delay average
                     6247     1859000000     2154070021     1674255063          0.268ms
    IO              count    delay total  delay average
                        0              0              0ms
    SWAP            count    delay total  delay average
                        0              0              0ms
    RECLAIM         count    delay total  delay average
                        0              0              0ms
    THRASHING       count    delay total  delay average
                        0              0              0ms
    COMPACT         count    delay total  delay average
                        3          72758              0ms
    WPCOPY          count    delay total  delay average
                     3635      271567604              0ms
[1] commit 31cc5bc4af70("mm: support GUP-triggered unsharing of anonymous pages")
Link: https://lkml.kernel.org/r/20220409014342.2505532-1-yang.yang29@zte.com.cn
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: wangyong <wang.yong12@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Linus Torvalds | 77fb622de1 | Six hotfixes.  One from Miaohe Lin is considered a minor thing so it isn't for -stable. The remainder address pre-5.19 issues and are cc:stable. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYpEC8gAKCRDdBJ7gKXxA jlukAQDCaXF7YTBjpoaAl0zhSu+5h7CawiB6cnRlq87/uJ2S4QD/eLVX3zfxI2DX YcOhc5H8BOgZ8ppD80Nv9qjmyvEWzAA= =ZFFG -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2022-05-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Six hotfixes. The page_table_check one from Miaohe Lin is considered a minor thing so it isn't marked for -stable. The remainder address pre-5.19 issues and are cc:stable" * tag 'mm-hotfixes-stable-2022-05-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/page_table_check: fix accessing unmapped ptep kexec_file: drop weak attribute from arch_kexec_apply_relocations[_add] mm/page_alloc: always attempt to allocate at least one page during bulk allocation hugetlb: fix huge_pmd_unshare address update zsmalloc: fix races between asynchronous zspage free and page migration Revert "mm/cma.c: remove redundant cma_mutex lock" | ||
|  Mike Kravetz | 48381273f8 | hugetlb: fix huge_pmd_unshare address update The routine huge_pmd_unshare() is passed a pointer to an address
associated with an area which may be unshared.  If unshare is successful
this address is updated to 'optimize' callers iterating over huge page
addresses.  For the optimization to work correctly, address should be
updated to the last huge page in the unmapped/unshared area.  However, in
the common case where the passed address is PUD_SIZE aligned, the address
is incorrectly updated to the address of the preceding huge page.  That
wastes CPU cycles as the unmapped/unshared range is scanned twice.
Link: https://lkml.kernel.org/r/20220524205003.126184-1-mike.kravetz@oracle.com
Fixes:  | ||
|  Peter Xu | bc70fbf269 | mm/hugetlb: handle uffd-wp during fork() Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range() because for uffd-wp it's the dst vma that matters on deciding how we should treat uffd-wp protected ptes. We should recognize pte markers during fork and do the pte copy if needed. [lkp@intel.com: vma_needs_copy can be static] Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Peter Xu | 05e90bd05e | mm/hugetlb: only drop uffd-wp special pte if required As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte if unmapping an entire vma or synchronized such that faults can not race with the unmap operation. This requires passing zap_flags all the way to the lowest level hugetlb unmap routine: __unmap_hugepage_range. In general, unmap calls originated in hugetlbfs code will pass the ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent faults. The exception is hole punch which will first unmap without any synchronization. Later when hole punch actually removes the page from the file, it will check to see if there was a subsequent fault and if so take the hugetlb fault mutex while unmapping again. This second unmap will pass in ZAP_FLAG_DROP_MARKER. The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when unmap a hugetlb range" is (IMHO): we should never reach a state when a page fault could errornously fault in a page-cache page that was wr-protected to be writable, even in an extremely short period. That could happen if e.g. we pass ZAP_FLAG_DROP_MARKER when hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page faults after that call and before remove_inode_hugepages() is executed, the page cache can be mapped writable again in the small racy window, that can cause unexpected data overwritten. [peterx@redhat.com: fix sparse warning] Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues] Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Peter Xu | 60dfaad65a | mm/hugetlb: allow uffd wr-protect none ptes Teach hugetlbfs code to wr-protect none ptes just in case the page cache existed for that pte. Meanwhile we also need to be able to recognize a uffd-wp marker pte and remove it for uffd_wp_resolve. Since at it, introduce a variable "psize" to replace all references to the huge page size fetcher. Link: https://lkml.kernel.org/r/20220405014912.14815-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Peter Xu | c64e912c86 | mm/hugetlb: handle pte markers in page faults Allow hugetlb code to handle pte markers just like none ptes. It's mostly there, we just need to make sure we don't assume hugetlb_no_page() only handles none pte, so when detecting pte change we should use pte_same() rather than pte_none(). We need to pass in the old_pte to do the comparison. Check the original pte to see whether it's a pte marker, if it is, we should recover uffd-wp bit on the new pte to be installed, so that the next write will be trapped by uffd. Link: https://lkml.kernel.org/r/20220405014909.14761-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Peter Xu | 5a90d5a103 | mm/hugetlb: handle UFFDIO_WRITEPROTECT This starts from passing cp_flags into hugetlb_change_protection() so hugetlb will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests. huge_pte_clear_uffd_wp() is introduced to handle the case where the UFFDIO_WRITEPROTECT is requested upon migrating huge page entries. Link: https://lkml.kernel.org/r/20220405014906.14708-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Peter Xu | 6041c69179 | mm/hugetlb: take care of UFFDIO_COPY_MODE_WP Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the stack. Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY. Hugetlb pages are only managed by hugetlbfs, so we're safe even without setting dirty bit in the huge pte if the page is installed as read-only. However we'd better still keep the dirty bit set for a read-only UFFDIO_COPY pte (when UFFDIO_COPY_MODE_WP bit is set), not only to match what we do with shmem, but also because the page does contain dirty data that the kernel just copied from the userspace. Link: https://lkml.kernel.org/r/20220405014904.14643-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Peter Xu | 166f3ecc0d | mm/hugetlb: hook page faults for uffd write protection Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults. We do this slightly earlier than hugetlb_cow() so that we can avoid taking some extra locks that we definitely don't need. Link: https://lkml.kernel.org/r/20220405014901.14590-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Baolin Wang | 3d0b95cd87 | mm: hugetlb: considering PMD sharing when flushing cache/TLBs This patchset fixes some cache flushing issues if PMD sharing is possible
for hugetlb pages, which were found by code inspection.  Meanwhile Mike
found the flush_cache_page() can not cover the whole size of a hugetlb
page on some architectures [1], so I added a new patch 3 to fix this
issue, since I found only try_to_unmap_one() and try_to_migrate_one() need
to fix after some investigation.
[1] https://lore.kernel.org/linux-mm/064da3bb-5b4b-7332-a722-c5a541128705@oracle.com/
This patch (of 3):
When moving hugetlb page tables, the cache flushing is called in
move_page_tables() without considering the shared PMDs, which may be cause
cache issues on some architectures.
Thus we should move the hugetlb cache flushing into
move_hugetlb_page_tables() with considering the shared PMDs ranges,
calculated by adjust_range_if_pmd_sharing_possible().  Meanwhile also
expanding the TLBs flushing range in case of shared PMDs.
Note this is discovered via code inspection, and did not meet a real
problem in practice so far.
Link: https://lkml.kernel.org/r/cover.1651056365.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/0443c8cf20db554d3ff4b439b30e0ff26c0181dd.1651056365.git.baolin.wang@linux.alibaba.com
Fixes:  | ||
|  David Hildenbrand | b6a2619c60 | mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning Let's verify when (un)pinning anonymous pages that we always deal with exclusive anonymous pages, which guarantees that we'll have a reliable PIN, meaning that we cannot end up with the GUP pin being inconsistent with he pages mapped into the page tables due to a COW triggered by a write fault. When pinning pages, after conditionally triggering GUP unsharing of possibly shared anonymous pages, we should always only see exclusive anonymous pages. Note that anonymous pages that are mapped writable must be marked exclusive, otherwise we'd have a BUG. When pinning during ordinary GUP, simply add a check after our conditional GUP-triggered unsharing checks. As we know exactly how the page is mapped, we know exactly in which page we have to check for PageAnonExclusive(). When pinning via GUP-fast we have to be careful, because we can race with fork(): verify only after we made sure via the seqcount that we didn't race with concurrent fork() that we didn't end up pinning a possibly shared anonymous page. Similarly, when unpinning, verify that the pages are still marked as exclusive: otherwise something turned the pages possibly shared, which can result in random memory corruptions, which we really want to catch. With only the pinned pages at hand and not the actual page table entries we have to be a bit careful: hugetlb pages are always mapped via a single logical page table entry referencing the head page and PG_anon_exclusive of the head page applies. Anon THP are a bit more complicated, because we might have obtained the page reference either via a PMD or a PTE -- depending on the mapping type we either have to check PageAnonExclusive of the head page (PMD-mapped THP) or the tail page (PTE-mapped THP) applies: as we don't know and to make our life easier, check that either is set. Take care to not verify in case we're unpinning during GUP-fast because we detected concurrent fork(): we might stumble over an anonymous page that is now shared. Link: https://lkml.kernel.org/r/20220428083441.37290-18-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  David Hildenbrand | a7f2266041 | mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.
The possible ways to deal with this situation are:
 (1) Ignore and pin -- what we do right now.
 (2) Fail to pin -- which would be rather surprising to callers and
     could break user space.
 (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
     pins.
Let's implement 3) because it provides the clearest semantics and allows
for checking in unpin_user_pages() and friends for possible BUGs: when
trying to unpin a page that's no longer exclusive, clearly something went
very wrong and might result in memory corruptions that might be hard to
debug.  So we better have a nice way to spot such issues.
This change implies that whenever user space *wrote* to a private mapping
(IOW, we have an anonymous page mapped), that GUP pins will always remain
consistent: reliable R/O GUP pins of anonymous pages.
As a side note, this commit fixes the COW security issue for hugetlb with
FOLL_PIN as documented in:
  https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
instead of FOLL_PIN.
Note that follow_huge_pmd() doesn't apply because we cannot end up in
there with FOLL_PIN.
This commit is heavily based on prototype patches by Andrea.
Link: https://lkml.kernel.org/r/20220428083441.37290-17-david@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  David Hildenbrand | c89357e27f | mm: support GUP-triggered unsharing of anonymous pages Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.
The possible ways to deal with this situation are:
 (1) Ignore and pin -- what we do right now.
 (2) Fail to pin -- which would be rather surprising to callers and
     could break user space.
 (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
     pins.
We want to implement 3) because it provides the clearest semantics and
allows for checking in unpin_user_pages() and friends for possible BUGs:
when trying to unpin a page that's no longer exclusive, clearly something
went very wrong and might result in memory corruptions that might be hard
to debug.  So we better have a nice way to spot such issues.
To implement 3), we need a way for GUP to trigger unsharing:
FAULT_FLAG_UNSHARE.  FAULT_FLAG_UNSHARE is only applicable to R/O mapped
anonymous pages and resembles COW logic during a write fault.  However, in
contrast to a write fault, GUP-triggered unsharing will, for example,
still maintain the write protection.
Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write
fault handlers for all applicable anonymous page types: ordinary pages,
THP and hugetlb.
* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
  marked exclusive in the meantime by someone else, there is nothing to do.
* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
  marked exclusive, it will try detecting if the process is the exclusive
  owner. If exclusive, it can be set exclusive similar to reuse logic
  during write faults via page_move_anon_rmap() and there is nothing
  else to do; otherwise, we either have to copy and map a fresh,
  anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
  THP.
This commit is heavily based on patches by Andrea.
Link: https://lkml.kernel.org/r/20220428083441.37290-16-david@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  David Hildenbrand | 8909691b6c | mm/gup: disallow follow_page(FOLL_PIN) We want to change the way we handle R/O pins on anonymous pages that might be shared: if we detect a possibly shared anonymous page -- mapped R/O and not !PageAnonExclusive() -- we want to trigger unsharing via a page fault, resulting in an exclusive anonymous page that can be pinned reliably without getting replaced via COW on the next write fault. However, the required page fault will be problematic for follow_page(): in contrast to ordinary GUP, follow_page() doesn't trigger faults internally. So we would have to end up failing a R/O pin via follow_page(), although there is something mapped R/O into the page table, which might be rather surprising. We don't seem to have follow_page(FOLL_PIN) users, and it's a purely internal MM function. Let's just make our life easier and the semantics of follow_page() clearer by just disallowing FOLL_PIN for follow_page() completely. Link: https://lkml.kernel.org/r/20220428083441.37290-15-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  David Hildenbrand | 6c287605fd | mm: remember exclusively mapped anonymous pages with PG_anon_exclusive Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
exclusive, and use that information to make GUP pins reliable and stay
consistent with the page mapped into the page table even if the page table
entry gets write-protected.
With that information at hand, we can extend our COW logic to always reuse
anonymous pages that are exclusive.  For anonymous pages that might be
shared, the existing logic applies.
As already documented, PG_anon_exclusive is usually only expressive in
combination with a page table entry.  Especially PTE vs.  PMD-mapped
anonymous pages require more thought, some examples: due to mremap() we
can easily have a single compound page PTE-mapped into multiple page
tables exclusively in a single process -- multiple page table locks apply.
Further, due to MADV_WIPEONFORK we might not necessarily write-protect
all PTEs, and only some subpages might be pinned.  Long story short: once
PTE-mapped, we have to track information about exclusivity per sub-page,
but until then, we can just track it for the compound page in the head
page and not having to update a whole bunch of subpages all of the time
for a simple PMD mapping of a THP.
For simplicity, this commit mostly talks about "anonymous pages", while
it's for THP actually "the part of an anonymous folio referenced via a
page table entry".
To not spill PG_anon_exclusive code all over the mm code-base, we let the
anon rmap code to handle all PG_anon_exclusive logic it can easily handle.
If a writable, present page table entry points at an anonymous (sub)page,
that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
pin (FOLL_PIN) on an anonymous page references via a present page table
entry, it must only pin if PG_anon_exclusive is set for the mapped
(sub)page.
This commit doesn't adjust GUP, so this is only implicitly handled for
FOLL_WRITE, follow-up commits will teach GUP to also respect it for
FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
reliable.
Whenever an anonymous page is to be shared (fork(), KSM), or when
temporarily unmapping an anonymous page (swap, migration), the relevant
PG_anon_exclusive bit has to be cleared to mark the anonymous page
possibly shared.  Clearing will fail if there are GUP pins on the page:
* For fork(), this means having to copy the page and not being able to
  share it.  fork() protects against concurrent GUP using the PT lock and
  the src_mm->write_protect_seq.
* For KSM, this means sharing will fail.  For swap this means, unmapping
  will fail, For migration this means, migration will fail early.  All
  three cases protect against concurrent GUP using the PT lock and a
  proper clear/invalidate+flush of the relevant page table entry.
This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
pinned page gets mapped R/O and the successive write fault ends up
replacing the page instead of reusing it.  It improves the situation for
O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
fork() is *not* involved, however swapout and fork() are still
problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
users will fix the issue for them.
I. Details about basic handling
I.1. Fresh anonymous pages
page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
the mechanism fresh anonymous pages come into life (besides migration code
where we copy the page->mapping), all fresh anonymous pages will start out
as exclusive.
I.2. COW reuse handling of anonymous pages
When a COW handler stumbles over a (sub)page that's marked exclusive, it
simply reuses it.  Otherwise, the handler tries harder under page lock to
detect if the (sub)page is exclusive and can be reused.  If exclusive,
page_move_anon_rmap() will mark the given (sub)page exclusive.
Note that hugetlb code does not yet check for PageAnonExclusive(), as it
still uses the old COW logic that is prone to the COW security issue
because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
pages are a scarce resource.
I.3. Migration handling
try_to_migrate() has to try marking an exclusive anonymous page shared via
page_try_share_anon_rmap().  If it fails because there are GUP pins on the
page, unmap fails.  migrate_vma_collect_pmd() and
__split_huge_pmd_locked() are handled similarly.
Writable migration entries implicitly point at shared anonymous pages. 
For readable migration entries that information is stored via a new
"readable-exclusive" migration entry, specific to anonymous pages.
When restoring a migration entry in remove_migration_pte(), information
about exlusivity is detected via the migration entry type, and
RMAP_EXCLUSIVE is set accordingly for
page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.
I.4. Swapout handling
try_to_unmap() has to try marking the mapped page possibly shared via
page_try_share_anon_rmap().  If it fails because there are GUP pins on the
page, unmap fails.  For now, information about exclusivity is lost.  In
the future, we might want to remember that information in the swap entry
in some cases, however, it requires more thought, care, and a way to store
that information in swap entries.
I.5. Swapin handling
do_swap_page() will never stumble over exclusive anonymous pages in the
swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
to detect manually if an anonymous page is exclusive and has to set
RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
I.6. THP handling
__split_huge_pmd_locked() has to move the information about exclusivity
from the PMD to the PTEs.
a) In case we have a readable-exclusive PMD migration entry, simply
   insert readable-exclusive PTE migration entries.
b) In case we have a present PMD entry and we don't want to freeze
   ("convert to migration entries"), simply forward PG_anon_exclusive to
   all sub-pages, no need to temporarily clear the bit.
c) In case we have a present PMD entry and want to freeze, handle it
   similar to try_to_migrate(): try marking the page shared first.  In
   case we fail, we ignore the "freeze" instruction and simply split
   ordinarily.  try_to_migrate() will properly fail because the THP is
   still mapped via PTEs.
When splitting a compound anonymous folio (THP), the information about
exclusivity is implicitly handled via the migration entries: no need to
replicate PG_anon_exclusive manually.
I.7.  fork() handling fork() handling is relatively easy, because
PG_anon_exclusive is only expressive for some page table entry types.
a) Present anonymous pages
page_try_dup_anon_rmap() will mark the given subpage shared -- which will
fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
PMD to handle it on the PTE level).
Note that device exclusive entries are just a pointer at a PageAnon()
page.  fork() will first convert a device exclusive entry to a present
page table and handle it just like present anonymous pages.
b) Device private entry
Device private entries point at PageAnon() pages that cannot be mapped
directly and, therefore, cannot get pinned.
page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
fail because they cannot get pinned.
c) HW poison entries
PG_anon_exclusive will remain untouched and is stale -- the page table
entry is just a placeholder after all.
d) Migration entries
Writable and readable-exclusive entries are converted to readable entries:
possibly shared.
I.8. mprotect() handling
mprotect() only has to properly handle the new readable-exclusive
migration entry:
When write-protecting a migration entry that points at an anonymous page,
remember the information about exclusivity via the "readable-exclusive"
migration entry type.
II. Migration and GUP-fast
Whenever replacing a present page table entry that maps an exclusive
anonymous page by a migration entry, we have to mark the page possibly
shared and synchronize against GUP-fast by a proper clear/invalidate+flush
to make the following scenario impossible:
1. try_to_migrate() places a migration entry after checking for GUP pins
   and marks the page possibly shared.
2. GUP-fast pins the page due to lack of synchronization
3. fork() converts the "writable/readable-exclusive" migration entry into a
   readable migration entry
4. Migration fails due to the GUP pin (failing to freeze the refcount)
5. Migration entries are restored. PG_anon_exclusive is lost
-> We have a pinned page that is not marked exclusive anymore.
Note that we move information about exclusivity from the page to the
migration entry as it otherwise highly overcomplicates fork() and
PTE-mapping a THP.
III. Swapout and GUP-fast
Whenever replacing a present page table entry that maps an exclusive
anonymous page by a swap entry, we have to mark the page possibly shared
and synchronize against GUP-fast by a proper clear/invalidate+flush to
make the following scenario impossible:
1. try_to_unmap() places a swap entry after checking for GUP pins and
   clears exclusivity information on the page.
2. GUP-fast pins the page due to lack of synchronization.
-> We have a pinned page that is not marked exclusive anymore.
If we'd ever store information about exclusivity in the swap entry,
similar to migration handling, the same considerations as in II would
apply.  This is future work.
Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  David Hildenbrand | 78fbe906cc | mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages The basic question we would like to have a reliable and efficient answer
to is: is this anonymous page exclusive to a single process or might it be
shared?  We need that information for ordinary/single pages, hugetlb
pages, and possibly each subpage of a THP.
Introduce a way to mark an anonymous page as exclusive, with the ultimate
goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins
lose consistency with the pages mapped into the page table, resulting in
reported memory corruptions.
Most pageflags already have semantics for anonymous pages, however,
PG_mappedtodisk should never apply to pages in the swapcache, so let's
reuse that flag.
As PG_has_hwpoisoned also uses that flag on the second tail page of a
compound page, convert it to PG_error instead, which is marked as
PF_NO_TAIL, so never used for tail pages.
Use custom page flag modification functions such that we can do additional
sanity checks.  The semantics we'll put into some kernel doc in the future
are:
"
  PG_anon_exclusive is *usually* only expressive in combination with a
  page table entry. Depending on the page table entry type it might
  store the following information:
       Is what's mapped via this page table entry exclusive to the
       single process and can be mapped writable without further
       checks? If not, it might be shared and we might have to COW.
  For now, we only expect PTE-mapped THPs to make use of
  PG_anon_exclusive in subpages. For other anonymous compound
  folios (i.e., hugetlb), only the head page is logically mapped and
  holds this information.
  For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
  set on the head page. When replacing the PMD by a page table full
  of PTEs, PG_anon_exclusive, if set on the head page, will be set on
  all tail pages accordingly. Note that converting from a PTE-mapping
  to a PMD mapping using the same compound page is currently not
  possible and consequently doesn't require care.
  If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
  it should only pin if the relevant PG_anon_exclusive is set. In that
  case, the pin will be fully reliable and stay consistent with the pages
  mapped into the page table, as the bit cannot get cleared (e.g., by
  fork(), KSM) while the page is pinned. For anonymous pages that
  are mapped R/W, PG_anon_exclusive can be assumed to always be set
  because such pages cannot possibly be shared.
  The page table lock protecting the page table entry is the primary
  synchronization mechanism for PG_anon_exclusive; GUP-fast that does
  not take the PT lock needs special care when trying to clear the
  flag.
  Page table entry types and PG_anon_exclusive:
  * Present: PG_anon_exclusive applies.
  * Swap: the information is lost. PG_anon_exclusive was cleared.
  * Migration: the entry holds this information instead.
               PG_anon_exclusive was cleared.
  * Device private: PG_anon_exclusive applies.
  * Device exclusive: PG_anon_exclusive applies.
  * HW Poison: PG_anon_exclusive is stale and not changed.
  If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
  not allowed and the flag will stick around until the page is freed
  and folio->mapping is cleared.
"
We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
zapping) of page table entries, page freeing code will handle that when
also invalidate page->mapping to not indicate PageAnon() anymore.  Letting
information about exclusivity stick around will be an important property
when adding sanity checks to unpinning code.
Note that we properly clear the flag in free_pages_prepare() via
PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
so there is no need to manually clear the flag.
Link: https://lkml.kernel.org/r/20220428083441.37290-12-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  David Hildenbrand | fb3d824d1a | mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() ... and move the special check for pinned pages into page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages via a new pageflag, clearing it only after making sure that there are no GUP pins on the anonymous page. We really only care about pins on anonymous pages, because they are prone to getting replaced in the COW handler once mapped R/O. For !anon pages in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about that, at least not that I could come up with an example. Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we know we're dealing with anonymous pages. Also, drop the handling of pinned pages from copy_huge_pud() and add a comment if ever supporting anonymous pages on the PUD level. This is a preparation for tracking exclusivity of anonymous pages in the rmap code, and disallowing marking a page shared (-> failing to duplicate) if there are GUP pins on a page. Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  David Hildenbrand | 623a1ddfeb | mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() Let's do it just like copy_page_range(), taking the seqlock and making sure the mmap_lock is held in write mode. This allows for add a VM_BUG_ON to page_needs_cow_for_dma() and properly synchronizes concurrent fork() with GUP-fast of hugetlb pages, which will be relevant for further changes. Link: https://lkml.kernel.org/r/20220428083441.37290-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Baolin Wang | 9c8bbfaca1 | mm: hugetlb: add missing cache flushing in hugetlb_unshare_all_pmds() Missed calling flush_cache_range() before removing the sharing PMD
entrires, otherwise data consistence issue may be occurred on some
architectures whose caches are strict and require a virtual>physical
translation to exist for a virtual address.  Thus add it.
Now no architectures enabling PMD sharing will be affected, since they do
not have a VIVT cache.  That means this issue can not be happened in
practice so far.
Link: https://lkml.kernel.org/r/47441086affcabb6ecbe403173e9283b0d904b38.1650956489.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/419b0e777c9e6d1454dcd906e0f5b752a736d335.1650781755.git.baolin.wang@linux.alibaba.com
Fixes:  | ||
|  Peng Liu | 30a514002d | mm: use for_each_online_node and node_online instead of open coding Use more generic functions to deal with issues related to online nodes. The changes will make the code simplified. Link: https://lkml.kernel.org/r/20220429030218.644635-1-liupeng256@huawei.com Signed-off-by: Peng Liu <liupeng256@huawei.com> Suggested-by: Davidlohr Bueso <dave@stgolabs.net> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Peng Liu | f81f6e4b5e | hugetlb: fix return value of __setup handlers When __setup() return '0', using invalid option values causes the entire kernel boot option string to be reported as Unknown. Hugetlb calls __setup() and will return '0' when set invalid parameter string. The following phenomenon is observed: cmdline: hugepagesz=1Y hugepages=1 dmesg: HugeTLB: unsupported hugepagesz=1Y HugeTLB: hugepages=1 does not follow a valid hugepagesz, ignoring Unknown kernel command line parameters "hugepagesz=1Y hugepages=1" Since hugetlb will print warning/error information before return for invalid parameter string, just use return '1' to avoid print again. Link: https://lkml.kernel.org/r/20220413032915.251254-4-liupeng256@huawei.com Signed-off-by: Peng Liu <liupeng256@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Yuntao <liuyuntao10@huawei.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Peng Liu | f87442f407 | hugetlb: fix hugepages_setup when deal with pernode Hugepages can be specified to pernode since "hugetlbfs: extend the
definition of hugepages parameter to support node allocation", but the
following problem is observed.
Confusing behavior is observed when both 1G and 2M hugepage is set
after "numa=off".
 cmdline hugepage settings:
  hugepagesz=1G hugepages=0:3,1:3
  hugepagesz=2M hugepages=0:1024,1:1024
 results:
  HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
  HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages
Furthermore, confusing behavior can be also observed when an invalid node
behind a valid node.  To fix this, never allocate any typical hugepage
when an invalid parameter is received.
Link: https://lkml.kernel.org/r/20220413032915.251254-3-liupeng256@huawei.com
Fixes:  | ||
|  Peng Liu | 0a7a0f6f7f | hugetlb: fix wrong use of nr_online_nodes Patch series "hugetlb: Fix some incorrect behavior", v3. This series fix three bugs of hugetlb: 1) Invalid use of nr_online_nodes; 2) Inconsistency between 1G hugepage and 2M hugepage; 3) Useless information in dmesg. This patch (of 4): Certain systems are designed to have sparse/discontiguous nodes. In this case, nr_online_nodes can not be used to walk through numa node. Also, a valid node may be greater than nr_online_nodes. However, in hugetlb, it is assumed that nodes are contiguous. For sparse/discontiguous nodes, the current code may treat a valid node as invalid, and will fail to allocate all hugepages on a valid node that "nid >= nr_online_nodes". As David suggested: if (tmp >= nr_online_nodes) goto invalid; Just imagine node 0 and node 2 are online, and node 1 is offline. Assuming that "node < 2" is valid is wrong. Recheck all the places that use nr_online_nodes, and repair them one by one. [liupeng256@huawei.com: v4] Link: https://lkml.kernel.org/r/20220416103526.3287348-1-liupeng256@huawei.com Link: https://lkml.kernel.org/r/20220413032915.251254-1-liupeng256@huawei.com Link: https://lkml.kernel.org/r/20220413032915.251254-2-liupeng256@huawei.com Fixes: | ||
|  Muchun Song | 5981611d0a | mm: hugetlb_vmemmap: cleanup hugetlb_vmemmap related functions Patch series "cleanup hugetlb_vmemmap". The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize" is more clear. In this series, cheanup related codes to make it more clear and expressive. This is suggested by David. This patch (of 3): The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". And some function names are prefixed with "huge_page" instead of "hugetlb", it is easily to be confused with THP. In this patch, cheanup related functions to make code more clear and expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220404074652.68024-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Jakob Koschel | 84448c8ecd | hugetlb: remove use of list iterator variable after loop In preparation to limit the scope of the list iterator to the list traversal loop, use a dedicated pointer to iterate through the list [1]. Before hugetlb_resv_map_add() was expecting a file_region struct, but in case the list iterator in add_reservation_in_range() did not exit early, the variable passed in, is not actually a valid structure. In such a case 'rg' is computed on the head element of the list and represents an out-of-bounds pointer. This still remains safe *iff* you only use the link member (as it is done in hugetlb_resv_map_add()). To avoid the type-confusion altogether and limit the list iterator to the loop, only a list_head pointer is kept to pass to hugetlb_resv_map_add(). Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20220331224323.903842-1-jakobkoschel@gmail.com Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Brian Johannesmeyer" <bjohannesmeyer@gmail.com> Cc: Cristiano Giuffrida <c.giuffrida@vu.nl> Cc: "Bos, H.J." <h.j.bos@vu.nl> Cc: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Naoya Horiguchi | b283d983a7 | mm, hugetlb, hwpoison: separate branch for free and in-use hugepage We know that HPageFreed pages should have page refcount 0, so get_page_unless_zero() always fails and returns 0. So explicitly separate the branch based on page state for minor optimization and better readability. Link: https://lkml.kernel.org/r/20220415041848.GA3034499@ik1-406-35019.vs.sakura.ne.jp Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Naoya Horiguchi | 405ce05123 | mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
  CPU 1                                   CPU 2
  memory_failure_hugetlb
  struct page *head = compound_head(p);
                                          hugetlb page might be freed to
                                          buddy, or even changed to another
                                          compound page.
  get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes:  | ||
|  Mike Kravetz | 5a317412ef | hugetlb: do not demote poisoned hugetlb pages It is possible for poisoned hugetlb pages to reside on the free lists.
The huge page allocation routines which dequeue entries from the free
lists make a point of avoiding poisoned pages.  There is no such check
and avoidance in the demote code path.
If a hugetlb page on the is on a free list, poison will only be set in
the head page rather then the page with the actual error.  If such a
page is demoted, then the poison flag may follow the wrong page.  A page
without error could have poison set, and a page with poison could not
have the flag set.
Check for poison before attempting to demote a hugetlb page.  Also,
return -EBUSY to the caller if only poisoned pages are on the free list.
Link: https://lkml.kernel.org/r/20220307215707.50916-1-mike.kravetz@oracle.com
Fixes:  | ||
|  Linus Torvalds | 9030fb0bb9 | Folio changes for 5.18 - Rewrite how munlock works to massively reduce the contention
    on i_mmap_rwsem (Hugh Dickins):
    https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
  - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
    https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
  - Convert GUP to use folios and make pincount available for order-1
    pages. (Matthew Wilcox)
  - Convert a few more truncation functions to use folios (Matthew Wilcox)
  - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
  - Convert rmap_walk to use folios (Matthew Wilcox)
  - Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
  - Add support for creating large folios in readahead (Matthew Wilcox)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
 gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
 P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
 s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
 Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
 CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
 v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
 =n9Ad
 -----END PGP SIGNATURE-----
Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
 - Rewrite how munlock works to massively reduce the contention on
   i_mmap_rwsem (Hugh Dickins):
     https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
 - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
   Hellwig):
     https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
 - Convert GUP to use folios and make pincount available for order-1
   pages. (Matthew Wilcox)
 - Convert a few more truncation functions to use folios (Matthew
   Wilcox)
 - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
   Wilcox)
 - Convert rmap_walk to use folios (Matthew Wilcox)
 - Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
 - Add support for creating large folios in readahead (Matthew Wilcox)
* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
  mm/damon: minor cleanup for damon_pa_young
  selftests/vm/transhuge-stress: Support file-backed PMD folios
  mm/filemap: Support VM_HUGEPAGE for file mappings
  mm/readahead: Switch to page_cache_ra_order
  mm/readahead: Align file mappings for non-DAX
  mm/readahead: Add large folio readahead
  mm: Support arbitrary THP sizes
  mm: Make large folios depend on THP
  mm: Fix READ_ONLY_THP warning
  mm/filemap: Allow large folios to be added to the page cache
  mm: Turn can_split_huge_page() into can_split_folio()
  mm/vmscan: Convert pageout() to take a folio
  mm/vmscan: Turn page_check_references() into folio_check_references()
  mm/vmscan: Account large folios correctly
  mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
  mm/vmscan: Free non-shmem folios without splitting them
  mm/rmap: Constify the rmap_walk_control argument
  mm/rmap: Convert rmap_walk() to take a folio
  mm: Turn page_anon_vma() into folio_anon_vma()
  mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
  ... | ||
|  Nadav Amit | 824ddc601a | userfaultfd: provide unmasked address on page-fault Userfaultfd is supposed to provide the full address (i.e., unmasked) of
the faulting access back to userspace.  However, that is not the case for
quite some time.
Even running "userfaultfd_demo" from the userfaultfd man page provides the
wrong output (and contradicts the man page).  Notice that
"UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and
not the first read address (0x7fc5e30b300f).
	Address returned by mmap() = 0x7fc5e30b3000
	fault_handler_thread():
	    poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
	    UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000
		(uffdio_copy.copy returned 4096)
	Read address 0x7fc5e30b300f in main(): A
	Read address 0x7fc5e30b340f in main(): A
	Read address 0x7fc5e30b380f in main(): A
	Read address 0x7fc5e30b3c0f in main(): A
The exact address is useful for various reasons and specifically for
prefetching decisions.  If it is known that the memory is populated by
certain objects whose size is not page-aligned, then based on the faulting
address, the uffd-monitor can decide whether to prefetch and prefault the
adjacent page.
This bug has been for quite some time in the kernel: since commit
 | ||
|  David Howells | 4e936ecc01 | mm/hugetlb.c: export PageHeadHuge() Export PageHeadHuge() - it's used by folio_test_hugetlb() and thence by such as folio_file_page() and folio_contains(). Matthew suggested I use the first of those instead of doing the same calculation manually - but I can't call it from a module. Kirill suggested rearranging things to put it in a header, but that introduces header dependencies because of where constants are defined. [akpm@linux-foundation.org: s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/, per Christoph] Link: https://lkml.kernel.org/r/2494562.1646054576@warthog.procyon.org.uk Link: https://lore.kernel.org/r/163707085314.3221130.14783857863702203440.stgit@warthog.procyon.org.uk/ Signed-off-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | 98bc26ac77 | mm/hugetlb: use helper macro __ATTR_RW Use helper macro __ATTR_RW to define HSTATE_ATTR to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/20220222112731.33479-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | f9317f77a6 | hugetlb: clean up potential spectre issue warnings Recently introduced code allows numa nodes to be specified on the kernel command line for hugetlb allocations or CMA reservations. The node values are user specified and used as indicies into arrays. This generated the following smatch warnings: mm/hugetlb.c:4170 hugepages_setup() warn: potential spectre issue 'default_hugepages_in_node' [w] mm/hugetlb.c:4172 hugepages_setup() warn: potential spectre issue 'parsed_hstate->max_huge_pages_node' [w] mm/hugetlb.c:6898 cmdline_parse_hugetlb_cma() warn: potential spectre issue 'hugetlb_cma_size_in_node' [w] (local cap) Clean up by using array_index_nospec to sanitize array indicies. The routine cmdline_parse_hugetlb_cma has the same overflow/truncation issue addressed in [1]. That is also fixed with this change. [1] https://lore.kernel.org/linux-mm/20220209134018.8242-1-liuyuntao10@huawei.com/ As Michal pointed out, this is unlikely to be exploitable because it is __init code. But the patch suppresses the warnings. [mike.kravetz@oracle.com: v2] Link: https://lkml.kernel.org/r/20220218212946.35441-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220217234218.192885-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Liu Yuntao <liuyuntao10@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | 348923665a | mm: hugetlb: fix missing cache flush in hugetlb_mcopy_atomic_pte() folio_copy() will copy the data from one page to the target page, then
the target page will be mapped to the user space address, which might
have an alias issue with the kernel address used to copy the data from
the page to.  There are 2 ways to fix this issue.
 1) insert flush_dcache_page() after folio_copy().
 2) replace folio_copy() with copy_user_huge_page() which already
    considers the cache maintenance.
We chose 2) way to fix the issue since architectures can optimize this
situation.  It is also make backports easier.
Link: https://lkml.kernel.org/r/20220210123058.79206-5-songmuchun@bytedance.com
Fixes:  | ||
|  Anshuman Khandual | 16785bd774 | mm: merge pte_mkhuge() call into arch_make_huge_pte() Each call into pte_mkhuge() is invariably followed by arch_make_huge_pte(). Instead arch_make_huge_pte() can accommodate pte_mkhuge() at the beginning. This updates generic fallback stub for arch_make_huge_pte() and available platforms definitions. This makes huge pte creation much cleaner and easier to follow. Link: https://lkml.kernel.org/r/1643860669-26307-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Matthew Wilcox (Oracle) | 822951d846 | mm/hugetlb: Use try_grab_folio() instead of try_grab_compound_head() follow_hugetlb_page() only cares about success or failure, so it doesn't need to know the type of the returned pointer, only whether it's NULL or not. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> | ||
|  Matthew Wilcox (Oracle) | 5232c63f46 | mm: Make compound_pincount always available Move compound_pincount from the third page to the second page, which means it's available for all compound pages. That lets us delete hpage_pincount_available(). On 32-bit systems, there isn't enough space for both compound_pincount and compound_nr in the second page (it would collide with page->private, which is in use for pages in the swap cache), so revert the optimisation of storing both compound_order and compound_nr on 32-bit systems. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> | ||
|  Liu Yuntao | e79ce98323 | hugetlbfs: fix a truncation issue in hugepages parameter When we specify a large number for node in hugepages parameter, it may
be parsed to another number due to truncation in this statement:
	node = tmp;
For example, add following parameter in command line:
	hugepagesz=1G hugepages=4294967297:5
and kernel will allocate 5 hugepages for node 1 instead of ignoring it.
I move the validation check earlier to fix this issue, and slightly
simplifies the condition here.
Link: https://lkml.kernel.org/r/20220209134018.8242-1-liuyuntao10@huawei.com
Fixes:  | ||
|  Aneesh Kumar K.V | db110a99d3 | mm/hugetlb: fix kernel crash with hugetlb mremap This fixes the below crash:
  kernel BUG at include/linux/mm.h:2373!
  cpu 0x5d: Vector: 700 (Program Check) at [c00000003c6e76e0]
      pc: c000000000581a54: pmd_to_page+0x54/0x80
      lr: c00000000058d184: move_hugetlb_page_tables+0x4e4/0x5b0
      sp: c00000003c6e7980
     msr: 9000000000029033
    current = 0xc00000003bd8d980
    paca    = 0xc000200fff610100   irqmask: 0x03   irq_happened: 0x01
      pid   = 9349, comm = hugepage-mremap
  kernel BUG at include/linux/mm.h:2373!
    move_hugetlb_page_tables+0x4e4/0x5b0 (link register)
    move_hugetlb_page_tables+0x22c/0x5b0 (unreliable)
    move_page_tables+0xdbc/0x1010
    move_vma+0x254/0x5f0
    sys_mremap+0x7c0/0x900
    system_call_exception+0x160/0x2c0
the kernel can't use huge_pte_offset before it set the pte entry because
a page table lookup check for huge PTE bit in the page table to
differentiate between a huge pte entry and a pointer to pte page.  A
huge_pte_alloc won't mark the page table entry huge and hence kernel
should not use huge_pte_offset after a huge_pte_alloc.
Link: https://lkml.kernel.org/r/20220211063221.99293-1-aneesh.kumar@linux.ibm.com
Fixes:  | ||
|  Hugh Dickins | cea86fe246 | mm/munlock: rmap call mlock_vma_page() munlock_vma_page() Add vma argument to mlock_vma_page() and munlock_vma_page(), make them inline functions which check (vma->vm_flags & VM_LOCKED) before calling mlock_page() and munlock_page() in mm/mlock.c. Add bool compound to mlock_vma_page() and munlock_vma_page(): this is because we have understandable difficulty in accounting pte maps of THPs, and if passed a PageHead page, mlock_page() and munlock_page() cannot tell whether it's a pmd map to be counted or a pte map to be ignored. Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the others, and use that to call mlock_vma_page() at the end of the page adds, and munlock_vma_page() at the end of page_remove_rmap() (end or beginning? unimportant, but end was easier for assertions in testing). No page lock is required (although almost all adds happen to hold it): delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s. Certainly page lock did serialize with page migration, but I'm having difficulty explaining why that was ever important. Mlock accounting on THPs has been hard to define, differed between anon and file, involved PageDoubleMap in some places and not others, required clear_page_mlock() at some points. Keep it simple now: just count the pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks. page_add_new_anon_rmap() callers unchanged: they have long been calling lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED handling (it also checks for not VM_SPECIAL: I think that's overcautious, and inconsistent with other checks, that mmap_region() already prevents VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> | ||
|  Pasha Tatashin | 1eba86c096 | mm: change page type prior to adding page table entry Patch series "page table check", v3. Ensure that some memory corruptions are prevented by checking at the time of insertion of entries into user page tables that there is no illegal sharing. We have recently found a problem [1] that existed in kernel since 4.14. The problem was caused by broken page ref count and led to memory leaking from one process into another. The problem was accidentally detected by studying a dump of one process and noticing that one page contains memory that should not belong to this process. There are some other page->_refcount related problems that were recently fixed: [2], [3] which potentially could also lead to illegal sharing. In addition to hardening refcount [4] itself, this work is an attempt to prevent this class of memory corruption issues. It uses a simple state machine that is independent from regular MM logic to check for illegal sharing at time pages are inserted and removed from page tables. [1] https://lore.kernel.org/all/xr9335nxwc5y.fsf@gthelen2.svl.corp.google.com [2] https://lore.kernel.org/all/1582661774-30925-2-git-send-email-akaher@vmware.com [3] https://lore.kernel.org/all/20210622021423.154662-3-mike.kravetz@oracle.com [4] https://lore.kernel.org/all/20211221150140.988298-1-pasha.tatashin@soleen.com This patch (of 4): There are a few places where we first update the entry in the user page table, and later change the struct page to indicate that this is anonymous or file page. In most places, however, we first configure the page metadata and then insert entries into the page table. Page table check, will use the information from struct page to verify the type of entry is inserted. Change the order in all places to first update struct page, and later to update page table. This means that we first do calls that may change the type of page (anon or file): page_move_anon_rmap page_add_anon_rmap do_page_add_anon_rmap page_add_new_anon_rmap page_add_file_rmap hugepage_add_anon_rmap hugepage_add_new_anon_rmap And after that do calls that add entries to the page table: set_huge_pte_at set_pte_at Link: https://lkml.kernel.org/r/20211221154650.1047963-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20211221154650.1047963-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Turner <pjt@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Zhenguo Yao | 4178158ef8 | hugetlbfs: fix issue of preallocation of gigantic pages can't work Preallocation of gigantic pages can't work bacause of commit | ||
|  Nadav Amit | 13e4ad2ce8 | hugetlbfs: flush before unlock on move_hugetlb_page_tables() We must flush the TLB before releasing i_mmap_rwsem to avoid the
potential reuse of an unshared PMDs page.  This is not true in the case
of move_hugetlb_page_tables().  The last reference on the page table can
therefore be dropped before the TLB flush took place.
Prevent it by reordering the operations and flushing the TLB before
releasing i_mmap_rwsem.
Fixes:  | ||
|  Nadav Amit | a4a118f2ee | hugetlbfs: flush TLBs correctly after huge_pmd_unshare When __unmap_hugepage_range() calls to huge_pmd_unshare() succeed, a TLB
flush is missing.  This TLB flush must be performed before releasing the
i_mmap_rwsem, in order to prevent an unshared PMDs page from being
released and reused before the TLB flush took place.
Arguably, a comprehensive solution would use mmu_gather interface to
batch the TLB flushes and the PMDs page release, however it is not an
easy solution: (1) try_to_unmap_one() and try_to_migrate_one() also call
huge_pmd_unshare() and they cannot use the mmu_gather interface; and (2)
deferring the release of the page reference for the PMDs page until
after i_mmap_rwsem is dropeed can confuse huge_pmd_unshare() into
thinking PMDs are shared when they are not.
Fix __unmap_hugepage_range() by adding the missing TLB flush, and
forcing a flush when unshare is successful.
Fixes:  | ||
|  Mina Almasry | cc30042df6 | hugetlb, userfaultfd: fix reservation restore on userfaultfd error Currently in the is_continue case in hugetlb_mcopy_atomic_pte(), if we
bail out using "goto out_release_unlock;" in the cases where idx >=
size, or !huge_pte_none(), the code will detect that new_pagecache_page
== false, and so call restore_reserve_on_error().  In this case I see
restore_reserve_on_error() delete the reservation, and the following
call to remove_inode_hugepages() will increment h->resv_hugepages
causing a 100% reproducible leak.
We should treat the is_continue case similar to adding a page into the
pagecache and set new_pagecache_page to true, to indicate that there is
no reservation to restore on the error path, and we need not call
restore_reserve_on_error().  Rename new_pagecache_page to
page_in_pagecache to make that clear.
Link: https://lkml.kernel.org/r/20211117193825.378528-1-almasrymina@google.com
Fixes:  | ||
|  Bui Quang Minh | afe041c2d0 | hugetlb: fix hugetlb cgroup refcounting during mremap When hugetlb_vm_op_open() is called during copy_vma(), we may take the
reference to resv_map->css.  Later, when clearing the reservation
pointer of old_vma after transferring it to new_vma, we forget to drop
the reference to resv_map->css.  This leads to a reference leak of css.
Fixes this by adding a check to drop reservation css reference in
clear_vma_resv_huge_pages()
Link: https://lkml.kernel.org/r/20211113154412.91134-1-minhquangbui99@gmail.com
Fixes:  | ||
|  Linus Torvalds | 512b7931ad | Merge branch 'akpm' (patches from Andrew) Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ... | ||
|  Zhenguo Yao | b5389086ad | hugetlbfs: extend the definition of hugepages parameter to support node allocation We can specify the number of hugepages to allocate at boot. But the hugepages is balanced in all nodes at present. In some scenarios, we only need hugepages in one node. For example: DPDK needs hugepages which are in the same node as NIC. If DPDK needs four hugepages of 1G size in node1 and system has 16 numa nodes we must reserve 64 hugepages on the kernel cmdline. But only four hugepages are used. The others should be free after boot. If the system memory is low(for example: 64G), it will be an impossible task. So extend the hugepages parameter to support specifying hugepages on a specific node. For example add following parameter: hugepagesz=1G hugepages=0:1,1:3 It will allocate 1 hugepage in node0 and 3 hugepages in node1. Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com Signed-off-by: Zhenguo Yao <yaozhenguo1@gmail.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 2c0078a7d8 | hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page In commit | ||
|  Baolin Wang | 76efc67a5e | hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range() When calling hugetlb_resv_map_add(), we've guaranteed that the parameter 'to' is always larger than 'from', so it never returns a negative value from hugetlb_resv_map_add(). Thus remove the redundant VM_BUG_ON(). Link: https://lkml.kernel.org/r/2b565552f3d06753da1e8dda439c0d96d6d9a5a3.1634797639.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Baolin Wang | 0739eb437f | hugetlb: remove redundant validation in has_same_uncharge_info() The callers of has_same_uncharge_info() has accessed the original file_region and new file_region, and they are impossible to be NULL now. So we can remove the file_region validation in has_same_uncharge_info() to simplify the code. Link: https://lkml.kernel.org/r/97fc68d3f8d34f63c204645e10d7a718997e50b7.1634797639.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Baolin Wang | aa6d2e8cba | hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments After commit  | ||
|  Baolin Wang | 38e719ab26 | hugetlb: support node specified when using cma for gigantic hugepages Now the size of CMA area for gigantic hugepages runtime allocation is balanced for all online nodes, but we also want to specify the size of CMA per-node, or only one node in some cases, which are similar with patch [1]. For example, on some multi-nodes systems, each node's memory can be different, allocating the same size of CMA for each node is not suitable for the low-memory nodes. Meanwhile some workloads like DPDK mentioned by Zhenguo in patch [1] only need hugepages in one node. On the other hand, we have some machines with multiple types of memory, like DRAM and PMEM (persistent memory). On this system, we may want to specify all the hugepages only on DRAM node, or specify the proportion of DRAM node and PMEM node, to tuning the performance of the workloads. Thus this patch adds node format for 'hugetlb_cma' parameter to support specifying the size of CMA per-node. An example is as follows: hugetlb_cma=0:5G,2:5G which means allocating 5G size of CMA area on node 0 and node 2 respectively. And the users should use the node specific sysfs file to allocate the gigantic hugepages if specified the CMA size on that node. Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com [1] Link: https://lkml.kernel.org/r/bb790775ca60bb8f4b26956bb3f6988f74e075c7.1634261144.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | 550a7d60bd | mm, hugepages: add mremap() support for hugepage backed vma Support mremap() for hugepage backed vma segment by simply repositioning page table entries. The page table entries are repositioned to the new virtual address on mremap(). Hugetlb mremap() support is of course generic; my motivating use case is a library (hugepage_text), which reloads the ELF text of executables in hugepages. This significantly increases the execution performance of said executables. Restrict the mremap operation on hugepages to up to the size of the original mapping as the underlying hugetlb reservation is not yet capable of handling remapping to a larger size. During the mremap() operation we detect pmd_share'd mappings and we unshare those during the mremap(). On access and fault the sharing is established again. Link: https://lkml.kernel.org/r/20211013195825.3058275-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ken Chen <kenchen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 8531fc6f52 | hugetlb: add hugetlb demote page support Demote page functionality will split a huge page into a number of huge pages of a smaller size. For example, on x86 a 1GB huge page can be demoted into 512 2M huge pages. Demotion is done 'in place' by simply splitting the huge page. Added '*_for_demote' wrappers for remove_hugetlb_page, destroy_compound_hugetlb_page and prep_compound_gigantic_page for use by demote code. [mike.kravetz@oracle.com: v4] Link: https://lkml.kernel.org/r/6ca29b8e-527c-d6ec-900e-e6a43e4f8b73@oracle.com Link: https://lkml.kernel.org/r/20211007181918.136982-6-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Nghia Le <nghialm78@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 34d9e35b13 | hugetlb: add demote bool to gigantic page routines The routines remove_hugetlb_page and destroy_compound_gigantic_page will remove a gigantic page and make the set of base pages ready to be returned to a lower level allocator. In the process of doing this, they make all base pages reference counted. The routine prep_compound_gigantic_page creates a gigantic page from a set of base pages. It assumes that all these base pages are reference counted. During demotion, a gigantic page will be split into huge pages of a smaller size. This logically involves use of the routines, remove_hugetlb_page, and destroy_compound_gigantic_page followed by prep_compound*_page for each smaller huge page. When pages are reference counted (ref count >= 0), additional speculative ref counts could be taken as described in previous commits [1] and [2]. This could result in errors while demoting a huge page. Quite a bit of code would need to be created to handle all possible issues. Instead of dealing with the possibility of speculative ref counts, avoid the possibility by keeping ref counts at zero during the demote process. Add a boolean 'demote' to the routines remove_hugetlb_page, destroy_compound_gigantic_page and prep_compound_gigantic_page. If the boolean is set, the remove and destroy routines will not reference count pages and the prep routine will not expect reference counted pages. '*_for_demote' wrappers of the routines will be added in a subsequent patch where this functionality is used. [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ [2] https://lore.kernel.org/linux-mm/20210809184832.18342-3-mike.kravetz@oracle.com/ Link: https://lkml.kernel.org/r/20211007181918.136982-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Nghia Le <nghialm78@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | a01f43901c | hugetlb: be sure to free demoted CMA pages to CMA When huge page demotion is fully implemented, gigantic pages can be demoted to a smaller huge page size. For example, on x86 a 1G page can be demoted to 512 2M pages. However, gigantic pages can potentially be allocated from CMA. If a gigantic page which was allocated from CMA is demoted, the corresponding demoted pages needs to be returned to CMA. Use the new interface cma_pages_valid() to determine if a non-gigantic hugetlb page should be freed to CMA. Also, clear mapping field of these pages as expected by cma_release. This also requires a change to CMA region creation for gigantic pages. CMA uses a per-region bit map to track allocations. When setting up the region, you specify how many pages each bit represents. Currently, only gigantic pages are allocated/freed from CMA so the region is set up such that one bit represents a gigantic page size allocation. With demote, a gigantic page (allocation) could be split into smaller size pages. And, these smaller size pages will be freed to CMA. So, since the per-region bit map needs to be set up to represent the smallest allocation/free size, it now needs to be set to the smallest huge page size which can be freed to CMA. Unfortunately, we set up the CMA region for huge pages before we set up huge pages sizes (hstates). So, technically we do not know the smallest huge page size as this can change via command line options and architecture specific code. Therefore, at region setup time we use HUGETLB_PAGE_ORDER as the smallest possible huge page size that can be given back to CMA. It is possible that this value is sub-optimal for some architectures/config options. If needed, this can be addressed in follow on work. Link: https://lkml.kernel.org/r/20211007181918.136982-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Nghia Le <nghialm78@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 79dfc69552 | hugetlb: add demote hugetlb page sysfs interfaces Patch series "hugetlb: add demote/split page functionality", v4.
The concurrent use of multiple hugetlb page sizes on a single system is
becoming more common.  One of the reasons is better TLB support for
gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
being used to back VMs in hosting environments.
When using hugetlb pages to back VMs, it is often desirable to
preallocate hugetlb pools.  This avoids the delay and uncertainty of
allocating hugetlb pages at VM startup.  In addition, preallocating huge
pages minimizes the issue of memory fragmentation that increases the
longer the system is up and running.
In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes.  Over
time, the preallocated pool of smaller hugetlb pages may become depleted
while larger hugetlb pages still remain.  In such situations, it is
desirable to convert larger hugetlb pages to smaller hugetlb pages.
Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  For example, to convert 50 GB pages on x86:
  gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
  m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
  echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
  echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages
On an idle system this operation is fairly reliable and results are as
expected.  The number of 2MB pages is increased as expected and the time
of the operation is a second or two.
However, when there is activity on the system the following issues
arise:
1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed. This is
   because the area freed by the larger page could quickly be
   fragmented.
In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:
  Unexpected number of 2MB pages allocated: Expected 25600, have 19944
  real    0m42.092s
  user    0m0.008s
  sys     0m41.467s
To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size.  This avoids freeing pages to buddy and then
trying to allocate from buddy.
Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.
 - demote_size
        Target page size for demotion, a smaller huge page size. File
        can be written to chose a smaller huge page size if multiple are
        available.
 - demote
        Writable number of hugetlb pages to be demoted
To demote 50 GB huge pages, one would:
  cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
  cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
  echo 50 > .../hugepages-1048576kB/demote
Only hugetlb pages which are free at the time of the request can be
demoted.  Demotion does not add to the complexity of surplus pages and
honors reserved huge pages.  Therefore, when a value is written to the
sysfs demote file, that value is only the maximum number of pages which
will be demoted.  It is possible fewer will actually be demoted.  The
recently introduced per-hstate mutex is used to synchronize demote
operations with other operations that modify hugetlb pools.
Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages
are used to back VMs on x86.  Both issues of long allocation times and
not necessarily getting the expected number of smaller huge pages after
a free and allocate cycle have been experienced.  The occurrence of
these issues is dependent on other activity within the host and can not
be predicted.
This patch (of 5):
Two new sysfs files are added to demote hugtlb pages.  These files are
both per-hugetlb page size and per node.  Files are:
  demote_size - The size in Kb that pages are demoted to. (read-write)
  demote - The number of huge pages to demote. (write-only)
By default, demote_size is the next smallest huge page size.  Valid huge
page sizes less than huge page size may be written to this file.  When
huge pages are demoted, they are demoted to this size.
Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.
NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size.  For example, on x86 1GB huge
pages will have demote interfaces.  2MB huge pages will not have demote
interfaces.
This patch does not provide full demote functionality.  It only provides
the sysfs interfaces.
It also provides documentation for the new interfaces.
[mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
  Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com
Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | 73c5476348 | mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h Remove __unmap_hugepage_range() from the header file, because it is only used in hugetlb.c. Link: https://lkml.kernel.org/r/20210917165108.9341-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Matthew Wilcox (Oracle) | 715cbfd6c5 | mm/migrate: Add folio_migrate_copy() This is the folio equivalent of migrate_page_copy(), which is retained as a wrapper for filesystems which are not yet converted to folios. Also convert copy_huge_page() to folio_copy(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> | ||
|  Ben Widawsky | cfcaa66f80 | mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY Implement the missing huge page allocation functionality while obeying the preferred node semantics. This is similar to the implementation for general page allocation, as it uses a fallback mechanism to try multiple preferred nodes first, and then all other nodes. To avoid adding too many "#ifdef CONFIG_NUMA" check, add a helper function in mempolicy.h to check whether a mempolicy is MPOL_PREFERRED_MANY. [akpm@linux-foundation.org: fix compiling issue when merging with other hugetlb patch] [Thanks to 0day bot for catching the !CONFIG_NUMA compiling issue] [mhocko@suse.com: suggest to remove the #ifdef CONFIG_NUMA check] [ben.widawsky@intel.com: add helpers to avoid ifdefs] Link: https://lore.kernel.org/r/20200630212517.308045-12-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-4-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/20210809024430.GA46432@shbuild999.sh.intel.com [nathan@kernel.org: initialize page to NULL in alloc_buddy_huge_page_with_mpol()] Link: https://lkml.kernel.org/r/20210810200632.3812797-1-nathan@kernel.org Link: https://lore.kernel.org/r/20200630212517.308045-12-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-4-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/20210809024430.GA46432@shbuild999.sh.intel.com Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Co-developed-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 09a26e8327 | hugetlb: fix hugetlb cgroup refcounting during vma split Guillaume Morin reported hitting the following WARNING followed by GPF or
NULL pointer deference either in cgroups_destroy or in the kill_css path.:
    percpu ref (css_release) <= 0 (-1) after switching to atomic
    WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130
    CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G           O      5.10.60 #1
    RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130
    Call Trace:
       rcu_core+0x30f/0x530
       rcu_core_si+0xe/0x10
       __do_softirq+0x103/0x2a2
       run_ksoftirqd+0x2b/0x40
       smpboot_thread_fn+0x11a/0x170
       kthread+0x10a/0x140
       ret_from_fork+0x22/0x30
Upon further examination, it was discovered that the css structure was
associated with hugetlb reservations.
For private hugetlb mappings the vma points to a reserve map that
contains a pointer to the css.  At mmap time, reservations are set up
and a reference to the css is taken.  This reference is dropped in the
vma close operation; hugetlb_vm_op_close.  However, if a vma is split no
additional reference to the css is taken yet hugetlb_vm_op_close will be
called twice for the split vma resulting in an underflow.
Fix by taking another reference in hugetlb_vm_op_open.  Note that the
reference is only taken for the owner of the reserve map.  In the more
common fork case, the pointer to the reserve map is cleared for
non-owning vmas.
Link: https://lkml.kernel.org/r/20210830215015.155224-1-mike.kravetz@oracle.com
Fixes:  | ||
|  Mike Kravetz | e32d20c0c8 | hugetlb: before freeing hugetlb page set dtor to appropriate value When removing a hugetlb page from the pool the ref count is set to one (as
the free page has no ref count) and compound page destructor is set to
NULL_COMPOUND_DTOR.  Since a subsequent call to free the hugetlb page will
call __free_pages for non-gigantic pages and free_gigantic_page for
gigantic pages the destructor is not used.
However, consider the following race with code taking a speculative
reference on the page:
Thread 0				Thread 1
--------				--------
remove_hugetlb_page
  set_page_refcounted(page);
  set_compound_page_dtor(page,
           NULL_COMPOUND_DTOR);
					get_page_unless_zero(page)
__update_and_free_page
  __free_pages(page,
           huge_page_order(h));
		/* Note that __free_pages() will simply drop
		   the reference to the page. */
					put_page(page)
					  __put_compound_page()
					    destroy_compound_page
					      NULL_COMPOUND_DTOR
						BUG: kernel NULL pointer
						dereference, address:
						0000000000000000
To address this race, set the dtor to the normal compound page dtor for
non-gigantic pages.  The dtor for gigantic pages does not matter as
gigantic pages are changed from a compound page to 'just a group of pages'
before freeing.  Hence, the destructor is not used.
Link: https://lkml.kernel.org/r/20210809184832.18342-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | b65a4edae1 | hugetlb: drop ref count earlier after page allocation When discussing the possibility of inflated page ref counts, Muuchun Song pointed out this potential issue [1]. It is true that any code could potentially take a reference on a compound page after allocation and before it is converted to and put into use as a hugetlb page. Specifically, this could be done by any users of get_page_unless_zero. There are three areas of concern within hugetlb code. 1) When adding pages to the pool. In this case, new pages are allocated added to the pool by calling put_page to invoke the hugetlb destructor (free_huge_page). If there is an inflated ref count on the page, it will not be immediately added to the free list. It will only be added to the free list when the temporary ref count is dropped. This is deemed acceptable and will not be addressed. 2) A page is allocated for immediate use normally as a surplus page or migration target. In this case, the user of the page will also hold a reference. There is no issue as this is just like normal page ref counting. 3) A page is allocated and MUST be added to the free list to satisfy a reservation. One such example is gather_surplus_pages as pointed out by Muchun in [1]. More specifically, this case covers callers of enqueue_huge_page where the page reference count must be zero. This patch covers this third case. Three routines call enqueue_huge_page when the page reference count could potentially be inflated. They are: gather_surplus_pages, alloc_and_dissolve_huge_page and add_hugetlb_page. add_hugetlb_page is called on error paths when a huge page can not be freed due to the inability to allocate vmemmap pages. In this case, the temporairly inflated ref count is not an issue. When the ref is dropped the appropriate action will be taken. Instead of VM_BUG_ON if the ref count does not drop to zero, simply return. In gather_surplus_pages and alloc_and_dissolve_huge_page the caller expects a page (or pages) to be put on the free lists. In this case we must ensure there are no temporary ref counts. We do this by calling put_page_testzero() earlier and not using pages without a zero ref count. The temporary page flag (HPageTemporary) is used in such cases so that as soon as the inflated ref count is dropped the page will be freed. [1] https://lore.kernel.org/linux-mm/CAMZfGtVMn3daKrJwZMaVOGOaJU+B4dS--x_oPmGQMD=c=QNGEg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20210809184832.18342-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 416d85ed3e | hugetlb: simplify prep_compound_gigantic_page ref count racing code Code in prep_compound_gigantic_page waits for a rcu grace period if it notices a temporarily inflated ref count on a tail page. This was due to the identified potential race with speculative page cache references which could only last for a rcu grace period. This is overly complicated as this situation is VERY unlikely to ever happen. Instead, just quickly return an error. Also, only print a warning in prep_compound_gigantic_page instead of multiple callers. Link: https://lkml.kernel.org/r/20210809184832.18342-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | c7b1850dfb | hugetlb: don't pass page cache pages to restore_reserve_on_error syzbot hit kernel BUG at fs/hugetlbfs/inode.c:532 as described in [1]. This BUG triggers if the HPageRestoreReserve flag is set on a page in the page cache. It should never be set, as the routine huge_add_to_page_cache explicitly clears the flag after adding a page to the cache. The only code other than huge page allocation which sets the flag is restore_reserve_on_error. It will potentially set the flag in rare out of memory conditions. syzbot was injecting errors to cause memory allocation errors which exercised this specific path. The code in restore_reserve_on_error is doing the right thing. However, there are instances where pages in the page cache were being passed to restore_reserve_on_error. This is incorrect, as once a page goes into the cache reservation information will not be modified for the page until it is removed from the cache. Error paths do not remove pages from the cache, so even in the case of error, the page will remain in the cache and no reservation adjustment is needed. Modify routines that potentially call restore_reserve_on_error with a page cache page to no longer do so. Note on fixes tag: Prior to commit | ||
|  Joao Martins | d08af0a596 | mm/hugetlb: fix refs calculation from unaligned @vaddr Commit | ||
|  Alistair Popple | 4dd845b5a3 | mm/swapops: rework swap entry manipulation code Both migration and device private pages use special swap entries that are manipluated by a range of inline functions. The arguments to these are somewhat inconsistent so rework them to remove flag type arguments and to make the arguments similar for both read and write entry creation. Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 7118fc2906 | hugetlb: address ref count racing in prep_compound_gigantic_page In [1], Jann Horn points out a possible race between
prep_compound_gigantic_page and __page_cache_add_speculative.  The root
cause of the possible race is prep_compound_gigantic_page uncondittionally
setting the ref count of pages to zero.  It does this because
prep_compound_gigantic_page is handed a 'group' of pages from an allocator
and needs to convert that group of pages to a compound page.  The ref
count of each page in this 'group' is one as set by the allocator.
However, the ref count of compound page tail pages must be zero.
The potential race comes about when ref counted pages are returned from
the allocator.  When this happens, other mm code could also take a
reference on the page.  __page_cache_add_speculative is one such example.
Therefore, prep_compound_gigantic_page can not just set the ref count of
pages to zero as it does today.  Doing so would lose the reference taken
by any other code.  This would lead to BUGs in code checking ref counts
and could possibly even lead to memory corruption.
There are two possible ways to address this issue.
1) Make all allocators of gigantic groups of pages be able to return a
   properly constructed compound page.
2) Make prep_compound_gigantic_page be more careful when constructing a
   compound page.
This patch takes approach 2.
In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
that the extra ref count will be driopped during a rcu grace period.  This
is not a performance critical code path and the wait should be
accceptable.  If the ref count is still inflated after the grace period,
then undo any modifications made and return an error.
Currently prep_compound_gigantic_page is type void and does not return
errors.  Modify the two callers to check for and handle error returns.  On
error, the caller must free the 'group' of pages as they can not be used
to form a gigantic page.  After freeing pages, the runtime caller
(alloc_fresh_huge_page) will retry the allocation once.  Boot time
allocations can not be retried.
The routine prep_compound_page also unconditionally sets the ref count of
compound page tail pages to zero.  However, in this case the buddy
allocator is constructing a compound page from freshly allocated pages.
The ref count on those freshly allocated pages is already zero, so the
set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
remove it.
[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
Fixes:  | ||
|  Mike Kravetz | 48b8d744ea | hugetlb: remove prep_compound_huge_page cleanup Patch series "Fix prep_compound_gigantic_page ref count adjustment". These patches address the possible race between prep_compound_gigantic_page and __page_cache_add_speculative as described by Jann Horn in [1]. The first patch simply removes the unnecessary/obsolete helper routine prep_compound_huge_page to make the actual fix a little simpler. The second patch is the actual fix and has a detailed explanation in the commit message. This potential issue has existed for almost 10 years and I am unaware of anyone actually hitting the race. I did not cc stable, but would be happy to squash the patches and send to stable if anyone thinks that is a good idea. [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/ This patch (of 2): I could not think of a reliable way to recreate the issue for testing. Rather, I 'simulated errors' to exercise all the error paths. The routine prep_compound_huge_page is a simple wrapper to call either prep_compound_gigantic_page or prep_compound_page. However, it is only called from gather_bootmem_prealloc which only processes gigantic pages. Eliminate the routine and call prep_compound_gigantic_page directly. Link: https://lkml.kernel.org/r/20210622021423.154662-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20210622021423.154662-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Youquan Song <youquan.song@intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | 8cc5fcbb5b | mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY On UFFDIO_COPY, if we fail to copy the page contents while holding the
hugetlb_fault_mutex, we will drop the mutex and return to the caller after
allocating a page that consumed a reservation.  In this case there may be
a fault that double consumes the reservation.  To handle this, we free the
allocated page, fix the reservations, and allocate a temporary hugetlb
page and return that to the caller.  When the caller does the copy outside
of the lock, we again check the cache, and allocate a page consuming the
reservation, and copy over the contents.
Test:
Hacked the code locally such that resv_huge_pages underflows produce
a warning and the copy_huge_page_from_user() always fails, then:
./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
        2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
./tools/testing/selftests/vm/userfaultfd hugetlb 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
Both tests succeed and produce no warnings. After the
test runs number of free/resv hugepages is correct.
[yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
  Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
[almasrymina@google.com: fix allocation error check and copy func name]
  Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com
Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Christophe Leroy | 79c1c594f4 | mm/hugetlb: change parameters of arch_make_huge_pte() Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2. This series implements huge VMAP and VMALLOC on powerpc 8xx. Powerpc 8xx has 4 page sizes: - 4k - 16k - 512k - 8M At the time being, vmalloc and vmap only support huge pages which are leaf at PMD level. Here the PMD level is 4M, it doesn't correspond to any supported page size. For now, implement use of 16k and 512k pages which is done at PTE level. Support of 8M pages will be implemented later, it requires use of hugepd tables. To allow this, the architecture provides two functions: - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what page size to use. A stub returning PAGE_SIZE is provided when the architecture doesn't provide this function. - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range() what page shift to use for a given area size. A stub returning PAGE_SHIFT is provided when the architecture doesn't provide this function. This patch (of 5): At the time being, arch_make_huge_pte() has the following prototype: pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, struct page *page, int writable); vma is used to get the pages shift or size. vma is also used on Sparc to get vm_flags. page is not used. writable is not used. In order to use this function without a vma, replace vma by shift and flags. Also remove the used parameters. Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | 774905878f | mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate All the infrastructure is ready, so we introduce nr_free_vmemmap_pages field in the hstate to indicate how many vmemmap pages associated with a HugeTLB page that can be freed to buddy allocator. And initialize it in the hugetlb_vmemmap_init(). This patch is actual enablement of the feature. There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a BUILD_BUG_ON to catch invalid usage of the tail struct page. Link: https://lkml.kernel.org/r/20210510030027.56044-10-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | ad2fa3717b | mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it.  However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure.  In
this case, we just refuse to free the HugeTLB page.  This changes behavior
in some corner cases as listed below:
 1) Failing to free a huge page triggered by the user (decrease nr_pages).
    User needs to try again later.
 2) Failing to free a surplus huge page when freed by the application.
    Try again later when freeing a huge page next time.
 3) Failing to dissolve a free huge page on ZONE_MOVABLE via
    offline_pages().
    This can happen when we have plenty of ZONE_MOVABLE memory, but
    not enough kernel memory to allocate vmemmmap pages.  We may even
    be able to migrate huge page contents, but will not be able to
    dissolve the source huge page.  This will prevent an offline
    operation and is unfortunate as memory offlining is expected to
    succeed on movable zones.  Users that depend on memory hotplug
    to succeed for movable zones should carefully consider whether the
    memory savings gained from this feature are worth the risk of
    possibly not being able to offline memory in certain situations.
 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
    alloc_contig_range() - once we have that handling in place. Mainly
    affects CMA and virtio-mem.
    Similar to 3). virito-mem will handle migration errors gracefully.
    CMA might be able to fallback on other free areas within the CMA
    region.
Vmemmap pages are allocated from the page freeing context.  In order for
those allocations to be not disruptive (e.g.  trigger oom killer)
__GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure.  GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.
[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
  Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | b65d4adbc0 | mm: hugetlb: defer freeing of HugeTLB pages In the subsequent patch, we should allocate the vmemmap pages when freeing a HugeTLB page. But update_and_free_page() can be called under any context, so we cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the actual freeing in a kworker to prevent from using GFP_ATOMIC to allocate the vmemmap pages. The __update_and_free_page() is where the call to allocate vmemmmap pages will be inserted. Link: https://lkml.kernel.org/r/20210510030027.56044-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | f41f2ed43c | mm: hugetlb: free the vmemmap pages associated with each HugeTLB page Every HugeTLB has more than one struct page structure. We __know__ that we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to store metadata associated with each HugeTLB. There are a lot of struct page structures associated with each HugeTLB page. For tail pages, the value of compound_head is the same. So we can reuse first page of tail page structures. We map the virtual addresses of the remaining pages of tail page structures to the first tail page struct, and then free these page frames. Therefore, we need to reserve two pages as vmemmap areas. When we allocate a HugeTLB page from the buddy, we can free some vmemmap pages associated with each HugeTLB page. It is more appropriate to do it in the prep_new_huge_page(). The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages associated with a HugeTLB page can be freed, returns zero for now, which means the feature is disabled. We will enable it once all the infrastructure is there. [willy@infradead.org: fix documentation warning] Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Naoya Horiguchi | 0ed950d1f2 | mm,hwpoison: make get_hwpoison_page() call get_any_page() __get_hwpoison_page() could fail to grab refcount by some race condition, so it's helpful if we can handle it by retrying. We already have retry logic, so make get_hwpoison_page() call get_any_page() when called from memory_failure(). As a result, get_hwpoison_page() can return negative values (i.e. error code), so some callers are also changed to handle error cases. soft_offline_page() does nothing for -EBUSY because that's enough and users in userspace can easily handle it. unpoison_memory() is also unchanged because it's broken and need thorough fixes (will be done later). Link: https://lkml.kernel.org/r/20210603233632.2964832-3-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Hugh Dickins | fe19bd3dae | mm, futex: fix shared futex pgoff on shmem huge page If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit | ||
|  Mike Kravetz | 846be08578 | mm/hugetlb: expand restore_reserve_on_error functionality The routine restore_reserve_on_error is called to restore reservation
information when an error occurs after page allocation.  The routine
alloc_huge_page modifies the mapping reserve map and potentially the
reserve count during allocation.  If code calling alloc_huge_page
encounters an error after allocation and needs to free the page, the
reservation information needs to be adjusted.
Currently, restore_reserve_on_error only takes action on pages for which
the reserve count was adjusted(HPageRestoreReserve flag).  There is
nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
modifies the reserve map during allocation even if the reserve count is
not adjusted.  This can cause issues as observed during development of
this patch [1].
One specific series of operations causing an issue is:
 - Create a shared hugetlb mapping
   Reservations for all pages created by default
 - Fault in a page in the mapping
   Reservation exists so reservation count is decremented
 - Punch a hole in the file/mapping at index previously faulted
   Reservation and any associated pages will be removed
 - Allocate a page to fill the hole
   No reservation entry, so reserve count unmodified
   Reservation entry added to map by alloc_huge_page
 - Error after allocation and before instantiating the page
   Reservation entry remains in map
 - Allocate a page to fill the hole
   Reservation entry exists, so decrement reservation count
This will cause a reservation count underflow as the reservation count
was decremented twice for the same index.
A user would observe a very large number for HugePages_Rsvd in
/proc/meminfo.  This would also likely cause subsequent allocations of
hugetlb pages to fail as it would 'appear' that all pages are reserved.
This sequence of operations is unlikely to happen, however they were
easily reproduced and observed using hacked up code as described in [1].
Address the issue by having the routine restore_reserve_on_error take
action on pages where HPageRestoreReserve is not set.  In this case, we
need to remove any reserve map entry created by alloc_huge_page.  A new
helper routine vma_del_reservation assists with this operation.
There are three callers of alloc_huge_page which do not currently call
restore_reserve_on error before freeing a page on error paths.  Add
those missing calls.
[1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/
Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
Fixes:  | ||
|  Naoya Horiguchi | 25182f05ff | mm,hwpoison: fix race with hugetlb page allocation When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
race:
    CPU0:                           CPU1:
                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)
__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is not enough because
there's a time window where compound pages have non-zero refcount during
hugetlb page initialization.
So make __get_hwpoison_page() check page status a bit more for hugetlb
pages with get_hwpoison_huge_page().  Checking hugetlb-specific flags
under hugetlb_lock makes sure that the hugetlb page is not transitive.
It's notable that another new function, HWPoisonHandlable(), is helpful
to prevent a race against other transitive page states (like a generic
compound page just before PageHuge becomes true).
Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
Fixes:  | ||
|  Mina Almasry | d84cf06e3d | mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPY The userfaultfd hugetlb tests cause a resv_huge_pages underflow.  This
happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
an index for which we already have a page in the cache.  When this
happens, we allocate a second page, double consuming the reservation,
and then fail to insert the page into the cache and return -EEXIST.
To fix this, we first check if there is a page in the cache which
already consumed the reservation, and return -EEXIST immediately if so.
There is still a rare condition where we fail to copy the page contents
AND race with a call for hugetlb_no_page() for this index and again we
will underflow resv_huge_pages.  That is fixed in a more complicated
patch not targeted for -stable.
Test:
  Hacked the code locally such that resv_huge_pages underflows produce a
  warning, then:
  ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
  ./tools/testing/selftests/vm/userfaultfd hugetlb 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
Both tests succeed and produce no warnings.  After the test runs number
of free/resv hugepages is correct.
[mike.kravetz@oracle.com: changelog fixes]
Link: https://lkml.kernel.org/r/20210528004649.85298-1-almasrymina@google.com
Fixes:  | ||
|  Naoya Horiguchi | 0c5da35723 | hugetlb: pass head page to remove_hugetlb_page() When memory_failure() or soft_offline_page() is called on a tail page of
some hugetlb page, "BUG: unable to handle page fault" error can be
triggered.
remove_hugetlb_page() dereferences page->lru, so it's assumed that the
page points to a head page, but one of the caller,
dissolve_free_huge_page(), provides remove_hugetlb_page() with 'page'
which could be a tail page.  So pass 'head' to it, instead.
Link: https://lkml.kernel.org/r/20210526235257.2769473-1-nao.horiguchi@gmail.com
Fixes:  | ||
|  Peter Xu | 84894e1c42 | mm/hugetlb: fix cow where page writtable in child When rework early cow of pinned hugetlb pages, we moved huge_ptep_get()
upper but overlooked a side effect that the huge_ptep_get() will fetch the
pte after wr-protection.  After moving it upwards, we need explicit
wr-protect of child pte or we will keep the write bit set in the child
process, which could cause data corrution where the child can write to the
original page directly.
This issue can also be exposed by "memfd_test hugetlbfs" kselftest.
Link: https://lkml.kernel.org/r/20210503234356.9097-3-peterx@redhat.com
Fixes:  | ||
|  Ingo Molnar | f0953a1bba | mm: fix typos in comments Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Pavel Tatashin | 8e3560d963 | mm: honor PF_MEMALLOC_PIN for all movable pages PF_MEMALLOC_PIN is only honored for CMA pages, extend this flag to work for any allocations from ZONE_MOVABLE by removing __GFP_MOVABLE from gfp_mask when this flag is passed in the current context. Add is_pinnable_page() to return true if page is in a pinnable page. A pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type. Link: https://lkml.kernel.org/r/20210215161349.246722-8-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Pavel Tatashin | 1a08ae36cf | mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not return pages that might belong to CMA region. This is currently used for long term gup to make sure that such pins are not going to be done on any CMA pages. When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it is focusing on CMA pages too much and that there is larger class of pages that need the same treatment. MOVABLE zone cannot contain any long term pins as well so it makes sense to reuse and redefine this flag for that usecase as well. Rename the flag to PF_MEMALLOC_PIN which defines an allocation context which can only get pages suitable for long-term pins. Also rename: memalloc_nocma_save()/memalloc_nocma_restore to memalloc_pin_save()/memalloc_pin_restore() and make the new functions common. [rppt@linux.ibm.com: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN] Link: https://lkml.kernel.org/r/20210331163816.11517-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210215161349.246722-6-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Axel Rasmussen | f619147104 | userfaultfd: add UFFDIO_CONTINUE ioctl This ioctl is how userspace ought to resolve "minor" userfaults.  The
idea is, userspace is notified that a minor fault has occurred.  It
might change the contents of the page using its second non-UFFD mapping,
or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`.  We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case.  (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Axel Rasmussen | 714c189108 | userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled For background, mm/userfaultfd.c provides a general mcopy_atomic implementation. But some types of memory (i.e., hugetlb and shmem) need a slightly different implementation, so they provide their own helpers for this. In other words, userfaultfd is the only caller of these functions. This patch achieves two things: 1. Don't spend time compiling code which will end up never being referenced anyway (a small build time optimization). 2. In patches later in this series, we extend the signature of these helpers with UFFD-specific state (a mode enumeration). Once this happens, we *have to* either not compile the helpers, or unconditionally define the UFFD-only state (which seems messier to me). This includes the declarations in the headers, as otherwise they'd yield warnings about implicitly defining the type of those arguments. Link: https://lkml.kernel.org/r/20210301222728.176417-4-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Axel Rasmussen | 7677f7fd8b | userfaultfd: add minor fault registration mode Patch series "userfaultfd: add minor fault handling", v9. Overview ======== This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS. When enabled (via the UFFDIO_API ioctl), this feature means that any hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also* get events for "minor" faults. By "minor" fault, I mean the following situation: Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory). One of the mappings is registered with userfaultfd (in minor mode), and the other is not. Via the non-UFFD mapping, the underlying pages have already been allocated & filled with some contents. The UFFD mapping has not yet been faulted in; when it is touched for the first time, this results in what I'm calling a "minor" fault. As a concrete example, when working with hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing page. We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is, userspace resolves the fault by either a) doing nothing if the contents are already correct, or b) updating the underlying contents using the second, non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA, or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are correct, carry on setting up the mapping". Use Case ======== Consider the use case of VM live migration (e.g. under QEMU/KVM): 1. While a VM is still running, we copy the contents of its memory to a target machine. The pages are populated on the target by writing to the non-UFFD mapping, using the setup described above. The VM is still running (and therefore its memory is likely changing), so this may be repeated several times, until we decide the target is "up to date enough". 2. We pause the VM on the source, and start executing on the target machine. During this gap, the VM's user(s) will *see* a pause, so it is desirable to minimize this window. 3. Between the last time any page was copied from the source to the target, and when the VM was paused, the contents of that page may have changed - and therefore the copy we have on the target machine is out of date. Although we can keep track of which pages are out of date, for VMs with large amounts of memory, it is "slow" to transfer this information to the target machine. We want to resume execution before such a transfer would complete. 4. So, the guest begins executing on the target machine. The first time it touches its memory (via the UFFD-registered mapping), userspace wants to intercept this fault. Userspace checks whether or not the page is up to date, and if not, copies the updated page from the source machine, via the non-UFFD mapping. Finally, whether a copy was performed or not, userspace issues a UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents are correct, carry on setting up the mapping". We don't have to do all of the final updates on-demand. The userfaultfd manager can, in the background, also copy over updated pages once it receives the map of which pages are up-to-date or not. Interaction with Existing APIs ============================== Because this is a feature, a registered VMA could potentially receive both missing and minor faults. I spent some time thinking through how the existing API interacts with the new feature: UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault: - For non-shared memory or shmem, -EINVAL is returned. - For hugetlb, -EFAULT is returned. UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without modifications, the existing codepath assumes a new page needs to be allocated. This is okay, since userspace must have a second non-UFFD-registered mapping anyway, thus there isn't much reason to want to use these in any case (just memcpy or memset or similar). - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned. - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case). - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns -ENOENT in that case (regardless of the kind of fault). Future Work =========== This series only supports hugetlbfs. I have a second series in flight to support shmem as well, extending the functionality. This series is more mature than the shmem support at this point, and the functionality works fully on hugetlbfs, so this series can be merged first and then shmem support will follow. This patch (of 6): This feature allows userspace to intercept "minor" faults. By "minor" faults, I mean the following situation: Let there exist two mappings (i.e., VMAs) to the same page(s). One of the mappings is registered with userfaultfd (in minor mode), and the other is not. Via the non-UFFD mapping, the underlying pages have already been allocated & filled with some contents. The UFFD mapping has not yet been faulted in; when it is touched for the first time, this results in what I'm calling a "minor" fault. As a concrete example, when working with hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing page. This commit adds the new registration mode, and sets the relevant flag on the VMAs being registered. In the hugetlb fault path, if we find that we have huge_pte_none(), but find_lock_page() does indeed find an existing page, then we have a "minor" fault, and if the VMA has the userfaultfd registration flag, we call into userfaultfd to handle it. This is implemented as a new registration mode, instead of an API feature. This is because the alternative implementation has significant drawbacks [1]. However, doing it this was requires we allocate a VM_* flag for the new registration mode. On 32-bit systems, there are no unused bits, so this feature is only supported on architectures with CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in MINOR mode on 32-bit architectures, we return -EINVAL. [1] https://lore.kernel.org/patchwork/patch/1380226/ [peterx@redhat.com: fix minor fault page leak] Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Steven Price <steven.price@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Oscar Salvador | ae37c7ff79 | mm: make alloc_contig_range handle in-use hugetlb pages alloc_contig_range() will fail if it finds a HugeTLB page within the range, without a chance to handle them. Since HugeTLB pages can be migrated as any LRU or Movable page, it does not make sense to bail out without trying. Enable the interface to recognize in-use HugeTLB pages so we can migrate them, and have much better chances to succeed the call. Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Oscar Salvador | 369fa227c2 | mm: make alloc_contig_range handle free hugetlb pages alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.
We can do better by trying to replace such page.
Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with a
new one.
In order to do that, a new function called alloc_and_dissolve_huge_page is
introduced.  This function will first try to get a new fresh hugepage, and
if it succeeds, it will replace the old one in the free hugepage pool.
The free page replacement is done under hugetlb_lock, so no external users
of hugetlb will notice the change.  To allocate the new huge page, we use
alloc_buddy_huge_page(), so we do not have to deal with any counters, and
prep_new_huge_page() is not called.  This is valulable because in case we
need to free the new page, we only need to call __free_pages().
Once we know that the page to be replaced is a genuine 0-refcounted huge
page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and
__prep_account_new_huge_page() for the new huge page to properly
initialize it and increment the hstate->nr_huge_pages counter (previously
decremented by remove_hugetlb_page()).  Once done, the page is enqueued by
enqueue_huge_page() and it is ready to be used.
There is one tricky case when page's refcount is 0 because it is in the
process of being released.  A missing PageHugeFreed bit will tell us that
freeing is in flight so we retry after dropping the hugetlb_lock.  The
race window should be small and the next retry should make a forward
progress.
E.g:
CPU0				CPU1
free_huge_page()		isolate_or_dissolve_huge_page
				  PageHuge() == T
				  alloc_and_dissolve_huge_page
				    alloc_buddy_huge_page()
				    spin_lock_irq(hugetlb_lock)
				    // PageHuge() && !PageHugeFreed &&
				    // !PageCount()
				    spin_unlock_irq(hugetlb_lock)
  spin_lock_irq(hugetlb_lock)
  1) update_and_free_page
       PageHuge() == F
       __free_pages()
  2) enqueue_huge_page
       SetPageHugeFreed()
  spin_unlock_irq(&hugetlb_lock)
				  spin_lock_irq(hugetlb_lock)
                                   1) PageHuge() == F (freed by case#1 from CPU0)
				   2) PageHuge() == T
                                       PageHugeFreed() == T
                                       - proceed with replacing the page
In the case above we retry as the window race is quite small and we have
high chances to succeed next time.
With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.
Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Oscar Salvador | d3d99fcc4e | mm,hugetlb: split prep_new_huge_page functionality Currently, prep_new_huge_page() performs two functions. It sets the right state for a new hugetlb, and increases the hstate's counters to account for the new page. Let us split its functionality into two separate functions, decoupling the handling of the counters from initializing a hugepage. The outcome is having __prep_new_huge_page(), which only initializes the page , and __prep_account_new_huge_page(), which adds the new page to the hstate's counters. This allows us to be able to set a hugetlb without having to worry about the counter/locking. It will prove useful in the next patch. prep_new_huge_page() still calls both functions. Link: https://lkml.kernel.org/r/20210419075413.1064-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Oscar Salvador | 9f27b34f23 | mm,hugetlb: drop clearing of flag from prep_new_huge_page Pages allocated via the page allocator or CMA get its private field
cleared by means of post_alloc_hook().
Pages allocated during boot, that is directly from the memblock
allocator, get cleared by paging_init()-> ..  ->memmap_init_zone-> ..
->__init_single_page() before any memblock allocation.
Based on this ground, let us remove the clearing of the flag from
prep_new_huge_page() as it is not needed.  This was a leftover from
commit  | ||
|  Mike Kravetz | 9487ca60fd | hugetlb: add lockdep_assert_held() calls for hugetlb_lock After making hugetlb lock irq safe and separating some functionality done under the lock, add some lockdep_assert_held to help verify locking. Link: https://lkml.kernel.org/r/20210409205254.242291-9-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | db71ef79b5 | hugetlb: make free_huge_page irq safe Commit  | ||
|  Mike Kravetz | 10c6ec4980 | hugetlb: change free_pool_huge_page to remove_pool_huge_page free_pool_huge_page was called with hugetlb_lock held.  It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy.  free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.
Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held.  It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators.  The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.
Add new helper routine to call update_and_free_page for a list of pages.
Note: Some changes to the routine return_unused_surplus_pages are in
need of explanation.  Commit  | ||
|  Mike Kravetz | 1121828a0c | hugetlb: call update_and_free_page without hugetlb_lock With the introduction of remove_hugetlb_page(), there is no need for update_and_free_page to hold the hugetlb lock. Change all callers to drop the lock before calling. With additional code modifications, this will allow loops which decrease the huge page pool to drop the hugetlb_lock with each page to reduce long hold times. The ugly unlock/lock cycle in free_pool_huge_page will be removed in a subsequent patch which restructures free_pool_huge_page. Link: https://lkml.kernel.org/r/20210409205254.242291-6-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 6eb4e88a6d | hugetlb: create remove_hugetlb_page() to separate functionality The new remove_hugetlb_page() routine is designed to remove a hugetlb
page from hugetlbfs processing.  It will remove the page from the active
or free list, update global counters and set the compound page
destructor to NULL so that PageHuge() will return false for the 'page'.
After this call, the 'page' can be treated as a normal compound page or
a collection of base size pages.
update_and_free_page no longer decrements h->nr_huge_pages{_node} as
this is performed in remove_hugetlb_page.  The only functionality
performed by update_and_free_page is to free the base pages to the lower
level allocators.
update_and_free_page is typically called after remove_hugetlb_page.
remove_hugetlb_page is to be called with the hugetlb_lock held.
Creating this routine and separating functionality is in preparation for
restructuring code to reduce lock hold times.  This commit should not
introduce any changes to functionality.
Link: https://lkml.kernel.org/r/20210409205254.242291-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 2938396771 | hugetlb: add per-hstate mutex to synchronize user adjustments The helper routine hstate_next_node_to_alloc accesses and modifies the hstate variable next_nid_to_alloc. The helper is used by the routines alloc_pool_huge_page and adjust_pool_surplus. adjust_pool_surplus is called with hugetlb_lock held. However, alloc_pool_huge_page can not be called with the hugetlb lock held as it will call the page allocator. Two instances of alloc_pool_huge_page could be run in parallel or alloc_pool_huge_page could run in parallel with adjust_pool_surplus which may result in the variable next_nid_to_alloc becoming invalid for the caller and pages being allocated on the wrong node. Both alloc_pool_huge_page and adjust_pool_surplus are only called from the routine set_max_huge_pages after boot. set_max_huge_pages is only called as the reusult of a user writing to the proc/sysfs nr_hugepages, or nr_hugepages_mempolicy file to adjust the number of hugetlb pages. It makes little sense to allow multiple adjustment to the number of hugetlb pages in parallel. Add a mutex to the hstate and use it to only allow one hugetlb page adjustment at a time. This will synchronize modifications to the next_nid_to_alloc variable. Link: https://lkml.kernel.org/r/20210409205254.242291-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 262443c042 | hugetlb: no need to drop hugetlb_lock to call cma_release Now that cma_release is non-blocking and irq safe, there is no need to drop hugetlb_lock before calling. Link: https://lkml.kernel.org/r/20210409205254.242291-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | da56388c43 | mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts() A rare out of memory error would prevent removal of the reserve map region
for a page.  hugetlb_fix_reserve_counts() handles this rare case to avoid
dangling with incorrect counts.  Unfortunately, hugepage_subpool_get_pages
and hugetlb_acct_memory could possibly fail too.  We should correctly
handle these cases.
Link: https://lkml.kernel.org/r/20210410072348.20437-5-linmiaohe@huawei.com
Fixes:  | ||
|  Miaohe Lin | dddf31a49a | mm/hugeltb: clarify (chg - freed) won't go negative in hugetlb_unreserve_pages() The resv_map could be NULL since this routine can be called in the evict inode path for all hugetlbfs inodes and we will have chg = 0 in this case. But (chg - freed) won't go negative as Mike pointed out: "If resv_map is NULL, then no hugetlb pages can be allocated/associated with the file. As a result, remove_inode_hugepages will never find any huge pages associated with the inode and the passed value 'freed' will always be zero." Add a comment clarifying this to make it clear and also avoid confusion. Link: https://lkml.kernel.org/r/20210410072348.20437-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | bf3d12b9f7 | mm/hugeltb: simplify the return code of __vma_reservation_common() It's guaranteed that the vma is associated with a resv_map, i.e. either VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would have returned via !resv check above. So it's unneeded to check whether HPAGE_RESV_OWNER is set here. Simplify the return code to make it more clear. Link: https://lkml.kernel.org/r/20210410072348.20437-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | f84df0b7f1 | mm/hugeltb: remove redundant VM_BUG_ON() in region_add() Patch series "Cleanup and fixup for hugetlb", v2. This series contains cleanups to remove redundant VM_BUG_ON() and simplify the return code. Also this handles the error case in hugetlb_fix_reserve_counts() correctly. More details can be found in the respective changelogs. This patch (of 5): The same VM_BUG_ON() check is already done in the callee. Remove this extra one to simplify the code slightly. Link: https://lkml.kernel.org/r/20210410072348.20437-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210410072348.20437-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Feilong Lin <linfeilong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | d83e6c8a9b | mm/hugetlb: simplify the code when alloc_huge_page() failed in hugetlb_no_page() Rework the error handling code when alloc_huge_page() failed to remove some duplicated code and simplify the code slightly. Link: https://lkml.kernel.org/r/20210308112809.26107-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | 5af1ab1d24 | mm/hugetlb: optimize the surplus state transfer code in move_hugetlb_state() We should not transfer the per-node surplus state when we do not cross the node in order to save some cpu cycles Link: https://lkml.kernel.org/r/20210308112809.26107-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | 04adbc3f7b | mm/hugetlb: use some helper functions to cleanup code Patch series "Some cleanups for hugetlb". This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE, use helper function and so on. I also collect some previous patches into this series in case they are forgotten. This patch (of 5): We could use pages_per_huge_page to get the number of pages per hugepage, use get_hstate_idx to calculate hstate index, and use hstate_is_gigantic to check if a hstate is gigantic to make code more succinct. Link: https://lkml.kernel.org/r/20210308112809.26107-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210308112809.26107-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | 6501fe5f16 | mm/hugetlb: remove redundant reservation check condition in alloc_huge_page() vma_resv_map(vma) checks if a reserve map is associated with the vma. The routine vma_needs_reservation() will check vma_resv_map(vma) and return 1 if no reserv map is present. map_chg is set to the return value of vma_needs_reservation(). Therefore, !vma_resv_map(vma) is redundant in the expression: map_chg || avoid_reserve || !vma_resv_map(vma); Remove the redundant check. [Thanks Mike Kravetz for reshaping this commit message!] Link: https://lkml.kernel.org/r/20210301104726.45159-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | 6dfeaff93b | hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because userfaultfd-wp is always based on pgtable entries, so they cannot be shared. Walk the hugetlb range and unshare all such mappings if there is, right before UFFDIO_REGISTER will succeed and return to userspace. This will pair with want_pmd_share() in hugetlb code so that huge pmd sharing is completely disabled for userfaultfd-wp registered range. Link: https://lkml.kernel.org/r/20210218231206.15524-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | 537cf30bba | mm/hugetlb: move flush_hugetlb_tlb_range() into hugetlb.h Prepare for it to be called outside of mm/hugetlb.c. Link: https://lkml.kernel.org/r/20210218231204.15474-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | c1991e0705 | hugetlb/userfaultfd: forbid huge pmd sharing when uffd enabled Huge pmd sharing could bring problem to userfaultfd. The thing is that userfaultfd is running its logic based on the special bits on page table entries, however the huge pmd sharing could potentially share page table entries for different address ranges. That could cause issues on either: - When sharing huge pmd page tables for an uffd write protected range, the newly mapped huge pmd range will also be write protected unexpectedly, or, - When we try to write protect a range of huge pmd shared range, we'll first do huge_pmd_unshare() in hugetlb_change_protection(), however that also means the UFFDIO_WRITEPROTECT could be silently skipped for the shared region, which could lead to data loss. While at it, a few other things are done altogether: - Move want_pmd_share() from mm/hugetlb.c into linux/hugetlb.h, because that's definitely something that arch code would like to use too - ARM64 currently directly check against CONFIG_ARCH_WANT_HUGE_PMD_SHARE when trying to share huge pmd. Switch to the want_pmd_share() helper. - Move vma_shareable() from huge_pmd_share() into want_pmd_share(). [peterx@redhat.com: fix build with !ARCH_WANT_HUGE_PMD_SHARE] Link: https://lkml.kernel.org/r/20210310185359.88297-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210218231202.15426-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | aec44e0f02 | hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share() Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4. This series tries to disable huge pmd unshare of hugetlbfs backed memory for uffd-wp. Although uffd-wp of hugetlbfs is still during rfc stage, the idea of this series may be needed for multiple tasks (Axel's uffd minor fault series, and Mike's soft dirty series), so I picked it out from the larger series. This patch (of 4): It is a preparation work to be able to behave differently in the per architecture huge_pte_alloc() according to different VMA attributes. Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call. [peterx@redhat.com: build fix] Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: https://lkml.kernel.org/r/20210218230633.15028-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210218230633.15028-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Matthew Wilcox (Oracle) | 84172f4bb7 | mm/page_alloc: combine __alloc_pages and __alloc_pages_nodemask There are only two callers of __alloc_pages() so prune the thicket of alloc_page variants by combining the two functions together. Current callers of __alloc_pages() simply add an extra 'NULL' parameter and current callers of __alloc_pages_nodemask() call __alloc_pages() instead. Link: https://lkml.kernel.org/r/20210225150642.2582252-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | d85aecf284 | hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings The current implementation of hugetlb_cgroup for shared mappings could
have different behavior.  Consider the following two scenarios:
 1.Assume initial css reference count of hugetlb_cgroup is 1:
  1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
      count is 2 associated with 1 file_region.
  1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
      count is 3 associated with 2 file_region.
  1.3 coalesce_file_region will coalesce these two file_regions into
      one. So css reference count is 3 associated with 1 file_region
      now.
 2.Assume initial css reference count of hugetlb_cgroup is 1 again:
  2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
      count is 2 associated with 1 file_region.
Therefore, we might have one file_region while holding one or more css
reference counts. This inconsistency could lead to imbalanced css_get()
and css_put() pair. If we do css_put one by one (i.g. hole punch case),
scenario 2 would put one more css reference. If we do css_put all
together (i.g. truncate case), scenario 1 will leak one css reference.
The imbalanced css_get() and css_put() pair would result in a non-zero
reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
directory is removed __but__ associated resource is not freed. This
might result in OOM or can not create a new hugetlb cgroup in a busy
workload ultimately.
In order to fix this, we have to make sure that one file_region must
hold exactly one css reference. So in coalesce_file_region case, we
should release one css reference before coalescence. Also only put css
reference when the entire file_region is removed.
The last thing to note is that the caller of region_add() will only hold
one reference to h_cg->css for the whole contiguous reservation region.
But this area might be scattered when there are already some
file_regions reside in it. As a result, many file_regions may share only
one h_cg->css reference. In order to ensure that one file_region must
hold exactly one css reference, we should do css_get() for each
file_region and release the reference held by caller when they are done.
[linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
  Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
Fixes:  | ||
|  Peter Xu | 4eae4efa2c | hugetlb: do early cow when page pinned on src mm This is the last missing piece of the COW-during-fork effort when there're pinned pages found. One can reference | ||
|  Peter Xu | ca6eb14d64 | mm: use is_cow_mapping() across tree where proper After is_cow_mapping() is exported in mm.h, replace some manual checks elsewhere throughout the tree but start to use the new helper. Link: https://lkml.kernel.org/r/20210217233547.93892-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> Cc: VMware Graphics <linux-graphics-maintainer@vmware.com> Cc: Roland Scheidegger <sroland@vmware.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Gal Pressman <galpress@amazon.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Wei Zhang <wzam@amazon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | ca7e0457ef | hugetlb: break earlier in add_reservation_in_range() when we can All the regions maintained in hugetlb reserved map is inclusive on "from" but exclusive on "to". We can break earlier even if rg->from==t because it already means no possible intersection. This does not need a Fixes in all cases because when it happens (rg->from==t) we'll not break out of the loop while we should, however the next thing we'd do is still add the last file_region we'd need and quit the loop in the next round. So this change is not a bugfix (since the old code should still run okay iiuc), but we'd better still touch it up to make it logically sane. Link: https://lkml.kernel.org/r/20210217233547.93892-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Gal Pressman <galpress@amazon.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Roland Scheidegger <sroland@vmware.com> Cc: VMware Graphics <linux-graphics-maintainer@vmware.com> Cc: Wei Zhang <wzam@amazon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | 2103cf9c3f | hugetlb: dedup the code to add a new file_region Patch series "mm/hugetlb: Early cow on fork, and a few cleanups", v5. As reported by Gal [1], we still miss the code clip to handle early cow for hugetlb case, which is true. Again, it still feels odd to fork() after using a few huge pages, especially if they're privately mapped to me.. However I do agree with Gal and Jason in that we should still have that since that'll complete the early cow on fork effort at least, and it'll still fix issues where buffers are not well under control and not easy to apply MADV_DONTFORK. The first two patches (1-2) are some cleanups I noticed when reading into the hugetlb reserve map code. I think it's good to have but they're not necessary for fixing the fork issue. The last two patches (3-4) are the real fix. I tested this with a fork() after some vfio-pci assignment, so I'm pretty sure the page copy path could trigger well (page will be accounted right after the fork()), but I didn't do data check since the card I assigned is some random nic. https://github.com/xzpeter/linux/tree/fork-cow-pin-huge [1] https://lore.kernel.org/lkml/27564187-4a08-f187-5a84-3df50009f6ca@amazon.com/ Introduce hugetlb_resv_map_add() helper to add a new file_region rather than duplication the similar code twice in add_reservation_in_range(). Link: https://lkml.kernel.org/r/20210217233547.93892-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210217233547.93892-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Gal Pressman <galpress@amazon.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Wei Zhang <wzam@amazon.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jann Horn <jannh@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Roland Scheidegger <sroland@vmware.com> Cc: VMware Graphics <linux-graphics-maintainer@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 33b8f84a4e | mm/hugetlb: change hugetlb_reserve_pages() to type bool While reviewing a bug in hugetlb_reserve_pages, it was noticed that all callers ignore the return value. Any failure is considered an ENOMEM error by the callers. Change the function to be of type bool. The function will return true if the reservation was successful, false otherwise. Callers currently assume a zero return code indicates success. Change the callers to look for true to indicate success. No functional change, only code cleanup. Link: https://lkml.kernel.org/r/20201221192542.15732-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | ff54611762 | hugetlb: fix uninitialized subpool pointer Gerald Schaefer reported a panic on s390 in hugepage_subpool_put_pages() with linux-next 5.12.0-20210222. Call trace: hugepage_subpool_put_pages.part.0+0x2c/0x138 __free_huge_page+0xce/0x310 alloc_pool_huge_page+0x102/0x120 set_max_huge_pages+0x13e/0x350 hugetlb_sysctl_handler_common+0xd8/0x110 hugetlb_sysctl_handler+0x48/0x58 proc_sys_call_handler+0x138/0x238 new_sync_write+0x10e/0x198 vfs_write.part.0+0x12c/0x238 ksys_write+0x68/0xf8 do_syscall+0x82/0xd0 __do_syscall+0xb4/0xc8 system_call+0x72/0x98 This is a result of the change which moved the hugetlb page subpool pointer from page->private to page[1]->private. When new pages are allocated from the buddy allocator, the private field of the head page will be cleared, but the private field of subpages is not modified. Therefore, old values may remain. Fix by initializing hugetlb page subpool pointer in prep_new_huge_page(). Link: https://lkml.kernel.org/r/20210223215544.313871-1-mike.kravetz@oracle.com Fixes: f1280272ae4d ("hugetlb: use page.private for hugetlb specific page flags") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 6c03714901 | hugetlb: convert PageHugeFreed to HPageFreed flag Use new hugetlb specific HPageFreed flag to replace the PageHugeFreed interfaces. Link: https://lkml.kernel.org/r/20210122195231.324857-6-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 9157c31186 | hugetlb: convert PageHugeTemporary() to HPageTemporary flag Use new hugetlb specific HPageTemporary flag to replace the PageHugeTemporary() interfaces. PageHugeTemporary does contain a PageHuge() check. However, this interface is only used within hugetlb code where we know we are dealing with a hugetlb page. Therefore, the check can be eliminated. Link: https://lkml.kernel.org/r/20210122195231.324857-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 8f251a3d5c | hugetlb: convert page_huge_active() HPageMigratable flag Use the new hugetlb page specific flag HPageMigratable to replace the page_huge_active interfaces. By it's name, page_huge_active implied that a huge page was on the active list. However, that is not really what code checking the flag wanted to know. It really wanted to determine if the huge page could be migrated. This happens when the page is actually added to the page cache and/or task page table. This is the reasoning behind the name change. The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not really necessary as we KNOW the page is a hugetlb page. Therefore, they are removed. The routine page_huge_active checked for PageHeadHuge before testing the active bit. This is unnecessary in the case where we hold a reference or lock and know it is a hugetlb head page. page_huge_active is also called without holding a reference or lock (scan_movable_pages), and can race with code freeing the page. The extra check in page_huge_active shortened the race window, but did not prevent the race. Offline code calling scan_movable_pages already deals with these races, so removing the check is acceptable. Add comment to racy code. [songmuchun@bytedance.com: remove set_page_huge_active() declaration from include/linux/hugetlb.h] Link: https://lkml.kernel.org/r/CAMZfGtUda+KoAZscU0718TN61cSFwp4zy=y2oZ=+6Z2TAZZwng@mail.gmail.com Link: https://lkml.kernel.org/r/20210122195231.324857-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | d6995da311 | hugetlb: use page.private for hugetlb specific page flags Patch series "create hugetlb flags to consolidate state", v3. While discussing a series of hugetlb fixes in [1], it became evident that the hugetlb specific page state information is stored in a somewhat haphazard manner. Code dealing with state information would be easier to read, understand and maintain if this information was stored in a consistent manner. This series uses page.private of the hugetlb head page for storing a set of hugetlb specific page flags. Routines are priovided for test, set and clear of the flags. [1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com This patch (of 4): As hugetlbfs evolved, state information about hugetlb pages was added. One 'convenient' way of doing this was to use available fields in tail pages. Over time, it has become difficult to know the meaning or contents of fields simply by looking at a small bit of code. Sometimes, the naming is just confusing. For example: The PagePrivate flag indicates a huge page reservation was consumed and needs to be restored if an error is encountered and the page is freed before it is instantiated. The page.private field contains the pointer to a subpool if the page is associated with one. In an effort to make the code more readable, use page.private to contain hugetlb specific page flags. These flags will have test, set and clear functions similar to those used for 'normal' page flags. More importantly, an enum of flag values will be created with names that actually reflect their purpose. In this patch, - Create infrastructure for hugetlb specific page flag functions - Move subpool pointer to page[1].private to make way for flags Create routines with meaningful names to modify subpool field - Use new HPageRestoreReserve flag instead of PagePrivate Conversion of other state information will happen in subsequent patches. Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Chen Wandun | 7ecc956551 | mm/hugetlb: suppress wrong warning info when alloc gigantic page If hugetlb_cma is enabled, it will skip boot time allocation when
allocating gigantic page, that doesn't means allocation failure, so
suppress this warning info.
Link: https://lkml.kernel.org/r/20210219123909.13130-1-chenwandun@huawei.com
Fixes:  | ||
|  Mike Kravetz | dbfee5aee7 | hugetlb: fix update_and_free_page contig page struct assumption page structs are not guaranteed to be contiguous for gigantic pages.  The
routine update_and_free_page can encounter a gigantic page, yet it assumes
page structs are contiguous when setting page flags in subpages.
If update_and_free_page encounters non-contiguous page structs, we can see
“BUG: Bad page state in process …” errors.
Non-contiguous page structs are generally not an issue.  However, they can
exist with a specific kernel configuration and hotplug operations.  For
example: Configure the kernel with CONFIG_SPARSEMEM and
!CONFIG_SPARSEMEM_VMEMMAP.  Then, hotplug add memory for the area where
the gigantic page will be allocated.  Zi Yan outlined steps to reproduce
here [1].
[1] https://lore.kernel.org/linux-mm/16F7C58B-4D79-41C5-9B64-A1A1628F4AF2@nvidia.com/
Link: https://lkml.kernel.org/r/20210217184926.33567-1-mike.kravetz@oracle.com
Fixes:  | ||
|  Miaohe Lin | aca78307bf | mm/hugetlb: use helper huge_page_size() to get hugepage size We can use helper huge_page_size() to get the hugepage size directly to simplify the code slightly. [linmiaohe@huawei.com: use helper huge_page_size() to get hugepage size] Link: https://lkml.kernel.org/r/20210209021803.49211-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210208082450.15716-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | 3f1b0162f6 | mm/hugetlb: remove unnecessary VM_BUG_ON_PAGE on putback_active_hugepage() All callers know they are operating on a hugetlb head page. So this VM_BUG_ON_PAGE can not catch anything useful. Link: https://lkml.kernel.org/r/20210209071151.44731-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | 07e51edf83 | mm/hugetlb: use helper function range_in_vma() in page_table_shareable() We could use helper function range_in_vma() to check whether the vma is in the desired range to simplify the code. Link: https://lkml.kernel.org/r/20210204112949.43051-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Zhiyuan Dai | 578b7725d4 | mm/hugetlb.c: fix typos in comments Fix typo in comment. Link: https://lkml.kernel.org/r/1612256106-9436-1-git-send-email-daizhiyuan@phytium.com.cn Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Yanfei Xu | 5291c09b3e | mm/hugetlb: remove redundant check in preparing and destroying gigantic page Gigantic page is a compound page and its order is more than 1. Thus it must be available for hpage_pincount. Let's remove the redundant check for gigantic page. Link: https://lkml.kernel.org/r/20210202112002.73170-1-yanfei.xu@windriver.com Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | 6c26d31083 | mm/hugetlb: fix some comment typos Fix typos sasitfy to satisfy, reservtion to reservation, hugegpage to hugepage and uniprocesor to uniprocessor in comments. Link: https://lkml.kernel.org/r/20210128112028.64831-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Joao Martins | 82e5d378b0 | mm/hugetlb: refactor subpage recording For a given hugepage backing a VA, there's a rather ineficient loop which
is solely responsible for storing subpages in GUP @pages/@vmas array.  For
each subpage we check whether it's within range or size of @pages and keep
increment @pfn_offset and a couple other variables per subpage iteration.
Simplify this logic and minimize the cost of each iteration to just store
the output page/vma.  Instead of incrementing number of @refs iteratively,
we do it through pre-calculation of @refs and only with a tight loop for
storing pinned subpages/vmas.
Additionally, retain existing behaviour with using mem_map_offset() when
recording the subpages for configurations that don't have a contiguous
mem_map.
pinning consequently improves bringing us close to
{pin,get}_user_pages_fast:
  - 16G with 1G huge page size
  gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
PIN_FAST_BENCHMARK: ~3.7k us
Link: https://lkml.kernel.org/r/20210128182632.24562-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Joao Martins | 0fa5bc4023 | mm/hugetlb: grab head page refcount once for group of subpages Patch series "mm/hugetlb: follow_hugetlb_page() improvements", v2. While looking at ZONE_DEVICE struct page reuse particularly the last patch[0], I found two possible improvements for follow_hugetlb_page() which is solely used for get_user_pages()/pin_user_pages(). The first patch batches page refcount updates while the second tidies up storing the subpages/vmas. Both together bring the cost of slow variant of gup() cost from ~87.6k usecs to ~5.8k usecs. libhugetlbfs tests seem to pass as well gup_test benchmarks with hugetlbfs vmas. This patch (of 2): follow_hugetlb_page() once it locks the pmd/pud, checks all its N subpages in a huge page and grabs a reference for each one. Similar to gup-fast, have follow_hugetlb_page() grab the head page refcount only after counting all its subpages that are part of the just faulted huge page. Consequently we reduce the number of atomics necessary to pin said huge page, which improves non-fast gup() considerably: - 16G with 1G huge page size gup_test -f /mnt/huge/file -m 16384 -r 10 -L -S -n 512 -w PIN_LONGTERM_BENCHMARK: ~87.6k us -> ~12.8k us Link: https://lkml.kernel.org/r/20210128182632.24562-1-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/20210128182632.24562-2-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Jiapeng Zhong | c93b0a9926 | mm/hugetlb: simplify the calculation of variables Fix the following coccicheck warnings: mm/hugetlb.c:3372:20-22: WARNING !A || A && B is equivalent to !A || B. Link: https://lkml.kernel.org/r/1611643468-52233-1-git-send-email-abaci-bugfix@linux.alibaba.com Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | 1d88433bb0 | mm/hugetlb: fix use after free when subpool max_hpages accounting is not enabled If a hugetlbfs filesystem is created with the min_size option and without the size option, used_hpages is always 0 and might lead to release subpool prematurely because it indicates no pages are used now while there might be. In order to fix this issue, we should check used_hpages == 0 iff max_hpages accounting is enabled. As max_hpages accounting should be enabled in most common case, this is not worth a Cc stable. [mike.kravetz@oracle.com: new changelog] Link: https://lkml.kernel.org/r/20210126115510.53374-1-linmiaohe@huawei.com Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Miaohe Lin | c78a7f3639 | mm/hugetlb: use helper huge_page_order and pages_per_huge_page Since commit  | ||
|  Miaohe Lin | 0aa7f3544a | mm/hugetlb: avoid unnecessary hugetlb_acct_memory() call When reservation accounting remains unchanged, hugetlb_acct_memory() will do nothing except holding and releasing hugetlb_lock. We should avoid this unnecessary hugetlb_lock lock/unlock cycle which is happening on 'most' hugetlb munmap operations by check delta against 0 at the beginning of hugetlb_acct_memory. Link: https://lkml.kernel.org/r/20210115092013.61012-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Li Xinhai | a1ba9da8f0 | mm/hugetlb.c: fix unnecessary address expansion of pmd sharing The current code would unnecessarily expand the address range.  Consider
one example, (start, end) = (1G-2M, 3G+2M), and (vm_start, vm_end) =
(1G-4M, 3G+4M), the expected adjustment should be keep (1G-2M, 3G+2M)
without expand.  But the current result will be (1G-4M, 3G+4M).  Actually,
the range (1G-4M, 1G) and (3G, 3G+4M) would never been involved in pmd
sharing.
After this patch, we will check that the vma span at least one PUD aligned
size and the start,end range overlap the aligned range of vma.
With above example, the aligned vma range is (1G, 3G), so if (start, end)
range is within (1G-4M, 1G), or within (3G, 3G+4M), then no adjustment to
both start and end.  Otherwise, we will have chance to adjust start
downwards or end upwards without exceeding (vm_start, vm_end).
Mike:
: The 'adjusted range' is used for calls to mmu notifiers and cache(tlb)
: flushing.  Since the current code unnecessarily expands the range in some
: cases, more entries than necessary would be flushed.  This would/could
: result in performance degradation.  However, this is highly dependent on
: the user runtime.  Is there a combination of vma layout and calls to
: actually hit this issue?  If the issue is hit, will those entries
: unnecessarily flushed be used again and need to be unnecessarily reloaded?
Link: https://lkml.kernel.org/r/20210104081631.2921415-1-lixinhai.lxh@gmail.com
Fixes:  | ||
|  Miaohe Lin | cc2205a67d | mm/hugetlb: fix potential double free in hugetlb_register_node() error path In hugetlb_sysfs_add_hstate(), we would do kobject_put() on hstate_kobjs
when failed to create sysfs group but forget to set hstate_kobjs to NULL.
Then in hugetlb_register_node() error path, we may free it again via
hugetlb_unregister_node().
Link: https://lkml.kernel.org/r/20210107123249.36964-1-linmiaohe@huawei.com
Fixes:  | ||
|  Linus Torvalds | 7b15c27e2f | These changes fix MM (soft-)dirty bit management in the procfs code & clean up the API. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmAtAgsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gOnA/9GKJblgi88Qb23YwGKp0OfCMdLfx8FJa+ dq0AB0jgzc8v2J8IITSs/qo/8o25IE9IPTjTfItn0E0jxz7Y8J16urb/vyWX6O2s jb4riT5fIRCXvhv9DooxSQerZePaOJXbHYa2BBk8yqNJPGbd5kr0SUGn3BQnBQhR 0yAfqjrzBLMGzzSO+kK0nhGQH8BJZgYu94CHNnUZJtWcIb2ZC6lzZ7Lz0zi6ueRJ 81JblV4NCOC9uy9I9odOwESu2TIxT9afq1C/6COyrKYx3sWY6xPOGQTxYZe1LITE lb57/95qc7SOIj7Y3aL4YRSVRYRihEU31qlAltwP4fEnz49qdHJOR1HQmjKVG8xs Uaa6kCYFeTKmh4SRRr8ZR/hUkebrFUT+9+db6LmBs/i4Kt09T+ZurXC4jqmUHMFn 2nYCDH6RX153V1YwcHGkr4OWaUVWZwAZl+t0zIo7o7wQdkoAD75ydecW2R3nLMN7 p1ofGPXmT8Wh4en8LngBawO/4bBuunezh4L3vpz0/EU3viK5+DRsyNKf+d+Tti28 XCe7ID0GDGq7nIzSZxuyIxmAbWJxjI+7gWT2WUudrJxJ2rUUxPQms8GsQD54IMh5 UILv9GMBNuV8iA/2c3B52ff5iFl7kp+SxVS3MRC6zTudIV9VV6bb7WpFb8FLOhsH 3sEo0qDFab4= =qcXO -----END PGP SIGNATURE----- Merge tag 'core-mm-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull tlb gather updates from Ingo Molnar: "Theses fix MM (soft-)dirty bit management in the procfs code & clean up the TLB gather API" * tag 'core-mm-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ldt: Use tlb_gather_mmu_fullmm() when freeing LDT page-tables tlb: arch: Remove empty __tlb_remove_tlb_entry() stubs tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm() tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() mm: proc: Invalidate TLB after clearing soft-dirty page state | ||
|  Muchun Song | e558464be9 | mm: hugetlb: fix missing put_page in gather_surplus_pages() The VM_BUG_ON_PAGE avoids the generation of any code, even if that
expression has side-effects when !CONFIG_DEBUG_VM.
Link: https://lkml.kernel.org/r/20210126031009.96266-1-songmuchun@bytedance.com
Fixes:  | ||
|  Muchun Song | ecbf4724e6 | mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active The page_huge_active() can be called from scan_movable_pages() which do
not hold a reference count to the HugeTLB page.  So when we call
page_huge_active() from scan_movable_pages(), the HugeTLB page can be
freed parallel.  Then we will trigger a BUG_ON which is in the
page_huge_active() when CONFIG_DEBUG_VM is enabled.  Just remove the
VM_BUG_ON_PAGE.
Link: https://lkml.kernel.org/r/20210115124942.46403-6-songmuchun@bytedance.com
Fixes:  | ||
|  Muchun Song | 0eb2df2b56 | mm: hugetlb: fix a race between isolating and freeing page There is a race between isolate_huge_page() and __free_huge_page().
  CPU0:                                     CPU1:
  if (PageHuge(page))
                                            put_page(page)
                                              __free_huge_page(page)
                                                  spin_lock(&hugetlb_lock)
                                                  update_and_free_page(page)
                                                    set_compound_page_dtor(page,
                                                      NULL_COMPOUND_DTOR)
                                                  spin_unlock(&hugetlb_lock)
    isolate_huge_page(page)
      // trigger BUG_ON
      VM_BUG_ON_PAGE(!PageHead(page), page)
      spin_lock(&hugetlb_lock)
      page_huge_active(page)
        // trigger BUG_ON
        VM_BUG_ON_PAGE(!PageHuge(page), page)
      spin_unlock(&hugetlb_lock)
When we isolate a HugeTLB page on CPU0.  Meanwhile, we free it to the
buddy allocator on CPU1.  Then, we can trigger a BUG_ON on CPU0, because
it is already freed to the buddy allocator.
Link: https://lkml.kernel.org/r/20210115124942.46403-5-songmuchun@bytedance.com
Fixes:  | ||
|  Muchun Song | 7ffddd499b | mm: hugetlb: fix a race between freeing and dissolving the page There is a race condition between __free_huge_page()
and dissolve_free_huge_page().
  CPU0:                         CPU1:
  // page_count(page) == 1
  put_page(page)
    __free_huge_page(page)
                                dissolve_free_huge_page(page)
                                  spin_lock(&hugetlb_lock)
                                  // PageHuge(page) && !page_count(page)
                                  update_and_free_page(page)
                                  // page is freed to the buddy
                                  spin_unlock(&hugetlb_lock)
      spin_lock(&hugetlb_lock)
      clear_page_huge_active(page)
      enqueue_huge_page(page)
      // It is wrong, the page is already freed
      spin_unlock(&hugetlb_lock)
The race window is between put_page() and dissolve_free_huge_page().
We should make sure that the page is already on the free list when it is
dissolved.
As a result __free_huge_page would corrupt page(s) already in the buddy
allocator.
Link: https://lkml.kernel.org/r/20210115124942.46403-4-songmuchun@bytedance.com
Fixes:  | ||
|  Muchun Song | 585fc0d287 | mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page If a new hugetlb page is allocated during fallocate it will not be
marked as active (set_page_huge_active) which will result in a later
isolate_huge_page failure when the page migration code would like to
move that page.  Such a failure would be unexpected and wrong.
Only export set_page_huge_active, just leave clear_page_huge_active as
static.  Because there are no external users.
Link: https://lkml.kernel.org/r/20210115124942.46403-3-songmuchun@bytedance.com
Fixes:  | ||
|  Will Deacon | a72afd8730 | tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing. Remove the unused arguments and update all callers. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com | ||
|  Will Deacon | ae8eba8b5d | tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Since commit  | ||
|  Miaohe Lin | 0eb98f1588 | mm/hugetlb: fix potential missing huge page size info The huge page size is encoded for VM_FAULT_HWPOISON errors only.  So if
we return VM_FAULT_HWPOISON, huge page size would just be ignored.
Link: https://lkml.kernel.org/r/20210107123449.38481-1-linmiaohe@huawei.com
Fixes:  | ||
|  Mike Kravetz | e7dd91c456 | mm/hugetlb: fix deadlock in hugetlb_cow error path syzbot reported the deadlock here [1]. The issue is in hugetlb cow error handling when there are not enough huge pages for the faulting task which took the original reservation. It is possible that other (child) tasks could have consumed pages associated with the reservation. In this case, we want the task which took the original reservation to succeed. So, we unmap any associated pages in children so that they can be used by the faulting task that owns the reservation. The unmapping code needs to hold i_mmap_rwsem in write mode. However, due to commit | ||
|  Joe Perches | ae7a927d27 | mm: use sysfs_emit for struct kobject * uses Patch series "mm: Convert sysfs sprintf family to sysfs_emit", v2.
Use the new sysfs_emit family and not the sprintf family.
This patch (of 5):
Use the sysfs_emit function instead of the sprintf family.
Done with cocci script as in commit  | ||
|  Dan Carpenter | 7fc2513aa2 | hugetlb: fix an error code in hugetlb_reserve_pages() Preserve the error code from region_add() instead of returning success.
Link: https://lkml.kernel.org/r/X9NGZWnZl5/Mt99R@mwanda
Fixes:  | ||
|  Oscar Salvador | 39a0feaef1 | mm,hugetlb: remove unneeded initialization hugetlb_add_hstate initializes nr_huge_pages and free_huge_pages to 0, but since hstates[] is a global variable, all its fields are defined to 0 already. Link: https://lkml.kernel.org/r/20201119112141.6452-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Liu Xiang | 0a4f3d1bb9 | mm: hugetlb: fix type of delta parameter and related local variables in gather_surplus_pages() On 64-bit machine, delta variable in hugetlb_acct_memory() may be larger than 0xffffffff, but gather_surplus_pages() can only use the low 32-bit value now. So we need to fix type of delta parameter and related local variables in gather_surplus_pages(). Link: https://lkml.kernel.org/r/1605793733-3573-1-git-send-email-liu.xiang@zlingsmart.com Reported-by: Ma Chenggong <ma.chenggong@zlingsmart.com> Signed-off-by: Liu Xiang <liu.xiang@zlingsmart.com> Signed-off-by: Pan Jiagen <pan.jiagen@zlingsmart.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Liu Xiang <liuxiang_1999@126.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Hui Su | e5dfacebe4 | mm/hugetlb.c: just use put_page_testzero() instead of page_count() We test the page reference count is zero or not here, it can be a bug here if page refercence count is not zero. So we can just use put_page_testzero() instead of page_count(). Link: https://lkml.kernel.org/r/20201007170949.GA6416@rlk Signed-off-by: Hui Su <sh_def@163.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Dmitry Safonov | dd3b614f85 | vm_ops: rename .split() callback to .may_split() Rename the callback to reflect that it's not called *on* or *after* split, but rather some time before the splitting to check if it's possible. Link: https://lkml.kernel.org/r/20201013013416.390574-5-dima@arista.com Signed-off-by: Dmitry Safonov <dima@arista.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Gerald Schaefer | ba9c1201be | mm/hugetlb: clear compound_nr before freeing gigantic pages Commit | ||
|  Mike Kravetz | 336bf30eb7 | hugetlbfs: fix anon huge page migration race Qian Cai reported the following BUG in [1]
  LTP: starting move_pages12
  BUG: unable to handle page fault for address: ffffffffffffffe0
  ...
  RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
  Call Trace:
    rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
    try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
    migrate_pages+0x1005/0x1fb0
    move_pages_and_store_status.isra.47+0xd7/0x1a0
    __x64_sys_move_pages+0xa5c/0x1100
    do_syscall_64+0x5f/0x310
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem.  This is wrong for
anon pages as the anon_vma_lock should be held in this case.  Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.
Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
keep mapping valid while dropping page lock.
This patch does the following:
 - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
   calling try_to_unmap.
 - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
   will now simply do a 'trylock' while still holding the page lock. If
   the trylock fails, it will return NULL. This could impact the
   callers:
    - migration calling code will receive -EAGAIN and retry up to the
      hard coded limit (10).
    - memory error code will treat the page as BUSY. This will force
      killing (SIGKILL) instead of SIGBUS any mapping tasks.
   Do note that this change in behavior only happens when there is a
   race. None of the standard kernel testing suites actually hit this
   race, but it is possible.
[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
Fixes:  | ||
|  Mike Kravetz | 79aa925bf2 | hugetlb_cgroup: fix reservation accounting Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon
with hugetlbfs and hit the warning below.  QEMU with free page hinting
uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported
as free by a VM.  The reporting granularity is in pageblock granularity.
So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE)
one huge page in QEMU.
  WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
  Modules linked in: ...
  CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
  Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
  RIP: 0010:page_counter_uncharge+0x4b/0x50
  ...
  Call Trace:
    hugetlb_cgroup_uncharge_file_region+0x4b/0x80
    region_del+0x1d3/0x300
    hugetlb_unreserve_pages+0x39/0xb0
    remove_inode_hugepages+0x1a8/0x3d0
    hugetlbfs_fallocate+0x3c4/0x5c0
    vfs_fallocate+0x146/0x290
    __x64_sys_fallocate+0x3e/0x70
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
Investigation of the issue uncovered bugs in hugetlb cgroup reservation
accounting.  This patch addresses the found issues.
Fixes:  | ||
|  Linus Torvalds | 5a32c3413d | dma-mapping updates for 5.10 - rework the non-coherent DMA allocator
  - move private definitions out of <linux/dma-mapping.h>
  - lower CMA_ALIGNMENT (Paul Cercueil)
  - remove the omap1 dma address translation in favor of the common
    code
  - make dma-direct aware of multiple dma offset ranges (Jim Quinlan)
  - support per-node DMA CMA areas (Barry Song)
  - increase the default seg boundary limit (Nicolin Chen)
  - misc fixes (Robin Murphy, Thomas Tai, Xu Wang)
  - various cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl+IiPwLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPKEQ//TM8vxjucnRl/pklpMin49dJorwiVvROLhQqLmdxw
 286ZKpVzYYAPc7LnNqwIBugnFZiXuHu8xPKQkIiOa2OtNDTwhKNoBxOAmOJaV6DD
 8JfEtZYeX5mKJ/Nqd2iSkIqOvCwZ9Wzii+aytJ2U88wezQr1fnyF4X49MegETEey
 FHWreSaRWZKa0MMRu9AQ0QxmoNTHAQUNaPc0PeqEtPULybfkGOGw4/ghSB7WcKrA
 gtKTuooNOSpVEHkTas2TMpcBp6lxtOjFqKzVN0ml+/nqq5NeTSDx91VOCX/6Cj76
 mXIg+s7fbACTk/BmkkwAkd0QEw4fo4tyD6Bep/5QNhvEoAriTuSRbhvLdOwFz0EF
 vhkF0Rer6umdhSK7nPd7SBqn8kAnP4vBbdmB68+nc3lmkqysLyE4VkgkdH/IYYQI
 6TJ0oilXWFmU6DT5Rm4FBqCvfcEfU2dUIHJr5wZHqrF2kLzoZ+mpg42fADoG4GuI
 D/oOsz7soeaRe3eYfWybC0omGR6YYPozZJ9lsfftcElmwSsFrmPsbO1DM5IBkj1B
 gItmEbOB9ZK3RhIK55T/3u1UWY3Uc/RVr+kchWvADGrWnRQnW0kxYIqDgiOytLFi
 JZNH8uHpJIwzoJAv6XXSPyEUBwXTG+zK37Ce769HGbUEaUrE71MxBbQAQsK8mDpg
 7fM=
 =Bkf/
 -----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
 - rework the non-coherent DMA allocator
 - move private definitions out of <linux/dma-mapping.h>
 - lower CMA_ALIGNMENT (Paul Cercueil)
 - remove the omap1 dma address translation in favor of the common code
 - make dma-direct aware of multiple dma offset ranges (Jim Quinlan)
 - support per-node DMA CMA areas (Barry Song)
 - increase the default seg boundary limit (Nicolin Chen)
 - misc fixes (Robin Murphy, Thomas Tai, Xu Wang)
 - various cleanups
* tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits)
  ARM/ixp4xx: add a missing include of dma-map-ops.h
  dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
  dma-direct: factor out a dma_direct_alloc_from_pool helper
  dma-direct check for highmem pages in dma_direct_alloc_pages
  dma-mapping: merge <linux/dma-noncoherent.h> into <linux/dma-map-ops.h>
  dma-mapping: move large parts of <linux/dma-direct.h> to kernel/dma
  dma-mapping: move dma-debug.h to kernel/dma/
  dma-mapping: remove <asm/dma-contiguous.h>
  dma-mapping: merge <linux/dma-contiguous.h> into <linux/dma-map-ops.h>
  dma-contiguous: remove dma_contiguous_set_default
  dma-contiguous: remove dev_set_cma_area
  dma-contiguous: remove dma_declare_contiguous
  dma-mapping: split <linux/dma-mapping.h>
  cma: decrease CMA_ALIGNMENT lower limit to 2
  firewire-ohci: use dma_alloc_pages
  dma-iommu: implement ->alloc_noncoherent
  dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
  dma-mapping: add a new dma_alloc_pages API
  dma-mapping: remove dma_cache_sync
  53c700: convert to dma_alloc_noncoherent
  ... | ||
|  Linus Torvalds | fe151462bd | Driver Core patches for 5.10-rc1 Here is the "big" set of driver core patches for 5.10-rc1
 
 They include a lot of different things, all related to the driver core
 and/or some driver logic:
 	- sysfs common write functions to make it easier to audit sysfs
 	  attributes
 	- device connection cleanups and fixes
 	- devm helpers for a few functions
 	- NOIO allocations for when devices are being removed
 	- minor cleanups and fixes
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCX4c4yA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylS7gCfcS+7/PE42eXxMY0z8rBX8aDMadIAn2DVEghA
 Eoh9UoMEW4g1uMKORA0c
 =CVAW
 -----END PGP SIGNATURE-----
Merge tag 'driver-core-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
 "Here is the "big" set of driver core patches for 5.10-rc1
  They include a lot of different things, all related to the driver core
  and/or some driver logic:
   - sysfs common write functions to make it easier to audit sysfs
     attributes
   - device connection cleanups and fixes
   - devm helpers for a few functions
   - NOIO allocations for when devices are being removed
   - minor cleanups and fixes
  All have been in linux-next for a while with no reported issues"
* tag 'driver-core-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
  regmap: debugfs: use semicolons rather than commas to separate statements
  platform/x86: intel_pmc_core: do not create a static struct device
  drivers core: node: Use a more typical macro definition style for ACCESS_ATTR
  drivers core: Use sysfs_emit for shared_cpu_map_show and shared_cpu_list_show
  mm: and drivers core: Convert hugetlb_report_node_meminfo to sysfs_emit
  drivers core: Miscellaneous changes for sysfs_emit
  drivers core: Reindent a couple uses around sysfs_emit
  drivers core: Remove strcat uses around sysfs_emit and neaten
  drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
  sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
  dyndbg: use keyword, arg varnames for query term pairs
  driver core: force NOIO allocations during unplug
  platform_device: switch to simpler IDA interface
  driver core: platform: Document return type of more functions
  Revert "driver core: Annotate dev_err_probe() with __must_check"
  Revert "test_firmware: Test platform fw loading on non-EFI systems"
  iio: adc: xilinx-xadc: use devm_krealloc()
  hwmon: pmbus: use more devres helpers
  devres: provide devm_krealloc()
  syscore: Use pm_pr_dbg() for syscore_{suspend,resume}()
  ... | ||
|  Mike Kravetz | 0bf7b64e6e | hugetlb: add lockdep check for i_mmap_rwsem held in huge_pmd_share As a debugging aid, huge_pmd_share should make sure i_mmap_rwsem is held
if necessary.  To clarify the 'if necessary', expand the comment block at
the beginning of huge_pmd_share.
No functional change.  The added i_mmap_assert_locked() call is only
enabled if CONFIG_LOCKDEP.
Ideally, this should have been included with commit  | ||
|  Wei Yang | 6664bfc8e9 | mm/hugetlb: take the free hpage during the iteration directly Function dequeue_huge_page_node_exact() iterates the free list and return the first valid free hpage. Instead of break and check the loop variant, we could return in the loop directly. This could reduce some redundant check. [mike.kravetz@oracle.com: points out a logic error] [richard.weiyang@linux.alibaba.com: v4] Link: https://lkml.kernel.org/r/20200901014636.29737-8-richard.weiyang@linux.alibaba.com Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Baoquan He <bhe@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lkml.kernel.org/r/20200831022351.20916-8-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Wei Yang | 2f37511cb6 | mm/hugetlb: narrow the hugetlb_lock protection area during preparing huge page set_hugetlb_cgroup_[rsvd] just manipulate page local data, which is not necessary to be protected by hugetlb_lock. Let's take this out. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lkml.kernel.org/r/20200831022351.20916-7-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Wei Yang | 15a8d68e9d | mm/hugetlb: a page from buddy is not on any list The page allocated from buddy is not on any list, so just use list_add() is enough. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lkml.kernel.org/r/20200831022351.20916-6-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Wei Yang | 972a3da355 | mm/hugetlb: count file_region to be added when regions_needed != NULL There are only two cases of function add_reservation_in_range()
    * count file_region and return the number in regions_needed
    * do the real list operation without counting
This means it is not necessary to have two parameters to classify these
two cases.
Just use regions_needed to separate them.
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: https://lkml.kernel.org/r/20200831022351.20916-5-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Wei Yang | d3ec7b6e09 | mm/hugetlb: use list_splice to merge two list at once Instead of add allocated file_region one by one to region_cache, we could use list_splice to merge two list at once. Also we know the number of entries in the list, increase the number directly. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lkml.kernel.org/r/20200831022351.20916-4-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Wei Yang | a1ddc2e825 | mm/hugetlb: remove VM_BUG_ON(!nrg) in get_file_region_entry_from_cache() We are sure to get a valid file_region, otherwise the VM_BUG_ON(resv->region_cache_count <= 0) at the very beginning would be triggered. Let's remove the redundant one. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lkml.kernel.org/r/20200831022351.20916-3-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Wei Yang | 7db5e7b67e | mm/hugetlb: not necessary to coalesce regions recursively Patch series "mm/hugetlb: code refine and simplification", v4. Following are some cleanups for hugetlb. Simple testing with tools/testing/selftests/vm/map_hugetlb passes. This patch (of 7): Per my understanding, we keep the regions ordered and would always coalesce regions properly. So the task to keep this property is just to coalesce its neighbour. Let's simplify this. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lkml.kernel.org/r/20200901014636.29737-1-richard.weiyang@linux.alibaba.com Link: https://lkml.kernel.org/r/20200831022351.20916-1-richard.weiyang@linux.alibaba.com Link: https://lkml.kernel.org/r/20200831022351.20916-2-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Baoquan He | d79d176a30 | mm/hugetlb.c: remove the unnecessary non_swap_entry() If a swap entry tests positive for either is_[migration|hwpoison]_entry(), then its swap_type() is among SWP_MIGRATION_READ, SWP_MIGRATION_WRITE and SWP_HWPOISON. All these types >= MAX_SWAPFILES, exactly what is asserted with non_swap_entry(). So the checking non_swap_entry() in is_hugetlb_entry_migration() and is_hugetlb_entry_hwpoisoned() is redundant. Let's remove it to optimize code. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lkml.kernel.org/r/20200723032248.24772-3-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Baoquan He | 3e5c36007e | mm/hugetlb.c: make is_hugetlb_entry_hwpoisoned return bool Patch series "mm/hugetlb: Small cleanup and improvement", v2. This patch (of 3): Just like its neighbour is_hugetlb_entry_migration() has done. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lkml.kernel.org/r/20200723032248.24772-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20200723032248.24772-2-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Joe Perches | 7981593bf0 | mm: and drivers core: Convert hugetlb_report_node_meminfo to sysfs_emit Convert the unbound sprintf in hugetlb_report_node_meminfo to use sysfs_emit_at so that no possible overrun of a PAGE_SIZE buf can occur. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Link: https://lore.kernel.org/r/894b351b82da6013cde7f36ff4b5493cd0ec30d0.1600285923.git.joe@perches.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | ||
|  Christoph Hellwig | 8c1c6c7588 | Merge branch 'master' of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into dma-mapping-for-next Pull in the latest 5.9 tree for the commit to revert the V4L2_FLAG_MEMORY_NON_CONSISTENT uapi addition. | ||
|  Muchun Song | 17743798d8 | mm/hugetlb: fix a race between hugetlb sysctl handlers There is a race between the assignment of `table->data` and write value
to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
the other thread.
  CPU0:                                 CPU1:
                                        proc_sys_write
  hugetlb_sysctl_handler                  proc_sys_call_handler
  hugetlb_sysctl_handler_common             hugetlb_sysctl_handler
    table->data = &tmp;                       hugetlb_sysctl_handler_common
                                                table->data = &tmp;
      proc_doulongvec_minmax
        do_proc_doulongvec_minmax           sysctl_head_finish
          __do_proc_doulongvec_minmax         unuse_table
            i = table->data;
            *i = val;  // corrupt CPU1's stack
Fix this by duplicating the `table`, and only update the duplicate of
it.  And introduce a helper of proc_hugetlb_doulongvec_minmax() to
simplify the code.
The following oops was seen:
    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    Code: Bad RIP value.
    ...
    Call Trace:
     ? set_max_huge_pages+0x3da/0x4f0
     ? alloc_pool_huge_page+0x150/0x150
     ? proc_doulongvec_minmax+0x46/0x60
     ? hugetlb_sysctl_handler_common+0x1c7/0x200
     ? nr_hugepages_store+0x20/0x20
     ? copy_fd_bitmaps+0x170/0x170
     ? hugetlb_sysctl_handler+0x1e/0x20
     ? proc_sys_call_handler+0x2f1/0x300
     ? unregister_sysctl_table+0xb0/0xb0
     ? __fd_install+0x78/0x100
     ? proc_sys_write+0x14/0x20
     ? __vfs_write+0x4d/0x90
     ? vfs_write+0xef/0x240
     ? ksys_write+0xc0/0x160
     ? __ia32_sys_read+0x50/0x50
     ? __close_fd+0x129/0x150
     ? __x64_sys_write+0x43/0x50
     ? do_syscall_64+0x6c/0x200
     ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes:  | ||
|  Li Xinhai | 953f064aa6 | mm/hugetlb: try preferred node first when alloc gigantic page from cma Since commit | ||
|  Barry Song | 2281f797f5 | mm: cma: use CMA_MAX_NAME to define the length of cma name array CMA_MAX_NAME should be visible to CMA's users as they might need it to set the name of CMA areas and avoid hardcoding the size locally. So this patch moves CMA_MAX_NAME from local header file to include/linux header file and removes the hardcode in both hugetlb.c and contiguous.c. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Christoph Hellwig <hch@lst.de> | ||
|  Joonsoo Kim | bbe88753bd | mm/hugetlb: make hugetlb migration callback CMA aware new_non_cma_page() in gup.c requires to allocate the new page that is not on the CMA area. new_non_cma_page() implements it by using allocation scope APIs. However, there is a work-around for hugetlb. Normal hugetlb page allocation API for migration is alloc_huge_page_nodemask(). It consists of two steps. First is dequeing from the pool. Second is, if there is no available page on the queue, allocating by using the page allocator. new_non_cma_page() can't use this API since first step (deque) isn't aware of scope API to exclude CMA area. So, new_non_cma_page() exports hugetlb internal function for the second step, alloc_migrate_huge_page(), to global scope and uses it directly. This is suboptimal since hugetlb pages on the queue cannot be utilized. This patch tries to fix this situation by making the deque function on hugetlb CMA aware. In the deque function, CMA memory is skipped if PF_MEMALLOC_NOCMA flag is found. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1596180906-8442-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Joonsoo Kim | d92bbc2719 | mm/hugetlb: unify migration callbacks There is no difference between two migration callback functions, alloc_huge_page_node() and alloc_huge_page_nodemask(), except __GFP_THISNODE handling. It's redundant to have two almost similar functions in order to handle this flag. So, this patch tries to remove one by introducing a new argument, gfp_mask, to alloc_huge_page_nodemask(). After introducing gfp_mask argument, it's caller's job to provide correct gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed to provide gfp_mask. Note that it's safe to remove a node id check in alloc_huge_page_node() since there is no caller passing NUMA_NO_NODE as a node id. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Randy Dunlap | 9e7ee40097 | mm/hugetlb.c: delete duplicated words Drop the repeated word "the" in two places. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200801173822.14973-5-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Barry Song | 29d0f41d23 | mm: hugetlb: fix the name of hugetlb CMA Once we enable CMA_DEBUGFS, we will get the below errors: directory 'cma-hugetlb' with parent 'cma' already present. We should have different names for different CMA areas. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/20200616223131.33828-3-song.bao.hua@hisilicon.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 34ae204f18 | hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem Commit | ||
|  Muchun Song | 8ca39e6874 | mm/hugetlb: add mempolicy check in the reservation routine In the reservation routine, we only check whether the cpuset meets the
memory allocation requirements.  But we ignore the mempolicy of MPOL_BIND
case.  If someone mmap hugetlb succeeds, but the subsequent memory
allocation may fail due to mempolicy restrictions and receives the SIGBUS
signal.  This can be reproduced by the follow steps.
 1) Compile the test case.
    cd tools/testing/selftests/vm/
    gcc map_hugetlb.c -o map_hugetlb
 2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
    system. Each node will pre-allocate one huge page.
    echo 2 > /proc/sys/vm/nr_hugepages
 3) Run test case(mmap 4MB). We receive the SIGBUS signal.
    numactl --membind=3D0 ./map_hugetlb 4
With this patch applied, the mmap will fail in the step 3) and throw
"mmap: Cannot allocate memory".
[akpm@linux-foundation.org: include sched.h for `current']
Reported-by: Jianchao Guo <guojianchao@bytedance.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Baoquan He <bhe@redhat.com>
Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | 75802ca663 | mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible This is found by code observation only.
Firstly, the worst case scenario should assume the whole range was covered
by pmd sharing.  The old algorithm might not work as expected for ranges
like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
expected range should be (0, 2g).
Since at it, remove the loop since it should not be required.  With that,
the new code should be faster too when the invalidating range is huge.
Mike said:
: With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only
: adjust to (0, 1g+2m) which is incorrect.
:
: We should cc stable.  The original reason for adjusting the range was to
: prevent data corruption (getting wrong page).  Since the range is not
: always adjusted correctly, the potential for corruption still exists.
:
: However, I am fairly confident that adjust_range_if_pmd_sharing_possible
: is only gong to be called in two cases:
:
: 1) for a single page
: 2) for range == entire vma
:
: In those cases, the current code should produce the correct results.
:
: To be safe, let's just cc stable.
Fixes:  | ||
|  Mike Rapoport | ca15ca406f | mm: remove unneeded includes of <asm/pgalloc.h> Patch series "mm: cleanup usage of <asm/pgalloc.h>"
Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table.  These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.
In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>
In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.
This patch (of 8):
In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory.  Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.
As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.
The process was somewhat automated using
	sed -i -E '/[<"]asm\/pgalloc\.h/d' \
                $(grep -L -w -f /tmp/xx \
                        $(git grep -E -l '[<"]asm/pgalloc\.h'))
where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.
[rppt@linux.ibm.com: fix powerpc warning]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Barry Song | dbda8feadf | mm/hugetlb: avoid hardcoding while checking if cma is enabled hugetlb_cma[0] can be NULL due to various reasons, for example, node0
has no memory.  so NULL hugetlb_cma[0] doesn't necessarily mean cma is
not enabled.  gigantic pages might have been reserved on other nodes.
This patch fixes possible double reservation and CMA leak.
[akpm@linux-foundation.org: fix CONFIG_CMA=n warning]
[sfr@canb.auug.org.au: better checks before using hugetlb_cma]
  Link: http://lkml.kernel.org/r/20200721205716.6dbaa56b@canb.auug.org.au
Fixes:  | ||
|  Mike Kravetz | 1139d336ff | mm/hugetlb.c: fix pages per hugetlb calculation The routine hpage_nr_pages() was incorrectly used to calculate the number
of base pages in a hugetlb page.  hpage_nr_pages is designed to be called
for THP pages and will return HPAGE_PMD_NR for hugetlb pages of any size.
Due to the context in which hpage_nr_pages was called, it is unlikely to
produce a user visible error.  The routine with the incorrect call is only
exercised in the case of hugetlb memory error or migration.  In addition,
this would need to be on an architecture which supports huge page sizes
less than PMD_SIZE.  And, the vma containing the huge page would also need
to smaller than PMD_SIZE.
Fixes:  | ||
|  Michel Lespinasse | c1e8d7c6a7 | mmap locking API: convert mmap_sem comments Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Rapoport | e31cf2f4ca | mm: don't include asm/pgtable.h if linux/mm.h is already included Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
	for f in $(git grep -l "include <linux/mm.h>") ; do
		sed -i -e '/include <asm\/pgtable.h>/ d' $f
	done
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Ethon Paul | 7c8de35889 | mm/hugetlb: fix a typos in comments [akpm@linux-foundation.org: coding style fixes] Signed-off-by: Ethon Paul <ethp@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200410163714.14085-1-ethp@qq.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Linus Torvalds | ee01c4d72a | Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "More mm/ work, plenty more to come Subsystems affected by this patch series: slub, memcg, gup, kasan, pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs, thp, mmap, kconfig" * akpm: (131 commits) arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined riscv: support DEBUG_WX mm: add DEBUG_WX support drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid() powerpc/mm: drop platform defined pmd_mknotpresent() mm: thp: don't need to drain lru cache when splitting and mlocking THP hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs sparc32: register memory occupied by kernel as memblock.memory include/linux/memblock.h: fix minor typo and unclear comment mm, mempolicy: fix up gup usage in lookup_node tools/vm/page_owner_sort.c: filter out unneeded line mm: swap: memcg: fix memcg stats for huge pages mm: swap: fix vmstats for huge pages mm: vmscan: limit the range of LRU type balancing mm: vmscan: reclaim writepage is IO cost mm: vmscan: determine anon/file pressure balance at the reclaim root mm: balance LRU lists based on relative thrashing mm: only count actual rotations as LRU reclaim cost ... | ||
|  Li Xinhai | 8ac0b81ab1 | mm/hugetlb: avoid unnecessary check on pud and pmd entry in huge_pte_offset When huge_pte_offset() is called, the parameter sz can only be PUD_SIZE or PMD_SIZE. If sz is PUD_SIZE and code can reach pud, then *pud must be none, or normal hugetlb entry, or non-present (migration or hwpoisoned) hugetlb entry, and we can directly return pud. When sz is PMD_SIZE, pud must be none or present, and if code can reach pmd, we can directly return pmd. So after this patch the code is simplified by first check on the parameter sz, and avoid unnecessary checks in current code. Same semantics of existing code is maintained. More details about relevant commits: commit | ||
|  Mike Kravetz | c2833a5bf7 | hugetlbfs: fix changes to command line processing Previously, a check for hugepages_supported was added before processing hugetlb command line parameters. On some architectures such as powerpc, hugepages_supported() is not set to true until after command line processing. Therefore, no hugetlb command line parameters would be accepted. Remove the additional checks for hugepages_supported. In hugetlb_init, print a warning if !hugepages_supported and command line parameters were specified. Reported-by: Sandipan Das <sandipan.osd@gmail.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/b1f04f9f-fa46-c2a0-7693-4a0679d2a1ee@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 282f421438 | hugetlbfs: clean up command line processing With all hugetlb page processing done in a single file clean up code.
- Make code match desired semantics
  - Update documentation with semantics
- Make all warnings and errors messages start with 'HugeTLB:'.
- Consistently name command line parsing routines.
- Warn if !hugepages_supported() and command line parameters have
  been specified.
- Add comments to code
  - Describe some of the subtle interactions
  - Describe semantics of command line arguments
This patch also fixes issues with implicitly setting the number of
gigantic huge pages to preallocate.  Previously on X86 command line,
        hugepages=2 default_hugepagesz=1G
would result in zero 1G pages being preallocated and,
        # grep HugePages_Total /proc/meminfo
        HugePages_Total:       0
        # sysctl -a | grep nr_hugepages
        vm.nr_hugepages = 2
        vm.nr_hugepages_mempolicy = 2
        # cat /proc/sys/vm/nr_hugepages
        2
After this patch 2 gigantic pages will be preallocated and all the proc,
sysfs, sysctl and meminfo files will accurately reflect this.
To address the issue with gigantic pages, a small change in behavior was
made to command line processing.  Previously the command line,
        hugepages=128 default_hugepagesz=2M hugepagesz=2M hugepages=256
would result in the allocation of 256 2M huge pages.  The value 128 would
be ignored without any warning.  After this patch, 128 2M pages will be
allocated and a warning message will be displayed indicating the value of
256 is ignored.  This change in behavior is required because allocation of
implicitly specified gigantic pages must be done when the
default_hugepagesz= is encountered for gigantic pages.  Previously the
code waited until later in the boot process (hugetlb_init), to allocate
pages of default size.  However the bootmem allocator required for
gigantic allocations is not available at this time.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sandipan Das <sandipan@linux.ibm.com>
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
Acked-by: Will Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Longpeng <longpeng2@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200417185049.275845-5-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 3823783088 | hugetlbfs: remove hugetlb_add_hstate() warning for existing hstate hugetlb_add_hstate() prints a warning if the hstate already exists.  This
was originally done as part of kernel command line parsing.  If
'hugepagesz=' was specified more than once, the warning
	pr_warn("hugepagesz= specified twice, ignoring\n");
would be printed.
Some architectures want to enable all huge page sizes.  They would call
hugetlb_add_hstate for all supported sizes.  However, this was done after
command line processing and as a result hstates could have already been
created for some sizes.  To make sure no warning were printed, there would
often be code like:
	if (!size_to_hstate(size)
		hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT)
The only time we want to print the warning is as the result of command
line processing.  So, remove the warning from hugetlb_add_hstate and add
it to the single arch independent routine processing "hugepagesz=".  After
this, calls to size_to_hstate() in arch specific code can be removed and
hugetlb_add_hstate can be called without worrying about warning messages.
[mike.kravetz@oracle.com: fix hugetlb initialization]
  Link: http://lkml.kernel.org/r/4c36c6ce-3774-78fa-abc4-b7346bf24348@oracle.com
  Link: http://lkml.kernel.org/r/20200428205614.246260-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Mina Almasry <almasrymina@google.com>
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
Acked-by: Will Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Longpeng <longpeng2@huawei.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200417185049.275845-4-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20200428205614.246260-4-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 359f25443a | hugetlbfs: move hugepagesz= parsing to arch independent code Now that architectures provide arch_hugetlb_valid_size(), parsing of "hugepagesz=" can be done in architecture independent code. Create a single routine to handle hugepagesz= parsing and remove all arch specific routines. We can also remove the interface hugetlb_bad_size() as this is no longer used outside arch independent code. This also provides consistent behavior of hugetlbfs command line options. The hugepagesz= option should only be specified once for a specific size, but some architectures allow multiple instances. This appears to be more of an oversight when code was added by some architectures to set up ALL huge pages sizes. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sandipan Das <sandipan@linux.ibm.com> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: Mina Almasry <almasrymina@google.com> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Acked-by: Will Deacon <will@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Longpeng <longpeng2@huawei.com> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200417185049.275845-3-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20200428205614.246260-3-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | ae94da8981 | hugetlbfs: add arch_hugetlb_valid_size Patch series "Clean up hugetlb boot command line processing", v4. Longpeng(Mike) reported a weird message from hugetlb command line processing and proposed a solution [1]. While the proposed patch does address the specific issue, there are other related issues in command line processing. As hugetlbfs evolved, updates to command line processing have been made to meet immediate needs and not necessarily in a coordinated manner. The result is that some processing is done in arch specific code, some is done in arch independent code and coordination is problematic. Semantics can vary between architectures. The patch series does the following: - Define arch specific arch_hugetlb_valid_size routine used to validate passed huge page sizes. - Move hugepagesz= command line parsing out of arch specific code and into an arch independent routine. - Clean up command line processing to follow desired semantics and document those semantics. [1] https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpeng2@huawei.com This patch (of 3): The architecture independent routine hugetlb_default_setup sets up the default huge pages size. It has no way to verify if the passed value is valid, so it accepts it and attempts to validate at a later time. This requires undocumented cooperation between the arch specific and arch independent code. For architectures that support more than one huge page size, provide a routine arch_hugetlb_valid_size to validate a huge page size. hugetlb_default_setup can use this to validate passed values. arch_hugetlb_valid_size will also be used in a subsequent patch to move processing of the "hugepagesz=" in arch specific code to a common routine in arch independent code. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Longpeng <longpeng2@huawei.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200428205614.246260-1-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20200428205614.246260-2-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20200417185049.275845-1-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20200417185049.275845-2-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Daniel Borkmann | 0b54142e4b | Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler methods to take kernel pointers instead. It gets rid of the set_fs address space overrides used by BPF. As per discussion, pull in the feature branch into bpf-next as it relates to BPF sysctl progs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/ | ||
|  Christoph Hellwig | 32927393dc | sysctl: pass kernel pointers to ->proc_handler Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> | ||
|  Longpeng | 3c1d7e6ccb | mm/hugetlb: fix a addressing exception caused by huge_pte_offset Our machine encountered a panic(addressing exception) after run for a
long time and the calltrace is:
    RIP: hugetlb_fault+0x307/0xbe0
    RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
    RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
    RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
    RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
    R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
    R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
    FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
      follow_hugetlb_page+0x175/0x540
      __get_user_pages+0x2a0/0x7e0
      __get_user_pages_unlocked+0x15d/0x210
      __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
      try_async_pf+0x6e/0x2a0 [kvm]
      tdp_page_fault+0x151/0x2d0 [kvm]
     ...
      kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
      kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
      do_vfs_ioctl+0x3f0/0x540
      SyS_ioctl+0xa1/0xc0
      system_call_fastpath+0x22/0x27
For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
may return a wrong 'pmdp' if there is a race.  Please look at the
following code snippet:
    ...
    pud = pud_offset(p4d, addr);
    if (sz != PUD_SIZE && pud_none(*pud))
        return NULL;
    /* hugepage or swap? */
    if (pud_huge(*pud) || !pud_present(*pud))
        return (pte_t *)pud;
    pmd = pmd_offset(pud, addr);
    if (sz != PMD_SIZE && pmd_none(*pmd))
        return NULL;
    /* hugepage or swap? */
    if (pmd_huge(*pmd) || !pmd_present(*pmd))
        return (pte_t *)pmd;
    ...
The following sequence would trigger this bug:
 - CPU0: sz = PUD_SIZE and *pud = 0 , continue
 - CPU0: "pud_huge(*pud)" is false
 - CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
 - CPU0: "!pud_present(*pud)" is false, continue
 - CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
However, we want CPU0 to return NULL or pudp in this case.
We must make sure there is exactly one dereference of pud and pmd.
Signed-off-by: Longpeng <longpeng2@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200413010342.771-1-longpeng2@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Roman Gushchin | cf11e85fc0 | mm: hugetlb: optionally allocate gigantic hugepages using cma Commit  | ||
|  Jules Irenge | 1b2a1e7bb9 | mm/hugetlb: add missing annotation for gather_surplus_pages() Sparse reports a warning at gather_surplus_pages() warning: context imbalance in hugetlb_cow() - unexpected unlock The root cause is the missing annotation at gather_surplus_pages() Add the missing __must_hold(&hugetlb_lock) Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Link: http://lkml.kernel.org/r/20200214204741.94112-7-jbi.octave@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Vlastimil Babka | d4af73e3f8 | mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge() Commit  | ||
|  Mateusz Nosek | 353b2de42e | mm/hugetlb.c: clean code by removing unnecessary initialization Previously variable 'check_addr' was initialized, but was not read later before reassigning. So the initialization can be removed. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Link: http://lkml.kernel.org/r/20200303212354.25226-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | a9b3f86740 | hugetlb: support file_region coalescing again An earlier patch in this series disabled file_region coalescing in order to hang the hugetlb_cgroup uncharge info on the file_region entries. This patch re-adds support for coalescing of file_region entries. Essentially everytime we add an entry, we call a recursive function that tries to coalesce the added region with the regions next to it. The worst case call depth for this function is 3: one to coalesce with the region next to it, one to coalesce to the region prev, and one to reach the base case. This is an important performance optimization as private mappings add their entries page by page, and we could incur big performance costs for large mappings with lots of file_region entries in their resv_map. [almasrymina@google.com: fix CONFIG_CGROUP_HUGETLB ifdefs] Link: http://lkml.kernel.org/r/20200214204544.231482-1-almasrymina@google.com [almasrymina@google.com: remove check_coalesce_bug debug code] Link: http://lkml.kernel.org/r/20200219233610.13808-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Link: http://lkml.kernel.org/r/20200211213128.73302-7-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | 08cf9faf75 | hugetlb_cgroup: support noreserve mappings Support MAP_NORESERVE accounting as part of the new counter. For each hugepage allocation, at allocation time we check if there is a reservation for this allocation or not. If there is a reservation for this allocation, then this allocation was charged at reservation time, and we don't re-account it. If there is no reserevation for this allocation, we charge the appropriate hugetlb_cgroup. The hugetlb_cgroup to uncharge for this allocation is stored in page[3].private. We use new APIs added in an earlier patch to set this pointer. Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Link: http://lkml.kernel.org/r/20200211213128.73302-6-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | 075a61d07a | hugetlb_cgroup: add accounting for shared mappings For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter. After a call to region_chg, we charge the approprate hugetlb_cgroup, and if successful, we pass on the hugetlb_cgroup info to a follow up region_add call. When a file_region entry is added to the resv_map via region_add, we put the pointer to that cgroup in file_region->reservation_counter. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation. On region_del, which is when the hugetlb memory is unreserved, we also uncharge the file_region->reservation_counter. [akpm@linux-foundation.org: forward declare struct file_region] Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | 0db9d74ed8 | hugetlb: disable region_add file_region coalescing A follow up patch in this series adds hugetlb cgroup uncharge info the file_region entries in resv->regions. The cgroup uncharge info may differ for different regions, so they can no longer be coalesced at region_add time. So, disable region coalescing in region_add in this patch. Behavior change: Say a resv_map exists like this [0->1], [2->3], and [5->6]. Then a region_chg/add call comes in region_chg/add(f=0, t=5). Old code would generate resv->regions: [0->5], [5->6]. New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], [5->6]. Special care needs to be taken to handle the resv->adds_in_progress variable correctly. In the past, only 1 region would be added for every region_chg and region_add call. But now, each call may add multiple regions, so we can no longer increment adds_in_progress by 1 in region_chg, or decrement adds_in_progress by 1 after region_add or region_abort. Instead, region_chg calls add_reservation_in_range() to count the number of regions needed and allocates those, and that info is passed to region_add and region_abort to decrement adds_in_progress correctly. We've also modified the assumption that region_add after region_chg never fails. region_chg now pre-allocates at least 1 region for region_add. If region_add needs more regions than region_chg has allocated for it, then it may fail. [almasrymina@google.com: fix file_region entry allocations] Link: http://lkml.kernel.org/r/20200219012736.20363-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Link: http://lkml.kernel.org/r/20200211213128.73302-4-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | e9fe92ae0c | hugetlb_cgroup: add reservation accounting for private mappings Normally the pointer to the cgroup to uncharge hangs off the struct page, and gets queried when it's time to free the page. With hugetlb_cgroup reservations, this is not possible. Because it's possible for a page to be reserved by one task and actually faulted in by another task. The best place to put the hugetlb_cgroup pointer to uncharge for reservations is in the resv_map. But, because the resv_map has different semantics for private and shared mappings, the code patch to charge/uncharge shared and private mappings is different. This patch implements charging and uncharging for private mappings. For private mappings, the counter to uncharge is in resv_map->reservation_counter. On initializing the resv_map this is set to NULL. On reservation of a region in private mapping, the tasks hugetlb_cgroup is charged and the hugetlb_cgroup is placed is resv_map->reservation_counter. On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter. [akpm@linux-foundation.org: forward declare struct resv_map] Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | 1adc4d419a | hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage or hugetlb reservation counter. Adds a new interface to uncharge a hugetlb_cgroup counter via hugetlb_cgroup_uncharge_counter. Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init, hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline. Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 87bf91d39b | hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race hugetlbfs page faults can race with truncate and hole punch operations. Current code in the page fault path attempts to handle this by 'backing out' operations if we encounter the race. One obvious omission in the current code is removing a page newly added to the page cache. This is pretty straight forward to address, but there is a more subtle and difficult issue of backing out hugetlb reservations. To handle this correctly, the 'reservation state' before page allocation needs to be noted so that it can be properly backed out. There are four distinct possibilities for reservation state: shared/reserved, shared/no-resv, private/reserved and private/no-resv. Backing out a reservation may require memory allocation which could fail so that needs to be taken into account as well. Instead of writing the required complicated code for this rare occurrence, just eliminate the race. i_mmap_rwsem is now held in read mode for the duration of page fault processing. Hold i_mmap_rwsem in write mode when modifying i_size. In this way, truncation can not proceed when page faults are being processed. In addition, i_size will not change during fault processing so a single check can be made to ensure faults are not beyond (proposed) end of file. Faults can still race with hole punch, but that race is handled by existing code and the use of hugetlb_fault_mutex. With this modification, checks for races with truncation in the page fault path can be simplified and removed. remove_inode_hugepages no longer needs to take hugetlb_fault_mutex in the case of truncation. Comments are expanded to explain reasoning behind locking. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Link: http://lkml.kernel.org/r/20200316205756.146666-3-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | c0d0381ade | hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2. While discussing the issue with huge_pte_offset [1], I remembered that there were more outstanding hugetlb races. These issues are: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. A previous attempt was made to use i_mmap_rwsem in this manner as described at [2]. However, those patches were reverted starting with [3] due to locking issues. To effectively use i_mmap_rwsem to address the above issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not too invasive as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. The only other way I can think of to address these issues is by catching all the races. After catching a race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Any other suggestions would be welcome. [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/ [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/ [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/ [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/ This patch (of 2): While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. One problem with this scheme is that it requires taking i_mmap_rwsem before taking the page lock during page faults. This is not the order specified in the rest of mm code. Handling of hugetlbfs pages is mostly isolated today. Therefore, we use this alternative locking order for PageHuge() pages. mapping->i_mmap_rwsem hugetlb_fault_mutex (hugetlbfs specific page fault mutex) page->flags PG_locked (lock_page) To help with lock ordering issues, hugetlb_page_mapping_lock_write() is introduced to write lock the i_mmap_rwsem associated with a page. In most cases it is easy to get address_space via vma->vm_file->f_mapping. However, in the case of migration or memory errors for anon pages we do not have an associated vma. A new routine _get_hugetlb_page_mapping() will use anon_vma to get address_space in these cases. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | 71335f37c5 | mm/gup: allow to react to fatal signals The existing gup code does not react to the fatal signals in many code paths. For example, in one retry path of gup we're still using down_read() rather than down_read_killable(). Also, when doing page faults we don't pass in FAULT_FLAG_KILLABLE as well, which means that within the faulting process we'll wait in non-killable way as well. These were spotted by Linus during the code review of some other patches. Let's allow the gup code to react to fatal signals to improve the responsiveness of threads when during gup and being killed. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160256.9887-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | 4426e945df | mm/gup: allow VM_FAULT_RETRY for multiple times This is the gup counterpart of the change that allows the VM_FAULT_RETRY to happen for more than once. One thing to mention is that we must check the fatal signal here before retry because the GUP can be interrupted by that, otherwise we can loop forever. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Xu | 4f6da93411 | mm/gup: rename "nonblocking" to "locked" where proper Patch series "mm: Page fault enhancements", v6. This series contains cleanups and enhancements to current page fault logic. The whole idea comes from the discussion between Andrea and Linus on the bug reported by syzbot here: https://lkml.org/lkml/2017/11/2/833 Basically it does two things: (a) Allows the page fault logic to be more interactive on not only SIGKILL, but also the rest of userspace signals, and, (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more than once. For (a): with the changes we should be able to react faster when page faults are working in parallel with userspace signals like SIGSTOP and SIGCONT (and more), and with that we can remove the buggy part in userfaultfd and benefit the whole page fault mechanism on faster signal processing to reach the userspace. For (b), we should be able to allow the page fault handler to loop for even more than twice. Some context: for now since we have FAULT_FLAG_ALLOW_RETRY we can allow to retry the page fault once with the same interrupt context, however never more than twice. This can be not only a potential cleanup to remove this assumption since AFAIU the code itself doesn't really have this twice-only limitation (though that should be a protective approach in the past), at the same time it'll greatly simplify future works like userfaultfd write-protect where it's possible to retry for more than twice (please have a look at [1] below for a possible user that might require the page fault to be handled for a third time; if we can remove the retry limitation we can simply drop that patch and those complexity). This patch (of 16): There's plenty of places around __get_user_pages() that has a parameter "nonblocking" which does not really mean that "it won't block" (because it can really block) but instead it shows whether the mmap_sem is released by up_read() during the page fault handling mostly when VM_FAULT_RETRY is returned. We have the correct naming in e.g. get_user_pages_locked() or get_user_pages_remote() as "locked", however there're still many places that are using the "nonblocking" as name. Renaming the places to "locked" where proper to better suite the functionality of the variable. While at it, fixing up some of the comments accordingly. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Cracauer <cracauer@cons.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220155353.8676-2-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  John Hubbard | 47e29d32af | mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned. This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there. A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used). This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount. Documentation/core-api/pin_user_pages.rst is updated accordingly. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  John Hubbard | 3faa52c03f | mm/gup: track FOLL_PIN pages Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section.  (That limitation will be
removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:
   bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
    https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
[jhubbard@nvidia.com: add kerneldoc]
  Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
  Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Waiman Long | c77c0a8ac4 | mm/hugetlb: defer freeing of huge pages if in non-task context The following lockdep splat was observed when a certain hugetlbfs test was run: ================================ WARNING: inconsistent lock state 4.18.0-159.el8.x86_64+debug #1 Tainted: G W --------- - - -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes: ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0 {SOFTIRQ-ON-W} state was registered at: lock_acquire+0x14f/0x3b0 _raw_spin_lock+0x30/0x70 __nr_hugepages_store_common+0x11b/0xb30 hugetlb_sysctl_handler_common+0x209/0x2d0 proc_sys_call_handler+0x37f/0x450 vfs_write+0x157/0x460 ksys_write+0xb8/0x170 do_syscall_64+0xa5/0x4d0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf irq event stamp: 691296 hardirqs last enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60 hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81 softirqs last enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0 softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(hugetlb_lock); <Interrupt> lock(hugetlb_lock); *** DEADLOCK *** : Call Trace: <IRQ> __lock_acquire+0x146b/0x48c0 lock_acquire+0x14f/0x3b0 _raw_spin_lock+0x30/0x70 free_huge_page+0x36f/0xaa0 bio_check_pages_dirty+0x2fc/0x5c0 clone_endio+0x17f/0x670 [dm_mod] blk_update_request+0x276/0xe50 scsi_end_request+0x7b/0x6a0 scsi_io_completion+0x1c6/0x1570 blk_done_softirq+0x22e/0x350 __do_softirq+0x23d/0xad8 irq_exit+0x23e/0x2b0 do_IRQ+0x11a/0x200 common_interrupt+0xf/0xf </IRQ> Both the hugetbl_lock and the subpool lock can be acquired in free_huge_page(). One way to solve the problem is to make both locks irq-safe. However, Mike Kravetz had learned that the hugetlb_lock is held for a linear scan of ALL hugetlb pages during a cgroup reparentling operation. So it is just too long to have irq disabled unless we can break hugetbl_lock down into finer-grained locks with shorter lock hold times. Another alternative is to defer the freeing to a workqueue job. This patch implements the deferred freeing by adding a free_hpage_workfn() work function to do the actual freeing. The free_huge_page() call in a non-task context saves the page to be freed in the hpage_freelist linked list in a lockless manner using the llist APIs. The generic workqueue is used to process the work, but a dedicated workqueue can be used instead if it is desirable to have the huge page freed ASAP. Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of llist APIs which simplfy the code. Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Zhigang Lu | acbfb087e3 | mm/hugetlb: avoid looping to the same hugepage if !pages and !vmas When mmapping an existing hugetlbfs file with MAP_POPULATE, we find it
is very time consuming.  For example, mmapping a 128GB file takes about
50 milliseconds.  Sampling with perfevent shows it spends 99% time in
the same_page loop in follow_hugetlb_page().
samples: 205  of event 'cycles', Event count (approx.): 136686374
-  99.04%  test_mmap_huget  [kernel.kallsyms]  [k] follow_hugetlb_page
        follow_hugetlb_page
        __get_user_pages
        __mlock_vma_pages_range
        __mm_populate
        vm_mmap_pgoff
        sys_mmap_pgoff
        sys_mmap
        system_call_fastpath
        __mmap64
follow_hugetlb_page() is called with pages=NULL and vmas=NULL, so for
each hugepage, we run into the same_page loop for pages_per_huge_page()
times, but doing nothing.  With this change, it takes less then 1
millisecond to mmap a 128GB file in hugetlbfs.
Link: http://lkml.kernel.org/r/1567581712-5992-1-git-send-email-totty.lu@gmail.com
Signed-off-by: Zhigang Lu <tonnylu@tencent.com>
Reviewed-by: Haozhong Zhang <hzhongzhang@tencent.com>
Reviewed-by: Zongming Zhang <knightzhang@tencent.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Wei Yang | 188b04a7d9 | hugetlb: remove unused hstate in hugetlb_fault_mutex_hash() The first parameter hstate in function hugetlb_fault_mutex_hash() is not used anymore. This patch removes it. [akpm@linux-foundation.org: various build fixes] [cai@lca.pw: fix a GCC compilation warning] Link: http://lkml.kernel.org/r/1570544108-32331-1-git-send-email-cai@lca.pw Link: http://lkml.kernel.org/r/20191005003302.785-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Qian Cai <cai@lca.pw> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | d75c6af9c8 | hugetlb: remove duplicated code Remove duplicated code between region_chg and region_add, and refactor it into a common function, add_reservation_in_range. This is mostly done because there is a follow up change in another series that disables region coalescing in region_add, and I want to make that change in one place only. It should improve maintainability anyway on its own. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190919200428.188797-3-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mina Almasry | 5c91195420 | hugetlb: region_chg provides only cache entry Current behavior is that region_chg provides both a cache entry in resv->region_cache, AND a placeholder entry in resv->regions. region_add first tries to use the placeholder, and if it finds that the placeholder has been deleted by a racing region_del call, it uses the cache entry. This behavior is completely unnecessary and is removed in this patch for a couple of reasons: 1. region_add needs to either find a cached file_region entry in resv->region_cache, or find an entry in resv->regions to expand. It does not need both. 2. region_chg adding a placeholder entry in resv->regions opens up a possible race with region_del, where region_chg adds a placeholder region in resv->regions, and this region is deleted by a racing call to region_del during region_chg execution or before region_add is called. Removing the race makes the code easier to reason about and maintain. In addition, a follow up patch in another series that disables region coalescing, which would be further complicated if the race with region_del exists. Link: http://lkml.kernel.org/r/20190919200428.188797-2-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Waiman Long | 930668c344 | hugetlbfs: take read_lock on i_mmap for PMD sharing A customer with large SMP systems (up to 16 sockets) with application that uses large amount of static hugepages (~500-1500GB) are experiencing random multisecond delays. These delays were caused by the long time it took to scan the VMA interval tree with mmap_sem held. The sharing of huge PMD does not require changes to the i_mmap at all. Therefore, we can just take the read lock and let other threads searching for the right VMA share it in parallel. Once the right VMA is found, either the PMD lock (2M huge page for x86-64) or the mm->page_table_lock will be acquired to perform the actual PMD sharing. Lock contention, if present, will happen in the spinlock. That is much better than contention in the rwsem where the time needed to scan the the interval tree is indeterminate. With this patch applied, the customer is seeing significant performance improvement over the unpatched kernel. Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 552546366a | hugetlbfs: hugetlb_fault_mutex_hash() cleanup A new clang diagnostic (-Wsizeof-array-div) warns about the calculation to determine the number of u32's in an array of unsigned longs. Suppress warning by adding parentheses. While looking at the above issue, noticed that the 'address' parameter to hugetlb_fault_mutex_hash is no longer used. So, remove it from the definition and all callers. No functional change. Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Ilie Halip <ilie.halip@gmail.com> Cc: David Bolvansky <david.bolvansky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Anshuman Khandual | 5e27a2df03 | mm/page_alloc: add alloc_contig_pages() HugeTLB helper alloc_gigantic_page() implements fairly generic allocation method where it scans over various zones looking for a large contiguous pfn range before trying to allocate it with alloc_contig_range(). Other than deriving the requested order from 'struct hstate', there is nothing HugeTLB specific in there. This can be made available for general use to allocate contiguous memory which could not have been allocated through the buddy allocator. alloc_gigantic_page() has been split carving out actual allocation method which is then made available via new alloc_contig_pages() helper wrapped under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced with more generic term 'contig'. Allocated pages here should be freed with free_contig_range() or by calling __free_page() on each allocated page. Link: http://lkml.kernel.org/r/1571300646-32240-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  David Hildenbrand | f231fe4235 | hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched. Let's make sure that we only consider online memory (managed by the buddy) that has initialized memmaps. ZONE_DEVICE is not applicable. page_zone() will call page_to_nid(), which will trigger VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This can be the case when an offline memory block (e.g., never onlined) is spanned by a zone. Note: As explained by Michal in [1], alloc_contig_range() will verify the range. So it boils down to the wrong access in this function. [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com Fixes: | ||
|  Mike Kravetz | f60858f9d3 | hugetlbfs: don't retry when pool page allocations start to fail When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, the pages will be interleaved between all nodes of the system. If nodes are not equal, it is quite possible for one node to fill up before the others. When this happens, the code still attempts to allocate pages from the full node. This results in calls to direct reclaim and compaction which slow things down considerably. When allocating pool pages, note the state of the previous allocation for each node. If previous allocation failed, do not use the aggressive retry algorithm on successive attempts. The allocation will still succeed if there is memory available, but it will not try as hard to free up memory. Link: http://lkml.kernel.org/r/20190806014744.15446-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 4643d67e8c | hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
the kernel-v5.2.3 testing.  This is caused by a race between hugetlb
page migration and page fault.
If a hugetlb page can not be allocated to satisfy a page fault, the task
is sent SIGBUS.  This is normal hugetlbfs behavior.  A hugetlb fault
mutex exists to prevent two tasks from trying to instantiate the same
page.  This protects against the situation where there is only one
hugetlb page, and both tasks would try to allocate.  Without the mutex,
one would fail and SIGBUS even though the other fault would be
successful.
There is a similar race between hugetlb page migration and fault.
Migration code will allocate a page for the target of the migration.  It
will then unmap the original page from all page tables.  It does this
unmap by first clearing the pte and then writing a migration entry.  The
page table lock is held for the duration of this clear and write
operation.  However, the beginnings of the hugetlb page fault code
optimistically checks the pte without taking the page table lock.  If
clear (as it can be during the migration unmap operation), a hugetlb
page allocation is attempted to satisfy the fault.  Note that the page
which will eventually satisfy this fault was already allocated by the
migration code.  However, the allocation within the fault path could
fail which would result in the task incorrectly being sent SIGBUS.
Ideally, we could take the hugetlb fault mutex in the migration code
when modifying the page tables.  However, locks must be taken in the
order of hugetlb fault mutex, page lock, page table lock.  This would
require significant rework of the migration code.  Instead, the issue is
addressed in the hugetlb fault code.  After failing to allocate a huge
page, take the page table lock and check for huge_pte_none before
returning an error.  This is the same check that must be made further in
the code even if page allocation is successful.
Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
Fixes:  | ||
|  Naoya Horiguchi | faf53def3b | mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
for hugepages with overcommitting enabled.  That was caused by the
suboptimal code in current soft-offline code.  See the following part:
    ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
                            MIGRATE_SYNC, MR_MEMORY_FAILURE);
    if (ret) {
            ...
    } else {
            /*
             * We set PG_hwpoison only when the migration source hugepage
             * was successfully dissolved, because otherwise hwpoisoned
             * hugepage remains on free hugepage list, then userspace will
             * find it as SIGBUS by allocation failure. That's not expected
             * in soft-offlining.
             */
            ret = dissolve_free_huge_page(page);
            if (!ret) {
                    if (set_hwpoison_free_buddy_page(page))
                            num_poisoned_pages_inc();
            }
    }
    return ret;
Here dissolve_free_huge_page() returns -EBUSY if the migration source page
was freed into buddy in migrate_pages(), but even in that case we actually
has a chance that set_hwpoison_free_buddy_page() succeeds.  So that means
current code gives up offlining too early now.
dissolve_free_huge_page() checks that a given hugepage is suitable for
dissolving, where we should return success for !PageHuge() case because
the given hugepage is considered as already dissolved.
This change also affects other callers of dissolve_free_huge_page(), which
are cleaned up together.
[n-horiguchi@ah.jp.nec.com: v3]
  Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
Fixes:  | ||
|  Thomas Gleixner | 457c899653 | treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | ||
|  Mike Kravetz | f27a5136f7 | hugetlbfs: always use address space in inode for resv_map pointer Continuing discussion about  | ||
|  Jérôme Glisse | 7269f99993 | mm/mmu_notifier: use correct mmu_notifier events for each invalidation This updates each existing invalidation to use the correct mmu notifier event that represent what is happening to the CPU page table. See the patch which introduced the events to see the rational behind this. Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Jérôme Glisse | 6f4f13e8d9 | mm/mmu_notifier: contextual information for event triggering invalidation CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).
Users of mmu notifier API track changes to the CPU page table and take
specific action for them.  While current API only provide range of virtual
address affected by the change, not why the changes is happening.
This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma).  Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.
The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens.  A latter patch do convert mm call path to use a more appropriate
events for each call.
This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:
%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%
Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place
Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 1b426bac66 | hugetlb: use same fault hash key for shared and private mappings hugetlb uses a fault mutex hash table to prevent page faults of the same pages concurrently. The key for shared and private mappings is different. Shared keys off address_space and file index. Private keys off mm and virtual address. Consider a private mappings of a populated hugetlbfs file. A fault will map the page from the file and if needed do a COW to map a writable page. Hugetlbfs hole punch uses the fault mutex to prevent mappings of file pages. It uses the address_space file index key. However, private mappings will use a different key and could race with this code to map the file page. This causes problems (BUG) for the page cache remove code as it expects the page to be unmapped. A sample stack is: page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) kernel BUG at mm/filemap.c:169! ... RIP: 0010:unaccount_page_cache_page+0x1b8/0x200 ... Call Trace: __delete_from_page_cache+0x39/0x220 delete_from_page_cache+0x45/0x70 remove_inode_hugepages+0x13c/0x380 ? __add_to_page_cache_locked+0x162/0x380 hugetlbfs_fallocate+0x403/0x540 ? _cond_resched+0x15/0x30 ? __inode_security_revalidate+0x5d/0x70 ? selinux_file_permission+0x100/0x130 vfs_fallocate+0x13f/0x270 ksys_fallocate+0x3c/0x80 __x64_sys_fallocate+0x1a/0x20 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 There seems to be another potential COW issue/race with this approach of different private and shared keys as noted in commit | ||
|  Mike Kravetz | 0919e1b69a | hugetlbfs: on restore reserve error path retain subpool reservation When a huge page is allocated, PagePrivate() is set if the allocation consumed a reservation. When freeing a huge page, PagePrivate is checked. If set, it indicates the reservation should be restored. PagePrivate being set at free huge page time mostly happens on error paths. When huge page reservations are created, a check is made to determine if the mapping is associated with an explicitly mounted filesystem. If so, pages are also reserved within the filesystem. The default action when freeing a huge page is to decrement the usage count in any associated explicitly mounted filesystem. However, if the reservation is to be restored the reservation/use count within the filesystem should not be decrementd. Otherwise, a subsequent page allocation and free for the same mapping location will cause the file filesystem usage to go 'negative'. Filesystem Size Used Avail Use% Mounted on nodev 4.0G -4.0M 4.1G - /opt/hugepool To fix, when freeing a huge page do not adjust filesystem usage if PagePrivate() is set to indicate the reservation should be restored. I did not cc stable as the problem has been around since reserves were added to hugetlbfs and nobody has noticed. Link: http://lkml.kernel.org/r/20190328234704.27083-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Oscar Salvador | 2d0adf7e0d | mm/hugetlb: get rid of NODEMASK_ALLOC NODEMASK_ALLOC is used to allocate a nodemask bitmap, and it does it by first determining whether it should be allocated on the stack or dynamically, depending on NODES_SHIFT. Right now, it goes the dynamic path whenever the nodemask_t is above 32 bytes. Although we could bump it to a reasonable value, the largest a nodemask_t can get is 128 bytes, so since __nr_hugepages_store_common is called from a rather short stack we can just get rid of the NODEMASK_ALLOC call here. This reduces some code churn and complexity. Link: http://lkml.kernel.org/r/20190402133415.21983-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Alex Ghiti <alex@ghiti.fr> Cc: David Rientjes <rientjes@google.com> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | fd875dca7c | hugetlbfs: fix potential over/underflow setting node specific nr_hugepages The number of node specific huge pages can be set via a file such as: /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages When a node specific value is specified, the global number of huge pages must also be adjusted. This adjustment is calculated as the specified node specific value + (global value - current node value). If the node specific value provided by the user is large enough, this calculation could overflow an unsigned long leading to a smaller than expected number of huge pages. To fix, check the calculation for overflow. If overflow is detected, use ULONG_MAX as the requested value. This is inline with the user request to allocate as many huge pages as possible. It was also noticed that the above calculation was done outside the hugetlb_lock. Therefore, the values could be inconsistent and result in underflow. To fix, the calculation is moved within the routine set_max_huge_pages() where the lock is held. In addition, the code in __nr_hugepages_store_common() which tries to handle the case of not being able to allocate a node mask would likely result in incorrect behavior. Luckily, it is very unlikely we will ever take this path. If we do, simply return ENOMEM. Link: http://lkml.kernel.org/r/20190328220533.19884-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Alex Ghiti <alex@ghiti.fr> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Alexandre Ghiti | 4eb0716e86 | hugetlb: allow to free gigantic pages regardless of the configuration On systems without CONTIG_ALLOC activated but that support gigantic pages, boottime reserved gigantic pages can not be freed at all. This patch simply enables the possibility to hand back those pages to memory allocator. Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: David S. Miller <davem@davemloft.net> [sparc] Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Kai Shen | 2bf753e64b | mm/hugetlb.c: don't put_page in lock of hugetlb_lock spinlock recursion happened when do LTP test:
#!/bin/bash
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
The dtor returned by get_compound_page_dtor in __put_compound_page may be
the function of free_huge_page which will lock the hugetlb_lock, so don't
put_page in lock of hugetlb_lock.
 BUG: spinlock recursion on CPU#0, hugemmap05/1079
  lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, .owner_cpu: 0
 Call trace:
  dump_backtrace+0x0/0x198
  show_stack+0x24/0x30
  dump_stack+0xa4/0xcc
  spin_dump+0x84/0xa8
  do_raw_spin_lock+0xd0/0x108
  _raw_spin_lock+0x20/0x30
  free_huge_page+0x9c/0x260
  __put_compound_page+0x44/0x50
  __put_page+0x2c/0x60
  alloc_surplus_huge_page.constprop.19+0xf0/0x140
  hugetlb_acct_memory+0x104/0x378
  hugetlb_reserve_pages+0xe0/0x250
  hugetlbfs_file_mmap+0xc0/0x140
  mmap_region+0x3e8/0x5b0
  do_mmap+0x280/0x460
  vm_mmap_pgoff+0xf4/0x128
  ksys_mmap_pgoff+0xb4/0x258
  __arm64_sys_mmap+0x34/0x48
  el0_svc_common+0x78/0x130
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc
Link: http://lkml.kernel.org/r/b8ade452-2d6b-0372-32c2-703644032b47@huawei.com
Fixes:  | ||
|  Linus Torvalds | 171c2bcbcb | Merge branch 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull unified TLB flushing from Ingo Molnar:
 "This contains the generic mmu_gather feature from Peter Zijlstra,
  which is an all-arch unification of TLB flushing APIs, via the
  following (broad) steps:
   - enhance the <asm-generic/tlb.h> APIs to cover more arch details
   - convert most TLB flushing arch implementations to the generic
     <asm-generic/tlb.h> APIs.
   - remove leftovers of per arch implementations
  After this series every single architecture makes use of the unified
  TLB flushing APIs"
* 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm/resource: Use resource_overlaps() to simplify region_intersects()
  ia64/tlb: Eradicate tlb_migrate_finish() callback
  asm-generic/tlb: Remove tlb_table_flush()
  asm-generic/tlb: Remove tlb_flush_mmu_free()
  asm-generic/tlb: Remove CONFIG_HAVE_GENERIC_MMU_GATHER
  asm-generic/tlb: Remove arch_tlb*_mmu()
  s390/tlb: Convert to generic mmu_gather
  asm-generic/tlb: Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y
  arch/tlb: Clean up simple architectures
  um/tlb: Convert to generic mmu_gather
  sh/tlb: Convert SH to generic mmu_gather
  ia64/tlb: Convert to generic mmu_gather
  arm/tlb: Convert to generic mmu_gather
  asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
  asm-generic/tlb, ia64: Conditionally provide tlb_migrate_finish()
  asm-generic/tlb: Provide generic tlb_flush() based on flush_tlb_mm()
  asm-generic/tlb, arch: Provide generic tlb_flush() based on flush_tlb_range()
  asm-generic/tlb, arch: Provide generic VIPT cache flush
  asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
  asm-generic/tlb: Provide a comment | ||
|  Linus Torvalds | 6b3a707736 | Merge branch 'page-refs' (page ref overflow) Merge page ref overflow branch. Jann Horn reported that he can overflow the page ref count with sufficient memory (and a filesystem that is intentionally extremely slow). Admittedly it's not exactly easy. To have more than four billion references to a page requires a minimum of 32GB of kernel memory just for the pointers to the pages, much less any metadata to keep track of those pointers. Jann needed a total of 140GB of memory and a specially crafted filesystem that leaves all reads pending (in order to not ever free the page references and just keep adding more). Still, we have a fairly straightforward way to limit the two obvious user-controllable sources of page references: direct-IO like page references gotten through get_user_pages(), and the splice pipe page duplication. So let's just do that. * branch page-refs: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit | ||
|  Linus Torvalds | 8fde12ca79 | mm: prevent get_user_pages() from overflowing page refcount If the page refcount wraps around past zero, it will be freed while there are still four billion references to it. One of the possible avenues for an attacker to try to make this happen is by doing direct IO on a page multiple times. This patch makes get_user_pages() refuse to take a new page reference if there are already more than two billion references to the page. Reported-by: Jann Horn <jannh@google.com> Acked-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Peter Zijlstra | ed6a79352c | asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE Move the mmu_gather::page_size things into the generic code instead of PowerPC specific bits. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> | ||
|  Aneesh Kumar K.V | 9a4e9f3b2d | mm: update get_user_pages_longterm to migrate pages allocated from CMA region This patch updates get_user_pages_longterm to migrate pages allocated out of CMA region. This makes sure that we don't keep non-movable pages (due to page reference count) in the CMA area. This will be used by ppc64 in a later patch to avoid pinning pages in the CMA region. ppc64 uses CMA region for allocation of the hardware page table (hash page table) and not able to migrate pages out of CMA region results in page table allocation failures. One case where we hit this easy is when a guest using a VFIO passthrough device. VFIO locks all the guest's memory and if the guest memory is backed by CMA region, it becomes unmovable resulting in fragmenting the CMA and possibly preventing other guests from allocation a large enough hash page table. NOTE: We allocate the new page without using __GFP_THISNODE Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Aneesh Kumar K.V | 023bdd0023 | mm/hugetlb: add prot_modify_start/commit sequence for hugetlb update Architectures like ppc64 require to do a conditional tlb flush based on the old and new value of pte. Follow the regular pte change protection sequence for hugetlb too. This allows the architectures to override the update sequence. Link: http://lkml.kernel.org/r/20190116085035.29729-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Anshuman Khandual | 7ed2c31dab | mm/hugetlb: distinguish between migratability and movability Patch series "arm64/mm: Enable HugeTLB migration", v4.
This patch series enables HugeTLB migration support for all supported
huge page sizes at all levels including contiguous bit implementation.
Following HugeTLB migration support matrix has been enabled with this
patch series.  All permutations have been tested except for the 16GB.
           CONT PTE    PMD    CONT PMD    PUD
           --------    ---    --------    ---
  4K:         64K     2M         32M     1G
  16K:         2M    32M          1G
  64K:         2M   512M         16G
First the series adds migration support for PUD based huge pages.  It
then adds a platform specific hook to query an architecture if a given
huge page size is supported for migration while also providing a default
fallback option preserving the existing semantics which just checks for
(PMD|PUD|PGDIR)_SHIFT macros.  The last two patches enables HugeTLB
migration on arm64 and subscribe to this new platform specific hook by
defining an override.
The second patch differentiates between movability and migratability
aspects of huge pages and implements hugepage_movable_supported() which
can then be used during allocation to decide whether to place the huge
page in movable zone or not.
This patch (of 5):
During huge page allocation it's migratability is checked to determine
if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
But the movability aspect of the huge page could depend on other factors
than just migratability.  Movability in itself is a distinct property
which should not be tied with migratability alone.
This differentiates these two and implements an enhanced movability check
which also considers huge page size to determine if it is feasible to be
placed under a movable zone.  At present it just checks for gigantic pages
but going forward it can incorporate other enhanced checks.
Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Anshuman Khandual | 98fa15f34c | mm: replace all open encodings for NUMA_NO_NODE Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe] Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx] Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband] Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | cb6acd01e2 | hugetlbfs: fix races and page leaks during migration hugetlb pages should only be migrated if they are 'active'.  The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.
When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked.  Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code.  This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.
To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.
Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available.  A program mmaps a 2G file in a hugetlbfs filesystem.  It
then migrates the pages associated with the file from one node to
another.  When the program exits, huge page counts are as follows:
  node0
  1024    free_hugepages
  1024    nr_hugepages
  node1
  0       free_hugepages
  1024    nr_hugepages
  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
That is as expected.  2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem.  If the file is then removed, the counts become:
  node0
  1024    free_hugepages
  1024    nr_hugepages
  node1
  1024    free_hugepages
  1024    nr_hugepages
  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use.  The only way to 'fix' the filesystem
accounting is to unmount the filesystem
If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field.  At migration
time, this information is not preserved.  To fix, simply transfer
page_private from old to new page at migration time if necessary.
There is a related race with removing a huge page from a file and
migration.  When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page().  A page could be migrated
while in this state.  However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.
To fix that, check for this condition before migrating a huge page.  If
the condition is detected, return EBUSY for the page.
Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes:  | ||
|  Andrea Arcangeli | 1ac25013fb | mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAIT hugetlb needs the same fix as faultin_nopage (which was applied in commit | ||
|  Mike Kravetz | ddeaab32a8 | hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization" This reverts  | ||
|  Mike Kravetz | e7c5809779 | hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race" This reverts  | ||
|  Davidlohr Bueso | fa45f1162f | mm/: remove caller signal_pending branch predictions This is already done for us internally by the signal machinery. Link: http://lkml.kernel.org/r/20181116002713.8474-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | c86aa7bbfd | hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race.  One obvious omission in the
current code is removing a page newly added to the page cache.  This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations.  To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out.  There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv.  Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race.  i_mmap_rwsem is now held in read mode for the
duration of page fault processing.  Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for races
becomes 'dead' as it can not longer happen.  Remove the dead code and
expand comments to explain reasoning.  Similarly, checks for races with
truncation in the page fault path can be simplified and removed.
[mike.kravetz@oracle.com: incorporat suggestions from Kirill]
  Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes:  | ||
|  Mike Kravetz | b43a999005 | hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table.  Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file.  As part of truncation, it
unmaps everyone who has the file mapped.  If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd.  If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse.  This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with the
  ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
  called.
[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes:  | ||
|  Jérôme Glisse | ac46d4f3c4 | mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Yongkai Wu | 8ace22bce8 | hugetlbfs: call VM_BUG_ON_PAGE earlier in free_huge_page() A stack trace was triggered by VM_BUG_ON_PAGE(page_mapcount(page), page) in free_huge_page(). Unfortunately, the page->mapping field was set to NULL before this test. This made it more difficult to determine the root cause of the problem. Move the VM_BUG_ON_PAGE tests earlier in the function so that if they do trigger more information is present in the page struct. Link: http://lkml.kernel.org/r/1543491843-23438-1-git-send-email-nic_w@163.com Signed-off-by: Yongkai Wu <nic_w@163.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Andrea Arcangeli | 9e368259ad | userfaultfd: use ENOENT instead of EFAULT if the atomic copy user fails Patch series "userfaultfd shmem updates".
Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
lack of the VM_MAYWRITE check and the lack of i_size checks.
Then looking into the above we also fixed the MAP_PRIVATE case.
Hugh by source review also found a data loss source if UFFDIO_COPY is
used on shmem MAP_SHARED PROT_READ mappings (the production usages
incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
happen in those production usages like with QEMU).
The whole patchset is marked for stable.
We verified QEMU postcopy live migration with guest running on shmem
MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
unconditionally invokes a punch hole if the guest mapping is filebacked
and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
for the anon backend).
This patch (of 5):
We internally used EFAULT to communicate with the caller, switch to
ENOENT, so EFAULT can be used as a non internal retval.
Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
Fixes:  | ||
|  Mike Kravetz | 5e41540c8a | hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! This bug has been experienced several times by the Oracle DB team.  The
BUG is in remove_inode_hugepages() as follows:
	/*
	 * If page is mapped, it was faulted in after being
	 * unmapped in caller.  Unmap (again) now after taking
	 * the fault mutex.  The mutex will prevent faults
	 * until we finish removing the page.
	 *
	 * This race can only happen in the hole punch case.
	 * Getting here in a truncate operation is a bug.
	 */
	if (unlikely(page_mapped(page))) {
		BUG_ON(truncate_op);
In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code.  Consider the following:
 - Process A maps a hugetlbfs file of sufficient size and alignment
   (PUD_SIZE) that a pmd page could be shared.
 - Process B maps the same hugetlbfs file with the same size and
   alignment such that a pmd page is shared.
 - Process B then calls mprotect() to change protections for the mapping
   with the shared pmd. As a result, the pmd is 'unshared'.
 - Process B then calls mprotect() again to chage protections for the
   mapping back to their original value. pmd remains unshared.
 - Process B then forks and process C is created. During the fork
   process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
   tables. Copying page tables for hugetlb mappings is done in the
   routine copy_hugetlb_page_range.
In copy_hugetlb_page_range(), the destination pte is obtained by:
	dst_pte = huge_pte_alloc(dst, addr, sz);
If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table.  In the situation above, process C could share with
either process A or process B.  Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.
However, the check for pmd sharing in copy_hugetlb_page_range is:
	/* If the pagetables are shared don't copy or take references */
	if (dst_pte == src_pte)
		continue;
Since process C is sharing with process A instead of process B, the
above test fails.  The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page.  This is how we end up with an elevated map count.
To solve, check the dst_pte entry for huge_pte_none.  If !none, this
implies PMD sharing so do not copy.
Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes:  | ||
|  Mike Rapoport | 57c8a661d9 | mm: remove include/linux/bootmem.h Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Rapoport | 97ad1087ef | memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of identical MEMBLOCK definitions. Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Rapoport | eb31d559f1 | memblock: remove _virt from APIs returning virtual address The conversion is done using sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc) Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 22146c3ce9 | hugetlbfs: dirty pages as they are added to pagecache Some test systems were experiencing negative huge page reserve counts and
incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
explicit code removes the pages, the appropriate accounting is not
performed.
This can be recreated as follows:
 fallocate -l 2M /dev/hugepages/foo
 echo 1 > /proc/sys/vm/drop_caches
 fallocate -l 2M /dev/hugepages/foo
 grep -i huge /proc/meminfo
   AnonHugePages:         0 kB
   ShmemHugePages:        0 kB
   HugePages_Total:    2048
   HugePages_Free:     2047
   HugePages_Rsvd:    18446744073709551615
   HugePages_Surp:        0
   Hugepagesize:       2048 kB
   Hugetlb:         4194304 kB
 ls -lsh /dev/hugepages/foo
   4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
To address this issue, dirty pages as they are added to pagecache.  This
can easily be reproduced with fallocate as shown above.  Read faulted
pages will eventually end up being marked dirty.  But there is a window
where they are clean and could be impacted by code such as drop_caches.
So, just dirty them all as they are added to the pagecache.
Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
Fixes:  | ||
|  Mike Kravetz | dff11abe28 | hugetlb: take PMD sharing into account when flushing tlb/caches When fixing an issue with PMD sharing and migration, it was discovered via code inspection that other callers of huge_pmd_unshare potentially have an issue with cache and tlb flushing. Use the routine adjust_range_if_pmd_sharing_possible() to calculate worst case ranges for mmu notifiers. Ensure that this range is flushed if huge_pmd_unshare succeeds and unmaps a PUD_SUZE area. Link: http://lkml.kernel.org/r/20180823205917.16297-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | ||
|  Mike Kravetz | 017b1660df | mm: migration: fix migration of huge PMD shared pages The page migration code employs try_to_unmap() to try and unmap the source
page.  This is accomplished by using rmap_walk to find all vmas where the
page is mapped.  This search stops when page mapcount is zero.  For shared
PMD huge pages, the page map count is always 1 no matter the number of
mappings.  Shared mappings are tracked via the reference count of the PMD
page.  Therefore, try_to_unmap stops prematurely and does not completely
unmap all mappings of the source page.
This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the target
page.  Hence, data is lost.
This problem was originally seen as DB corruption of shared global areas
after a huge page was soft offlined due to ECC memory errors.  DB
developers noticed they could reproduce the issue by (hotplug) offlining
memory used to back huge pages.  A simple testcase can reproduce the
problem by creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes while continually
writing to the huge pages being migrated.
To fix, have the try_to_unmap_one routine check for huge PMD sharing by
calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
mapping it will be 'unshared' which removes the page table entry and drops
the reference on the PMD page.  After this, flush caches and TLB.
mmu notifiers are called before locking page tables, but we can not be
sure of PMD sharing until page tables are locked.  Therefore, check for
the possibility of PMD sharing before locking so that notifiers can
prepare for the worst possible case.
Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: make _range_in_vma() a static inline]
  Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
Fixes:  | ||
|  Souptick Joarder | 2b74030354 | mm: Change return type int to vm_fault_t for fault handlers Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit  | ||
|  Naoya Horiguchi | 6bc9b56433 | mm: fix race on soft-offlining free huge pages Patch series "mm: soft-offline: fix race against page allocation".
Xishi recently reported the issue about race on reusing the target pages
of soft offlining.  Discussion and analysis showed that we need make
sure that setting PG_hwpoison should be done in the right place under
zone->lock for soft offline.  1/2 handles free hugepage's case, and 2/2
hanldes free buddy page's case.
This patch (of 2):
There's a race condition between soft offline and hugetlb_fault which
causes unexpected process killing and/or hugetlb allocation failure.
The process killing is caused by the following flow:
  CPU 0               CPU 1              CPU 2
  soft offline
    get_any_page
    // find the hugetlb is free
                      mmap a hugetlb file
                      page fault
                        ...
                          hugetlb_fault
                            hugetlb_no_page
                              alloc_huge_page
                              // succeed
      soft_offline_free_page
      // set hwpoison flag
                                         mmap the hugetlb file
                                         page fault
                                           ...
                                             hugetlb_fault
                                               hugetlb_no_page
                                                 find_lock_page
                                                   return VM_FAULT_HWPOISON
                                           mm_fault_error
                                             do_sigbus
                                             // kill the process
The hugetlb allocation failure comes from the following flow:
  CPU 0                          CPU 1
                                 mmap a hugetlb file
                                 // reserve all free page but don't fault-in
  soft offline
    get_any_page
    // find the hugetlb is free
      soft_offline_free_page
      // set hwpoison flag
        dissolve_free_huge_page
        // fail because all free hugepages are reserved
                                 page fault
                                   ...
                                     hugetlb_fault
                                       hugetlb_no_page
                                         alloc_huge_page
                                           ...
                                             dequeue_huge_page_node_exact
                                             // ignore hwpoisoned hugepage
                                             // and finally fail due to no-mem
The root cause of this is that current soft-offline code is written based
on an assumption that PageHWPoison flag should be set at first to avoid
accessing the corrupted data.  This makes sense for memory_failure() or
hard offline, but does not for soft offline because soft offline is about
corrected (not uncorrected) error and is safe from data lost.  This patch
changes soft offline semantics where it sets PageHWPoison flag only after
containment of the error page completes successfully.
Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Cannon Matthews | 330d6e489a | mm/hugetlb.c: don't zero 1GiB bootmem pages When using 1GiB pages during early boot, use the new memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it. Zeroing out hundreds or thousands of GiB in a single core memset() call is very slow, and can make early boot last upwards of 20-30 minutes on multi TiB machines. The memory does not need to be zero'd as the hugetlb pages are always zero'd on page fault. Tested: Booted with ~3800 1G pages, and it booted successfully in roughly the same amount of time as with 0, as opposed to the 25+ minutes it would take before. Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.com Signed-off-by: Cannon Matthews <cannonmatthews@google.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Kravetz | 40d18ebffb | mm/hugetlb: remove gigantic page support for HIGHMEM This reverts | ||
|  Huang Ying | 974e6d66b6 | mm, hugetlbfs: pass fault address to cow handler This is to take better advantage of the general huge page copying optimization. Where, the target subpage will be copied last to avoid the cache lines of target subpage to be evicted when copying other subpages. This works better if the address of the target subpage is available when copying huge page. So hugetlbfs page fault handlers are changed to pass that information to hugetlb_cow(). This will benefit workloads which don't access the begin of the hugetlbfs huge page after the page fault under heavy cache contention. Link: http://lkml.kernel.org/r/20180524005851.4079-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Punit Agrawal <punit.agrawal@arm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Huang Ying | 5b7a1d4060 | mm, hugetlbfs: rename address to haddr in hugetlb_cow() To take better advantage of general huge page copying optimization, the target subpage address will be passed to hugetlb_cow(), then copy_user_huge_page(). So we will use both target subpage address and huge page size aligned address in hugetlb_cow(). To distinguish between them, "haddr" is used for huge page size aligned address to be consistent with Transparent Huge Page naming convention. Now, only huge page size aligned address is used in hugetlb_cow(), so the "address" is renamed to "haddr" in hugetlb_cow() in this patch. Next patch will use target subpage address in hugetlb_cow() too. The patch is just code cleanup without any functionality changes. Link: http://lkml.kernel.org/r/20180524005851.4079-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Punit Agrawal <punit.agrawal@arm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Jane Chu | eec3636ad1 | ipc/shm.c add ->pagesize function to shm_vm_ops Commit | ||
|  Cannon Matthews | 520495fe96 | mm: hugetlb: yield when prepping struct pages When booting with very large numbers of gigantic (i.e. 1G) pages, the operations in the loop of gather_bootmem_prealloc, and specifically prep_compound_gigantic_page, takes a very long time, and can cause a softlockup if enough pages are requested at boot. For example booting with 3844 1G pages requires prepping (set_compound_head, init the count) over 1 billion 4K tail pages, which takes considerable time. Add a cond_resched() to the outer loop in gather_bootmem_prealloc() to prevent this lockup. Tested: Booted with softlockup_panic=1 hugepagesz=1G hugepages=3844 and no softlockup is reported, and the hugepages are reported as successfully setup. Link: http://lkml.kernel.org/r/20180627214447.260804-1-cannonmatthews@google.com Signed-off-by: Cannon Matthews <cannonmatthews@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Kees Cook | 6da2ec5605 | treewide: kmalloc() -> kmalloc_array() The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:
        kmalloc(a * b, gfp)
with:
        kmalloc_array(a * b, gfp)
as well as handling cases of:
        kmalloc(a * b * c, gfp)
with:
        kmalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
        kmalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
        kmalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
  kmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
  kmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kmalloc
+ kmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
  kmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
  kmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
  kmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
  kmalloc(sizeof(THING) * C2, ...)
|
  kmalloc(sizeof(TYPE) * C2, ...)
|
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org> | ||
|  Huang Ying | 285b8dcaac | mm, hugetlbfs: pass fault address to no page handler This is to take better advantage of general huge page clearing
optimization (commit  | ||
|  Souptick Joarder | b3ec9f33ac | mm: change return type to vm_fault_t Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become a distinct type.
See commit  | ||
|  Jonathan Corbet | 24844fd339 | Merge branch 'mm-rst' into docs-next Mike Rapoport says: These patches convert files in Documentation/vm to ReST format, add an initial index and link it to the top level documentation. There are no contents changes in the documentation, except few spelling fixes. The relatively large diffstat stems from the indentation and paragraph wrapping changes. I've tried to keep the formatting as consistent as possible, but I could miss some places that needed markup and add some markup where it was not necessary. [jc: significant conflicts in vm/hmm.rst] | ||
|  Mike Rapoport | ad56b738c5 | docs/vm: rename documentation files to .rst Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> | ||
|  Dan Williams | 05ea88608d | mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct When device-dax is operating in huge-page mode we want it to behave like
hugetlbfs and report the MMU page mapping size that is being enforced by
the vma.
Similar to commit  | ||
|  Dan Williams | 09135cc594 | mm, powerpc: use vma_kernel_pagesize() in vma_mmu_pagesize() Patch series "mm, smaps: MMUPageSize for device-dax", v3.
Similar to commit  | ||
|  Mike Kravetz | 63489f8e82 | hugetlbfs: check for pgoff value overflow A vma with vm_pgoff large enough to overflow a loff_t type when
converted to a byte offset can be passed via the remap_file_pages system
call.  The hugetlbfs mmap routine uses the byte offset to calculate
reservations and file size.
A sequence such as:
  mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
  remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);
will result in the following when task exits/file closed,
  kernel BUG at mm/hugetlb.c:749!
  Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The overflowed pgoff value causes hugetlbfs to try to set up a mapping
with a negative range (end < start) that leaves invalid state which
causes the BUG.
The previous overflow fix to this code was incomplete and did not take
the remap_file_pages system call into account.
[mike.kravetz@oracle.com: v3]
  Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
[akpm@linux-foundation.org: include mmdebug.h]
[akpm@linux-foundation.org: fix -ve left shift count on sh]
Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
Fixes:  | ||
|  Michal Hocko | 4704dea36d | hugetlb: fix surplus pages accounting Dan Rue has noticed that libhugetlbfs test suite fails counter test:
  # mount_point="/mnt/hugetlb/"
  # echo 200 > /proc/sys/vm/nr_hugepages
  # mkdir -p "${mount_point}"
  # mount -t hugetlbfs hugetlbfs "${mount_point}"
  # export LD_LIBRARY_PATH=/root/libhugetlbfs/libhugetlbfs-2.20/obj64
  # /root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters
  Starting testcase "/root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters", pid 3319
  Base pool size: 0
  Clean...
  FAIL    Line 326: Bad HugePages_Total: expected 0, actual 1
The bug was bisected to  | ||
|  Michal Hocko | 389c8178d0 | hugetlb, mbind: fall back to default policy if vma is NULL Dan Carpenter has noticed that mbind migration callback (new_page) can get a NULL vma pointer and choke on it inside alloc_huge_page_vma which relies on the VMA to get the hstate. We used to BUG_ON this case but the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the mbind hugetlb migration". The proper way to handle this is to get the hstate from the migrated page and rely on huge_node (resp. get_vma_policy) do the right thing with null VMA. We are currently falling back to the default mempolicy in that case which is in line what THP path is doing here. Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Michal Hocko | ebd6372358 | hugetlb, mempolicy: fix the mbind hugetlb migration do_mbind migration code relies on alloc_huge_page_noerr for hugetlb pages. alloc_huge_page_noerr uses alloc_huge_page which is a highlevel allocation function which has to take care of reserves, overcommit or hugetlb cgroup accounting. None of that is really required for the page migration because the new page is only temporal and either will replace the original page or it will be dropped. This is essentially as for other migration call paths and there shouldn't be any reason to handle mbind in a special way. The current implementation is even suboptimal because the migration might fail just because the hugetlb cgroup limit is reached, or the overcommit is saturated. Fix this by making mbind like other hugetlb migration paths. Add a new migration helper alloc_huge_page_vma as a wrapper around alloc_huge_page_nodemask with additional mempolicy handling. alloc_huge_page_noerr has no more users and it can go. Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Michal Hocko | 0c397daea1 | mm, hugetlb: further simplify hugetlb allocation API Hugetlb allocator has several layer of allocation functions depending and the purpose of the allocation. There are two allocators depending on whether the page can be allocated from the page allocator or we need a contiguous allocator. This is currently opencoded in alloc_fresh_huge_page which is the only path that might allocate giga pages which require the later allocator. Create alloc_fresh_huge_page which hides this implementation detail and use it in all callers which hardcoded the buddy allocator path (__hugetlb_alloc_buddy_huge_page). This shouldn't introduce any funtional change because both migration and surplus allocators exlude giga pages explicitly. While we are at it let's do some renaming. The current scheme is not consistent and overly painfull to read and understand. Get rid of prefix underscores from most functions. There is no real reason to make names longer. * alloc_fresh_huge_page is the new layer to abstract underlying allocator * __hugetlb_alloc_buddy_huge_page becomes shorter and neater alloc_buddy_huge_page. * Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put the new page directly to the pool * alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code as it uses alloc_fresh_huge_page now * others lose their excessive prefix underscores to make names shorter [dan.carpenter@oracle.com: fix double unlock bug in alloc_surplus_huge_page()] Link: http://lkml.kernel.org/r/20180109200559.g3iz5kvbdrz7yydp@mwanda Link: http://lkml.kernel.org/r/20180103093213.26329-6-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Michal Hocko | 9980d744a0 | mm, hugetlb: get rid of surplus page accounting tricks alloc_surplus_huge_page increases the pool size and the number of
surplus pages opportunistically to prevent from races with the pool size
change.  See commit  | ||
|  Michal Hocko | ab5ac90aec | mm, hugetlb: do not rely on overcommit limit during migration hugepage migration relies on __alloc_buddy_huge_page to get a new page. This has 2 main disadvantages. 1) it doesn't allow to migrate any huge page if the pool is used completely which is not an exceptional case as the pool is static and unused memory is just wasted. 2) it leads to a weird semantic when migration between two numa nodes might increase the pool size of the destination NUMA node while the page is in use. The issue is caused by per NUMA node surplus pages tracking (see free_huge_page). Address both issues by changing the way how we allocate and account pages allocated for migration. Those should temporal by definition. So we mark them that way (we will abuse page flags in the 3rd page) and update free_huge_page to free such pages to the page allocator. Page migration path then just transfers the temporal status from the new page to the old one which will be freed on the last reference. The global surplus count will never change during this path but we still have to be careful when migrating a per-node suprlus page. This is now handled in move_hugetlb_state which is called from the migration path and it copies the hugetlb specific page state and fixes up the accounting when needed Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better reflect its purpose. The new allocation routine for the migration path is __alloc_migrate_huge_page. The user visible effect of this patch is that migrated pages are really temporal and they travel between NUMA nodes as per the migration request: Before migration /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 After /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 with the previous implementation, both nodes would have nr_hugepages:1 until the page is freed. Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Michal Hocko | d9cc948f6f | mm, hugetlb: integrate giga hugetlb more naturally to the allocation path Gigantic hugetlb pages were ingrown to the hugetlb code as an alien specie with a lot of special casing. The allocation path is not an exception. Unnecessarily so to be honest. It is true that the underlying allocator is different but that is an implementation detail. This patch unifies the hugetlb allocation path that a prepares fresh pool pages. alloc_fresh_gigantic_page basically copies alloc_fresh_huge_page logic so we can move everything there. This will simplify set_max_huge_pages which doesn't have to care about what kind of huge page we allocate. Link: http://lkml.kernel.org/r/20180103093213.26329-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Michal Hocko | af0fb9df78 | mm, hugetlb: unify core page allocation accounting and initialization Patch series "mm, hugetlb: allocation API and migration improvements" Motivation: this is a follow up for [3] for the allocation API and [4] for the hugetlb migration. It wasn't really easy to split those into two separate patch series as they share some code. My primary motivation to touch this code is to make the gigantic pages migration working. The giga pages allocation code is just too fragile and hacked into the hugetlb code now. This series tries to move giga pages closer to the first class citizen. We are not there yet but having 5 patches is quite a lot already and it will already make the code much easier to follow. I will come with other changes on top after this sees some review. The first two patches should be trivial to review. The third patch changes the way how we migrate huge pages. Newly allocated pages are a subject of the overcommit check and they participate surplus accounting which is quite unfortunate as the changelog explains. This patch doesn't change anything wrt. giga pages. Patch #4 removes the surplus accounting hack from __alloc_surplus_huge_page. I hope I didn't miss anything there and a deeper review is really due there. Patch #5 finally unifies allocation paths and giga pages shouldn't be any special anymore. There is also some renaming going on as well. This patch (of 6): hugetlb allocator has two entry points to the page allocator - alloc_fresh_huge_page_node - __hugetlb_alloc_buddy_huge_page The two differ very subtly in two aspects. The first one doesn't care about HTLB_BUDDY_* stats and it doesn't initialize the huge page. prep_new_huge_page is not used because it not only initializes hugetlb specific stuff but because it also put_page and releases the page to the hugetlb pool which is not what is required in some contexts. This makes things more complicated than necessary. Simplify things by a) removing the page allocator entry point duplicity and only keep __hugetlb_alloc_buddy_huge_page and b) make prep_new_huge_page more reusable by removing the put_page which moves the page to the allocator pool. All current callers are updated to call put_page explicitly. Later patches will add new callers which won't need it. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20180103093213.26329-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Michal Hocko | d6cb41cc44 | mm, hugetlb: remove hugepages_treat_as_movable sysctl hugepages_treat_as_movable has been introduced by  |