mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	
			
				
					
						
					
					901b277eaf
				
			
			
		
	
	
		
			69 Commits
		
	
	
	| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|  Aneesh Kumar K.V | 40135fc718 | mm/vmemmap: allow architectures to override how vmemmap optimization works Architectures like powerpc will like to use different page table allocators and mapping mechanisms to implement vmemmap optimization. Similar to vmemmap_populate allow architectures to implement vmemap_populate_compound_pages Link: https://lkml.kernel.org/r/20230724190759.483013-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Ryan Roberts | c33c794828 | mm: ptep_get() conversion Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Aneesh Kumar K.V | 87a7ae75d7 | mm/vmemmap/devdax: fix kernel crash when probing devdax devices commit | ||
|  Feiyang Chen | 2045a3b891 | mm/sparse-vmemmap: generalise vmemmap_populate_hugepages() Generalise vmemmap_populate_hugepages() so ARM64 & X86 & LoongArch can share its implementation. Link: https://lkml.kernel.org/r/20221027125253.3458989-4-chenhuacai@loongson.cn Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Acked-by: Will Deacon <will@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Min Zhou <zhoumin@loongson.cn> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Feiyang Chen | 7b09f5af01 | LoongArch: add sparse memory vmemmap support Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn operations. This is the most efficient option when sufficient kernel resources are available. Link: https://lkml.kernel.org/r/20221027125253.3458989-3-chenhuacai@loongson.cn Signed-off-by: Min Zhou <zhoumin@loongson.cn> Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Will Deacon <will@kernel.org> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Muchun Song | 998a299788 | mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c When I first introduced vmemmap manipulation functions related to HugeTLB, I thought those functions may be reused by other modules (e.g. using similar approach to optimize vmemmap pages, unfortunately, the DAX used the same approach but does not use those functions). After two years, we didn't see any other users. So move those functions to hugetlb_vmemmap.c. Code movement without any functional change. Link: https://lkml.kernel.org/r/20220628092235.91270-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Linus Torvalds | 6614a3c316 | - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
 
 - Some kmemleak fixes from Patrick Wang and Waiman Long
 
 - DAMON updates from SeongJae Park
 
 - memcg debug/visibility work from Roman Gushchin
 
 - vmalloc speedup from Uladzislau Rezki
 
 - more folio conversion work from Matthew Wilcox
 
 - enhancements for coherent device memory mapping from Alex Sierra
 
 - addition of shared pages tracking and CoW support for fsdax, from
   Shiyang Ruan
 
 - hugetlb optimizations from Mike Kravetz
 
 - Mel Gorman has contributed some pagealloc changes to improve latency
   and realtime behaviour.
 
 - mprotect soft-dirty checking has been improved by Peter Xu
 
 - Many other singleton patches all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
 jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
 SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
 =w/UH
 -----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
 "Most of the MM queue. A few things are still pending.
  Liam's maple tree rework didn't make it. This has resulted in a few
  other minor patch series being held over for next time.
  Multi-gen LRU still isn't merged as we were waiting for mapletree to
  stabilize. The current plan is to merge MGLRU into -mm soon and to
  later reintroduce mapletree, with a view to hopefully getting both
  into 6.1-rc1.
  Summary:
   - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
     Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
   - Some kmemleak fixes from Patrick Wang and Waiman Long
   - DAMON updates from SeongJae Park
   - memcg debug/visibility work from Roman Gushchin
   - vmalloc speedup from Uladzislau Rezki
   - more folio conversion work from Matthew Wilcox
   - enhancements for coherent device memory mapping from Alex Sierra
   - addition of shared pages tracking and CoW support for fsdax, from
     Shiyang Ruan
   - hugetlb optimizations from Mike Kravetz
   - Mel Gorman has contributed some pagealloc changes to improve
     latency and realtime behaviour.
   - mprotect soft-dirty checking has been improved by Peter Xu
   - Many other singleton patches all over the place"
 [ XFS merge from hell as per Darrick Wong in
   https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
  tools/testing/selftests/vm/hmm-tests.c: fix build
  mm: Kconfig: fix typo
  mm: memory-failure: convert to pr_fmt()
  mm: use is_zone_movable_page() helper
  hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
  hugetlbfs: cleanup some comments in inode.c
  hugetlbfs: remove unneeded header file
  hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
  hugetlbfs: use helper macro SZ_1{K,M}
  mm: cleanup is_highmem()
  mm/hmm: add a test for cross device private faults
  selftests: add soft-dirty into run_vmtests.sh
  selftests: soft-dirty: add test for mprotect
  mm/mprotect: fix soft-dirty check in can_change_pte_writable()
  mm: memcontrol: fix potential oom_lock recursion deadlock
  mm/gup.c: fix formatting in check_and_migrate_movable_page()
  xfs: fail dax mount if reflink is enabled on a partition
  mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
  userfaultfd: don't fail on unrecognized features
  hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
  ... | ||
|  Linus Torvalds | 97a77ab14f | EFI updates for v5.20 - Enable mirrored memory for arm64 - Fix up several abuses of the efivar API - Refactor the efivar API in preparation for moving the 'business logic' part of it into efivarfs - Enable ACPI PRM on arm64 -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmLhuDIACgkQw08iOZLZ jyS9IQv/Wc2nhjN50S3gfrL+68/el/hGdP/J0FK5BOOjNosG2t1ZNYZtSthXqpPH hRrTU2m6PpQUalRpFDyLiHkJvdBFQe4VmvrzBa3TIBIzyflLQPJzkWrqThPchV+B qi4lmCtTDNIEJmayewqx1wWA+QmUiyI5zJ8wrZp84LTctBPL75seVv0SB20nqai0 3/I73omB2RLVGpCpeWvb++vePXL8euFW3FEwCTM8hRboICjORTyIZPy8Y5os+3xT UgrIgVDOtn1Xwd4tK0qVwjOVA51east4Fcn3yGOrL40t+3SFm2jdpAJOO3UvyNPl vkbtjvXsIjt3/oxreKxXHLbamKyueWIfZRyCLsrg6wrr96oypPk6ID4iDCQoen/X Zf0VjM2vmvSd4YgnEIblOfSBxVg48cHJA4iVHVxFodNTrVnzGGFYPTmNKmJqo+Xn JeUILM7jlR4h/t0+cTTK3Busu24annTuuz5L5rjf4bUm6pPf4crb1yJaFWtGhlpa er233D6O =zI0R -----END PGP SIGNATURE----- Merge tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: - Enable mirrored memory for arm64 - Fix up several abuses of the efivar API - Refactor the efivar API in preparation for moving the 'business logic' part of it into efivarfs - Enable ACPI PRM on arm64 * tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits) ACPI: Move PRM config option under the main ACPI config ACPI: Enable Platform Runtime Mechanism(PRM) support on ARM64 ACPI: PRM: Change handler_addr type to void pointer efi: Simplify arch_efi_call_virt() macro drivers: fix typo in firmware/efi/memmap.c efi: vars: Drop __efivar_entry_iter() helper which is no longer used efi: vars: Use locking version to iterate over efivars linked lists efi: pstore: Omit efivars caching EFI varstore access layer efi: vars: Add thin wrapper around EFI get/set variable interface efi: vars: Don't drop lock in the middle of efivar_init() pstore: Add priv field to pstore_record for backend specific use Input: applespi - avoid efivars API and invoke EFI services directly selftests/kexec: remove broken EFI_VARS secure boot fallback check brcmfmac: Switch to appropriate helper to load EFI variable contents iwlwifi: Switch to proper EFI variable store interface media: atomisp_gmin_platform: stop abusing efivar API efi: efibc: avoid efivar API for setting variables efi: avoid efivars layer when loading SSDTs from variables efi: Correct comment on efi_memmap_alloc memblock: Disable mirror feature if kernelcore is not specified ... | ||
|  XueBing Chen | f673bd7c26 | mm: sparsemem: drop unexpected word 'a' in comments there is an unexpected word 'a' in the comments that need to be dropped Link: https://lkml.kernel.org/r/24fbdae3.c86.1819a0f31b9.Coremail.chenxuebing@jari.cn Signed-off-by: XueBing Chen <chenxuebing@jari.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Muchun Song | 39d35edee4 | mm: sparsemem: fix missing higher order allocation splitting Higher order allocations for vmemmap pages from buddy allocator must be
able to be treated as indepdenent small pages as they can be freed
individually by the caller.  There is no problem for higher order vmemmap
pages allocated at boot time since each individual small page will be
initialized at boot time.  However, it will be an issue for memory hotplug
case since those higher order vmemmap pages are allocated from buddy
allocator without initializing each individual small page's refcount.  The
system will panic in put_page_testzero() when CONFIG_DEBUG_VM is enabled
if the vmemmap page is freed.
Link: https://lkml.kernel.org/r/20220620023019.94257-1-songmuchun@bytedance.com
Fixes:  | ||
|  Mike Rapoport | ee65728e10 | docs: rename Documentation/vm to Documentation/mm so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Jonathan Corbet <corbet@lwn.net> Acked-by: Wu XiangCheng <bobwxc@email.cn> | ||
|  Gautam Menghani | 55896f935a | mm/sparse-vmemmap.c: remove unwanted initialization in vmemmap_populate_compound_pages() Remove unnecessary initialization for the variable 'next'. This fixes the clang scan warning: Value stored to 'next' during its initialization is never read [deadcode.DeadStores] Link: https://lkml.kernel.org/r/20220612182320.160651-1-gautammenghani201@gmail.com Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Fanjun Kong | 0b82ade6c0 | mm: use PAGE_ALIGNED instead of IS_ALIGNED <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's use this macro instead of IS_ALIGNED and passing PAGE_SIZE directly. Link: https://lkml.kernel.org/r/20220526140257.1568744-1-bh1scw@gmail.com Signed-off-by: Fanjun Kong <bh1scw@gmail.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Ma Wupeng | abd62377c0 | mm: Limit warning message in vmemmap_verify() to once For a system only have limited mirrored memory or some numa node without mirrored memory, the per node vmemmap page_structs prefer to allocate memory from mirrored region, which will lead to vmemmap_verify() in vmemmap_populate_basepages() report lots of warning message. This patch change the frequency of "potential offnode page_structs" warning messages to only once to avoid a very long print during bootup. Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20220614092156.1972846-4-mawupeng1@huawei.com Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> | ||
|  Joao Martins | 4917f55b4e | mm/sparse-vmemmap: improve memory savings for compound devmaps A compound devmap is a dev_pagemap with @vmemmap_shift > 0 and it means that pages are mapped at a given huge page alignment and utilize uses compound pages as opposed to order-0 pages. Take advantage of the fact that most tail pages look the same (except the first two) to minimize struct page overhead. Allocate a separate page for the vmemmap area which contains the head page and separate for the next 64 pages. The rest of the subsections then reuse this tail vmemmap page to initialize the rest of the tail pages. Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and when initializing compound devmap with big enough @vmemmap_shift (e.g. 1G PUD) it may cross multiple sections. The vmemmap code needs to consult @pgmap so that multiple sections that all map the same tail data can refer back to the first copy of that data for a given gigantic page. On compound devmaps with 2M align, this mechanism lets 6 pages be saved out of the 8 necessary PFNs necessary to set the subsection's 512 struct pages being mapped. On a 1G compound devmap it saves 4094 pages. Altmap isn't supported yet, given various restrictions in altmap pfn allocator, thus fallback to the already in use vmemmap_populate(). It is worth noting that altmap for devmap mappings was there to relieve the pressure of inordinate amounts of memmap space to map terabytes of pmem. With compound pages the motivation for altmaps for pmem gets reduced. Link: https://lkml.kernel.org/r/20220420155310.9712-5-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Joao Martins | 2beea70a3e | mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper In preparation for describing a memmap with compound pages, move the actual pte population logic into a separate function vmemmap_populate_address() and have a new helper vmemmap_populate_range() walk through all base pages it needs to populate. While doing that, change the helper to use a pte_t* as return value, rather than an hardcoded errno of 0 or -ENOMEM. Link: https://lkml.kernel.org/r/20220420155310.9712-3-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Joao Martins | e3246d8f52 | mm/sparse-vmemmap: add a pgmap argument to section activation Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9.
This series minimizes 'struct page' overhead by pursuing a similar
approach as Muchun Song series "Free some vmemmap pages of hugetlb page"
(now merged since v5.14), but applied to devmap with @vmemmap_shift
(device-dax).  
The vmemmap dedpulication original idea (already used in HugeTLB) is to
reuse/deduplicate tail page vmemmap areas, particular the area which only
describes tail pages.  So a vmemmap page describes 64 struct pages, and
the first page for a given ZONE_DEVICE vmemmap would contain the head page
and 63 tail pages.  The second vmemmap page would contain only tail pages,
and that's what gets reused across the rest of the subsection/section. 
The bigger the page size, the bigger the savings (2M hpage -> save 6
vmemmap pages; 1G hpage -> save 4094 vmemmap pages).  
This is done for PMEM /specifically only/ on device-dax configured
namespaces, not fsdax.  In other words, a devmap with a @vmemmap_shift.
In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound devmap:
* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of
  total memory)
* with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of
  total memory)
The series is mostly summed up by patch 4, and to summarize what the
series does:
Patches 1 - 3: Minor cleanups in preparation for patch 4.  Move the very
nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry.
Patch 4: Patch 4 is the one that takes care of the struct page savings
(also referred to here as tail-page/vmemmap deduplication).  Much like
Muchun series, we reuse the second PTE tail page vmemmap areas across a
given @vmemmap_shift On important difference though, is that contrary to
the hugetlbfs series, there's no vmemmap for the area because we are
late-populating it as opposed to remapping a system-ram range.  IOW no
freeing of pages of already initialized vmemmap like the case for
hugetlbfs, which greatly simplifies the logic (besides not being
arch-specific).  altmap case unchanged and still goes via the
vmemmap_populate().  Also adjust the newly added docs to the device-dax
case.
[Note that device-dax is still a little behind HugeTLB in terms of
savings.  I have an additional simple patch that reuses the head vmemmap
page too, as a follow-up.  That will double the savings and namespaces
initialization.]
Patch 5: Initialize fewer struct pages depending on the page size with
DRAM backed struct pages -- because fewer pages are unique and most tail
pages (with bigger vmemmap_shift).
    NVDIMM namespace bootstrap improves from ~268-358 ms to
    ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally.  And struct
    page needed capacity will be 3.8x / 1071x smaller for 2M and 1G
    respectivelly.  Tested on x86 with 1.5Tb of pmem (including pinning,
    and RDMA registration/deregistration scalability with 2M MRs)
This patch (of 5):
In support of using compound pages for devmap mappings, plumb the pgmap
down to the vmemmap_populate implementation.  Note that while altmap is
retrievable from pgmap the memory hotplug code passes altmap without
pgmap[*], so both need to be independently plumbed.
So in addition to @altmap, pass @pgmap to sparse section populate
functions namely:
	sparse_add_section
	  section_activate
	    populate_section_memmap
   	      __populate_section_memmap
Passing @pgmap allows __populate_section_memmap() to both fetch the
vmemmap_shift in which memmap metadata is created for and also to let
sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
whether to just reuse tail pages from past onlined sections.
While at it, fix the kdoc for @altmap for sparse_add_section().
[*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/
Link: https://lkml.kernel.org/r/20220420155310.9712-1-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/20220420155310.9712-2-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Muchun Song | 47010c040d | mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP* The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". In this patch , cheanup configs to make code more expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | ||
|  Muchun Song | e540841734 | mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP. Link: https://lkml.kernel.org/r/20211101031651.75851-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | d8d55f5616 | mm: sparsemem: use page table lock to protect kernel pmd operations The init_mm.page_table_lock is used to protect kernel page tables, we can use it to serialize splitting vmemmap PMD mappings instead of mmap write lock, which can increase the concurrency of vmemmap_remap_free(). Actually, It increase the concurrency between allocations of HugeTLB pages. But it is not the only benefit. There are a lot of users of mmap read lock of init_mm. The mmap write lock is holding through vmemmap_remap_free(), removing mmap write lock usage to make it does not affect other users of mmap read lock. It is not making anything worse and always a win to move. Now the kernel page table walker does not hold the page_table_lock when walking pmd entries. There may be consistency issue of a pmd entry, because pmd entry might change from a huge pmd entry to a PTE page table. There is only one user of kernel page table walker, namely ptdump. The ptdump already considers the consistency, which use a local variable to cache the value of pmd entry. But we also need to update ->action to ACTION_CONTINUE to make sure the walker does not walk every pte entry again when concurrent thread has split the huge pmd. Link: https://lkml.kernel.org/r/20211101031651.75851-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | e7d324850b | mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page Patch series "Free the 2nd vmemmap page associated with each HugeTLB
page", v7.
This series can minimize the overhead of struct page for 2MB HugeTLB
pages significantly.  It further reduces the overhead of struct page by
12.5% for a 2MB HugeTLB compared to the previous approach, which means
2GB per 1TB HugeTLB.  It is a nice gain.  Comments and reviews are
welcome.  Thanks.
The main implementation and details can refer to the commit log of patch
1.  In this series, I have changed the following four helpers, the
following table shows the impact of the overhead of those helpers.
	+------------------+-----------------------+
	|       APIs       | head page | tail page |
	+------------------+-----------+-----------+
	|    PageHead()    |     Y     |     N     |
	+------------------+-----------+-----------+
	|    PageTail()    |     Y     |     N     |
	+------------------+-----------+-----------+
	|  PageCompound()  |     N     |     N     |
	+------------------+-----------+-----------+
	|  compound_head() |     Y     |     N     |
	+------------------+-----------+-----------+
	Y: Overhead is increased.
	N: Overhead is _NOT_ increased.
It shows that the overhead of those helpers on a tail page don't change
between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off".  But the
overhead on a head page will be increased when "hugetlb_free_vmemmap=on"
(except PageCompound()).  So I believe that Matthew Wilcox's folio series
will help with this.
The users of PageHead() and PageTail() are much less than compound_head()
and most users of PageTail() are VM_BUG_ON(), so I have done some tests
about the overhead of compound_head() on head pages.
I have tested the overhead of calling compound_head() on a head page,
which is 2.11ns (Measure the call time of 10 million times
compound_head(), and then average).
For a head page whose address is not aligned with PAGE_SIZE or a
non-compound page, the overhead of compound_head() is 2.54ns which is
increased by 20%.  For a head page whose address is aligned with
PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by
40%.  Most pages are the former.  I do not think the overhead is
significant since the overhead of compound_head() itself is low.
This patch (of 5):
This patch minimizes the overhead of struct page for 2MB HugeTLB pages
significantly.  It further reduces the overhead of struct page by 12.5%
for a 2MB HugeTLB compared to the previous approach, which means 2GB per
1TB HugeTLB (2MB type).
After the feature of "Free sonme vmemmap pages of HugeTLB page" is
enabled, the mapping of the vmemmap addresses associated with a 2MB
HugeTLB page becomes the figure below.
     HugeTLB                    struct pages(8 pages)         page frame(8 pages)
 +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
 |           |                     |     0     | -------------> |     0     |
 |           |                     +-----------+                +-----------+
 |           |                     |     1     | -------------> |     1     |
 |           |                     +-----------+                +-----------+
 |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
 |           |                     +-----------+                   | | | | |
 |           |                     |     3     | ------------------+ | | | |
 |           |                     +-----------+                     | | | |
 |           |                     |     4     | --------------------+ | | |
 |    2MB    |                     +-----------+                       | | |
 |           |                     |     5     | ----------------------+ | |
 |           |                     +-----------+                         | |
 |           |                     |     6     | ------------------------+ |
 |           |                     +-----------+                           |
 |           |                     |     7     | --------------------------+
 |           |                     +-----------+
 |           |
 |           |
 |           |
 +-----------+
As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and
remaped. However, the 2nd vmemmap page frame is also can be freed to
the buddy allocator, then we can change the mapping from the figure
above to the figure below.
    HugeTLB                    struct pages(8 pages)         page frame(8 pages)
 +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
 |           |                     |     0     | -------------> |     0     |
 |           |                     +-----------+                +-----------+
 |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
 |           |                     +-----------+                  | | | | | |
 |           |                     |     2     | -----------------+ | | | | |
 |           |                     +-----------+                    | | | | |
 |           |                     |     3     | -------------------+ | | | |
 |           |                     +-----------+                      | | | |
 |           |                     |     4     | ---------------------+ | | |
 |    2MB    |                     +-----------+                        | | |
 |           |                     |     5     | -----------------------+ | |
 |           |                     +-----------+                          | |
 |           |                     |     6     | -------------------------+ |
 |           |                     +-----------+                            |
 |           |                     |     7     | ---------------------------+
 |           |                     +-----------+
 |           |
 |           |
 |           |
 +-----------+
After we do this, all tail vmemmap pages (1-7) are mapped to the head
vmemmap page frame (0).  In other words, there are more than one page
struct with PG_head associated with each HugeTLB page.  We __know__ that
there is only one head page struct, the tail page structs with PG_head are
fake head page structs.  We need an approach to distinguish between those
two different types of page structs so that compound_head(), PageHead()
and PageTail() can work properly if the parameter is the tail page struct
but with PG_head.
The following code snippet describes how to distinguish between real and
fake head page struct.
	if (test_bit(PG_head, &page->flags)) {
		unsigned long head = READ_ONCE(page[1].compound_head);
		if (head & 1) {
			if (head == (unsigned long)page + 1)
				==> head page struct
			else
				==> tail page struct
		} else
			==> head page struct
	}
We can safely access the field of the @page[1] with PG_head because the
@page is a compound page composed with at least two contiguous pages.
[songmuchun@bytedance.com: restore lost comment changes]
Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Qi Zheng | ed33b5a677 | mm: remove redundant smp_wmb() The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes setup is visible before the pte is made visible to other CPUs by being put into page tables. We only need this when the pte is actually populated, so move it to pmd_install(). __pte_alloc_kernel(), __p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar to this case. We can also defer smp_wmb() to the place where the pmd entry is really populated by preallocated pte. There are two kinds of user of preallocated pte, one is filemap & finish_fault(), another is THP. The former does not need another smp_wmb() because the smp_wmb() has been done by pmd_install(). Fortunately, the latter also does not need another smp_wmb() because there is already a smp_wmb() before populating the new pte when the THP uses a preallocated pte to split a huge pmd. Link: https://lkml.kernel.org/r/20210901102722.47686-3-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mika Penttila <mika.penttila@nextfour.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | 3bc2b6a725 | mm: sparsemem: split the huge PMD mapping of vmemmap pages Patch series "Split huge PMD mapping of vmemmap pages", v4. In order to reduce the difficulty of code review in series[1]. We disable huge PMD mapping of vmemmap pages when that feature is enabled. In this series, we do not disable huge PMD mapping of vmemmap pages anymore. We will split huge PMD mapping when needed. When HugeTLB pages are freed from the pool we do not attempt coalasce and move back to a PMD mapping because it is much more complex. [1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/ This patch (of 3): In [1], PMD mappings of vmemmap pages were disabled if the the feature hugetlb_free_vmemmap was enabled. This was done to simplify the initial implementation of vmmemap freeing for hugetlb pages. Now, remove this simplification by allowing PMD mapping and switching to PTE mappings as needed for allocated hugetlb pages. When a hugetlb page is allocated, the vmemmap page tables are walked to free vmemmap pages. During this walk, split huge PMD mappings to PTE mappings as required. In the unlikely case PTE pages can not be allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb page. When HugeTLB pages are freed from the pool, we do not attempt to coalesce and move back to a PMD mapping because it is much more complex. [1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | ad2fa3717b | mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it.  However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure.  In
this case, we just refuse to free the HugeTLB page.  This changes behavior
in some corner cases as listed below:
 1) Failing to free a huge page triggered by the user (decrease nr_pages).
    User needs to try again later.
 2) Failing to free a surplus huge page when freed by the application.
    Try again later when freeing a huge page next time.
 3) Failing to dissolve a free huge page on ZONE_MOVABLE via
    offline_pages().
    This can happen when we have plenty of ZONE_MOVABLE memory, but
    not enough kernel memory to allocate vmemmmap pages.  We may even
    be able to migrate huge page contents, but will not be able to
    dissolve the source huge page.  This will prevent an offline
    operation and is unfortunate as memory offlining is expected to
    succeed on movable zones.  Users that depend on memory hotplug
    to succeed for movable zones should carefully consider whether the
    memory savings gained from this feature are worth the risk of
    possibly not being able to offline memory in certain situations.
 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
    alloc_contig_range() - once we have that handling in place. Mainly
    affects CMA and virtio-mem.
    Similar to 3). virito-mem will handle migration errors gracefully.
    CMA might be able to fallback on other free areas within the CMA
    region.
Vmemmap pages are allocated from the page freeing context.  In order for
those allocations to be not disruptive (e.g.  trigger oom killer)
__GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure.  GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.
[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
  Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Muchun Song | f41f2ed43c | mm: hugetlb: free the vmemmap pages associated with each HugeTLB page Every HugeTLB has more than one struct page structure. We __know__ that we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to store metadata associated with each HugeTLB. There are a lot of struct page structures associated with each HugeTLB page. For tail pages, the value of compound_head is the same. So we can reuse first page of tail page structures. We map the virtual addresses of the remaining pages of tail page structures to the first tail page struct, and then free these page frames. Therefore, we need to reserve two pages as vmemmap areas. When we allocate a HugeTLB page from the buddy, we can free some vmemmap pages associated with each HugeTLB page. It is more appropriate to do it in the prep_new_huge_page(). The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages associated with a HugeTLB page can be freed, returns zero for now, which means the feature is disabled. We will enable it once all the infrastructure is there. [willy@infradead.org: fix documentation warning] Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Wei Yang | 6cda72047e | mm/sparse: only sub-section aligned range would be populated There are two code path which invoke __populate_section_memmap()
  * sparse_init_nid()
  * sparse_add_section()
For both case, we are sure the memory range is sub-section aligned.
  * we pass PAGES_PER_SECTION to sparse_init_nid()
  * we check range by check_pfn_span() before calling
    sparse_add_section()
Also, the counterpart of __populate_section_memmap(), we don't do such
calculation and check since the range is checked by check_pfn_span() in
__remove_pages().
Clear the calculation and check to keep it simple and comply with its
counterpart.
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200703031828.14645-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Anshuman Khandual | 56993b4e14 | mm/sparsemem: enable vmem_altmap support in vmemmap_alloc_block_buf() There are many instances where vmemap allocation is often switched between regular memory and device memory just based on whether altmap is available or not. vmemmap_alloc_block_buf() is used in various platforms to allocate vmemmap mappings. Lets also enable it to handle altmap based device memory allocation along with existing regular memory allocations. This will help in avoiding the altmap based allocation switch in many places. To summarize there are two different methods to call vmemmap_alloc_block_buf(). vmemmap_alloc_block_buf(size, node, NULL) /* Allocate from system RAM */ vmemmap_alloc_block_buf(size, node, altmap) /* Allocate from altmap */ This converts altmap_alloc_block_buf() into a static function, drops it's entry from the header and updates Documentation/vm/memory-model.rst. Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jia He <justin.he@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yu Zhao <yuzhao@google.com> Link: http://lkml.kernel.org/r/1594004178-8861-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Anshuman Khandual | 1d9cfee753 | mm/sparsemem: enable vmem_altmap support in vmemmap_populate_basepages() Patch series "arm64: Enable vmemmap mapping from device memory", v4. This series enables vmemmap backing memory allocation from device memory ranges on arm64. But before that, it enables vmemmap_populate_basepages() and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based alocation requests. This patch (of 3): vmemmap_populate_basepages() is used across platforms to allocate backing memory for vmemmap mapping. This is used as a standard default choice or as a fallback when intended huge pages allocation fails. This just creates entire vmemmap mapping with base pages (PAGE_SIZE). On arm64 platforms, vmemmap_populate_basepages() is called instead of the platform specific vmemmap_populate() when ARM64_SWAPPER_USES_SECTION_MAPS is not enabled as in case for ARM64_16K_PAGES and ARM64_64K_PAGES configs. At present vmemmap_populate_basepages() does not support allocating from driver defined struct vmem_altmap while trying to create vmemmap mapping for a device memory range. It prevents ARM64_16K_PAGES and ARM64_64K_PAGES configs on arm64 from supporting device memory with vmemap_altmap request. This enables vmem_altmap support in vmemmap_populate_basepages() unlocking device memory allocation for vmemap mapping on arm64 platforms with 16K or 64K base page configs. Each architecture should evaluate and decide on subscribing device memory based base page allocation through vmemmap_populate_basepages(). Hence lets keep it disabled on all archs in order to preserve the existing semantics. A subsequent patch enables it on arm64. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jia He <justin.he@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Yu Zhao <yuzhao@google.com> Link: http://lkml.kernel.org/r/1594004178-8861-1-git-send-email-anshuman.khandual@arm.com Link: http://lkml.kernel.org/r/1594004178-8861-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Rapoport | e31cf2f4ca | mm: don't include asm/pgtable.h if linux/mm.h is already included Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
	for f in $(git grep -l "include <linux/mm.h>") ; do
		sed -i -e '/include <asm\/pgtable.h>/ d' $f
	done
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Dan Williams | e9c0a3f054 | mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() Allow sub-section sized ranges to be added to the memmap. populate_section_memmap() takes an explict pfn range rather than assuming a full section, and those parameters are plumbed all the way through to vmmemap_populate(). There should be no sub-section usage in current deployments. New warnings are added to clarify which memmap allocation paths are sub-section capable. Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Rapoport | 57c8a661d9 | mm: remove include/linux/bootmem.h Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Rapoport | 97ad1087ef | memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of identical MEMBLOCK definitions. Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Mike Rapoport | eb31d559f1 | memblock: remove _virt from APIs returning virtual address The conversion is done using sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc) Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Pavel Tatashin | 2a3cb8baef | mm/sparse: delete old sparse_init and enable new one Rename new_sparse_init() to sparse_init() which enables it. Delete old sparse_init() and all the code that became obsolete with. [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()] Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Pavel Tatashin | afda57bc13 | mm/sparse: move buffer init/fini to the common place Now that both variants of sparse memory use the same buffers to populate memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the common place. Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Pavel Tatashin | 35fd1eb1e8 | mm/sparse: abstract sparse buffer allocations Patch series "sparse_init rewrite", v6. In sparse_init() we allocate two large buffers to temporary hold usemap and memmap for the whole machine. However, we can avoid doing that if we changed sparse_init() to operated on per-node bases instead of doing it on the whole machine beforehand. As shown by Baoquan http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com The buffers are large enough to cause machine stop to boot on small memory systems. Another benefit of these changes is that they also obsolete CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER. This patch (of 5): When struct pages are allocated for sparse-vmemmap VA layout, we first try to allocate one large buffer, and than if that fails allocate struct pages for each section as we go. The code that allocates buffer is uses global variables and is spread across several call sites. Cleanup the code by introducing three functions to handle the global buffer: sparse_buffer_init() initialize the buffer sparse_buffer_fini() free the remaining part of the buffer sparse_buffer_alloc() alloc from the buffer, and if buffer is empty return NULL Define these functions in sparse.c instead of sparse-vmemmap.c because later we will use them for non-vmemmap sparse allocations as well. [akpm@linux-foundation.org: use PTR_ALIGN()] [akpm@linux-foundation.org: s/BUG_ON/WARN_ON/] Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Baoquan He | c98aff6493 | mm/sparse: optimize memmap allocation during sparse_init() In sparse_init(), two temporary pointer arrays, usemap_map and map_map are allocated with the size of NR_MEM_SECTIONS. They are used to store each memory section's usemap and mem map if marked as present. With the help of these two arrays, continuous memory chunk is allocated for usemap and memmap for memory sections on one node. This avoids too many memory fragmentations. Like below diagram, '1' indicates the present memory section, '0' means absent one. The number 'n' could be much smaller than NR_MEM_SECTIONS on most of systems. |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0| ------------------------------------------------------- 0 1 2 3 4 5 i i+1 n-1 n If we fail to populate the page tables to map one section's memmap, its ->section_mem_map will be cleared finally to indicate that it's not present. After use, these two arrays will be released at the end of sparse_init(). In 4-level paging mode, each array costs 4M which can be ignorable. While in 5-level paging, they costs 256M each, 512M altogether. Kdump kernel Usually only reserves very few memory, e.g 256M. So, even thouth they are temporarily allocated, still not acceptable. In fact, there's no need to allocate them with the size of NR_MEM_SECTIONS. Since the ->section_mem_map clearing has been deferred to the last, the number of present memory sections are kept the same during sparse_init() until we finally clear out the memory section's ->section_mem_map if its usemap or memmap is not correctly handled. Thus in the middle whenever for_each_present_section_nr() loop is taken, the i-th present memory section is always the same one. Here only allocate usemap_map and map_map with the size of 'nr_present_sections'. For the i-th present memory section, install its usemap and memmap to usemap_map[i] and mam_map[i] during allocation. Then in the last for_each_present_section_nr() loop which clears the failed memory section's ->section_mem_map, fetch usemap and memmap from usemap_map[] and map_map[] array and set them into mem_section[] accordingly. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oscar Salvador <osalvador@techadventures.net> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Baoquan He | 07a34a8c36 | mm/sparsemem.c: defer the ms->section_mem_map clearing In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system will allocate one continuous memory chunk for mem maps on one node and populate the relevant page tables to map memory section one by one. If fail to populate for a certain mem section, print warning and its ->section_mem_map will be cleared to cancel the marking of being present. Like this, the number of mem sections marked as present could become less during sparse_init() execution. Here just defer the ms->section_mem_map clearing if failed to populate its page tables until the last for_each_present_section_nr() loop. This is in preparation for later optimizing the mem map allocation. [akpm@linux-foundation.org: remove now-unused local `ms', per Oscar] Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Christoph Hellwig | eb8045335c | mm: merge vmem_altmap_alloc into altmap_alloc_block_buf There is no clear separation between the two, so merge them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> | ||
|  Christoph Hellwig | a8fc357b28 | mm: split altmap memory map allocation from normal case No functional changes, just untangling the call chain and document why the altmap is passed around the hotplug code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> | ||
|  Christoph Hellwig | 7b73d978a5 | mm: pass the vmem_altmap to vmemmap_populate We can just pass this on instead of having to do a radix tree lookup without proper locking a few levels into the callchain. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> | ||
|  Michal Hocko | fcdaf842bd | mm, sparse: do not swamp log with huge vmemmap allocation failures While doing memory hotplug tests under heavy memory pressure we have
noticed too many page allocation failures when allocating vmemmap memmap
backed by huge page
  kworker/u3072:1: page allocation failure: order:9, mode:0x24084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO)
  [...]
  Call Trace:
    dump_trace+0x59/0x310
    show_stack_log_lvl+0xea/0x170
    show_stack+0x21/0x40
    dump_stack+0x5c/0x7c
    warn_alloc_failed+0xe2/0x150
    __alloc_pages_nodemask+0x3ed/0xb20
    alloc_pages_current+0x7f/0x100
    vmemmap_alloc_block+0x79/0xb6
    __vmemmap_alloc_block_buf+0x136/0x145
    vmemmap_populate+0xd2/0x2b9
    sparse_mem_map_populate+0x23/0x30
    sparse_add_one_section+0x68/0x18e
    __add_pages+0x10a/0x1d0
    arch_add_memory+0x4a/0xc0
    add_memory_resource+0x89/0x160
    add_memory+0x6d/0xd0
    acpi_memory_device_add+0x181/0x251
    acpi_bus_attach+0xfd/0x19b
    acpi_bus_scan+0x59/0x69
    acpi_device_hotplug+0xd2/0x41f
    acpi_hotplug_work_fn+0x1a/0x23
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xbd/0xe0
    ret_from_fork+0x3f/0x70
and we do see many of those because essentially every allocation fails
for each memory section.  This is an excessive way to tell the user that
there is nothing to really worry about because we do have a fallback
mechanism to use base pages.  The only downside might be a performance
degradation due to TLB pressure.
This patch changes vmemmap_alloc_block() to use __GFP_NOWARN and warn
explicitly once on the first allocation failure.  This will reduce the
noise in the kernel log considerably, while we still have an indication
that a performance might be impacted.
[mhocko@kernel.org: forgot to git add the follow up fix]
  Link: http://lkml.kernel.org/r/20171107090635.c27thtse2lchjgvb@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20171106092228.31098-1-mhocko@kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Pavel Tatashin | f7f99100d8 | mm: stop zeroing memory during allocation in vmemmap vmemmap_alloc_block() will no longer zero the block, so zero memory at
its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.
Replace allocators in sparse-vmemmap to use the non-zeroing version.
So, we will get the performance improvement by zeroing the memory in
parallel when struct pages are zeroed.
Add struct page zeroing as a part of initialization of other fields in
__init_single_page().
This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
                         BASE            FIX
sparse_init     11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
                  --------------------------
Total           16.124447727s   8.362381922s
sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().
[akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c]
Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Greg Kroah-Hartman | b24413180f | License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | ||
|  Michal Hocko | b95046b047 | mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations Commit | ||
|  Michal Hocko | dcda9b0471 | mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Kirill A. Shutemov | c2febafc67 | mm: convert generic code to 5-level paging Convert all non-architecture-specific code to 5-level paging. It's mostly mechanical adding handling one more page table level in places where we deal with pud_t. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Fabian Frederick | bd721ea73e | treewide: replace obsolete _refok by __ref There was only one use of __initdata_refok and __exit_refok
__init_refok was used 46 times against 82 for __ref.
Those definitions are obsolete since commit  | ||
|  Joe Perches | 1170532bb4 | mm: convert printk(KERN_<LEVEL> to pr_<level> Most of the mm subsystem uses pr_<level> so make it consistent. Miscellanea: - Realign arguments - Add missing newline to format - kmemleak-test.c has a "kmemleak: " prefix added to the "Kmemleak testing" logging message via pr_fmt Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
|  Joe Perches | 756a025f00 | mm: coalesce split strings Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |