linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00

Author	SHA1	Message	Date
Ryan Roberts	935d4f0c6d	mm: hugetlb: add huge page size param to set_huge_pte_at() Patch series "Fix set_huge_pte_at() panic on arm64", v2. This series fixes a bug in arm64's implementation of set_huge_pte_at(), which can result in an unprivileged user causing a kernel panic. The problem was triggered when running the new uffd poison mm selftest for HUGETLB memory. This test (and the uffd poison feature) was merged for v6.5-rc7. Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable (correctly this time) to get it backported to v6.5, where the issue first showed up. Description of Bug ================== arm64's huge pte implementation supports multiple huge page sizes, some of which are implemented in the page table with multiple contiguous entries. So set_huge_pte_at() needs to work out how big the logical pte is, so that it can also work out how many physical ptes (or pmds) need to be written. It previously did this by grabbing the folio out of the pte and querying its size. However, there are cases when the pte being set is actually a swap entry. But this also used to work fine, because for huge ptes, we only ever saw migration entries and hwpoison entries. And both of these types of swap entries have a PFN embedded, so the code would grab that and everything still worked out. But over time, more calls to set_huge_pte_at() have been added that set swap entry types that do not embed a PFN. And this causes the code to go bang. The triggering case is for the uffd poison test, commit `99aa77215a` ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit `8a13897fb0` ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") - added in v6.5-rc7. Although review shows that there are other call sites that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger on arm64 because arm64 doesn't support UFFD WP. If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise, it will dereference a bad pointer in page_folio(): static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry) { VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry)); return page_folio(pfn_to_page(swp_offset_pfn(entry))); } Fix === The simplest fix would have been to revert the dodgy cleanup commit `18f3962953` ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since things have moved on, this would have required an audit of all the new set_huge_pte_at() call sites to see if they should be converted to set_huge_swap_pte_at(). As per the original intent of the change, it would also leave us open to future bugs when people invariably get it wrong and call the wrong helper. So instead, I've added a huge page size parameter to set_huge_pte_at(). This means that the arm64 code has the size in all cases. It's a bigger change, due to needing to touch the arches that implement the function, but it is entirely mechanical, so in my view, low risk. I've compile-tested all touched arches; arm64, parisc, powerpc, riscv, s390, sparc (and additionally x86_64). I've additionally booted and run mm selftests against arm64, where I observe the uffd poison test is fixed, and there are no other regressions. This patch (of 2): In order to fix a bug, arm64 needs to be told the size of the huge page for which the pte is being set in set_huge_pte_at(). Provide for this by adding an `unsigned long sz` parameter to the function. This follows the same pattern as huge_pte_clear(). This commit makes the required interface modifications to the core mm as well as all arches that implement this function (arm64, parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed in a separate commit. No behavioral changes intended. Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com Fixes: `8a13897fb0` ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> [vmalloc change] Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-09-29 17:20:47 -07:00
Joel Fernandes (Google)	0818e739b5	mm/vmalloc: add a safer version of find_vm_area() for debug It is unsafe to dump vmalloc area information when trying to do so from some contexts. Add a safer trylock version of the same function to do a best-effort VMA finding and use it from vmalloc_dump_obj(). [applied test robot feedback on unused function fix.] [applied Uladzislau feedback on locking.] Link: https://lkml.kernel.org/r/20230904180806.1002832-1-joel@joelfernandes.org Fixes: `98f180837a` ("mm: Make mem_dump_obj() handle vmalloc() memory") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reported-by: Zhen Lei <thunder.leizhen@huaweicloud.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Zqiang <qiang.zhang1211@gmail.com> Cc: <stable@vger.kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-09-05 10:13:45 -07:00
Alexandre Ghiti	a50420c797	mm: add a call to flush_cache_vmap() in vmap_pfn() flush_cache_vmap() must be called after new vmalloc mappings are installed in the page table in order to allow architectures to make sure the new mapping is visible. It could lead to a panic since on some architectures (like powerpc), the page table walker could see the wrong pte value and trigger a spurious page fault that can not be resolved (see commit `f1cb8f9beb` ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")). But actually the patch is aiming at riscv: the riscv specification allows the caching of invalid entries in the TLB, and since we recently removed the vmalloc page fault handling, we now need to emit a tlb shootdown whenever a new vmalloc mapping is emitted (https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/). That's a temporary solution, there are ways to avoid that :) Link: https://lkml.kernel.org/r/20230809164633.1556126-1-alexghiti@rivosinc.com Fixes: `3e9a9e256b` ("mm: add a vmap_pfn function") Reported-by: Dylan Jhong <dylan@andestech.com> Closes: https://lore.kernel.org/linux-riscv/ZMytNY2J8iyjbPPy@atctrx.andestech.com/ Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Dylan Jhong <dylan@andestech.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 13:07:21 -07:00
Andrew Morton	63773d2b59	Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes.	2023-06-23 16:58:19 -07:00
Ryan Roberts	c33c794828	mm: ptep_get() conversion Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t v; @@ - v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-19 16:19:25 -07:00
Hugh Dickins	0d1c81edc6	mm/vmalloc: vmalloc_to_page() use pte_offset_kernel() vmalloc_to_page() was using pte_offset_map() (followed by pte_unmap()), but it's intended for userspace page tables: prefer pte_offset_kernel(). Link: https://lkml.kernel.org/r/696386a-84f8-b33c-82e5-f865ed6eb39@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-19 16:19:14 -07:00
Lu Hongfei	0e4bc27111	mm/vmalloc: replace the ternary conditional operator with min() It would be better to replace the traditional ternary conditional operator with min() in zero_iter Link: https://lkml.kernel.org/r/20230609093057.27777-1-luhongfei@vivo.com Signed-off-by: Lu Hongfei <luhongfei@vivo.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-19 16:19:11 -07:00
Lorenzo Stoakes	95a301eefa	mm/vmalloc: do not output a spurious warning when huge vmalloc() fails In __vmalloc_area_node() we always warn_alloc() when an allocation performed by vm_area_alloc_pages() fails unless it was due to a pending fatal signal. However, huge page allocations instigated either by vmalloc_huge() or __vmalloc_node_range() (or a caller that invokes this like kvmalloc() or kvmalloc_node()) always falls back to order-0 allocations if the huge page allocation fails. This renders the warning useless and noisy, especially as all callers appear to be aware that this may fallback. This has already resulted in at least one bug report from a user who was confused by this (see link). Therefore, simply update the code to only output this warning for order-0 pages when no fatal signal is pending. Link: https://bugzilla.suse.com/show_bug.cgi?id=1211410 Link: https://lkml.kernel.org/r/20230605201107.83298-1-lstoakes@gmail.com Fixes: `80b1d8fdfa` ("mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node()") Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-19 13:19:31 -07:00
Ryan Roberts	b3f78e7498	mm: vmalloc must set pte via arch code Patch series "Fixes for pte encapsulation bypasses", v3. A series to improve the encapsulation of pte entries by disallowing non-arch code from directly dereferencing pte_t pointers. This patch (of 4): It is bad practice to directly set pte entries within a pte table. Instead all modifications must go through arch-provided helpers such as set_pte_at() to give the arch code visibility and allow it to check (and potentially modify) the operation. Link: https://lkml.kernel.org/r/20230602092949.545577-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230602092949.545577-2-ryan.roberts@arm.com Fixes: `3e9a9e256b` ("mm: add a vmap_pfn function") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:55 -07:00
Thomas Gleixner	77e50af07f	mm/vmalloc: dont purge usable blocks unnecessarily Purging fragmented blocks is done unconditionally in several contexts: 1) From drain_vmap_area_work(), when the number of lazy to be freed vmap_areas reached the threshold 2) Reclaiming vmalloc address space from pcpu_get_vm_areas() 3) _vm_unmap_aliases() #1 There is no reason to zap fragmented vmap blocks unconditionally, simply because reclaiming all lazy areas drains at least 32MB * fls(num_online_cpus()) per invocation which is plenty. #2 Reclaiming when running out of space or due to memory pressure makes a lot of sense #3 _unmap_aliases() requires to touch everything because the caller has no clue which vmap_area used a particular page last and the vmap_area lost that information too. Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the vmap area first and then cares about the flush. That in turn requires a full walk of _all_ vmap areas including the one which was just added to the purge list. But as this has to be flushed anyway this is an opportunity to combine outstanding TLB flushes and do the housekeeping of purging freed areas, but like #1 there is no real good reason to zap usable vmap blocks unconditionally. Add a @force_purge argument to the newly split out block purge function and if not true only purge fragmented blocks which have less than 1/4 of their capacity left. Rename purge_vmap_area_lazy() to reclaim_and_purge_vmap_areas() to make it clear what the function does. [lstoakes@gmail.com: correct VMAP_PURGE_THRESHOLD check] Link: https://lkml.kernel.org/r/3e92ef61-b910-4576-88e7-cf43211fd4e7@lucifer.local Link: https://lkml.kernel.org/r/20230525124504.864005691@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:41 -07:00
Thomas Gleixner	7f48121e9f	mm/vmalloc: add missing READ/WRITE_ONCE() annotations purge_fragmented_blocks() accesses vmap_block::free and vmap_block::dirty lockless for a quick check. Add the missing READ/WRITE_ONCE() annotations. Link: https://lkml.kernel.org/r/20230525124504.807356682@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:41 -07:00
Thomas Gleixner	43d7650234	mm/vmalloc: check free space in vmap_block lockless vb_alloc() unconditionally locks a vmap_block on the free list to check the free space. This can be done locklessly because vmap_block::free never increases, it's only decreased on allocations. Check the free space lockless and only if that succeeds, recheck under the lock. Link: https://lkml.kernel.org/r/20230525124504.750481992@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:41 -07:00
Thomas Gleixner	a09fad96ff	mm/vmalloc: prevent flushing dirty space over and over vmap blocks which have active mappings cannot be purged. Allocations which have been freed are accounted for in vmap_block::dirty_min/max, so that they can be detected in _vm_unmap_aliases() as potentially stale TLBs. If there are several invocations of _vm_unmap_aliases() then each of them will flush the dirty range. That's pointless and just increases the probability of full TLB flushes. Avoid that by resetting the flush range after accounting for it. That's safe versus other invocations of _vm_unmap_aliases() because this is all serialized with vmap_purge_lock. Link: https://lkml.kernel.org/r/20230525124504.692056496@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:41 -07:00
Thomas Gleixner	ca5e46c340	mm/vmalloc: avoid iterating over per CPU vmap blocks twice _vunmap_aliases() walks the per CPU xarrays to find partially unmapped blocks and then walks the per cpu free lists to purge fragmented blocks. Arguably that's waste of CPU cycles and cache lines as the full xarray walk already touches every block. Avoid this double iteration: - Split out the code to purge one block and the code to free the local purge list into helper functions. - Try to purge the fragmented blocks in the xarray walk before looking at their dirty space. Link: https://lkml.kernel.org/r/20230525124504.633469722@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:40 -07:00
Thomas Gleixner	fc1e0d9800	mm/vmalloc: prevent stale TLBs in fully utilized blocks Patch series "mm/vmalloc: Assorted fixes and improvements", v2. this series addresses the following issues: 1) Prevent the stale TLB problem related to fully utilized vmap blocks 2) Avoid the double per CPU list walk in _vm_unmap_aliases() 3) Avoid flushing dirty space over and over 4) Add a lockless quickcheck in vb_alloc() and add missing READ/WRITE_ONCE() annotations 5) Prevent overeager purging of usable vmap_blocks if not under memory/address space pressure. This patch (of 6): _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a page are left in the system. This is required due to the lazy TLB flush mechanism in vmalloc. This is tried to achieve by walking the per CPU free lists, but those do not contain fully utilized vmap blocks because they are removed from the free list once the blocks free space became zero. When the block is not fully unmapped then it is not on the purge list either. So neither the per CPU list iteration nor the purge list walk find the block and if the page was mapped via such a block and the TLB has not yet been flushed, the guarantee of _vm_unmap_aliases() that there are no stale TLBs after returning is broken: x = vb_alloc() // Removes vmap_block from free list because vb->free became 0 vb_free(x) // Unmaps page and marks in dirty_min/max range // Block has still mappings and is not put on purge list // Page is reused vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL So instead of walking the per CPU free lists, walk the per CPU xarrays which hold pointers to _all_ active blocks in the system including those removed from the free lists. Link: https://lkml.kernel.org/r/20230525122342.109672430@linutronix.de Link: https://lkml.kernel.org/r/20230525124504.573987880@linutronix.de Fixes: `db64fe0225` ("mm: rewrite vmap layer") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:40 -07:00
Uladzislau Rezki (Sony)	fa1c77c13c	mm: vmalloc: rename addr_to_vb_xarray() function Short the name of the addr_to_vb_xarray() function to the addr_to_vb_xa(). This aligns with other internal function abbreviations. Link: https://lkml.kernel.org/r/20230331073727.6968-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Suggested-by: Baoquan He <bhe@redhat.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-18 16:29:48 -07:00
Uladzislau Rezki (Sony)	062eacf57a	mm: vmalloc: remove a global vmap_blocks xarray A global vmap_blocks-xarray array can be contented under heavy usage of the vm_map_ram()/vm_unmap_ram() APIs. The lock_stat shows that a "vmap_blocks.xa_lock" lock is a second in a top-list when it comes to contentions: <snip> ---------------------------------------- class name con-bounces contentions ... ---------------------------------------- vmap_area_lock: 2554079 2554276 ... -------------- vmap_area_lock 1297948 [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910 vmap_area_lock 1256330 [<000000009d927bf3>] free_vmap_block+0x4a/0xe0 vmap_area_lock 1 [<00000000c95c05a7>] find_vm_area+0x16/0x70 -------------- vmap_area_lock 1738590 [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910 vmap_area_lock 815688 [<000000009d927bf3>] free_vmap_block+0x4a/0xe0 vmap_area_lock 1 [<00000000c1d619d7>] __get_vm_area_node+0xd2/0x170 vmap_blocks.xa_lock: 862689 862698 ... ------------------- vmap_blocks.xa_lock 378418 [<00000000625a5626>] vm_map_ram+0x359/0x4a0 vmap_blocks.xa_lock 484280 [<00000000caa2ef03>] xa_erase+0xe/0x30 ------------------- vmap_blocks.xa_lock 576226 [<00000000caa2ef03>] xa_erase+0xe/0x30 vmap_blocks.xa_lock 286472 [<00000000625a5626>] vm_map_ram+0x359/0x4a0 ... <snip> that is a result of running vm_map_ram()/vm_unmap_ram() in a loop. The test creates 64(on 64 CPUs system) threads and each one maps/unmaps 1 page. After this change the "xa_lock" can be considered as a noise in the same test condition: <snip> ... &xa->xa_lock#1: 10333 10394 ... -------------- &xa->xa_lock#1 5349 [<00000000bbbc9751>] xa_erase+0xe/0x30 &xa->xa_lock#1 5045 [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0 -------------- &xa->xa_lock#1 7326 [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0 &xa->xa_lock#1 3068 [<00000000bbbc9751>] xa_erase+0xe/0x30 ... <snip> Running the test_vmalloc.sh run_test_mask=1024 nr_threads=64 nr_pages=5 shows around ~8 percent of throughput improvement of vm_map_ram() and vm_unmap_ram() APIs. This patch does not fix vmap_area_lock/free_vmap_area_lock and purge_vmap_area_lock bottle-necks, it is rather a separate rework. Link: https://lkml.kernel.org/r/20230330190639.431589-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-18 16:29:47 -07:00
Andrew Morton	f8f238ffe5	sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes	2023-04-18 14:53:49 -07:00
Alexander Potapenko	fdea03e12a	mm: kmsan: handle alloc failures in kmsan_ioremap_page_range() Similarly to kmsan_vmap_pages_range_noflush(), kmsan_ioremap_page_range() must also properly handle allocation/mapping failures. In the case of such, it must clean up the already created metadata mappings and return an error code, so that the error can be propagated to ioremap_page_range(). Without doing so, KMSAN may silently fail to bring the metadata for the page range into a consistent state, which will result in user-visible crashes when trying to access them. Link: https://lkml.kernel.org/r/20230413131223.4135168-2-glider@google.com Fixes: `b073d7f8ae` ("mm: kmsan: maintain KMSAN metadata for page operations") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/ Reviewed-by: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-18 14:22:13 -07:00
Alexander Potapenko	47ebd0310e	mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush() As reported by Dipanjan Das, when KMSAN is used together with kernel fault injection (or, generally, even without the latter), calls to kcalloc() or __vmap_pages_range_noflush() may fail, leaving the metadata mappings for the virtual mapping in an inconsistent state. When these metadata mappings are accessed later, the kernel crashes. To address the problem, we return a non-zero error code from kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping failure inside it, and make vmap_pages_range_noflush() return an error if KMSAN fails to allocate the metadata. This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(), as these allocation failures are not fatal anymore. Link: https://lkml.kernel.org/r/20230413131223.4135168-1-glider@google.com Fixes: `b073d7f8ae` ("mm: kmsan: maintain KMSAN metadata for page operations") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/ Reviewed-by: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-18 14:22:13 -07:00
Andrew Morton	e492cd61b9	sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes	2023-04-16 12:31:58 -07:00
Lorenzo Stoakes	4c91c07c93	mm: vmalloc: convert vread() to vread_iter() Having previously laid the foundation for converting vread() to an iterator function, pull the trigger and do so. This patch attempts to provide minimal refactoring and to reflect the existing logic as best we can, for example we continue to zero portions of memory not read, as before. Overall, there should be no functional difference other than a performance improvement in /proc/kcore access to vmalloc regions. Now we have eliminated the need for a bounce buffer in read_kcore_iter(), we dispense with it, and try to write to user memory optimistically but with faults disabled via copy_page_to_iter_nofault(). We already have preemption disabled by holding a spin lock. We continue faulting in until the operation is complete. Additionally, we must account for the fact that at any point a copy may fail (most likely due to a fault not being able to occur), we exit indicating fewer bytes retrieved than expected. [sfr@canb.auug.org.au: fix sparc64 warning] Link: https://lkml.kernel.org/r/20230320144721.663280c3@canb.auug.org.au [lstoakes@gmail.com: redo Stephen's sparc build fix] Link: https://lkml.kernel.org/r/8506cbc667c39205e65a323f750ff9c11a463798.1679566220.git.lstoakes@gmail.com [akpm@linux-foundation.org: unbreak uio.h includes] Link: https://lkml.kernel.org/r/941f88bc5ab928e6656e1e2593b91bf0f8c81e1b.1679511146.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-05 19:42:57 -07:00
Yafang Shao	f349b15e18	mm: vmalloc: avoid warn_alloc noise caused by fatal signal There're some suspicious warn_alloc on my test serer, for example, [13366.518837] warn_alloc: 81 callbacks suppressed [13366.518841] test_verifier: vmalloc error: size 4096, page order 0, failed to allocate pages, mode:0x500dc2(GFP_HIGHUSER\|__GFP_ZERO\|__GFP_ACCOUNT), nodemask=(null),cpuset=/,mems_allowed=0-1 [13366.522240] CPU: 30 PID: 722463 Comm: test_verifier Kdump: loaded Tainted: G W O 6.2.0+ #638 [13366.524216] Call Trace: [13366.524702] <TASK> [13366.525148] dump_stack_lvl+0x6c/0x80 [13366.525712] dump_stack+0x10/0x20 [13366.526239] warn_alloc+0x119/0x190 [13366.526783] ? alloc_pages_bulk_array_mempolicy+0x9e/0x2a0 [13366.527470] __vmalloc_area_node+0x546/0x5b0 [13366.528066] __vmalloc_node_range+0xc2/0x210 [13366.528660] __vmalloc_node+0x42/0x50 [13366.529186] ? bpf_prog_realloc+0x53/0xc0 [13366.529743] __vmalloc+0x1e/0x30 [13366.530235] bpf_prog_realloc+0x53/0xc0 [13366.530771] bpf_patch_insn_single+0x80/0x1b0 [13366.531351] bpf_jit_blind_constants+0xe9/0x1c0 [13366.531932] ? __free_pages+0xee/0x100 [13366.532457] ? free_large_kmalloc+0x58/0xb0 [13366.533002] bpf_int_jit_compile+0x8c/0x5e0 [13366.533546] bpf_prog_select_runtime+0xb4/0x100 [13366.534108] bpf_prog_load+0x6b1/0xa50 [13366.534610] ? perf_event_task_tick+0x96/0xb0 [13366.535151] ? security_capable+0x3a/0x60 [13366.535663] __sys_bpf+0xb38/0x2190 [13366.536120] ? kvm_clock_get_cycles+0x9/0x10 [13366.536643] __x64_sys_bpf+0x1c/0x30 [13366.537094] do_syscall_64+0x38/0x90 [13366.537554] entry_SYSCALL_64_after_hwframe+0x72/0xdc [13366.538107] RIP: 0033:0x7f78310f8e29 [13366.538561] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 e0 2c 00 f7 d8 64 89 01 48 [13366.540286] RSP: 002b:00007ffe2a61fff8 EFLAGS: 00000206 ORIG_RAX: 0000000000000141 [13366.541031] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f78310f8e29 [13366.541749] RDX: 0000000000000080 RSI: 00007ffe2a6200b0 RDI: 0000000000000005 [13366.542470] RBP: 00007ffe2a620010 R08: 00007ffe2a6202a0 R09: 00007ffe2a6200b0 [13366.543183] R10: 00000000000f423e R11: 0000000000000206 R12: 0000000000407800 [13366.543900] R13: 00007ffe2a620540 R14: 0000000000000000 R15: 0000000000000000 [13366.544623] </TASK> [13366.545260] Mem-Info: [13366.546121] active_anon:81319 inactive_anon:20733 isolated_anon:0 active_file:69450 inactive_file:5624 isolated_file:0 unevictable:0 dirty:10 writeback:0 slab_reclaimable:69649 slab_unreclaimable:48930 mapped:27400 shmem:12868 pagetables:4929 sec_pagetables:0 bounce:0 kernel_misc_reclaimable:0 free:15870308 free_pcp:142935 free_cma:0 [13366.551886] Node 0 active_anon:224836kB inactive_anon:33528kB active_file:175692kB inactive_file:13752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:59248kB dirty:32kB writeback:0kB shmem:18252kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:4616kB pagetables:10664kB sec_pagetables:0kB all_unreclaimable? no [13366.555184] Node 1 active_anon:100440kB inactive_anon:49404kB active_file:102108kB inactive_file:8744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:50352kB dirty:8kB writeback:0kB shmem:33220kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:3896kB pagetables:9052kB sec_pagetables:0kB all_unreclaimable? no [13366.558262] Node 0 DMA free:15360kB boost:0kB min:304kB low:380kB high:456kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [13366.560821] lowmem_reserve[]: 0 2735 31873 31873 31873 [13366.561981] Node 0 DMA32 free:2790904kB boost:0kB min:56028kB low:70032kB high:84036kB reserved_highatomic:0KB active_anon:1936kB inactive_anon:20kB active_file:396kB inactive_file:344kB unevictable:0kB writepending:0kB present:3129200kB managed:2801520kB mlocked:0kB bounce:0kB free_pcp:5188kB local_pcp:0kB free_cma:0kB [13366.565148] lowmem_reserve[]: 0 0 29137 29137 29137 [13366.566168] Node 0 Normal free:28533824kB boost:0kB min:596740kB low:745924kB high:895108kB reserved_highatomic:28672KB active_anon:222900kB inactive_anon:33508kB active_file:175296kB inactive_file:13408kB unevictable:0kB writepending:32kB present:30408704kB managed:29837172kB mlocked:0kB bounce:0kB free_pcp:295724kB local_pcp:0kB free_cma:0kB [13366.569485] lowmem_reserve[]: 0 0 0 0 0 [13366.570416] Node 1 Normal free:32141144kB boost:0kB min:660504kB low:825628kB high:990752kB reserved_highatomic:69632KB active_anon:100440kB inactive_anon:49404kB active_file:102108kB inactive_file:8744kB unevictable:0kB writepending:8kB present:33554432kB managed:33025372kB mlocked:0kB bounce:0kB free_pcp:270880kB local_pcp:46860kB free_cma:0kB [13366.573403] lowmem_reserve[]: 0 0 0 0 0 [13366.574015] Node 0 DMA: 04kB 08kB 016kB 032kB 064kB 0128kB 0256kB 0512kB 11024kB (U) 12048kB (M) 34096kB (M) = 15360kB [13366.575474] Node 0 DMA32: 7824kB (UME) 7568kB (UME) 73616kB (UME) 74532kB (UME) 69464kB (UME) 653128kB (UME) 595256kB (UME) 552512kB (UME) 4541024kB (UME) 3472048kB (UME) 2464096kB (UME) = 2790904kB [13366.577442] Node 0 Normal: 338564kB (UMEH) 518158kB (UMEH) 4241816kB (UMEH) 3627232kB (UMEH) 2219564kB (UMEH) 10296128kB (UMEH) 7238256kB (UMEH) 5638512kB (UEH) 53371024kB (UMEH) 35062048kB (UMEH) 14704096kB (UME) = 28533784kB [13366.580460] Node 1 Normal: 157764kB (UMEH) 374858kB (UMEH) 2950916kB (UMEH) 2142032kB (UMEH) 1481864kB (UMEH) 13051128kB (UMEH) 9918256kB (UMEH) 7374512kB (UMEH) 53971024kB (UMEH) 38872048kB (UMEH) 20024096kB (UME) = 32141240kB [13366.583027] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [13366.584380] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [13366.585702] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [13366.587042] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [13366.588372] 87386 total pagecache pages [13366.589266] 0 pages in swap cache [13366.590327] Free swap = 0kB [13366.591227] Total swap = 0kB [13366.592142] 16777082 pages RAM [13366.593057] 0 pages HighMem/MovableOnly [13366.594037] 357226 pages reserved [13366.594979] 0 pages hwpoisoned This failure really confuse me as there're still lots of available pages. Finally I figured out it was caused by a fatal signal. When a process is allocating memory via vm_area_alloc_pages(), it will break directly even if it hasn't allocated the requested pages when it receives a fatal signal. In that case, we shouldn't show this warn_alloc, as it is useless. We only need to show this warning when there're really no enough pages. Link: https://lkml.kernel.org/r/20230330162625.13604-1-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-05 18:06:24 -07:00
Lorenzo Stoakes	dcc1be1190	mm: prefer xxx_page() alloc/free functions for order-0 pages Update instances of alloc_pages(..., 0), __get_free_pages(..., 0) and __free_pages(..., 0) to use alloc_page(), __get_free_page() and __free_page() respectively in core code. Link: https://lkml.kernel.org/r/50c48ca4789f1da2a65795f2346f5ae3eff7d665.1678710232.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:16 -07:00
Peter Collingbourne	0a54864f8d	kasan: remove PG_skip_kasan_poison flag Code inspection reveals that PG_skip_kasan_poison is redundant with kasantag, because the former is intended to be set iff the latter is the match-all tag. It can also be observed that it's basically pointless to poison pages which have kasantag=0, because any pages with this tag would have been pointed to by pointers with match-all tags, so poisoning the pages would have little to no effect in terms of bug detection. Therefore, change the condition in should_skip_kasan_poison() to check kasantag instead, and remove PG_skip_kasan_poison and associated flags. Link: https://lkml.kernel.org/r/20230310042914.3805818-3-pcc@google.com Link: https://linux-review.googlesource.com/id/I57f825f2eaeaf7e8389d6cf4597c8a5821359838 Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:16 -07:00
Michal Hocko	e9c3cda4d8	mm, vmalloc: fix high order __GFP_NOFAIL allocations Gao Xiang has reported that the page allocator complains about high order __GFP_NOFAIL request coming from the vmalloc core: __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5549 alloc_pages+0x1aa/0x270 mm/mempolicy.c:2286 vm_area_alloc_pages mm/vmalloc.c:2989 [inline] __vmalloc_area_node mm/vmalloc.c:3057 [inline] __vmalloc_node_range+0x978/0x13c0 mm/vmalloc.c:3227 kvmalloc_node+0x156/0x1a0 mm/util.c:606 kvmalloc include/linux/slab.h:737 [inline] kvmalloc_array include/linux/slab.h:755 [inline] kvcalloc include/linux/slab.h:760 [inline] it seems that I have completely missed high order allocation backing vmalloc areas case when implementing __GFP_NOFAIL support. This means that [k]vmalloc at al. can allocate higher order allocations with __GFP_NOFAIL which can trigger OOM killer for non-costly orders easily or cause a lot of reclaim/compaction activity if those requests cannot be satisfied. Fix the issue by falling back to zero order allocations for __GFP_NOFAIL requests if the high order request fails. Link: https://lkml.kernel.org/r/ZAXynvdNqcI0f6Us@dhcp22.suse.cz Fixes: `9376130c39` ("mm/vmalloc: add support for __GFP_NOFAIL") Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lkml.kernel.org/r/20230305053035.1911-1-hsiangkao@linux.alibaba.com Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-23 17:18:31 -07:00
Linus Torvalds	3822a7c409	- Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K DmxHkn0LAitGgJRS/W9w81yrgig9tAQ= =MlGs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...	2023-02-23 17:09:35 -08:00
David Howells	0185846975	netfs: Add a function to extract an iterator into a scatterlist Provide a function for filling in a scatterlist from the list of pages contained in an iterator. If the iterator is UBUF- or IOBUF-type, the pages have a pin taken on them (as FOLL_PIN). If the iterator is BVEC-, KVEC- or XARRAY-type, no pin is taken on the pages and it is left to the caller to manage their lifetime. It cannot be assumed that a ref can be validly taken, particularly in the case of a KVEC iterator. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: linux-cachefs@redhat.com cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2023-02-20 17:25:43 -06:00
Baoquan He	30a7a9b17c	mm/vmalloc: skip the uninitilized vmalloc areas For areas allocated via vmalloc_xxx() APIs, it searches for unmapped area to reserve and allocates new pages to map into, please see function __vmalloc_node_range(). During the process, flag VM_UNINITIALIZED is set in vm->flags to indicate that the pages allocation and mapping haven't been done, until clear_vm_uninitialized_flag() is called to clear VM_UNINITIALIZED. For this kind of area, if VM_UNINITIALIZED is still set, let's ignore it in vread() because pages newly allocated and being mapped in that area only contains zero data. reading them out by aligned_vread() is wasting time. Link: https://lkml.kernel.org/r/20230206084020.174506-6-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:43 -08:00
Baoquan He	bba9697b42	mm/vmalloc: explicitly identify vm_map_ram area when shown in /proc/vmcoreinfo Now, by marking VMAP_RAM in vmap_area->flags for vm_map_ram area, we can clearly differentiate it with other vmalloc areas. So identify vm_map_area area by checking VMAP_RAM of vmap_area->flags when shown in /proc/vmcoreinfo. Meanwhile, the code comment above vm_map_ram area checking in s_show() is not needed any more, remove it here. Link: https://lkml.kernel.org/r/20230206084020.174506-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:43 -08:00
Baoquan He	06c8994626	mm/vmalloc.c: allow vread() to read out vm_map_ram areas Currently, vread can read out vmalloc areas which is associated with a vm_struct. While this doesn't work for areas created by vm_map_ram() interface because it doesn't have an associated vm_struct. Then in vread(), these areas are all skipped. Here, add a new function vmap_ram_vread() to read out vm_map_ram areas. The area created with vmap_ram_vread() interface directly can be handled like the other normal vmap areas with aligned_vread(). While areas which will be further subdivided and managed with vmap_block need carefully read out page-aligned small regions and zero fill holes. Link: https://lkml.kernel.org/r/20230206084020.174506-4-bhe@redhat.com Reported-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Tested-by: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:42 -08:00
Baoquan He	869176a096	mm/vmalloc.c: add flags to mark vm_map_ram area Through vmalloc API, a virtual kernel area is reserved for physical address mapping. And vmap_area is used to track them, while vm_struct is allocated to associate with the vmap_area to store more information and passed out. However, area reserved via vm_map_ram() is an exception. It doesn't have vm_struct to associate with vmap_area. And we can't recognize the vmap_area with '->vm == NULL' as a vm_map_ram() area because the normal freeing path will set va->vm = NULL before unmapping, please see function remove_vm_area(). Meanwhile, there are two kinds of handling for vm_map_ram area. One is the whole vmap_area being reserved and mapped at one time through vm_map_area() interface; the other is the whole vmap_area with VMAP_BLOCK_SIZE size being reserved, while mapped into split regions with smaller size via vb_alloc(). To mark the area reserved through vm_map_ram(), add flags field into struct vmap_area. Bit 0 indicates this is vm_map_ram area created through vm_map_ram() interface, while bit 1 marks out the type of vm_map_ram area which makes use of vmap_block to manage split regions via vb_alloc/free(). This is a preparation for later use. Link: https://lkml.kernel.org/r/20230206084020.174506-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:42 -08:00
Baoquan He	d76f99548c	mm/vmalloc.c: add used_map into vmap_block to track space of vmap_block Patch series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas", v5. Problem: * Stephen reported vread() will skip vm_map_ram areas when reading out /proc/kcore with drgn utility. Please see below link to get more details. /proc/kcore reads 0's for vmap_block https://lore.kernel.org/all/87ilk6gos2.fsf@oracle.com/T/#u Root cause: * The normal vmalloc API uses struct vmap_area to manage the virtual kernel area allocated, and associate a vm_struct to store more information and pass out. However, area reserved through vm_map_ram() interface doesn't allocate vm_struct to associate with. So the current code in vread() will skip the vm_map_ram area through 'if (!va->vm)' conditional checking. Solution: *** To mark the area reserved through vm_map_ram() interface, add field 'flags' into struct vmap_area. Bit 0 indicates this is vm_map_ram area created through vm_map_ram() interface, bit 1 marks out the type of vm_map_ram area which makes use of vmap_block to manage split regions via vb_alloc/free(). And also add bitmap field 'used_map' into struct vmap_block to mark those further subdivided regions being used to differentiate with dirty and free regions in vmap_block. With the help of above vmap_area->flags and vmap_block->used_map, we can recognize and handle vm_map_ram areas successfully. All these are done in patch 1~3. Meanwhile, do some improvement on areas related to vm_map_ram areas in patch 4, 5. And also change area flag from VM_ALLOC to VM_IOREMAP in patch 6, 7 because this will show them as 'ioremap' in /proc/vmallocinfo, and exclude them from /proc/kcore. This patch (of 7): In one vmap_block area, there could be three types of regions: region being used which is allocated through vb_alloc(), dirty region which is freed via vb_free() and free region. Among them, only used region has available data. While there's no way to track those used regions currently. Here, add bitmap field used_map into vmap_block, and set/clear it during allocation or freeing regions of vmap_block area. This is a preparation for later use. Link: https://lkml.kernel.org/r/20230206084020.174506-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20230206084020.174506-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:42 -08:00
Hyunmin Lee	7e4a32c0e8	mm/vmalloc: replace BUG_ON with a simple if statement As per the coding standards, in the event of an abnormal condition that should not occur under normal circumstances, the kernel should attempt recovery and proceed with execution, rather than halting the machine. Specifically, in the alloc_vmap_area() function, use a simple if() instead of using BUG_ON() halting the machine. Link: https://lkml.kernel.org/r/20230201115142.GA7772@min-iamroot Co-developed-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Signed-off-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Co-developed-by: Jeungwoo Yoo <casionwoo@gmail.com> Signed-off-by: Jeungwoo Yoo <casionwoo@gmail.com> Co-developed-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Hyunmin Lee <hn.min.lee@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:41 -08:00
Suren Baghdasaryan	1c71222e5f	mm: replace vma->vm_flags direct modifications with modifier calls Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-09 16:51:39 -08:00
Christoph Hellwig	9e5fa0ae52	mm: refactor va_remove_mappings Move the VM_FLUSH_RESET_PERMS to the caller and rename the function to better describe what it is doing. Link: https://lkml.kernel.org/r/20230121071051.1143058-11-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:32 -08:00
Christoph Hellwig	79311c1fe0	mm: split __vunmap vunmap only needs to find and free the vmap_area and vm_strut, so open code that there and merge the rest of the code into vfree. Link: https://lkml.kernel.org/r/20230121071051.1143058-10-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:32 -08:00
Christoph Hellwig	17d3ef432d	mm: move debug checks from __vunmap to remove_vm_area All these checks apply to the free_vm_area interface as well, so move them to the common routine. Link: https://lkml.kernel.org/r/20230121071051.1143058-9-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:32 -08:00
Christoph Hellwig	75c59ce74e	mm: use remove_vm_area in __vunmap Use the common helper to find and remove a vmap_area instead of open coding it. Link: https://lkml.kernel.org/r/20230121071051.1143058-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:32 -08:00
Christoph Hellwig	39e65b7f63	mm: move __remove_vm_area out of va_remove_mappings __remove_vm_area is the only part of va_remove_mappings that requires a vmap_area. Move the call out to the caller and only pass the vm_struct to va_remove_mappings. Link: https://lkml.kernel.org/r/20230121071051.1143058-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:31 -08:00
Christoph Hellwig	5d3d31d6fb	mm: call vfree instead of __vunmap from delayed_vfree_work This adds an extra, never taken, in_interrupt() branch, but will allow to cut down the maze of vfree helpers. Link: https://lkml.kernel.org/r/20230121071051.1143058-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:31 -08:00
Christoph Hellwig	208162f42f	mm: move vmalloc_init and free_work down in vmalloc.c Move these two functions around a bit to avoid forward declarations. Link: https://lkml.kernel.org/r/20230121071051.1143058-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:31 -08:00
Christoph Hellwig	01e2e8394a	mm: remove __vfree_deferred Fold __vfree_deferred into vfree_atomic, and call vfree_atomic early on from vfree if called from interrupt context so that the extra low-level helper can be avoided. Link: https://lkml.kernel.org/r/20230121071051.1143058-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:31 -08:00
Christoph Hellwig	f41f036b80	mm: remove __vfree __vfree is a subset of vfree that just skips a few checks, and which is only used by vfree and an error cleanup path. Fold __vfree into vfree and switch the only other caller to call vfree() instead. Link: https://lkml.kernel.org/r/20230121071051.1143058-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:30 -08:00
Christoph Hellwig	37f3605e5e	mm: reject vmap with VM_FLUSH_RESET_PERMS Patch series "cleanup vfree and vunmap". This little series untangles the vfree and vunmap code path a bit. This patch (of 10): VM_FLUSH_RESET_PERMS is just for use with vmalloc as it is tied to freeing the underlying pages. Link: https://lkml.kernel.org/r/20230121071051.1143058-1-hch@lst.de Link: https://lkml.kernel.org/r/20230121071051.1143058-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:30 -08:00
Uladzislau Rezki (Sony)	14687619e1	mm: vmalloc: replace BUG_ON() by WARN_ON_ONCE() Currently a vm_unmap_ram() functions triggers a BUG() if an area is not found. Replace it by the WARN_ON_ONCE() error message and keep machine alive instead of stopping it. The worst case is a memory leaking. Link: https://lkml.kernel.org/r/20221222190022.134380-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:48 -08:00
Uladzislau Rezki (Sony)	edd898181e	mm: vmalloc: avoid calling __find_vmap_area() twice in __vunmap() Currently the __vunmap() path calls __find_vmap_area() twice. Once on entry to check that the area exists, then inside the remove_vm_area() function which also performs a new search for the VA. In order to improvie it from a performance point of view we split remove_vm_area() into two new parts: - find_unlink_vmap_area() that does a search and unlink from tree; - __remove_vm_area() that removes without searching. In this case there is no any functional change for remove_vm_area() whereas vm_remove_mappings(), where a second search happens, switches to the __remove_vm_area() variant where the already detached VA is passed as a parameter, so there is no need to find it again. Performance wise, i use test_vmalloc.sh with 32 threads doing alloc free on a 64-CPUs-x86_64-box: perf without this patch: - 31.41% 0.50% vmalloc_test/10 [kernel.vmlinux] [k] __vunmap - 30.92% __vunmap - 17.67% _raw_spin_lock native_queued_spin_lock_slowpath - 12.33% remove_vm_area - 11.79% free_vmap_area_noflush - 11.18% _raw_spin_lock native_queued_spin_lock_slowpath 0.76% free_unref_page perf with this patch: - 11.35% 0.13% vmalloc_test/14 [kernel.vmlinux] [k] __vunmap - 11.23% __vunmap - 8.28% find_unlink_vmap_area - 7.95% _raw_spin_lock 7.44% native_queued_spin_lock_slowpath - 1.93% free_vmap_area_noflush - 0.56% _raw_spin_lock 0.53% native_queued_spin_lock_slowpath 0.60% __vunmap_range_noflush __vunmap() consumes around ~20% less CPU cycles on this test. Also, switch from find_vmap_area() to find_unlink_vmap_area() to prevent a double access to the vmap_area_lock: one for finding area, second time is for unlinking from a tree. [urezki@gmail.com: switch to find_unlink_vmap_area() in vm_unmap_ram()] Link: https://lkml.kernel.org/r/20221222190022.134380-2-urezki@gmail.com Link: https://lkml.kernel.org/r/20221222190022.134380-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reported-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:48 -08:00
Lorenzo Stoakes	80b1d8fdfa	mm: vmalloc: correct use of __GFP_NOWARN mask in __vmalloc_area_node() This function sets __GFP_NOWARN in the gfp_mask rendering the warn_alloc() invocations no-ops. Remove this and instead rely on this flag being set only for the vm_area_alloc_pages() function, ensuring it is cleared for each of the warn_alloc() calls. Link: https://lkml.kernel.org/r/20221219123659.90614-1-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:45 -08:00
Uladzislau Rezki (Sony)	8c4196fe81	mm: vmalloc: use trace_free_vmap_area_noflush event It is for debug purposes and is called when a vmap area gets freed. This event gives some indication about: - a start address of released area; - a current number of outstanding pages; - a maximum number of allowed outstanding pages. Link: https://lkml.kernel.org/r/20221018181053.434508-7-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-08 17:37:17 -08:00
Uladzislau Rezki (Sony)	6030fd5fd1	mm: vmalloc: use trace_purge_vmap_area_lazy event This is for debug purposes and is called when all outstanding areas are removed back to the vmap space. It gives some extra information about: - a start:end range where set of vmap ares were freed; - a number of purged areas which were backed off. [urezki@gmail.com: simplify return boolean expression] Link: https://lkml.kernel.org/r/20221020125247.5053-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20221018181053.434508-6-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-08 17:37:17 -08:00
Uladzislau Rezki (Sony)	cf243da6ab	mm: vmalloc: use trace_alloc_vmap_area event This is for debug purpose and is called when an allocation attempt occurs. This event gives some information about: - start address of allocated area; - size that is requested; - alignment that is required; - vstart/vend restriction; - if an allocation fails. Link: https://lkml.kernel.org/r/20221018181053.434508-5-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-08 17:37:17 -08:00
Alexander Potapenko	b073d7f8ae	mm: kmsan: maintain KMSAN metadata for page operations Insert KMSAN hooks that make the necessary bookkeeping changes: - poison page shadow and origins in alloc_pages()/free_page(); - clear page shadow and origins in clear_page(), copy_user_highpage(); - copy page metadata in copy_highpage(), wp_page_copy(); - handle vmap()/vunmap()/iounmap(); Link: https://lkml.kernel.org/r/20220915150417.722975-15-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-03 14:03:20 -07:00
Song Liu	bd1264c37c	mm/vmalloc: extend find_vmap_lowest_match_check with extra arguments find_vmap_lowest_match() is now able to handle different roots. With DEBUG_AUGMENT_LOWEST_MATCH_CHECK enabled as: : --- a/mm/vmalloc.c : +++ b/mm/vmalloc.c : @@ -713,7 +713,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn); : /* Global kva allocator / : : -#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0 : +#define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 1 compilation failed as: mm/vmalloc.c: In function 'find_vmap_lowest_match_check': mm/vmalloc.c:1328:32: warning: passing argument 1 of 'find_vmap_lowest_match' makes pointer from integer without a cast [-Wint-conversion] 1328 \| va_1 = find_vmap_lowest_match(size, align, vstart, false); \| ^~~~ \| \| \| long unsigned int mm/vmalloc.c:1236:40: note: expected 'struct rb_root ' but argument is of type 'long unsigned int' 1236 \| find_vmap_lowest_match(struct rb_root root, unsigned long size, \| ~~~~~~~~~~~~~~~~^~~~ mm/vmalloc.c:1328:9: error: too few arguments to function 'find_vmap_lowest_match' 1328 \| va_1 = find_vmap_lowest_match(size, align, vstart, false); \| ^~~~~~~~~~~~~~~~~~~~~~ mm/vmalloc.c:1236:1: note: declared here 1236 \| find_vmap_lowest_match(struct rb_root root, unsigned long size, \| ^~~~~~~~~~~~~~~~~~~~~~ Extend find_vmap_lowest_match_check() and find_vmap_lowest_linear_match() with extra arguments to fix this. Link: https://lkml.kernel.org/r/20220906060548.1127396-1-song@kernel.org Link: https://lkml.kernel.org/r/20220831052734.3423079-1-song@kernel.org Fixes: `f9863be493` ("mm/vmalloc: extend __alloc_vmap_area() with extra arguments") Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-11 20:26:05 -07:00
Matthew Wilcox	08262ac50a	mm/vmalloc.c: support HIGHMEM pages in vmap_pages_range_noflush() If the pages being mapped are in HIGHMEM, page_address() returns NULL. This probably wasn't noticed before because there aren't currently any architectures with HAVE_ARCH_HUGE_VMALLOC and HIGHMEM, but it's simpler to call page_to_phys() and futureproofs us against such configurations existing. Link: https://lkml.kernel.org/r/Yv6qHc6e+m7TMWhi@casper.infradead.org Fixes: `121e6f3258` ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-11 20:25:56 -07:00
Uladzislau Rezki (Sony)	899c6efe58	mm/vmalloc: extend __find_vmap_area() with one more argument __find_vmap_area() finds a "vmap_area" based on passed address. It scan the specific "vmap_area_root" rb-tree. Extend the function with one extra argument, so any tree can be specified where the search has to be done. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-5-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Uladzislau Rezki (Sony)	5d7a7c54d3	mm/vmalloc: initialize VA's list node after unlink A vmap_area can travel between different places. For example attached/detached to/from different rb-trees. In order to prevent fancy bugs, initialize a VA's list node after it is removed from the list, so it pairs with VA's rb_node which is also initialized. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Uladzislau Rezki (Sony)	f9863be493	mm/vmalloc: extend __alloc_vmap_area() with extra arguments It implies that __alloc_vmap_area() allocates only from the global vmap space, therefore a list-head and rb-tree, which represent a free vmap space, are not passed as parameters to this function and are accessed directly from this function. Extend the __alloc_vmap_area() and other dependent functions to have a possibility to allocate from different trees making an interface common and not specific. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Uladzislau Rezki (Sony)	8eb510db21	mm/vmalloc: make link_va()/unlink_va() common to different rb_root Patch series "Reduce a vmalloc internal lock contention preparation work". This small serias is preparation work to implement per-cpu vmalloc allocation in order to reduce a high internal lock contention. This series does not introduce any functional changes, it is only about preparation. This patch (of 5): Currently link_va() and unlik_va(), in order to figure out a tree type, compares a passed root value with a global free_vmap_area_root variable to distinguish the augmented rb-tree from a regular one. It is hard coded since such functions can manipulate only with specific "free_vmap_area_root" tree that represents a global free vmap space. Make it common by introducing "_augment" versions of both internal functions, so it is possible to deal with different trees. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20220607093449.3100-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20220607093449.3100-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Andrey Konovalov	6c2f761dad	kasan: fix zeroing vmalloc memory with HW_TAGS HW_TAGS KASAN skips zeroing page_alloc allocations backing vmalloc mappings via __GFP_SKIP_ZERO. Instead, these pages are zeroed via kasan_unpoison_vmalloc() by passing the KASAN_VMALLOC_INIT flag. The problem is that __kasan_unpoison_vmalloc() does not zero pages when either kasan_vmalloc_enabled() or is_vmalloc_or_module_addr() fail. Thus: 1. Change __vmalloc_node_range() to only set KASAN_VMALLOC_INIT when __GFP_SKIP_ZERO is set. 2. Change __kasan_unpoison_vmalloc() to always zero pages when the KASAN_VMALLOC_INIT flag is set. 3. Add WARN_ON() asserts to check that KASAN_VMALLOC_INIT cannot be set in other early return paths of __kasan_unpoison_vmalloc(). Also clean up the comment in __kasan_unpoison_vmalloc. Link: https://lkml.kernel.org/r/4bc503537efdc539ffc3f461c1b70162eea31cf6.1654798516.git.andreyknvl@google.com Fixes: `23689e91fb` ("kasan, vmalloc: add vmalloc tagging for HW_TAGS") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:39 -07:00
akpm	46a3b11253	Merge branch 'master' into mm-stable	2022-06-27 10:31:34 -07:00
Baoquan He	153090f2c6	mm/vmalloc: add code comment for find_vmap_area_exceed_addr() Its behaviour is like find_vma() which finds an area above the specified address, add comment to make it easier to understand. And also fix two places of grammer mistake/typo. Link: https://lkml.kernel.org/r/20220607105958.382076-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-06-16 19:48:29 -07:00
Baoquan He	baa468a648	mm/vmalloc: fix typo in local variable name In __purge_vmap_area_lazy(), rename local_pure_list to local_purge_list. Link: https://lkml.kernel.org/r/20220607105958.382076-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-06-16 19:48:28 -07:00
Baoquan He	753df96be5	mm/vmalloc: remove the redundant boundary check In find_va_links(), when traversing the vmap_area tree, the comparing to check if the passed in 'va' is above or below 'tmp_va' is redundant, assuming both 'va' and 'tmp_va' has ->va_start <= ->va_end. Here, to simplify the checking as code change. Link: https://lkml.kernel.org/r/20220607105958.382076-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-06-16 19:48:28 -07:00
Baoquan He	1b23ff80b3	mm/vmalloc: invoke classify_va_fit_type() in adjust_va_to_fit_type() Patch series "Cleanup patches of vmalloc", v2. Some cleanup patches found when reading vmalloc code. This patch (of 4): adjust_va_to_fit_type() checks all values of passed in fit type, including NOTHING_FIT in the else branch. However, the check of NOTHING_FIT has been done inside adjust_va_to_fit_type() and before it's called in all call sites. In fact, both of these functions are coupled tightly, since classify_va_fit_type() is doing the preparation work for adjust_va_to_fit_type(). So putting invocation of classify_va_fit_type() inside adjust_va_to_fit_type() can simplify code logic and the redundant check of NOTHING_FIT issue will go away. Link: https://lkml.kernel.org/r/20220607105958.382076-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20220607105958.382076-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-06-16 19:48:28 -07:00
Matthew Wilcox (Oracle)	993d0b287e	usercopy: Handle vm_map_ram() areas vmalloc does not allocate a vm_struct for vm_map_ram() areas. That causes us to deny usercopies from those areas. This affects XFS which uses vm_map_ram() for its directories. Fix this by calling find_vmap_area() instead of find_vm_area(). Fixes: `0aef499f31` ("mm/usercopy: Detect vmalloc overruns") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220612213227.3881769-2-willy@infradead.org	2022-06-13 09:54:52 -07:00
Sebastian Andrzej Siewior	3f80492001	mm/vmalloc: use raw_cpu_ptr() for vmap_block_queue access The per-CPU resource vmap_block_queue is accessed via get_cpu_var(). That macro disables preemption and then loads the pointer from the current CPU. This doesn't work on PREEMPT_RT because a spinlock_t is later accessed within the preempt-disable section. There is no need to disable preemption while accessing the per-CPU struct vmap_block_queue because the list is protected with a spinlock_t. The per-CPU struct is also accessed cross-CPU in purge_fragmented_blocks(). It is possible that by using raw_cpu_ptr() the code migrates to another CPU and uses struct from another CPU. This is fine because the list is locked and the locked section is very short. Use raw_cpu_ptr() to access vmap_block_queue. Link: https://lkml.kernel.org/r/YnKx3duAB53P7ojN@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-13 07:20:18 -07:00
Li kunyu	c8db8c2628	mm: functions may simplify the use of return values p4d_clear_huge may be optimized for void return type and function usage. vunmap_p4d_range function saves a few steps here. Link: https://lkml.kernel.org/r/20220507150630.90399-1-kunyu@nfschina.com Signed-off-by: Li kunyu <kunyu@nfschina.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-13 07:20:18 -07:00
Yury Norov	4fcdcc1291	vmap(): don't allow invalid pages vmap() takes struct page *pages as one of arguments, and user may provide an invalid pointer which may lead to corrupted translation table. An example of such behaviour is erroneous usage of virt_to_page(): vaddr1 = dma_alloc_coherent() page = virt_to_page() // Wrong here ... vaddr2 = vmap(page) memset(vaddr2) // Faulting here virt_to_page() returns a wrong pointer if vaddr1 is not a linear kernel address. The problem is that vmap() populates pte with bad pfn successfully, and it's much harder to debug at memory access time. This case should be caught by DEBUG_VIRTUAL being that enabled, but it's not enabled in popular distros. Kernel already checks the pages against NULL. In the case mentioned above, however, the address is not NULL, and it's big enough so that the hardware generated Address Size Abort on arm64: [ 665.484101] Unhandled fault at 0xffff8000252cd000 [ 665.488807] Mem abort info: [ 665.491617] ESR = 0x96000043 [ 665.494675] EC = 0x25: DABT (current EL), IL = 32 bits [ 665.499985] SET = 0, FnV = 0 [ 665.503039] EA = 0, S1PTW = 0 [ 665.506167] Data abort info: [ 665.509047] ISV = 0, ISS = 0x00000043 [ 665.512882] CM = 0, WnR = 1 [ 665.515851] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000818cb000 [ 665.522550] [ffff8000252cd000] pgd=000000affcfff003, pud=000000affcffe003, pmd=0000008fad8c3003, pte=00688000a5217713 [ 665.533160] Internal error: level 3 address size fault: 96000043 [#1] SMP [ 665.539936] Modules linked in: [...] [ 665.616212] CPU: 178 PID: 13199 Comm: test Tainted: P OE 5.4.0-84-generic #94~18.04.1-Ubuntu [ 665.626806] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018 [ 665.636618] pstate: 80400009 (Nzcv daif +PAN -UAO) [ 665.641407] pc : __memset+0x38/0x188 [ 665.645146] lr : test+0xcc/0x3f8 [ 665.650184] sp : ffff8000359bb840 [ 665.653486] x29: ffff8000359bb840 x28: 0000000000000000 [ 665.658785] x27: 0000000000000000 x26: 0000000000231000 [ 665.664083] x25: ffff00ae660f6110 x24: ffff00ae668cb800 [ 665.669382] x23: 0000000000000001 x22: ffff00af533e5000 [ 665.674680] x21: 0000000000001000 x20: 0000000000000000 [ 665.679978] x19: ffff00ae66950000 x18: ffffffffffffffff [ 665.685276] x17: 00000000588636a5 x16: 0000000000000013 [ 665.690574] x15: ffffffffffffffff x14: 000000000007ffff [ 665.695872] x13: 0000000080000000 x12: 0140000000000000 [ 665.701170] x11: 0000000000000041 x10: ffff8000652cd000 [ 665.706468] x9 : ffff8000252cf000 x8 : ffff8000252cd000 [ 665.711767] x7 : 0303030303030303 x6 : 0000000000001000 [ 665.717065] x5 : ffff8000252cd000 x4 : 0000000000000000 [ 665.722363] x3 : ffff8000252cdfff x2 : 0000000000000001 [ 665.727661] x1 : 0000000000000003 x0 : ffff8000252cd000 [ 665.732960] Call trace: [ 665.735395] __memset+0x38/0x188 [...] Interestingly, this abort happens even if copy_from_kernel_nofault() is used, which is quite inconvenient for debugging purposes. This patch adds a pfn_valid() check into vmap() path, so that invalid mapping will not be created; WARN_ON() is used to let client code know that something goes wrong, and it's not a regular EINVAL situation. Link: https://lkml.kernel.org/r/20220422220410.1308706-1-yury.norov@gmail.com Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexey Klimov <aklimov@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-04-28 23:16:00 -07:00
Yixuan Cao	98af39d52e	mm/vmalloc: fix a comment The sentence "but the mempolcy want to alloc memory by interleaving" should be rephrased with "but the mempolicy wants to alloc memory by interleaving" where "mempolicy" is a struct name. This work is coauthored by Yinan Zhang Jiajian Ye Shenghong Han Chongxi Zhao Yuhong Feng Yongqiang Liu Link: https://lkml.kernel.org/r/20220401064543.4447-1-caoyixuan2019@email.szu.edu.cn Signed-off-by: Yixuan Cao <caoyixuan2019@email.szu.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-04-28 23:16:00 -07:00
Nicholas Piggin	3b8000ae18	mm/vmalloc: huge vmalloc backing pages should be split rather than compound Huge vmalloc higher-order backing pages were allocated with __GFP_COMP in order to allow the sub-pages to be refcounted by callers such as "remap_vmalloc_page [sic]" (remap_vmalloc_range). However a similar problem exists for other struct page fields callers use, for example fb_deferred_io_fault() takes a vmalloc'ed page and not only refcounts it but uses ->lru, ->mapping, ->index. This is not compatible with compound sub-pages, and can cause bad page state issues like BUG: Bad page state in process swapper/0 pfn:00743 page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743 flags: 0x7ffff000000000(node=0\|zone=0\|lastcpupid=0x7ffff) raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: corrupted mapping in tail page Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810 Call Trace: dump_stack_lvl+0x74/0xa8 (unreliable) bad_page+0x12c/0x170 free_tail_pages_check+0xe8/0x190 free_pcp_prepare+0x31c/0x4e0 free_unref_page+0x40/0x1b0 __vunmap+0x1d8/0x420 ... The correct approach is to use split high-order pages for the huge vmalloc backing. These allow callers to treat them in exactly the same way as individually-allocated order-0 pages. Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Cc: Song Liu <songliubraving@fb.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-04-22 09:20:16 -07:00
Song Liu	559089e0a9	vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP Huge page backed vmalloc memory could benefit performance in many cases. However, some users of vmalloc may not be ready to handle huge pages for various reasons: hardware constraints, potential pages split, etc. VM_NO_HUGE_VMAP was introduced to allow vmalloc users to opt-out huge pages. However, it is not easy to track down all the users that require the opt-out, as the allocation are passed different stacks and may cause issues in different layers. To address this issue, replace VM_NO_HUGE_VMAP with an opt-in flag, VM_ALLOW_HUGE_VMAP, so that users that benefit from huge pages could ask specificially. Also, remove vmalloc_no_huge() and add opt-in helper vmalloc_huge(). Fixes: `fac54e2bfb` ("x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP") Link: https://lore.kernel.org/netdev/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/" Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-04-19 12:08:57 -07:00
Omar Sandoval	c12cd77cb0	mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore Commit `3ee48b6af4` ("mm, x86: Saving vmcore with non-lazy freeing of vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1, ensuring that any future vunmaps() immediately purge the vmap areas instead of doing it lazily. Commit `690467c81b` ("mm/vmalloc: Move draining areas out of caller context") moved the purging from the vunmap() caller to a worker thread. Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin (possibly forever). For example, consider the following scenario: 1. Thread reads from /proc/vmcore. This eventually calls __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1. 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2 pages (one page plus the guard page) to the purge list and vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the drain_vmap_work is scheduled. 3. Thread returns from the kernel and is scheduled out. 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It frees the 2 pages on the purge list. vmap_lazy_nr is now lazy_max_pages() + 1. 5. This is still over the threshold, so it tries to purge areas again, but doesn't find anything. 6. Repeat 5. If the system is running with only one CPU (which is typicial for kdump) and preemption is disabled, then this will never make forward progress: there aren't any more pages to purge, so it hangs. If there is more than one CPU or preemption is enabled, then the worker thread will spin forever in the background. (Note that if there were already pages to be purged at the time that set_iounmap_nonlazy() was called, this bug is avoided.) This can be reproduced with anything that reads from /proc/vmcore multiple times. E.g., vmcore-dmesg /proc/vmcore. It turns out that improvements to vmap() over the years have obsoleted the need for this "optimization". I benchmarked `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed entirely: \|5.17 \|5.18+fix\|5.18+removal 4k\|40.86s\| 40.09s\| 26.73s 1M\|24.47s\| 23.98s\| 21.84s The removal was the fastest (by a wide margin with 4k reads). This patch removes set_iounmap_nonlazy(). Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com Fixes: `690467c81b` ("mm/vmalloc: Move draining areas out of caller context") Signed-off-by: Omar Sandoval <osandov@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-04-15 14:49:56 -07:00
Andrey Konovalov	f6e39794f4	kasan, vmalloc: only tag normal vmalloc allocations The kernel can use to allocate executable memory. The only supported way to do that is via __vmalloc_node_range() with the executable bit set in the prot argument. (vmap() resets the bit via pgprot_nx()). Once tag-based KASAN modes start tagging vmalloc allocations, executing code from such allocations will lead to the PC register getting a tag, which is not tolerated by the kernel. Only tag the allocations for normal kernel pages. [andreyknvl@google.com: pass KASAN_VMALLOC_PROT_NORMAL to kasan_unpoison_vmalloc()] Link: https://lkml.kernel.org/r/9230ca3d3e40ffca041c133a524191fd71969a8d.1646233925.git.andreyknvl@google.com [andreyknvl@google.com: support tagged vmalloc mappings] Link: https://lkml.kernel.org/r/2f6605e3a358cf64d73a05710cb3da356886ad29.1646233925.git.andreyknvl@google.com [andreyknvl@google.com: don't unintentionally disabled poisoning] Link: https://lkml.kernel.org/r/de4587d6a719232e83c760113e46ed2d4d8da61e.1646757322.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/fbfd9939a4dc375923c9a5c6b9e7ab05c26b8c6b.1643047180.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-24 19:06:48 -07:00
Andrey Konovalov	23689e91fb	kasan, vmalloc: add vmalloc tagging for HW_TAGS Add vmalloc tagging support to HW_TAGS KASAN. The key difference between HW_TAGS and the other two KASAN modes when it comes to vmalloc: HW_TAGS KASAN can only assign tags to physical memory. The other two modes have shadow memory covering every mapped virtual memory region. Make __kasan_unpoison_vmalloc() for HW_TAGS KASAN: - Skip non-VM_ALLOC mappings as HW_TAGS KASAN can only tag a single mapping of normal physical memory; see the comment in the function. - Generate a random tag, tag the returned pointer and the allocation, and initialize the allocation at the same time. - Propagate the tag into the page stucts to allow accesses through page_address(vmalloc_to_page()). The rest of vmalloc-related KASAN hooks are not needed: - The shadow-related ones are fully skipped. - __kasan_poison_vmalloc() is kept as a no-op with a comment. Poisoning and zeroing of physical pages that are backing vmalloc() allocations are skipped via __GFP_SKIP_KASAN_UNPOISON and __GFP_SKIP_ZERO: __kasan_unpoison_vmalloc() does that instead. Enabling CONFIG_KASAN_VMALLOC with HW_TAGS is not yet allowed. Link: https://lkml.kernel.org/r/d19b2e9e59a9abc59d05b72dea8429dcaea739c6.1643047180.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Co-developed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-24 19:06:48 -07:00
Andrey Konovalov	19f1c3acf8	kasan, vmalloc: unpoison VM_ALLOC pages after mapping Make KASAN unpoison vmalloc mappings after they have been mapped in when it's possible: for vmalloc() (indentified via VM_ALLOC) and vm_map_ram(). The reasons for this are: - For vmalloc() and vm_map_ram(): pages don't get unpoisoned in case mapping them fails. - For vmalloc(): HW_TAGS KASAN needs pages to be mapped to set tags via kasan_unpoison_vmalloc(). As a part of these changes, the return value of __vmalloc_node_range() is changed to area->addr. This is a non-functional change, as __vmalloc_area_node() returns area->addr anyway. Link: https://lkml.kernel.org/r/fcb98980e6fcd3c4be6acdcb5d6110898ef28548.1643047180.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-24 19:06:47 -07:00
Andrey Konovalov	01d92c7f35	kasan, vmalloc, arm64: mark vmalloc mappings as pgprot_tagged HW_TAGS KASAN relies on ARM Memory Tagging Extension (MTE). With MTE, a memory region must be mapped as MT_NORMAL_TAGGED to allow setting memory tags via MTE-specific instructions. Add proper protection bits to vmalloc() allocations. These allocations are always backed by page_alloc pages, so the tags will actually be getting set on the corresponding physical memory. Link: https://lkml.kernel.org/r/983fc33542db2f6b1e77b34ca23448d4640bbb9e.1643047180.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Co-developed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-24 19:06:47 -07:00
Andrey Konovalov	1d96320f8d	kasan, vmalloc: add vmalloc tagging for SW_TAGS Add vmalloc tagging support to SW_TAGS KASAN. - __kasan_unpoison_vmalloc() now assigns a random pointer tag, poisons the virtual mapping accordingly, and embeds the tag into the returned pointer. - __get_vm_area_node() (used by vmalloc() and vmap()) and pcpu_get_vm_areas() save the tagged pointer into vm_struct->addr (note: not into vmap_area->addr). This requires putting kasan_unpoison_vmalloc() after setup_vmalloc_vm[_locked](); otherwise the latter will overwrite the tagged pointer. The tagged pointer then is naturally propagateed to vmalloc() and vmap(). - vm_map_ram() returns the tagged pointer directly. As a result of this change, vm_struct->addr is now tagged. Enabling KASAN_VMALLOC with SW_TAGS is not yet allowed. Link: https://lkml.kernel.org/r/4a78f3c064ce905e9070c29733aca1dd254a74f1.1643047180.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-24 19:06:47 -07:00
Andrey Konovalov	4aff1dc4fb	kasan, vmalloc: reset tags in vmalloc functions In preparation for adding vmalloc support to SW/HW_TAGS KASAN, reset pointer tags in functions that use pointer values in range checks. vread() is a special case here. Despite the untagging of the addr pointer in its prologue, the accesses performed by vread() are checked. Instead of accessing the virtual mappings though addr directly, vread() recovers the physical address via page_address(vmalloc_to_page()) and acceses that. And as page_address() recovers the pointer tag, the accesses get checked. Link: https://lkml.kernel.org/r/046003c5f683cacb0ba18e1079e9688bb3dca943.1643047180.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-24 19:06:47 -07:00
Andrey Konovalov	63840de296	kasan, x86, arm64, s390: rename functions for modules shadow Rename kasan_free_shadow to kasan_free_module_shadow and kasan_module_alloc to kasan_alloc_module_shadow. These functions are used to allocate/free shadow memory for kernel modules when KASAN_VMALLOC is not enabled. The new names better reflect their purpose. Also reword the comment next to their declaration to improve clarity. Link: https://lkml.kernel.org/r/36db32bde765d5d0b856f77d2d806e838513fe84.1643047180.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-24 19:06:47 -07:00
Jiapeng Chong	c3385e8458	mm/vmalloc.c: fix "unused function" warning compute_subtree_max_size() is unused, when building with DEBUG_AUGMENT_PROPAGATE_CHECK=y. mm/vmalloc.c:785:1: warning: unused function 'compute_subtree_max_size' [-Wunused-function]. Link: https://lkml.kernel.org/r/20220129034652.75359-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-22 15:57:05 -07:00
Uladzislau Rezki (Sony)	c3d77172df	mm/vmalloc: eliminate an extra orig_gfp_mask That extra variable has been introduced just for keeping an original passed gfp_mask because it is updated with __GFP_NOWARN on entry, thus error handling messages were broken. Instead we can keep an original gfp_mask without modifying it and add an extra __GFP_NOWARN flag together with gfp_mask as a parameter to the vm_area_alloc_pages() function. It will make it less confused. Link: https://lkml.kernel.org/r/20220119143540.601149-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-22 15:57:05 -07:00
Uladzislau Rezki	9333fe98d0	mm/vmalloc: add adjust_search_size parameter Extend the find_vmap_lowest_match() function with one more parameter. It is "adjust_search_size" boolean variable, so it is possible to control an accuracy of search block if a specific alignment is required. With this patch, a search size is always adjusted, to serve a request as fast as possible because of performance reason. But there is one exception though, it is short ranges where requested size corresponds to passed vstart/vend restriction together with a specific alignment request. In such scenario an adjustment wold not lead to success allocation. Link: https://lkml.kernel.org/r/20220119143540.601149-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki <uladzislau.rezki@sony.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-22 15:57:05 -07:00
Uladzislau Rezki (Sony)	690467c81b	mm/vmalloc: Move draining areas out of caller context A caller initiates the drain procces from its context once the drain threshold is reached or passed. There are at least two drawbacks of doing so: a) a caller can be a high-prio or RT task. In that case it can stuck in doing the actual drain of all lazily freed areas. This is not optimal because such tasks usually are latency sensitive where the control should be returned back as soon as possible in order to drive such workloads in time. See `96e2db4561` ("mm/vmalloc: rework the drain logic") b) It is not safe to call vfree() during holding a spinlock due to the vmap_purge_lock mutex. The was a report about this from Zeal Robot <zealci@zte.com.cn> here: https://lore.kernel.org/all/20211222081026.484058-1-chi.minghao@zte.com.cn Moving the drain to the separate work context addresses those issues. v1->v2: - Added prefix "_work" to the drain worker function. v2->v3: - Remove the drain_vmap_work_in_progress. Extra queuing is expectable under heavy load but it can be disregarded because a work will bail out if nothing to be done. Link: https://lkml.kernel.org/r/20220131144058.35608-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-22 15:57:05 -07:00
Miaohe Lin	651d55ce09	mm/vmalloc: remove unneeded function forward declaration The forward declaration for lazy_max_pages() is unnecessary. Remove it. Link: https://lkml.kernel.org/r/20220124133752.60663-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-22 15:57:05 -07:00
Anshuman Khandual	16785bd774	mm: merge pte_mkhuge() call into arch_make_huge_pte() Each call into pte_mkhuge() is invariably followed by arch_make_huge_pte(). Instead arch_make_huge_pte() can accommodate pte_mkhuge() at the beginning. This updates generic fallback stub for arch_make_huge_pte() and available platforms definitions. This makes huge pte creation much cleaner and easier to follow. Link: https://lkml.kernel.org/r/1643860669-26307-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-22 15:57:04 -07:00
Michal Hocko	30d3f01191	mm/vmalloc: be more explicit about supported gfp flags. Commit `b7d90e7a5e` ("mm/vmalloc: be more explicit about supported gfp flags") has been merged prematurely without the rest of the series and without addressed review feedback from Neil. Fix that up now. Only wording is changed slightly. Link: https://lkml.kernel.org/r/20211122153233.9924-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-15 16:30:28 +02:00
Michal Hocko	9376130c39	mm/vmalloc: add support for __GFP_NOFAIL Dave Chinner has mentioned that some of the xfs code would benefit from kvmalloc support for __GFP_NOFAIL because they have allocations that cannot fail and they do not fit into a single page. The large part of the vmalloc implementation already complies with the given gfp flags so there is no work for those to be done. The area and page table allocations are an exception to that. Implement a retry loop for those. Add a short sleep before retrying. 1 jiffy is a completely random timeout. Ideally the retry would wait for an explicit event - e.g. a change to the vmalloc space change if the failure was caused by the space fragmentation or depletion. But there are multiple different reasons to retry and this could become much more complex. Keep the retry simple for now and just sleep to prevent from hogging CPUs. Link: https://lkml.kernel.org/r/20211122153233.9924-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-15 16:30:28 +02:00
Michal Hocko	451769ebb7	mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc Patch series "extend vmalloc support for constrained allocations", v2. Based on a recent discussion with Dave and Neil [1] I have tried to implement NOFS, NOIO, NOFAIL support for the vmalloc to make life of kvmalloc users easier. A requirement for NOFAIL support for kvmalloc was new to me but this seems to be really needed by the xfs code. NOFS/NOIO was a known and a long term problem which was hoped to be handled by the scope API. Those scope should have been used at the reclaim recursion boundaries both to document them and also to remove the necessity of NOFS/NOIO constrains for all allocations within that scope. Instead workarounds were developed to wrap a single allocation instead (like ceph_kvmalloc). First patch implements NOFS/NOIO support for vmalloc. The second one adds NOFAIL support and the third one bundles all together into kvmalloc and drops ceph_kvmalloc which can use kvmalloc directly now. [1] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown This patch (of 4): vmalloc historically hasn't supported GFP_NO{FS,IO} requests because page table allocations do not support externally provided gfp mask and performed GFP_KERNEL like allocations. Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs to enforce NOFS and NOIO constrains implicitly to all allocators within the scope. There was a hope that those scopes would be defined on a higher level when the reclaim recursion boundary starts/stops (e.g. when a lock required during the memory reclaim is required etc.). It seems that not all NOFS/NOIO users have adopted this approach and instead they have taken a workaround approach to wrap a single [k]vmalloc allocation by a scope API. These workarounds do not serve the purpose of a better reclaim recursion documentation and reduction of explicit GFP_NO{FS,IO} usege so let's just provide them with the semantic they are asking for without a need for workarounds. Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal allocations already comply with the given gfp_mask. The only current exception is vmap_pages_range which maps kernel page tables. Infer the proper scope API based on the given gfp mask. [sfr@canb.auug.org.au: mm/vmalloc.c needs linux/sched/mm.h] Link: https://lkml.kernel.org/r/20211217232641.0148710c@canb.auug.org.au Link: https://lkml.kernel.org/r/20211122153233.9924-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20211122153233.9924-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-15 16:30:28 +02:00
Shakeel Butt	4e5aa1f4c2	memcg: add per-memcg vmalloc stat The kvmalloc* allocation functions can fallback to vmalloc allocations and more often on long running machines. In addition the kernel does have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines, the memory.stat does not tell the complete picture which type of memory is charged to the memcg. So add a per-memcg vmalloc stat. [shakeelb@google.com: page_memcg() within rcu lock, per Muchun] Link: https://lkml.kernel.org/r/20211222052457.1960701-1-shakeelb@google.com [akpm@linux-foundation.org: remove cast, per Muchun] [shakeelb@google.com: remove area->page[0] checks and move to page by page accounting per Michal] Link: https://lkml.kernel.org/r/20220104222341.3972772-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20211221215336.1922823-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-15 16:30:27 +02:00
Kefeng Wang	60115fa54a	mm: defer kmemleak object creation of module_alloc() Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN enabled(without KASAN_VMALLOC) on x86[1]. When the module area allocates memory, it's kmemleak_object is created successfully, but the KASAN shadow memory of module allocation is not ready, so when kmemleak scan the module's pointer, it will panic due to no shadow memory with KASAN check. module_alloc __vmalloc_node_range kmemleak_vmalloc kmemleak_scan update_checksum kasan_module_alloc kmemleak_ignore Note, there is no problem if KASAN_VMALLOC enabled, the modules area entire shadow memory is preallocated. Thus, the bug only exits on ARCH which supports dynamic allocation of module area per module load, for now, only x86/arm64/s390 are involved. Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of kmemleak in module_alloc() to fix this issue. [1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/ [wangkefeng.wang@huawei.com: fix build] Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com [akpm@linux-foundation.org: simplify ifdefs, per Andrey] Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com Fixes: `793213a82d` ("s390/kasan: dynamic shadow mem allocation for modules") Fixes: `39d114ddc6` ("arm64: add KASAN support") Fixes: `bebf56a1b1` ("kasan: enable instrumentation of global variables") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Yongqiang Liu <liuyongqiang13@huawei.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-15 16:30:25 +02:00
Chen Wandun	c00b6b9610	mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation Commit `ffb29b1c25` ("mm/vmalloc: fix numa spreading for large hash tables") can cause significant performance regressions in some situations as Andrew mentioned in [1]. The main situation is vmalloc, vmalloc will allocate pages with NUMA_NO_NODE by default, that will result in alloc page one by one; In order to solve this, __alloc_pages_bulk and mempolicy should be considered at the same time. 1) If node is specified in memory allocation request, it will alloc all pages by __alloc_pages_bulk. 2) If interleaving allocate memory, it will cauculate how many pages should be allocated in each node, and use __alloc_pages_bulk to alloc pages in each node. [1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60 [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: make two functions static] [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Link: https://lkml.kernel.org/r/20211021080744.874701-3-chenwandun@huawei.com Signed-off-by: Chen Wandun <chenwandun@huawei.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:37 -07:00
Michal Hocko	b7d90e7a5e	mm/vmalloc: be more explicit about supported gfp flags The core of the vmalloc allocator __vmalloc_area_node doesn't say anything about gfp mask argument. Not all gfp flags are supported though. Be more explicit about constraints. Link: https://lkml.kernel.org/r/20211020082545.4830-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:37 -07:00
Kefeng Wang	3252b1d830	kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK the kernel crashes: Unable to handle kernel paging request at virtual address ffff7000028f2000 ... swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000 [ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000 Internal error: Oops: 96000007 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62 Hardware name: linux,dummy-virt (DT) pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--) pc : kasan_check_range+0x90/0x1a0 lr : memcpy+0x88/0xf4 sp : ffff80001378fe20 ... Call trace: kasan_check_range+0x90/0x1a0 pcpu_page_first_chunk+0x3f0/0x568 setup_per_cpu_areas+0xb8/0x184 start_kernel+0x8c/0x328 The vm area used in vm_area_register_early() has no kasan shadow memory, Let's add a new kasan_populate_early_vm_area_shadow() function to populate the vm area shadow memory to fix the issue. [wangkefeng.wang@huawei.com: fix redefinition of 'kasan_populate_early_vm_area_shadow'] Link: https://lkml.kernel.org/r/20211011123211.3936196-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20210910053354.26721-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Marco Elver <elver@google.com> [KASAN] Acked-by: Andrey Konovalov <andreyknvl@gmail.com> [KASAN] Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:37 -07:00
Kefeng Wang	0eb68437a7	vmalloc: choose a better start address in vm_area_register_early() Percpu embedded first chunk allocator is the firstly option, but it could fail on ARM64, eg, percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000 percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000 percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000 then we could get to WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838 and the system cannot boot successfully. Let's implement page mapping percpu first chunk allocator as a fallback to the embedding allocator to increase the robustness of the system. Also fix a crash when both NEED_PER_CPU_PAGE_FIRST_CHUNK and KASAN_VMALLOC enabled. Tested on ARM64 qemu with cmdline "percpu_alloc=page". This patch (of 3): There are some fixed locations in the vmalloc area be reserved in ARM(see iotable_init()) and ARM64(see map_kernel()), but for pcpu_page_first_chunk(), it calls vm_area_register_early() and choose VMALLOC_START as the start address of vmap area which could be conflicted with above address, then could trigger a BUG_ON in vm_area_add_early(). Let's choose a suit start address by traversing the vmlist. Link: https://lkml.kernel.org/r/20210910053354.26721-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20210910053354.26721-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:37 -07:00
Vasily Averin	dd544141b9	vmalloc: back off when the current task is OOM-killed Huge vmalloc allocation on heavy loaded node can lead to a global memory shortage. Task called vmalloc can have worst badness and be selected by OOM-killer, however taken fatal signal does not interrupt allocation cycle. Vmalloc repeat page allocaions again and again, exacerbating the crisis and consuming the memory freed up by another killed tasks. After a successful completion of the allocation procedure, a fatal signal will be processed and task will be destroyed finally. However it may not release the consumed memory, since the allocated object may have a lifetime unrelated to the completed task. In the worst case, this can lead to the host will panic due to "Out of memory and no killable processes..." This patch allows OOM-killer to break vmalloc cycle, makes OOM more effective and avoid host panic. It does not check oom condition directly, however, and breaks page allocation cycle when fatal signal was received. This may trigger some hidden problems, when caller does not handle vmalloc failures, or when rollaback after failed vmalloc calls own vmallocs inside. However all of these scenarios are incorrect: vmalloc does not guarantee successful allocation, it has never been called with __GFP_NOFAIL and threfore either should not be used for any rollbacks or should handle such errors correctly and not lead to critical failures. Link: https://lkml.kernel.org/r/83efc664-3a65-2adb-d7c4-2885784cf109@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:37 -07:00
Uladzislau Rezki (Sony)	066fed59d8	mm/vmalloc: check various alignments when debugging Before we did not guarantee a free block with lowest start address for allocations with alignment >= PAGE_SIZE. Because an alignment overhead was included into a search length like below: length = size + align - 1; doing so we make sure that a bigger block would fit after applying an alignment adjustment. Now there is no such limitation, i.e. any alignment that user wants to apply will result to a lowest address of returned free area. Link: https://lkml.kernel.org/r/20211004142829.22222-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Ping Fang <pifang@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:37 -07:00
Uladzislau Rezki (Sony)	9f531973df	mm/vmalloc: do not adjust the search size for alignment overhead We used to include an alignment overhead into a search length, in that case we guarantee that a found area will definitely fit after applying a specific alignment that user specifies. From the other hand we do not guarantee that an area has the lowest address if an alignment is >= PAGE_SIZE. It means that, when a user specifies a special alignment together with a range that corresponds to an exact requested size then an allocation will fail. This is what happens to KASAN, it wants the free block that exactly matches a specified range during onlining memory banks: [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state [root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8e/0xd0 warn_alloc.cold.90+0x8a/0x1b2 ? zone_watermark_ok_safe+0x300/0x300 ? slab_free_freelist_hook+0x85/0x1a0 ? __get_vm_area_node+0x240/0x2c0 ? kfree+0xdd/0x570 ? kmem_cache_alloc_node_trace+0x157/0x230 ? notifier_call_chain+0x90/0x160 __vmalloc_node_range+0x465/0x840 ? mark_held_locks+0xb7/0x120 Fix it by making sure that find_vmap_lowest_match() returns lowest start address with any given alignment value, i.e. for alignments bigger then PAGE_SIZE the algorithm rolls back toward parent nodes checking right sub-trees if the most left free block did not fit due to alignment overhead. Link: https://lkml.kernel.org/r/20211004142829.22222-1-urezki@gmail.com Fixes: `68ad4a3304` ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reported-by: Ping Fang <pifang@redhat.com> Tested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:37 -07:00
Eric Dumazet	7cc7913e8e	mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo If last va found in vmap_area_list does not have a vm pointer, vmallocinfo.s_show() returns 0, and show_purge_info() is not called as it should. Link: https://lkml.kernel.org/r/20211001170815.73321-1-eric.dumazet@gmail.com Fixes: `dd3b8353ba` ("mm/vmalloc: do not keep unpurged areas in the busy tree") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Pengfei Li <lpf.vector@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:37 -07:00
Eric Dumazet	51e50b3a22	mm/vmalloc: make show_numa_info() aware of hugepage mappings show_numa_info() can be slightly faster, by skipping over hugepages directly. Link: https://lkml.kernel.org/r/20211001172725.105824-1-eric.dumazet@gmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:36 -07:00
Peter Zijlstra	bd1a8fb2d4	mm/vmalloc: don't allow VM_NO_GUARD on vmap() The vmalloc guard pages are added on top of each allocation, thereby isolating any two allocations from one another. The top guard of the lower allocation is the bottom guard guard of the higher allocation etc. Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of isolating separate allocations. There are only two in-tree users of this flag, neither of which use it through the exported interface. Ensure it stays this way. Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/xt@hirez.programming.kicks-ass.net Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-06 13:30:36 -07:00

1 2 3 4 5 ...

566 Commits