mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	 ee86588960
			
		
	
	
		ee86588960
		
	
	
	
	
		
			
			It is enough to use a file name to cross-reference another rst document. Jon says: The right things will happen in the HTML output, readers of the plain-text will know immediately where to go, and we don't have to add the label clutter. Drop reference markup and unnecessary labels and use plain file names. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Link: https://lore.kernel.org/r/20230201094156.991542-3-rppt@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
		
			
				
	
	
		
			176 lines
		
	
	
		
			7.9 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			176 lines
		
	
	
		
			7.9 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| =====================
 | |
| Physical Memory Model
 | |
| =====================
 | |
| 
 | |
| Physical memory in a system may be addressed in different ways. The
 | |
| simplest case is when the physical memory starts at address 0 and
 | |
| spans a contiguous range up to the maximal address. It could be,
 | |
| however, that this range contains small holes that are not accessible
 | |
| for the CPU. Then there could be several contiguous ranges at
 | |
| completely distinct addresses. And, don't forget about NUMA, where
 | |
| different memory banks are attached to different CPUs.
 | |
| 
 | |
| Linux abstracts this diversity using one of the two memory models:
 | |
| FLATMEM and SPARSEMEM. Each architecture defines what
 | |
| memory models it supports, what the default memory model is and
 | |
| whether it is possible to manually override that default.
 | |
| 
 | |
| All the memory models track the status of physical page frames using
 | |
| struct page arranged in one or more arrays.
 | |
| 
 | |
| Regardless of the selected memory model, there exists one-to-one
 | |
| mapping between the physical page frame number (PFN) and the
 | |
| corresponding `struct page`.
 | |
| 
 | |
| Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
 | |
| helpers that allow the conversion from PFN to `struct page` and vice
 | |
| versa.
 | |
| 
 | |
| FLATMEM
 | |
| =======
 | |
| 
 | |
| The simplest memory model is FLATMEM. This model is suitable for
 | |
| non-NUMA systems with contiguous, or mostly contiguous, physical
 | |
| memory.
 | |
| 
 | |
| In the FLATMEM memory model, there is a global `mem_map` array that
 | |
| maps the entire physical memory. For most architectures, the holes
 | |
| have entries in the `mem_map` array. The `struct page` objects
 | |
| corresponding to the holes are never fully initialized.
 | |
| 
 | |
| To allocate the `mem_map` array, architecture specific setup code should
 | |
| call :c:func:`free_area_init` function. Yet, the mappings array is not
 | |
| usable until the call to :c:func:`memblock_free_all` that hands all the
 | |
| memory to the page allocator.
 | |
| 
 | |
| An architecture may free parts of the `mem_map` array that do not cover the
 | |
| actual physical pages. In such case, the architecture specific
 | |
| :c:func:`pfn_valid` implementation should take the holes in the
 | |
| `mem_map` into account.
 | |
| 
 | |
| With FLATMEM, the conversion between a PFN and the `struct page` is
 | |
| straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
 | |
| `mem_map` array.
 | |
| 
 | |
| The `ARCH_PFN_OFFSET` defines the first page frame number for
 | |
| systems with physical memory starting at address different from 0.
 | |
| 
 | |
| SPARSEMEM
 | |
| =========
 | |
| 
 | |
| SPARSEMEM is the most versatile memory model available in Linux and it
 | |
| is the only memory model that supports several advanced features such
 | |
| as hot-plug and hot-remove of the physical memory, alternative memory
 | |
| maps for non-volatile memory devices and deferred initialization of
 | |
| the memory map for larger systems.
 | |
| 
 | |
| The SPARSEMEM model presents the physical memory as a collection of
 | |
| sections. A section is represented with struct mem_section
 | |
| that contains `section_mem_map` that is, logically, a pointer to an
 | |
| array of struct pages. However, it is stored with some other magic
 | |
| that aids the sections management. The section size and maximal number
 | |
| of section is specified using `SECTION_SIZE_BITS` and
 | |
| `MAX_PHYSMEM_BITS` constants defined by each architecture that
 | |
| supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
 | |
| physical address that an architecture supports, the
 | |
| `SECTION_SIZE_BITS` is an arbitrary value.
 | |
| 
 | |
| The maximal number of sections is denoted `NR_MEM_SECTIONS` and
 | |
| defined as
 | |
| 
 | |
| .. math::
 | |
| 
 | |
|    NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
 | |
| 
 | |
| The `mem_section` objects are arranged in a two-dimensional array
 | |
| called `mem_sections`. The size and placement of this array depend
 | |
| on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
 | |
| sections:
 | |
| 
 | |
| * When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
 | |
|   array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
 | |
|   single `mem_section` object.
 | |
| * When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
 | |
|   array is dynamically allocated. Each row contains PAGE_SIZE worth of
 | |
|   `mem_section` objects and the number of rows is calculated to fit
 | |
|   all the memory sections.
 | |
| 
 | |
| The architecture setup code should call sparse_init() to
 | |
| initialize the memory sections and the memory maps.
 | |
| 
 | |
| With SPARSEMEM there are two possible ways to convert a PFN to the
 | |
| corresponding `struct page` - a "classic sparse" and "sparse
 | |
| vmemmap". The selection is made at build time and it is determined by
 | |
| the value of `CONFIG_SPARSEMEM_VMEMMAP`.
 | |
| 
 | |
| The classic sparse encodes the section number of a page in page->flags
 | |
| and uses high bits of a PFN to access the section that maps that page
 | |
| frame. Inside a section, the PFN is the index to the array of pages.
 | |
| 
 | |
| The sparse vmemmap uses a virtually mapped memory map to optimize
 | |
| pfn_to_page and page_to_pfn operations. There is a global `struct
 | |
| page *vmemmap` pointer that points to a virtually contiguous array of
 | |
| `struct page` objects. A PFN is an index to that array and the
 | |
| offset of the `struct page` from `vmemmap` is the PFN of that
 | |
| page.
 | |
| 
 | |
| To use vmemmap, an architecture has to reserve a range of virtual
 | |
| addresses that will map the physical pages containing the memory
 | |
| map and make sure that `vmemmap` points to that range. In addition,
 | |
| the architecture should implement :c:func:`vmemmap_populate` method
 | |
| that will allocate the physical memory and create page tables for the
 | |
| virtual memory map. If an architecture does not have any special
 | |
| requirements for the vmemmap mappings, it can use default
 | |
| :c:func:`vmemmap_populate_basepages` provided by the generic memory
 | |
| management.
 | |
| 
 | |
| The virtually mapped memory map allows storing `struct page` objects
 | |
| for persistent memory devices in pre-allocated storage on those
 | |
| devices. This storage is represented with struct vmem_altmap
 | |
| that is eventually passed to vmemmap_populate() through a long chain
 | |
| of function calls. The vmemmap_populate() implementation may use the
 | |
| `vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
 | |
| allocate memory map on the persistent memory device.
 | |
| 
 | |
| ZONE_DEVICE
 | |
| ===========
 | |
| The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
 | |
| `struct page` `mem_map` services for device driver identified physical
 | |
| address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
 | |
| that the page objects for these address ranges are never marked online,
 | |
| and that a reference must be taken against the device, not just the page
 | |
| to keep the memory pinned for active use. `ZONE_DEVICE`, via
 | |
| :c:func:`devm_memremap_pages`, performs just enough memory hotplug to
 | |
| turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
 | |
| :c:func:`get_user_pages` service for the given range of pfns. Since the
 | |
| page reference count never drops below 1 the page is never tracked as
 | |
| free memory and the page's `struct list_head lru` space is repurposed
 | |
| for back referencing to the host device / driver that mapped the memory.
 | |
| 
 | |
| While `SPARSEMEM` presents memory as a collection of sections,
 | |
| optionally collected into memory blocks, `ZONE_DEVICE` users have a need
 | |
| for smaller granularity of populating the `mem_map`. Given that
 | |
| `ZONE_DEVICE` memory is never marked online it is subsequently never
 | |
| subject to its memory ranges being exposed through the sysfs memory
 | |
| hotplug api on memory block boundaries. The implementation relies on
 | |
| this lack of user-api constraint to allow sub-section sized memory
 | |
| ranges to be specified to :c:func:`arch_add_memory`, the top-half of
 | |
| memory hotplug. Sub-section support allows for 2MB as the cross-arch
 | |
| common alignment granularity for :c:func:`devm_memremap_pages`.
 | |
| 
 | |
| The users of `ZONE_DEVICE` are:
 | |
| 
 | |
| * pmem: Map platform persistent memory to be used as a direct-I/O target
 | |
|   via DAX mappings.
 | |
| 
 | |
| * hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
 | |
|   event callbacks to allow a device-driver to coordinate memory management
 | |
|   events related to device-memory, typically GPU memory. See
 | |
|   Documentation/mm/hmm.rst.
 | |
| 
 | |
| * p2pdma: Create `struct page` objects to allow peer devices in a
 | |
|   PCI/-E topology to coordinate direct-DMA operations between themselves,
 | |
|   i.e. bypass host memory.
 |