mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	mm: frontswap: config and doc files
This patch 4of4 adds configuration and documentation files including a FAQ. [v14: updated docs/FAQ to use zcache and RAMster as examples] [v10: no change] [v9: akpm@linux-foundation.org: sysfs->debugfs; no longer need Doc/ABI file] [v8: rebase to 3.0-rc4] [v7: rebase to 3.0-rc3] [v6: rebase to 3.0-rc1] [v5: change config default to n] [v4: rebase to 2.6.39] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Acked-by: Jan Beulich <JBeulich@novell.com> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chris Mason <chris.mason@oracle.com> Cc: Rik Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This commit is contained in:
		
							parent
							
								
									29f233cfff
								
							
						
					
					
						commit
						27c6aec214
					
				
							
								
								
									
										278
									
								
								Documentation/vm/frontswap.txt
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										278
									
								
								Documentation/vm/frontswap.txt
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,278 @@ | ||||
| Frontswap provides a "transcendent memory" interface for swap pages. | ||||
| In some environments, dramatic performance savings may be obtained because | ||||
| swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. | ||||
| 
 | ||||
| (Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" | ||||
| and the only necessary changes to the core kernel for transcendent memory; | ||||
| all other supporting code -- the "backends" -- is implemented as drivers. | ||||
| See the LWN.net article "Transcendent memory in a nutshell" for a detailed | ||||
| overview of frontswap and related kernel parts: | ||||
| https://lwn.net/Articles/454795/ ) | ||||
| 
 | ||||
| Frontswap is so named because it can be thought of as the opposite of | ||||
| a "backing" store for a swap device.  The storage is assumed to be | ||||
| a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming | ||||
| to the requirements of transcendent memory (such as Xen's "tmem", or | ||||
| in-kernel compressed memory, aka "zcache", or future RAM-like devices); | ||||
| this pseudo-RAM device is not directly accessible or addressable by the | ||||
| kernel and is of unknown and possibly time-varying size.  The driver | ||||
| links itself to frontswap by calling frontswap_register_ops to set the | ||||
| frontswap_ops funcs appropriately and the functions it provides must | ||||
| conform to certain policies as follows: | ||||
| 
 | ||||
| An "init" prepares the device to receive frontswap pages associated | ||||
| with the specified swap device number (aka "type").  A "put_page" will | ||||
| copy the page to transcendent memory and associate it with the type and | ||||
| offset associated with the page. A "get_page" will copy the page, if found, | ||||
| from transcendent memory into kernel memory, but will NOT remove the page | ||||
| from from transcendent memory.  An "invalidate_page" will remove the page | ||||
| from transcendent memory and an "invalidate_area" will remove ALL pages | ||||
| associated with the swap type (e.g., like swapoff) and notify the "device" | ||||
| to refuse further puts with that swap type. | ||||
| 
 | ||||
| Once a page is successfully put, a matching get on the page will normally | ||||
| succeed.  So when the kernel finds itself in a situation where it needs | ||||
| to swap out a page, it first attempts to use frontswap.  If the put returns | ||||
| success, the data has been successfully saved to transcendent memory and | ||||
| a disk write and, if the data is later read back, a disk read are avoided. | ||||
| If a put returns failure, transcendent memory has rejected the data, and the | ||||
| page can be written to swap as usual. | ||||
| 
 | ||||
| If a backend chooses, frontswap can be configured as a "writethrough | ||||
| cache" by calling frontswap_writethrough().  In this mode, the reduction | ||||
| in swap device writes is lost (and also a non-trivial performance advantage) | ||||
| in order to allow the backend to arbitrarily "reclaim" space used to | ||||
| store frontswap pages to more completely manage its memory usage. | ||||
| 
 | ||||
| Note that if a page is put and the page already exists in transcendent memory | ||||
| (a "duplicate" put), either the put succeeds and the data is overwritten, | ||||
| or the put fails AND the page is invalidated.  This ensures stale data may | ||||
| never be obtained from frontswap. | ||||
| 
 | ||||
| If properly configured, monitoring of frontswap is done via debugfs in | ||||
| the /sys/kernel/debug/frontswap directory.  The effectiveness of | ||||
| frontswap can be measured (across all swap devices) with: | ||||
| 
 | ||||
| failed_puts	- how many put attempts have failed | ||||
| gets		- how many gets were attempted (all should succeed) | ||||
| succ_puts	- how many put attempts have succeeded | ||||
| invalidates	- how many invalidates were attempted | ||||
| 
 | ||||
| A backend implementation may provide additional metrics. | ||||
| 
 | ||||
| FAQ | ||||
| 
 | ||||
| 1) Where's the value? | ||||
| 
 | ||||
| When a workload starts swapping, performance falls through the floor. | ||||
| Frontswap significantly increases performance in many such workloads by | ||||
| providing a clean, dynamic interface to read and write swap pages to | ||||
| "transcendent memory" that is otherwise not directly addressable to the kernel. | ||||
| This interface is ideal when data is transformed to a different form | ||||
| and size (such as with compression) or secretly moved (as might be | ||||
| useful for write-balancing for some RAM-like devices).  Swap pages (and | ||||
| evicted page-cache pages) are a great use for this kind of slower-than-RAM- | ||||
| but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and | ||||
| cleancache) interface to transcendent memory provides a nice way to read | ||||
| and write -- and indirectly "name" -- the pages. | ||||
| 
 | ||||
| Frontswap -- and cleancache -- with a fairly small impact on the kernel, | ||||
| provides a huge amount of flexibility for more dynamic, flexible RAM | ||||
| utilization in various system configurations: | ||||
| 
 | ||||
| In the single kernel case, aka "zcache", pages are compressed and | ||||
| stored in local memory, thus increasing the total anonymous pages | ||||
| that can be safely kept in RAM.  Zcache essentially trades off CPU | ||||
| cycles used in compression/decompression for better memory utilization. | ||||
| Benchmarks have shown little or no impact when memory pressure is | ||||
| low while providing a significant performance improvement (25%+) | ||||
| on some workloads under high memory pressure. | ||||
| 
 | ||||
| "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory | ||||
| support for clustered systems.  Frontswap pages are locally compressed | ||||
| as in zcache, but then "remotified" to another system's RAM.  This | ||||
| allows RAM to be dynamically load-balanced back-and-forth as needed, | ||||
| i.e. when system A is overcommitted, it can swap to system B, and | ||||
| vice versa.  RAMster can also be configured as a memory server so | ||||
| many servers in a cluster can swap, dynamically as needed, to a single | ||||
| server configured with a large amount of RAM... without pre-configuring | ||||
| how much of the RAM is available for each of the clients! | ||||
| 
 | ||||
| In the virtual case, the whole point of virtualization is to statistically | ||||
| multiplex physical resources acrosst the varying demands of multiple | ||||
| virtual machines.  This is really hard to do with RAM and efforts to do | ||||
| it well with no kernel changes have essentially failed (except in some | ||||
| well-publicized special-case workloads). | ||||
| Specifically, the Xen Transcendent Memory backend allows otherwise | ||||
| "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple | ||||
| virtual machines, but the pages can be compressed and deduplicated to | ||||
| optimize RAM utilization.  And when guest OS's are induced to surrender | ||||
| underutilized RAM (e.g. with "selfballooning"), sudden unexpected | ||||
| memory pressure may result in swapping; frontswap allows those pages | ||||
| to be swapped to and from hypervisor RAM (if overall host system memory | ||||
| conditions allow), thus mitigating the potentially awful performance impact | ||||
| of unplanned swapping. | ||||
| 
 | ||||
| A KVM implementation is underway and has been RFC'ed to lkml.  And, | ||||
| using frontswap, investigation is also underway on the use of NVM as | ||||
| a memory extension technology. | ||||
| 
 | ||||
| 2) Sure there may be performance advantages in some situations, but | ||||
|    what's the space/time overhead of frontswap? | ||||
| 
 | ||||
| If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into | ||||
| nothingness and the only overhead is a few extra bytes per swapon'ed | ||||
| swap device.  If CONFIG_FRONTSWAP is enabled but no frontswap "backend" | ||||
| registers, there is one extra global variable compared to zero for | ||||
| every swap page read or written.  If CONFIG_FRONTSWAP is enabled | ||||
| AND a frontswap backend registers AND the backend fails every "put" | ||||
| request (i.e. provides no memory despite claiming it might), | ||||
| CPU overhead is still negligible -- and since every frontswap fail | ||||
| precedes a swap page write-to-disk, the system is highly likely | ||||
| to be I/O bound and using a small fraction of a percent of a CPU | ||||
| will be irrelevant anyway. | ||||
| 
 | ||||
| As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend | ||||
| registers, one bit is allocated for every swap page for every swap | ||||
| device that is swapon'd.  This is added to the EIGHT bits (which | ||||
| was sixteen until about 2.6.34) that the kernel already allocates | ||||
| for every swap page for every swap device that is swapon'd.  (Hugh | ||||
| Dickins has observed that frontswap could probably steal one of | ||||
| the existing eight bits, but let's worry about that minor optimization | ||||
| later.)  For very large swap disks (which are rare) on a standard | ||||
| 4K pagesize, this is 1MB per 32GB swap. | ||||
| 
 | ||||
| When swap pages are stored in transcendent memory instead of written | ||||
| out to disk, there is a side effect that this may create more memory | ||||
| pressure that can potentially outweigh the other advantages.  A | ||||
| backend, such as zcache, must implement policies to carefully (but | ||||
| dynamically) manage memory limits to ensure this doesn't happen. | ||||
| 
 | ||||
| 3) OK, how about a quick overview of what this frontswap patch does | ||||
|    in terms that a kernel hacker can grok? | ||||
| 
 | ||||
| Let's assume that a frontswap "backend" has registered during | ||||
| kernel initialization; this registration indicates that this | ||||
| frontswap backend has access to some "memory" that is not directly | ||||
| accessible by the kernel.  Exactly how much memory it provides is | ||||
| entirely dynamic and random. | ||||
| 
 | ||||
| Whenever a swap-device is swapon'd frontswap_init() is called, | ||||
| passing the swap device number (aka "type") as a parameter. | ||||
| This notifies frontswap to expect attempts to "put" swap pages | ||||
| associated with that number. | ||||
| 
 | ||||
| Whenever the swap subsystem is readying a page to write to a swap | ||||
| device (c.f swap_writepage()), frontswap_put_page is called.  Frontswap | ||||
| consults with the frontswap backend and if the backend says it does NOT | ||||
| have room, frontswap_put_page returns -1 and the kernel swaps the page | ||||
| to the swap device as normal.  Note that the response from the frontswap | ||||
| backend is unpredictable to the kernel; it may choose to never accept a | ||||
| page, it could accept every ninth page, or it might accept every | ||||
| page.  But if the backend does accept a page, the data from the page | ||||
| has already been copied and associated with the type and offset, | ||||
| and the backend guarantees the persistence of the data.  In this case, | ||||
| frontswap sets a bit in the "frontswap_map" for the swap device | ||||
| corresponding to the page offset on the swap device to which it would | ||||
| otherwise have written the data. | ||||
| 
 | ||||
| When the swap subsystem needs to swap-in a page (swap_readpage()), | ||||
| it first calls frontswap_get_page() which checks the frontswap_map to | ||||
| see if the page was earlier accepted by the frontswap backend.  If | ||||
| it was, the page of data is filled from the frontswap backend and | ||||
| the swap-in is complete.  If not, the normal swap-in code is | ||||
| executed to obtain the page of data from the real swap device. | ||||
| 
 | ||||
| So every time the frontswap backend accepts a page, a swap device read | ||||
| and (potentially) a swap device write are replaced by a "frontswap backend | ||||
| put" and (possibly) a "frontswap backend get", which are presumably much | ||||
| faster. | ||||
| 
 | ||||
| 4) Can't frontswap be configured as a "special" swap device that is | ||||
|    just higher priority than any real swap device (e.g. like zswap, | ||||
|    or maybe swap-over-nbd/NFS)? | ||||
| 
 | ||||
| No.  First, the existing swap subsystem doesn't allow for any kind of | ||||
| swap hierarchy.  Perhaps it could be rewritten to accomodate a hierarchy, | ||||
| but this would require fairly drastic changes.  Even if it were | ||||
| rewritten, the existing swap subsystem uses the block I/O layer which | ||||
| assumes a swap device is fixed size and any page in it is linearly | ||||
| addressable.  Frontswap barely touches the existing swap subsystem, | ||||
| and works around the constraints of the block I/O subsystem to provide | ||||
| a great deal of flexibility and dynamicity. | ||||
| 
 | ||||
| For example, the acceptance of any swap page by the frontswap backend is | ||||
| entirely unpredictable. This is critical to the definition of frontswap | ||||
| backends because it grants completely dynamic discretion to the | ||||
| backend.  In zcache, one cannot know a priori how compressible a page is. | ||||
| "Poorly" compressible pages can be rejected, and "poorly" can itself be | ||||
| defined dynamically depending on current memory constraints. | ||||
| 
 | ||||
| Further, frontswap is entirely synchronous whereas a real swap | ||||
| device is, by definition, asynchronous and uses block I/O.  The | ||||
| block I/O layer is not only unnecessary, but may perform "optimizations" | ||||
| that are inappropriate for a RAM-oriented device including delaying | ||||
| the write of some pages for a significant amount of time.  Synchrony is | ||||
| required to ensure the dynamicity of the backend and to avoid thorny race | ||||
| conditions that would unnecessarily and greatly complicate frontswap | ||||
| and/or the block I/O subsystem.  That said, only the initial "put" | ||||
| and "get" operations need be synchronous.  A separate asynchronous thread | ||||
| is free to manipulate the pages stored by frontswap.  For example, | ||||
| the "remotification" thread in RAMster uses standard asynchronous | ||||
| kernel sockets to move compressed frontswap pages to a remote machine. | ||||
| Similarly, a KVM guest-side implementation could do in-guest compression | ||||
| and use "batched" hypercalls. | ||||
| 
 | ||||
| In a virtualized environment, the dynamicity allows the hypervisor | ||||
| (or host OS) to do "intelligent overcommit".  For example, it can | ||||
| choose to accept pages only until host-swapping might be imminent, | ||||
| then force guests to do their own swapping. | ||||
| 
 | ||||
| There is a downside to the transcendent memory specifications for | ||||
| frontswap:  Since any "put" might fail, there must always be a real | ||||
| slot on a real swap device to swap the page.  Thus frontswap must be | ||||
| implemented as a "shadow" to every swapon'd device with the potential | ||||
| capability of holding every page that the swap device might have held | ||||
| and the possibility that it might hold no pages at all.  This means | ||||
| that frontswap cannot contain more pages than the total of swapon'd | ||||
| swap devices.  For example, if NO swap device is configured on some | ||||
| installation, frontswap is useless.  Swapless portable devices | ||||
| can still use frontswap but a backend for such devices must configure | ||||
| some kind of "ghost" swap device and ensure that it is never used. | ||||
| 
 | ||||
| 5) Why this weird definition about "duplicate puts"?  If a page | ||||
|    has been previously successfully put, can't it always be | ||||
|    successfully overwritten? | ||||
| 
 | ||||
| Nearly always it can, but no, sometimes it cannot.  Consider an example | ||||
| where data is compressed and the original 4K page has been compressed | ||||
| to 1K.  Now an attempt is made to overwrite the page with data that | ||||
| is non-compressible and so would take the entire 4K.  But the backend | ||||
| has no more space.  In this case, the put must be rejected.  Whenever | ||||
| frontswap rejects a put that would overwrite, it also must invalidate | ||||
| the old data and ensure that it is no longer accessible.  Since the | ||||
| swap subsystem then writes the new data to the read swap device, | ||||
| this is the correct course of action to ensure coherency. | ||||
| 
 | ||||
| 6) What is frontswap_shrink for? | ||||
| 
 | ||||
| When the (non-frontswap) swap subsystem swaps out a page to a real | ||||
| swap device, that page is only taking up low-value pre-allocated disk | ||||
| space.  But if frontswap has placed a page in transcendent memory, that | ||||
| page may be taking up valuable real estate.  The frontswap_shrink | ||||
| routine allows code outside of the swap subsystem to force pages out | ||||
| of the memory managed by frontswap and back into kernel-addressable memory. | ||||
| For example, in RAMster, a "suction driver" thread will attempt | ||||
| to "repatriate" pages sent to a remote machine back to the local machine; | ||||
| this is driven using the frontswap_shrink mechanism when memory pressure | ||||
| subsides. | ||||
| 
 | ||||
| 7) Why does the frontswap patch create the new include file swapfile.h? | ||||
| 
 | ||||
| The frontswap code depends on some swap-subsystem-internal data | ||||
| structures that have, over the years, moved back and forth between | ||||
| static and global.  This seemed a reasonable compromise:  Define | ||||
| them as global but declare them in a new include file that isn't | ||||
| included by the large number of source files that include swap.h. | ||||
| 
 | ||||
| Dan Magenheimer, last updated April 9, 2012 | ||||
							
								
								
									
										17
									
								
								mm/Kconfig
									
									
									
									
									
								
							
							
						
						
									
										17
									
								
								mm/Kconfig
									
									
									
									
									
								
							| @ -379,3 +379,20 @@ config CLEANCACHE | ||||
| 	  in a negligible performance hit. | ||||
| 
 | ||||
| 	  If unsure, say Y to enable cleancache | ||||
| 
 | ||||
| config FRONTSWAP | ||||
| 	bool "Enable frontswap to cache swap pages if tmem is present" | ||||
| 	depends on SWAP | ||||
| 	default n | ||||
| 	help | ||||
| 	  Frontswap is so named because it can be thought of as the opposite | ||||
| 	  of a "backing" store for a swap device.  The data is stored into | ||||
| 	  "transcendent memory", memory that is not directly accessible or | ||||
| 	  addressable by the kernel and is of unknown and possibly | ||||
| 	  time-varying size.  When space in transcendent memory is available, | ||||
| 	  a significant swap I/O reduction may be achieved.  When none is | ||||
| 	  available, all frontswap calls are reduced to a single pointer- | ||||
| 	  compare-against-NULL resulting in a negligible performance hit | ||||
| 	  and swap data is stored as normal on the matching swap device. | ||||
| 
 | ||||
| 	  If unsure, say Y to enable frontswap. | ||||
|  | ||||
| @ -26,6 +26,7 @@ obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o | ||||
| 
 | ||||
| obj-$(CONFIG_BOUNCE)	+= bounce.o | ||||
| obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o | ||||
| obj-$(CONFIG_FRONTSWAP)	+= frontswap.o | ||||
| obj-$(CONFIG_HAS_DMA)	+= dmapool.o | ||||
| obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o | ||||
| obj-$(CONFIG_NUMA) 	+= mempolicy.o | ||||
|  | ||||
		Loading…
	
		Reference in New Issue
	
	Block a user
	 Dan Magenheimer
						Dan Magenheimer