mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	 34dcaf40c1
			
		
	
	
		34dcaf40c1
		
	
	
	
	
		
			
			Fix typos and add the following to the scripts/spelling.txt: explictely||explicitly Link: http://lkml.kernel.org/r/1481573103-11329-25-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
		
			
				
	
	
		
			334 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			334 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| dm-raid
 | |
| =======
 | |
| 
 | |
| The device-mapper RAID (dm-raid) target provides a bridge from DM to MD.
 | |
| It allows the MD RAID drivers to be accessed using a device-mapper
 | |
| interface.
 | |
| 
 | |
| 
 | |
| Mapping Table Interface
 | |
| -----------------------
 | |
| The target is named "raid" and it accepts the following parameters:
 | |
| 
 | |
|   <raid_type> <#raid_params> <raid_params> \
 | |
|     <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
 | |
| 
 | |
| <raid_type>:
 | |
|   raid0		RAID0 striping (no resilience)
 | |
|   raid1		RAID1 mirroring
 | |
|   raid4		RAID4 with dedicated last parity disk
 | |
|   raid5_n 	RAID5 with dedicated last parity disk supporting takeover
 | |
| 		Same as raid4
 | |
| 		-Transitory layout
 | |
|   raid5_la	RAID5 left asymmetric
 | |
| 		- rotating parity 0 with data continuation
 | |
|   raid5_ra	RAID5 right asymmetric
 | |
| 		- rotating parity N with data continuation
 | |
|   raid5_ls	RAID5 left symmetric
 | |
| 		- rotating parity 0 with data restart
 | |
|   raid5_rs 	RAID5 right symmetric
 | |
| 		- rotating parity N with data restart
 | |
|   raid6_zr	RAID6 zero restart
 | |
| 		- rotating parity zero (left-to-right) with data restart
 | |
|   raid6_nr	RAID6 N restart
 | |
| 		- rotating parity N (right-to-left) with data restart
 | |
|   raid6_nc	RAID6 N continue
 | |
| 		- rotating parity N (right-to-left) with data continuation
 | |
|   raid6_n_6	RAID6 with dedicate parity disks
 | |
| 		- parity and Q-syndrome on the last 2 disks;
 | |
| 		  layout for takeover from/to raid4/raid5_n
 | |
|   raid6_la_6	Same as "raid_la" plus dedicated last Q-syndrome disk
 | |
| 		- layout for takeover from raid5_la from/to raid6
 | |
|   raid6_ra_6	Same as "raid5_ra" dedicated last Q-syndrome disk
 | |
| 		- layout for takeover from raid5_ra from/to raid6
 | |
|   raid6_ls_6	Same as "raid5_ls" dedicated last Q-syndrome disk
 | |
| 		- layout for takeover from raid5_ls from/to raid6
 | |
|   raid6_rs_6	Same as "raid5_rs" dedicated last Q-syndrome disk
 | |
| 		- layout for takeover from raid5_rs from/to raid6
 | |
|   raid10        Various RAID10 inspired algorithms chosen by additional params
 | |
| 		(see raid10_format and raid10_copies below)
 | |
| 		- RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
 | |
| 		- RAID1E: Integrated Adjacent Stripe Mirroring
 | |
| 		- RAID1E: Integrated Offset Stripe Mirroring
 | |
| 		-  and other similar RAID10 variants
 | |
| 
 | |
|   Reference: Chapter 4 of
 | |
|   http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
 | |
| 
 | |
| <#raid_params>: The number of parameters that follow.
 | |
| 
 | |
| <raid_params> consists of
 | |
|     Mandatory parameters:
 | |
|         <chunk_size>: Chunk size in sectors.  This parameter is often known as
 | |
| 		      "stripe size".  It is the only mandatory parameter and
 | |
| 		      is placed first.
 | |
| 
 | |
|     followed by optional parameters (in any order):
 | |
| 	[sync|nosync]   Force or prevent RAID initialization.
 | |
| 
 | |
| 	[rebuild <idx>]	Rebuild drive number 'idx' (first drive is 0).
 | |
| 
 | |
| 	[daemon_sleep <ms>]
 | |
| 		Interval between runs of the bitmap daemon that
 | |
| 		clear bits.  A longer interval means less bitmap I/O but
 | |
| 		resyncing after a failure is likely to take longer.
 | |
| 
 | |
| 	[min_recovery_rate <kB/sec/disk>]  Throttle RAID initialization
 | |
| 	[max_recovery_rate <kB/sec/disk>]  Throttle RAID initialization
 | |
| 	[write_mostly <idx>]		   Mark drive index 'idx' write-mostly.
 | |
| 	[max_write_behind <sectors>]       See '--write-behind=' (man mdadm)
 | |
| 	[stripe_cache <sectors>]           Stripe cache size (RAID 4/5/6 only)
 | |
| 	[region_size <sectors>]
 | |
| 		The region_size multiplied by the number of regions is the
 | |
| 		logical size of the array.  The bitmap records the device
 | |
| 		synchronisation state for each region.
 | |
| 
 | |
|         [raid10_copies   <# copies>]
 | |
|         [raid10_format   <near|far|offset>]
 | |
| 		These two options are used to alter the default layout of
 | |
| 		a RAID10 configuration.  The number of copies is can be
 | |
| 		specified, but the default is 2.  There are also three
 | |
| 		variations to how the copies are laid down - the default
 | |
| 		is "near".  Near copies are what most people think of with
 | |
| 		respect to mirroring.  If these options are left unspecified,
 | |
| 		or 'raid10_copies 2' and/or 'raid10_format near' are given,
 | |
| 		then the layouts for 2, 3 and 4 devices	are:
 | |
| 		2 drives         3 drives          4 drives
 | |
| 		--------         ----------        --------------
 | |
| 		A1  A1           A1  A1  A2        A1  A1  A2  A2
 | |
| 		A2  A2           A2  A3  A3        A3  A3  A4  A4
 | |
| 		A3  A3           A4  A4  A5        A5  A5  A6  A6
 | |
| 		A4  A4           A5  A6  A6        A7  A7  A8  A8
 | |
| 		..  ..           ..  ..  ..        ..  ..  ..  ..
 | |
| 		The 2-device layout is equivalent 2-way RAID1.  The 4-device
 | |
| 		layout is what a traditional RAID10 would look like.  The
 | |
| 		3-device layout is what might be called a 'RAID1E - Integrated
 | |
| 		Adjacent Stripe Mirroring'.
 | |
| 
 | |
| 		If 'raid10_copies 2' and 'raid10_format far', then the layouts
 | |
| 		for 2, 3 and 4 devices are:
 | |
| 		2 drives             3 drives             4 drives
 | |
| 		--------             --------------       --------------------
 | |
| 		A1  A2               A1   A2   A3         A1   A2   A3   A4
 | |
| 		A3  A4               A4   A5   A6         A5   A6   A7   A8
 | |
| 		A5  A6               A7   A8   A9         A9   A10  A11  A12
 | |
| 		..  ..               ..   ..   ..         ..   ..   ..   ..
 | |
| 		A2  A1               A3   A1   A2         A2   A1   A4   A3
 | |
| 		A4  A3               A6   A4   A5         A6   A5   A8   A7
 | |
| 		A6  A5               A9   A7   A8         A10  A9   A12  A11
 | |
| 		..  ..               ..   ..   ..         ..   ..   ..   ..
 | |
| 
 | |
| 		If 'raid10_copies 2' and 'raid10_format offset', then the
 | |
| 		layouts for 2, 3 and 4 devices are:
 | |
| 		2 drives       3 drives           4 drives
 | |
| 		--------       ------------       -----------------
 | |
| 		A1  A2         A1  A2  A3         A1  A2  A3  A4
 | |
| 		A2  A1         A3  A1  A2         A2  A1  A4  A3
 | |
| 		A3  A4         A4  A5  A6         A5  A6  A7  A8
 | |
| 		A4  A3         A6  A4  A5         A6  A5  A8  A7
 | |
| 		A5  A6         A7  A8  A9         A9  A10 A11 A12
 | |
| 		A6  A5         A9  A7  A8         A10 A9  A12 A11
 | |
| 		..  ..         ..  ..  ..         ..  ..  ..  ..
 | |
| 		Here we see layouts closely akin to 'RAID1E - Integrated
 | |
| 		Offset Stripe Mirroring'.
 | |
| 
 | |
|         [delta_disks <N>]
 | |
| 		The delta_disks option value (-251 < N < +251) triggers
 | |
| 		device removal (negative value) or device addition (positive
 | |
| 		value) to any reshape supporting raid levels 4/5/6 and 10.
 | |
| 		RAID levels 4/5/6 allow for addition of devices (metadata
 | |
| 		and data device tuple), raid10_near and raid10_offset only
 | |
| 		allow for device addition. raid10_far does not support any
 | |
| 		reshaping at all.
 | |
| 		A minimum of devices have to be kept to enforce resilience,
 | |
| 		which is 3 devices for raid4/5 and 4 devices for raid6.
 | |
| 
 | |
|         [data_offset <sectors>]
 | |
| 		This option value defines the offset into each data device
 | |
| 		where the data starts. This is used to provide out-of-place
 | |
| 		reshaping space to avoid writing over data whilst
 | |
| 		changing the layout of stripes, hence an interruption/crash
 | |
| 		may happen at any time without the risk of losing data.
 | |
| 		E.g. when adding devices to an existing raid set during
 | |
| 		forward reshaping, the out-of-place space will be allocated
 | |
| 		at the beginning of each raid device. The kernel raid4/5/6/10
 | |
| 		MD personalities supporting such device addition will read the data from
 | |
| 		the existing first stripes (those with smaller number of stripes)
 | |
| 		starting at data_offset to fill up a new stripe with the larger
 | |
| 		number of stripes, calculate the redundancy blocks (CRC/Q-syndrome)
 | |
| 		and write that new stripe to offset 0. Same will be applied to all
 | |
| 		N-1 other new stripes. This out-of-place scheme is used to change
 | |
| 		the RAID type (i.e. the allocation algorithm) as well, e.g.
 | |
| 		changing from raid5_ls to raid5_n.
 | |
| 
 | |
| 	[journal_dev <dev>]
 | |
| 		This option adds a journal device to raid4/5/6 raid sets and
 | |
| 		uses it to close the 'write hole' caused by the non-atomic updates
 | |
| 		to the component devices which can cause data loss during recovery.
 | |
| 		The journal device is used as writethrough thus causing writes to
 | |
| 		be throttled versus non-journaled raid4/5/6 sets.
 | |
| 		Takeover/reshape is not possible with a raid4/5/6 journal device;
 | |
| 		it has to be deconfigured before requesting these.
 | |
| 
 | |
| <#raid_devs>: The number of devices composing the array.
 | |
| 	Each device consists of two entries.  The first is the device
 | |
| 	containing the metadata (if any); the second is the one containing the
 | |
| 	data. A Maximum of 64 metadata/data device entries are supported
 | |
| 	up to target version 1.8.0.
 | |
| 	1.9.0 supports up to 253 which is enforced by the used MD kernel runtime.
 | |
| 
 | |
| 	If a drive has failed or is missing at creation time, a '-' can be
 | |
| 	given for both the metadata and data drives for a given position.
 | |
| 
 | |
| 
 | |
| Example Tables
 | |
| --------------
 | |
| # RAID4 - 4 data drives, 1 parity (no metadata devices)
 | |
| # No metadata devices specified to hold superblock/bitmap info
 | |
| # Chunk size of 1MiB
 | |
| # (Lines separated for easy reading)
 | |
| 
 | |
| 0 1960893648 raid \
 | |
|         raid4 1 2048 \
 | |
|         5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
 | |
| 
 | |
| # RAID4 - 4 data drives, 1 parity (with metadata devices)
 | |
| # Chunk size of 1MiB, force RAID initialization,
 | |
| #       min recovery rate at 20 kiB/sec/disk
 | |
| 
 | |
| 0 1960893648 raid \
 | |
|         raid4 4 2048 sync min_recovery_rate 20 \
 | |
|         5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
 | |
| 
 | |
| 
 | |
| Status Output
 | |
| -------------
 | |
| 'dmsetup table' displays the table used to construct the mapping.
 | |
| The optional parameters are always printed in the order listed
 | |
| above with "sync" or "nosync" always output ahead of the other
 | |
| arguments, regardless of the order used when originally loading the table.
 | |
| Arguments that can be repeated are ordered by value.
 | |
| 
 | |
| 
 | |
| 'dmsetup status' yields information on the state and health of the array.
 | |
| The output is as follows (normally a single line, but expanded here for
 | |
| clarity):
 | |
| 1: <s> <l> raid \
 | |
| 2:      <raid_type> <#devices> <health_chars> \
 | |
| 3:      <sync_ratio> <sync_action> <mismatch_cnt>
 | |
| 
 | |
| Line 1 is the standard output produced by device-mapper.
 | |
| Line 2 & 3 are produced by the raid target and are best explained by example:
 | |
|         0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
 | |
| Here we can see the RAID type is raid4, there are 5 devices - all of
 | |
| which are 'A'live, and the array is 2/490221568 complete with its initial
 | |
| recovery.  Here is a fuller description of the individual fields:
 | |
| 	<raid_type>     Same as the <raid_type> used to create the array.
 | |
| 	<health_chars>  One char for each device, indicating: 'A' = alive and
 | |
| 			in-sync, 'a' = alive but not in-sync, 'D' = dead/failed.
 | |
| 	<sync_ratio>    The ratio indicating how much of the array has undergone
 | |
| 			the process described by 'sync_action'.  If the
 | |
| 			'sync_action' is "check" or "repair", then the process
 | |
| 			of "resync" or "recover" can be considered complete.
 | |
| 	<sync_action>   One of the following possible states:
 | |
| 			idle    - No synchronization action is being performed.
 | |
| 			frozen  - The current action has been halted.
 | |
| 			resync  - Array is undergoing its initial synchronization
 | |
| 				  or is resynchronizing after an unclean shutdown
 | |
| 				  (possibly aided by a bitmap).
 | |
| 			recover - A device in the array is being rebuilt or
 | |
| 				  replaced.
 | |
| 			check   - A user-initiated full check of the array is
 | |
| 				  being performed.  All blocks are read and
 | |
| 				  checked for consistency.  The number of
 | |
| 				  discrepancies found are recorded in
 | |
| 				  <mismatch_cnt>.  No changes are made to the
 | |
| 				  array by this action.
 | |
| 			repair  - The same as "check", but discrepancies are
 | |
| 				  corrected.
 | |
| 			reshape - The array is undergoing a reshape.
 | |
| 	<mismatch_cnt>  The number of discrepancies found between mirror copies
 | |
| 			in RAID1/10 or wrong parity values found in RAID4/5/6.
 | |
| 			This value is valid only after a "check" of the array
 | |
| 			is performed.  A healthy array has a 'mismatch_cnt' of 0.
 | |
| 	<data_offset>   The current data offset to the start of the user data on
 | |
| 			each component device of a raid set (see the respective
 | |
| 			raid parameter to support out-of-place reshaping).
 | |
| 	<journal_char>	'A' - active raid4/5/6 journal device.
 | |
| 			'D' - dead journal device.
 | |
| 			'-' - no journal device.
 | |
| 
 | |
| 
 | |
| Message Interface
 | |
| -----------------
 | |
| The dm-raid target will accept certain actions through the 'message' interface.
 | |
| ('man dmsetup' for more information on the message interface.)  These actions
 | |
| include:
 | |
| 	"idle"   - Halt the current sync action.
 | |
| 	"frozen" - Freeze the current sync action.
 | |
| 	"resync" - Initiate/continue a resync.
 | |
| 	"recover"- Initiate/continue a recover process.
 | |
| 	"check"  - Initiate a check (i.e. a "scrub") of the array.
 | |
| 	"repair" - Initiate a repair of the array.
 | |
| 
 | |
| 
 | |
| Discard Support
 | |
| ---------------
 | |
| The implementation of discard support among hardware vendors varies.
 | |
| When a block is discarded, some storage devices will return zeroes when
 | |
| the block is read.  These devices set the 'discard_zeroes_data'
 | |
| attribute.  Other devices will return random data.  Confusingly, some
 | |
| devices that advertise 'discard_zeroes_data' will not reliably return
 | |
| zeroes when discarded blocks are read!  Since RAID 4/5/6 uses blocks
 | |
| from a number of devices to calculate parity blocks and (for performance
 | |
| reasons) relies on 'discard_zeroes_data' being reliable, it is important
 | |
| that the devices be consistent.  Blocks may be discarded in the middle
 | |
| of a RAID 4/5/6 stripe and if subsequent read results are not
 | |
| consistent, the parity blocks may be calculated differently at any time;
 | |
| making the parity blocks useless for redundancy.  It is important to
 | |
| understand how your hardware behaves with discards if you are going to
 | |
| enable discards with RAID 4/5/6.
 | |
| 
 | |
| Since the behavior of storage devices is unreliable in this respect,
 | |
| even when reporting 'discard_zeroes_data', by default RAID 4/5/6
 | |
| discard support is disabled -- this ensures data integrity at the
 | |
| expense of losing some performance.
 | |
| 
 | |
| Storage devices that properly support 'discard_zeroes_data' are
 | |
| increasingly whitelisted in the kernel and can thus be trusted.
 | |
| 
 | |
| For trusted devices, the following dm-raid module parameter can be set
 | |
| to safely enable discard support for RAID 4/5/6:
 | |
|     'devices_handle_discards_safely'
 | |
| 
 | |
| 
 | |
| Version History
 | |
| ---------------
 | |
| 1.0.0	Initial version.  Support for RAID 4/5/6
 | |
| 1.1.0	Added support for RAID 1
 | |
| 1.2.0	Handle creation of arrays that contain failed devices.
 | |
| 1.3.0	Added support for RAID 10
 | |
| 1.3.1	Allow device replacement/rebuild for RAID 10
 | |
| 1.3.2   Fix/improve redundancy checking for RAID10
 | |
| 1.4.0	Non-functional change.  Removes arg from mapping function.
 | |
| 1.4.1   RAID10 fix redundancy validation checks (commit 55ebbb5).
 | |
| 1.4.2   Add RAID10 "far" and "offset" algorithm support.
 | |
| 1.5.0   Add message interface to allow manipulation of the sync_action.
 | |
| 	New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
 | |
| 1.5.1   Add ability to restore transiently failed devices on resume.
 | |
| 1.5.2   'mismatch_cnt' is zero unless [last_]sync_action is "check".
 | |
| 1.6.0   Add discard support (and devices_handle_discard_safely module param).
 | |
| 1.7.0   Add support for MD RAID0 mappings.
 | |
| 1.8.0   Explicitly check for compatible flags in the superblock metadata
 | |
| 	and reject to start the raid set if any are set by a newer
 | |
| 	target version, thus avoiding data corruption on a raid set
 | |
| 	with a reshape in progress.
 | |
| 1.9.0   Add support for RAID level takeover/reshape/region size
 | |
| 	and set size reduction.
 | |
| 1.9.1   Fix activation of existing RAID 4/10 mapped devices
 | |
| 1.9.2   Don't emit '- -' on the status table line in case the constructor
 | |
| 	fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
 | |
| 	'D' on the status line.  If '- -' is passed into the constructor, emit
 | |
| 	'- -' on the table line and '-' as the status line health character.
 | |
| 1.10.0  Add support for raid4/5/6 journal device
 |