€•yE      Œsphinx.addnodes”Œdocument”“”)”}”(Œ	rawsource”Œ ”Œchildren”]”(Œtranslations”ŒLanguagesNode”“”)”}”(hhh]”(h Œpending_xref”“”)”}”(hhh]”Œdocutils.nodes”ŒText”“”ŒChinese (Simplified)”…””}”Œparent”hsbaŒ
attributes”}”(Œids”]”Œclasses”]”Œnames”]”Œdupnames”]”Œbackrefs”]”Œ	refdomain”Œstd”Œreftype”Œdoc”Œ	reftarget”Œ-/translations/zh_CN/driver-api/md/raid5-cache”Œmodname”NŒ	classname”NŒrefexplicit”ˆuŒtagname”hhhubh)”}”(hhh]”hŒChinese (Traditional)”…””}”hh2sbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ-/translations/zh_TW/driver-api/md/raid5-cache”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒItalian”…””}”hhFsbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ-/translations/it_IT/driver-api/md/raid5-cache”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒJapanese”…””}”hhZsbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ-/translations/ja_JP/driver-api/md/raid5-cache”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒKorean”…””}”hhnsbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ-/translations/ko_KR/driver-api/md/raid5-cache”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒSpanish”…””}”hh‚sbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ-/translations/sp_SP/driver-api/md/raid5-cache”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubeh}”(h]”h ]”h"]”h$]”h&]”Œcurrent_language”ŒEnglish”uh1h
hhŒ	_document”hŒsource”NŒline”NubhŒsection”“”)”}”(hhh]”(hŒtitle”“”)”}”(hŒRAID 4/5/6 cache”h]”hŒRAID 4/5/6 cache”…””}”(hh¨hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h¦hh£hžhhŸŒG/var/lib/git/docbuild/linux/Documentation/driver-api/md/raid5-cache.rst”h KubhŒ	paragraph”“”)”}”(hX  Raid 4/5/6 could include an extra disk for data cache besides normal RAID
disks. The role of RAID disks isn't changed with the cache disk. The cache disk
caches data to the RAID disks. The cache can be in write-through (supported
since 4.4) or write-back mode (supported since 4.10). mdadm (supported since
3.4) has a new option '--write-journal' to create array with cache. Please
refer to mdadm manual for details. By default (RAID array starts), the cache is
in write-through mode. A user can switch it to write-back mode by::”h]”hX  Raid 4/5/6 could include an extra disk for data cache besides normal RAID
disks. The role of RAID disks isnâ€™t changed with the cache disk. The cache disk
caches data to the RAID disks. The cache can be in write-through (supported
since 4.4) or write-back mode (supported since 4.10). mdadm (supported since
3.4) has a new option â€˜--write-journalâ€™ to create array with cache. Please
refer to mdadm manual for details. By default (RAID array starts), the cache is
in write-through mode. A user can switch it to write-back mode by:”…””}”(hh¹hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khh£hžhubhŒliteral_block”“”)”}”(hŒ2echo "write-back" > /sys/block/md0/md/journal_mode”h]”hŒ2echo "write-back" > /sys/block/md0/md/journal_mode”…””}”hhÉsbah}”(h]”h ]”h"]”h$]”h&]”Œ	xml:space”Œpreserve”uh1hÇhŸh¶h Khh£hžhubh¸)”}”(hŒ-And switch it back to write-through mode by::”h]”hŒ,And switch it back to write-through mode by:”…””}”(hhÙhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khh£hžhubhÈ)”}”(hŒ5echo "write-through" > /sys/block/md0/md/journal_mode”h]”hŒ5echo "write-through" > /sys/block/md0/md/journal_mode”…””}”hhçsbah}”(h]”h ]”h"]”h$]”h&]”h×hØuh1hÇhŸh¶h Khh£hžhubh¸)”}”(hŒyIn both modes, all writes to the array will hit cache disk first. This means
the cache disk must be fast and sustainable.”h]”hŒyIn both modes, all writes to the array will hit cache disk first. This means
the cache disk must be fast and sustainable.”…””}”(hhõhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khh£hžhubh¢)”}”(hhh]”(h§)”}”(hŒwrite-through mode”h]”hŒwrite-through mode”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h¦hj  hžhhŸh¶h Kubh¸)”}”(hX[  This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean
shutdown can cause data in some stripes to not be in consistent state, eg, data
and parity don't match. The reason is that a stripe write involves several RAID
disks and it's possible the writes don't hit all RAID disks yet before the
unclean shutdown. We call an array degraded if it has inconsistent data. MD
tries to resync the array to bring it back to normal state. But before the
resync completes, any system crash will expose the chance of real data
corruption in the RAID array. This problem is called 'write hole'.”h]”hXi  This mode mainly fixes the â€˜write holeâ€™ issue. For RAID 4/5/6 array, an unclean
shutdown can cause data in some stripes to not be in consistent state, eg, data
and parity donâ€™t match. The reason is that a stripe write involves several RAID
disks and itâ€™s possible the writes donâ€™t hit all RAID disks yet before the
unclean shutdown. We call an array degraded if it has inconsistent data. MD
tries to resync the array to bring it back to normal state. But before the
resync completes, any system crash will expose the chance of real data
corruption in the RAID array. This problem is called â€˜write holeâ€™.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khj  hžhubh¸)”}”(hX0  The write-through cache will cache all data on cache disk first. After the data
is safe on the cache disk, the data will be flushed onto RAID disks. The
two-step write will guarantee MD can recover correct data after unclean
shutdown even the array is degraded. Thus the cache can close the 'write hole'.”h]”hX4  The write-through cache will cache all data on cache disk first. After the data
is safe on the cache disk, the data will be flushed onto RAID disks. The
two-step write will guarantee MD can recover correct data after unclean
shutdown even the array is degraded. Thus the cache can close the â€˜write holeâ€™.”…””}”(hj"  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K"hj  hžhubh¸)”}”(hŒøIn write-through mode, MD reports IO completion to upper layer (usually
filesystems) after the data is safe on RAID disks, so cache disk failure
doesn't cause data loss. Of course cache disk failure means the array is
exposed to 'write hole' again.”h]”hŒþIn write-through mode, MD reports IO completion to upper layer (usually
filesystems) after the data is safe on RAID disks, so cache disk failure
doesnâ€™t cause data loss. Of course cache disk failure means the array is
exposed to â€˜write holeâ€™ again.”…””}”(hj0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K'hj  hžhubh¸)”}”(hŒfIn write-through mode, the cache disk isn't required to be big. Several
hundreds megabytes are enough.”h]”hŒhIn write-through mode, the cache disk isnâ€™t required to be big. Several
hundreds megabytes are enough.”…””}”(hj>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K,hj  hžhubeh}”(h]”Œwrite-through-mode”ah ]”h"]”Œwrite-through mode”ah$]”h&]”uh1h¡hh£hžhhŸh¶h Kubh¢)”}”(hhh]”(h§)”}”(hŒwrite-back mode”h]”hŒwrite-back mode”…””}”(hjW  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h¦hjT  hžhhŸh¶h K0ubh¸)”}”(hXú  write-back mode fixes the 'write hole' issue too, since all write data is
cached on cache disk. But the main goal of 'write-back' cache is to speed up
write. If a write crosses all RAID disks of a stripe, we call it full-stripe
write. For non-full-stripe writes, MD must read old data before the new parity
can be calculated. These synchronous reads hurt write throughput. Some writes
which are sequential but not dispatched in the same time will suffer from this
overhead too. Write-back cache will aggregate the data and flush the data to
RAID disks only after the data becomes a full stripe write. This will
completely avoid the overhead, so it's very helpful for some workloads. A
typical workload which does sequential write followed by fsync is an example.”h]”hX  write-back mode fixes the â€˜write holeâ€™ issue too, since all write data is
cached on cache disk. But the main goal of â€˜write-backâ€™ cache is to speed up
write. If a write crosses all RAID disks of a stripe, we call it full-stripe
write. For non-full-stripe writes, MD must read old data before the new parity
can be calculated. These synchronous reads hurt write throughput. Some writes
which are sequential but not dispatched in the same time will suffer from this
overhead too. Write-back cache will aggregate the data and flush the data to
RAID disks only after the data becomes a full stripe write. This will
completely avoid the overhead, so itâ€™s very helpful for some workloads. A
typical workload which does sequential write followed by fsync is an example.”…””}”(hje  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K2hjT  hžhubh¸)”}”(hŒêIn write-back mode, MD reports IO completion to upper layer (usually
filesystems) right after the data hits cache disk. The data is flushed to raid
disks later after specific conditions met. So cache disk failure will cause
data loss.”h]”hŒêIn write-back mode, MD reports IO completion to upper layer (usually
filesystems) right after the data hits cache disk. The data is flushed to raid
disks later after specific conditions met. So cache disk failure will cause
data loss.”…””}”(hjs  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K=hjT  hžhubh¸)”}”(hX  In write-back mode, MD also caches data in memory. The memory cache includes
the same data stored on cache disk, so a power loss doesn't cause data loss.
The memory cache size has performance impact for the array. It's recommended
the size is big. A user can configure the size by::”h]”hX  In write-back mode, MD also caches data in memory. The memory cache includes
the same data stored on cache disk, so a power loss doesnâ€™t cause data loss.
The memory cache size has performance impact for the array. Itâ€™s recommended
the size is big. A user can configure the size by:”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h KBhjT  hžhubhÈ)”}”(hŒ1echo "2048" > /sys/block/md0/md/stripe_cache_size”h]”hŒ1echo "2048" > /sys/block/md0/md/stripe_cache_size”…””}”hj  sbah}”(h]”h ]”h"]”h$]”h&]”h×hØuh1hÇhŸh¶h KGhjT  hžhubh¸)”}”(hŒÊToo small cache disk will make the write aggregation less efficient in this
mode depending on the workloads. It's recommended to use a cache disk with at
least several gigabytes size in write-back mode.”h]”hŒÌToo small cache disk will make the write aggregation less efficient in this
mode depending on the workloads. Itâ€™s recommended to use a cache disk with at
least several gigabytes size in write-back mode.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h KIhjT  hžhubeh}”(h]”Œwrite-back-mode”ah ]”h"]”Œwrite-back mode”ah$]”h&]”uh1h¡hh£hžhhŸh¶h K0ubh¢)”}”(hhh]”(h§)”}”(hŒThe implementation”h]”hŒThe implementation”…””}”(hj¶  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h¦hj³  hžhhŸh¶h KNubh¸)”}”(hX6  The write-through and write-back cache use the same disk format. The cache disk
is organized as a simple write log. The log consists of 'meta data' and 'data'
pairs. The meta data describes the data. It also includes checksum and sequence
ID for recovery identification. Data can be IO data and parity data. Data is
checksummed too. The checksum is stored in the meta data ahead of the data. The
checksum is an optimization because MD can write meta and data freely without
worry about the order. MD superblock has a field pointed to the valid meta data
of log head.”h]”hX>  The write-through and write-back cache use the same disk format. The cache disk
is organized as a simple write log. The log consists of â€˜meta dataâ€™ and â€˜dataâ€™
pairs. The meta data describes the data. It also includes checksum and sequence
ID for recovery identification. Data can be IO data and parity data. Data is
checksummed too. The checksum is stored in the meta data ahead of the data. The
checksum is an optimization because MD can write meta and data freely without
worry about the order. MD superblock has a field pointed to the valid meta data
of log head.”…””}”(hjÄ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h KPhj³  hžhubh¸)”}”(hX  The log implementation is pretty straightforward. The difficult part is the
order in which MD writes data to cache disk and RAID disks. Specifically, in
write-through mode, MD calculates parity for IO data, writes both IO data and
parity to the log, writes the data and parity to RAID disks after the data and
parity is settled down in log and finally the IO is finished. Read just reads
from raid disks as usual.”h]”hX  The log implementation is pretty straightforward. The difficult part is the
order in which MD writes data to cache disk and RAID disks. Specifically, in
write-through mode, MD calculates parity for IO data, writes both IO data and
parity to the log, writes the data and parity to RAID disks after the data and
parity is settled down in log and finally the IO is finished. Read just reads
from raid disks as usual.”…””}”(hjÒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h KYhj³  hžhubh¸)”}”(hX0  In write-back mode, MD writes IO data to the log and reports IO completion. The
data is also fully cached in memory at that time, which means read must query
memory cache. If some conditions are met, MD will flush the data to RAID disks.
MD will calculate parity for the data and write parity into the log. After this
is finished, MD will write both data and parity into RAID disks, then MD can
release the memory cache. The flush conditions could be stripe becomes a full
stripe write, free cache disk space is low or free in-kernel memory cache space
is low.”h]”hX0  In write-back mode, MD writes IO data to the log and reports IO completion. The
data is also fully cached in memory at that time, which means read must query
memory cache. If some conditions are met, MD will flush the data to RAID disks.
MD will calculate parity for the data and write parity into the log. After this
is finished, MD will write both data and parity into RAID disks, then MD can
release the memory cache. The flush conditions could be stripe becomes a full
stripe write, free cache disk space is low or free in-kernel memory cache space
is low.”…””}”(hjà  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K`hj³  hžhubh¸)”}”(hXã  After an unclean shutdown, MD does recovery. MD reads all meta data and data
from the log. The sequence ID and checksum will help us detect corrupted meta
data and data. If MD finds a stripe with data and valid parities (1 parity for
raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If
parities are incompleted, they are discarded. If part of data is corrupted,
they are discarded too. MD then loads valid data and writes them to RAID disks
in normal way.”h]”hXã  After an unclean shutdown, MD does recovery. MD reads all meta data and data
from the log. The sequence ID and checksum will help us detect corrupted meta
data and data. If MD finds a stripe with data and valid parities (1 parity for
raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If
parities are incompleted, they are discarded. If part of data is corrupted,
they are discarded too. MD then loads valid data and writes them to RAID disks
in normal way.”…””}”(hjî  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Kihj³  hžhubeh}”(h]”Œthe-implementation”ah ]”h"]”Œthe implementation”ah$]”h&]”uh1h¡hh£hžhhŸh¶h KNubeh}”(h]”Œraid-4-5-6-cache”ah ]”h"]”Œraid 4/5/6 cache”ah$]”h&]”uh1h¡hhhžhhŸh¶h Kubeh}”(h]”h ]”h"]”h$]”h&]”Œsource”h¶uh1hŒcurrent_source”NŒcurrent_line”NŒsettings”Œdocutils.frontend”ŒValues”“”)”}”(h¦NŒ	generator”NŒ	datestamp”NŒsource_link”NŒ
source_url”NŒtoc_backlinks”Œentry”Œfootnote_backlinks”KŒsectnum_xform”KŒstrip_comments”NŒstrip_elements_with_classes”NŒstrip_classes”NŒreport_level”KŒ
halt_level”KŒexit_status_level”KŒdebug”NŒwarning_stream”NŒ	traceback”ˆŒinput_encoding”Œ	utf-8-sig”Œinput_encoding_error_handler”Œstrict”Œoutput_encoding”Œutf-8”Œoutput_encoding_error_handler”j/  Œerror_encoding”Œutf-8”Œerror_encoding_error_handler”Œbackslashreplace”Œlanguage_code”Œen”Œrecord_dependencies”NŒconfig”NŒ	id_prefix”hŒauto_id_prefix”Œid”Œdump_settings”NŒdump_internals”NŒdump_transforms”NŒdump_pseudo_xml”NŒexpose_internals”NŒstrict_visitor”NŒ_disable_config”NŒ_source”h¶Œ_destination”NŒ_config_files”]”Œ7/var/lib/git/docbuild/linux/Documentation/docutils.conf”aŒfile_insertion_enabled”ˆŒraw_enabled”KŒline_length_limit”M'Œpep_references”NŒpep_base_url”Œhttps://peps.python.org/”Œpep_file_url_template”Œpep-%04d”Œrfc_references”NŒrfc_base_url”Œ&https://datatracker.ietf.org/doc/html/”Œ	tab_width”KŒtrim_footnote_reference_space”‰Œsyntax_highlight”Œlong”Œsmart_quotes”ˆŒsmartquotes_locales”]”Œcharacter_level_inline_markup”‰Œdoctitle_xform”‰Œdocinfo_xform”KŒsectsubtitle_xform”‰Œimage_loading”Œlink”Œembed_stylesheet”‰Œcloak_email_addresses”ˆŒsection_self_link”‰Œenv”NubŒreporter”NŒindirect_targets”]”Œsubstitution_defs”}”Œsubstitution_names”}”Œrefnames”}”Œrefids”}”Œnameids”}”(j	  j  jQ  jN  j°  j­  j  jþ  uŒ	nametypes”}”(j	  ‰jQ  ‰j°  ‰j  ‰uh}”(j  h£jN  j  j­  jT  jþ  j³  uŒfootnote_refs”}”Œcitation_refs”}”Œautofootnotes”]”Œautofootnote_refs”]”Œsymbol_footnotes”]”Œsymbol_footnote_refs”]”Œ	footnotes”]”Œ	citations”]”Œautofootnote_start”KŒsymbol_footnote_start”K Œ
id_counter”Œcollections”ŒCounter”“”}”…”R”Œparse_messages”]”Œtransform_messages”]”Œtransformer”NŒinclude_log”]”Œ
decoration”Nhžhub.