€•¦oŒsphinx.addnodes”Œdocument”“”)”}”(Œ rawsource”Œ”Œchildren”]”(Œ translations”Œ LanguagesNode”“”)”}”(hhh]”(hŒ pending_xref”“”)”}”(hhh]”Œdocutils.nodes”ŒText”“”ŒChinese (Simplified)”…””}”(hhŒparent”hubaŒ attributes”}”(Œids”]”Œclasses”]”Œnames”]”Œdupnames”]”Œbackrefs”]”Œ refdomain”Œstd”Œreftype”Œdoc”Œ reftarget”Œ!/translations/zh_CN/staging/crc32”Œmodname”NŒ classname”NŒ refexplicit”ˆuŒtagname”hhh ubh)”}”(hhh]”hŒChinese (Traditional)”…””}”(hhhh2ubah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ!/translations/zh_TW/staging/crc32”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubh)”}”(hhh]”hŒItalian”…””}”(hhhhFubah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ!/translations/it_IT/staging/crc32”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubh)”}”(hhh]”hŒJapanese”…””}”(hhhhZubah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ!/translations/ja_JP/staging/crc32”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubh)”}”(hhh]”hŒKorean”…””}”(hhhhnubah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ!/translations/ko_KR/staging/crc32”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubh)”}”(hhh]”hŒSpanish”…””}”(hhhh‚ubah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ!/translations/sp_SP/staging/crc32”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubeh}”(h]”h ]”h"]”h$]”h&]”Œcurrent_language”ŒEnglish”uh1h hhŒ _document”hŒsource”NŒline”NubhŒsection”“”)”}”(hhh]”(hŒtitle”“”)”}”(hŒ!Brief tutorial on CRC computation”h]”hŒ!Brief tutorial on CRC computation”…””}”(hhªhh¨hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h¦hh£hžhhŸŒ;/var/lib/git/docbuild/linux/Documentation/staging/crc32.rst”h KubhŒ paragraph”“”)”}”(hX¶A CRC is a long-division remainder. You add the CRC to the message, and the whole thing (message+CRC) is a multiple of the given CRC polynomial. To check the CRC, you can either check that the CRC matches the recomputed value, *or* you can check that the remainder computed on the message+CRC is 0. This latter approach is used by a lot of hardware implementations, and is why so many protocols put the end-of-frame flag after the CRC.”h]”(hŒåA CRC is a long-division remainder. You add the CRC to the message, and the whole thing (message+CRC) is a multiple of the given CRC polynomial. To check the CRC, you can either check that the CRC matches the recomputed value, ”…””}”(hŒåA CRC is a long-division remainder. You add the CRC to the message, and the whole thing (message+CRC) is a multiple of the given CRC polynomial. To check the CRC, you can either check that the CRC matches the recomputed value, ”hh¹hžhhŸNh NubhŒemphasis”“”)”}”(hŒ*or*”h]”hŒor”…””}”(hhhhÄhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÂhh¹ubhŒÍ you can check that the remainder computed on the message+CRC is 0. This latter approach is used by a lot of hardware implementations, and is why so many protocols put the end-of-frame flag after the CRC.”…””}”(hŒÍ you can check that the remainder computed on the message+CRC is 0. This latter approach is used by a lot of hardware implementations, and is why so many protocols put the end-of-frame flag after the CRC.”hh¹hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khh£hžhubh¸)”}”(hŒHIt's actually the same long division you learned in school, except that:”h]”hŒJIt’s actually the same long division you learned in school, except that:”…””}”(hhßhhÝhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K hh£hžhubhŒ bullet_list”“”)”}”(hhh]”(hŒ list_item”“”)”}”(hŒWe’re working in binary, so the digits are only 0 and 1, and”…””}”(hhôhhöhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khhòubah}”(h]”h ]”h"]”h$]”h&]”uh1hðhhíhžhhŸh¶h Nubhñ)”}”(hŒµWhen dividing polynomials, there are no carries. Rather than add and subtract, we just xor. Thus, we tend to get a bit sloppy about the difference between adding and subtracting. ”h]”h¸)”}”(hŒ´When dividing polynomials, there are no carries. Rather than add and subtract, we just xor. Thus, we tend to get a bit sloppy about the difference between adding and subtracting.”h]”hŒ´When dividing polynomials, there are no carries. Rather than add and subtract, we just xor. Thus, we tend to get a bit sloppy about the difference between adding and subtracting.”…””}”(hjhj hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khj ubah}”(h]”h ]”h"]”h$]”h&]”uh1hðhhíhžhhŸh¶h Nubeh}”(h]”h ]”h"]”h$]”h&]”Œbullet”Œ-”uh1hëhŸh¶h Khh£hžhubh¸)”}”(hXhLike all division, the remainder is always smaller than the divisor. To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial. Since it's 33 bits long, bit 32 is always going to be set, so usually the CRC is written in hex with the most significant bit omitted. (If you're familiar with the IEEE 754 floating-point format, it's the same idea.)”h]”hXnLike all division, the remainder is always smaller than the divisor. To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial. Since it’s 33 bits long, bit 32 is always going to be set, so usually the CRC is written in hex with the most significant bit omitted. (If you’re familiar with the IEEE 754 floating-point format, it’s the same idea.)”…””}”(hj+hj)hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khh£hžhubh¸)”}”(hXÇNote that a CRC is computed over a string of *bits*, so you have to decide on the endianness of the bits within each byte. To get the best error-detecting properties, this should correspond to the order they're actually sent. For example, standard RS-232 serial is little-endian; the most significant bit (sometimes used for parity) is sent last. And when appending a CRC word to a message, you should do it in the right order, matching the endianness.”h]”(hŒ-Note that a CRC is computed over a string of ”…””}”(hŒ-Note that a CRC is computed over a string of ”hj7hžhhŸNh NubhÃ)”}”(hŒ*bits*”h]”hŒbits”…””}”(hhhj@hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÂhj7ubhX–, so you have to decide on the endianness of the bits within each byte. To get the best error-detecting properties, this should correspond to the order they’re actually sent. For example, standard RS-232 serial is little-endian; the most significant bit (sometimes used for parity) is sent last. And when appending a CRC word to a message, you should do it in the right order, matching the endianness.”…””}”(hX”, so you have to decide on the endianness of the bits within each byte. To get the best error-detecting properties, this should correspond to the order they're actually sent. For example, standard RS-232 serial is little-endian; the most significant bit (sometimes used for parity) is sent last. And when appending a CRC word to a message, you should do it in the right order, matching the endianness.”hj7hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khh£hžhubh¸)”}”(hX©Just like with ordinary division, you proceed one digit (bit) at a time. Each step of the division you take one more digit (bit) of the dividend and append it to the current remainder. Then you figure out the appropriate multiple of the divisor to subtract to being the remainder back into range. In binary, this is easy - it has to be either 0 or 1, and to make the XOR cancel, it's just a copy of bit 32 of the remainder.”h]”hX«Just like with ordinary division, you proceed one digit (bit) at a time. Each step of the division you take one more digit (bit) of the dividend and append it to the current remainder. Then you figure out the appropriate multiple of the divisor to subtract to being the remainder back into range. In binary, this is easy - it has to be either 0 or 1, and to make the XOR cancel, it’s just a copy of bit 32 of the remainder.”…””}”(hj[hjYhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K"hh£hžhubh¸)”}”(hŒìWhen computing a CRC, we don't care about the quotient, so we can throw the quotient bit away, but subtract the appropriate multiple of the polynomial from the remainder and we're back to where we started, ready to process the next bit.”h]”hŒðWhen computing a CRC, we don’t care about the quotient, so we can throw the quotient bit away, but subtract the appropriate multiple of the polynomial from the remainder and we’re back to where we started, ready to process the next bit.”…””}”(hjihjghžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K)hh£hžhubh¸)”}”(hŒ7A big-endian CRC written this way would be coded like::”h]”hŒ6A big-endian CRC written this way would be coded like:”…””}”(hŒ6A big-endian CRC written this way would be coded like:”hjuhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K.hh£hžhubhŒ literal_block”“”)”}”(hŒ¡for (i = 0; i < input_bits; i++) { multiple = remainder & 0x80000000 ? CRCPOLY : 0; remainder = (remainder << 1 | next_input_bit()) ^ multiple; }”h]”hŒ¡for (i = 0; i < input_bits; i++) { multiple = remainder & 0x80000000 ? CRCPOLY : 0; remainder = (remainder << 1 | next_input_bit()) ^ multiple; }”…””}”(hhhj†ubah}”(h]”h ]”h"]”h$]”h&]”Œ xml:space”Œpreserve”uh1j„hŸh¶h K0hh£hžhubh¸)”}”(hŒoNotice how, to get at bit 32 of the shifted remainder, we look at bit 31 of the remainder *before* shifting it.”h]”(hŒZNotice how, to get at bit 32 of the shifted remainder, we look at bit 31 of the remainder ”…””}”(hŒZNotice how, to get at bit 32 of the shifted remainder, we look at bit 31 of the remainder ”hj–hžhhŸNh NubhÃ)”}”(hŒ*before*”h]”hŒbefore”…””}”(hhhjŸhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÂhj–ubhŒ shifting it.”…””}”(hŒ shifting it.”hj–hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K5hh£hžhubh¸)”}”(hXfBut also notice how the next_input_bit() bits we're shifting into the remainder don't actually affect any decision-making until 32 bits later. Thus, the first 32 cycles of this are pretty boring. Also, to add the CRC to a message, we need a 32-bit-long hole for it at the end, so we have to add 32 extra cycles shifting in zeros at the end of every message.”h]”hXjBut also notice how the next_input_bit() bits we’re shifting into the remainder don’t actually affect any decision-making until 32 bits later. Thus, the first 32 cycles of this are pretty boring. Also, to add the CRC to a message, we need a 32-bit-long hole for it at the end, so we have to add 32 extra cycles shifting in zeros at the end of every message.”…””}”(hjºhj¸hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K8hh£hžhubh¸)”}”(hXThese details lead to a standard trick: rearrange merging in the next_input_bit() until the moment it's needed. Then the first 32 cycles can be precomputed, and merging in the final 32 zero bits to make room for the CRC can be skipped entirely. This changes the code to::”h]”hXThese details lead to a standard trick: rearrange merging in the next_input_bit() until the moment it’s needed. Then the first 32 cycles can be precomputed, and merging in the final 32 zero bits to make room for the CRC can be skipped entirely. This changes the code to:”…””}”(hXThese details lead to a standard trick: rearrange merging in the next_input_bit() until the moment it's needed. Then the first 32 cycles can be precomputed, and merging in the final 32 zero bits to make room for the CRC can be skipped entirely. This changes the code to:”hjÆhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K?hh£hžhubj…)”}”(hŒ½for (i = 0; i < input_bits; i++) { remainder ^= next_input_bit() << 31; multiple = (remainder & 0x80000000) ? CRCPOLY : 0; remainder = (remainder << 1) ^ multiple; }”h]”hŒ½for (i = 0; i < input_bits; i++) { remainder ^= next_input_bit() << 31; multiple = (remainder & 0x80000000) ? CRCPOLY : 0; remainder = (remainder << 1) ^ multiple; }”…””}”(hhhjÕubah}”(h]”h ]”h"]”h$]”h&]”j”j•uh1j„hŸh¶h KDhh£hžhubh¸)”}”(hŒGWith this optimization, the little-endian code is particularly simple::”h]”hŒFWith this optimization, the little-endian code is particularly simple:”…””}”(hŒFWith this optimization, the little-endian code is particularly simple:”hjãhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h KJhh£hžhubj…)”}”(hŒ®for (i = 0; i < input_bits; i++) { remainder ^= next_input_bit(); multiple = (remainder & 1) ? CRCPOLY : 0; remainder = (remainder >> 1) ^ multiple; }”h]”hŒ®for (i = 0; i < input_bits; i++) { remainder ^= next_input_bit(); multiple = (remainder & 1) ? CRCPOLY : 0; remainder = (remainder >> 1) ^ multiple; }”…””}”(hhhjòubah}”(h]”h ]”h"]”h$]”h&]”j”j•uh1j„hŸh¶h KLhh£hžhubh¸)”}”(hŒöThe most significant coefficient of the remainder polynomial is stored in the least significant bit of the binary "remainder" variable. The other details of endianness have been hidden in CRCPOLY (which must be bit-reversed) and next_input_bit().”h]”hŒúThe most significant coefficient of the remainder polynomial is stored in the least significant bit of the binary “remainder†variable. The other details of endianness have been hidden in CRCPOLY (which must be bit-reversed) and next_input_bit().”…””}”(hjhjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h KRhh£hžhubh¸)”}”(hŒÔAs long as next_input_bit is returning the bits in a sensible order, we don't *have* to wait until the last possible moment to merge in additional bits. We can do it 8 bits at a time rather than 1 bit at a time::”h]”(hŒPAs long as next_input_bit is returning the bits in a sensible order, we don’t ”…””}”(hŒNAs long as next_input_bit is returning the bits in a sensible order, we don't ”hjhžhhŸNh NubhÃ)”}”(hŒ*have*”h]”hŒhave”…””}”(hhhjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÂhjubhŒ to wait until the last possible moment to merge in additional bits. We can do it 8 bits at a time rather than 1 bit at a time:”…””}”(hŒ to wait until the last possible moment to merge in additional bits. We can do it 8 bits at a time rather than 1 bit at a time:”hjhžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h KWhh£hžhubj…)”}”(hŒûfor (i = 0; i < input_bytes; i++) { remainder ^= next_input_byte() << 24; for (j = 0; j < 8; j++) { multiple = (remainder & 0x80000000) ? CRCPOLY : 0; remainder = (remainder << 1) ^ multiple; } }”h]”hŒûfor (i = 0; i < input_bytes; i++) { remainder ^= next_input_byte() << 24; for (j = 0; j < 8; j++) { multiple = (remainder & 0x80000000) ? CRCPOLY : 0; remainder = (remainder << 1) ^ multiple; } }”…””}”(hhhj0ubah}”(h]”h ]”h"]”h$]”h&]”j”j•uh1j„hŸh¶h K[hh£hžhubh¸)”}”(hŒOr in little-endian::”h]”hŒOr in little-endian:”…””}”(hŒOr in little-endian:”hj>hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Kchh£hžhubj…)”}”(hŒìfor (i = 0; i < input_bytes; i++) { remainder ^= next_input_byte(); for (j = 0; j < 8; j++) { multiple = (remainder & 1) ? CRCPOLY : 0; remainder = (remainder >> 1) ^ multiple; } }”h]”hŒìfor (i = 0; i < input_bytes; i++) { remainder ^= next_input_byte(); for (j = 0; j < 8; j++) { multiple = (remainder & 1) ? CRCPOLY : 0; remainder = (remainder >> 1) ^ multiple; } }”…””}”(hhhjMubah}”(h]”h ]”h"]”h$]”h&]”j”j•uh1j„hŸh¶h Kehh£hžhubh¸)”}”(hŒ{If the input is a multiple of 32 bits, you can even XOR in a 32-bit word at a time and increase the inner loop count to 32.”h]”hŒ{If the input is a multiple of 32 bits, you can even XOR in a 32-bit word at a time and increase the inner loop count to 32.”…””}”(hj]hj[hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Kmhh£hžhubh¸)”}”(hŒ¯You can also mix and match the two loop styles, for example doing the bulk of a message byte-at-a-time and adding bit-at-a-time processing for any fractional bytes at the end.”h]”hŒ¯You can also mix and match the two loop styles, for example doing the bulk of a message byte-at-a-time and adding bit-at-a-time processing for any fractional bytes at the end.”…””}”(hjkhjihžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Kphh£hžhubh¸)”}”(hŒóTo reduce the number of conditional branches, software commonly uses the byte-at-a-time table method, popularized by Dilip V. Sarwate, "Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM v.31 no.8 (August 1998) p. 1008-1013.”h]”hŒ÷To reduce the number of conditional branches, software commonly uses the byte-at-a-time table method, popularized by Dilip V. Sarwate, “Computation of Cyclic Redundancy Checks via Table Look-Upâ€, Comm. ACM v.31 no.8 (August 1998) p. 1008-1013.”…””}”(hjyhjwhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Kthh£hžhubh¸)”}”(hXGHere, rather than just shifting one bit of the remainder to decide in the correct multiple to subtract, we can shift a byte at a time. This produces a 40-bit (rather than a 33-bit) intermediate remainder, and the correct multiple of the polynomial to subtract is found using a 256-entry lookup table indexed by the high 8 bits.”h]”hXGHere, rather than just shifting one bit of the remainder to decide in the correct multiple to subtract, we can shift a byte at a time. This produces a 40-bit (rather than a 33-bit) intermediate remainder, and the correct multiple of the polynomial to subtract is found using a 256-entry lookup table indexed by the high 8 bits.”…””}”(hj‡hj…hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Kyhh£hžhubh¸)”}”(hŒI(The table entries are simply the CRC-32 of the given one-byte messages.)”h]”hŒI(The table entries are simply the CRC-32 of the given one-byte messages.)”…””}”(hj•hj“hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khh£hžhubh¸)”}”(hŒ{When space is more constrained, smaller tables can be used, e.g. two 4-bit shifts followed by a lookup in a 16-entry table.”h]”hŒ{When space is more constrained, smaller tables can be used, e.g. two 4-bit shifts followed by a lookup in a 16-entry table.”…””}”(hj£hj¡hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khh£hžhubh¸)”}”(hŒÀIt is not practical to process much more than 8 bits at a time using this technique, because tables larger than 256 entries use too much memory and, more importantly, too much of the L1 cache.”h]”hŒÀIt is not practical to process much more than 8 bits at a time using this technique, because tables larger than 256 entries use too much memory and, more importantly, too much of the L1 cache.”…””}”(hj±hj¯hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K„hh£hžhubh¸)”}”(hŒÚTo get higher software performance, a "slicing" technique can be used. See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm", ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf”h]”(hŒ—To get higher software performance, a “slicing†technique can be used. See “High Octane CRC Generation with the Intel Slicing-by-8 Algorithmâ€, ”…””}”(hŒTo get higher software performance, a "slicing" technique can be used. See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm", ”hj½hžhhŸNh NubhŒ reference”“”)”}”(hŒKftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf”h]”hŒKftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf”…””}”(hhhjÈhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œrefuri”jÊuh1jÆhj½ubeh}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Kˆhh£hžhubh¸)”}”(hŒËThis does not change the number of table lookups, but does increase the parallelism. With the classic Sarwate algorithm, each table lookup must be completed before the index of the next can be computed.”h]”hŒËThis does not change the number of table lookups, but does increase the parallelism. With the classic Sarwate algorithm, each table lookup must be completed before the index of the next can be computed.”…””}”(hjßhjÝhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h KŒhh£hžhubh¸)”}”(hXâA "slicing by 2" technique would shift the remainder 16 bits at a time, producing a 48-bit intermediate remainder. Rather than doing a single lookup in a 65536-entry table, the two high bytes are looked up in two different 256-entry tables. Each contains the remainder required to cancel out the corresponding byte. The tables are different because the polynomials to cancel are different. One has non-zero coefficients from x^32 to x^39, while the other goes from x^40 to x^47.”h]”hXæA “slicing by 2†technique would shift the remainder 16 bits at a time, producing a 48-bit intermediate remainder. Rather than doing a single lookup in a 65536-entry table, the two high bytes are looked up in two different 256-entry tables. Each contains the remainder required to cancel out the corresponding byte. The tables are different because the polynomials to cancel are different. One has non-zero coefficients from x^32 to x^39, while the other goes from x^40 to x^47.”…””}”(hjíhjëhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Khh£hžhubh¸)”}”(hŒ¿Since modern processors can handle many parallel memory operations, this takes barely longer than a single table look-up and thus performs almost twice as fast as the basic Sarwate algorithm.”h]”hŒ¿Since modern processors can handle many parallel memory operations, this takes barely longer than a single table look-up and thus performs almost twice as fast as the basic Sarwate algorithm.”…””}”(hjûhjùhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K˜hh£hžhubh¸)”}”(hXJThis can be extended to "slicing by 4" using 4 256-entry tables. Each step, 32 bits of data is fetched, XORed with the CRC, and the result broken into bytes and looked up in the tables. Because the 32-bit shift leaves the low-order bits of the intermediate remainder zero, the final CRC is simply the XOR of the 4 table look-ups.”h]”hXNThis can be extended to “slicing by 4†using 4 256-entry tables. Each step, 32 bits of data is fetched, XORed with the CRC, and the result broken into bytes and looked up in the tables. Because the 32-bit shift leaves the low-order bits of the intermediate remainder zero, the final CRC is simply the XOR of the 4 table look-ups.”…””}”(hj hjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h Kœhh£hžhubh¸)”}”(hŒÙBut this still enforces sequential execution: a second group of table look-ups cannot begin until the previous groups 4 table look-ups have all been completed. Thus, the processor's load/store unit is sometimes idle.”h]”hŒÛBut this still enforces sequential execution: a second group of table look-ups cannot begin until the previous groups 4 table look-ups have all been completed. Thus, the processor’s load/store unit is sometimes idle.”…””}”(hjhjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K¢hh£hžhubh¸)”}”(hXœTo make maximum use of the processor, "slicing by 8" performs 8 look-ups in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed with 64 bits of input data. What is important to note is that 4 of those 8 bytes are simply copies of the input data; they do not depend on the previous CRC at all. Thus, those 4 table look-ups may commence immediately, without waiting for the previous loop iteration.”h]”hX To make maximum use of the processor, “slicing by 8†performs 8 look-ups in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed with 64 bits of input data. What is important to note is that 4 of those 8 bytes are simply copies of the input data; they do not depend on the previous CRC at all. Thus, those 4 table look-ups may commence immediately, without waiting for the previous loop iteration.”…””}”(hj%hj#hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K¦hh£hžhubh¸)”}”(hŒvBy always having 4 loads in flight, a modern superscalar processor can be kept busy and make full use of its L1 cache.”h]”hŒvBy always having 4 loads in flight, a modern superscalar processor can be kept busy and make full use of its L1 cache.”…””}”(hj3hj1hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1h·hŸh¶h K­hh£hžhubh¸)”}”(hŒ