If a coherency error is observed either in the output of a "getsniffer" report, or in the array logs after the TRiiAGE script has been run against them, it should be cause for concern. Very simply put, a coherency error on a CLARiiON array means that when a stripe of data was read across all disks in a particular RAID group, each individual disk's checksum appeared "valid," yet when the parity was computed from the data disks it was found not to match the parity stored on the parity disk for this stripe. In an N+1 RAID group type, the CLARiiON array "recovers" from this by recomputing parity because the array does not know which component was incorrect. This action will correct the problem if parity was indeed the component that was incorrect. However, if one of the data components was in fact the incorrect piece, the corrective action taken by the array has caused the correct data to be unrecoverable, and the array will unwittingly return potentially incorrect data to a host.
In a RAID 1 or RAID 1/0 RAID type, the array synchronizes to the primary mirror if both halves of the mirror have correct checksums yet are found to be different. The logic employed here is that the most probable cause would be an incomplete write due to a power failure situation, in which case the write would not have been acknowledged to the host, and, therefore, the write would eventually re-try. Based upon that assumption, it should not matter which side the array synchronizes to because the data will be overwritten by the host shortly anyway. In general, EMC has not observed coherency errors to be a serious problem on RAID 0 or RAID 1/0 RAID groups.
What are coherency events?
A coherency (COH) event indicates that although the stamps (write, time, shed) may match, the parity for a stripe does not accurately reflect the data. On a R1/R10, this means that the mirrored pairs do not match. On R3/R5/R6, this means that the XOR of all the data drives does not match the contents of the parity drive(s) . The COH errors could be seen on R1/R10 units if a write fails before getting to both drives. This could be seen on R3/R5/R6 if a write fails before getting to all the data drives and parity drive(s) being modified. A coherency error could be seen during a read operation if the data being read was returned (by the drive) incorrectly. This is the same type of issue that could occur with vendor employing RAID technology and is not unique to EMC CLARiiON storage systems.
What effect does this have on data?
This could result in an incorrectly calculated value that does not properly match with the contents of the parity drive. FLARE Verify will take the data read from the data drives and calculate new parity and write that value to the parity drive. If coherency errors are observed either in the output of a "getsniffer" report, in SP event logs or when viewing a TRiiAGE Analysis report, there should be cause for concern.
Very simply put, a coherency error on a CLARiiON array means that when a stripe of data was read across all disks in a particular RAID group, each individual disk's checksum appeared "valid," yet when the parity was computed from the data disks, it was found not to match the parity stored on the parity disk for this stripe. It does not mean that the data is incorrect. Instead, it simply means that there was a difference between calculated parity and what actually exists in parity.
RAID 6 can correct a maximum of one coherency error in the absence of any other errors on a non-degraded LUN. In some cases, RAID 6 will detect more than one coherency error.
What process is used to detect coherency events?
In situations where the COH (coherency) of the redundant information of a LUN (Logical Unit Number) is uncertain, FLARE base software performs a Verify operation to check for the consistency of information and takes corrective action as required.
Three types of events can lead to a Verify process operation.
- If a single storage processor (SP) experiences a power failure event while Write data operations are in progress to a redundant unit (such as a RAID-5 LUN), a verify operation will occur. In this case, it is necessary on the next power up to check the state of the LUN in the areas which were being written to. This will ensure that the Data and Parity sectors were left in a consistent state. If the Write data operation had modified the information on the Data disk, but had not yet updated the corresponding information on the Parity disk when the power failed, the stripe would be left in an inconsistent state.
This type of Verify process operation is done by all FLARE base software and it referred to as a Nonvol (Non-Volatile) Verify process. This name is used because information which specifies which sectors need to be verified is maintained in non-volatile storage on each Storage Processor (SP) in the array. The sectors which need to be verified are known, therefore the Nonvol Verify process occurs very quickly, and is completed before the LUN becomes enabled.
- A failure of an SP while write data operations are in progress to a redundant unit (such as a RAID-5 LUN). The LUN is then trespassed and enabled by the Peer SP. This situation is the same as the case outlined above, except that the Peer SP which takes control of the LUN does not know what areas of the LUN the failed SP may have been writing to at the point of failure.
When this occurs, the Peer SP must perform a full BGV (Background Verify) operation. This name is used because the SP checks the state of the entire LUN as a background operation proceeds while the LUN is enabled. A BGV can also be induced by power failing a single SP while Write operations are in progress, and then replacing the SP before the next power up.
It is assumed when data and parity do not match that the failure occurred as described above; the data is correct and the parity is stale. Therefore, parity needs to be updated. RAID-3, RAID-5 and RAID-6 update the parity disk(s). Mirrors (RAID-1 and RAID-10) update the secondary disk. This process does not apply to Non-Redundant RAID types ID (Individual Disk) or RAID-0.
- A READ operation (Host-based or Array-based IOs such as Sniff operations) or a Background Verify (BGV) operation is performed and if an error occurs during this read operation of either a data or parity disk, a coherency event could result. See also emc150707 for information regarding a rare occurrence in which background verify fails to report an uncorrectable event.
How to identify coherency events
Coherency (COH) errors usually occur because of a mismatch between the parity generated by all the data drives and the parity drive. In such cases, the parity drive may be faulted. But, there are times when this is not the case and a faulty drive is causing the coherency errors and must be identified.
For example, if all drives in a RAID group log Parity Sector Reconstructed [r5_vr COH] messages except one drive and that drive is reporting Sector Reconstructed [r5_vr TS] messages, the latter drive can be identified as the problem drive. Furthermore, if it is determined that the same drive is also reporting soft media errors with bad blocks, it may be the faulty drive causing other drives to report coherency errors.
Additionally, if COH events are due to partial writes, first view the SP logs and determine the affected LUN. Next, look back in time to when the LUN was last shut down by an SP. If the coherency errors occur between the disabling and the subsequent enabling of the LUN and these errors have the ·r5_vr· or ·mirror_vr· algorithm decoded as part of the extended status, then this is an expected coherency error corrected by FLARE code.
You must check the following to identify the root cause for a coherency error:
Run the RAID Group Health Check (RGHC) script to isolate disks in a given RAID group (RG) to make viewing of the disk errors easier.
- Event code 820 - These are recoverable media error events.
- Event code 684/689 - These are host parity sector and host sector reconstructed messages for a particular disk and are NOT an indicator that the disk is bad.
- Event code 953/957 - These are uncorrectable error events. See emc48444 for more detail Identify the RG type and all disks that are part of the RG that have any type of errors.
Are there any disks that exhibited errors on sectors where the coherency errors took place within a stripe element of each other, a size of 0x80? Search for a matching sector between COH and an error (i.e., TS, CRCRetry) can help identify the bad disk Search for a DH diagnostic message for a disk in the SP logs can help determine the bad disk. The disk that is causing the issue could be a replacement or a hot spare. RAID-3 RGs, the disk that the coherency error is reported against is probably NOT at fault.
Should you seek assistance from the next level and have an Engineering ticket created?
Coherency errors with NO uncorrectable and invalidated sector events in the SP event log
Review the SP event logs for COH events without any accompanying ·uncorrectable· or ·invalidated· events. If found, look for known causes that may have been the source of the coherency event. Do not perform a Background Verify (BGV) as unexplained coherency events could potentially result in additional data corruption.
If a determination can not be made as to the source of the coherency event and there are NO "uncorrectable" or "invalidated" events, obtain a new set of SPCollects and contact a Certified TS2 (send an e-mail to CLARiiON CTS). The CTS should analyze the logs. If upon further review no clear indication of the source of the coherency can be determined, an Engineering ticket should be opened for assistance. For further information on cases that a CTS can review, see this document. If a CTS engineer can not be reached (allow up to 15 minutes for a reply), then an Engineering Action Request (using ARS) will need to be created.
Coherency errors with uncorrectable and invalidated sector events in the SP event log
When reviewing SP event logs and there are "Coherency Events" reported as 684 or 689 events, look for other events such as "Uncorrectable Sector" and "Invalidated Sector" events .If there are other events such as the uncorrectable or invalidated sectors, treat the issue as you would any other uncorrectable sector event. Begin a Background Verify operation to determine which LUNs are affected by bad sectors. Obtain a new set of SPcollects and follow the standard process for resolving uncorrectable sector events described in emc48444. This solution also contains information regarding whether an Engineering ticket is required. Any process that requires recovering data will require a customer·s last known backup. For issues involving COH events, a last known good backup is defined as one that was performed prior to the first recorded COH event.
According to emc260859 Code: a4b Coherency error detected
The initial problem starts when a suspect drive was manually or automatically (by FLARE) proactively copied to a hot spare using FLARE Release 30 before patch 509. Although the suspect drive was already replaced and the hot spare was equalized back to a new drive, coherency events are seen from time to time.
According to ETA emc258303 ("ETA emc258303: CLARiiON Release 30: Critical Uptime and Robustness ETA bulletin for EMC CLARiiON CX4 arrays"):
"During proactive sparing, FLARE reads from the Proactive Copy (PACO) candidate and writes to the PACO spare. If there are any media errors, the data is reconstructed from the other disks in the RAID Group. This reconstructed data is not being written to the PACO spare. As a result, stale data remains on the PACO spare. A subsequent data verify will return a coherency error when the data is read. A write to the lost position will correct the issue for that stripe."
As a result of this, the stale data will be copied back to new disk so the defect/stale data is still in the RAID group. A subsequent data verify will return a coherency error when the data is read by Backgroundverify. So when a Background Verify/Sniffer now tries to access these areas with stale data, a coherency error will be reported.
A write to the lost position will correct the issue for that stripe. However, it is strongly recommended to rebind the affected LUN and restore data from last known good backup.
Source: EMC Powerlink