Split-Brain
Overview
The state in which the source node becomes more than two nodes at a specific time by the administrator or management (HA) software of the replication cluster is called a split brain (SB). SB occurs when the replication connection is disconnected, and both nodes become source nodes at the same time without knowing each other's role and status. When this SB occurs, the replication cluster is split into two replication SETs, putting it in a state of potentially data loss. Upon recognizing the occurrence of SB, the administrator must resolve the SB and normalize the replication according to the following procedure.
Detect
The FSR can internally identify whether both nodes are in SB state. The identification of SB is performed through RID exchange at the time when the replication connection is established. As a result, if it is recognized as SB, the replication connection is immediately disconnected and the connection is standby in a standalone state. The following is the output log when SB occurs.
2019-12-19 14:50:03.629 WRN establishing error=split-brain compare=newer key=1 peer=node3 resource=r0 state=connected
Resolve
The way to solve SB starts with deciding which of the two nodes the SB will sacrifice to. When the node to be sacrificed is determined, the data of the victim node is discarded through the following command on the victim node, and the final SB is resolved by connecting with the other node.
fsradm connect --discard-my-data <res-id> <peer-node>
When the SB is resolved by establishing a connection with --discard-my-data, the victim node resynchronizes based on the other node and restores the latest duplicate data set.
When multiple SBs occur, synchronization between victim nodes does not occur, so it can be resolved by establishing a connection with --discard-my-data from all victim nodes.
Fault
The FSR defines the following error conditions as failures, and if such failures occur, follow-up actions must be manually performed.
- Disk error
- File I/O error
Disk Error
A failure may occur in the replication target itself, such as the volume on which the replication target is located is unintentionally unmounted during operation, or a problem occurs in the storage medium itself due to physical damage. In this case, the user restores the replication target again and the volume device is restored again. Actions must be taken to make it operational. When manual recovery is complete, you will need to restart replication with a full sync.
File I/O Error
File I/O can cause errors in various situations, such as errors due to file path problems and permission problems according to accounts. Although I/O errors do not occur frequently, it is a common situation that can occur during service operation when there is an unintended environment change or an error is caused by an application that is not designed to respond flexibly to an exception situation. When an I/O error occurs, applications are expected to handle exceptions for the exception situation, and the subsequent actions are handled differently depending on the application. In this way, file I/O errors caused by source-side applications are regarded as general file I/O errors that can occur at any time and are not considered as failures. However, if an error occurs in the file I/O performed by the fsr engine, it becomes a failure. If fsr is unable to do file I/O, the mirroring operation is essentially disabled and replication stops immediately.
The error code of the I/O error occurring in the fsr engine is recorded in the log, and the cause of the error can be estimated from the error code. This requires the administrator to manually recover from the failure and normalize the fsr's I/O. In the normalized environment, the resource is finally restarted and replication is resumed by performing full synchronization.
Check Disk
Physical errors in disk volumes are difficult to recover due to damage to the media itself, but logical errors at the file system level can be inspected or repaired through a utility (chkdsk in Windows, fsck in Linux).
In general, in the case of using such a utility, it is safe to use the volume after unmounting, and if there was logical defect detection and recovery accordingly during this inspection process, the volume must be restarted as a replication resource for consistency with the target. You need to do a full synchronization.