How does Network Partitioning Affect MySQL Cluster?
Network partitioning is when there is a network outage that causes the cluster to be split into two survivable groups. This is an issue the MySQL Cluster setup needs to take into account whenever there is a node failure situation. The reason for this is that the cluster does not want to end up with a “split-brain” scenario whereby there are two running setups that have different data.
A cluster is considered survivable if there are members from every node group in the cluster present. There are two possible methods of how this can occur depending upon the setting of NoOfReplicas.
If NoOfReplicas is set to 2, then the above scenario can only occur in an even, or 50-50, split. For example, if there are 2 data nodes and one of them fails, then it would be considered a possible network partition. If there were 4 data nodes and 2 data nodes from different node groups shutdown at the exact same time, then it would be a possible network partitioned environment.
If NoOfReplicas is set to greater than 2, then it is possible to have partial failures that result in multiple survivable cluster setups. For example, if you have 6 nodes setup with NoOfReplicas=3, then it would be possible to get a 4-2 split, where both sides have members from all node groups, and hence could continue as an active cluster. In an extreme case even a split such as 2-2-2 would be possible.
While it may seem very uncommon for splits such as the above to occur, the issue involves any node failure not only network failures. The reason they have to be handled the same way is that to the surviving nodes in a cluster, a network failure and a node failure is identical. That means whenever there is a node or multiple node failures MySQL Cluster has to assume it might be a network partitioning and handle it appropriately.
In the event that a network partitioning occurs then an Arbitrator is required in order to resolve the possible “split-brain” situation.
What are the Requirements for MySQL Cluster to Keep Working When Losing One or More Data Nodes?
The number of data nodes the cluster can lose while still being operational depends on which data nodes are lost and whether there is an arbitrator.
In the following, a cluster consisting of 4 data nodes in 2 node groups will be considered
You will always need one data node in each group, so in the case of four data nodes in two groups, you can lose two data nodes if it is one from each group. If they are both from the same group, the cluster cannot access all data and the whole cluster becomes unavailable. The whole nodegroup cannot go down as that leaves the cluster in an inconsistent state i.e. how can it update a row that does not exist or use it in a select for existing data etc. To preserve data consistency, the cluster has to take drastic action which means shutting the rest of the data nodes down.
In the case that one data node from each group is unavailable, it is essential for the remaining part of the cluster to know that cluster does not enter a split-brain scenario with both halves of the cluster being alive. This is were the arbitrator comes in (in your case the management node is the arbitrator). If the management node is running, it can let the remaining two nodes know that they are the surviving nodes and in case the dead nodes comes online again, they will not be allowed to rejoin the cluster without synchronizing the data first.