Common mode failure

Revision as of 18:44, 30 October 2007 by Jqin (talk | contribs) (1 revision(s))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

A common mode failure occurs when events are not statistically independent. That is, one event causes multiple systems to fail.

An example is when all of the pumps for a fire sprinkler system are located in one room. If the room becomes too hot for the pumps to operate, they will all fail at essentially the same time, from one cause (the heat in the room).

The principle of redundancy states that, when events of failure of a component are statistically independent, the probabilities of their joint occurrence multiply. Thus, for instance, if the probability of failure of a component of a system is one in one thousand per year, the probability of the joint failure of two of them is one in one million per year, provided that the two events are statistically independent. This principle favors the strategy of the redundancy of components. One place this strategy is implemented is in RAID 1, where two hard disks store a computer's data redundantly.

But even so there can be many common modes: consider a RAID1 where two disks are purchased online and are installed in a computer, there can be many common modes:

  • The disks are likely to be from the same manufacturer and of the same model, therefore they share the same design flaws.
  • The disks are likely to have similar serial numbers, thus they may share any manufacturing flaws affecting production of the same batch.
  • The disks are likely to have been shipped at the same time, thus they may are likely to have suffered from the same transportation damage.
  • As installed both disks are attached to the same power supply, making them vulnerable to the same power supply issues.
  • As installed both disks are in the same case, making them vulnerable to the same overheating events.
  • They will be both attached to the same card or motherboard, and driven by the same software, which may have the same bugs.
  • Because of the very nature of RAID1, both disks will be subjected to the same workload and very closely similar access patterns, stressing them in the same way.

Also, if the events of failure of two components are maximally statistically dependent, the probability of the joint failure of both is identical to the probability of failure of them individually. In such a case, the advantages of redundancy are negated. Strategies for the avoidance of common mode failures include keeping redundant components physically isolated.

A prime example of redundancy with isolation is a nuclear power plant. The new ABWR has three divisions of Emergency Core Cooling Systems, each with its own generators and pumps and each isolated from the others. The new European Pressurized Reactor has two containment buildings, one inside the other. However, even here it is not impossible for a common mode failure to occur (for example, caused by a highly-unlikely Richter 10 earthquake).

See also