Reliabilty of infrastructure components

Infrastructure components or the combination thereof all fail at some moment in time. There are several reasons for failure, as described below.

Physical defects

Of course unavailability can arise from physical defects of parts in the infrastructure. Of course everything breaks down eventally, but mechanical parts are most likely to break first. Some examples of mechanical parts are:

Fans for cooling equipment. The fans usually have a limited lifespan, they usually break because of dust in the bearing, leading to the motor to work harder until it breaks.
Disk drives. Disk drives contain two moving parts: the motor spinning the platters and the linear motor that moves the read/write heads.
Tapes and tape drives. Tapes in itself are very vulnerable to defects as the tape is spun on and off the reels all the time. Tape drives and especially tape robots contain very senistive pieces of mechanics that can easily break.

Apart from mechanical faillures due to normal use parts also break because of external factors like ambient temperature, vibrations and aging. Most parts favor stable temperatures. When the temperature in for instance a datacenter fluctuates, parts expand and shrink, leading to for instance contact problems in connectors, printed circuit board connections or solder joints.

This effect also occurs when parts are exposed to vibrations and when parts are switched on and off frequently.

Some parts also age over time. Not only mechanical parts wear out, but also some electronic parts like large capacitors that contain fluids and transformers that vibrate due to humming. Solder joints also age over time, just like on/off switches that are used regularly.

Cables tend to fail too. The best example of this is SCSI flat cables. When confronted with a intermittend SCSI error, always first replace the cable. But not only flat cables have problems. Also network cables, especially when they are moved around much tend to fail over time. Another type of cable that is highly sensitive to mechanical stress is fiber optics cable.

Some systems like PC system boards and external disk caches are equipped with batteries. Batteries, even rechargable batteries, are known to fail often.

Another typical component to fail are oscillators used on system baords. These oscillators are also in effect mechanical parts and prone to faillure.

The best solution to this problem is to implement resilience to avoid Singe Points Of Faillures (SPOFs) as described in one of the next sections.

Environmental issues

Environmental issues can cause downtime as well. Issues with power and cooling, and external factors like fire and flooding can cause entire datacenters to fail.

Power can fail for a short or long time, and can have voltage drops or spikes. Power outages can couse downtime, and power spikes can cause power supplies to fail. The effect of these power issues can be eliminated by using an Uninterruptable Power Supply (UPS).

Cooling issues can be faillure of the air conditioning system, leading to high temperatures in the datacenter. When the temperature rises too much systems must be shut down to avoid damage.

Some external factors that can lead to unavailability are:

Earthquakes - Not much can be done about this, apart from the building quality of the datacenter
Flooding - In parts of the world where sea level is higher than land level (I live in The Netherlands at 6 metres below sea level) or where rivers can overflow easily flooding can occur. This is why I always advise to locate a datacenter at least at the second floor of a building.
Fire - Proper fire extinction systems and fire prevention systems can help avoid or minimize downtime
Smoke - Most fire related downtime is due to smoke, not fire. Even when a fire is not at the datacenter but at some other part of the building, smoke can be a good reason to shut down the entire IT infrastructure, since smoke gets sucked into components by their fans and cause damage to the components.
Terrorist attacks - There were datacenters located in the World Trade Center in New York. The 9/11 attacks casued severe downtime.

Software

Downtime caused by software includes software bugs, including errors in file systems and operating systems. After human errors software bugs are the number one reason for unavailability.

Because of the complexity of most software it is nearly impossible (and very costly) to create bug-free software. Software bugs in applications can stop an entire system (like the infamous Blue Screen of Death on Windows systems), or create downtime in other ways. Since operating systems are software too, operating systems contain bugs which can lead to corrupted file systems, network faillures or other sources for unavailability.

Bathtub curve

In most cases the availability of a components follows a so-called bathbub curve. A component failure is most likely when the components is new. In the first month of use the chance of a components failure is relatively high. Sometimes a components doesn't even work when unpacked for the first time, before it is used at all. This is what we call a DOA component - Dead On Arrival.

When the components still works after the first month it is highly likely that it will continue working without failure, until the end of its technical life cycle. This is the other end of the bathtub - the chance of failure rises enormously at the end of the life cycle of a component.

Complexity of the infrastructure

Adding more components to an overall system design can undermine high availability, even if the extra components are needed to achhieve high availability. This sounds paradoxal, but in practice I have seen such situations. Complex systems inherently have more potential failure points and are more difficult to implement correctly. Also the complex system is harder to manage, more knowledge is needed to maintain the system and errors are easily made.

Sometimes it is better to have an extra spare system than to have complex system redundancy in place. When a workstation fails, most people can work on another rmachine, and the defective machjine can be swapped in 15 minutes. This is probably a better choice than implementing high availability measures in the workstation like dual network cards, dual connections to dual network switches that can failover, failover drivers for the network card in the workstation, dual power supplies in the workstation fed via two separate cables and power outlets on two fuse boxes, etc. You get the point.

The same goes for high availability measures on other levels. I have had a very instable set of redundant ATM (Asynchronous Transfer Mode) network switches in the core of a network once. I could not get the systems to perform failover well, leading to many periodes of downtime of a few minutes each. When I removed the redundancy in the network, the network never failed again for a year. The leftover switches were loaded with a working configuration and put in the closet. If the core switch would fail, we could swap it in 10 minutes (which given that this woule not happen more than once a year - an probably fewer, led to an availability of at least 99.995%) .

This entry was posted on Tuesday 12 July 2011