From time to time, a caller asks me about the MTBF rating of an Industrial Ethernet switch. The abbreviation MTBF stands for
Mean-Time-Between-Failure
and indicates the reliability of the specified equipment. It is the
typical time between failures for a specified device design -- that is, the typical amount of time (in hours) any of a specified set of devices will function before failing.
However, different companies define failure in different ways, depending on the nature of the equipment and its function within a system. Also, test parameters and batch size are not standardized. Essentially, higher MTBF ratings for finished goods are obtained by building equipment with components that have high individual MTBF values -- that is, better quality components.
MTBF grew out of the US military's attempts to formalize reliability assessment in the 1950s and 1960s which resulted in the publication of MIL-HDBK-217. Various flaws with this document led to a number of revisions and eventually, "... the U.S. Army has discovered that the problems with the traditional reliability prediction techniques are enormous and have canceled the use of MIL-HDBK-217 in Army specifications ..." Source: Equipment Reliability Institute's "ERI News", August, 2001 - vol. 4.
Despite criticisms of MTBF (especially within MIL-HDBK-217), it remains the dominant reliability assessment tool in the commercial electronics industry. The "Telcordia SR-332" handbook is used by many non-military electronic manufacturers for generating MTBF values. It evolved as follows: In the early 1980s Bellcore (Bell Communications Research) spun off from AT&T Bell Labs. Starting in 1985, Bellcore used MIL-HDBK-217, then improved and adapted it for highly-integrated commercial electronic products. In 1997 Bellcore was sold and its name was later changed to Telcordia Technologies.
At Contemporary Controls, equipment reliability is specified by MTBF values produced through the use of the Telcordia standard: Method I - Case I - Quality Level I.
Although the derivation of an MTBF value can be mathematically quite involved, the process can be generally stated as:
(Total Operating Time) / (Sample Size)
Suppose, as a very simple example, we test five electronic components until each one fails with the following results:
After totaling the above hour counts (3000), we would divide by the sample size (5) to get an MTBF for the component:
MTBF = 3000/5 = 600 hours
The above MTBF example means that we would expect the
theoretically typical component to fail after 600 hours of operation. Stated differently, if we assume that
all five components were typical, we would expect
all of them to fail at 600 hours, with an average failure rate of one every 120 hours (600/5). Note that every component greatly outlived the 120-hour statistical failure mark for an individual. The 1-failure-per-120-hours is merely a
statistical artifact that only achieves significance once the group size becomes much larger than in this example.
Actual MTBF values are much, much higher than the preceding example. Indeed, some exceed 1,000,000 hours! Industrial Ethernet switches usually have MTBF ratings of about 500,000 hours. That is, of all such units tested, the
typical one would fail at 500,000 hours -- also,
all of them would fail at 500,000 hours, if the entire group is composed of typical devices. Of course, no one really tests devices for such a long time -- 500,000 hours is about 57 years! Actual MTBF ratings are either: projections based on a record of actual product failures, or predictions made by aggregating known MTBF values from component or sub-assembly suppliers.
Some people like to look at the MTBF like this: If a group of 1000 Industrial Ethernet switches has an MTBF rating of 500,000 hours, we could expect all 1000 units to fail within some 57 years. But if all 1000 were placed in service over the same time period with an evenly-spread failure rate, we could statistically expect one to fail about every 21 days, based on the following calculations:
MTBF / population size = mean unit time to failure
(500,000 hours) / (1000 switches) = 500 hours mean lifetime per unit
(500 hours) / (24 hours in day) = 20.83 days
However, the foregoing result is very misleading. Firstly, assuming a symmetrically balanced failure record, the odds are 999 to 1 that a
particular switch will fail after 21 days! Also, various factors (some unknown) skew the typical failure model toward the MTBF value. That is, in reality the 1000 switches tend to fail (or wear out) at roughly the same time (near the MTBF value). But the
averaging process yields a
statistical result that predicts one failure every 21 days -- even though the true lifetime of the vast majority of switches is much nearer the MTBF.
From this you can see that an MTBF rating is of no value when applied to an
individual item (nobody replaces a switch every 21 days). Instead, the MTBF is a figure-of-merit that predicts the reliability of
an entire group of products. What we should care about is: The greater the MTBF of the group, the more reliable a
typical individual product within the group!