Why did my hard disk fail or crash so fast & for no apparent reason?

It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data is either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies.

Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis.

Customers replace disk drives at rates far higher than those suggested by the estimated mean time between failure (MTBF) supplied by drive vendors, according to a study of about 100,000 drives conducted by Carnegie Mellon University.

Hard Disk Failure

The Carnegie Mellon study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for those drives listed MTBF between 1 million to 1.5 million hours, which the study said should mean annual failure rates “of at most 0.88%.” However, the study showed typical annual replacement rates of between 2% and 4%, “and up to 13% observed on some systems.”

So what does this mean to you, the consumer who purchases hard drives and computers with hard drives?

I have over 25 years of engineering, manufacturing, and software development experience so first let’s examine an important aspect of typical manufacturing processes from automobiles and air planes to hard drives and smart phones. The typical manufacturer of any end product actually produces few of the components that make up the end product. They in fact outsource the manufacture and often the design of almost all subcomponents giving the supplier oversight ranging from none at all to expansive specifications, testing, and oversight. The supplier picked to supply the component is often the lowest bidder while some manufacturers choose the best supplier based on value which is a combination of price, quality, and reliability.

This system of outsourcing is often referred to as the tiered supplier base. A tier one supplier supplies directly to the manufacturer of the end product. The suppliers to the tier one supplier are tier two suppliers, and so goes it down the food chain. Technically in the case of a hard drive manufacturer, they in fact, a tier one supplier to the computer manufacturer. This system explains why when the United States Government was wrestling with whether to bailout the US Automobile Manufacturers people were quoted as saying if they are allowed to go under, hundreds of thousands of people will lose their jobs. They were referring to the employees of all the tier suppliers.

In a system like this the quality of the end product is only as good as the weakest link in the supply chain. Very complex and rigid quality control and design methods are used by most suppliers to ensure the quality of their product but in the end it still comes down to potential for human error. Even the most sophisticated lights out, 24/7, computer controlled, & robotized manufacturing plant in the world is subject to human error. The person programing the robot may not be concentrating on the task causing the robot to place a microchip a fraction of a micrometer off target every 100th operation causing your hard drive to fail when your co-workers identical computer us just fine.

Early failures like this are not uncommon. It is what all warranties refer to as “manufacturing defects”. The inside industry term is Infant Mortality Failure (IMF). Warranties have a time limit because they are intended to protect you against IMF’s. There are in fact different levels of IMF’s. Most electronics go through some sort of test often referred to as burn-in. This is testing for an immediate failure or a failure in the first few minutes. These are caused by gross manufacturing defects that cause a catastrophic failure almost immediately.

The more bothersome IMF’s are the ones that make it all the way to you, the consumer, perform flawlessly for a short period of time, and then bam, its dead. The manufacturers hate these failures because now your opinion of the manufacturer is tarnished. You never knew of the failures during burn in and were happy not knowing about them but when your hard drive dies the night before a critical deadline, you go ballistic and demand the world for compensation. The cost of this failure is long term and higher that the cost of a new hard drive. It may result in a lost customer forever. This is why I will never own another HP computer even though they may be great computers. I got a bad one and it tarnished me against HP forever.

So what can you do to protect yourself?

I personally always do a lot of research before any new electronics purchase. IMF’s can be a persistent problem with one manufacturer or model until the root cause of the problem is found and corrected. It could even be a design flaw and not a manufacturing problem. I recently purchased a new big screen HD TV and I thought I wanted the top of the line Panasonic 3D Plasma until I learned through reading reviews from several sources that the 2010 models experience early (within 3 months) loss of black levels and not enough information was available to determine if it was fixed in the 2011 models. So I bought my second choice.

The other more obvious thing you can do specifically with a computer hard drive is to back up your data or image your entire system. I personally use a product called Acronis True Image. I make a backup image of my entire system and then make incremental backups every night. I have it set to keep 10 past increments so I can always reset back to an earlier recent version. I back this up to a dedicated 1 TB external Hard Drive. What if that hard drive fails you say? Well the likelihood of your computer hard drive and your external hard drive failing at the same time is remote but I own my own business so I have a redundant external hard drive that I do redundant backups on just to be safe.

I would also recommend you get a good quality surge protector, not the kind you get at Walmart next to the extension cords but a good quality unit from a retailer like Best Buy or any computer supply retailer. I USE A Belkin unit that cost around $40 USD.

Check this if you need some Freeware to Monitor Hard Disk for Potential Failure.

The author of this Guest Post, Randy L. Miller is the C.E.O of Alagad Incorporated and is also an active member of TWC Forums.

Posted by on , in Category General with Tags
This post has been submitted by a Guest Author. If you would like to submit a guest post, you may contact me on the mail ID mentioned in the About page.