Tuesday, September 21, 2010

Failure of a memory card in a SAN can cause so much havoc?

On Aug 25, the failure of a storage area network (SAN) caused Web site outages at 26 of Virginia Information Technologies Agency's (VITA) 83 state agencies. The EMC SAN malfunction has been blamed on a memory card failure. A backup SAN that was supposed to act as a fail-over system then also experienced problems, according to published reports. As of end Aug 2010, all but three agency sites had been restored, leaving the commonwealth's Department of Motor Vehicles (DMV), the Department of Taxation and the state Board of Elections without services.

So the question is how did this happen? The database for the critical systems should have DR mechanisms in place to assist in restoration of the databases. The weird part is that a single memory card failure caused so many systems to fail. Where is the redundancy and fail-over?

I've no internal knowledge of the database design but it seems to me that the databases are designed incorrectly.

It's not wrong to have a central database that is accessed by the organisation. However, the database files should be kept in different disk arrays. Translated. If your organisation's central database has 5 different systems connected to it, the database files should be stored in 5 different disk arrays. This is to prevent the failure of one array to down the whole organisation's systems.

Using the above example, a single memory card failure would have only down 1 system. Therefore, I'm surprised that it managed to disrupt all the systems. To me, that seems to be a database design failure. Either all the different tables were created in 1 database file or that the different database files were stored into 1 disk array. The failure of that disk array resulted in a big nightmare.

SAN storage do promise much in terms of redundancy,availability, resiliency and maintainability but as a technology person, I will not rely totally on it. A system must be built to cater for this kind of disaster recovery and they should not overly rely on the storage tier for the disaster recovery. For example, up to now I'm a little hesitant on the Virtual Tape Library (VTL) technology for the simple fact that it uses hard disks. There is a certain lifespan attached to each hard disk in terms of the number of times it can be written. How can I be assured that the hard disk will work when required? How can I be assured that when the backup is being done on the VTL, the source/target is not corrupted in the first place, causing my backup to be rendered useless. Of course tape will also fail in some point in time but that's like 15 to 20 years. By that time, I should have no need of that data.

Maybe it's my job to be skeptical but knowing technology, I know of too many instances where things will fail. The application design itself should cater for disaster recovery and contrary to opinion, the installation of the application server and database are important.

No comments:

Visit Rhinestic's Knick Knacks @ Etsy for handmade goods and supplies!

Related Posts Plugin for WordPress, Blogger...