troubleshooting Question

We've loss SQL data and I must prevent this from happening again

Avatar of jhhaley
jhhaley asked on
DatabasesStorage Software
14 Comments1 Solution676 ViewsLast Modified:
I hope you can follow through this. Thanks ahead of time for your input and suggestions.
You may have proposals I haven't even consider.

In the spring of 2005 our company installed a Xiotech SAN. We moved our SQL 2000 Enterprise Production SQL database to the SAN. The windows server connected to the SAN was a Windows 2003 Advanced Server. From the spring of 2005 to December 2005 we had zero issues and we absolutely loved it. If we needed to attach a new server it required little down time, attaching the new server to the database required less than 10 minutes. Performance was outstanding. However, I guess we were lull into a sense of false security.  

Here is what happened that opened my eyes to some of the real potentials for data corruption if extreme care is not exercised and the right procedures are not in place. All of the events occurred mid January 2006 through Feb 6 2006. Let me first provide the following two items. All the systems listed below are running Advanced Windows 2003 and SQL 2000 enterprise.

1) Our Network Admin changed the SAN drives, housing our SQL database, Tlogs, and backups, from basic to dynamic. His thinking was that this would allow on the fly expansion of the database residing on the SAN. He knew Windows 2003 Advanced Server was advertised as supporting dynamic drives so he implemented the change on the SAN.  He did not check with Xiotech, nor did he tell the DBA. He is normally very thorough, but for the last six months we've been working around the clock as we integrated new acquisitions, as well as financial, payroll, inventory and other systems into our company. He did this in late December.

2) A backup SQL Server was setup and attached to the SAN with an incorrect Qlogic HBA driver. On the surface everything looked ok, it connected the SAN fine. The server was ready as a cold spare in case there was a failure on the primary. We had elected not to go with a cluster until we were able to implement an active/active passive cluster.

Now to tell you what happened. Hopefully you get the picture.

Strange behavior started occurring in mid January. The SQL Sever begin detaching itself, for no apparent reason, from the database residing on the SAN. We also had brief spells in which our datacenter would lose connectivity from the database. This behavior started about the same time we installed the latest service pack for SQL Server 2000, this also coincided with the change from basic to dynamic drives on the SAN. Initially the network team pointed toward the service pack as the cause, even though they couldn't understand why it would cause an issue. The service pack was also not first installed on a test SQL server. The Network Admin did not tell anyone about the change he made on the drives. The SAN had been operating flawlessly and it couldn't be A SAN ISSUE.

When the server was initially rebooted, after installing the service pack, it kept rebooting for no apparent reason. Since we could not back out of the Sql Server service pack we elected to change out the SQL server. The 2nd Server (#2 above) was setup with an older Sql Server 2000 service pack and it seem to be stable. The following day the Sql Server crashed coming up with an 823 error (torn page). Corruption occurred across 4 or 22 databases on the SAN. No corruption occurred in the transaction logs. The most we lost in any database was 30 minutes, when we restored the transaction logs. We immediately called Xiotech. When we told them the driver version, they immediately pointed toward that as the issue since it was not even the correct driver for the installed card. The driver was updated and once again it seem stable. The question about the type drive (dynamic versus basic) being used on the SAN never came up.  

The following day we had another failure. That afternoon, late we had corruption across 10 of the 22 databases, loss 5 transaction logs and over two hours worth of data in three of our databases. Again it was an 823 error torn page error. I then had to point toward the SAN and the immediate response, from the Network team was it's not a SAN ISSUE. We immediately logged onto the Xiotech site and there were articles from Sept 2005 noting that you must use basic drives and not the dynamic drives.

We then created new basic drives on the SAN and copied the data from the dynamic drives to the basic drives. However, when we attempted to attach the drives from the Sql Server to the SAN, the SQL Server still considered them as dynamic drives. We had to initiate a format erasing the data that was already copied, before the Sql server would attach to the new basic drives on the SAN. For times sake we elected a low level format. We wanted to get the SQL Server back online quickly. After the format we were able to attach from the SQL Server to the newly setup basic drives. The SQL Server then recognized them as basic drives. Another copy was then initiated.

The next three days the datacenter still had time frames in which they were mysteriously disconnected from the database for brief period of times. We also had two transaction logs that were corrupted. For fear of corruption we moved all transaction logs and backups off the SAN. They are also stored on tape. The event log was indicating disk errors and the need to run CHKDSK.

I initiated a call to Xiotech. Within 30 minutes these guys were scrambling to help solve our problem. As it turns our when we did the low level format across the drives it left remnants of the dynamic headers. After following the correct procedures for creating a basic drive, formatting, attaching to the drive and copying the data we've not had any issues.

It's now back to same old LAZY routine and I don't think I want to return to this routine.  

We were very fortunate...we were able to recover from the loss of data, but it was extremely embarrassing and could have been REAL BAD. Thankfully I don't think it was apparent to our customers.  

Here is my question: What procedures would you implement to prevent the situation, described above, from happening again? Your responses may trigger other questions from me.
ASKER CERTIFIED SOLUTION
Join our community to see this answer!
Unlock 1 Answer and 14 Comments.
Start Free Trial
Learn from the best

Network and collaborate with thousands of CTOs, CISOs, and IT Pros rooting for you and your success.

Andrew Hancock - VMware vExpert
See if this solution works for you by signing up for a 7 day free trial.
Unlock 1 Answer and 14 Comments.
Try for 7 days

”The time we save is the biggest benefit of E-E to our team. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange.

-Mike Kapnisakis, Warner Bros