Solved

Unexpected server restart

Posted on 2011-02-16
30
1,885 Views
Last Modified: 2012-05-11
The SQL server which is under production has been restarting on its own at regular intervals at a specific time.
In the last 5 months the server has restarted 5 times,every time in between 4.10 a.m and 4.20 a.m.
We have a backup job that runs during that time, the backup job starts at 1.00am and finishes at 7 am on an average.
I have got some doubtful logs from the eventviewer ,




Event ID:      19
Task Category: None
Level:         Warning
Keywords:      
User:          LOCAL SERVICE
Computer:      SQLSRV001.************.com
Description:
A corrected hardware error has occurred.
Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Memory Controller Error
-------------------------------------
Log Name:      System
Source:        volsnap
Date:          1/30/2011 11:14:33 PM
Event ID:      24
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      SQLSRV001.*********.com
Description:
There was insufficient disk space on volume E: to grow the shadow copy storage for shadow copies of E:.  As a result of this failure all shadow copies of volume E: are at risk of being deleted.
-------------------------followed by----------------
Log Name:      System
Source:        volsnap
Date:          1/30/2011 11:15:29 PM
Event ID:      35
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SQLSRV001.************.com
Description:
The shadow copies of volume E: were aborted because the shadow copy storage failed to grow .


Please go through it and provide a suitable solutions.let me know if you have any questions.
0
Comment
Question by:saran_2006
  • 14
  • 13
  • 2
  • +1
30 Comments
 
LVL 2

Expert Comment

by:storkyIV
Comment Utility
Hello,

Who manufactures your server?
I had an almost identical issue from an HP server that randomly rebooted, turned out to be firmware and bios needed to be updated.
0
 
LVL 14

Expert Comment

by:JAN PAKULA
Comment Utility
or it might be ram or ram port going bad (dust)

Jan ICT TECH MA CCNA
0
 

Author Comment

by:saran_2006
Comment Utility
Thanks for replying.
1.The Brandname is Cybertron ,
2. WE have about 6 RAM's installed on the board ,Any method to figure out the damaged one out of 6.

thanks again.
0
 
LVL 14

Expert Comment

by:JAN PAKULA
Comment Utility
memtest 86 - donwload it burn out start likie windows install disk
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
If the failure always occurs within a minute or two of the same time every day, I would be looking for a process or application that starts (or ends) at that time.

The last two events you have posted occurred at about 11.15pm, so may not have much to do with the main event which happens 5 hours later - hmmm... almost exactly five hours later actually, so there might be a link between them after all. The first error doesn't have a timestamp included with it unfortunately, so it might be helpful to know when it happened, and if it is recorded more than once.

The second and third events suggest that one of your volumes is too small, so that wants looking at anyway.

What events, if any, are recorded in the event logs immediately before the restart?
0
 

Author Comment

by:saran_2006
Comment Utility
Thanks once again,

the server is in production so I need to wait till saturday or sunday.

The hardisk size is 1.5 TB and the free space is 450 GB.I know there is a restriction of only 15% can be used by VSS. please guide me how to increase it.

The event ID 19 has occurred at
2/1/2011 2:18:16 AM and 1/31/2011 10:42:17 PM and 1/31/2011 9:33:17 PM and 1/31/2011 9:32:19 PM .

thanks,
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
Did the five restarts in five months happen on the same day of each month? Or an exact number of days apart, for example every 28 days?

The event ID 19 entries do seem to point to a memory problem; as they refer to a correction it may be that a module has developed a fault and the memory controller is compensating for it, in which case a memory test may not show any problems because the memory controller is masking it. Have you looked in the event log in the server BIOS to see if anything is recorded there?

The hard disk is presumably partitioned into various volumes, one of which is the E: drive. It is this volume which is mentioned in the event log, so if your 450GB of free space (I assume that you mean unused file system capacity as opposed to raw unpartitioned and unformatted space) is on, say, the C: and D: partitions then it is not available for drive E:.

Are your disks MBR- or GPT-based?
0
 

Author Comment

by:saran_2006
Comment Utility
Thanks,

(First Restart: 8/3/2010 at 4:10 A.M, Second Restart :8/25/2010 at 4:06 A.M, Third Restart :10/17/2010 at 4:07  A.M, Fourth Restart :10/29/2010 at 4:19 A.M and the last at 1/13/2011 at 4:10 A.M) .

Can I check the Bios logs without restarting the server?

Sorry for not making it clear,
The hard disk size is 2 TB.
C Drive size is 100 GB
E Drive size is 1.90 TB and used space is 1.30 TB .

thanks again,
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
Mmm. Those restarts do seem a bit random in terms of the intervals between them. However, I can't help feeling that the time that they occur is significant; perhaps a particular file in a particular state causes the restart.

If there is a way of looking at the BIOS event logs from the OS I don't know of it. I believe that some of the higher-end big-name server manufacturers offer this kind of functionality via out-of-band management interfaces, but I suspect that your server doesn't have this level of features.

I've been pondering the symptoms that are presenting, and I'm wondering if you have more than one issue here, as the memory problem seems to me to be separate from any other issues. As you have six modules I presume that they are fitted either in pairs or banks of three. If memory test utilities don't identify the module, then perhaps the best way of pinpointing the faulty one is to remove a pair or trio of memory modules (the smallest number possible) and see if the error goes away. If it reappears within, say, four days then refit them and take out another bank, and so on until you've at least narrowed it down to the smallest number of modules. This isn't ideal, I know, but short of replacing the lot of them I don't see another way of doing it; additionally, if doing it this way doesn't eliminate the error at all, then possibly the memory controller or a socket is faulty and new memory wouldn’t fix the problem anyway.

Regarding the event IDs 24 and 35 I’ve found a couple of links which might shed some light on what is happening. The first applies to Windows Server 2003 and so may not be helpful, but the second gives some quite detailed information about configuring Volume Shadow Copy on Windows Server 2008. As the event IDs explicitly implicate VolSnap it seems reasonable to suppose that you are using it in connection with your backup strategy. One point that the article makes is that by default Shadow Copy makes two snapshots a day, at 7am and 12pm, so just as your backup is finishing Shadow Copy is trying to take a snapshot if you haven’t configured it to do something different; there is a potential, if not actual, conflict here.

http://support.microsoft.com/kb/312067

http://www.techotopia.com/index.php/Configuring_Volume_Shadow_Copy_on_Windows_Server_2008

I don’t run Server 2008 myself at present so I can’t give you any hands-on input, but the latter article highlights a few considerations to bear in mind when setting up Shadow Copy. Forgive me if you already know all this, but I’m at the extreme edges of my own knowledge of such things and can only make general suggestions along the lines of how I would proceed if I was in your situation.
0
 

Author Comment

by:saran_2006
Comment Utility
Thanks,
I'm still trying to figure out the work flow for Back EXEC , In my case its just copying a folder full of .BAK files .I'm not sure whether VSS is required in this case.
 Also AOFO is enabled in my Backup exec.
0
 

Author Comment

by:saran_2006
Comment Utility
And again the server has restarted at 4.18 am today , Dump report shows sqlservr.exe as the reason ,any Idea guys.
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
What other events are going on at around this time? Not just errors, but normal entries, and not just in the Applications log. Could you post the dump?
0
 

Author Comment

by:saran_2006
Comment Utility
you want the dump itself or the report alone?
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
The report might be sufficient. How big is the dump?

Do the other event logs show anything happening at that time?
0
 

Author Comment

by:saran_2006
Comment Utility
Thanks,
The size of full memory dump is 1.1 GB ,
 sqlsrv001-2522011-dumpanalysis.docx

No other process starts or ends at that time.
0
Shouldn't all users have the same email signature?

You wouldn't let your users design their own business cards, would you? So, why do you let them design their own email signatures? Think of the damage they could be doing to your brand reputation! Choose the easy way to manage set up and add email signatures for all users.

 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
Erm, I'll just have the report, thanks!

I've had a quick look at it (it's 11:30pm here, so I'll have a longer look at it tomorrow); your server is turning in a STOP 0x7F (UNEXPECTED_KERNEL_MODE_TRAP) error, and more specifically a Double Fault, as described here:

http://support.microsoft.com/kb/137539

It looks increasingly likely that bad RAM is at the root of the problem, especially as problems with it have already been flagged up, though you should look at the other causes listed in the article as well in case any of them fits with your circumstances.
Check with your server vendor to see if they know of this issue and can advise you; there may be BIOS and/or firmware updates available that might resolve the problem. Also check that your hardware drivers are the latest available and are (preferably) WHQL certified.
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
I've just realised that the MS KB refers to Windows Server 2000 and XP; however, the basic principles still seem to be relevant.

There may be a Windows Server 2008-specific article on the subject, but I'm not looking for one until tomorrow sometime...
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
One other question: How much free space do you have on your C: drive, which, I presume, is the system volume for the server?
0
 

Author Comment

by:saran_2006
Comment Utility
Free space = 65 GB ,
Total 100GB.
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
How much RAM does the server have? What size is the swap file?
0
 

Author Comment

by:saran_2006
Comment Utility
RAM :12 GB
Page File Size : 12 GB
0
 
LVL 15

Accepted Solution

by:
Perarduaadastra earned 500 total points
Comment Utility
The swap file size seems a bit small, particularly as you are running SQL on the server as well.

See this KB for calculating the optimum size for your situation:

http://support.microsoft.com/kb/889654

Further advice is given in this EE question:

http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/Windows_Server_2008/Q_23586571.html

which references the same KB.

If you have to reboot the server when changing the swap file size, that would be a good time to check the server BIOS event logs for memory errors...

Hope this helps.

0
 

Author Comment

by:saran_2006
Comment Utility
Thanks for the reply,
 I cant confirm the size of the paging file,it displays two diff things in two diff places.
Also the Physical Memory usage is flat at 98%.Will this be a problem.Please see the pictures for more info.

Thanks again paging File setting Task manager
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
Now we're getting somewhere.

Your page file is very nearly maxed out - if you look at the Virtual Memory screenshot, you will see that the current page file allocation is the same as the total page file size for all drives - not good!

Furthermore, the system recommendation is for almost half as much again, at 18417MB. It's apparent that leaving the system to manage the page file isn't working.

I would suggest unticking the Automatically Manage Paging File Size for all drives checkbox, selecting Custom Size, setting the Initial Size to 13312MB, and the Maximum Size to 24576MB. Be aware that making this change will almost certainly require a reboot to take effect.

This will certainly relieve the page file congestion, but it would be helpful to know why the page file is so heavily used; there may be a problem with an application failing to return virtual memory to the pool when it's finished with it.

Once you've made the changes, keep an eye on page file usage; if it shoots up to 98% of the new larger allocation then there is definitely a problem with one or more running processes that needs to be fixed.

0
 

Author Comment

by:saran_2006
Comment Utility
Thanks,
Is there a way to find that, I mean why the page file is so heavily used?
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
I'm sure that there are tools that can monitor the resources used by Windows processes and applications, but I've been fortunate in that I haven't needed to use them on the servers that I'm responsible for, perhaps because said servers are all quite a bit older than yours! I notice on your screenshot that there is a Resource Monitor button on the Performance tab of the Task Manager - this might be a good place to start.

What apps is the server running? It may be that it's simply being asked to do too much. How many users are connected? What is SQL supporting? What is the volume of data that is being moved around the network?

The restarts may be due to the backup job using up the available memory (both physical and virtual) until it runs out, and it takes until between 4:10-4:20am for this point to be reached; although it's a scheduled job the amount of data will vary, and occasionally it exceeds the threshold that the system can cope with. When the restarts occur, do they tend to be after a particularly busy working day?
0
 

Author Comment

by:saran_2006
Comment Utility
Thanks,
To Check whether thats the problem I have rescheduled the backup job to start one hour earlier, let us see what happens next.
0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
That may be some help, but I understood from your earlier comments that the spontaneous reboot only happens now and again; this is why I wondered if there was a backup size threshold above which the system was overloaded and fell over as a result.

Perhaps a better test would be to deliberately back up say, 10% more data than would be usual even at peak periods. If this approach produced a restart every time you tried it, it would be reasonable to deduce that the increased amount of data was triggering the event.

You said earlier that you were using Backup Exec; do you have the latest version? It is possible that older versions may not work as well as they could with newer OSes.

Have you tried increasing the swap file size and monitoring its usage?
0
 

Author Comment

by:saran_2006
Comment Utility
Thanks,
I have increased the paging file size to 20 GB .
Backup exec 2010 version is 13.0 .
I checked the Backup job History and found that the restart has occurred in the middle ,say normally a full backup will do with 230 GB, whereas the backup during the restart would have failed after 110 GB or so.
The performance and Reliability tool has a Page file monitor but it doesn't give so many options , it just shows the total usage.


0
 
LVL 15

Expert Comment

by:Perarduaadastra
Comment Utility
If you remote into the server from time to time during the backup, you should be able to get an idea of swap file usage as the job progresses.
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Scenario:  You do full backups to a internal hard drive in either product (SBS or Server 2008).  All goes well for a very long time.  One day, backups begin to fail with a message that the disk is full.  Your disk contains many, many more backups th…
Restoring deleted objects in Active Directory has been a standard feature in Active Directory for many years, yet some admins may not know what is available.
This tutorial will walk an individual through the steps necessary to enable the VMware\Hyper-V licensed feature of Backup Exec 2012. In addition, how to add a VMware server and configure a backup job. The first step is to acquire the necessary licen…
This tutorial will walk an individual through the steps necessary to join and promote the first Windows Server 2012 domain controller into an Active Directory environment running on Windows Server 2008. Determine the location of the FSMO roles by lo…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now