Link to home
Create AccountLog in
Avatar of saran_2006
saran_2006Flag for India

asked on

Unexpected server restart

The SQL server which is under production has been restarting on its own at regular intervals at a specific time.
In the last 5 months the server has restarted 5 times,every time in between 4.10 a.m and 4.20 a.m.
We have a backup job that runs during that time, the backup job starts at 1.00am and finishes at 7 am on an average.
I have got some doubtful logs from the eventviewer ,

Event ID:      19
Task Category: None
Level:         Warning
User:          LOCAL SERVICE
Computer:      SQLSRV001.************.com
A corrected hardware error has occurred.
Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Memory Controller Error
Log Name:      System
Source:        volsnap
Date:          1/30/2011 11:14:33 PM
Event ID:      24
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      SQLSRV001.*********.com
There was insufficient disk space on volume E: to grow the shadow copy storage for shadow copies of E:.  As a result of this failure all shadow copies of volume E: are at risk of being deleted.
-------------------------followed by----------------
Log Name:      System
Source:        volsnap
Date:          1/30/2011 11:15:29 PM
Event ID:      35
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SQLSRV001.************.com
The shadow copies of volume E: were aborted because the shadow copy storage failed to grow .

Please go through it and provide a suitable solutions.let me know if you have any questions.
Avatar of storkyIV
Flag of United Kingdom of Great Britain and Northern Ireland image


Who manufactures your server?
I had an almost identical issue from an HP server that randomly rebooted, turned out to be firmware and bios needed to be updated.
or it might be ram or ram port going bad (dust)

Avatar of saran_2006


Thanks for replying.
1.The Brandname is Cybertron ,
2. WE have about 6 RAM's installed on the board ,Any method to figure out the damaged one out of 6.

thanks again.
memtest 86 - donwload it burn out start likie windows install disk
If the failure always occurs within a minute or two of the same time every day, I would be looking for a process or application that starts (or ends) at that time.

The last two events you have posted occurred at about 11.15pm, so may not have much to do with the main event which happens 5 hours later - hmmm... almost exactly five hours later actually, so there might be a link between them after all. The first error doesn't have a timestamp included with it unfortunately, so it might be helpful to know when it happened, and if it is recorded more than once.

The second and third events suggest that one of your volumes is too small, so that wants looking at anyway.

What events, if any, are recorded in the event logs immediately before the restart?
Thanks once again,

the server is in production so I need to wait till saturday or sunday.

The hardisk size is 1.5 TB and the free space is 450 GB.I know there is a restriction of only 15% can be used by VSS. please guide me how to increase it.

The event ID 19 has occurred at
2/1/2011 2:18:16 AM and 1/31/2011 10:42:17 PM and 1/31/2011 9:33:17 PM and 1/31/2011 9:32:19 PM .

Did the five restarts in five months happen on the same day of each month? Or an exact number of days apart, for example every 28 days?

The event ID 19 entries do seem to point to a memory problem; as they refer to a correction it may be that a module has developed a fault and the memory controller is compensating for it, in which case a memory test may not show any problems because the memory controller is masking it. Have you looked in the event log in the server BIOS to see if anything is recorded there?

The hard disk is presumably partitioned into various volumes, one of which is the E: drive. It is this volume which is mentioned in the event log, so if your 450GB of free space (I assume that you mean unused file system capacity as opposed to raw unpartitioned and unformatted space) is on, say, the C: and D: partitions then it is not available for drive E:.

Are your disks MBR- or GPT-based?

(First Restart: 8/3/2010 at 4:10 A.M, Second Restart :8/25/2010 at 4:06 A.M, Third Restart :10/17/2010 at 4:07  A.M, Fourth Restart :10/29/2010 at 4:19 A.M and the last at 1/13/2011 at 4:10 A.M) .

Can I check the Bios logs without restarting the server?

Sorry for not making it clear,
The hard disk size is 2 TB.
C Drive size is 100 GB
E Drive size is 1.90 TB and used space is 1.30 TB .

thanks again,
Mmm. Those restarts do seem a bit random in terms of the intervals between them. However, I can't help feeling that the time that they occur is significant; perhaps a particular file in a particular state causes the restart.

If there is a way of looking at the BIOS event logs from the OS I don't know of it. I believe that some of the higher-end big-name server manufacturers offer this kind of functionality via out-of-band management interfaces, but I suspect that your server doesn't have this level of features.

I've been pondering the symptoms that are presenting, and I'm wondering if you have more than one issue here, as the memory problem seems to me to be separate from any other issues. As you have six modules I presume that they are fitted either in pairs or banks of three. If memory test utilities don't identify the module, then perhaps the best way of pinpointing the faulty one is to remove a pair or trio of memory modules (the smallest number possible) and see if the error goes away. If it reappears within, say, four days then refit them and take out another bank, and so on until you've at least narrowed it down to the smallest number of modules. This isn't ideal, I know, but short of replacing the lot of them I don't see another way of doing it; additionally, if doing it this way doesn't eliminate the error at all, then possibly the memory controller or a socket is faulty and new memory wouldn’t fix the problem anyway.

Regarding the event IDs 24 and 35 I’ve found a couple of links which might shed some light on what is happening. The first applies to Windows Server 2003 and so may not be helpful, but the second gives some quite detailed information about configuring Volume Shadow Copy on Windows Server 2008. As the event IDs explicitly implicate VolSnap it seems reasonable to suppose that you are using it in connection with your backup strategy. One point that the article makes is that by default Shadow Copy makes two snapshots a day, at 7am and 12pm, so just as your backup is finishing Shadow Copy is trying to take a snapshot if you haven’t configured it to do something different; there is a potential, if not actual, conflict here.

I don’t run Server 2008 myself at present so I can’t give you any hands-on input, but the latter article highlights a few considerations to bear in mind when setting up Shadow Copy. Forgive me if you already know all this, but I’m at the extreme edges of my own knowledge of such things and can only make general suggestions along the lines of how I would proceed if I was in your situation.
I'm still trying to figure out the work flow for Back EXEC , In my case its just copying a folder full of .BAK files .I'm not sure whether VSS is required in this case.
 Also AOFO is enabled in my Backup exec.
And again the server has restarted at 4.18 am today , Dump report shows sqlservr.exe as the reason ,any Idea guys.
What other events are going on at around this time? Not just errors, but normal entries, and not just in the Applications log. Could you post the dump?
you want the dump itself or the report alone?
The report might be sufficient. How big is the dump?

Do the other event logs show anything happening at that time?
The size of full memory dump is 1.1 GB ,

No other process starts or ends at that time.
Erm, I'll just have the report, thanks!

I've had a quick look at it (it's 11:30pm here, so I'll have a longer look at it tomorrow); your server is turning in a STOP 0x7F (UNEXPECTED_KERNEL_MODE_TRAP) error, and more specifically a Double Fault, as described here:

It looks increasingly likely that bad RAM is at the root of the problem, especially as problems with it have already been flagged up, though you should look at the other causes listed in the article as well in case any of them fits with your circumstances.
Check with your server vendor to see if they know of this issue and can advise you; there may be BIOS and/or firmware updates available that might resolve the problem. Also check that your hardware drivers are the latest available and are (preferably) WHQL certified.
I've just realised that the MS KB refers to Windows Server 2000 and XP; however, the basic principles still seem to be relevant.

There may be a Windows Server 2008-specific article on the subject, but I'm not looking for one until tomorrow sometime...
One other question: How much free space do you have on your C: drive, which, I presume, is the system volume for the server?
Free space = 65 GB ,
Total 100GB.
How much RAM does the server have? What size is the swap file?
RAM :12 GB
Page File Size : 12 GB
Avatar of Perarduaadastra
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Thanks for the reply,
 I cant confirm the size of the paging file,it displays two diff things in two diff places.
Also the Physical Memory usage is flat at 98%.Will this be a problem.Please see the pictures for more info.

Thanks again User generated image User generated image
Now we're getting somewhere.

Your page file is very nearly maxed out - if you look at the Virtual Memory screenshot, you will see that the current page file allocation is the same as the total page file size for all drives - not good!

Furthermore, the system recommendation is for almost half as much again, at 18417MB. It's apparent that leaving the system to manage the page file isn't working.

I would suggest unticking the Automatically Manage Paging File Size for all drives checkbox, selecting Custom Size, setting the Initial Size to 13312MB, and the Maximum Size to 24576MB. Be aware that making this change will almost certainly require a reboot to take effect.

This will certainly relieve the page file congestion, but it would be helpful to know why the page file is so heavily used; there may be a problem with an application failing to return virtual memory to the pool when it's finished with it.

Once you've made the changes, keep an eye on page file usage; if it shoots up to 98% of the new larger allocation then there is definitely a problem with one or more running processes that needs to be fixed.

Is there a way to find that, I mean why the page file is so heavily used?
I'm sure that there are tools that can monitor the resources used by Windows processes and applications, but I've been fortunate in that I haven't needed to use them on the servers that I'm responsible for, perhaps because said servers are all quite a bit older than yours! I notice on your screenshot that there is a Resource Monitor button on the Performance tab of the Task Manager - this might be a good place to start.

What apps is the server running? It may be that it's simply being asked to do too much. How many users are connected? What is SQL supporting? What is the volume of data that is being moved around the network?

The restarts may be due to the backup job using up the available memory (both physical and virtual) until it runs out, and it takes until between 4:10-4:20am for this point to be reached; although it's a scheduled job the amount of data will vary, and occasionally it exceeds the threshold that the system can cope with. When the restarts occur, do they tend to be after a particularly busy working day?
To Check whether thats the problem I have rescheduled the backup job to start one hour earlier, let us see what happens next.
That may be some help, but I understood from your earlier comments that the spontaneous reboot only happens now and again; this is why I wondered if there was a backup size threshold above which the system was overloaded and fell over as a result.

Perhaps a better test would be to deliberately back up say, 10% more data than would be usual even at peak periods. If this approach produced a restart every time you tried it, it would be reasonable to deduce that the increased amount of data was triggering the event.

You said earlier that you were using Backup Exec; do you have the latest version? It is possible that older versions may not work as well as they could with newer OSes.

Have you tried increasing the swap file size and monitoring its usage?
I have increased the paging file size to 20 GB .
Backup exec 2010 version is 13.0 .
I checked the Backup job History and found that the restart has occurred in the middle ,say normally a full backup will do with 230 GB, whereas the backup during the restart would have failed after 110 GB or so.
The performance and Reliability tool has a Page file monitor but it doesn't give so many options , it just shows the total usage.

If you remote into the server from time to time during the backup, you should be able to get an idea of swap file usage as the job progresses.