Page Life Expectancy sudden drop

It's ben a while now since we experience these sudden PLE drops on our production servers. Usually the value is normal, over 4K, but several times a day it suddenly drops and stays like that for minutes until grows back again. When these drops happen they are accompanied by spikes in the waiting time for that period of time and obviously by lag experience by users. The waiting time is caused by I/O high activity due to storage-memory transfer.

We know that we can rule out index missing or fragmentation, poor queries, we know how to deal with those problems. I understand that a high usage by our users at times can cause this but it doesn't seem to necessarily follow that pattern. It can happen while regular usage or even due to lunch period when normally there are less users active, or even at night.

One thing that we found on the net is that this may be a known SQL 2012 SP1 issue, which is supposed to be fixed by teh SP1 CU4 :

only that in our case it happens multiple times a day. We already scheduled to apply the upgrade to SP2 but I thought I should ask here as well maybe someone can give us a light in this matter.

Thank you.

Some pictures here:

PLE drop:
Waiting time spike at 10min interval(sometimes the grow can be dramatic, to over 20K):
LVL 27
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Rich WeisslerProfessional Troublemaker^h^h^h^h^hshooterCommented:
A few thoughts:
1. Check for queries which have high IO rather than long run times, after an occurrence.  (I don't have particularly high hopes for that though... I wouldn't expect to drop to PLE of 0 though.)
2. You don't have any reindex or statistics jobs running during production times do you?
3. You don't have any developers or junior DBAs on your production system which would dump your cache to 'test something.'?
4. (I haven't admit though... we abandoned SP1 on SQL 2012 as quick as we could because of the registry bloat problem... so I can't say for certain if it's a flaw in the older service pack version...)
ZberteocAuthor Commented:
Thanks, Rich!

One thing I forgot to mention is that we have a Windows Fail Over Cluster configuration with with 2 SQL server nodes with AlwayOn. The 2 nodes are actually virtual machines on different hosts. We use VMWare for virtualization.
Rich WeisslerProfessional Troublemaker^h^h^h^h^hshooterCommented:
*nod*  Those considerations add two more quick 'confirming assumptions':
I'm assuming a failover isn't occuring.  (I assume not, 'cause that would produce lots of other red flags.)
I'm assuming the servers aren't getting into a ballon memory condition when the problem occurs.  (That's easy enough to confirm in VCenter.)
I'm also assuming there aren't any messages in the SQL ERRORLOG when the cache is turned over?
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

ZberteocAuthor Commented:
Your assumptions are correct. :)
ZberteocAuthor Commented:
Actually there is something, messages that say:

"AppDomain 1637 (db_name.dbo[runtime].2034) is marked for unload due to memory pressure."

They happen quite often.

The SQL nodes have 64GB of memory of which 55 alocated to SQL server.
Rich WeisslerProfessional Troublemaker^h^h^h^h^hshooterCommented:
Confirm the VMWare VCenter reports that Balloon memory is zero... and you should be able to check the performance graph to look over past performance... and that the memory limit isn't set lower than the configured memory.

I would normally _assume_ 9 GB of RAM is sufficient for everything else... but I also believe that memory pressure is in that 9 GB... I think the msg in the log you see is something like a CLR which is getting squeezed out of memory.  I think that msg is a symptom of your problem... which is memory pressure... rather than the problem itself.  (That is, unless you have a memory dump elsewhere in the log... in which case, that would be the problem.)

But I'd still proceed with the installation of SQL 2012 SP2 when you can.  There appear to be more than one memory fixes.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server 2008

From novice to tech pro — start learning today.