Memory consumption on Solaris-10

Dip Sh
Dip Sh used Ask the Experts™
on
Hello,
We have two Solaris-10 x86 servers with below details, both are on VMWare -
bad-server - 4 GB memory and 2 vCPU
good-server - 16 GM memory and 4 vCPU

There is an application running on both servers, which query something and read from a file. Queries are failing on bad-server, while good-server is fine. If I check sar reports on bad-server, total memory utilization is never going higher than 30%.

Upon further investigation, we see that once PID 1243 (this is process id of that application) consumes 900 MB of RSS (from prstat output), queries starts failing. We attached that PID with truss and found below line
/1243:   1.6180 open("/export/correctaddress/data/ltravel.wrk", O_RDONLY) = 23

Open in new window

It takes more than one second, and this fails the query, while on good-server, it takes around 0.030 seconds. When this happens, application should be restarted and then total RSS would be 100 MB. After couple of hours, it would be 300 MB, then 500 MB and once it will hit near 900 MB, queries will again fail. This makes application has to be started every 7-8 hours. Can somebody explain this, when memory consumption is never reaching over 30% ? We would have increased its CPU and memory to match good server, but it should tell us, if it crossing threshold value.
Thanks in advance.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
Are both servers in the same cluster on VMware?

Author

Commented:
They are on same cluster, but different Storage group. I got confirmation from VMWare admin that storage policy and setting are same for both datastores.
Commented:
Sounds like disk (storage) issue rather than memory issue.
Exploring ASP.NET Core: Fundamentals

Learn to build web apps and services, IoT apps, and mobile backends by covering the fundamentals of ASP.NET Core and  exploring the core foundations for app libraries.

Author

Commented:
I did a write test and read test on both servers with "time dd ....... ..... ...." command and both are showing us almost same time to execute complete (when there is no issue, i.e. when RSS is less than 800 MB). Didn't got chance to run and test dd during time of problem.
So, I was not able to prove, if it is a storage issue.
Commented:
Just to remember, storage groups under VMware are sharing with other Virtual servers. When the problem occurs, there may be many servers using the same storage disks, which caused disk performance issue.

Author

Commented:
I just checked, there are 5 VMs in each datastores. At VMWare level, as well as storage level, I do not see peak in memory or CPU utilizing peaking in past 7 days. Both datastores are coming from same storage, with same policies applied on both.
I can dig more, or probably can open case with Storage vendor. But before that, I would like to see, if there is no issue on OS side, which is not the case right now.
How will I explain that, once it will reach to certain memory utilization, application starts taking longer time in opening one file ?

Commented:
Check patch level between the two. Your VMWare admin should be able to show the storage performance when your server had issues. Those 5 VMs in each datastores are not identical, so the performance are random.

Author

Commented:
Both servers are at kernel level 150401-13, update 11.
I will get both storage checked again tomorrow morning and then will update you.

Commented:
Tomorrow is weekend here :)

Author

Commented:
We have one more day to work :-)
You can check that, once you are back
Have a nice weekend :-)

Commented:
Author has checked, and has no more questions.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial