Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

DBLIB: Intermittent hang on DBOPEN - we are totall stumped on this one.

Posted on 2004-10-02
10
Medium Priority
?
346 Views
Last Modified: 2012-05-05
(Apologies for the long post - I thought I better include all relevant information to save time)

We have a problem where a utility using DBLIB written to run in a login script occassionally hangs for no apparent reason.

This is a legacy application that cannot be rewritten because the client does not want to spend the money - believe me we have tried - so solutions involving upgrading to OLE-DB are not valid.

Problem:
----------

The application has been running successfully against an MS SQL 7.0 database for 18 months (before that on SQL 6.5 for 18 Months). The application is run at login by every user on a 10,000 user network (8000 on the LAN and 2000 WAN users). The application runs for 5-20 seconds (until it has finished updating some data) and then exits.

Yesterday the call centre had to stop logging calls after the 900th user phoned in to say their login scripts were hanging. Subsequent investigation should it was the login utility described above that was causing the problem. On the few machines investigated the process was running consuming 99% of the CPU.

Some more facts
--------------------
a) This is the first process launched in the login script. It is launched using
start /w theapp.exe the parameters

So the login script will not proceed until the application has finished executing.

b) A check on the database server showed no blocks, locks or any other condition that would cause clients to hang.

c) Not all clients were hanging - some were able to connect without there being a hangup

d) According to change control logs no update was made either to workstations or server or any other infrastructural component within the last 2 days

e) The problem was randomly distributed over the network and not confined to a specific subnet or area.

f) An analysis of the application trace log showed that the application got as far as the first dbopn and did not proceed. On those workstations where users just left the machine without rebooting the log showed that after 2-3 hours the dbopen eventually returned and the rest of the application completed normally in the expected amount of time.

g) Analysis of the server logs showed there was no correlation to server load - incidents occurred when CPU was < 50% and when CPU was > 90%

h) We cannot replicate the problem - it appears to be an entirely random event requirng thousands of logins before it occurs.

i) There are no other databases on the server and no other processes access the database except the application in question.

j) The night before the problem occurred the DB admins ran a maintenance plan which included refreshing the indexes on the database (I think there are 2 - very simple indexes). This process did not complete and had to be aborted.

k) We tried boucing the server but the problem occurred again after the bounce.

Any ideas where to look for this problem. We know it is somewhere in the guts of DBLIB. I did search groups.google.com and found some references to TDS size defaulting to 512bytes and not 4096 bytes on SQL2K and also some info on dbcancel causing problems (we don't use dbcancel) - neither of which are relevant - the key fact here is that the process has been running successfully for 15months without incident. The only thing I can think of is a faulty network or something relating to the server but I don't know how to prove it.

0
Comment
Question by:Julian Hansen
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
  • 2
10 Comments
 
LVL 15

Assisted Solution

by:jdlambert1
jdlambert1 earned 750 total points
ID: 12206757
I observe and deduce the following:
1. The problem is caused by something that changed (since it used to work fine).
2. Only one change is reported to have occured shortly before the problem developed: a failed maintenance plan.
3. The change that's causing the problem is not likely to involve the workstation, since they still work fine until there's a heavy load (>requirng thousands of logins before it occurs)
4. There has to be a reason why the maintenance plan failed.
5. The cause of the problem is either a database performance problem due to the failed maintenance plan, or the same problem that caused the maintenance plan to fail.

Recommended actions:
1. Investigate the cause of the failure of the maintenance plan.
2. Get the indexes fixed to restore the system's performance to previous levels.
3. If there's an audit trail, review exactly what was executed by the maintenance plan, and any scripts that may have been run before the maintenance plan. (Consider that if someone ran an unapproved/unlogged script with a bad reaction, they might try to hide their actions. I've known DBA's who would run untested/unapproved scripts because... they were "sure" they would fix a problem, were simple, no chance of error, change control is a waste of time, etc., etc. Often they're right, sometimes they're not, and I've never seen one admit a mistake when a rogue script broke something.)

HTH
0
 
LVL 59

Author Comment

by:Julian Hansen
ID: 12207013
Hi jdlambert1,

Thanks for the quick response.

We have pretty much ruled out load as being a factor - according to logs incidents were being logged even when load was minimal (6.30am).

I will look further into the maintenance plan failure but it appears the reason was because some process locked a record and it could not complete. I must admit details are hazy on this one so it must be followed up.

The indexes really don't have much of an effect on the data. The tables are not very large and they are static so an index rebuild is not really necessary but rather than tempt Mr Murphy we will try that too.

I don't think this has anything to do with a DBA FU - or any other maverick user - the fact that a workstation can have a problem while another does not at the same time of day in addition to the fact that a workstation that experienced the problem in the morning did not have the same problem on subsequent logins and workstations that did not have a problem in the morning did have a problem on subsequent logins and some workstations did not have a problem at all seems to suggest that the problem lies outside of the server - personally I suspect the network but I don't have a plausible explanation as to why this would cause the dbopen() call to hang and chew up all the CPU.

Will follow up on recommendations above and report back.

thanks

0
 
LVL 34

Expert Comment

by:arbert
ID: 12208325
This really sounds like a network issue to me--since it's random and all over the place.

Have your ran profiler on SQL server to see if the users "logins" even make the request to SQL Server?
0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 
LVL 59

Author Comment

by:Julian Hansen
ID: 12211003
Thanks for the comments argert,

I will have to check if a Profiler trace was done, although I do know manual monitoring of the database was done during the time the incidents. The manual monitoring (sp_who2) showed that clients were connecting and disconnecting as expected and no hanging or blocked processes were encountered. As I mentioned before not all workstations were affected - many workstations were able to connect successfully without incident.

Unfortunately, if a profiler trace was not done then we are not going to be able to do one in the future. The business has basically come back and said don't turn this thing back on unless you can guarantee that it is not going to cause a problem. As there are strict penalties involved we are not at liberty to experiment.

This is why this is not an easy question to answer - it has to be done theoretically rather than through practical experimentation using problem solving techniques.

Thanks again
0
 
LVL 34

Expert Comment

by:arbert
ID: 12219850
"This is why this is not an easy question to answer - it has to be done theoretically rather than through practical experimentation using problem solving techniques."

I don't think you're going to solve it then--It might be an MDAC problem on certain machines, might be network connectivity, etc...You can't test if it's not live....
0
 
LVL 15

Expert Comment

by:jdlambert1
ID: 12219881
Agreed. All you can do is deduce possibilities, and if there's more than one (and there is), you can't pin it down without testing.
0
 
LVL 59

Author Comment

by:Julian Hansen
ID: 12220064
I understand the importance of being able to test and by my previous post I did not mean we cannot do any testing - just that we cannot do any testing that could disrupt the production environment. If there is some unobtrusive test that can be run to eliminate / highlight a problem then that is an option - we are just not allowed to hang up client workstations (or even risk doing so).
 
The main reason I posted this was that firstly, I was hoping that someone out there might have come across a similar problem and solved it and secondly there is enough other information to draw some conclusions. For instance we can rule out MDAC, just to take an example, because a) up until Thursday last week there were no problems for 16 months anywhere on the network and b) workstations that experienced problems in the morning did not experience in the afternoon and vice versa - without changing anything on the workstation - we can therefore assume it is very probably not a workstation problem. What I am trying to do is highlight what the most probable cause is and so far all I have is intermittent network problems that at this stage I cannot prove.

It is a tough one - thanks for the effort guys - I do appreciate it.


0
 
LVL 34

Accepted Solution

by:
arbert earned 750 total points
ID: 12220371
"For instance we can rule out MDAC, just to take an example, because a) up until Thursday last week there were no problems for 16 months anywhere on the network and b) workstations that experienced problems in the morning did not experience in the afternoon and vice versa - without changing anything on the workstation - we can therefore assume it is very probably not a workstation problem"


I don't think you can totally rule it out.  Are you positive nothing changed on the workstations?  No service packs?  Hotfixes?  new software installed?

"j) The night before the problem occurred the DB admins ran a maintenance plan which included refreshing the indexes on the database (I think there are 2 - very simple indexes). This process did not complete and had to be aborted."

Are we sure the indexes got put back into place?  Did the Admins run a CHECKDB just to make sure things are ok?
0
 
LVL 59

Author Comment

by:Julian Hansen
ID: 12224465
Hi arbert,

No packages were rolled to the workstations and no other changes were made. Given the fact that 900 calls were logged in the morning it would have to be a fairly significant change to affect so many clients. But this is not really relevant - assuming it is a problem with the Workstation then why

a) Is the problem not consistent - some workstations fail and then work
b) We cannot replicate the problem on any machine
c) given there were no changes why did the application work for 15 months

Surely if it was to do with the workstation there would be some consistency in being able to replicate the problem AND we would have expected that the problem would have manifested itself sometime in the past 15 months.

CHECKDB was run and it did check out ok but again we have to ask the question if there was something wrong with the indexes why the inconsistency in the occurance of the problem - surely, at the very least if a workstation experiences a problem today it the indexes are to blame then the problem should happen again and again.

Guys, thanks for the input -  I will leave the question open till Friday and then close it.
0
 
LVL 59

Author Comment

by:Julian Hansen
ID: 12316793
Thanks for the input guys.

Although we have not solved the problem it is not because your suggestions were not incorrect but more because we were not able to try most of them.

The client has decided to consider moving off the DBLIB solution (which is a good thing). Subsequent investigation seems to suggest some sort of corruption in the database but doesn't explain why the DBLIB functions hang or how to prevent / stop them from hanging. There does not appear to be anything we can do from a coding perspective - we pass control to the library and simply never get it back.

0

Featured Post

[Webinar] Lessons on Recovering from Petya

Skyport is working hard to help customers recover from recent attacks, like the Petya worm. This work has brought to light some important lessons. New malware attacks like this can take down your entire environment. Learn from others mistakes on how to prevent Petya like worms.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Ever wondered why sometimes your SQL Server is slow or unresponsive with connections spiking up but by the time you go in, all is well? The following article will show you how to install and configure a SQL job that will send you email alerts includ…
This article shows gives you an overview on SQL Server 2016 row level security. You will also get to know the usages of row-level-security and how it works
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

618 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question