Solved

DBLIB: Intermittent hang on DBOPEN - we are totall stumped on this one.

Posted on 2004-10-02
10
302 Views
Last Modified: 2012-05-05
(Apologies for the long post - I thought I better include all relevant information to save time)

We have a problem where a utility using DBLIB written to run in a login script occassionally hangs for no apparent reason.

This is a legacy application that cannot be rewritten because the client does not want to spend the money - believe me we have tried - so solutions involving upgrading to OLE-DB are not valid.

Problem:
----------

The application has been running successfully against an MS SQL 7.0 database for 18 months (before that on SQL 6.5 for 18 Months). The application is run at login by every user on a 10,000 user network (8000 on the LAN and 2000 WAN users). The application runs for 5-20 seconds (until it has finished updating some data) and then exits.

Yesterday the call centre had to stop logging calls after the 900th user phoned in to say their login scripts were hanging. Subsequent investigation should it was the login utility described above that was causing the problem. On the few machines investigated the process was running consuming 99% of the CPU.

Some more facts
--------------------
a) This is the first process launched in the login script. It is launched using
start /w theapp.exe the parameters

So the login script will not proceed until the application has finished executing.

b) A check on the database server showed no blocks, locks or any other condition that would cause clients to hang.

c) Not all clients were hanging - some were able to connect without there being a hangup

d) According to change control logs no update was made either to workstations or server or any other infrastructural component within the last 2 days

e) The problem was randomly distributed over the network and not confined to a specific subnet or area.

f) An analysis of the application trace log showed that the application got as far as the first dbopn and did not proceed. On those workstations where users just left the machine without rebooting the log showed that after 2-3 hours the dbopen eventually returned and the rest of the application completed normally in the expected amount of time.

g) Analysis of the server logs showed there was no correlation to server load - incidents occurred when CPU was < 50% and when CPU was > 90%

h) We cannot replicate the problem - it appears to be an entirely random event requirng thousands of logins before it occurs.

i) There are no other databases on the server and no other processes access the database except the application in question.

j) The night before the problem occurred the DB admins ran a maintenance plan which included refreshing the indexes on the database (I think there are 2 - very simple indexes). This process did not complete and had to be aborted.

k) We tried boucing the server but the problem occurred again after the bounce.

Any ideas where to look for this problem. We know it is somewhere in the guts of DBLIB. I did search groups.google.com and found some references to TDS size defaulting to 512bytes and not 4096 bytes on SQL2K and also some info on dbcancel causing problems (we don't use dbcancel) - neither of which are relevant - the key fact here is that the process has been running successfully for 15months without incident. The only thing I can think of is a faulty network or something relating to the server but I don't know how to prove it.

0
Comment
Question by:Julian Hansen
  • 5
  • 3
  • 2
10 Comments
 
LVL 15

Assisted Solution

by:jdlambert1
jdlambert1 earned 250 total points
ID: 12206757
I observe and deduce the following:
1. The problem is caused by something that changed (since it used to work fine).
2. Only one change is reported to have occured shortly before the problem developed: a failed maintenance plan.
3. The change that's causing the problem is not likely to involve the workstation, since they still work fine until there's a heavy load (>requirng thousands of logins before it occurs)
4. There has to be a reason why the maintenance plan failed.
5. The cause of the problem is either a database performance problem due to the failed maintenance plan, or the same problem that caused the maintenance plan to fail.

Recommended actions:
1. Investigate the cause of the failure of the maintenance plan.
2. Get the indexes fixed to restore the system's performance to previous levels.
3. If there's an audit trail, review exactly what was executed by the maintenance plan, and any scripts that may have been run before the maintenance plan. (Consider that if someone ran an unapproved/unlogged script with a bad reaction, they might try to hide their actions. I've known DBA's who would run untested/unapproved scripts because... they were "sure" they would fix a problem, were simple, no chance of error, change control is a waste of time, etc., etc. Often they're right, sometimes they're not, and I've never seen one admit a mistake when a rogue script broke something.)

HTH
0
 
LVL 51

Author Comment

by:Julian Hansen
ID: 12207013
Hi jdlambert1,

Thanks for the quick response.

We have pretty much ruled out load as being a factor - according to logs incidents were being logged even when load was minimal (6.30am).

I will look further into the maintenance plan failure but it appears the reason was because some process locked a record and it could not complete. I must admit details are hazy on this one so it must be followed up.

The indexes really don't have much of an effect on the data. The tables are not very large and they are static so an index rebuild is not really necessary but rather than tempt Mr Murphy we will try that too.

I don't think this has anything to do with a DBA FU - or any other maverick user - the fact that a workstation can have a problem while another does not at the same time of day in addition to the fact that a workstation that experienced the problem in the morning did not have the same problem on subsequent logins and workstations that did not have a problem in the morning did have a problem on subsequent logins and some workstations did not have a problem at all seems to suggest that the problem lies outside of the server - personally I suspect the network but I don't have a plausible explanation as to why this would cause the dbopen() call to hang and chew up all the CPU.

Will follow up on recommendations above and report back.

thanks

0
 
LVL 34

Expert Comment

by:arbert
ID: 12208325
This really sounds like a network issue to me--since it's random and all over the place.

Have your ran profiler on SQL server to see if the users "logins" even make the request to SQL Server?
0
 
LVL 51

Author Comment

by:Julian Hansen
ID: 12211003
Thanks for the comments argert,

I will have to check if a Profiler trace was done, although I do know manual monitoring of the database was done during the time the incidents. The manual monitoring (sp_who2) showed that clients were connecting and disconnecting as expected and no hanging or blocked processes were encountered. As I mentioned before not all workstations were affected - many workstations were able to connect successfully without incident.

Unfortunately, if a profiler trace was not done then we are not going to be able to do one in the future. The business has basically come back and said don't turn this thing back on unless you can guarantee that it is not going to cause a problem. As there are strict penalties involved we are not at liberty to experiment.

This is why this is not an easy question to answer - it has to be done theoretically rather than through practical experimentation using problem solving techniques.

Thanks again
0
 
LVL 34

Expert Comment

by:arbert
ID: 12219850
"This is why this is not an easy question to answer - it has to be done theoretically rather than through practical experimentation using problem solving techniques."

I don't think you're going to solve it then--It might be an MDAC problem on certain machines, might be network connectivity, etc...You can't test if it's not live....
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 
LVL 15

Expert Comment

by:jdlambert1
ID: 12219881
Agreed. All you can do is deduce possibilities, and if there's more than one (and there is), you can't pin it down without testing.
0
 
LVL 51

Author Comment

by:Julian Hansen
ID: 12220064
I understand the importance of being able to test and by my previous post I did not mean we cannot do any testing - just that we cannot do any testing that could disrupt the production environment. If there is some unobtrusive test that can be run to eliminate / highlight a problem then that is an option - we are just not allowed to hang up client workstations (or even risk doing so).
 
The main reason I posted this was that firstly, I was hoping that someone out there might have come across a similar problem and solved it and secondly there is enough other information to draw some conclusions. For instance we can rule out MDAC, just to take an example, because a) up until Thursday last week there were no problems for 16 months anywhere on the network and b) workstations that experienced problems in the morning did not experience in the afternoon and vice versa - without changing anything on the workstation - we can therefore assume it is very probably not a workstation problem. What I am trying to do is highlight what the most probable cause is and so far all I have is intermittent network problems that at this stage I cannot prove.

It is a tough one - thanks for the effort guys - I do appreciate it.


0
 
LVL 34

Accepted Solution

by:
arbert earned 250 total points
ID: 12220371
"For instance we can rule out MDAC, just to take an example, because a) up until Thursday last week there were no problems for 16 months anywhere on the network and b) workstations that experienced problems in the morning did not experience in the afternoon and vice versa - without changing anything on the workstation - we can therefore assume it is very probably not a workstation problem"


I don't think you can totally rule it out.  Are you positive nothing changed on the workstations?  No service packs?  Hotfixes?  new software installed?

"j) The night before the problem occurred the DB admins ran a maintenance plan which included refreshing the indexes on the database (I think there are 2 - very simple indexes). This process did not complete and had to be aborted."

Are we sure the indexes got put back into place?  Did the Admins run a CHECKDB just to make sure things are ok?
0
 
LVL 51

Author Comment

by:Julian Hansen
ID: 12224465
Hi arbert,

No packages were rolled to the workstations and no other changes were made. Given the fact that 900 calls were logged in the morning it would have to be a fairly significant change to affect so many clients. But this is not really relevant - assuming it is a problem with the Workstation then why

a) Is the problem not consistent - some workstations fail and then work
b) We cannot replicate the problem on any machine
c) given there were no changes why did the application work for 15 months

Surely if it was to do with the workstation there would be some consistency in being able to replicate the problem AND we would have expected that the problem would have manifested itself sometime in the past 15 months.

CHECKDB was run and it did check out ok but again we have to ask the question if there was something wrong with the indexes why the inconsistency in the occurance of the problem - surely, at the very least if a workstation experiences a problem today it the indexes are to blame then the problem should happen again and again.

Guys, thanks for the input -  I will leave the question open till Friday and then close it.
0
 
LVL 51

Author Comment

by:Julian Hansen
ID: 12316793
Thanks for the input guys.

Although we have not solved the problem it is not because your suggestions were not incorrect but more because we were not able to try most of them.

The client has decided to consider moving off the DBLIB solution (which is a good thing). Subsequent investigation seems to suggest some sort of corruption in the database but doesn't explain why the DBLIB functions hang or how to prevent / stop them from hanging. There does not appear to be anything we can do from a coding perspective - we pass control to the library and simply never get it back.

0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Performance is the key factor for any successful data integration project, knowing the type of transformation that you’re using is the first step on optimizing the SSIS flow performance, by utilizing the correct transformation or the design alternat…
Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
Familiarize people with the process of retrieving data from SQL Server using an Access pass-thru query. Microsoft Access is a very powerful client/server development tool. One of the ways that you can retrieve data from a SQL Server is by using a pa…
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now