DBLIB: Intermittent hang on DBOPEN - we are totall stumped on this one.
Posted on 2004-10-02
(Apologies for the long post - I thought I better include all relevant information to save time)
We have a problem where a utility using DBLIB written to run in a login script occassionally hangs for no apparent reason.
This is a legacy application that cannot be rewritten because the client does not want to spend the money - believe me we have tried - so solutions involving upgrading to OLE-DB are not valid.
The application has been running successfully against an MS SQL 7.0 database for 18 months (before that on SQL 6.5 for 18 Months). The application is run at login by every user on a 10,000 user network (8000 on the LAN and 2000 WAN users). The application runs for 5-20 seconds (until it has finished updating some data) and then exits.
Yesterday the call centre had to stop logging calls after the 900th user phoned in to say their login scripts were hanging. Subsequent investigation should it was the login utility described above that was causing the problem. On the few machines investigated the process was running consuming 99% of the CPU.
Some more facts
a) This is the first process launched in the login script. It is launched using
start /w theapp.exe the parameters
So the login script will not proceed until the application has finished executing.
b) A check on the database server showed no blocks, locks or any other condition that would cause clients to hang.
c) Not all clients were hanging - some were able to connect without there being a hangup
d) According to change control logs no update was made either to workstations or server or any other infrastructural component within the last 2 days
e) The problem was randomly distributed over the network and not confined to a specific subnet or area.
f) An analysis of the application trace log showed that the application got as far as the first dbopn and did not proceed. On those workstations where users just left the machine without rebooting the log showed that after 2-3 hours the dbopen eventually returned and the rest of the application completed normally in the expected amount of time.
g) Analysis of the server logs showed there was no correlation to server load - incidents occurred when CPU was < 50% and when CPU was > 90%
h) We cannot replicate the problem - it appears to be an entirely random event requirng thousands of logins before it occurs.
i) There are no other databases on the server and no other processes access the database except the application in question.
j) The night before the problem occurred the DB admins ran a maintenance plan which included refreshing the indexes on the database (I think there are 2 - very simple indexes). This process did not complete and had to be aborted.
k) We tried boucing the server but the problem occurred again after the bounce.
Any ideas where to look for this problem. We know it is somewhere in the guts of DBLIB. I did search groups.google.com and found some references to TDS size defaulting to 512bytes and not 4096 bytes on SQL2K and also some info on dbcancel causing problems (we don't use dbcancel) - neither of which are relevant - the key fact here is that the process has been running successfully for 15months without incident. The only thing I can think of is a faulty network or something relating to the server but I don't know how to prove it.