Link to home
Start Free TrialLog in
Avatar of MichaelKilroy
MichaelKilroyFlag for United States of America

asked on

RPG Socket Program Hanging on Connect()

First, I am new to socket programming. I found a great tutorial online and am reusing code from it. I am using a socket program to send hex commands to a production scoreboard. The hex commands increment/decrement counters, etc on the board.

In my socket program, I send the binary/hex commands to various 8 input ports for the board to process. In this example, I am using software to send the binary command to simulate input 1 (total count). The socket program randomly will return an error that it could not connect to the host.

Ex:

I can run my program 70x within a minute or so and it increments the total counter 70x. Everything is fine. Then, out of nowhere, when called, the program tries to make a connection on port 4001 and it hangs. It then returns an error stating it could not make a connection to the host within the timeout period.

As for working/not working, it appears completely random. I can create the socket, send, and close the socket several times and all is fine. I may then wait 3 minutes and send, and the program hangs on the connect function. Any ideas?

thanks in advance - Adam
eval      sock = socket                       
             (AF_INET: SOCK_STREAM:IPPROTO_IP)

if        connect(sock: p_connto: addrlen) < 0   
eval      err = errno                            
callp     close(sock)                            
callp     die('connect(): '+%str(strerror(err))) 
return                                           
endif                                            

eval      rc = Send(sock: %addr(activeCmd):4:0) 
                                                
if        rc < 4                                
eval      err = errno                           
callp     close(sock)                           
callp     die(%str(strerror(err)))              
return                                          
endif                                           

eval      rc = Recv(sock: %addr(activeCmd):4:0) 
e anything if no bytes returned                 
if        rc < 1                                
callp     close(sock)                           
callp     die(%str(strerror(err)))              
return                                          
endif

Open in new window

Avatar of Member_2_276102
Member_2_276102

Any ideas?

First, this is network communications. Welcome to the "Why does it work except 1% of the time?" Club!

One thing you will have to get used to is that you will never know for certain (and often not even in general) why a specific connection didn't work. Get used to that idea now and save yourself a lot of worry in the future. Instead, focus on recovery.

If you are comfortable that it works correctly "70x within a minute or so" and then fails once, then use that as a basic benchmark for good performance. After a connection time-out, clean up whatever needs it and loop back to try again. You can expect the second attempt to work 98+% of the time. If it fails again, then try one more time. Only log a connection error and exit after the third try fails.

In that kind of situation, you might easily be seeing transient problems that will be far beyond what you want to or need to get into. Maybe the air conditioner clicked on. Maybe the old microwave in the breakroom needs to be replaced. Maybe there's a faulty network switch that's working at the edge of tolerance. Maybe there's a bandwidth issue. Maybe the remote device needs a few more seconds before it gets back to a ready state after so many connections. Maybe... Maybe......

If a short series of simple retries gets past the problem and everything else seems fine, then don't worry about it.

OTOH, if connection problems show an increasing trend over a few days or if other apps also start showing connection problems, then perhaps some fundamental networking troubleshooting is called for.

Until then, just add a bit of recovery in your basic app and go on to the next project.

Tom
Avatar of MichaelKilroy

ASKER

Tom,

Thanks for the feeback. I actually was going (and probably will) write an async to call the socket program. This way when the user scans, it writes the qtys to a file. Then the socket program will process those lines and loop as many times as needed to get the connection and process the records. This will prevent the user's scan program from getting hung.

Problem is, it trys for a minute on connect() before it times out. The qtys on the scoreboard update every second. Say the connect fails 2x before getting through. I cannot have a two minute delay when updating the qtys on the board.

The production line will think that the scanner is not updating the board.
Another thing I just realized. After the job hangs on the connection, retrying (re-calling the program) will always hang until I sign out of my as400 session and back in (I am calling the program interactively for testing purposes). Once signed back in, the program works again until it hangs again. Almost like something is left over in my job from the previous failure (I do close the socket on failure).
Avatar of Gary Patterson, CISSP
Why don't you post your code and show is where the problem is?
Problem is, it trys for a minute on connect() before it times out.

One possibility -- any chance you can create a permanent connection? Have a batch NEP that relays. Your program can send commands to it through a *dtaq (or whatever), and it relays through its connection. No need to open a connection every time.

But if you need to establish and re-establish your connection and you need to control the timeout, you have a little preparation to do. Communications programming is tricky. Fortunately, it's been done before and written about.

Review Scott Klement's excellent article on timeouts and see if it provides what you need.

Tom
tliotta,

Thanks for the response. I will give the article a read and see if I can implement it. As far as the batch NEP that relays, I really like the sound of that, but do not know what it is. Do you know of any tutorials or reading material I could check out to explore that option.
Sorry - looking at this on my mobile, I didn't see the code fragment.

- Gary


ASKER CERTIFIED SOLUTION
Avatar of Member_2_276102
Member_2_276102

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Tom,

thanks for the explanation and sorry for the delay. I have solved the issue. For some reason, the socket program that connects to the server hangs when it is called multiple times within the same job, but never when called the first time in a job.

Using this info, I wrote an async program that runs all the time (except for killing itself and restarting at 5am). Each iteration, it checks for input needing processed. It does this by checking a file that holds scans needing processed. Our various scan programs update that file. If it finds records needing processed, it fires of the socket program via a SBMJOB command. The socket program then sends the requests by reading the scans file and ends it's job. The async program then enters a DLYJOB(2) and then repeats the above logic.

By doing this, the socket program always executes in a job once, so I never get the hanging on the connect() function. I have been running it for 1 1/2 days now and the data has been 100% correct on the board when comparing to the DB2. I haven't lost any scans/packets.

Thanks for the help, info, and ideas.

Adam V
Not the answer, but very helpful information that I appreciate.
For some reason, the socket program that connects to the server hangs when it is called multiple times within the same job,...

That can almost certainly be fixed. Since you had written "70x within a minute or so", it seemed that the problem was external to your program.

But if your current state works well enough for you, there's no pressing need to do better.

I assume you have a multi-threaded *JOBQ that you submit onto to avoid waits.

Tom