Link to home
Start Free TrialLog in
Avatar of bman21
bman21

asked on

asterisk dropping calls

Hi all, relatively new to asterisk pbx.  I have an issue with one of my pbx servers.  I am constantly getting the following error message in the /var/log/asterisk/messages file.

 "Dec 18 11:48:44 NOTICE[23651] chan_sip.c: Peer '4006' is now UNREACHABLE!  Last      qualify: 124"

All of my extensions become unreachable simultaneously, and then a few seconds later they all simultaneously become reachable again. (status code below)

Dec 18 11:48:54 NOTICE[23651] chan_sip.c: Peer '4006' is now REACHABLE! (145ms /      2000ms)

We are using SIP on port 5060.  All of the phones that connect to this server are connected via a vpn.  My first impression was that we were experiencing limited bandwidth issues or high packet loss.  Our bandwidth is not even close to being fully utilized and the packet loss on this servers interface is less than 1%.

Any help is greatly appreciated.
Avatar of grblades
grblades
Flag of United Kingdom of Great Britain and Northern Ireland image

The asterisk server is polling the phones and if they fail to respond within 2 seconds then they are marked as offline.

I would leave a ping running on the asterisk server to one of the phones on the other side. Then see if you get any dropped packets at the same time as the phones go offline.
Avatar of bman21
bman21

ASKER

i ran the ping on several phones and no packet loss.
Avatar of bman21

ASKER

what protocol does asterisk use to poll the phones?
It uses the standard SIP protocol. The 'qualify=' parameter which causes the polling is part of the sip.conf.
Avatar of bman21

ASKER

not sure if this would be related to all my phones becoming unreachable, but thought it might be relevant.  in the error log /var/log/asterisk/messages this is what i see after i restart asterisk.

Dec 19 12:50:06 NOTICE[32301] cdr.c: CDR simple logging enabled.
Dec 19 12:50:06 WARNING[32301] res_musiconhold.c: Unable to open pseudo channel for timing...  Sound may be choppy.
Dec 19 12:50:07 NOTICE[32301] res_odbc.c: registered database handle 'asterisk' dsn->[asterisk]
Dec 19 12:50:07 NOTICE[32301] res_odbc.c: Connecting asterisk
Dec 19 12:50:07 WARNING[32301] res_odbc.c: res_odbc: Error SQLConnect=-1 errno=0 [unixODBC][Driver Manager]Data source name not found, and no default driver specified
Dec 19 12:50:07 NOTICE[32301] res_odbc.c: res_odbc loaded.
Dec 19 12:50:07 NOTICE[32301] config.c: Registered Config Engine odbc
Dec 19 12:50:07 WARNING[32301] chan_iax2.c: Unable to open IAX timing interface: No such file or directory
Dec 19 12:50:07 WARNING[32301] chan_iax2.c: Unable to support trunking on user 'REMOTE_SERVER' without zaptel timing
Dec 19 12:50:07 WARNING[32301] chan_iax2.c: Unable to support trunking on peer 'REMOTE_SERVER' without zaptel timing

I do have a second pbx server that communicates with this server via IAX, but so far haven't noticed any degredation in communication between the two.

also, i have been monitoring the asterisk database and have been getting the following notices in addition to the unreachable notices.

Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '5001' is now TOO LAGGED! (3318ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4008' is now TOO LAGGED! (3319ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4002' is now TOO LAGGED! (3319ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4003' is now TOO LAGGED! (3319ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4000' is now TOO LAGGED! (3320ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '7120' is now TOO LAGGED! (3321ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4010' is now TOO LAGGED! (3321ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4001' is now TOO LAGGED! (3321ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4009' is now TOO LAGGED! (3322ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '5011' is now TOO LAGGED! (3322ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4004' is now TOO LAGGED! (3322ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4005' is now TOO LAGGED! (3322ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '5003' is now TOO LAGGED! (3323ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '5007' is now TOO LAGGED! (3323ms / 2000ms)
Dec 19 12:59:35 NOTICE[32318]: chan_sip.c:10057 handle_response_peerpoke: Peer '4006' is now TOO LAGGED! (3323ms / 2000ms)

I have also been continuously  pinging several of these numbers and still haven't received any packet loss.

hope this provides a little bit better information.

The first thing I would do is sort out the timing issue. I dont think it will be the cause but it is something you will probably need in the future if you are using conference calls or various other features.
You need to install the 'zaptel' package from the same place you got asterisk from. One of the modules that is loaded is 'ztdummy' which gives you a timing source.
If you are compiling ztdummy from source rather than loading a package then you will also need to make sure you have the kernel sources installed.

You said you have been pinging the phones. What is the largest round trip delay time?
You may not be getting packet loss but a delay of 3.3s which is causing the problem.

You could disable the feature causing the phones to show as too lagged by commenting out the 'qualify=' lines in each of their entries in the sip.conf file.
If you are really getting spikes of packet delay this will stop them from becoming unreachable but you will still have audio disruption during any active calls so I would suggest you make this change as a last resort.
Avatar of bman21

ASKER

the largest round trip delay that i have noticed is 140 ms (average is around 50ms).
Avatar of Ron Malmstead
..as Grblades said, you should definitely get the timing issue sorted out, but it's probably not the cause of the simultaneous loss of registration and call drops.

Can you tell us about your setup at all ?...
How many clients do you have, softphones or hardphones, what is your connection speed, and what models/brands of firewalls are you using ?

Have you checked the CPU on the firewalls ?

Sounds to me like there is temporary interuption of traffic on the vpn or wan.. from time to time.  Which can be caused by a lot of things..  like vpn timeout settings on the firewall, cpu spike on firewall, bandwidth spike, interface flapping, packet loss / ISP, ....etc.

The only thing you can really do is, ...check all of it...so I would start with hardware and network connectivity issues.

Also, sometimes problems that seem complex are actually simple...once you figure it out.

short story...for example, I once had a user that would periodically get on a u-torrent, and since I wasn't using QOS or blocking P2p, my calls would get  choppy or drop.... took several days to figure it out, because everytime I was in the office, the little punk wouldn't dare download crap on my network....he would always wait until I left and I would get a phone call and have to drive back the office.
Avatar of bman21

ASKER

i do have a high volume network.  I am running a sonicwall 4060 with more than 80 active vpn connections that stream information at high rates.  It has hardware failover (to which it doesn't appear to be switching to) with 2Ghz processor and 256 mb of ram.  

My datacenter isn't having a bandwidth issue, but it definitely spikes from time to time.  The end users have a sonicwall TZ-150 that use a vpn to get into the system that allows them to receive streaming information as well as allows the phones to connect to the pbx server.  Their CPU and memory utilization checks out ok.

I could not find any log files or system stats relating to cpu utilization, but i did find my memory utilization (below).

Memory Partition Statistics
 status   bytes    blocks   avg block  max block
 ------ --------- -------- ---------- ----------
current
    free   52577892   119154          -   33531872
   alloc  196462408   249888          -          -



--Cache check----------
Cache current: 291, high water 5666, added 6388314, deleted 6388023
ConnNode errors: 0, Hash List errors: 0, ConnNode cleanup errors: 0
buffer bounds check (buffer from 4e0ccd0 to 560d0d0)
checking freeBufferList (524061 elem)
checking unmappedList (0 elem)
checking connectionTable
checking freeNodeList (524269 elem)
total bounding errors: 0, connection table errors 0, nat table errors 0, conn node errors 0
--Cache check complete---------

We have 20 hardphones (Polycom SIP501) that connect via vpn to this pbx server.  This server is also connected via VPN to another PBX server.  

I have ran constant pings to a few of these phones without any noticeable packet loss (less than 1%).  I have also ran a constant ping to the PBX interface and again, no noticeable packet loss.  
ASKER CERTIFIED SOLUTION
Avatar of grblades
grblades
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of bman21

ASKER

may have found a possible solution to the problem.  the server did not have the correct DNS setup in /etc/resolv.conf  since changing it to the correct DNS settings, i have not had any further issues with all of my phones becoming unreachable all at once.  I have only had this "fixed" for about 5 hours and don't want to jump to conclusions, but so far so good.  I still have a few phones becoming unreachable, but that is to be expected when connected over the internet.

I tried commenting the "qualify=yes" out on a few phones.  I called those numbers when the others became unreachable, i was unable to get through to them either.  

i am still working on the zaptel timing issue and would appreciate any help possible.  tried using yum install zaptel, but no luck.  any one know where to get the rpms?

 
How did you install asterisk?
Avatar of bman21

ASKER

didn't, took over this position a few months ago and the guy who built the server no longer works here nor left any documentation.
If you are using a modern version of ubuntu then it may have come as part of a package.
Otherwise you will need to download the zaptel source from www.asterisk.org, install the kernel source via your package manager and then compile it.
Avatar of bman21

ASKER

The zaptel timing issue has been fixed, but am still getting the occasional widespread loss of connection on the phones.  I setup a test phone over the weekend and called its number leaving an open line.  The call never dropped even though the asterisk log stated that it became unreachable on several occasions.  I am going to comment out the qualify option and see if i can reach the phones once they become unreachable.
Avatar of bman21

ASKER

ok, looks like commenting out the qualify section helps out quite a bit.  i get the occasional cutting in and out, but am not getting complete dropped calls like i was earlier.  Thanks for all the help and i will split points accordingly.