Link to home
Start Free TrialLog in
Avatar of steveo225
steveo225Flag for United States of America

asked on

Service every 600 seconds

We are running RedHat Linux AS 3.1. We wrote an app to transmit udp data via multicast over our network. The problem is, every 600 seconds some service launches and causes us to lose hundreds of consecutive packets on all machines receiving the data. We have Googled the internet and searched for various culprits, but with no luck. What services could be occuring every 600 seconds that would cause this probelm? Also, it is always EXACTLY 600 seconds, we verify this through the udp packet timestamps.
Avatar of Sadrul
Sadrul

you might want to see what crontabs you have running.

crontab -l

should give you a list of scheduled jobs. see if there's any assigned to run every 10 minutes / 600 seconds.

-- Adil
Avatar of steveo225

ASKER

there are no crontabs for any user
crontab -l  always returns "no crontab for <username>"
Have you chacked
cron.d        
cron.hourly  
cron.daily    
cron.monthly  
cron.weekly

Are there services running which trigger periodically? Or that could be remotly triggered every 600 seconds?
all are empty but do exist
If it was me I would write a perl script which did a ps every second, logging the output to a file with a time stamp between each ps command. Match the udp timestamp with one from the perl script output and the offending process should be obvious.
Well, since this happens every 600 seconds, we can predict exactly when it is going to occur, so we have tried ps and top running at low intervals, like top at .1 seconds, and nothing seems to change. We have plenty of cpu available, it seems to be more of a service that is using our network that causes the packet loss. But again, we are unable to locate any service or daemon that runs every 600 seconds. Also, that is 600 seconds of real time, not cpu or system time, and it is also based on when the machine is booted.
What does a sniffer trace of the network traffic show as the src/dest, port, & protocol of burst?
Are you sure the linux box is the source?
We used tcpdump and ethereal to get what happens at that time. We get some random packets that ethereal describes as "spanning tree packets."

And yes, we are pretty sure it has to do with our linux box, or hardware in it. The network we have for this data is on a gigabit ethernet card to a router to more gigabit ethernet cards. There is no other network involved, thus the problem has to be with one of the linux boxes.
Im sure youve already thought of this but it wouldnt be a bug in your  udp multicast app?
Spanning tree is usually some sort of protocol used by switches to talk to each other isnt it? Our gigabit switches used to sync using spanning tree.
ASKER CERTIFIED SOLUTION
Avatar of jlevie
jlevie

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
owensleftfoot:
We seriously doubt its our app, we originally thought this may be a problem, so we wrote a mini app, all it does is send data over multicast, nothing else, and the receiver receives only. The data is lost every 600 seconds just the same, and like I said, we can predict when this will happen, so we can have nothing running on our system (as far as apps) and about 10 seconds before we suspect this to happen, we run our udp app, and sure enough, they were lost when we suspected

paranoidcookie:
We also thought that, so we ran a test where two linux boxes were connected point-to-point, and we still got this problem, however, we lost less packets at the 600 second mark. That is why we are led to believe that some service or daemon is running every 600 seconds that is causing problems. Since there is almost no cpu utilization, we are also led to believe that whatever is running is somehow network related, and perhaps our packets are being overwritten in the send or receive buffer, since udp is not guaranteed.
By "point to point" do you mean via crossover cable or via a dumb switch?

Does the same thing happen if you use a pair of 100Mbps cards instead of the Gig cards?

Is this RHEL 2.1 or 3.0? There's no such thing as 3.1.

Is the OS fully up to date (including the kernel) and running a stock RHEL kernel?

It would be good to figure out whether this is a problem with the source or destination machine. Since it is synched to wall clock time you could figure that out by setting the clock on one end 2.5 minutes different from the other end and see when the packet loss occurs.
Its gigabit fiber, one machine directly to another

Its RedHat AS 3.1 (Advanced Server)

We fully updated some and left some un-updated to see if that fixed the problem, but there was no difference

The receiving end warns that a packet was lost during transfer by means of a checksum we added into the packets. We also ran tcpdump on both machines during the test on only the port and ip and the sender reported the proper number sent, and the receiver reported the amount minus the packets it reported lost. This seems to imply that the sending machine is at least putting all the packets on the send buffer and the receiver is getting all the packets that get put into the receive buffer, so somewhere between the two buffers, the data is getting lost or corrupted. Since the conection is point-to-point now, there must be something corrupting the packets on on of the two buffers

As far as changing wall clock times on them, they are different, the problem is we watch the two monitors attached to the machines, when the receiver reports packets lost, the sender reports "spanning tree packets" on the network. So, its hard to tell where the error is coming from
> Its gigabit fiber, one machine directly to another

And you saw "spanning tree packets" when no switches were in the network path? What does 'ifconfig -a' look like?

Seeing "spanning tree packets" when switches and VLAN's aren't being used would make me wonder if packets aren't getting munged on their way to the wire.
In fact, that is exactly what we are thinking. Since they are udp packets and not guaranteed, we thought perhaps another network service was attempting to send data and was corrupting the packets and this was leading ethereal to believe they were spanning tree packets, when in fact, they were just corrupted packets

As far as ifconfig -a, I am not at work again until Monday, but I will post the data as soon as I get there
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Here is the output of ifconfig -a (I left out eth0, its just a 10/100 built in, its not enabled, and loopback)

eth1      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:259 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:35742 (34.9 Kb)
          Interrupt:57 Base address:0x3000 Memory:98080000-980a0000
> Ethernet spec  should limit that to a very low percentage of the total traffic

Actually, what we are losing is a very low percent. In 10 minutes we send between 20 and 30 million packets at about 1KB per packet and we only lose 250 - 1000 packets. That is a low percent, usually not too bad, but what makes this bad is that they are all consecutive, lost at exactly 600 second intervals. Losing 1MB of data all at once is causing a lot of problems
If anybody can help us figure this out, I'd be happy to give another 150 points
Well, what's needed is a test that can show if the broadcast packets make it onto the wire and if they are colliding with some other traffic.  And it seems to me that will require the use of Gig capable sniffer. I don't think we can trust the results of tcpdump or ethereal given what you've reported.
I agree, tcpdump surprisingly keeps up well, ethereal does not. However, there should be no collision of traffic, data is only being sent in one direction, and only one machine is sending any data. I am at a complete loss...
>  I agree, tcpdump surprisingly keeps up well

Right, but you are still using one of the suspect machines to gather the debug data. So any flaws in the machine will confuse the results. That's why I'm suggesting a real sniffer attached directly to the network connection. When using a real sniffer you'll see what exactly is "on the wire" and thus will be able to determine if the problem is host or network related.
Turns out linux has a service that fires off every 600 seconds that clears the some tcp stack, i forget all the specifications, I had to change a kernel parameter named secret_interval, obviously that was it, hell it has secret in the name
Any chance you could fnd and insert the spec for future reference?
The value can be found at: net.ipv4.route.secret_interval
The idea is that every net.ipv4.route.secret_interval number of seconds, the tcp stack or some buffer is flushed. This prevents the system from hanging in the event somebody does something malicious. Problem is, when udp is running and this thing fires off, it causes packets to be lost, quite a few actually. So, we just set it to some ridiclously high number so it wouldn't happen.

I am also tired of this question appearing in my open questions, but I do not know who to give the points to, so I am going to give some to everybody that I felt helped me in finding the solution

(I guess this is one of those things I thoguht would be an easy answer and should have been a 500 point question)