Service every 600 seconds

We are running RedHat Linux AS 3.1. We wrote an app to transmit udp data via multicast over our network. The problem is, every 600 seconds some service launches and causes us to lose hundreds of consecutive packets on all machines receiving the data. We have Googled the internet and searched for various culprits, but with no luck. What services could be occuring every 600 seconds that would cause this probelm? Also, it is always EXACTLY 600 seconds, we verify this through the udp packet timestamps.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

you might want to see what crontabs you have running.

crontab -l

should give you a list of scheduled jobs. see if there's any assigned to run every 10 minutes / 600 seconds.

-- Adil
steveo225Author Commented:
there are no crontabs for any user
crontab -l  always returns "no crontab for <username>"
Have you chacked

Are there services running which trigger periodically? Or that could be remotly triggered every 600 seconds?
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

steveo225Author Commented:
all are empty but do exist
If it was me I would write a perl script which did a ps every second, logging the output to a file with a time stamp between each ps command. Match the udp timestamp with one from the perl script output and the offending process should be obvious.
steveo225Author Commented:
Well, since this happens every 600 seconds, we can predict exactly when it is going to occur, so we have tried ps and top running at low intervals, like top at .1 seconds, and nothing seems to change. We have plenty of cpu available, it seems to be more of a service that is using our network that causes the packet loss. But again, we are unable to locate any service or daemon that runs every 600 seconds. Also, that is 600 seconds of real time, not cpu or system time, and it is also based on when the machine is booted.
What does a sniffer trace of the network traffic show as the src/dest, port, & protocol of burst?
Are you sure the linux box is the source?
steveo225Author Commented:
We used tcpdump and ethereal to get what happens at that time. We get some random packets that ethereal describes as "spanning tree packets."

And yes, we are pretty sure it has to do with our linux box, or hardware in it. The network we have for this data is on a gigabit ethernet card to a router to more gigabit ethernet cards. There is no other network involved, thus the problem has to be with one of the linux boxes.
Im sure youve already thought of this but it wouldnt be a bug in your  udp multicast app?
Spanning tree is usually some sort of protocol used by switches to talk to each other isnt it? Our gigabit switches used to sync using spanning tree.
As stated "spanning tree packets." are used by switches and unless something is wrong should not cause a problem. Likewise switches in a VLAN environment periodically pass other data over the wire.

To see what's actually happening you need to use a pair of sniffers, one at the packet source and another at the destination. These should really be actual sniffers, not tcpdump or ethereal on the machines that are participating in the UDP transfer to eliminate hardware of software issues on the Linux boxes as a cause. Comparing the simultaneous traces you be able to tell whether the problem lies with the sender, the receiver, or the network.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
steveo225Author Commented:
We seriously doubt its our app, we originally thought this may be a problem, so we wrote a mini app, all it does is send data over multicast, nothing else, and the receiver receives only. The data is lost every 600 seconds just the same, and like I said, we can predict when this will happen, so we can have nothing running on our system (as far as apps) and about 10 seconds before we suspect this to happen, we run our udp app, and sure enough, they were lost when we suspected

We also thought that, so we ran a test where two linux boxes were connected point-to-point, and we still got this problem, however, we lost less packets at the 600 second mark. That is why we are led to believe that some service or daemon is running every 600 seconds that is causing problems. Since there is almost no cpu utilization, we are also led to believe that whatever is running is somehow network related, and perhaps our packets are being overwritten in the send or receive buffer, since udp is not guaranteed.
By "point to point" do you mean via crossover cable or via a dumb switch?

Does the same thing happen if you use a pair of 100Mbps cards instead of the Gig cards?

Is this RHEL 2.1 or 3.0? There's no such thing as 3.1.

Is the OS fully up to date (including the kernel) and running a stock RHEL kernel?

It would be good to figure out whether this is a problem with the source or destination machine. Since it is synched to wall clock time you could figure that out by setting the clock on one end 2.5 minutes different from the other end and see when the packet loss occurs.
steveo225Author Commented:
Its gigabit fiber, one machine directly to another

Its RedHat AS 3.1 (Advanced Server)

We fully updated some and left some un-updated to see if that fixed the problem, but there was no difference

The receiving end warns that a packet was lost during transfer by means of a checksum we added into the packets. We also ran tcpdump on both machines during the test on only the port and ip and the sender reported the proper number sent, and the receiver reported the amount minus the packets it reported lost. This seems to imply that the sending machine is at least putting all the packets on the send buffer and the receiver is getting all the packets that get put into the receive buffer, so somewhere between the two buffers, the data is getting lost or corrupted. Since the conection is point-to-point now, there must be something corrupting the packets on on of the two buffers

As far as changing wall clock times on them, they are different, the problem is we watch the two monitors attached to the machines, when the receiver reports packets lost, the sender reports "spanning tree packets" on the network. So, its hard to tell where the error is coming from
> Its gigabit fiber, one machine directly to another

And you saw "spanning tree packets" when no switches were in the network path? What does 'ifconfig -a' look like?

Seeing "spanning tree packets" when switches and VLAN's aren't being used would make me wonder if packets aren't getting munged on their way to the wire.
steveo225Author Commented:
In fact, that is exactly what we are thinking. Since they are udp packets and not guaranteed, we thought perhaps another network service was attempting to send data and was corrupting the packets and this was leading ethereal to believe they were spanning tree packets, when in fact, they were just corrupted packets

As far as ifconfig -a, I am not at work again until Monday, but I will post the data as soon as I get there
> we thought perhaps another network service was attempting to send data and was corrupting the packets

I could see another node attempting to use the wire and colliding with a packet on occasion, but the collision avoidance mechanism built into the Ethernet spec  should limit that to a very low percentage of the total traffic. And the way the spec works it would be unlikely to occur with any degree of regularity and probably hardly ever in a point to point link. Furthermore such damage to packets would tend to render them invalid rather than morphing them into what looks like some other protocol.
steveo225Author Commented:
Here is the output of ifconfig -a (I left out eth0, its just a 10/100 built in, its not enabled, and loopback)

eth1      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet addr:  Bcast:  Mask:
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:259 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:35742 (34.9 Kb)
          Interrupt:57 Base address:0x3000 Memory:98080000-980a0000
steveo225Author Commented:
> Ethernet spec  should limit that to a very low percentage of the total traffic

Actually, what we are losing is a very low percent. In 10 minutes we send between 20 and 30 million packets at about 1KB per packet and we only lose 250 - 1000 packets. That is a low percent, usually not too bad, but what makes this bad is that they are all consecutive, lost at exactly 600 second intervals. Losing 1MB of data all at once is causing a lot of problems
steveo225Author Commented:
If anybody can help us figure this out, I'd be happy to give another 150 points
Well, what's needed is a test that can show if the broadcast packets make it onto the wire and if they are colliding with some other traffic.  And it seems to me that will require the use of Gig capable sniffer. I don't think we can trust the results of tcpdump or ethereal given what you've reported.
steveo225Author Commented:
I agree, tcpdump surprisingly keeps up well, ethereal does not. However, there should be no collision of traffic, data is only being sent in one direction, and only one machine is sending any data. I am at a complete loss...
>  I agree, tcpdump surprisingly keeps up well

Right, but you are still using one of the suspect machines to gather the debug data. So any flaws in the machine will confuse the results. That's why I'm suggesting a real sniffer attached directly to the network connection. When using a real sniffer you'll see what exactly is "on the wire" and thus will be able to determine if the problem is host or network related.
steveo225Author Commented:
Turns out linux has a service that fires off every 600 seconds that clears the some tcp stack, i forget all the specifications, I had to change a kernel parameter named secret_interval, obviously that was it, hell it has secret in the name
Any chance you could fnd and insert the spec for future reference?
steveo225Author Commented:
The value can be found at: net.ipv4.route.secret_interval
The idea is that every net.ipv4.route.secret_interval number of seconds, the tcp stack or some buffer is flushed. This prevents the system from hanging in the event somebody does something malicious. Problem is, when udp is running and this thing fires off, it causes packets to be lost, quite a few actually. So, we just set it to some ridiclously high number so it wouldn't happen.

I am also tired of this question appearing in my open questions, but I do not know who to give the points to, so I am going to give some to everybody that I felt helped me in finding the solution

(I guess this is one of those things I thoguht would be an easy answer and should have been a 500 point question)
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.