Ubuntu routing + VLAN + iptables = hair-loss

fuats
fuats used Ask the Experts™
on
Inherited a network that has been grown up by several different people over the last 10 years.  It's pretty messy, and I'm trying to clean it up.  The current priority project is getting a VLAN functioning.  The switches know of the VLAN, and can handle the tagging part.  Everything is getting routing via a Ubuntu firewall/router with several NICs.  Some items documented as VLANs I've found are not really VLANs, but just class B addresses with varying third octets.

Kernel is 2.6.27

eth0 - External IP 1
eth0:5 - External IP 2
eth0:6 - External IP 3
...
eth1 - 10.10.1.1 (10.10.0.0/16)
eth2 - External IP 7
eth3 - External IP 8
eth4 - 10.10.200.1 (VLAN Trunk)
vlan172 - 172.16.172.1 (172.16.172.0/24, bound to eth4)
vlan173 - 172.16.173.1 (172.16.172.0/24, bound to eth4)
vlan109 - 10.10.109.1 (10.10.109.0/24, bound to eth4)
---------------------------------------ifaces-----------------------------------
iface eth4 inet static
      address 10.10.200.1
      netmask 255.255.255.0
      vlan_raw_device eth4
iface vlan109 inet static
      address 10.10.109.1
      netmask 255.255.255.0
      vlan_raw_device eth4
iface vlan172 inet static
      address 172.16.172.1
      netmask 255.255.255.0
      vlan_raw_device eth4
iface vlan173 inet static
      address 172.16.173.1
      netmask 255.255.255.0
      vlan_raw_device eth4

-------------------Abbreviated Firewall Script -----------------------------
#!/bin/sh

# I've removed all comments, and extraneous garbage that I don't feel is pertinent.

IPTABLES=/sbin/iptables
ROUTE=/sbin/route

WANIFACE="eth0"
LANIFACE="eth1"
VTRUNK="eth4"

VLAN109="vlan109"

$IPTABLES -F
$IPTABLES -F -t nat
$IPTABLES -X
$IPTABLES -P INPUT ACCEPT
$IPTABLES -F INPUT
$IPTABLES -P FORWARD DROP
$IPTABLES -F FORWARD
$IPTABLES -P OUTPUT ACCEPT
$IPTABLES -F OUTPUT

$IPTABLES -t nat -I POSTROUTING -o $WANIFACE -s 10.10.0.0/16 -j SNAT --to <external>

$IPTABLES -t nat -I POSTROUTING -o $WANIFACE -s 172.16.173.0/24 -j SNAT --to <external>
$IPTABLES -t nat -I POSTROUTING -o $WANIFACE -s 10.10.202.0/24 -j SNAT --to <external>

$IPTABLES -t nat -I POSTROUTING -o $WANIFACE -s 172.16.172.0/24 -j SNAT --to <external>
$IPTABLES -t nat -I POSTROUTING -o $WANIFACE -s 10.10.203.0/24 -j SNAT --to <external>

$IPTABLES -A FORWARD -p gre -j ACCEPT

$IPTABLES -A FORWARD -i vlan109 -o $LANIFACE -j ACCEPT
$IPTABLES -A FORWARD -i $LANIFACE -o vlan109 -j ACCEPT

$IPTABLES -A FORWARD -i vlan172 -o $WANIFACE -j ACCEPT
$IPTABLES -A FORWARD -i $WANIFACE -o vlan172 -j ACCEPT

$IPTABLES -A FORWARD -i vlan109 -o $WANIFACE -j ACCEPT
$IPTABLES -A FORWARD -i $WANIFACE -o vlan109 -j ACCEPT

$IPTABLES -A FORWARD -i vlan173 -o $WANIFACE -j ACCEPT
$IPTABLES -A FORWARD -i $WANIFACE -o vlan173 -j ACCEPT

---------------------------------------------------------------------------

From a node (10.10.109.109), I can ping other nodes on that same switch within the 10.10.109.x range.  I can also ping from 10.10.109.109 (test laptop) to the gateway (10.10.109.1) on eth4 where VL109 is bound.  This goes through two other switches to get to the Ubuntu box...so I know the switches have their tagging act together.  I can also ping 10.10.200.1 (still eth4) from 10.10.109.109.  The traffic will not leave the router though.

From 10.10.25.100, etc. I can ping pretty much any address on the class B subnet, and hit eth4 (10.10.109.1, 10.10.200.1) with no problem.   I cannot ping through from any 10.10.x.x address to 10.10.109.2-254.

SSH'd into the router, and I can ping all nodes on the 10.10.109.x VLAN.

I've zero'd out rp_filter for vlan109, then eth4, then eth0, and finally for all.  Tried in incrementally because this is a production environment that pretty much has no downtime, and I didn't want to break anything.

I've made all sorts of changes to the iptables script, reloaded, and still same behavior.

iptables -nvL shows eth4 and eth0 passing traffic, and eth0 and vlan172 / vlan173 throwing packets happily, but vlan109 and eth4 are no-go.  eth4 and eth1 are chattering away fine as well.

I'm still trying different things, but have noticed I'm starting to do some of the same things I've already tried.  When it gets circular, it's time to ask for help.

Is anything jumping out at anyone out there as a cause for the problem?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Software Engineer
Distinguished Expert 2018
Commented:
you have a 24 bit netmask on eth4...
10.10.200.X is the network then.

If there is a network behind that one you need to add a route to the next hop router for the 10.10.x.x net through a 10.10.200.x
address.
If the eth4 is actualy the 10.10.x.x net work you need to make the netmask 255.255.0.0.

You can verify if the ping actualy goes to the the firewall & andother interface using tcpdump. It is very well posible that the ping doest go to the 10.10.109.x but the answer doesn't reach you.

BTW, this assumes that for the 10.10.x.x network you HAVE specified that the subnet 10.10.109.x need to go to 10.10.200.1  if no such route exists on the 10.10.x.x systems, they will only try to do an arp resolution. (Als use tcpdump to verify this).

Some questions to ask:
Does the 10.10.109 net realy need to be on a different VLAN?
if so why is the address a sub range of the major network. You might think about moving that one to another network range too 172.16.109.x ? , (it is actualy more transparent for the network then).

Author

Commented:
Thanks for the response.  I'll take a look at what tcpdump is showing.  I think last I looked at that it was stopping at the vlan109 gw.  The 109 needs to be on its own as it's apparently a very chatty system (PBX and associated vmail and controllers), needs isolation, but still needs to be remotely accesible for administration.  Using iptables, I am going to make it so the administrative machines can access it, but Joe User can't wander into it.

eth4 has all the VLANS, eth1 is the primary LAN (the class B monstrosity), and ultimately, I want to make it a class C with all the current subnets VLAN'd out - but it has to be done when it's not going to affect normal operations.  Changing the network range seems like a good alternative, because I can see some confusion arising from a VLAN that is actually in a range inclusive of the larger primary LAN.

Author

Commented:
After trying a million different iterations of my iptables script, I finally threw together a test vlan on a 192.168.109.x network.  Put a simple forwarding (eth1 <--> testVlan) statement in, and reran the script.  Ping ran right through, and outside systems connected.

The fact that the main network is class B (10.10.x.x/16) meant that it would not route data to a class C address (10.10.109.x/24) either for reasons of spoofing prevention, and/or it's on the same logical network.

I can't explain how happy I am to have this off my plate.  (Now to go play catch-up on all the stuff that piled in while I was fighting this problem.)  Thanks!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial