This has been a problem I've been sitting on for a few months now, due to scheduling difficulties with the customer and their MPLS/ISP vendor. The customer has a very simple network, with a Juniper
Netscreen SSG-5 at the main office, and another at their satellite office. They used to have a Netscreen-Netscreen VPN between the two sites, which worked fine. Now, they have a MPLS (higher bandwidth) which also works fine.
What we are trying to accomplish is setting up a failover from the MPLS to the VPN for site to site traffic, should the MPLS go down. Currently, we have the MPLS router on the same subnet as the LAN
(10.0.0.0/24), plugged into bgroup0 on the Juniper (for those not familiar with the SSG line, the device can group ports in the same security zone together into a 'bgroup', which functions like a mini switch made up of the physical ports on the device assigned to that zone). Firewall is at 10.0.0.1 and the MPLS router is at 10.0.0.2, static route entry to push traffic across to the other office, etc. Very simple.
In order to do failover though, I need to be able to failover to another interface (rather than just an IP). Since the MPLS was just acting as a LAN device in bgroup0 instead of being plugged into it's own interface, we couldn't do this. So, we decided to break off one port on the Netscreen, add in a seperate dummy subnet (10.0.1.0/24), and change the MPLS router's IP and hang it off of this interface. The same was done at the other site. (Satellite MPLS subnet 10.0.2.0/24 and satellite office LAN subnet 10.0.3.0/24)
So now, the traffic looks like this (sorry for the crappy Visio):
This also connects up fine, and the physical path is the same, there's just the additional logical hop of the two extra subnets. Static routes were put into place for these.
This configuration worked fine... at first. We could ping between the two sites, and everything seemed ok. However, once we got into more application-level stuff, such as Outlook, and opening up files across the MPLS, we started getting what seemed to be some kind of disconnect or drop issue. I could browse a UNC share to a Windows file server for example, and it would go a few levels down, but then stop and not do anything... almost like it was timing out, but not. No actual error message. Then in Outlook it would open and connect, but then say a minute later that it was disconnected from the server.
Basically, everything became 'flaky' in a way that is hard to describe. It was like timeout/packet loss/dropping without causing the actual error messages you might expect to see.
I feel like I am dealing with a TTL or some other timing/packet life sort of thing, but I am not well versed in this, and admittedly grasping at straws.
I put in a ticket with Juniper support about it, and they refused to even speculate or give me an idea of some things to check, and wanted about a million debug and log dump things from me to even start looking at it. I was hopeful since the customer's network is so simple, but apparently that's not how they operate.
The rub is that we need to coordinate myself, the customer's IT guy, and the customer's MPLS vendor all to be on a conference call at the same time to change the setup on all of the devices to the desired (but as of yet non-functioning) configuration, in order to allow Juniper to collect these logs, but that creates problems between the two sites, and they have to revert it back before too long. So, in order to get help from Juniper, I am looking at having to coordinate myself, the customer, the MPLS vendor, AND Juniper tech support all at a specific after-hours time. Ugh.
I know this is a lot of detail, I will be happy to clarify anything.
**Short version (can't blame you): Site to site MPLS traffic doesn't work properly after adding in two extra subnet hops, can't figure out why.**