Very odd Microsoft DNS issue
Posted on 2009-05-26
This is my first posting and it's a bit long so be gentle with me :)
I encountered a very odd DNS issue the other day and am stumped as to what occured, so I am hoping that someone where might be able to explain what happened and how to correct it. Here is our setup. We have one fibre connection and one ADSL connection. The fibre is our primary internert feed. The ADSL is for testing purposes and is provided by a different ISP. There is no bridging between the fibre and ADSL feeds.
We have two sets of DNS servers, one for our internal AD domain, the other for external DNS requests. I'll refer to these as our "private" and "public" DNS servers respectively. The public DNS servers run on a pair of Windows 2003 servers. One is located at our primary site, the other at our DR site. Both are on different networks. We also use our ISPs DNS servers as a thrid "public" DNS server. I'll refer to these as ns1, ns2 and ns3 respectively. NS1 is the primary, NS2 and NS3 are secondaries. NS1 will only allow transfers to NS2 and NS3. NS3 uses BIND under linux whereas NS1 and NS2 are using the standard DNS server component that comes with Windows 2003.
The issue was this, our ISP was doing some maintanence work on our fibre feed and we were going to lose all connectivity whilst this work was done. I was requested to have a holding page up in place of our website via a webserver at our DR site. I planned to do this by simply changing the host record on our public DNS server for our website to that of the webserver at our DR site. In preperation for this I changed the TTL of the web servers host record from 2hrs to 10min three thours before the scheduled cut off time. This should have ensured that most upstream DNS servers would have the shorter TTL version of the host record, so when the IP change took place it should have propergated out quicker. So, about 30min before cutoff I changed the IP address and queried the secondary DNS servers for the IP of our website and sure enough they repsonded with the updated IP. I did this check using our internal network as well as our ADSL network. So, all was looking good. The engineers cut our fibre and started their work. I soon received a phone call saying our site was not accessable. Sure enough when I checked the holding page was not being loaded. I did some dns checks and discovered that NS2 and NS3 were now refusing the answer DNS queries for our domain! NS2 was actually responding with "query refused" when using nslookup to do the query." NS3 just didn't return anything at all. This had me stumped as it all worked before our fibre feed was cut. Once our fibre feed was reconnected both NS2 and NS3 started answering queries for our domain again. All three DNS servers are on different networks, so losing the fibre should not have effected NS2 or NS3. For the life of me I cannot work out why this would happen. I have checked our domain records and ns1, ns2 and ns3 are defined as our primary and secondary DNS servers for our domain. Why the secondaries would not respond/reply to DNS queries when the primary is off the air is beyond me. Does anyone have any ideas? Have I made some fundemental design error in our config?
Comments/suggestions/criticisms welcome, well maybe not the criticisms :)