Link to home
Start Free TrialLog in
Avatar of sara_bellum
sara_bellumFlag for United States of America

asked on

Mail to my server cannot be delivered due to DNS server failure

Recently my DNS server died, I'll probably have to rebuild it but it will take some time. My alternate DNS server is running with db.mydomain.com, db.mydomain2.com zone files in /var/cache/bind (with symbolic links pointing to it from a chroot path).

I've updated the serial number on the cached domain zone files and restarted bind. MX toolbox reports no A record for my mail server on mydomain.com, however ZoneEdit.com (which is hosting my domains for my routable IPs) validates the zone, and dig results on my NAT'd LAN look good (see below, server1 is the SOA that is down, server2 is alternate DNS and mail server and web server):

$ dig MX mydomain.com

; <<>> DiG 9.8.1-P1 <<>> MX mydomain.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43721
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 2

;; QUESTION SECTION:
;mydomain.com.                  IN      MX

;; ANSWER SECTION:
mydomain.com.            10800      IN      MX      10 server2.mydomain.com.

;; AUTHORITY SECTION:
mydomain.com.            10800      IN      NS      server1.mydomain.com.
mydomain.com.            10800      IN      NS      server2.mydomain.com.

;; ADDITIONAL SECTION:
server2.mydomain.com.      10800      IN      A      192.168.1.4
server1.mydomain.com.      10800      IN      A      192.168.1.3

;; Query time: 1 msec
;; SERVER: 192.168.1.4#53(192.168.1.4)
;; WHEN: Sun Nov  4 11:54:08 2012
;; MSG SIZE  rcvd: 121

$ dig A mail.mydomain.com
 ...

; QUESTION SECTION:
;mail.mydomain.com.            IN      A

;; ANSWER SECTION:
mail.mydomain.com.      10800      IN      CNAME      server2.mydomain.com.
server2.mydomain.com.      10800      IN      A      192.168.1.4

;; AUTHORITY SECTION:
mydomain.com.            10800      IN      NS      server2.mydomain.com.
mydomain.com.            10800      IN      NS      server1.mydomain.com.

;; ADDITIONAL SECTION:
server1.mydomain.com.      10800      IN      A      192.168.1.3

...
 
And here's the zone file for the primary domain on my server:

$ cat db.mydomain.com
 
$ORIGIN .
$TTL 10800      ; 3 hours
mydomain.com            IN SOA      server1.mydomain.com. webmaster.mydomain.com. (
                        2012110301 ; serial
                        10800      ; refresh (3 hours)
                        3600       ; retry (1 hour)
                        604800     ; expire (1 week)
                        3600       ; minimum (1 hour)
                        )
                  NS      server1.mydomain.com.
                        NS      server2.mydomain.com.
                  MX      10 server2.mydomain.com.
$ORIGIN mydomain.com.
server2             A       192.168.1.4
firewall             A       192.168.1.2
server1             A       192.168.1.3
wireless             A         192.168.1.5
client1              A        192.168.1.6
client2              A       192.168.1.22

mydomain.com      CNAME      server2
mail                         CNAME      server2
ns                         CNAME      server1
ns2                        CNAME      server2
www                       CNAME      server2

My home page at http://www.mydomain.com opens in google chrome in Alaska without any problems that I know of, although a friend from Ohio was not able to open the page. I thought I understood DNS but apparently I'm missing something very important. Let me know what it is thanks.
Avatar of Davis McCarn
Davis McCarn
Flag of United States of America image

You don't say when your DNS went down; but, if there has been a refresh since it did, you need to wait for it to be refreshed again and to propagate.  DNS records have to propagate upwards to ICANN and then back down to local ISP's which can take as long as 48 hours (and maybe longer if the cybercriminals are attacking top level DNS servers)
Some DNS servers don't refresh their cached data as often as they're supposed to: those servers will cause issues with lookups.

This may seem like a stupid question... but the IP addresses you've listed in your original post were "changed to protect the innocent," right?  I ask because they're internal IP's, not addressable to the outside world; anybody on your network would get the intended results, but nobody else would.
It is possible that DNS needs to update, which could take up to 24 hours, but it's weird that MX Toolbox didn't still have the old record, unless the TTL expired and the record was deleted from cache.

P.S. Internal name resolution would not work when querying your server details from external since the internal network on 192.168.1.x is not routable and cannot be reached.

In order for mail to be delivered, your external DNS recorded needs to be resolved, especially the MX record.

I'd suggest checking with ZoneEdit why your internal zone is reflecting....
It could be that you had split DNS zone configured on the server that died so it help both internal and external zones.

Whereas the "replacement"/secondary DNS server only has the internal zone.
Avatar of sara_bellum

ASKER

Here's my status:
- All outgoing smtp works (at least my tests today are received at gmail etc)
- All internal (telnet) smtp / port 25 tests generate mail that is delivered to users, (with external emails returning an 'ok' prompt)
- I can receive mail on my laptop (Thunderbird) client for user1@mydomain.com and user2@mydomain.com (both users are on my primary domain)
- My apache home page continues to resolve to the outside world.
       
But all external to internal smtp fails, no postfix/mail log  rejections appear, nor do any appear in the firewall, so the DNS issue remains. If this were a NAT issue I'd see something in my firewall logs, which have static routes from my ISP-assigned routable IPs to the internal (non-routable) IPs on my LAN (for the servers only, the clients don't need that). So I'm still stumped. I've written ZoneEdit asking them to reply to my gmail account, nothing. I disabled ClamAV, will wait and see.
If you were to tell us what your actual domain was (instead of "mydomain.com"), the experts here at Experts Exchange would be able to help you track down the problem.  We'd be able to query fresh DNS servers that probably haven't cached information about your server, so the problem could get tracked down easier.
Actually this is instructive, I have received a couple of rejections for mail sent / forwarded to another user at my ISP that contains an image:

"This is the mail system at host localhost.
I'm sorry to have to inform you that your message could not
be delivered to one or more recipients. It's attached below.
...  The mail system

<destination_user@myisp.net>: host smtpgate.myisp.net[1.2.3.4] said:
    554 5.7.1 spamdefang score exceeds maximum - message from 5.6.7.8
    rejected - queue ID qA622xdo048238 (in reply to end of DATA command)

Reporting-MTA: dns; localhost
X-Postfix-Queue-ID: 74F073942E3
X-Postfix-Sender: rfc822; my_user@mydomain.com
Arrival-Date: Mon,  5 Nov 2012 17:02:35 -0900 (AKST)

Final-Recipient: rfc822; destination_user@myisp.net
Original-Recipient: rfc822;destination_user@myisp.net
Action: failed
Status: 5.7.1
Remote-MTA: dns; smtpgate.myisp.net
Diagnostic-Code: smtp; 554 5.7.1 spamdefang score exceeds maximum - message
    from 5.6.7.8 rejected - queue ID qA622xdo048238 "

where 1.2.3.4 is the destination mail server and 5.6.7.8 is my own mail server (both IPs are routable). Rejection occurs whether or not I have ClamAV enabled, but the good news is that there must be a TCP handshake with the outside world for me to receive a response like this in my mailbox.
SMTP Code 554 5.7.1 is message rejected due to relaying.
http://support.microsoft.com/kb/284204

I'd check with your ISP, especially considering "spamdefang score exceeds maximum".
You need to go to http://www,mxtoolbox.com and check your blacklisting status!
Still trying to slug my way through this - from a computer that uses my ISP for DNS lookup instead of server2.mydomain.com:

$ dig mydomain.com

; <<>> DiG 9.8.1-P1 <<>> mydomain.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59905
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 2

;; QUESTION SECTION:
;mydomain.com.                  IN      A

;; ANSWER SECTION:
mydomain.com.            6536      IN      A      1.2.3.4

;; AUTHORITY SECTION:
mydomain.com.            156      IN      NS      ns7.zoneedit.com.
mydomain.com.            156      IN      NS      ns1.zoneedit.com.

;; ADDITIONAL SECTION:
ns1.zoneedit.com.      4046      IN      A      69.64.67.242
ns7.zoneedit.com.      4044      IN      A      216.122.7.155

;; Query time: 22 msec
;; SERVER: 209.193.4.7#53(209.193.4.7)
;; WHEN: Tue Nov  6 17:44:12 2012
;; MSG SIZE  rcvd: 128

where 1.2.3.4 is the routable IP of server2.mydomain.com.

From this computer/on the Internet, www.mydomain.com does not resolve, so the fact that my mail server is not found is at least consistent.

I thought the problem might be with the fact that I was using default settings in named.conf.options, but I reviewed the required settings for secondary name servers and  made minor changes - after restarting Bind I have the same problem: the outside world cannot find my server.

Except under very limited circumstances:
- www.mailradar.com can find me - its open relay tests show up in my mail log. I passed all the tests, and I'm not on any of the blacklists they tested.
- spammers can find me - one example from my mail log:
Nov  6 17:50:22 server2 postfix/smtpd[15713]: NOQUEUE: reject: RCPT from unknown[207.178.180.130]: 554 5.7.1 <therichsheickc@yahoo.com>: Relay access denied; from=<test@live.com> to=<therichsheickc@yahoo.com> proto=SMTP

But no one else can, I'm stumped...I could change my secondary DNS to make it a primary DNS, but the point of setting it up this way was to have redundancy, and changing the SOA will not do that.
I also tried to ping www.mydomain.com - the host is unknown.

I was using the same zone files before my SOA / primary DNS server crashed, so the problem is not the zone files. I have no idea how to debug this...
ASKER CERTIFIED SOLUTION
Avatar of Leon Fester
Leon Fester
Flag of South Africa image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Having your domain not be resolved from an outside query means that your main problem has nothing to do with your server configuration.  Somebody forgot to pay the registrar or the name got stolen or cancelled.  Even if the ip of your backup server was different from the original, an ip would still be resolved but might not be reachable.
We've asked once; but, really need the real domain name to help you further.
If you're editing your zone files by hand, make sure you update the serial number each time.  I believe the standard is to use a timestamp... something like 201211070801 (the last part being a straight-up number that does not necessarily match the minute).  Make sure also to reload the zone afterward.

Knowing the actual domain and one or two hostnames would definitely help.
A quick test would be to open a hotmail account or similar and then send yourself an email.
Then check the Non-Delivery Report report that is generated.

The NDR report will include SMTP codes that will confirm the reason for mail not being delivered.
Just look for information similar to what you posted in the other users returned email.
Diagnostic-Code: smtp; 554 5.7.1 spamdefang score exceeds maximum - message
    from 5.6.7.8 rejected - queue ID qA622xdo048238 "

SMTP Status codes explained:
http://mig5.net/content/list-mail-smtp-status-codes
http://www.ietf.org/rfc/rfc1893.txt
http://www.ietf.org/rfc/rfc1891.txt
I think I've done everything that you've suggested, and yes, my domain(s) are registered for a 5-year period, renewable in 4 years. There's nothing wrong with ZoneEdit's MX or DNS records for my routable IPs, and there's probably nothing wrong with my internal DNS config either, at least for the files that I've reviewed that are a standard part of the bind9 configuration. I did find an Ubuntu help page that mirrored my configuration, recommending bind views for internal and external zones to control recursion and reverse lookups, so that was helpful I think.

dvt_localboy's comment about how easy it is to scan and connect to port 25 was useful, thanks.

The problem is that there is some process (I think) on the default ubuntu 12.4 server that is requesting a dynamic IP when bind9 starts - I found this in my daily logwatch reports: "generating session key for dynamic DNS: 7 Time(s)"

I've checked all the usual packages and programs to halt these requests: /usr/bin/nsupdate has been moved to a tmp file and I removed apparmor (which I had previously removed but on upgrade it was reinstalled - it's not always apparent what packages get installed on upgrade).

I knew I needed to raise the log level on bind to see exactly what happens when bind  starts, but the recommended way to do that was within a chroot environment.  My earlier attempts to get chroot working for bind failed and this time was no exception: bind fails to start, and there are no entries in syslog (or the bind log when I point named.conf towards that file) to explain what happened; earlier errors showed a permissions problem which was supposed to be resolved if the bind log file used is owned by the same user that runs bind, but that solution also failed.

In short, I don't think you can help me. I should be writing about this in the Ubuntu help pages which have become almost impossible to post to, much less to find one's posts or any follow-ups to them.  Tomorrow is another day, let me think about this.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks, I did think that dynamic DNS might be an issue but my logs were not detailed enough to tell me so I wasn't sure. There were also programs installed by default to help implement dynamic DNS solutions that I disabled/purged. But yes I do have a fixed IP address assigned to eth0 so that wasn't the problem.

Today I managed to get bind9 to start in chroot, with a logfile set to a high debug level. This was much more difficult than I had anticipated, I should volunteer to update the ubuntu help pages lol.

There were some other bind9 bugs on startup that I fixed, but with the internal and external views required for a secondary name server to do reverse lookups, there are config issues I have yet to resolve:
# dig mydomain.com on my server now returns SERVFAIL, even though
# named-checkzone mydomain.com /path/to/zone/file returns +Ok for all files

FYI my config is described at http://ubuntuforums.org/showthread.php?t=1270491 
(not my post but it applies). Tomorrow I should be able to pick up my primary DNS server from the shop - if I have to recover files from the backup drive it will take time, will see.
I don't remember for certain, but I believe you have to configure the daemon so it will bind (no pun intended) to the proper network/adapter.  Maybe it's binding to the dynamic adapter instead of the static one...?
I found an Ubuntu help page that showed how you can assign multiple IP addresses to a single interface, weird...fortunately you actually have to configure that - I'm having  enough trouble with a single static IP on eth0 with multiple domains...

The hard drive on my primary DNS is fried and the backup doesn't boot, so that will take time to fix. In the meantime I've switched the functioning server to SOA and adjusted the resource records. Dig results look ok, but I still can't get icmp replies outside my LAN. This  may be part of the latency problem addressed above, will try again tomorrow.
The best website for testing is here in my experience: http://www.whois.net/dig-lookup/
As I mentioned, dig output looks good, but there are no responses to icmp requests on the ping page of this website.
My PIX firewall is now showing errors (although not consistently) that point to this:
* Deny IP due to Land Attack from 1.2.3.4 to 1.2.3.4 (the IP assigned to the outside interface of the PIX by the PPPOE connection)
The reason for this is clear enough: the PIX doesn't support wireless connections, which I require on my LAN. Some time ago I purchased a wireless router that I connected to the PIX, which is on subnet 1. The wireless router assigns wireless clients a subnet 2 IP address via DHCP. This creates a Land Attack loop on the PIX interface that does the NAT.
I wasn't using subnet 2 for testing until my primary DNS server crashed. When that happened and I reconfigured server2 as SOA, the error appeared.  The only apparent solution is to move my PPPOE connection to the wireless router and implement Linux security on the servers (which I've been able to avoid until now thanks to the PIX).

The PIX has been a great support for some years now, I'm sorry to lose it! If you have any thoughts on this, pls post them thanks.
Can't you hard-wire the DNS server into the router (or a switch connected to it)?  I've always avoided connecting servers using wireless because it simply isn't as dependable as being wired.
My config is:

DSL --> PIX firewall (wired) --> DNS/mail/web server, + subnet 1 clients
                                            --> Wireless router (hard-wired to PIX), + subnet 2 clients

I was told the PIX should work if I add a static route to subnet 2 on the DNS server so I did, and changed the DNS entries on the wireless router setup to my local DNS server (formerly, the wireless  was using my ISP DNS servers because I was having trouble routing packets between subnet 1 and subnet 2, given that the PIX is not a router). Now packets seem to be moving between both subnets and the Land Attack errors have disappeared from the firewall, which is progress.

The external problem remains though: the PIX outside interface is not capturing TCP packets I think - the debug logs are filled with build/tear-down entries for UDP packets but TCP packets are strangely absent, appearing on rare occasions when there's no host connection. So DNS is running fine for now, but no TCP packets are reaching my LAN -  if they're being rejected I can't see where or what is causing this. Will keep digging.
The firewall is fine I think; I'd like to finish this question where I started: the DNS problem that is preventing mail delivery.

My bind problems are tied to bind version 9.8 (version 9.7 was working before the primary DNS server crashed). The issue is how DNSSEC is handled - dynamic DNS is loaded by default in Bind 9.8, and cannot be disabled.

Accordingly, on bind (9.8) restart, I see this in syslog:
- generating session key for dynamic DNS
- set up managed keys zone for view _default, file 'managed-keys.bind' # these are used for managing DNSSEC
- managed-keys-zone ./IN: loading from master file managed-keys.bind failed: file not found
# but the file is in the chroot tree and updates each time I restart bind

As I mentioned, the above DNSSEC features load by default, whether DNSSEC is configured or not. If I enable DNSSEC in named.conf.options, I also see these errors:
error (network unreachable) resolving './DNSKEY/IN': 2001:7fe::53#53
error (network unreachable) resolving './NS/IN': 2001:7fe::53#53

I tried to configure DNSSEC but the guidance I followed doesn't work (the same errors output to syslog), so I reverted to my older configuration, which has no DNSSEC/dynamic DNS and worked fine in bind 9.7. Although Ubuntu 12.4 uses bind 9.8 by default, I plan to uninstall bind and then attempt to reinstall the older version, which may fail. Any thoughts on handling DNSSEC are welcome, will close out this question as soon as possible.
Are you sure the managed-keys.bind file is in the expected location?  It seems from the configuration that it's not...

Does the Bind daemon start at all?  Can you do local DNS lookups?
Yup, the managed-keys.bind file is in the chroot bind folder, and I copied rndc.key from /etc  there also.  On bind startup, my log file shows
> set up managed keys zone for view _default, file 'managed-keys.bind'
so that key is being read, this time at least. I'm not aware of having changed anything there, but I'm keeping notes on any lessons learned I can pass on.

I found out that I can manually disable dnssec in named.conf.options, but when I restart bind I still see this:
> generating session key for dynamic DNS
which is a dnssec feature.

I see no bind errors on this latest reboot, so I'll check this tomorrow and see what gives.
I reloaded the old OS on my original name server (ns.mydomain.com), stopped bind on the current server (ns2.mydomain.com)  and started bind on ns: bind9 starts without error but doesn't resolve anything in dig or icmp, so I'm sticking with ns2 for now.

The managed-keys files are generated by bind for dnssec - if you haven't configured dnssec keys etc, the managed-keys files are empty - the syslog error is just a warning for administrators who are using dnssec I think. I can restore them but if I do, I'm likely to see SOA authentication errors again.

Ubuntu ships with the rndc package but no files or any mention of rndc in its default bind configuration. The O'Reilly DNS and Bind book recommends using it to secure bind on the localhost. I generated a new key (to replace the older one I had) and added a controls statement in named.conf.local to accompany the include statement I was using in named.conf, again no error in the logs.

My last logwatch output is down to these errors:
generating session key for dynamic DNS: 1 Time(s)
managed-keys-zone ./IN: No DNSKEY RRSIGs found for '.': success: 1 Time(s)
managed-keys-zone ./IN: No DNSKEY RRSIGs found for 'dlv.isc.org': success: 1 Time(s)
reading built-in trusted keys from file '/etc/bind/bind.keys': 1 Time(s)
set up managed keys zone for view _default, file 'managed-keys.bind': 1 Time(s)

Interestingly, ns.mydomain.com shows none of these errors on bind start-up, so you'd think that it would have no trouble with DNS resolution. But when bind runs on it, I have no DNS resolution / dig results at all, as I mentioned.

Stopping bind on ns and starting it up on ns2 allows ns to resolve URLs like any client on my LAN. I believe ns2's apache server resolves to mydomain.com also, and smtp mail works as before, but mail delivery attempts continue to show a DNS general failure. It's possible that I'll never find the answer, but as long as I'm looking for one, I'll post what I find.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ok it's time to close this; I've been running DNS without any need for change for some years now but never had to run DNS off my slave server so I wasn't sure if it was fully configured. The new OS I had loaded on the slave before I lost DNS on my domains complicated matters since there were named daemon syslog errors,  so I studied them to find which ones were benign. Then there was my firewall, which is still throwing errors I have to figure out. When everything crashes at once it can be tough to write the question, much less understand the answers.

It should have been obvious that when my dig results look good but whois returns a servfail, that there was a problem with ZoneEdit. But ZoneEdit had validated my zones, which gave me a false sense of security. At some point ZoneEdit introduced a new feature - NS servers - that I don't fully understand, but I enabled it because I thought that my routable DNS records should match my internal DNS records as much as possible.

Once I was sure that my internal DNS configuration was in order, I disabled the NS server entries on ZoneEdit and mail was delivered. So dvt_localboy was right, but there were so many problems with my LAN that I needed to have a lengthier conversation, and crazedinsanity helped me rule out issues that I thought I had. I'm now back to my original config, more or less, working on my next set of issues (DNS chroot on primary, and my firewall). Will assign points next.