Link to home
Start Free TrialLog in
Avatar of Dr. Klahn
Dr. Klahn

asked on

gethostbyname() returns incorrect information

This Debian linux system recently started producing errors due to incorrect DNSBL lookups in Apache loadable modules.

a) The problem's not in Apache.  It can be reproduced in a little test program shown below.
b) The modules have been in use for over four years before this problem occurred last week.

/* Standard program for confirming Project Spamhaus DNS lookup */
/* If this fails, something's wrong somewhere in DNS */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <netdb.h>


void main() {


  static int oct1, oct2, oct3, oct4;
  struct in_addr addr;
  struct hostent *hp = 0;


/* =====================================================
 * Stage 1.  Lookup 2.0.0.127.sbl.spamhaus.org
 *
 * Should return 127.0.0.2
 * =====================================================
 */


  printf("\n================== FIRST TEST ===================\n\n");
  printf("Lookup 2.0.0.127.sbl.spamhaus.org\n");
  printf("Expected result success, 127.0.0.2\n\n");
  hp = gethostbyname("2.0.0.127.sbl.spamhaus.org");
  if (hp == NULL) {
    printf("Validation failed, h_errno = %d <%s>\n",
            h_errno, hstrerror(h_errno));
    return;
  }


  /* Isn't this next line horrible? */
  addr.s_addr = *(u_long *) hp->h_addr_list[0];
  printf("Official name = <%s>\n", hp->h_name);
  printf("Address length in bytes = %d\n", hp->h_length);
  sscanf(inet_ntoa(addr), "%u.%u.%u.%u",
         &oct1, &oct2, &oct3, &oct4);
  printf("Resolved to %u.%u.%u.%u\n\n", oct1, oct2, oct3, oct4);


  if ((oct1 != 127) || (oct2 != 0) || (oct3 != 0) || (oct4 != 2)) {
    printf("Validation failed, result not 127.0.0.2\n\n");
    printf("================== FAILURE ===================\n\n");
    return;
  }

  printf ("Validation successful\n\n");

/* =====================================================
 * Stage 2.  Lookup 1.0.0.127.sbl.spamhaus.org
 *
 * Should return failure
 * =====================================================
 */

  printf("================== SECOND TEST ===================\n\n");
  printf("Lookup 1.0.0.127.sbl.spamhaus.org\n");
  printf("Expected result failure\n\n");
  hp = gethostbyname("1.0.0.127.sbl.spamhaus.org");
  if (hp == NULL) {
    printf("Lookup failed, h_errno = %d <%s>\n\n",
            h_errno, hstrerror(h_errno));
    printf ("Validation successful\n\n");
    printf("================= ALL TESTS PASS =================\n\n");
    return;
  }

  /* Isn't this next line horrible? */
  addr.s_addr = *(u_long *) hp->h_addr_list[0];
  printf("Official name = <%s>\n", hp->h_name);
  printf("Address length in bytes = %d\n", hp->h_length);
  sscanf(inet_ntoa(addr), "%u.%u.%u.%u",
         &oct1, &oct2, &oct3, &oct4);
  printf("Resolved to %u.%u.%u.%u\n\n", oct1, oct2, oct3, oct4);
  printf("Validation failed, result should not exist\n\n");
  printf("================== FAILURE ===================\n\n");
  return;
}

Open in new window


Compiling and running the test program produces unexpected (I won't say incorrect; the system has some reason for doing this) results.

root:/usr/src/mod_spamhaus/src> ./a.out

================== FIRST TEST ===================

Lookup 2.0.0.127.sbl.spamhaus.org
Expected result success, 127.0.0.2

Official name = <2.0.0.127.sbl.spamhaus.org>
Address length in bytes = 4
Resolved to 127.0.0.2

Validation successful

================== SECOND TEST ===================

Lookup 1.0.0.127.sbl.spamhaus.org
Expected result failure

Official name = <1.0.0.127.sbl.spamhaus.org.my-domain-name.com>
Address length in bytes = 4
Resolved to xxx.yyy.zzz.aaa (the host's IP address)

Validation failed, result should not exist

================== FAILURE ===================

root:/usr/src/mod_spamhaus/src>

Open in new window


Note the response from the second lookup -- the official name returned by the lookup has the host's domain name tagged onto the tail.  This has not been seen previously on any of the systems where the test program was executed.

The problem is specific to gethostbyname, because nslookup returns the correct value - no such domain - and dig also returns the correct info.

root:/usr/src/mod_spamhaus/src> nslookup 1.0.0.127.sbl.spamhaus.org
Server:         84.200.69.80
Address:        84.200.69.80#53

** server can't find 1.0.0.127.sbl.spamhaus.org: NXDOMAIN
root:/usr/src/mod_spamhaus/src> dig 1.0.0.127.sbl.spamhaus.org

; <<>> DiG 9.10.3-P4-Debian <<>> 1.0.0.127.sbl.spamhaus.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 58161
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;1.0.0.127.sbl.spamhaus.org.    IN      A

;; AUTHORITY SECTION:
sbl.spamhaus.org.       8       IN      SOA     need.to.know.only. hostmaster.spamhaus.org. 2011051822 3600 600 432000 10

;; Query time: 127 msec
;; SERVER: 84.200.69.80#53(84.200.69.80)
;; WHEN: Thu Nov 05 18:23:28 GMT 2020
;; MSG SIZE  rcvd: 119

root:/usr/src/mod_spamhaus/src>

Open in new window


So the question is:  What's going on in the DNS subsystem that this issue suddenly cropped up?

Further info:
The host runs Debian that is some years old, not automatically updated, and very little should change in the system.

The DNS servers have not been changed recently.

/etc/resolv.conf contents:
# 1.1.1.1 open nameserver
# Four ones does not properly resolve spamhaus lookups
# nameserver 1.1.1.1


# OpenNIC nameservers
# May be blocked in iptables
# nameserver 172.98.193.42


# dns.watch IPv4 nameservers
nameserver 84.200.69.80    # resolver1.dns.watch
nameserver 84.200.70.40    # resolver2.dns.watch
nameserver 208.67.220.220
nameserver 208.67.222.222

Open in new window



Avatar of David Favor
David Favor
Flag of United States of America image

Likely culprit - dread systemd-resolved service...

Provide output of following command for review...

netstat -pluten | egrep :53

Open in new window


Said differently, the cached DNS lookups are likely corrupt if systemd-resolved is running + has (as usual) lost its mind.

The solution - destroy all traces of systemd-resolved + either use no DNS caching or use dnsmasq (rock solid - works 100% of the time).
https://www.experts-exchange.com/questions/29162917/Huge-DNS-queries-from-openshift.html provides instructions for nuking systemd-resolved + replacing with dnsmasq, if this is the problem.

Note: Till I figured out this problem, I lost many hours of life I'll never get back.

Since removing systemd-resolved from all installs - machines/containers/VMs - all oddball DNS problems magically resolved.
Avatar of Dr. Klahn
Dr. Klahn

ASKER

The system is SysV init, not systemd.  systemd came along later.  There's no DNS caching.  It is largely vanilla when it comes to networking.

root:/usr/src/mod_spamhaus/src> netstat -pluten | egrep :53
root:/usr/src/mod_spamhaus/src>

Open in new window

That is a lot of code to do:   dig 2.0.0.127.sbl.spamhaus.org  to ask the DNS server directly. (verify that is no upstream issue).
and
hostname 2.0.0.127.sbl.spamhaus.org  to produce your code. 
getent hosts 2.0.0.127.sbl.spamhaus.org is an alternative.

1.0.0.127.sbl.spamhaus.or returns domain not found..., not sure where the software running on your system converts that to a locat query.
$ dig 1.0.0.127.sbl.spamhaus.org          
; <<>> DiG 9.16.6 <<>> 1.0.0.127.sbl.spamhaus.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 44512
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;1.0.0.127.sbl.spamhaus.org.    IN      A

;; AUTHORITY SECTION:
sbl.spamhaus.org.       10      IN      SOA     need.to.know.only. hostmaster.spamhaus.org. 2011051850 3600 600 432000 10

;; Query time: 33 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: do nov 05 19:51:39 CET 2020
;; MSG SIZE  rcvd: 119

Open in new window

The code shown is an example to demonstrate the problem in the module code.  It has to be done in code and not as callouts because it is not time-efficient to rely on spawned processes from within an Apache module.

1.0.0.127.sbl.spamhaus.org and 2.0.0.127.sbl.spamhaus.org are the test addresses for Project Spamhaus.  2.0.0.127.sbl.spamhaus.org should always resolve; 1.0.0.127.sbl.spamhaus.org should never resolve.  A system which fails to resolve 2.0.0.127.sbl.spamhaus.org has DNS problems; one which resolves 1.0.0.127.sbl.spamhaus.org also has DNS problems.  goofle's nameservers often fail both checks, so it is necessary to test the system's ability to resolve these test FQDNs before activating the DNSRBL module.
I tried your C reproducer and with me it returns:
$ ./a.out

================== FIRST TEST ===================

Lookup 2.0.0.127.sbl.spamhaus.org
Expected result success, 127.0.0.2

Official name = <2.0.0.127.sbl.spamhaus.org>
Address length in bytes = 4
Resolved to 127.0.0.2

Validation successful

================== SECOND TEST ===================

Lookup 1.0.0.127.sbl.spamhaus.org
Expected result failure

Lookup failed, h_errno = 1 <Unknown host>

Validation successful

================= ALL TESTS PASS =================

Open in new window

The expected results... So the problem is somewhere in your environment.
No recent changes to firewall to redirect DNS queries for filtering? updates to C runtime library?
What is the response to other NXDOMAIN returns?
ASKER CERTIFIED SOLUTION
Avatar of Dr. Klahn
Dr. Klahn

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
crucial question then was: What is the response to other NXDOMAIN returns?
asked before....
I'm not following you there.  The NXDOMAIN return is proper and the code keeps going.
Question was: how does your system respond to a query returning NXDOMAIN.  (sorry non-native english speaker here, the earlier translation doesn;t fit that well).
Likely solution is to actually use a real local caching server like dnsmasq.

Using dnsmasq solves several problems.

1) DNS resolution speed is fast, as all resolved records are returned based on local entries per the TTLs.

If this is too slow, dnsmasq allows you to... how to say this... manually edit TTLs, creating your own TTL override values.

2) Highly dense logging (unlike named's insane verbosity + systemd-resolved being silent failures) providing simple query debugging.

I run dnsmasq (machines/containers/VMs) in query debug mode all the time.

This way when any problem occurs, I never have to setup debugging, then wait for a reoccurrence.

I just refer to the logs.

3) Instantly surface stealth DNS lookups.

This is incredibly useful to quickly notice when a site has been hacked, then look back in time to find time hack originally occurred, many times to the minute... as most hacks attempt to exercise some minimal callback-to-home probe to determine the hack is active.

Having this detail tends to simplify hack cleansing... for projects where people refuse to use correct security hygiene... to keep hack from occurring in the first place...
David, the problem lies below where a caching DNS server would solve the problem.  The DNS lookups are all, strictly speaking, correct.

The problem is in the behavior of the libc resolver routines which does that second, unwanted lookup "behind my back" in an attempt to be obliging.  If that second lookup could be disabled things would be fine, but the system is going around again and hanging ".my-domain-name.com" onto the supplied FQDN in an attempt to be obliging.

This can be avoided using the resolver routine res_query but it is ugly and touchy as far as setting the options.  A test program is shown below.  

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <netinet/in.h>
#include <resolv.h>
#include <netdb.h>

#define N 4096

int main (int argc, char *argv[]) {

    u_char nsbuf[N];
    u_long savedoptions;
    char dispbuf[N];
    ns_msg msg;
    ns_rr rr;
    int i;
    int l;
    int querystat;
    extern int h_errno;

  char name[60];

  while (1) {

  printf("Enter name: ");
  scanf("%s", name);

  /* Initialize the resolver data structure if necessary */

  if ((_res.options & RES_INIT) != 0) {
    printf ("res_init has already been called\n");
  } else {
    res_init();
  }

  /* We don't know what's going on elsewhere in Apache.  Something else
   * may well be using resolver routines, in fact it is likely.  To be
   * safe, save the current _res.options, set our own bits, and restore
   * them after the query.  Yeah, there's a critical section problem here.
   * Enable:   RES_PRIMARY   Query only primary
   * Disable:  RES_DEFNAMES  Append default domain name
   * Disable:  RES_DNSRCH    Search in current and parent domains
   */

  savedoptions = _res.options;
  _res.options = _res.options | RES_PRIMARY;
  _res.options = _res.options & ~RES_DEFNAMES;
  _res.options = _res.options & ~RES_DNSRCH;

  /* Query the resolver */

  errno = 0;                                             /* Not cleared by res_query. */
  querystat = res_query((char *)name, ns_c_any, ns_t_a, nsbuf, sizeof(nsbuf));
  _res.options = savedoptions;                           /* Restore res_options */

  /* On failure, print info and exit now */

    if (querystat < 0) {
      printf("res_query error, return code = %d, errno = %d, h_errno = %d\n", querystat, errno, h_errno);
      perror("pError: ");
      herror("hError: ");
      exit(0);
    }

  /* Query was not rejected out-of-hand; parse the result buffer */

    ns_initparse(nsbuf, querystat, &msg);

  /* How many responses? */

    l = ns_msg_count(msg, ns_s_an);
    printf("Responses received:  %d\n", l);

  /* If none, print info and exit now */

    if (l == 0) {
      printf("No responses received\n");
      perror(argv[1]);
      exit(0);
    }

  /* Print the responses */

    for (i = 0; i < l; i++) {
      ns_parserr(&msg, ns_s_an, i, &rr);
      ns_sprintrr(&msg, &rr, NULL, NULL, dispbuf, sizeof(dispbuf));
      printf("\t%s \n", dispbuf);
    }
    //------------

}

    return 0;
}

Open in new window

That 2nd (and more) "behind the back" lookup is actually what is expected from the resolver library.
The resolver should provide the address beloning to a given name.
(although gethostbyname is considered obsolete, getaddrinfo() is the new kid on the block (due to IPv6 semantices and structure sizes).

gethostbyname can be configured using /etc/host.conf  (bind uses resolver)
you already did use those in your new code.
In this case it's not possible to configure the gethostbyname or getaddrinfo behavior using /etc/resolv.conf, because the code is being used in an Apache module which is used worldwide.  It would be unacceptable for the module installer to stomp on the system's settings.  (I certainly did think about it, though.)  But at the end of the day, the module should work regardless of the system's resolver settings.  So nothing was left except going down two levels and using res_query.
Do you need to know in apache what the value is?

Maybe you can consider using nginx, it has a built in resolver.
https://www.nginx.com/blog/dns-service-discovery-nginx-plus/
(depending on what you want with the result).

nginx might increase your throughput as well besides scaling out by 2* on the same hardware.

Yes, it's necessary to do real-time blackhole list lookups on various RBLs.  These are not built into Apache.  mod_spamhaus, mod_torcheck and mod_honeypot are examples of modules which must do DNS lookups.

I'm not a fan of nginx as I have 25 years invested into Apache and I now know how to bend it to my will.  In any case I need to correct the behavior in these modules because if I've seen it, somebody else surely will, and I'm the one who loosed the monster.
Using a caching DNS server locally seems something that would help a lot as well. (if only to prevent over querying any RBL).
Tor-check should not need a DNS lookup, a connect back to the system you come from and check if it opens the relay port should be sufficient.
With a DNS check you need a valid, reliable and up to date service of tor exits.
Using a caching server provides no benefit in this case because the modules cache IP addresses internally.  This is much faster than doing lookups even if the resolver is local and caching, and it prevents thrashing the resolver no matter what the local configuration is.  The data space required is surprisingly small.

Nov 06  13:50:29  mod_honeypot 2.0.4
Nov 06  13:50:29    Data block 7764 bytes
Nov 06  13:50:29  mod_spamhaus 2.0.3
Nov 06  13:50:29    Data block 5432 bytes
Nov 06  13:50:29  mod_torcheck 4.0.0
Nov 06  13:50:29    Data block 5140 bytes
Nov 06  13:50:29  mod_efnetrbl 1.0.3
Nov 06  13:50:29    Data block 5432 bytes
Nov 06  13:50:29  mod_proxycheck 1.0.2
Nov 06  13:50:29    Data block 5156 bytes
Nov 06  13:50:29  mod_dronebl 1.0.A
Nov 06  13:50:29    Data block 5156 bytes

Open in new window


There is indeed such a service for Tor exits which is implemented by the Tor Project as a DNSRBL.  It operates somewhat similar to the Spamhaus and Project Honeypot DNSRBLs.  The documentation is at:

https://2019.www.torproject.org/projects/tordnsel.html.en

and that service is the one used by mod_torcheck.

But I fear we have wandered far afield from the original question.  If you would like to continue this discussion, and I am certainly glad to do that, I suggest we continue via PMs.