About EDNS Fallback

EDNS fallback is briefly defined in RFC6891 that a requester can detects and caches the information of remote end whether it support ENDS(0) or not. This behavior avoids fallback delays in the future. According to one of ISC’s document, BIND EDNS fallback has a process describe in below:

1) Query with EDNS, advertising size 4096, DO (DNSSEC OK) bit set
2) If no response, retry with EDNS, size 512, DO bit set
3) If no response, retry without EDNS (no DNSSEC, and buffer size maximum 512)
4) If no response, retry the query over TCP

The merit of EDNS fallback is to identify the capacity of remote server and shorten the delay with less retries. But if the intermittent network causes packet losses or DNS manipulation, it can result in SERVFAILs due to servers that should support EDNS being marked as EDNS-incapable.

A failure case observed in Yeti resolver

All yeti resolvers are required to be DNSSEC-aware. It is reported one of resolver using BIND 9.11.0-P2 (call R1) in China received many SERVFAIL response due to EDNS fallback. With the debug information, we found this resolver has such experience:

1) A client try to resolve www.facebook.com via R1; 2) R1 got a response (modified) and start doing DNSSEC validation for facebook.com; 3) R1 query the DS record of www.facebook.com via one NS of .com but got a modified response : www.facebook.com in A x.x.x.x.; 4) R1 tried all the other NS of .com, and got modified answer too; 5) R1 fallback to query the DS record with ENDS0 buffer size 512 bytes but still got the modified response; 6) R1 fallback again to query the DS record of www.facebook.com without EDNS0 option and receive the same modified response. 7) R1 can not validate the www.facebook.com. And the client got SERVFAIL.

But when any client try to resolve other normal domains. R1 got the right response for both A or AAAA record, but when it do the DNSSEC validation process, R1 sent the DS query without EDNS0 option, then the validation process failed. Finally, the client got SERVFAIL. There is a log for that process by querying dnsv6lab.net after www.facebook.com.

15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: starting
15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: attempting negative response validation
15-Feb-2017 13:24:01.203 dnssec: debug 3:   validating net/SOA: starting
15-Feb-2017 13:24:01.203 dnssec: debug 3:   validating net/SOA: attempting insecurity proof
15-Feb-2017 13:24:01.203 dnssec: debug 3:   validating net/SOA: checking existence of DS at 'net'
15-Feb-2017 13:24:01.203 dnssec: debug 3:   validating net/SOA: insecurity proof failed
15-Feb-2017 13:24:01.203 dnssec: info:   validating net/SOA: got insecure response; parent indicates it should be secure
15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: in authvalidated
15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: authvalidated: got insecurity proof failed
15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: resuming nsecvalidate
15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: nonexistence proof(s) not found
15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: checking existence of DS at 'net'
15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: checking existence of DS at 'dnsv6lab.net'
15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: continuing validation would lead to deadlock: aborting validation
15-Feb-2017 13:24:01.203 dnssec: debug 3: validating dnsv6lab.net/DS: deadlock found (create_fetch)

It is obvious that BIND 9 gets confused about EDNS support and this breaks later DNSSEC lookups. The intuitive thinking in author’s mind is that all BIND 9 deployed in China may affected by this issue. It explains the low penetration of DNSSEC and complains on DNSSEC in that region. In general this bug may cause BIND 9 vulnerable to the on-path DOS attack against the DNSSEC-aware resolver.

Patch to this issue

After locating this problem we contact ISC people and got a following patch to fix it.

diff --git a/lib/dns/resolver.c b/lib/dns/resolver.c
index f935a67..5ca9c47 100644
--- a/lib/dns/resolver.c
+++ b/lib/dns/resolver.c
@@ -8145,6 +8145,7 @@ resquery_response(isc_task_t *task, isc_event_t *event) {
 		dns_adb_changeflags(fctx->adb, query->addrinfo,
 				    DNS_FETCHOPT_NOEDNS0,
 				    DNS_FETCHOPT_NOEDNS0);
+#if 0
 	} else if (opt == NULL && (message->flags & DNS_MESSAGEFLAG_TC) == 0 &&
 		   !EDNSOK(query->addrinfo) &&
 		   (message->rcode == dns_rcode_noerror ||
@@ -8169,6 +8170,7 @@ resquery_response(isc_task_t *task, isc_event_t *event) {
 		dns_adb_changeflags(fctx->adb, query->addrinfo,
 				    DNS_FETCHOPT_NOEDNS0,
 				    DNS_FETCHOPT_NOEDNS0);
+#endif
 	}
 
 	/*

Conclusion

EDNS fallback is proposed for good but it may introduce false positives and collateral impacts due to temporary network failure or malicious manipulations. When the name server of certain TLD like .com and .net are marked EDNS-incapable , it will become a disaster for validating resolvers.

One intuitive idea is to stop marking TLD’s NS server as EDNS-incapable, given the fact that 7040 of 7060 (99.72%) of name servers support EDNS. Or we can turn off the fallback function when it comes to DS record(the query to the parent).