This blog post covers how to surpass a small but annoying issue that you may suffer when upgrading your Kubernetes/Azure AKS that changes the default DNS from kube-dns to coredns.

Scenario:

  • You are running a Kubernetes or AKS cluster that uses kube-dns (v1.10 or earlier)
  • You are running in a hybrid cloud scenario, or connecting to on-premise servers that requires private hostname resolution
  • You are upgrading Kubernetes or AKS and coredns is the default DNS server (v1.11 or later)
  • Your cluster is configured in advanced networking mode (e.g. Azure CNI): You have setup your VNET, subnets and setup DNS servers yourself to support the hybrid cloud or on-premise connection scenario

Symptoms:

  • After upgrading you are experiencing random DNS resolution-related errors like this:
curl: (6) Could not resolve host: service.domain.corp

Context

In our latest project, we had two Azure Kubernetes Service (AKS) cluster: One for production, and one for non-production.  The project started almost a year ago. By that time, we created an AKS v1.10 cluster.

These clusters were connected from Azure to an on-premise datacenter through ExpressRoute, where several services that we need to integrate to were hosted.

At some point, Microsoft retired AKS v1.9, and our project's next release stage was close. So, we reviewed the AKS and Kubernetes release notes, planned and execute a cluster upgrade, starting with our non-production one.

Upgrading the cluster with Azure Kubernetes Service is quite straightforward: You have to log in to the Azure portal, go the relevant screen and click a button. Alternatively, use the command line (e.g. Azure CLI). Azure will automatically do the magic for you: Create new nodes, move the workloads from the old to the new ones, tier down the old nodes... and ready. Cluster upgraded without business interruptions!

Upgrading AKS is quite straightforward

However...

During our validation process in our non-production cluster, prior to the production rollout, we found some errors in our .NET Core workloads, pointing to a DNS resolution issue:

System.ServiceModel.EndpointNotFoundException: 
There was no endpoint listening at https://service.domain.corp that could accept the message. 
This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details. 
---> System.Net.Http.HttpRequestException: No such device or address 
---> System.Net.Sockets.SocketException: No such device or address
.NET Core "No such device or address" error usually means "I cannot resolve a DNS entry"

The problem was more evident under heavy load processes, those making multiple DNS resolutions per second. But, looked like the problem was only with on-premise hostnames, not public ones.

Given that coredns was part of the AKS upgrade, we started a troubleshooting process that involved this Kubernetes component, amongst other company-wide checks.

Troubleshooting

As the first step, we logged in into one of our non-production cluster pods and started digging-in  to confirm our suspicions:

root@api-844474646d-cbghx:/app# curl https://service.domain.corp
<html>
<head>
...
</head>
<body>
</body>
...
</html>

root@api-844474646d-cbghx:/app# curl: (6) Could not resolve host: service.domain.corp

root@api-844474646d-cbghx:/app# curl https://service.domain.corp
!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
...
</head>
<body>
</body>
...
</html>
We confirmed we were experiencing random DNS errors
Confirmed: Random DNS resolution errors are taking place. And they are really easy to reproduce.

In this scenario, we opted by:

  1. Turn coredns logs on: It is quite straightforward as well if you have access to the cluster via kubectl. The only thing you need is this YAML file:
apiVersion: v1
data:
  log.override: |
    log
kind: ConfigMap
metadata:
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
  name: coredns-custom
  namespace: kube-system
Important to note: In AKS, you can't edit the "coredns" yaml configuration, but you have to always modify coredns-custom. The degree of personalization of coredns in AKS is limited. And not well documented at all.
  1. Use kubectl log or VSCode Kubernetes extension to open/review your coredns logs
  2. Attach to one of our environment pods and executed some DNS resolution actions, including nslookup and curl. Some of the executions where loop queries to put pressure on the DNS and networking components:
for i in {1..1000}; do nslookup google.com; done
for i in {1..1000}; do nslookup service.domain.corp; done
for i in {1..1000}; do curl https://service.domain.corp; done
  • Google nslookup resolution worked fine.
  • On-premise nslookup and curl failed most of the times
  1. Review coredns the logs:
10.107.13.170:54760 - [10/May/2019:15:24:35 +0000] 49217 "A IN service.domain.corp.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd,ra 153 0.000166801s
10.107.13.170:54760 - [10/May/2019:15:24:35 +0000] 23972 "AAAA IN service.domain.corp.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd,ra 153 0.0000866s
10.107.13.170:33435 - [10/May/2019:15:24:35 +0000] 20599 "A IN service.domain.corp.cluster.local. udp 56 false 512" NXDOMAIN qr,rd,ra 149 0.000054701s
10.107.13.170:33435 - [10/May/2019:15:24:35 +0000] 45135 "AAAA IN service.domain.corp.cluster.local. udp 56 false 512" NXDOMAIN qr,rd,ra 149 0.0000365s
10.107.13.170:41076 - [10/May/2019:15:24:35 +0000] 46566 "AAAA IN service.domain.corp.app.svc.cluster.local. udp 67 false 512" NXDOMAIN qr,rd,ra 160 0.0001446s
10.107.13.170:41076 - [10/May/2019:15:24:35 +0000] 59248 "A IN service.domain.corp.app.svc.cluster.local. udp 67 false 512" NXDOMAIN qr,rd,ra 160 0.000213701s
10.107.13.170:32916 - [10/May/2019:15:24:35 +0000] 10920 "AAAA IN service.domain.corp.reddog.microsoft.com. udp 63 false 512" NXDOMAIN qr,rd,ra 144 0.002197313s
10.107.13.170:32916 - [10/May/2019:15:24:35 +0000] 65528 "A IN service.domain.corp.reddog.microsoft.com. udp 63 false 512" NXDOMAIN qr,rd,ra 144 0.002216013s
10.107.13.170:56515 - [10/May/2019:15:24:35 +0000] 24011 "A IN service.domain.corp. udp 42 false 512" NXDOMAIN qr,rd,ra 118 0.000069101s
10.107.13.170:56515 - [10/May/2019:15:24:35 +0000] 40180 "AAAA IN service.domain.corp. udp 42 false 512" NOERROR qr,rd,ra 155 0.000064201s
Extract of a succeed curl DNS resolution. NOERROR = domain resolved
10.107.13.170:52419 - [10/May/2019:15:29:02 +0000] 47429 "AAAA IN service.domain.corp.app.svc.cluster.local. udp 67 false 512" NXDOMAIN qr,rd,ra 160 0.000111701s
10.107.13.170:52419 - [10/May/2019:15:29:02 +0000] 60834 "A IN service.domain.corp.app.svc.cluster.local. udp 67 false 512" NXDOMAIN qr,rd,ra 160 0.000188601s
10.107.13.170:52582 - [10/May/2019:15:29:02 +0000] 31129 "AAAA IN service.domain.corp.cluster.local. udp 56 false 512" NXDOMAIN qr,rd,ra 149 0.000097901s
10.107.13.170:52582 - [10/May/2019:15:29:02 +0000] 19637 "A IN service.domain.corp.cluster.local. udp 56 false 512" NXDOMAIN qr,rd,ra 149 0.000289502s
10.107.13.170:47075 - [10/May/2019:15:29:02 +0000] 14551 "AAAA IN service.domain.corp.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,rd,ra 153 0.000152601s
10.107.13.170:47075 - [10/May/2019:15:29:02 +0000] 42635 "A IN service.domain.corp.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,rd,ra 153 0.000229001s
10.107.13.170:52357 - [10/May/2019:15:29:02 +0000] 5554 "AAAA IN service.domain.corp.reddog.microsoft.com. udp 63 false 512" NXDOMAIN qr,rd,ra 144 0.000050401s
10.107.13.170:52357 - [10/May/2019:15:29:02 +0000] 47394 "A IN service.domain.corp.reddog.microsoft.com. udp 63 false 512" NXDOMAIN qr,rd,ra 144 0.0000831s
10.107.13.170:38727 - [10/May/2019:15:29:02 +0000] 24837 "AAAA IN service.domain.corp. udp 42 false 512" NXDOMAIN qr,rd,ra 118 0.000046801s
10.107.13.170:38727 - [10/May/2019:15:29:02 +0000] 217 "A IN service.domain.corp. udp 42 false 512" NXDOMAIN qr,rd,ra 118 0.0000368s
Extract of a failed curl DNS resolution. Our on-prem domain was not found!

As you can see before, curl is trying to resolve DNS by trying both A and AAAA entries, for all the search domains defined in our pods. In other words, to resolve "service.domain.corp", curl is making 10 DNS queries agains coredns. However, coredns is responding properly with an "NXDOMAIN" (no exist) or "NOERROR" (record found). So the problem is elsewhere.

Was the on-premise DNS Server faultly responding? Why the DNS resolution error is random?

The difference between coredns and kube-dns

This strange behaviour let us think about one of the very first decisions we made during the environment setup: We configured 4 DNS servers at VNET level:

VNET DNS setup with 4 DNS servers: 2 on-prem and 2 from Google
  • 2 x on-premise corporate DNS servers
  • 2 x Google DNS servers

At the very beginning, there was no cloud to on-premise connectivity. So, we temporary relayed in manual configured Google DNS servers for public hostname resolution.

Eventually, on-premise DNS servers were accessible and we switched the order of precedende in the VNET DNS list which caused kube-dns to use the corporative, private ones. And kube-dns was happy with this. DNS resolution was performed as we expected. Also looked like Google ones were acting as mere "backups" in case the primary corporate DNS server failed.

However, looks like coredns behaves in a different way. Given all the evidences we collected, the only rational explanation to this random DNS resolution error is that, under high load scenarios, coredns is using on all the DNS servers defined at VNET level. Some queries were probably going to the on-prem servers. Others, to Google. And obviously, Google doesn't know how to resolve your private hostnames :)

In the end, we decided to remove the two Google DNS servers from the VNET, restarted the cluster and ran our checks.

Problem solved. No more NXDOMAIN errors.

Conclusions

  • Looks like coredns uses the DNS server list in a differnt way that kube-dns did. If you have to resolve both public and private hostnames, always check only private DNS servers are on the list, or find the right configuration to route private DNS queries against your private premises.
  • Always have a detailed read of the release notes. Is important to know what changed in order to know where you have to start having a look!
  • Even though upgrading AKS is quite straightforward, is always a good idea to have at least two clusters. Start deploying the upgrade in one, thoroughly test everything after upgrading and, when you're ready, go for the production one.
  • You can't rollback an AKS upgrade, so the only way to go back is tiering down the cluster and creating a new one. Automate your infrastructure provisioning as much as you can to minimize any business disruption in worst case scenarios like this one.