CoreDNS 健康检查详解

云原生实验室 · 公众号 · · 2022-07-29 10:04

正文

引导关注

❝
本文转自 TinyChen 的博客，原文：https://tinychen.com/20220728-dns-11-coredns-08-healthcheck/，版权归原作者所有。欢迎投稿，投稿请添加微信好友：cloud-native-yang

本文主要讲解介绍 CoreDNS 内置的两个健康检查插件 health 和 ready 的使用方式和适用场景。

health 插件

health 插件^[1]默认情况下会在 8080 端口的 /health 路径下提供健康状态查询服务，当 CoreDNS 服务正常的时候，会返回 200 的 http 状态码并附带一个 OK 的内容。

$ curl -v http://10.31.53.1:8080/health
* About to connect() to 10.31.53.1 port 8080 (#0)
*   Trying 10.31.53.1...
* Connected to 10.31.53.1 (10.31.53.1) port 8080 (#0)
> GET /health HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.31.53.1:8080
> Accept: */*
>
<
* Connection #0 to host 10.31.53.1 left intact
OK

比较特别的是 health 插件还附带了一个 lameduck 功能，lameduck 的效果就是在 coredns 进程关闭之前延迟对应的时间。假设我们设置了 lameduck 10s，那么 coredns 在接收到退出进程命令的时候会延迟 10s 的时间再结束进程。

health [ADDRESS] {
    lameduck DURATION
}

需要特别注意的是，假设我们在多个配置块中都使用了 lameduck 功能，那么时间会叠加。举个例子，假设我们在 10 个配置块中都设置了 lameduck 10s，那么 coredns 在接收到退出进程命令的时候会延迟 10*10=100s 的时间再结束进程。

此外还有一个小问题，在开启 health 插件之后会导致 health 插件对应的端口会有较多的 TIME_WAIT 连接，目前怀疑是插件本身会请求自身端口进行检查导致产生 TIME_WAIT 连接。

$ netstat -nt | grep 8080 | grep -c TIME_WAIT
61

ready 插件

ready 插件^[2]和 health 插件有些类似，默认情况下定义在 8181 端口的 /ready 路径下返回 CoreDNS 服务器的状态，正常情况下也是返回 200 的 http 状态码并附带一个 OK 的内容。

$ curl -v http://10.31.53.1:8181/ready
* About to connect() to 10.31.53.1 port 8181 (#0)
*   Trying 10.31.53.1...
* Connected to 10.31.53.1 (10.31.53.1) port 8181 (#0)
> GET /ready HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.31.53.1:8181
> Accept: */*
>
<
* Connection #0 to host 10.31.53.1 left intact
OK

当 CoreDNS 服务中的某个组件的相关配置出现异常的时候，则会返回 503 的 http 状态码并附带一个出现问题的组件名称。

$ curl -vv http://10.31.53.1:8181/ready
* About to connect() to 10.31.53.1 port 8181 (#0)
*   Trying 10.31.53.1...
* Connected to 10.31.53.1 (10.31.53.1) port 8181 (#0)
> GET /ready HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.31.53.1:8181
> Accept: */*
>
<
* Connection #0 to host 10.31.53.1 left intact
kubernetes

而此时访问 health 组件的接口返回的响应码还是 200 以及 OK

$ curl -v http://10.31.53.1:8080/health
* About to connect() to 10.31.53.1 port 8080 (#0)
*   Trying 10.31.53.1...
* Connected to 10.31.53.1 (10.31.53.1) port 8080 (#0)
> GET /health HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.31.53.1:8080
> Accept: */*
>
<
* Connection #0 to host 10.31.53.1 left intact
OK

从 systemd 的服务状态中我们不难看出，此时的 coredns 是处于运行状态，但是 kubernetes 插件工作异常。这也就较好地说明了 health 插件在工作时主要关注 coredns 本身的运行状态，而 ready 插件会同时关注组件的工作状态是否正常。

$ systemctl status coredns
● coredns.service - CoreDNS
   Loaded: loaded (/usr/lib/systemd/system/coredns.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-07-28 11:52:50 CST; 8min ago
     Docs: https://coredns.io/manual/toc/
 Main PID: 14478 (coredns)
    Tasks: 13
   Memory: 23.8M
   CGroup: /system.slice/coredns.service
           └─14478 /home/coredns/sbin/coredns -dns.port=53 -conf /home/coredns/conf/corefile

Jul 28 11:52:50 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] plugin/reload: Running configuration MD5 = e3edb2bb003af1e51a1b82bfaebba8f4
Jul 28 11:52:50 coredns-10-31-53-1.tinychen.io coredns[14478]: CoreDNS-1.8.6
Jul 28 11:52:50 coredns-10-31-53-1.tinychen.io coredns[14478]: linux/amd64, go1.17.1, 13a9191
Jul 28 11:52:50 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] 127.0.0.1:53443 - 17600 "HINFO IN 6988510158354025264.1665891352749413348.cali-cluster.tclocal. udp 78 false 512" NXDOMAIN qr,aa,rd 192 0.000385901s
Jul 28 11:57:05 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] Reloading
Jul 28 11:57:10 coredns-10-31-53-1.tinychen.io coredns[14478]: [WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
Jul 28 11:57:10 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] 127.0.0.1:41957 - 46173 "HINFO IN 3749714491109172199.3469953470964448055.cali-cluster.tclocal. udp 78 false 512" SERVFAIL qr,aa,rd 192 0.00012492s
Jul 28 11:57:10 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] plugin/reload: Running configuration MD5 = 2365432f92773a3434ec9ab810392378
Jul 28 11:57:10 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] Reloading complete
Jul 28 11:59:49 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] plugin/ready: Still waiting on: "kubernetes"

小结

从上面的对比我们不难发现就单纯的就检测程序本身状态而言，两者都是能够满足需求的。而在默认的 k8s 中部署的 coredns，我们查看其配置文件可以发现两者的用途并不一致，health 插件主要用于 livenessProbe，用于检测该 pod 是否正常运行，是否需要销毁重建等；而 ready 插件主要用于 readinessProbe，用于检测 coredns 的状态是否可以 ready 并对外提供服务。

livenessProbe:
  failureThreshold: 5
  httpGet:
    path: /health
    port: 8080
    scheme: HTTP
  initialDelaySeconds: 60
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 5

readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /ready
    port: 8181
    scheme: HTTP
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1

❝
更多关于 Liveness 和 Readiness 的配置可以参考 kubernetes 的官方配置文档^[3]

The kubelet^[4] uses liveness probes to know when to restart a container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a container in such a state can help to make the application more available despite bugs.
The kubelet uses readiness probes to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.
The kubelet uses startup probes to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don’t interfere with the application startup. This can be used to adopt liveness checks on slow starting containers, avoiding them getting killed by the kubelet before they are up and running.