kubernetes集群问题排查
版权声明:本文为博主原创文章,未经博主允许不得转载。如需转载请联系本人,并标明出处和作者。
本文CSDN博客地址:https://blog.csdn.net/huwh_/article/details/71308301
1. 查看系统Event事件
- kubectl describe pod <PodName> --namespace=<NAMESPACE>
该命令可以显示Pod创建时的配置定义、状态等信息和最近的Event事件,事件信息可用于排错。例如当Pod状态为Pending,可通过查看Event事件确认原因,一般原因有几种:
- 没有可用的Node可调度
- 开启了资源配额管理并且当前Pod的目标节点上恰好没有可用的资源
- 正在下载镜像(镜像拉取耗时太久)
kubectl describe还可以查看其它k8s对象:NODE,RC,Service,Namespace,Secrets。
1.1. Pod
kubectl describe pod <PodName> --namespace=<NAMESPACE>
1.2. NODE
kubectl describe NODE
- Name: runtime2.foshan2.wae.haplat.net
- Labels: kubernetes.io/hostname=runtime2.foshan2.wae.haplat.net,namespace/test=true
- CreationTimestamp: Fri, 01 Apr 2016 17:34:16 +0800
- Phase:
- Conditions:
- Type Status LastHeartbeatTime LastTransitionTime Reason Message
- ──── ────── ───────────────── ────────────────── ────── ───────
- Ready True Sat, 08 Apr 2017 14:15:41 +0800 Sun, 26 Mar 2017 08:58:04 +0800 KubeletReady kubelet is posting ready status
- OutOfDisk False Sat, 08 Apr 2017 14:15:41 +0800 Fri, 01 Apr 2016 17:34:16 +0800 KubeletHasSufficientDisk kubelet hassufficient disk space available
- Addresses: 221.5.100.100,221.5.100.100
- Capacity:
- memory: 134975102976
- pods: 40
- cpu: 32
- System Info:
- Machine ID: 120de474f77e4d75a670a74eea6d1e45
- System UUID: 1C929431-8D94-11E1-BD1D-001E6744D094
- Boot ID: 2a154beb-86e4-40e3-acce-4f83b1ea5ed2
- Kernel Version: 3.10.0-229.20.1.el7.x86_64
- OS Image: CentOS Linux 7 (Core)
- Container Runtime Version: docker://1.8.2-el7.centos
- Kubelet Version: v1.1.1-wae2-12
- Kube-Proxy Version: v1.1.1-wae2-12
- ExternalID: runtime2.foshan2.wae.haplat.net
- Non-terminated Pods: (6 in total)
- Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
- ───────── ──── ──────────── ────────── ─────────────── ─────────────
- acp acp-ui-1-4-0-16j3a 4 (12%) 4 (12%) 8589934592 (6%) 8589934592 (6%)
- acp acp-ui-1-4-1-kou20 4 (12%) 4 (12%) 8589934592 (6%) 8589934592 (6%)
- acp acp-ui-api-1-4-0-u3w4f 4 (12%) 4 (12%) 8589934592 (6%) 8589934592 (6%)
- cloud-eye cloud-eye-dim2-1-6-6-1-9g16i 4 (12%) 4 (12%) 34359738368 (25%) 34359738368 (25%)
- cloud-wst cloud-wst-ceba-1-12-0-1-hjdk2 2 (6%) 2 (6%) 8589934592 (6%) 8589934592 (6%)
- cms-fd cms-fd-schedule-3-9-1-3-wzqa1 8 (25%) 8 (25%) 21474836480 (15%) 21474836480 (15%)
- Allocated resources:
- (Total limits may be over 100%, i.e., overcommitted. More info: https://releases.k8s.io/HEAD/docs/user-guide/compute-resources.md)
- CPU Requests CPU Limits Memory Requests Memory Limits
- ──────────── ────────── ─────────────── ─────────────
- 26 (81%) 26 (81%) 90194313216 (66%) 90194313216 (66%)
- No events.
1.3. RC
kubectl describe rc --namespace=rmp
- [root@node5 ~]# kubectl describe rc --namespace=rmp
- Name: rmp-web-2-15-3-1
- Namespace: rmp
- Image(s): registry.wae.haplat.net/rmp/rmp-web:2.15.3-1
- Selector: app=rmp-web,appVersion=2.15.3-1
- Labels: app=rmp-web,appVersion=2.15.3-1,env=product,zone=foshan2
- Replicas: 1 current / 1 desired
- Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
- No volumes.
- No events.
1.4. NAMESPACE
kubectl describe NAMESPACE
- [root@node5 ~]# kubectl describe NAMESPACE
- Name: acp
- Labels: <none>
- Status: Active
- Resource Quotas
- Resource Used Hard
- --- --- ---
- cpu 24 20
- memory 51539607552 53687091200
- persistentvolumeclaims 0 10
- pods 6 10
- replicationcontrollers 6 10
- resourcequotas 1 1
- secrets 2 10
- services 6 10
- No resource limits.
1.5. Service
kubectl describe Service --namespace=rmp
- [root@node5 ~]# kubectl describe Service --namespace=rmp
- Name: rmp-web-2-15-3-1
- Namespace: rmp
- Labels: app=rmp-web,appVersion=2.15.3-1,waeEnv=product,waeZone=foshan2
- Selector: app=rmp-web,appVersion=2.15.3-1
- Type: ClusterIP
- IP: 10.254.201.163
- Port: port-l7-tcp-80 80/TCP
- Endpoints: 10.0.68.240:80
- Session Affinity: None
- No events.
2. 查看容器日志
1、查看指定pod的日志
kubectl logs <pod_name>
kubectl logs -f <pod_name> #类似tail -f的方式查看
2、查看上一个pod的日志
kubectl logs -p <pod_name>
3、查看指定pod中指定容器的日志
kubectl logs <pod_name> -c <container_name>
[root@node5 ~] # kubectl logs --help
Print the logs for a container in a pod. If the pod has only one container, the container name is optional.
Usage: kubectl logs [-f] [-p] POD [-c CONTAINER] [flags]
Aliases: logs, log
Examples: # Return snapshot logs from pod nginx with only one container $ kubectl logs nginx # Return snapshot of previous terminated ruby container logs from pod web-1 $ kubectl logs -p -c ruby web-1 # Begin streaming the logs of the ruby container in pod web-1 $ kubectl logs -f -c ruby web-1 # Display only the most recent 20 lines of output in pod nginx $ kubectl logs -- tail =20 nginx
# Show all logs from pod nginx written in the last hour $ kubectl logs --since=1h nginx |
3. 查看k8s服务日志
3.1. journalctl
在Linux系统上systemd系统来管理kubernetes服务,并且journal系统会接管服务程序的输出日志,可以通过systemctl status <xxx>或journalctl -u <xxx> -f来查看kubernetes服务的日志。
其中kubernetes组件包括:
kube-apiserver | ||
kube-controller-manager | Pod扩容相关或RC相关 | |
kube-scheduler | Pod扩容相关或RC相关 | |
kubelet | Pod生命周期相关:创建、停止等 | |
etcd |
3.2. 日志文件
也可以通过指定日志存放目录来保存和查看日志
- --logtostderr=false:不输出到stderr
- --log-dir=/var/log/kubernetes:日志的存放目录
- --alsologtostderr=false:设置为true表示日志输出到文件也输出到stderr
- --v=0:glog的日志级别
- --vmodule=gfs*=2,test*=4:glog基于模块的详细日志级别
4. 常见问题
4.1. Pod状态一直为Pending
kubectl describe <pod_name> --namespace=<NAMESPACE>
查看该POD的事件。
- 正在下载镜像但拉取不下来(镜像拉取耗时太久)[一般都是该原因]
- 没有可用的Node可调度
- 开启了资源配额管理并且当前Pod的目标节点上恰好没有可用的资源
解决方法:
- 查看该POD所在宿主机与镜像仓库之间的网络是否有问题,可以手动拉取镜像
- 删除POD实例,让POD调度到别的宿主机上
4.2. Pod创建后不断重启
kubectl get pods中Pod状态一会running,一会不是,且RESTARTS次数不断增加。
一般原因为容器启动命令不是阻塞式命令,导致容器运行后马上退出。
非阻塞式命令:
- 本身CMD指定的命令就是非阻塞式命令
- 将服务启动方式设置为后台运行
解决方法:
1、将命令改为阻塞式命令(前台运行),例如:zkServer.sh start-foreground
2、Java运行程序的启动脚本将 nohup xxx &的nobup和&去掉,例如:
nohup $JAVA_HOME/bin/java $JAVA_OPTS -cp $CLASSPATH com.cnc.open.processor.Main &
改为:
$JAVA_HOME/bin/java $JAVA_OPTS -cp $CLASSPATH com.cnc.open.processor.Main
文章参考:
《Kubernetes权威指南》
最后更新:2017-08-13 22:51:21