九零不老心
发布于 2019-11-21 / 24 阅读 / 0 评论 / 0 点赞

kubernetes错误汇集

kubernetes 强制删除资源

# 解决办法:
    --grace-period=0 --force

kubernetes容器间网络通信不可达,但配置都正常啊

# 解决办法:
    kubernetes的默认coredns,存在解析异常的情况,需要手动删除pod,然后k8s会自动重新生成新的,网络通信恢复

正常高可用环境的kubernetes集群出现异常,连接8443端口失败,kubelet服务报错Failed creating a mirror pod for pods already exists

# 解决办法:
    重启docker服务

failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "nginx-ingress-controller-7bff4d7c6-n7g62_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "c366cb025fb0d73569707170e7ab10528c74222681bf7e6df347374fe83a6b83"

# 解决办法:

Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"

# 解决办法:
    添加额外参数:
cat <<EOF > /etc/sysconfig/kubelet
KUBELET_EXTRA_ARGS="--runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice"
EOF

删除pv失败,并且状态为Terminating(persistentvolume/pvc-ba6ae684-7237-43ab-91b3-750eaf16811d 30Gi RWO Delete Terminating logging/data-elasticsearch-data-0 managed-nfs-storage 2d)

# 解决办法:
    kubectl patch pv pvc-ba6ae684-7237-43ab-91b3-750eaf16811d -p '{"metadata":{"finalizers":null}}'
    kubectl patch pvc pvc-ba6ae684-7237-43ab-91b3-750eaf16811d -p '{"metadata":{"finalizers":null}}'
    kubectl patch pod db-74755f6698-8td72 -p '{"metadata":{"finalizers":null}}'
    then you can delete them
    kubectl delete pv/pvc-ba6ae684-7237-43ab-91b3-750eaf16811d
# 参考链接:
    https://github.com/kubernetes/kubernetes/issues/69697

删除namespace cert-manager失败,一直Terminating

# 解决办法:
    kubectl get namespace rook-ceph -o json > tmp.json
    then edit tmp.json and remove "kubernetes"
    kubectl proxy
    curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/cert-manager/finalize
    或者
    curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json https://10.0.4.13:8443/k8s/clusters/c-xk82n/v1/namespaces/cert-manager/finalize
# 参考链接:
    https://github.com/kubernetes/kubernetes/issues/60807

NetworkPlugin cni failed to set up pod "elasticsearch-master-0_logging" network: open /run/flannel/subnet.env: no such file or directory

# 原因
    pod启动用户没有root文件/run/flannel/subnet的访问权限

0/6 nodes are available: 3 Insufficient memory, 3 node(s) had taints that the pod didn't tolerate(k8s总共6个节点,3个内存不足,3个节点有污点,pod部署失败)

# 解决办法:
    提升node节点内存即可

StorageClass.storage.k8s.io "managed-nfs-storage" is invalid: parameters: Forbidden: updates to parameters are forbidden.

# 解决办法:
    kubectl replace -f storage-class.yaml --force

lvl=eror msg="failed to search for dashboards" logger=provisioning.dashboard type=file name=istio error="database is locked"

# 解决办法:
    https://github.com/grafana/grafana/issues/16638

安装prometheus-operator,报错manifest_sorter.go:175: info: skipping unknown hook: "crd-install",Error: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1", unable to recognize "": no matches for kind "Prometheus" in version "monitoring.coreos.com/v1", unable to recognize "": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1", unable to recognize "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"]

# 解决办法:
    kubectl create namespace monitoring
    kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/alertmanager.crd.yaml
    kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheus.crd.yaml
    kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheusrule.crd.yaml
    kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/servicemonitor.crd.yaml
    kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/podmonitor.crd.yaml
    helm install prometheus-operator --namespace monitoring micro/prometheus-operator -f prometheus-operator.yaml

filebeat输出到logstash报错Exiting: error unpacking config data: more than one namespace configured accessing 'output' (source:'filebeat.yml')

# 解决办法:
    # filbeat.yaml中,设置如下:
    output.file.enabled: false
    output.elasticsearch.enabled: false
    output.logstash.enabled: true

如何在service内部实现session保持呢?当然是在service的yaml里进行设置啦。

# 解决办法:
    # 在service的yaml的sepc里加入以下内容:
    sessionAffinity: ClientIP
    sessionAffinityConfig:
        clientIP:
        timeoutSeconds: 10800

上次咨询的容器无法使用动态增加的cpu问题,我这边已经解决了。

# 解决办法:
    1、使用docker container update和重启容器都无法解决,
    2、taskset -pc pid  #即使设置为0-7,最终pid还是只能使用0-3
    # 原因是:/sys/fs/cgroup/cpuset/docker/cpuset.cpus的内容还是0-3,解决步骤:
    1、echo 0-7 > /sys/fs/cgroup/cpuset/docker/cpuset.cpus #手动更改为0-7(才能使用新增加的cpu核数)
    2、echo 0-7 >/sys/fs/cgroup/cpuset/docker/containerpid/cpuset.cpus

service headless类型的服务,不提供负载均衡,直接返回后端endpoints的ip列表

# 解决办法:
    # 手动将pod的dns地址拼接比如nacos-0.nacos-headless.online-shop-test.svc.cluster.local,nacos-1.nacos-headless.online-shop-test.svc.cluster.local,nacos-2.nacos-headless.online-shop-test.svc.cluster.local,提供给应用使用
    # 应用自动解析nacos-headless,动态将返回的ip列表,与实际工作端口拼接,作为连接配置项
    # In your StatefulSet manifest try to specify:
        serviceName: busy-headless
    # The headless service must exist before the StatefulSet, and is responsible for the network identity of the set.

使用helm创建elasticsearch集群,使用腾讯云cfs(nfs)作为持久化存储的时候

# 问题详情:
    "type": "server", "timestamp": "2021-04-22T07:07:51,641Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "uncaught exception in thread [main]", 
    "stacktrace": ["org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];",
    "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.11.2.jar:7.11.2]",
    "at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.11.2.jar:7.11.2]",
    "at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:75) ~[elasticsearch-7.11.2.jar:7.11.2]",
    "at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:116) ~[elasticsearch-cli-7.11.2.jar:7.11.2]",
    "at org.elasticsearch.cli.Command.main(Command.java:79) ~[elasticsearch-cli-7.11.2.jar:7.11.2]",
    "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.11.2.jar:7.11.2]",
    "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:81) ~[elasticsearch-7.11.2.jar:7.11.2]",
    "Caused by: org.elasticsearch.ElasticsearchException: failed to bind service",
# 解决办法
    找到PV对应服务器上的目录,更改权限即可(chmod -R 777 els)。

使用腾讯云cbs、cbs(csi)、cfs、cos作为持久化存储时

# 需要注意,如果使用cbs、cbs、cbs(csi),pv创建的时候,需要指定size为10的倍数,最大为16000GB
# 如果使用cfs作为持久化存储,注意主账号对cfs创建数量限制阈值,以及nfs挂载目录属组要跟容器的进程id属组权限一致
# 如果使用cos 挂载参数需要指定 -oallow_other(允许其他用户访问挂载文件夹)
# 参考链接:
    https://cloud.tencent.com/document/product/436/6883#.E5.B8.B8.E7.94.A8.E6.8C.82.E8.BD.BD.E9.80.89.E9.A1.B9

使用jenkins docker官方镜像启动,初次启动一直停留在刷新登录页

# 解决办法
    initContainers:
      - name: alpine
        image: alpine:latest
        imagePullPolicy: IfNotPresent
        command: ["sh", "-c", "chown -R 1000:1000 /opt; sed -i 's/updates.jenkins.io/mirrors.tuna.tsinghua.edu.cn\\/jenkins\\/updates/g' /opt/hudson.model.UpdateCenter.xml; true"]

k8s中jenkins master和jenins agent,安装jenkins kubernete plugin后

# configureClouds 配置
# 配置kubernetes集群
# 配置pod模板
    (注意默认已有一个jenkins/inbound-agent:4.3-4 name: "jnlp"镜像,页面上不显示,但执行jenkins 任务,就会显示该pod)
    如果使用自定义的agent镜像,需要设置Container Template中的名称一定为为jnlp,来覆盖默认的jenkins/inbound-agent:4.3-4,否则始终执行默认的jenkins/inbound-agent:4.3-4

Dockerfile中的环境变量问题

# Dockerfile中的env,是设置环境变量,而且from会被继承的。

k8s部署了service模式type类型为LoadBalancer的ingress-controller,web通过ingress配置外网域名对外提供服务:

# 问题详细:
一般情况下,新建的pod如果访问此web服务,新建的pod应该配置访问此web服务的k8s内部svc地址
但甲方说新建生产环境,不会再这个k8s下,所以打算配置此web的外网https地址,实际配置完毕后发现此集群内的大部分node和大部分node上的pod访问此域名的https失败
然而所有节点访问此域名的http协议地址却是正常的。

此时集群内node访问https直接集群内就被劫持了,没有走正常的互联网访问线路

# 解决办法:
    ingress-controller 的 svc 配置externalTrafficPolicy 为cluster
    而关于externalTrafficPolicy的cluster与local的差异,请自行官方查询,
    # 参考链接
        https://www.starbugs.net/index.php/2020/09/30/k8s%E4%B8%ADservice%E7%9A%84%E7%89%B9%E6%80%A7service-spec-externaltrafficpolicy%E5%AF%B9ingress-controller%E7%9A%84%E5%BD%B1%E5%93%8D/

# 后找到大佬的一部分讲解:
    We're also seeing this as an issue at DigitalOcean. It's a concern not just for load-balancer TLS termination, but also for supporting proxy protocol as encouraged in ( https://kubernetes.io/docs/tutorials/services/source-ip/ ). Traffic addressed to the LB ip from within the cluster never reaches the load-balancer, and the required proxy header isn't applied, causing a protocol violation.
    The in-tree AWS service type loadbalancer supports proxy protocol and TLS termination, but because they populate status.loadbalancer.ingress.hostname rather than .ip they avoid this bug/optimization.
    We're willing to put together a PR to address this there's interest from sig-network to accept it. We've considered a kube-proxy flag to disable the optimization, or the more complex option of extending v1.LoadBalancerIngress to include feedback from the cloud provider.

# 参考链接
    https://github.com/kubernetes/kubernetes/issues/66607?spm=a2c4g.11186623.2.8.765e7a47mxR9Qr
    https://help.aliyun.com/document_detail/171437.html