kubernetes 强制删除资源
# 解决办法:
--grace-period=0 --force
kubernetes容器间网络通信不可达,但配置都正常啊
# 解决办法:
kubernetes的默认coredns,存在解析异常的情况,需要手动删除pod,然后k8s会自动重新生成新的,网络通信恢复
正常高可用环境的kubernetes集群出现异常,连接8443端口失败,kubelet服务报错Failed creating a mirror pod for pods already exists
# 解决办法:
重启docker服务
failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "nginx-ingress-controller-7bff4d7c6-n7g62_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "c366cb025fb0d73569707170e7ab10528c74222681bf7e6df347374fe83a6b83"
# 解决办法:
Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
# 解决办法:
添加额外参数:
cat <<EOF > /etc/sysconfig/kubelet
KUBELET_EXTRA_ARGS="--runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice"
EOF
删除pv失败,并且状态为Terminating(persistentvolume/pvc-ba6ae684-7237-43ab-91b3-750eaf16811d 30Gi RWO Delete Terminating logging/data-elasticsearch-data-0 managed-nfs-storage 2d)
# 解决办法:
kubectl patch pv pvc-ba6ae684-7237-43ab-91b3-750eaf16811d -p '{"metadata":{"finalizers":null}}'
kubectl patch pvc pvc-ba6ae684-7237-43ab-91b3-750eaf16811d -p '{"metadata":{"finalizers":null}}'
kubectl patch pod db-74755f6698-8td72 -p '{"metadata":{"finalizers":null}}'
then you can delete them
kubectl delete pv/pvc-ba6ae684-7237-43ab-91b3-750eaf16811d
# 参考链接:
https://github.com/kubernetes/kubernetes/issues/69697
删除namespace cert-manager失败,一直Terminating
# 解决办法:
kubectl get namespace rook-ceph -o json > tmp.json
then edit tmp.json and remove "kubernetes"
kubectl proxy
curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/cert-manager/finalize
或者
curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json https://10.0.4.13:8443/k8s/clusters/c-xk82n/v1/namespaces/cert-manager/finalize
# 参考链接:
https://github.com/kubernetes/kubernetes/issues/60807
NetworkPlugin cni failed to set up pod "elasticsearch-master-0_logging" network: open /run/flannel/subnet.env: no such file or directory
# 原因
pod启动用户没有root文件/run/flannel/subnet的访问权限
0/6 nodes are available: 3 Insufficient memory, 3 node(s) had taints that the pod didn't tolerate(k8s总共6个节点,3个内存不足,3个节点有污点,pod部署失败)
# 解决办法:
提升node节点内存即可
StorageClass.storage.k8s.io "managed-nfs-storage" is invalid: parameters: Forbidden: updates to parameters are forbidden.
# 解决办法:
kubectl replace -f storage-class.yaml --force
lvl=eror msg="failed to search for dashboards" logger=provisioning.dashboard type=file name=istio error="database is locked"
# 解决办法:
https://github.com/grafana/grafana/issues/16638
安装prometheus-operator,报错manifest_sorter.go:175: info: skipping unknown hook: "crd-install",Error: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1", unable to recognize "": no matches for kind "Prometheus" in version "monitoring.coreos.com/v1", unable to recognize "": no matches for kind "PrometheusRule" in version "monitoring.coreos.com/v1", unable to recognize "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"]
# 解决办法:
kubectl create namespace monitoring
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/alertmanager.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheus.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheusrule.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/servicemonitor.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/podmonitor.crd.yaml
helm install prometheus-operator --namespace monitoring micro/prometheus-operator -f prometheus-operator.yaml
filebeat输出到logstash报错Exiting: error unpacking config data: more than one namespace configured accessing 'output' (source:'filebeat.yml')
# 解决办法:
# filbeat.yaml中,设置如下:
output.file.enabled: false
output.elasticsearch.enabled: false
output.logstash.enabled: true
如何在service内部实现session保持呢?当然是在service的yaml里进行设置啦。
# 解决办法:
# 在service的yaml的sepc里加入以下内容:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
上次咨询的容器无法使用动态增加的cpu问题,我这边已经解决了。
# 解决办法:
1、使用docker container update和重启容器都无法解决,
2、taskset -pc pid #即使设置为0-7,最终pid还是只能使用0-3
# 原因是:/sys/fs/cgroup/cpuset/docker/cpuset.cpus的内容还是0-3,解决步骤:
1、echo 0-7 > /sys/fs/cgroup/cpuset/docker/cpuset.cpus #手动更改为0-7(才能使用新增加的cpu核数)
2、echo 0-7 >/sys/fs/cgroup/cpuset/docker/containerpid/cpuset.cpus
service headless类型的服务,不提供负载均衡,直接返回后端endpoints的ip列表
# 解决办法:
# 手动将pod的dns地址拼接比如nacos-0.nacos-headless.online-shop-test.svc.cluster.local,nacos-1.nacos-headless.online-shop-test.svc.cluster.local,nacos-2.nacos-headless.online-shop-test.svc.cluster.local,提供给应用使用
# 应用自动解析nacos-headless,动态将返回的ip列表,与实际工作端口拼接,作为连接配置项
# In your StatefulSet manifest try to specify:
serviceName: busy-headless
# The headless service must exist before the StatefulSet, and is responsible for the network identity of the set.
使用helm创建elasticsearch集群,使用腾讯云cfs(nfs)作为持久化存储的时候
# 问题详情:
"type": "server", "timestamp": "2021-04-22T07:07:51,641Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "uncaught exception in thread [main]",
"stacktrace": ["org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];",
"at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.11.2.jar:7.11.2]",
"at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.11.2.jar:7.11.2]",
"at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:75) ~[elasticsearch-7.11.2.jar:7.11.2]",
"at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:116) ~[elasticsearch-cli-7.11.2.jar:7.11.2]",
"at org.elasticsearch.cli.Command.main(Command.java:79) ~[elasticsearch-cli-7.11.2.jar:7.11.2]",
"at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.11.2.jar:7.11.2]",
"at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:81) ~[elasticsearch-7.11.2.jar:7.11.2]",
"Caused by: org.elasticsearch.ElasticsearchException: failed to bind service",
# 解决办法
找到PV对应服务器上的目录,更改权限即可(chmod -R 777 els)。
使用腾讯云cbs、cbs(csi)、cfs、cos作为持久化存储时
# 需要注意,如果使用cbs、cbs、cbs(csi),pv创建的时候,需要指定size为10的倍数,最大为16000GB
# 如果使用cfs作为持久化存储,注意主账号对cfs创建数量限制阈值,以及nfs挂载目录属组要跟容器的进程id属组权限一致
# 如果使用cos 挂载参数需要指定 -oallow_other(允许其他用户访问挂载文件夹)
# 参考链接:
https://cloud.tencent.com/document/product/436/6883#.E5.B8.B8.E7.94.A8.E6.8C.82.E8.BD.BD.E9.80.89.E9.A1.B9
使用jenkins docker官方镜像启动,初次启动一直停留在刷新登录页
# 解决办法
initContainers:
- name: alpine
image: alpine:latest
imagePullPolicy: IfNotPresent
command: ["sh", "-c", "chown -R 1000:1000 /opt; sed -i 's/updates.jenkins.io/mirrors.tuna.tsinghua.edu.cn\\/jenkins\\/updates/g' /opt/hudson.model.UpdateCenter.xml; true"]
k8s中jenkins master和jenins agent,安装jenkins kubernete plugin后
# configureClouds 配置
# 配置kubernetes集群
# 配置pod模板
(注意默认已有一个jenkins/inbound-agent:4.3-4 name: "jnlp"镜像,页面上不显示,但执行jenkins 任务,就会显示该pod)
如果使用自定义的agent镜像,需要设置Container Template中的名称一定为为jnlp,来覆盖默认的jenkins/inbound-agent:4.3-4,否则始终执行默认的jenkins/inbound-agent:4.3-4
Dockerfile中的环境变量问题
# Dockerfile中的env,是设置环境变量,而且from会被继承的。
k8s部署了service模式type类型为LoadBalancer的ingress-controller,web通过ingress配置外网域名对外提供服务:
# 问题详细:
一般情况下,新建的pod如果访问此web服务,新建的pod应该配置访问此web服务的k8s内部svc地址
但甲方说新建生产环境,不会再这个k8s下,所以打算配置此web的外网https地址,实际配置完毕后发现此集群内的大部分node和大部分node上的pod访问此域名的https失败
然而所有节点访问此域名的http协议地址却是正常的。
此时集群内node访问https直接集群内就被劫持了,没有走正常的互联网访问线路
# 解决办法:
ingress-controller 的 svc 配置externalTrafficPolicy 为cluster
而关于externalTrafficPolicy的cluster与local的差异,请自行官方查询,
# 参考链接
https://www.starbugs.net/index.php/2020/09/30/k8s%E4%B8%ADservice%E7%9A%84%E7%89%B9%E6%80%A7service-spec-externaltrafficpolicy%E5%AF%B9ingress-controller%E7%9A%84%E5%BD%B1%E5%93%8D/
# 后找到大佬的一部分讲解:
We're also seeing this as an issue at DigitalOcean. It's a concern not just for load-balancer TLS termination, but also for supporting proxy protocol as encouraged in ( https://kubernetes.io/docs/tutorials/services/source-ip/ ). Traffic addressed to the LB ip from within the cluster never reaches the load-balancer, and the required proxy header isn't applied, causing a protocol violation.
The in-tree AWS service type loadbalancer supports proxy protocol and TLS termination, but because they populate status.loadbalancer.ingress.hostname rather than .ip they avoid this bug/optimization.
We're willing to put together a PR to address this there's interest from sig-network to accept it. We've considered a kube-proxy flag to disable the optimization, or the more complex option of extending v1.LoadBalancerIngress to include feedback from the cloud provider.
# 参考链接
https://github.com/kubernetes/kubernetes/issues/66607?spm=a2c4g.11186623.2.8.765e7a47mxR9Qr
https://help.aliyun.com/document_detail/171437.html