跳转至

Calico Node 无法启动的排查

我们部署了一个 3 节点的 Kubernetes 集群,使用的是 Calico 网络插件,版本信息如下:

  • Kubernetes: v1.27.4
  • Calico: v3.15.1
  • CNI: v0.8.6
  • OS: Ubuntu 22.04 LTS
  • Kernel: 6.5.0-21-generic
  • containerd: 1.7.12-0ubuntu2~22.04.1

当前集群工作正常,我们尝试新增一个相同版本信息的节点,但是发现新节点的 Calico Node 无法启动, 使用 kubectl describe pod calico-node-xxxxx -n kube-system 查看 Pod 的状态,发现存活探针失败,导致 Pod 不断重启,最后进入 CrashLoopBackOff 状态。

查看 Pod 的日志,也没有明显的报错,一般是在下面的日志后, Pod 就重启了

2024-11-11 03:12:57.671 [INFO][105] felix/health.go 336: Overall health status changed: live=true ready=true
+---------------------------+---------+----------------+-----------------+--------+
|         COMPONENT         | TIMEOUT |    LIVENESS    |    READINESS    | DETAIL |
+---------------------------+---------+----------------+-----------------+--------+
| CalculationGraph          | 30s     | reporting live | reporting ready |        |
| FelixStartup              | -       | reporting live | reporting ready |        |
| InternalDataplaneMainLoop | 1m30s   | reporting live | reporting ready |        |
+---------------------------+---------+----------------+-----------------+--------+

kubelet 服务也没有明显的报错信息,排查陷入了僵局,不知道该如何解决。期间尝试了重置节点,重新部署 Calico Node,但是问题依然存在。

这样断断续续排查了一周,问题还是没有解决。

今天又继续排查,然后再 /var/log/calico/cni/cni.log 日志里找到了下面的信息:

2024-11-11 02:54:08.675 [INFO][119] felix/daemon.go 398: Successfully loaded configuration. GOMAXPROCS=64 builddate="2023-08-17T14:17:29+0000" config=&config.Config{UseInternalDataplaneDriver:true, DataplaneDriver:"calico-iptables-plugin", DataplaneWatchdogTimeout:90000000000, WireguardEnabled:false, WireguardEnabledV6:false, WireguardListeningPort:51820, WireguardListeningPortV6:51821, WireguardRoutingRulePriority:99, WireguardInterfaceName:"wireguard.cali", WireguardInterfaceNameV6:"wg-v6.cali", WireguardMTU:0, WireguardMTUV6:0, WireguardHostEncryptionEnabled:false, WireguardPersistentKeepAlive:0, BPFEnabled:false, BPFDisableUnprivileged:true, BPFLogLevel:"off", BPFLogFilters:map[string]string(nil), BPFCTLBLogFilter:"", BPFDataIfacePattern:(*regexp.Regexp)(0xc000479ae0), BPFL3IfacePattern:(*regexp.Regexp)(nil), BPFConnectTimeLoadBalancingEnabled:true, BPFExternalServiceMode:"tunnel", BPFDSROptoutCIDRs:[]string(nil), BPFKubeProxyIptablesCleanupEnabled:true, BPFKubeProxyMinSyncPeriod:1000000000, BPFKubeProxyEndpointSlicesEnabled:true, BPFExtToServiceConnmark:0, BPFPSNATPorts:numorstring.Port{MinPort:0x4e20, MaxPort:0x752f, PortName:""}, BPFMapSizeNATFrontend:65536, BPFMapSizeNATBackend:262144, BPFMapSizeNATAffinity:65536, BPFMapSizeRoute:262144, BPFMapSizeConntrack:512000, BPFMapSizeIPSets:1048576, BPFMapSizeIfState:1000, BPFHostConntrackBypass:true, BPFEnforceRPF:"Loose", BPFPolicyDebugEnabled:true, BPFForceTrackPacketsFromIfaces:[]string{"docker+"}, BPFDisableGROForIfaces:(*regexp.Regexp)(nil), DebugBPFCgroupV2:"", DebugBPFMapRepinEnabled:false, DatastoreType:"kubernetes", FelixHostname:"node13.k8s.gp51.com", EtcdAddr:"127.0.0.1:2379", EtcdScheme:"http", EtcdKeyFile:"", EtcdCertFile:"", EtcdCaFile:"", EtcdEndpoints:[]string(nil), TyphaAddr:"", TyphaK8sServiceName:"", TyphaK8sNamespace:"kube-system", TyphaReadTimeout:30000000000, TyphaWriteTimeout:10000000000, TyphaKeyFile:"", TyphaCertFile:"", TyphaCAFile:"", TyphaCN:"", TyphaURISAN:"", Ipv6Support:false, BpfIpv6Support:false, IptablesBackend:"auto", RouteRefreshInterval:90000000000, InterfaceRefreshInterval:90000000000, DeviceRouteSourceAddress:net.IP(nil), DeviceRouteSourceAddressIPv6:net.IP(nil), DeviceRouteProtocol:3, RemoveExternalRoutes:true, IptablesRefreshInterval:90000000000, IptablesPostWriteCheckIntervalSecs:1000000000, IptablesLockFilePath:"/run/xtables.lock", IptablesLockTimeoutSecs:0, IptablesLockProbeIntervalMillis:50000000, FeatureDetectOverride:map[string]string(nil), FeatureGates:map[string]string(nil), IpsetsRefreshInterval:10000000000, MaxIpsetSize:1048576, XDPRefreshInterval:90000000000, PolicySyncPathPrefix:"", NetlinkTimeoutSecs:10000000000, MetadataAddr:"", MetadataPort:8775, OpenstackRegion:"", InterfacePrefix:"cali", InterfaceExclude:[]*regexp.Regexp{(*regexp.Regexp)(0xc000479c20)}, ChainInsertMode:"insert", DefaultEndpointToHostAction:"ACCEPT", IptablesFilterAllowAction:"ACCEPT", IptablesMangleAllowAction:"ACCEPT", IptablesFilterDenyAction:"DROP", LogPrefix:"calico-packet", LogFilePath:"", LogSeverityFile:"", LogSeverityScreen:"INFO", LogSeveritySys:"", LogDebugFilenameRegex:(*regexp.Regexp)(nil), VXLANEnabled:(*bool)(nil), VXLANPort:4789, VXLANVNI:4096, VXLANMTU:0, VXLANMTUV6:0, IPv4VXLANTunnelAddr:net.IP(nil), IPv6VXLANTunnelAddr:net.IP(nil), VXLANTunnelMACAddr:"", VXLANTunnelMACAddrV6:"", IpInIpEnabled:(*bool)(nil), IpInIpMtu:0, IpInIpTunnelAddr:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xf4, 0xe2, 0x40}, FloatingIPs:"Disabled", AllowVXLANPacketsFromWorkloads:false, AllowIPIPPacketsFromWorkloads:false, AWSSrcDstCheck:"DoNothing", ServiceLoopPrevention:"Drop", WorkloadSourceSpoofing:"Disabled", ReportingIntervalSecs:0, ReportingTTLSecs:90000000000, EndpointReportingEnabled:false, EndpointReportingDelaySecs:1000000000, IptablesMarkMask:0xffff0000, DisableConntrackInvalidCheck:false, HealthEnabled:true, HealthPort:9099, HealthHost:"localhost", HealthTimeoutOverrides:map[string]time.Duration(nil), PrometheusMetricsEnabled:false, PrometheusMetricsHost:"", PrometheusMetricsPort:9091, PrometheusGoMetricsEnabled:true, PrometheusProcessMetricsEnabled:true, PrometheusWireGuardMetricsEnabled:true, FailsafeInboundHostPorts:[]config.ProtoPort{config.ProtoPort{Net:"", Protocol:"tcp", Port:0x16}, config.ProtoPort{Net:"", Protocol:"udp", Port:0x44}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0xb3}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94c}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1561}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x192b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0a}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0b}}, FailsafeOutboundHostPorts:[]config.ProtoPort{config.ProtoPort{Net:"", Protocol:"udp", Port:0x35}, config.ProtoPort{Net:"", Protocol:"udp", Port:0x43}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0xb3}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94c}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1561}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x192b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0a}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0b}}, KubeNodePortRanges:[]numorstring.Port{numorstring.Port{MinPort:0x7530, MaxPort:0x7fff, PortName:""}}, NATPortRange:numorstring.Port{MinPort:0x0, MaxPort:0x0, PortName:""}, NATOutgoingAddress:net.IP(nil), UsageReportingEnabled:true, UsageReportingInitialDelaySecs:300000000000, UsageReportingIntervalSecs:86400000000000, ClusterGUID:"a97da56e5db6460facdb88bcaeac2b27", ClusterType:"k8s,bgp,kubeadm,kdd", CalicoVersion:"v3.27.0-0.dev-263-g253d5925780a", ExternalNodesCIDRList:[]string(nil), DebugMemoryProfilePath:"", DebugCPUProfilePath:"/tmp/felix-cpu-<timestamp>.pprof", DebugDisableLogDropping:false, DebugSimulateCalcGraphHangAfter:0, DebugSimulateDataplaneHangAfter:0, DebugPanicAfter:0, DebugSimulateDataRace:false, RouteSource:"CalicoIPAM", RouteTableRange:idalloc.IndexRange{Min:0, Max:0}, RouteTableRanges:[]idalloc.IndexRange(nil), RouteSyncDisabled:false, IptablesNATOutgoingInterfaceFilter:"", SidecarAccelerationEnabled:false, XDPEnabled:true, GenericXDPEnabled:false, Variant:"Calico", MTUIfacePattern:(*regexp.Regexp)(0xc000479ea0), Encapsulation:config.Encapsulation{IPIPEnabled:true, VXLANEnabled:false, VXLANEnabledV6:false}, internalOverrides:map[string]string{}, sourceToRawConfig:map[config.Source]map[string]string{0x1:map[string]string{"CalicoVersion":"v3.27.0-0.dev-263-g253d5925780a", "ClusterGUID":"a97da56e5db6460facdb88bcaeac2b27", "ClusterType":"k8s,bgp,kubeadm,kdd", "FloatingIPs":"Disabled", "LogSeverityScreen":"Info", "ReportingIntervalSecs":"0"}, 0x2:map[string]string{"IpInIpTunnelAddr":"10.244.226.64"}, 0x3:map[string]string{"LogFilePath":"None", "LogSeverityFile":"None", "LogSeveritySys":"None", "MetadataAddr":"None"}, 0x4:map[string]string{"datastoretype":"kubernetes", "defaultendpointtohostaction":"ACCEPT", "felixhostname":"node13.k8s.gp51.com", "healthenabled":"true", "ipinipmtu":"0", "ipv6support":"false", "vxlanmtu":"0", "wireguardmtu":"0"}}, rawValues:map[string]string{"CalicoVersion":"v3.27.0-0.dev-263-g253d5925780a", "ClusterGUID":"a97da56e5db6460facdb88bcaeac2b27", "ClusterType":"k8s,bgp,kubeadm,kdd", "DatastoreType":"kubernetes", "DefaultEndpointToHostAction":"ACCEPT", "FelixHostname":"node13.k8s.gp51.com", "FloatingIPs":"Disabled", "HealthEnabled":"true", "IpInIpMtu":"0", "IpInIpTunnelAddr":"10.244.226.64", "Ipv6Support":"false", "LogFilePath":"None", "LogSeverityFile":"None", "LogSeverityScreen":"Info", "LogSeveritySys":"None", "MetadataAddr":"None", "ReportingIntervalSecs":"0", "VXLANMTU":"0", "WireguardMTU":"0"}, Err:error(nil), loadClientConfigFromEnvironment:(func() (*apiconfig.CalicoAPIConfig, error))(0x162cca0), useNodeResourceUpdates:false} gitcommit="253d5925780a3207cd262279e32cf54f11db18a0" version="v3.27.0-0.dev-263-g253d5925780a"
2024-11-11 02:54:08.679 [INFO][119] felix/int_dataplane.go 356: Creating internal dataplane driver. config=intdataplane.Config{Hostname:"node13.k8s.gp51.com", NodeZone:"", IPv6Enabled:false, RuleRendererOverride:rules.RuleRenderer(nil), IPIPMTU:0, VXLANMTU:0, VXLANMTUV6:0, VXLANPort:4789, MaxIPSetSize:1048576, RouteSyncDisabled:false, IptablesBackend:"auto", IPSetsRefreshInterval:10000000000, RouteRefreshInterval:90000000000, DeviceRouteSourceAddress:net.IP(nil), DeviceRouteSourceAddressIPv6:net.IP(nil), DeviceRouteProtocol:3, RemoveExternalRoutes:true, IptablesRefreshInterval:90000000000, IptablesPostWriteCheckInterval:1000000000, IptablesInsertMode:"insert", IptablesLockFilePath:"/run/xtables.lock", IptablesLockTimeout:0, IptablesLockProbeInterval:50000000, XDPRefreshInterval:90000000000, FloatingIPsEnabled:false, Wireguard:wireguard.Config{Enabled:false, EnabledV6:false, ListeningPort:51820, ListeningPortV6:51821, FirewallMark:0, RoutingRulePriority:99, RoutingTableIndex:1, RoutingTableIndexV6:2, InterfaceName:"wireguard.cali", InterfaceNameV6:"wg-v6.cali", MTU:0, MTUV6:0, RouteSource:"CalicoIPAM", EncryptHostTraffic:false, PersistentKeepAlive:0, RouteSyncDisabled:false}, NetlinkTimeout:10000000000, RulesConfig:rules.Config{IPSetConfigV4:(*ipsets.IPVersionConfig)(0xc0006a45f0), IPSetConfigV6:(*ipsets.IPVersionConfig)(0xc0006a4690), WorkloadIfacePrefixes:[]string{"cali"}, IptablesMarkAccept:0x10000, IptablesMarkPass:0x20000, IptablesMarkScratch0:0x40000, IptablesMarkScratch1:0x80000, IptablesMarkEndpoint:0xfff00000, IptablesMarkNonCaliEndpoint:0x100000, KubeNodePortRanges:[]numorstring.Port{numorstring.Port{MinPort:0x7530, MaxPort:0x7fff, PortName:""}}, KubeIPVSSupportEnabled:true, OpenStackMetadataIP:net.IP(nil), OpenStackMetadataPort:0x2247, OpenStackSpecialCasesEnabled:false, VXLANEnabled:false, VXLANEnabledV6:false, VXLANPort:4789, VXLANVNI:4096, IPIPEnabled:true, FelixConfigIPIPEnabled:(*bool)(nil), IPIPTunnelAddress:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xf4, 0xe2, 0x40}, VXLANTunnelAddress:net.IP(nil), VXLANTunnelAddressV6:net.IP(nil), AllowVXLANPacketsFromWorkloads:false, AllowIPIPPacketsFromWorkloads:false, WireguardEnabled:false, WireguardEnabledV6:false, WireguardInterfaceName:"wireguard.cali", WireguardInterfaceNameV6:"wg-v6.cali", WireguardIptablesMark:0x0, WireguardListeningPort:51820, WireguardListeningPortV6:51821, WireguardEncryptHostTraffic:false, RouteSource:"CalicoIPAM", IptablesLogPrefix:"calico-packet", EndpointToHostAction:"ACCEPT", IptablesFilterAllowAction:"ACCEPT", IptablesMangleAllowAction:"ACCEPT", IptablesFilterDenyAction:"DROP", FailsafeInboundHostPorts:[]config.ProtoPort{config.ProtoPort{Net:"", Protocol:"tcp", Port:0x16}, config.ProtoPort{Net:"", Protocol:"udp", Port:0x44}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0xb3}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94c}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1561}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x192b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0a}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0b}}, FailsafeOutboundHostPorts:[]config.ProtoPort{config.ProtoPort{Net:"", Protocol:"udp", Port:0x35}, config.ProtoPort{Net:"", Protocol:"udp", Port:0x43}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0xb3}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94c}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1561}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x192b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0a}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0b}}, DisableConntrackInvalid:false, NATPortRange:numorstring.Port{MinPort:0x0, MaxPort:0x0, PortName:""}, IptablesNATOutgoingInterfaceFilter:"", NATOutgoingAddress:net.IP(nil), BPFEnabled:false, BPFForceTrackPacketsFromIfaces:[]string{"docker+"}, ServiceLoopPrevention:"Drop"}, IfaceMonitorConfig:ifacemonitor.Config{InterfaceExcludes:[]*regexp.Regexp{(*regexp.Regexp)(0xc000479c20)}, ResyncInterval:90000000000, NetlinkTimeout:10000000000}, StatusReportingInterval:0, ConfigChangedRestartCallback:(func())(0x27fadc0), FatalErrorRestartCallback:(func(error))(0x27faca0), PostInSyncCallback:(func())(0x27e9220), HealthAggregator:(*health.HealthAggregator)(0xc0006a35f0), WatchdogTimeout:90000000000, RouteTableManager:(*idalloc.IndexAllocator)(0xc000616740), DebugSimulateDataplaneHangAfter:0, ExternalNodesCidrs:[]string(nil), BPFEnabled:false, BPFPolicyDebugEnabled:true, BPFDisableUnprivileged:true, BPFKubeProxyIptablesCleanupEnabled:true, BPFLogLevel:"off", BPFLogFilters:map[string]string(nil), BPFCTLBLogFilter:"", BPFExtToServiceConnmark:0, BPFDataIfacePattern:(*regexp.Regexp)(0xc000479ae0), BPFL3IfacePattern:(*regexp.Regexp)(nil), XDPEnabled:true, XDPAllowGeneric:false, BPFConntrackTimeouts:conntrack.Timeouts{CreationGracePeriod:10000000000, TCPPreEstablished:20000000000, TCPEstablished:3600000000000, TCPFinsSeen:30000000000, TCPResetSeen:40000000000, UDPLastSeen:60000000000, GenericIPLastSeen:600000000000, ICMPLastSeen:5000000000}, BPFCgroupV2:"", BPFConnTimeLBEnabled:true, BPFMapRepin:false, BPFNodePortDSREnabled:false, BPFDSROptoutCIDRs:[]string(nil), BPFPSNATPorts:numorstring.Port{MinPort:0x4e20, MaxPort:0x752f, PortName:""}, BPFMapSizeRoute:262144, BPFMapSizeConntrack:512000, BPFMapSizeNATFrontend:65536, BPFMapSizeNATBackend:262144, BPFMapSizeNATAffinity:65536, BPFMapSizeIPSets:1048576, BPFMapSizeIfState:1000, BPFIpv6Enabled:false, BPFHostConntrackBypass:true, BPFEnforceRPF:"Loose", BPFDisableGROForIfaces:(*regexp.Regexp)(nil), KubeProxyMinSyncPeriod:1000000000, SidecarAccelerationEnabled:false, LookPathOverride:(func(string) (string, error))(nil), KubeClientSet:(*kubernetes.Clientset)(0xc0007829c0), FeatureDetectOverrides:map[string]string(nil), FeatureGates:map[string]string(nil), hostMTU:0, MTUIfacePattern:(*regexp.Regexp)(0xc000479ea0), RouteSource:"CalicoIPAM", KubernetesProvider:0x0}
2024-11-11 02:54:08.708 [INFO][119] felix/int_dataplane.go 1965: attempted to modprobe nf_conntrack_proto_sctp error=exit status 1 output=""

看这日志基本并不是 ERROR ,最后提示 nf_conntrack_proto_sctp 内核模块加载失败,我以为这个就是报错的根本原因了,然后开始查找这个模块,当前内核果然没有带这个模块, 尝试安装了 linux-modules-extra 包,也没有这个模块,然后去另外几个正常运行的节点上,发现都没有这个内核模块。看来这个模块并不是必须的,那这个信息也不是导致 Calico Node 无法启动的原因了。

没目的的是在 Google 上搜索 nf_conntrack_proto_sctp ubuntu 22.04 ,第一条搜索结果标题就是 Calico Node keeps restarting with CrashLoopBackOff #7951 ,马上点击进去看看。 这是2023年8月份的一个 issue ,已经关闭,讨论得的比较多, 讨论的最后,提出 issue 得作者最后自己解决了这个问题,是因为 SystemdCgroup 设置的问题。 立刻查看我这边的 /etc/containerd/config.toml 文件,果然设置 SystemdCgroupfalse,立刻改成 true 后,重启 containerd 和 kubelet 服务,然后删除 Calico Node 的 pod,再次创建,果然正常启动了。