用户报告K8S集群中Pod偶发网络不通。故障表现随机。并不固定在某些计算节点,也不固定于某些虚拟机,出现时间随机;用户在不变动任何OpenStack内虚拟机或网络资源的情况下,仅仅重建Pod,即恢复正常。而且出现网络中断的Pod原本处于正常通讯的状态。由于前两次用户通过重建Pod解决故障,并未留下现场尸体,排查并没有实质性进展。遂让用户下次出现同样故障时留下现场待排查根源后再恢复。PS:Random failure is quite a challenge.
异常情况下,对应上述的流表依旧存在,却没有命中;而是命中了低优先级的ct_mark=0x1,reg5=0x362 actions=drop。若要命中该流表,对应流量就必须先行命中ct_state=+est,ip,reg5=0x362actions=ct(commit,zone=NXM_NX_REG6[0..15],exec(load:0x1->NXM_NX_CT_MARK[])),将连接跟踪表中的mark置为1(invalid)但这些状况在正常情况下不应出现。 cookie=0xcccccccccccccccc, duration=8725.528s, table=72, n_packets=68693, n_bytes=12470448, idle_age=0, priority=77,ct_state=+est–rel–rpl,ip,reg5=0x362,nw_proto=4 actions=resubmit(,73)cookie=0xcccccccccccccccc, duration=8725.531s, table=72, n_packets=16642, n_bytes=1668970, idle_age=0, priority=50,ct_mark=0x1,reg5=0x362 actions=dropcookie=0xcccccccccccccccc, duration=8725.531s, table=72, n_packets=2, n_bytes=156, idle_age=8725, priority=40,ct_state=+est,ip,reg5=0x362 actions=ct(commit,zone=NXM_NX_REG6[0..15],exec(load:0x1->NXM_NX_CT_MARK[])) 进一步观察流表(duration值),可以发现三条流表的生存周期有0.03s的轻微时间差,本应正确命中的流表比当前异常情况下命中的流表晚下发了0.03秒。至此可以得出一个初步的故障原因结论:0.03秒的流表下发时间差导致了当前流量的中断。具体分析如下图。 仔细观察Neutron代码,也可以发现流表的下发流程之中,_initialize_tracked_egress,也发生在create_flows_from_rule_and_port之前。 def_initialize_tracked_egress(self, port):# Drop invalid packetsself._add_flow( table=ovs_consts.RULES_EGRESS_TABLE, priority=50, ct_state=ovsfw_consts.OF_STATE_INVALID, actions=‘drop’ )# Drop traffic for removed sg rulesself._add_flow( table=ovs_consts.RULES_EGRESS_TABLE, priority=50, reg_port=port.ofport, ct_mark=ovsfw_consts.CT_MARK_INVALID, actions=‘drop’ ) ……defadd_flows_from_rules(self, port):self._initialize_tracked_ingress(port)self._initialize_tracked_egress(port)LOG.debug(‘Creating flow rules for port %s that is port %d in OVS’, port.id, port.ofport)for rule inself._create_rules_generator_for_port(port):# NOTE(toshii): A better version of merge_common_rules and# its friend should be applied here in order to avoid# overlapping flows. flows = rules.create_flows_from_rule_and_port(rule, port)LOG.debug(“RULGEN: Rules generated for flow %s are %s”, rule, flows)for flow in flows:self._accept_flow(**flow)self._add_non_ip_conj_flows(port) self.conj_ip_manager.update_flows_for_vlan(port.vlan_tag) 安全组流表的更新通常发生在安全组更新之后,通过对上述流表生存周期与排查时间的反推,得到的安全组更新时间与OpenStack显示的用户更新安全组时间一致。根据用户的反馈,故障察觉时间也与安全组更新时间高度吻合。