软件 | 版本 |
OpenStack | Mikata |
Mysql | 10.0.25 |
RabbitMQ | 3.3.5 |
Open vSwitch | 2.4.0 |
Docker | 1.10.3 |
Qemu-kvm | 2.2.1 |
Libvirt | 1.2.8 |
Ceph | 10.2.0 |
Linux Release | CentOS Linux release 7.1.1503 (Core) |
Linux Kernel | 3.10.0-229.7.2.el7.centos.x86_64 |
2017/3/17: 测试完成修复方案,上线更新;
2017/3/15: 找到root cause,确定触发场景,确定修复方案;
2017/3/14: 找到neutron产生无限多tap的场景2,能够在模拟条件下复现问题,与环境日志记录匹配;
2017/3/13: 找到neutron产生无限多tap的场景1,能够在模拟条件下复现问题,但对比环境日志并不完全匹配,而且模拟条件的出现概率非常低,排除;
09:00:00: 在控制节点恢复网络服务(neutron-metadata-agent、neutron-dhcp-agent);
08:00:00: 自动化脚本清理完成所有无效tap设备,控制节点成功启动neutron-*-agent,虚拟网络服务能力恢复;
22:00:00: 运行脚本自动化清理控制节点的无效tap设备
21:30:00: 原控制节点所属(6台)虚拟机恢复访问在其他节点(nc05、nc03)恢复原控制节点的虚拟机
21:00:00: 环境更新包功能还原在其他节点(nc05)成功启动neutron-metadata-agent
20:30:00: 整体环境新建虚机分配网络还原在其他节点(nc05)成功启动 neutron-dhcp-agent
18:30:00: 放弃在控制节点还原网络服务,尝试在其他节点还原网络服务;
18:00:00: 尝试删除tap设备,但进度较慢;尝试在代码去掉neutron-openvsiwtch-agent中关于tap设备的预读过程,因涉及点太多放弃在线修改;
17:00:00: 发现控制节点存在10000+tap设备,link状态为DOWN(导致neutorn-openvswitch-agent启动失败) ;
16:40:00: 整体环境新建虚拟机分配网络失效,处理人忙乱之中再次重启neutron-dhcp-agent,发现循环错误,neutron-dhcp-agent重启失败;
16:30:00: 控制节点所属(6台)虚拟机网络失效怀疑是控制节点流表未刷新原因,开发同事尝试重启控制节点的neutron-openvsiwtch-agent刷新,但重启失败;
16:25:00: 开发开始定位处理开发同事检查虚拟机创建成功,但新建虚拟机无法获取neutron-metadata服务接口,导致业务集群配置失败;
16:10:00: 线上环境更新包功能失效项目维护同事反映培训环境更新业务应用集群失效;
def _periodic_resync_helper(self):
"""Resync the dhcp state at the configured interval."""
while True:
eventlet.sleep(self.conf.resync_interval) //间隔时间默认为30s
if self.needs_resync_reasons: // needs_resync_reasons由待决事件更新
# be careful to avoid a race with additions to list
# from other threads
reasons = self.needs_resync_reasons
self.needs_resync_reasons = collections.defaultdict(list)
for net, r in reasons.items():
if not net:
net = "*"
LOG.debug("resync (%(network)s): %(reason)s",
{"reason": r, "network": net})
1) Neutron-dhcp-agent循环监听是否存在更新需求(need_resunc_reasons);
2) 每次循环延迟间隔conf.resync_interval秒;
3) setup_dhcp_port()方法中申请一个新的Port(新的tapid产生);
4) add_veth()方法创建veth设备(新的tap设备产生);
5) ensure_namespace()方法中确认namespace,如不存在则创建;
6) set_netns()方法设置tap设备的network namespace,设置失败;
7) 跳转到第一步循环;
1) 设置namespace失败后,没有正确的try...catch流程删除之前创建的设备,导致失败的tap设备积累越来越多
2) ensure_namespace()方法只检查namespace是否存在,没有深入检查namespace权限等可能导致后续设置失败的属性
Neutron-*-agent各服务需要操作主机共享的network namespace,在容器化部署OpenStack的场景中,就需要neutron-*-agent各容器的启动参数中包含映射参数如:“-v / run/netns: / run/netns:shared -v / run:/run:rw”,下图为两个容器的共享与隔离关系:
const defaultMountFlags = syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_NODEV
// parseMountOptions parses the string and returns the flags, propagation
// flags and any mount data that it contains.
func parseMountOptions(options []string) (int, []int, string, int) {
propagationFlags := map[string]int{
"private": syscall.MS_PRIVATE,
"shared": syscall.MS_SHARED,
"slave": syscall.MS_SLAVE,
"unbindable": syscall.MS_UNBINDABLE,
"rprivate": syscall.MS_PRIVATE | syscall.MS_REC,
"rshared": syscall.MS_SHARED | syscall.MS_REC,
"rslave": syscall.MS_SLAVE | syscall.MS_REC,
"runbindable": syscall.MS_UNBINDABLE | syscall.MS_REC,
// Do the mount operation followed by additional mounts required to take care
// of propagation flags.
func mountPropagate(m *configs.Mount, rootfs string, mountLabel string) error {
var (
dest = m.Destination
data = label.FormatMountLabel(m.Data, mountLabel)
flags = m.Flags
if libcontainerUtils.CleanPath(dest) == "/dev" {
flags &= ^syscall.MS_RDONLY
copyUp := m.Extensions&configs.EXT_COPYUP == configs.EXT_COPYUP
if !(copyUp || strings.HasPrefix(dest, rootfs)) {
dest = filepath.Join(rootfs, dest)
其中各容器共享主机的Network namespace,但每个容器具备非共享的Mount namespace;在各自独立的Mount namespace中,共享主机/run/netns目录,用于共享虚拟网络的network namespace操作入口。
Docker对附加传入的private、shared等不同属性的处理,实际对应执行mount系统调用时传入不同flags,不同的flags对应到不同的Mount Propagation Type:
This mount point shares mount and unmount events with other mount points that are members of its “peer group”. When a mount point is added or removed under this mount point, this change will propagate to the peer group, so that the mount or unmount will also take place under each of the peer mount points. Propagation also occurs in the reverse direction, so that mount and unmount events on a peer mount will also propagate to this mount point.
This is the converse of a shared mount point. The mount point does not propagate events to any peers, and does not receive propagation events from any peers.
This propagation type sits midway between shared and private. A slave mount has a master—a shared peer group whose members propagate mount and unmount events to the slave mount. However, the slave mount does not propagate events to the master peer group.
This mount point is unbindable. Like a private mount point, this mount point does not propagate events to or from peers. In addition, this mount point can’t be the source for a bind mount operation.
由于启动容器时对/run目录没有使用MS_SHARED传播类型,容器重启后,之前创建的namespace文件会因/run/netns产生的`peer group`内其他peer的引用计数不能正常删除(Device or resource busy)
umount("/run/netns/ns1", MNT_DETACH) = -1 EINVAL (Invalid argument)
unlink("/run/netns/ns1") = -1 EBUSY (Device or resource busy)
删除失败带来的后续结果是,由于文件系统层的namespace文件没有完全删除,而实际的network namespace已经释放,所以这个“半删除”的namespace从用户态程序的角度就呈现出这样的状态:能够查看到namespace,对应2.1第5)步中ensure_namespace()操作正常,但set操作时会失败,对应2.1第6)步set_ns()操作失败
setns(4, 1073741824) = -1 EINVAL (Invalid argument)
write(2, "seting the network namespace \"ns"..., 60seting the network namespace "ns1" failed: Invalid argument
- -v /run/netns:/run/netns:shared -v /run:/run:rw
+ -v /run/netns:/run/netns:shared -v /run:/run:rw:shared
1) 创建网络和子网
neutron net-create --shared --provider:network_type vlan --provider:physical_network physnet1 --provider:segmentation_id 108 vlan108
neutron subnet-create –name subnet108 vlan108
2) 重启neutron_dhcp_agent容器
docker restart neutron_dhcp_agent
这时neutron_dhcp_agent所创建的namespace已被docker daemon和其他容器引用,权限已转移不允许neutron_dhcp_agent删除
3) 删除该网络的子网
neutron subnet-delete subnet108
4) 为该网络重新创建子网
neutron subnet-create –name subnet108 vlan108
1) 启动两个容器共享可操作network namespace文件系统
docker run -d --name testa -it –v /run:/run:rw -v /run/netns:/run/netns:shared --privileged --net=host nova-compute:latest bash
docker run -d --name testb -it –v /run:/run:rw -v /run/netns:/run/netns:shared --privileged --net=host nova-compute:latest bash
2) 在容器A中创建namespace ns1
docker exec -u root testa ip netns add ns1
3) 重启容器A
docker restart testa
4) 使用容器A删除之前创建的namespace
docker exec -u root testa ip netns del ns1
Cannot remove namespace file "/var/run/netns/ns1": Device or resource busy
5) 使用容器A设置namespace
docker exec -u root testa ip netns exec ns1 ip a
seting the network namespace "ns1" failed: Invalid argument
每个Mount namespace有自己独立的文件系统视图,但是这种隔离性同时也带来一些问题:比如,当系统加载一块新的磁盘时,在最初的实现中每个namespace必须单独挂载磁盘。为此内核在2.6.15引入了shared subtrees feature:“The key benefit of shared subtrees is to allow automatic, controlled propagation of mount and unmount events between namespaces. This means, for example, that mounting an optical disk in one mount namespace can trigger a mount of that disk in all other namespaces.” 每个挂载点都会标记Propagation type,用于决定在当前挂载点下创建/删除(子)挂载点时,是否传播到别的挂载点。功能同样带来潜在的复杂,如2.2描述的权限传播转移出乎使用者的预料。
Docker基础技术——Linux Namespace http://coolshell.cn/articles/17010.html
Mount namespace and mount propagation http://hustcat.github.io/mount-namespace-and-mount-propagation/
Shared Subtrees https://lwn.net/Articles/159077/
Mount namespaces and shared trees https://lwn.net/Articles/689856/