https://lwn.net/Articles/740157/
http://arthurchiao.art/blog/ebpf-and-k8s-zh/
https://elixir.bootlin.com/linux/v4.15/source/tools/bpf/bpftool/Documentation/bpftool.rst
https://elixir.bootlin.com/linux/v4.15/source/tools/bpf/bpftool
参考书籍:https://files.cnblogs.com/files/codestack/OReilly-Linux-Observability-with-BPF-2019.rar
腾讯云在其使用内核版本中实现了ipvs+ebpf 提升性能 , 解决了iptable-snat对其性能的影响
patch如下:
https://github.com/Tencent/TencentOS-kernel/blob/master/net/netfilter/ipvs/ip_vs_bpf_proc.c#L56
https://github.com/Tencent/TencentOS-kernel/blob/master/net/netfilter/ipvs/ip_vs_bpf_proc.c#L402
https://github.com/Tencent/TencentOS-kernel/commit/d1a590ff8758af2fc1055899c9bf49f3028243d5#diff-1a592df245508d5f30565cf2860fa670bf1ae15b0d69d07e5328a0b880fc1024
https://cloud.tencent.com/developer/inventory/600/article/1685332
在腾讯云的技术实现细节中,就有通过IPVS模块向BPF Map插入数据的逻辑,其关键代码如下所示:
// https://github.com/Tencent/TencentOS-kernel/blob/master/net/netfilter/ipvs/ip_vs_bpf_proc.c#L56 struct bpf_map *conntrack_map; // https://github.com/Tencent/TencentOS-kernel/blob/master/net/netfilter/ipvs/ip_vs_bpf_proc.c#L402 static ssize_t ip_vs_bpf_write(struct file *file, const char __user *ubuf, size_t count, loff_t *ppos) { int err = 0; struct bpf_map *map = NULL; struct bpf_prog *prog1 = NULL; struct bpf_prog *prog2 = NULL; const char delim[2] = ":"; char ids[5][20]; char *token; int tag, pid; unsigned int mapid, progid1, progid2; char buf[100]; int i = 0; char *s = buf; ... /* singleton:conntrack_map is assigned once, * and be nulled in module exit */ if (conntrack_map) { pr_err("%s %d conntrack_map exists ", __func__, __LINE__); return -EEXIST; } ... err = bpf_conntrack_map_get(pid, mapid, (long long)(resolve_addrs.bpf_map_fops), &map); if (err != 0 || !map) { pr_err("%s acquire bpf_map failed ", __func__); return -EINVAL; } ... bpf_map_fd = mapid; ... // got the target bpf map via fd conntrack_map = map; ... } // https://github.com/Tencent/TencentOS-kernel/blob/master/net/netfilter/ipvs/ip_vs_bpf_proc.c#L306 static int bpf_conntrack_map_get(int pid, unsigned int fd, unsigned long long addr, struct bpf_map **map) { struct files_struct *files; struct task_struct *task; struct file *file; int err = 0; ... file = fcheck_files(files, fd); ... *map = file->private_data; if (*map) bpf_map_inc2(*map); ... }
简单说明下ip_vs_bpf_proc.c
文件:
- 在
ip_vs_bpf_proc.c
这个文件开头声明了一个bpf_map
结构(来自内核bpf体系)的变量conntrack_map
,这个就是将来被IPVS操作的BPF Map对象。 - 同样是在
ip_vs_bpf_proc.c
这个文件里,函数ip_vs_bpf_write()
内调用了bpf_conntrack_map_get()
方法,而它的入参之一mapid
在这里就是BPF Map的fd,这个方法内容就是通过fd找到了背后相关数据,并序列化为bpf_map
结构的对象,最终赋值给conntrack_map
,完成目标BPF Map的获取。
然后在IPVS模块创建新connection时,对上面conntrack_map
插入了必要的元素,关键代码如下所示:
// https://github.com/Tencent/TencentOS-kernel/blob/master/net/netfilter/ipvs/ip_vs_conn.c#L1098 static bool ip_vs_conn_new_bpf(struct ip_vs_dest *dest, unsigned int flags, const struct ip_vs_conn_param *p, int *skip) { int i; struct ip_vs_service *svc; struct bpf_lb_conn_key key = {}; struct bpf_lb_conn_key reply_key = {}; struct bpf_lb_conn_value value = {}; struct bpf_lb_conn_value reply_value = {}; struct bpf_lb_conn_value *v; int inserted = 0; struct bpf_map *map; ... if (!bpf_mode_on) return true; ... // conntrack_map is the same one map = conntrack_map; ... lip = alloc_localip(); reply_key.sip = key.dip; reply_key.sport = key.dport; reply_key.dip = lip; reply_key.dport = key.sport; reply_key.proto = p->protocol; reply_key.vip = 0; reply_key.vport = 0; reply_key.pad = 0; atomic_set(&reply_value.ref, 0); reply_value.sip = key.dip; reply_value.sport = key.dport; reply_value.dip = key.sip; reply_value.dport = key.sport; reply_value.proto = p->protocol; ... if (likely(!map->ops->map_lookup_elem(map, &reply_key))) { if (likely(map->ops->map_update_elem(map, &reply_key, &reply_value, BPF_ANY) == 0)) { /* the common case! break the loop */ inserted = 1; nf_conntrack_single_unlock(&reply_key, map->key_size); break; } /* if lookup ok, shall insert ok since lock is held!*/ pr_err("map insert key failed "); BPF_STAT_INC(p->ipvs, BPF_NEW_INSERT); nf_conntrack_single_unlock(&reply_key, map->key_size); return false; } ... }
上面的代码里使用了map_lookup_elem()
和map_update_elem()
两个操作BPF Map的方法——先查询后更新(插入新的元素),完成对BPF Map的操作