using ebpf to monitor your self hosted GPU

Why, What, and How?

Why

Below is the layout of a NVIDIA 4070 Super connected to a kubernetes node that is virtualized on proxmox. The gpu is connected via thunderbolt and set on passthrough mode so it can be utilized by the virtualized machine mlops-worker-00. We are already using dcgm-exporter but it does not give us the metrics we need to build a thorough visualization that portrays the understanding of this setup current state in real time.


┌─────────────────────┐           ┌────────────────────────────┐
│  Proxmox Host       │           │  Kubernetes Node (Talos)   │
│  ┌───────────────┐  │           │  ┌──────────────────────┐  │
│  │ eBPF Agent    │◀─┼── bridge ─┼─▶│ DaemonSet: eBPF      │  │
│  │ (systemd      │  │  vmbr0    │  │ gpu + net probes     │  │
│  │  container)   │  │           │  └──────────────────────┘  │
│  └───────────────┘  │           │    ┌──────────────┐        │
│                     │           │    │ Prometheus   │        │
│                     │           │    │ node_exporter│        │
│                     │           │    └──────────────┘        │
│                     │           │         ▲                  │
│                     │ Thunderbolt│ Scrape │                  │
│                     │   3/4      │        │                  │
│                     │           │  ┌─────┴─────┐             │
│                     │           │  │ Grafana    │            │
│                     │           │  └────────────┘            │
└─────────────────────┘           └────────────────────────────┘

What

What is Ebpf? It is a continuation of BPF (The Berkeley Packet Filter) which was first introduced by Steven McCanne and Van Jacobson in the Lawrence Berkeley National Laboratory in 1993. It was written to intercept network packets and run them through a series of filters to ultimately decide wether to accept or reject the packet.

Fast forward 20 years, this tool has been extended and now applied to the linux kernel as well. The ability to attach runtime middleware into the linux kernel provides many different possibilities in the world of observability and security.

The picture above is a great representation of its capabilities and how it can be used to access different parts of the OSI Model by listening to the kernel of a host machine.

How?

...