
Why, What, and How?
Why
When you start running GPU workloads at home β whether for LLM inference, AI gateways, or robotics β you quickly realize youβve got visibility gaps. Monitoring GPU utilization is easy. Understanding why performance drops under load, or when Thunderbolt/PCIe connections start throttling, is a different story.
Below is the layout of a NVIDIA 4070 Super connected to a kubernetes node that is virtualized on proxmox. The gpu is connected via thunderbolt and set on passthrough mode so it can be utilized by the virtualized machine mlops-worker-00. We are already using dcgm-exporter but it does not give us the metrics we need to build a thorough visualization that portrays the understanding of this setup current state in real time.
βββββββββββββββββββββββ ββββββββββββββββββββββββββββββ β Proxmox Host β β Kubernetes Node (Talos) β β βββββββββββββββββ β β ββββββββββββββββββββββββ β β β eBPF Agent ββββΌββ bridge ββΌββΆβ DaemonSet: eBPF β β β β (systemd β β vmbr0 β β gpu + net probes β β β β container) β β β ββββββββββββββββββββββββ β β βββββββββββββββββ β β ββββββββββββββββ β β β β β Prometheus β β β β β β node_exporterβ β β β β ββββββββββββββββ β β β β β² β β β Thunderboltβ Scrape β β β β 3/4 β β β β β β βββββββ΄ββββββ β β β β β Grafana β β β β β ββββββββββββββ β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
What
eBPF (extended Berkeley Packet Filter) lets you safely run custom programs in the Linux kernel without kernel modules or intrusive instrumentation. Think of it as a programmable microscope for the OS, giving you real-time insights into scheduling, I/O, syscalls, and network flows. On Talos (an immutable OS), you canβt install tools like bcc or perf. Instead, you deploy eBPF agents as privileged DaemonSets in Kubernetes.
That pattern fits perfectly for GPU nodes, where you want to correlate kernel-level metrics with DCGM GPU telemetry β all without breaking Talos immutability.
How?
Overview
To provide visibility into the unknown, we will use a combination of open source tooling and personal tooling and discovery of what type of metrics we can collect and aggregate for display.
- Cilium/Hubble β Network QoS and RTT
- eBPF Agent β Kernel & scheduling latency
- DCGM Exporter β GPU utilization & PCIe throughput
- Parca β CPU flamegraphs
ββββββββββββββββββββββββββββββββ β Talos Linux Worker β β ββββββββββββββββββββββββββββ β β β NVIDIA GPU (Thunderbolt) β β β ββββββββββββββββββββββββββββ β β β PCIe/DMA β Network β β β ββββββββββββββββββββββββββββ β β β eBPF Agent DaemonSet βββββΊ Prometheus (TCP RTT, IRQ latency) β ββββββββββββββββββββββββββββ β β ββββββββββββββββββββββββββββ β β β DCGM Exporter DaemonSet βββββΊ Prometheus (GPU util, PCIe TX/RX) β ββββββββββββββββββββββββββββ β β ββββββββββββββββββββββββββββ β β β Parca Agent (eBPF CO-RE) βββββΊ Parca Server (CPU flamegraphs) β ββββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββββ
Details
One example of using ebpf to derive metrics for monitoring gpu performance is the network Qos (quality of service) over gpu utilized routes in the network. I have set cilium to enable Bottleneck Bandwidth and Round-trip propagation time which can durastically improve throughput and latency.
bandwidthManager:
enabled: true
bbr: true
Then, you can use the annotations below to define your ingress / egress bandwidth limits so pods do not exceed 500 Mbps in either direction.
metadata:
annotations:
kubernetes.io/egress-bandwidth: "500M"
kubernetes.io/ingress-bandwidth: "500M"
How do you provide visibility into this? I am hoping to capture and utilize the following metrics
ebpf_tc_bytes_totalebpf_tc_dropped_packets_totalebpf_tcp_rtt_microseconds_sumebpf_tcp_retransmissions_totalebpf_sched_latency_microseconds_sum
Which would bundle up to the following metric queries in grafana for display.
# Latency improvement under BBR
rate(ebpf_tcp_rtt_microseconds_sum[1m]) by (pod)
# Traffic shaping validation
rate(ebpf_tc_bytes_total[1m]) by (pod)
If BandwidthManager and BBR are working correctly, RTT variance (jitter) will drop, retransmits will fall near zero, and aggregate throughput will stabilize at ~500 Mbps per pod (or whatever youβve set).