monitoring ai both at home and away from home

April 9, 2025

The image above is a grafana dashboard that shows metrics for my eGPU that is powering all of my self hosted LLM's. I used a nodeSelector to force scheduling the dcgm-exporter on the node with the gpu connected to it. However, I would like to go beyond the metrics that are provided by this exporter while observing other possible solutions for monitoring the gpu.

Overall, here is what we are going to cover.

  1. GPU Observability foundation dcgm-exporter
    • Discuss the current setup with the GPU that we are monitoring
    • Go over the metrics provided
    • Cover how to generate specific metrics that target how the GPU is communicates with the host machine
  2. Monitoring your AI Gateway
    • Why use an AI gateway?
    • What does observability look like?
  3. Monitoring LLM Provider Costs
  4. Monitoring GPU metrics with eBPF
    • What is eBPF and how can we use it to enhance observability with our GPU?

GPU Observability foundation

...

Monitoring your AI Gateway

...

Utilize eBPF to further possible metrics

...