Observability - DUT and host metrics¶

The lab host runs the metrics stack. On each DUT, only the OpenWrt exporter is required (prometheus-node-exporter-lua); the orchestration host exposes its own metrics via prometheus-node-exporter. The Ansible observability role automates the following on the host:

Public HTTPS Grafana (Oracle VM, Nginx, reverse SSH tunnel), more details here.
Below: Prometheus scrape and local Grafana on the lab host.

What Ansible automates¶

Action	Detail
Packages	Installs `autossh`, `prometheus`, `grafana` (official Grafana repo), `prometheus-node-exporter` (host).
DUT tunnels	For each `observability_duts` row: unit `dut-metrics-tunnel-<name>.service` (`autossh`, forward `127.0.0.1:<local_port>` to `127.0.0.1:9100` on the DUT).
Host node exporter	Installs `prometheus-node-exporter`, binds to loopback `127.0.0.1:9100`, generates job `orchestrator-host` in `jobs.d/`.
Prometheus	Writes `/etc/prometheus/prometheus.yml` from template; validates with `promtool`. Loads jobs from *`/etc/prometheus/jobs.d/.yml`**.
Grafana	Creates Prometheus datasource (`uid: prometheus`) via provisioning. Provisions dashboards from JSON in the repo: Orchestrator Host and DUTs & gateway.
Services	Enables and starts tunnels, `prometheus`, `prometheus-node-exporter`, `grafana-server`.

Does not automate: exporter installation on the DUT (opkg) or optional Wi-Fi/hwmon collectors.

Data flow¶

flowchart LR
  DUT[DUT exporter loopback :9100]
  Proxy[labgrid-bound-connect SSH]
  Tunnel[autossh host :191XX]
  Host[host node_exporter :9100]
  Prom[Prometheus scrape localhost]
  Graf[Grafana PromQL]
  DUT -->|"-L"| Proxy
  Proxy --> Tunnel
  Tunnel --> Prom
  Host --> Prom
  Prom --> Graf

Hold "Alt" / "Option" to enable pan & zoom

Prometheus only talks to 127.0.0.1 on the host; the DUT IP on the VLAN is resolved via SSH + labgrid-bound-connect (static isolated VLAN per DUT). If the DUT VLAN changes briefly during tests, the session drops and autossh brings the forward back up.

systemd units on the host (multiple DUTs)¶

Each DUT with observability has its own tunnel unit. Prometheus and Grafana are one each. The host exposes its metrics directly (no tunnel).

flowchart LR
  Tunnels["dut-metrics-tunnel-*.service x7"]
  HostExp[node_exporter.service]
  PromSvc[prometheus.service]
  GrafSvc[grafana-server.service]
  Tunnels -->|"127.0.0.1:1910x"| PromSvc
  HostExp --> PromSvc
  PromSvc --> GrafSvc

Hold "Alt" / "Option" to enable pan & zoom

One tunnel unit per observability_duts entry. Useful commands: systemctl status dut-metrics-tunnel-<name>, journalctl -u dut-metrics-tunnel-<name> -n 30.

Web UIs¶

Service	URL	Access
Prometheus	http://127.0.0.1:9090	Host or SSH tunnel only
Grafana (local)	http://127.0.0.1:3000	Host or SSH tunnel only
Grafana (public)	https://fcefyn-testbed.duckdns.org	Internet via Oracle VPS + HTTPS

In Prometheus: Status → Targets lists each job (name = name in observability_duts). In Grafana: Prometheus datasource is provisioned by Ansible.

VPS, Certbot, and tunnel unit: grafana-public-access.md.

Grafana dashboards¶

Two dashboards:

Dashboard	Source	Description
FCEFyN Testbed - DUTs & gateway	Provisioned (JSON in repo)	DUTs + WDR3500 gateway. device variable with `label_values(up{dut!="lab-orchestrator"}, dut)`: does not include the orchestration host. All queries use `dut="$device"` and datasource `uid: prometheus`.
FCEFyN Testbed - Orchestrator Host	Provisioned (JSON in repo)	Orchestration host (~30 panels). Job `orchestrator-host`, label `dut=lab-orchestrator`.

DUTs & gateway dashboard sections¶

Section	Content
Overview	Uptime, CPU, RAM, load, disk `/`, `up`
Device info	Instant tables `node_uname_info`, `node_openwrt_info`
CPU & load	CPU by mode (stacked), load 1/5/15m
Memory	Total / available / used
Network	Traffic and packets per interface (excluding `lo`)
Disk	Usage % per mountpoint, free space
Temperature	`node_hwmon_temp_celsius`, `node_thermal_zone_temp`, CPU stats / max / ieee80211 radios
Wi-Fi	`wifi_network_*` (AP), `wifi_stations` / `wifi_station_signal_dbm` (stations, if opkg packages present)
Labels	Table of scrape labels (`firmware`, `target`, etc.) from `up{dut="$device"}`

Orchestrator Host dashboard sections¶

Section	Panels
System Overview	Uptime, CPU %, RAM %, Disk %, Swap %, Load, Processes, Open FDs
CPU	Usage stacked by mode, Load Average (1/5/15m + cores), Context Switches/Interrupts, Processes & Threads
Memory	Usage stacked (apps/buffers/cached/free), Swap
Disk / Filesystem	Usage % (bar gauge), Available Space, Inodes %
Disk I/O	Throughput (read/write), IOPS, I/O Wait Time, I/O in Progress
Network - Physical	Bandwidth (bps), Packets/s, Errors & Drops, TCP Connections
Network - VLANs	Bandwidth and packets for vlan100-108, vlan200 (collapsible)
System Internals	File Descriptors, Entropy, Sockets by Protocol, Systemd Units (active/failed), Socket Memory

For orchestration host only metrics, always use Orchestrator Host; the DUT dashboard excludes it on purpose from the device dropdown.

Deploy (Ansible)¶

ansible-playbook -i ansible/inventory/hosts.yml ansible/playbook_testbed.yml --tags observability -K

Active DUTs are listed in ansible/roles/observability/defaults/main.yml (repo root) under observability_duts.

Adding a DUT¶

Step 1 - On the DUT (manual, once over SSH)¶

Requires Internet on the DUT (opkg feeds). See also duts-config - Internet access.

opkg update
opkg install prometheus-node-exporter-lua prometheus-node-exporter-lua-openwrt
uci set prometheus-node-exporter-lua.main.listen_interface='loopback'
uci commit prometheus-node-exporter-lua
/etc/init.d/prometheus-node-exporter-lua enable
/etc/init.d/prometheus-node-exporter-lua start

Verify: wget -qO- http://127.0.0.1:9100/metrics | head -5

The exporter listens only on loopback for security.

Optional collectors (hardware dependent):

opkg install prometheus-node-exporter-lua-hwmon prometheus-node-exporter-lua-wifi

Filesystem collector (no official package)¶

The node_filesystem_* collector is not an opkg package: upstream PR #25535 has been open since 2020. Install manually as a Lua file:

cat > /usr/lib/lua/prometheus-collectors/filesystem.lua << 'EOF'
local nix = require "nixio"

local function scrape()
  local metric_size_bytes = metric("node_filesystem_size_bytes", "gauge")
  local metric_free_bytes = metric("node_filesystem_free_bytes", "gauge")
  local metric_avail_bytes = metric("node_filesystem_avail_bytes", "gauge")
  local metric_files = metric("node_filesystem_files", "gauge")
  local metric_files_free = metric("node_filesystem_files_free", "gauge")
  local metric_readonly = metric("node_filesystem_readonly", "gauge")

  for e in io.lines("/proc/self/mounts") do
    local fields = space_split(e)
    local device, mount_point, fs_type = fields[1], fields[2], fields[3]

    if mount_point:find("/dev/?", 1) ~= 1
    and mount_point:find("/proc/?", 1) ~= 1
    and mount_point:find("/sys/?", 1) ~= 1
    and fs_type ~= "overlay" and fs_type ~= "squashfs"
    and fs_type ~= "tmpfs"   and fs_type ~= "sysfs"
    and fs_type ~= "proc"    and fs_type ~= "devtmpfs"
    and fs_type ~= "devpts"  and fs_type ~= "debugfs"
    and fs_type ~= "cgroup"  and fs_type ~= "cgroup2"
    and fs_type ~= "pstore" then
      local ok, stat = pcall(nix.fs.statvfs, mount_point)
      if ok and stat then
        local labels = { device = device, fstype = fs_type, mountpoint = mount_point }
        local ro = (nix.bit.band(stat.flag, 0x001) == 1) and 1 or 0
        metric_size_bytes(labels, stat.blocks * stat.bsize)
        metric_free_bytes(labels, stat.bfree  * stat.bsize)
        metric_avail_bytes(labels, stat.bavail * stat.bsize)
        metric_files(labels, stat.files)
        metric_files_free(labels, stat.ffree)
        metric_readonly(labels, ro)
      end
    end
  end
end

return { scrape = scrape }
EOF

After creating or editing the file, restart the service so the collector loads (without restart, wget … | grep node_filesystem is often empty):

/etc/init.d/prometheus-node-exporter-lua restart
wget -qO- http://127.0.0.1:9100/metrics | grep node_filesystem

Requires nixio (usual dependency of prometheus-node-exporter-lua; if it fails, opkg install luci-lib-nixio).

Note: Root filesystem / is overlay (filtered by design, same as standard node_exporter).

Thermal sensors by device¶

Not all SoCs expose temperature sensors in Linux. The table shows which devices report node_hwmon_temp_celsius:

DUT	CPU sensor	Wi-Fi radio sensor	Notes
openwrt-one	Yes (`thermal_thermal_zone0`)	Yes (`ieee80211_phy0`, `phy1`)	MediaTek Filogic
bananapi	Yes (`thermal_thermal_zone0`)	Yes (`ieee80211_phy0`)	MediaTek MT7988
belkin-1	No	Yes (`ieee80211_phy0`, `phy1`)	MediaTek MT7622
belkin-2	No	Yes (`ieee80211_phy0`)	MediaTek MT7622
belkin-3	No	Yes (`ieee80211_phy0`)	MediaTek MT7622
librerouter-1	No	No	IPQ4019 - not supported
gateway-wdr3500	No	No	QCA9558 (ath79) - not supported

In Grafana, temperature panels show "No data" for devices without sensors.

Step 2 - In the repo¶

Add an entry under observability_duts in ansible/roles/observability/defaults/main.yml:

observability_duts:
  - name: dut-name
    ssh_alias: dut-dut-name
    local_port: 19106
    remote_port: 9100
    labels:
      dut: dut-name
      firmware: openwrt-X.Y.Z
      target: platform-arch

Local ports (local_port) per DUT (align with duts-config when changing firmware):

`dut` (Grafana label)	Device	`local_port`	`ssh_alias`
openwrt-one	OpenWrt One	19100	`dut-openwrt-one`
belkin-1	Belkin RT3200 #1	19101	`dut-belkin-1`
belkin-2	Belkin RT3200 #2	19102	`dut-belkin-2`
belkin-3	Belkin RT3200 #3	19103	`dut-belkin-3`
bananapi	Banana Pi R4	19104	`dut-bananapi`
librerouter-1	Librerouter 1	19105	`dut-librerouter-1`

Until step 1 is done, the Prometheus target stays DOWN for that name; the tunnel may restart in a loop if the DUT is off.

Step 3 - Apply¶

ansible-playbook -i ansible/inventory/hosts.yml ansible/playbook_testbed.yml --tags observability -K

Verification¶

systemctl status dut-metrics-tunnel-<name>
curl -sS http://127.0.0.1:<local_port>/metrics | head -5
promtool check config /etc/prometheus/prometheus.yml

In Prometheus: Status → Targets - all DUTs plus orchestrator-host should appear. In Grafana: FCEFyN Testbed - DUTs & gateway and FCEFyN Testbed - Orchestrator Host (both provisioned by Ansible).

# Host node exporter
curl -sS http://127.0.0.1:9100/metrics | head -5
systemctl status prometheus-node-exporter

Key files¶

Paths are relative to the repository root.

Path	Description
`ansible/roles/observability/defaults/main.yml`	`observability_duts`, `orchestrator_node_exporter`, `grafana_public_tunnel`, `grafana_config`
`ansible/roles/observability/templates/dut-metrics-tunnel.service.j2`	autossh unit per DUT
`ansible/roles/observability/templates/dut-scrape-job.yml.j2`	Scrape fragment per DUT
`ansible/roles/observability/templates/orchestrator-scrape-job.yml.j2`	Host scrape fragment
`ansible/roles/observability/templates/prometheus.yml.j2`	Main `prometheus.yml`
`ansible/roles/observability/templates/grafana-dashboards-provider.yml.j2`	File-based dashboard provider in Grafana
`ansible/roles/observability/files/dashboards/orchestrator-node.json`	Orchestrator host dashboard JSON
`ansible/roles/observability/files/dashboards/duts-node.json`	DUTs + gateway dashboard JSON (variable excludes `lab-orchestrator`)