Observability - DUT and host metrics¶
The lab host runs the metrics stack. On each DUT, only the OpenWrt exporter is required (prometheus-node-exporter-lua); the orchestration host exposes its own metrics via prometheus-node-exporter. The Ansible observability role automates the following on the host:
- Public HTTPS Grafana (Oracle VM, Nginx, reverse SSH tunnel), more details here.
- Below: Prometheus scrape and local Grafana on the lab host.
What Ansible automates¶
| Action | Detail |
|---|---|
| Packages | Installs autossh, prometheus, grafana (official Grafana repo), prometheus-node-exporter (host). |
| DUT tunnels | For each observability_duts row: unit dut-metrics-tunnel-<name>.service (autossh, forward 127.0.0.1:<local_port> to 127.0.0.1:9100 on the DUT). |
| Host node exporter | Installs prometheus-node-exporter, binds to loopback 127.0.0.1:9100, generates job orchestrator-host in jobs.d/. |
| Prometheus | Writes /etc/prometheus/prometheus.yml from template; validates with promtool. Loads jobs from /etc/prometheus/jobs.d/*.yml. |
| Grafana | Creates Prometheus datasource (uid: prometheus) via provisioning. Provisions dashboards from JSON in the repo: Orchestrator Host and DUTs & gateway. |
| Services | Enables and starts tunnels, prometheus, prometheus-node-exporter, grafana-server. |
Does not automate: exporter installation on the DUT (opkg) or optional Wi-Fi/hwmon collectors.
Data flow¶
flowchart LR
DUT[DUT exporter loopback :9100]
Proxy[labgrid-bound-connect SSH]
Tunnel[autossh host :191XX]
Host[host node_exporter :9100]
Prom[Prometheus scrape localhost]
Graf[Grafana PromQL]
DUT -->|"-L"| Proxy
Proxy --> Tunnel
Tunnel --> Prom
Host --> Prom
Prom --> Graf
Prometheus only talks to 127.0.0.1 on the host; the DUT IP on the VLAN is resolved via SSH + labgrid-bound-connect (static isolated VLAN per DUT). If the DUT VLAN changes briefly during tests, the session drops and autossh brings the forward back up.
systemd units on the host (multiple DUTs)¶
Each DUT with observability has its own tunnel unit. Prometheus and Grafana are one each. The host exposes its metrics directly (no tunnel).
flowchart LR
Tunnels["dut-metrics-tunnel-*.service x7"]
HostExp[node_exporter.service]
PromSvc[prometheus.service]
GrafSvc[grafana-server.service]
Tunnels -->|"127.0.0.1:1910x"| PromSvc
HostExp --> PromSvc
PromSvc --> GrafSvc
One tunnel unit per observability_duts entry. Useful commands: systemctl status dut-metrics-tunnel-<name>, journalctl -u dut-metrics-tunnel-<name> -n 30.
Web UIs¶
| Service | URL | Access |
|---|---|---|
| Prometheus | http://127.0.0.1:9090 | Host or SSH tunnel only |
| Grafana (local) | http://127.0.0.1:3000 | Host or SSH tunnel only |
| Grafana (public) | https://fcefyn-testbed.duckdns.org | Internet via Oracle VPS + HTTPS |
In Prometheus: Status → Targets lists each job (name = name in observability_duts). In Grafana: Prometheus datasource is provisioned by Ansible.
VPS, Certbot, and tunnel unit: grafana-public-access.md.
Grafana dashboards¶
Two dashboards:
| Dashboard | Source | Description |
|---|---|---|
| FCEFyN Testbed - DUTs & gateway | Provisioned (JSON in repo) | DUTs + WDR3500 gateway. device variable with label_values(up{dut!="lab-orchestrator"}, dut): does not include the orchestration host. All queries use dut="$device" and datasource uid: prometheus. |
| FCEFyN Testbed - Orchestrator Host | Provisioned (JSON in repo) | Orchestration host (~30 panels). Job orchestrator-host, label dut=lab-orchestrator. |
DUTs & gateway dashboard sections¶
| Section | Content |
|---|---|
| Overview | Uptime, CPU, RAM, load, disk /, up |
| Device info | Instant tables node_uname_info, node_openwrt_info |
| CPU & load | CPU by mode (stacked), load 1/5/15m |
| Memory | Total / available / used |
| Network | Traffic and packets per interface (excluding lo) |
| Disk | Usage % per mountpoint, free space |
| Temperature | node_hwmon_temp_celsius, node_thermal_zone_temp, CPU stats / max / ieee80211 radios |
| Wi-Fi | wifi_network_* (AP), wifi_stations / wifi_station_signal_dbm (stations, if opkg packages present) |
| Labels | Table of scrape labels (firmware, target, etc.) from up{dut="$device"} |
Orchestrator Host dashboard sections¶
| Section | Panels |
|---|---|
| System Overview | Uptime, CPU %, RAM %, Disk %, Swap %, Load, Processes, Open FDs |
| CPU | Usage stacked by mode, Load Average (1/5/15m + cores), Context Switches/Interrupts, Processes & Threads |
| Memory | Usage stacked (apps/buffers/cached/free), Swap |
| Disk / Filesystem | Usage % (bar gauge), Available Space, Inodes % |
| Disk I/O | Throughput (read/write), IOPS, I/O Wait Time, I/O in Progress |
| Network - Physical | Bandwidth (bps), Packets/s, Errors & Drops, TCP Connections |
| Network - VLANs | Bandwidth and packets for vlan100-108, vlan200 (collapsible) |
| System Internals | File Descriptors, Entropy, Sockets by Protocol, Systemd Units (active/failed), Socket Memory |
For orchestration host only metrics, always use Orchestrator Host; the DUT dashboard excludes it on purpose from the device dropdown.
Deploy (Ansible)¶
ansible-playbook -i ansible/inventory/hosts.yml ansible/playbook_testbed.yml --tags observability -K
Active DUTs are listed in ansible/roles/observability/defaults/main.yml (repo root) under observability_duts.
Adding a DUT¶
Step 1 - On the DUT (manual, once over SSH)¶
Requires Internet on the DUT (opkg feeds). See also duts-config - Internet access.
opkg update
opkg install prometheus-node-exporter-lua prometheus-node-exporter-lua-openwrt
uci set prometheus-node-exporter-lua.main.listen_interface='loopback'
uci commit prometheus-node-exporter-lua
/etc/init.d/prometheus-node-exporter-lua enable
/etc/init.d/prometheus-node-exporter-lua start
Verify: wget -qO- http://127.0.0.1:9100/metrics | head -5
The exporter listens only on loopback for security.
Optional collectors (hardware dependent):
opkg install prometheus-node-exporter-lua-hwmon prometheus-node-exporter-lua-wifi
Filesystem collector (no official package)¶
The node_filesystem_* collector is not an opkg package: upstream PR #25535 has been open since 2020. Install manually as a Lua file:
cat > /usr/lib/lua/prometheus-collectors/filesystem.lua << 'EOF'
local nix = require "nixio"
local function scrape()
local metric_size_bytes = metric("node_filesystem_size_bytes", "gauge")
local metric_free_bytes = metric("node_filesystem_free_bytes", "gauge")
local metric_avail_bytes = metric("node_filesystem_avail_bytes", "gauge")
local metric_files = metric("node_filesystem_files", "gauge")
local metric_files_free = metric("node_filesystem_files_free", "gauge")
local metric_readonly = metric("node_filesystem_readonly", "gauge")
for e in io.lines("/proc/self/mounts") do
local fields = space_split(e)
local device, mount_point, fs_type = fields[1], fields[2], fields[3]
if mount_point:find("/dev/?", 1) ~= 1
and mount_point:find("/proc/?", 1) ~= 1
and mount_point:find("/sys/?", 1) ~= 1
and fs_type ~= "overlay" and fs_type ~= "squashfs"
and fs_type ~= "tmpfs" and fs_type ~= "sysfs"
and fs_type ~= "proc" and fs_type ~= "devtmpfs"
and fs_type ~= "devpts" and fs_type ~= "debugfs"
and fs_type ~= "cgroup" and fs_type ~= "cgroup2"
and fs_type ~= "pstore" then
local ok, stat = pcall(nix.fs.statvfs, mount_point)
if ok and stat then
local labels = { device = device, fstype = fs_type, mountpoint = mount_point }
local ro = (nix.bit.band(stat.flag, 0x001) == 1) and 1 or 0
metric_size_bytes(labels, stat.blocks * stat.bsize)
metric_free_bytes(labels, stat.bfree * stat.bsize)
metric_avail_bytes(labels, stat.bavail * stat.bsize)
metric_files(labels, stat.files)
metric_files_free(labels, stat.ffree)
metric_readonly(labels, ro)
end
end
end
end
return { scrape = scrape }
EOF
After creating or editing the file, restart the service so the collector loads (without restart, wget … | grep node_filesystem is often empty):
/etc/init.d/prometheus-node-exporter-lua restart
wget -qO- http://127.0.0.1:9100/metrics | grep node_filesystem
Requires nixio (usual dependency of prometheus-node-exporter-lua; if it fails, opkg install luci-lib-nixio).
Note: Root filesystem / is overlay (filtered by design, same as standard node_exporter).
Thermal sensors by device¶
Not all SoCs expose temperature sensors in Linux. The table shows which devices report node_hwmon_temp_celsius:
| DUT | CPU sensor | Wi-Fi radio sensor | Notes |
|---|---|---|---|
| openwrt-one | Yes (thermal_thermal_zone0) |
Yes (ieee80211_phy0, phy1) |
MediaTek Filogic |
| bananapi | Yes (thermal_thermal_zone0) |
Yes (ieee80211_phy0) |
MediaTek MT7988 |
| belkin-1 | No | Yes (ieee80211_phy0, phy1) |
MediaTek MT7622 |
| belkin-2 | No | Yes (ieee80211_phy0) |
MediaTek MT7622 |
| belkin-3 | No | Yes (ieee80211_phy0) |
MediaTek MT7622 |
| librerouter-1 | No | No | IPQ4019 - not supported |
| gateway-wdr3500 | No | No | QCA9558 (ath79) - not supported |
In Grafana, temperature panels show "No data" for devices without sensors.
Step 2 - In the repo¶
Add an entry under observability_duts in ansible/roles/observability/defaults/main.yml:
observability_duts:
- name: dut-name
ssh_alias: dut-dut-name
local_port: 19106
remote_port: 9100
labels:
dut: dut-name
firmware: openwrt-X.Y.Z
target: platform-arch
Local ports (local_port) per DUT (align with duts-config when changing firmware):
dut (Grafana label) |
Device | local_port |
ssh_alias |
|---|---|---|---|
| openwrt-one | OpenWrt One | 19100 | dut-openwrt-one |
| belkin-1 | Belkin RT3200 #1 | 19101 | dut-belkin-1 |
| belkin-2 | Belkin RT3200 #2 | 19102 | dut-belkin-2 |
| belkin-3 | Belkin RT3200 #3 | 19103 | dut-belkin-3 |
| bananapi | Banana Pi R4 | 19104 | dut-bananapi |
| librerouter-1 | Librerouter 1 | 19105 | dut-librerouter-1 |
Until step 1 is done, the Prometheus target stays DOWN for that name; the tunnel may restart in a loop if the DUT is off.
Step 3 - Apply¶
ansible-playbook -i ansible/inventory/hosts.yml ansible/playbook_testbed.yml --tags observability -K
Verification¶
systemctl status dut-metrics-tunnel-<name>
curl -sS http://127.0.0.1:<local_port>/metrics | head -5
promtool check config /etc/prometheus/prometheus.yml
In Prometheus: Status → Targets - all DUTs plus orchestrator-host should appear. In Grafana: FCEFyN Testbed - DUTs & gateway and FCEFyN Testbed - Orchestrator Host (both provisioned by Ansible).
# Host node exporter
curl -sS http://127.0.0.1:9100/metrics | head -5
systemctl status prometheus-node-exporter
Key files¶
Paths are relative to the repository root.
| Path | Description |
|---|---|
ansible/roles/observability/defaults/main.yml |
observability_duts, orchestrator_node_exporter, grafana_public_tunnel, grafana_config |
ansible/roles/observability/templates/dut-metrics-tunnel.service.j2 |
autossh unit per DUT |
ansible/roles/observability/templates/dut-scrape-job.yml.j2 |
Scrape fragment per DUT |
ansible/roles/observability/templates/orchestrator-scrape-job.yml.j2 |
Host scrape fragment |
ansible/roles/observability/templates/prometheus.yml.j2 |
Main prometheus.yml |
ansible/roles/observability/templates/grafana-dashboards-provider.yml.j2 |
File-based dashboard provider in Grafana |
ansible/roles/observability/files/dashboards/orchestrator-node.json |
Orchestrator host dashboard JSON |
ansible/roles/observability/files/dashboards/duts-node.json |
DUTs + gateway dashboard JSON (variable excludes lab-orchestrator) |