# Metrics Backlog Implementation Plan ## Context The Grafana admin dashboard at stats.irc.now has two problems: (1) existing stat panels show duplicate series because bare metric names aren't aggregated across Prometheus scrape targets, and (2) the metrics backlog in todo.txt lists 9 ideas, 8 of which are feasible now. This plan fixes the dashboard bug and implements all 8 metrics. ## Part 0: Fix Dashboard Duplicate Series Bug **Problem**: Every stat panel uses bare metric names (e.g., `irc_now_users_total`) without aggregation. When Prometheus scrapes the same gauge from multiple targets, each target produces a separate time series, so stat panels show repeated values (visible in the screenshot: "Total 4 Total 4 Total 4..."). **Fix**: Wrap all stat/gauge panel expressions in `max()` (appropriate for gauges set to the same absolute value). Timeseries panels also need `max()` to avoid duplicate legend entries. **File**: `deploy/monitoring/grafana-dashboard.yaml` Panels to fix (all stat panels + timeseries with bare metrics): - Panel 1: `irc_now_users_total` -> `max(irc_now_users_total)` - Panel 2: `irc_now_active_users_24h` -> `max(irc_now_active_users_24h)` - Panel 3: `increase(irc_now_logins_total[24h])` -> `sum(increase(irc_now_logins_total[24h]))` - Panel 4: `irc_now_users_by_plan` -> `max(irc_now_users_by_plan) by (plan)` - Panel 5: `irc_now_pastes_total` -> `max(irc_now_pastes_total)` - Panel 6: `irc_now_images_total` -> `max(irc_now_images_total)` - Panel 7: `irc_now_bots_total` -> `max(irc_now_bots_total)` - Panel 8: `irc_now_bots_running` -> `max(irc_now_bots_running)` - Panel 9: both targets -> `max(irc_now_users_total)`, `max(irc_now_active_users_24h)` - Panel 10: all 3 targets -> `max(...)` each - Panel 11: both targets -> `max(...)` each - Panel 12: all 3 targets -> `max(...)` each - Panel 15: `irc_now_mrr_cents / 100` -> `max(irc_now_mrr_cents) / 100` - Panel 16: `irc_now_subscriptions_active` -> `max(irc_now_subscriptions_active)` - Panel 17: `irc_now_signups_7d` -> `max(irc_now_signups_7d)` - Panel 18: `irc_now_conversion_rate_30d * 100` -> `max(irc_now_conversion_rate_30d) * 100` - Panel 19: `irc_now_churn_rate_30d * 100` -> `max(irc_now_churn_rate_30d) * 100` - Panel 20: both targets -> `max(...)` each Also fix `deploy/monitoring/grafana-dashboard-public.yaml` (same pattern). ## Part 1: Per-Network Activity (Metric #1) **File**: `crates/web-api/src/business_metrics.rs` -- inside `record_soju_metrics()` Add a per-network delivery receipt count query after the existing Channel count (line 189). `"DeliveryReceipt"` has a `network` FK (confirmed in `migrate.rs:287`) and is reliably present on all bouncers. ```sql SELECT COALESCE(n.name, n.addr) AS label, COUNT(dr.*) FROM "Network" n LEFT JOIN "DeliveryReceipt" dr ON dr.network = n.id GROUP BY COALESCE(n.name, n.addr) ``` Metric: `irc_now_bouncer_deliveries_by_network{network="..."}` gauge Cap at 50 networks per bouncer to prevent label explosion (skip per-network breakdown if exceeded, just log). ## Part 2: Connected Users 5-Minute Window (Metric #2) **File**: `crates/web-api/src/business_metrics.rs` -- inside `record_soju_metrics()` Add accumulator `total_connected_5m: i64 = 0` alongside `total_active` (line 138). Add query after the existing 24h query (line 211): ```sql SELECT COUNT(*) FROM "User" WHERE downstream_interacted_at > NOW() - INTERVAL '5 minutes' ``` Metric: `irc_now_bouncer_connected_users_5m` gauge (emitted after the loop with the others) ## Part 3: Onboarding Funnel (Metric #3) **File**: `crates/web-api/src/business_metrics.rs` -- inside `record_event_metrics()`, after line 117 Three gauges, all 30-day rolling windows from the accounts DB events table: **Stage 1 -- Signups (30d)**: ```sql SELECT COUNT(*) FROM events WHERE event_type = 'signup' AND created_at > NOW() - INTERVAL '30 days' ``` Metric: `irc_now_funnel_signups_30d` **Stage 2 -- Bouncer Created (30d)**: users who signed up in last 30d AND created a bouncer after signup: ```sql SELECT COUNT(DISTINCT s.user_sub) FROM events s JOIN events bc ON s.user_sub = bc.user_sub AND bc.event_type = 'bouncer_create' WHERE s.event_type = 'signup' AND s.created_at > NOW() - INTERVAL '30 days' AND bc.created_at >= s.created_at ``` Metric: `irc_now_funnel_bouncer_created_30d` **Stage 3 -- Returned (30d)**: users who signed up in last 30d AND logged in again >1 day after signup: ```sql SELECT COUNT(DISTINCT s.user_sub) FROM events s WHERE s.event_type = 'signup' AND s.created_at > NOW() - INTERVAL '30 days' AND EXISTS ( SELECT 1 FROM events e2 WHERE e2.user_sub = s.user_sub AND e2.event_type = 'login' AND e2.created_at > s.created_at + INTERVAL '1 day' ) ``` Metric: `irc_now_funnel_returned_30d` Pattern: `sqlx::query_scalar::<_, i64>(...)` matching existing code style. ## Part 4: Time-to-First-Bouncer (Metric #4) **File**: `crates/web-api/src/business_metrics.rs` -- inside `record_event_metrics()`, after funnel queries ```sql SELECT AVG(EXTRACT(EPOCH FROM (bc.created_at - s.created_at))) FROM events s JOIN events bc ON s.user_sub = bc.user_sub AND bc.event_type = 'bouncer_create' WHERE s.event_type = 'signup' AND s.created_at > NOW() - INTERVAL '30 days' AND bc.created_at >= s.created_at ``` Return type: `Option` (AVG returns NULL if no rows). Set gauge to 0.0 on None. Metric: `irc_now_time_to_first_bouncer_seconds` gauge ## Part 5: Feature Adoption (Metric #5) **File**: `crates/web-api/src/business_metrics.rs` -- inside `record_event_metrics()`, after TTFB Three gauges (raw counts, percentages computed in Grafana): ```sql SELECT COUNT(*) FROM users WHERE plan = 'pro' AND content_expires = false ``` Metric: `irc_now_adoption_permanent_content` ```sql SELECT COUNT(DISTINCT user_sub) FROM events WHERE event_type = 'bouncer_create' ``` Metric: `irc_now_adoption_bouncer_users` ```sql SELECT COUNT(DISTINCT user_sub) FROM events WHERE event_type = 'network_create' ``` Metric: `irc_now_adoption_network_users` ## Part 6: Error Rates by Service (Metric #6) **No Rust changes.** Pure PromQL in dashboard. New panel in `deploy/monitoring/grafana-dashboard.yaml`: - `sum(rate(http_requests_total{namespace="irc-josie-cloud", status=~"4.."}[5m])) by (job)` -- 4xx - `sum(rate(http_requests_total{namespace="irc-josie-cloud", status=~"5.."}[5m])) by (job)` -- 5xx Timeseries panel, unit: reqps. ## Part 7: Storage Growth Prediction (Metric #7) **No Rust changes.** Pure PromQL in dashboard. Two stat panels in `deploy/monitoring/grafana-dashboard.yaml`: - `predict_linear(max(irc_now_pastes_storage_bytes)[7d:1h], 30*86400)` -- pastes in 30d - `predict_linear(max(irc_now_images_storage_bytes)[7d:1h], 30*86400)` -- images in 30d Unit: bytes. Shows "N/A" until 7d of data exists. ## Part 8: Bot Execution Metrics (Metric #8) ### 8a. Instrument dispatch (hot path) **File**: `crates/bot/src/lua/dispatch.rs` Add `use std::time::Instant;` and `use metrics::{counter, histogram};`. Wrap the match block with timing: 1. Capture `let start = Instant::now();` before the match at line 22 2. After the match, record duration and increment counter 3. On error, also increment error counter Metrics (label: `handler` -- bounded to 6 values: on_message, on_join, on_part, on_kick, on_nick, on_notice): - `irc_now_bot_script_runs_total` counter - `irc_now_bot_script_duration_seconds` histogram - `irc_now_bot_script_errors_total` counter ### 8b. Script load errors **File**: `crates/bot/src/manager.rs` -- line 174 Add `metrics::counter!("irc_now_bot_script_load_errors_total").increment(1);` inside the `if let Err(e)` block. ### 8c. Background gauges **File**: `crates/bot/src/business_metrics.rs` Add two queries to the existing loop: ```sql SELECT COUNT(*) FROM bot_logs WHERE level = 'error' AND created_at > NOW() - INTERVAL '24 hours' ``` Metric: `irc_now_bot_errors_24h` gauge ```sql SELECT COUNT(*) FROM bot_scripts WHERE enabled = true ``` Metric: `irc_now_bot_scripts_enabled` gauge ## Part 9: Dashboard Panels for New Metrics **File**: `deploy/monitoring/grafana-dashboard.yaml` Add 10 new panels (IDs 21-30) to the admin dashboard: | ID | Title | Type | Key Expression | |----|-------|------|---------------| | 21 | HTTP Errors (4xx vs 5xx) | timeseries | `rate(http_requests_total{status=~"4/5.."}[5m])` by job | | 22 | Pastes Storage in 30d | stat | `predict_linear(...)` | | 23 | Images Storage in 30d | stat | `predict_linear(...)` | | 24 | Onboarding Funnel (30d) | bargauge | signups, bouncer_created, returned | | 25 | Time to First Bouncer | stat | `max(irc_now_time_to_first_bouncer_seconds)` | | 26 | Connected Users (5m) | stat | `max(irc_now_bouncer_connected_users_5m)` | | 27 | Deliveries by Network | timeseries | `irc_now_bouncer_deliveries_by_network` by network | | 28 | Feature Adoption | bargauge | permanent_content, bouncer_users, network_users | | 29 | Bot Script Runs | timeseries | `rate(irc_now_bot_script_runs_total[5m])` | | 30 | Bot Script Latency p95 | timeseries | `histogram_quantile(0.95, ...)` | All new panels use `max()` aggregation from the start. ## Implementation Order 1. **Part 0**: Fix dashboard duplicate series bug (quick, unblocks correct visualization) 2. **Parts 1-5**: Rust changes in `web-api/src/business_metrics.rs` (one file, one build) 3. **Part 8**: Rust changes in `bot` crate (3 files, one build) 4. **Parts 6-7, 9**: Dashboard panel additions (one `oc apply`) 5. Build and deploy ## Files to Modify | File | Changes | |------|---------| | `crates/web-api/src/business_metrics.rs` | Parts 1-5: ~60 lines added to existing functions | | `crates/bot/src/lua/dispatch.rs` | Part 8a: timing + counters around match block | | `crates/bot/src/manager.rs` | Part 8b: 1 line added (script load error counter) | | `crates/bot/src/business_metrics.rs` | Part 8c: 2 query blocks added (~14 lines) | | `deploy/monitoring/grafana-dashboard.yaml` | Parts 0, 6, 7, 9: fix existing + add 10 panels | | `deploy/monitoring/grafana-dashboard-public.yaml` | Part 0: fix duplicate series | ## Verification 1. `cargo check -p irc-now-web-api -p irc-now-bot` -- compiles 2. `cargo test -p irc-now-web-api -p irc-now-bot` -- existing tests pass 3. Build + deploy web-api and bot via `oc start-build` 4. `oc apply -f deploy/monitoring/grafana-dashboard.yaml` + public dashboard 5. Check stats.irc.now admin dashboard: - Stat panels show single values (not duplicated) - New panels appear and either show data or "N/A" / "No data" (acceptable for metrics that need time to accumulate) 6. Check `/metrics` endpoint on web-api pod for new gauge names 7. Check `/metrics` endpoint on bot pod for counter/histogram names