# Metrics Backlog Implementation Plan

## Context

The Grafana admin dashboard at stats.irc.now has two problems: (1) existing stat panels show duplicate series because bare metric names aren't aggregated across Prometheus scrape targets, and (2) the metrics backlog in todo.txt lists 9 ideas, 8 of which are feasible now. This plan fixes the dashboard bug and implements all 8 metrics.

## Part 0: Fix Dashboard Duplicate Series Bug

**Problem**: Every stat panel uses bare metric names (e.g., `irc_now_users_total`) without aggregation. When Prometheus scrapes the same gauge from multiple targets, each target produces a separate time series, so stat panels show repeated values (visible in the screenshot: "Total 4 Total 4 Total 4...").

**Fix**: Wrap all stat/gauge panel expressions in `max()` (appropriate for gauges set to the same absolute value). Timeseries panels also need `max()` to avoid duplicate legend entries.

**File**: `deploy/monitoring/grafana-dashboard.yaml`

Panels to fix (all stat panels + timeseries with bare metrics):
- Panel 1: `irc_now_users_total` -> `max(irc_now_users_total)`
- Panel 2: `irc_now_active_users_24h` -> `max(irc_now_active_users_24h)`
- Panel 3: `increase(irc_now_logins_total[24h])` -> `sum(increase(irc_now_logins_total[24h]))`
- Panel 4: `irc_now_users_by_plan` -> `max(irc_now_users_by_plan) by (plan)`
- Panel 5: `irc_now_pastes_total` -> `max(irc_now_pastes_total)`
- Panel 6: `irc_now_images_total` -> `max(irc_now_images_total)`
- Panel 7: `irc_now_bots_total` -> `max(irc_now_bots_total)`
- Panel 8: `irc_now_bots_running` -> `max(irc_now_bots_running)`
- Panel 9: both targets -> `max(irc_now_users_total)`, `max(irc_now_active_users_24h)`
- Panel 10: all 3 targets -> `max(...)` each
- Panel 11: both targets -> `max(...)` each
- Panel 12: all 3 targets -> `max(...)` each
- Panel 15: `irc_now_mrr_cents / 100` -> `max(irc_now_mrr_cents) / 100`
- Panel 16: `irc_now_subscriptions_active` -> `max(irc_now_subscriptions_active)`
- Panel 17: `irc_now_signups_7d` -> `max(irc_now_signups_7d)`
- Panel 18: `irc_now_conversion_rate_30d * 100` -> `max(irc_now_conversion_rate_30d) * 100`
- Panel 19: `irc_now_churn_rate_30d * 100` -> `max(irc_now_churn_rate_30d) * 100`
- Panel 20: both targets -> `max(...)` each

Also fix `deploy/monitoring/grafana-dashboard-public.yaml` (same pattern).

## Part 1: Per-Network Activity (Metric #1)

**File**: `crates/web-api/src/business_metrics.rs` -- inside `record_soju_metrics()`

Add a per-network delivery receipt count query after the existing Channel count (line 189). `"DeliveryReceipt"` has a `network` FK (confirmed in `migrate.rs:287`) and is reliably present on all bouncers.

```sql
SELECT COALESCE(n.name, n.addr) AS label, COUNT(dr.*)
FROM "Network" n
LEFT JOIN "DeliveryReceipt" dr ON dr.network = n.id
GROUP BY COALESCE(n.name, n.addr)
```

Metric: `irc_now_bouncer_deliveries_by_network{network="..."}` gauge

Cap at 50 networks per bouncer to prevent label explosion (skip per-network breakdown if exceeded, just log).

## Part 2: Connected Users 5-Minute Window (Metric #2)

**File**: `crates/web-api/src/business_metrics.rs` -- inside `record_soju_metrics()`

Add accumulator `total_connected_5m: i64 = 0` alongside `total_active` (line 138). Add query after the existing 24h query (line 211):

```sql
SELECT COUNT(*) FROM "User" WHERE downstream_interacted_at > NOW() - INTERVAL '5 minutes'
```

Metric: `irc_now_bouncer_connected_users_5m` gauge (emitted after the loop with the others)

## Part 3: Onboarding Funnel (Metric #3)

**File**: `crates/web-api/src/business_metrics.rs` -- inside `record_event_metrics()`, after line 117

Three gauges, all 30-day rolling windows from the accounts DB events table:

**Stage 1 -- Signups (30d)**:
```sql
SELECT COUNT(*) FROM events WHERE event_type = 'signup'
  AND created_at > NOW() - INTERVAL '30 days'
```
Metric: `irc_now_funnel_signups_30d`

**Stage 2 -- Bouncer Created (30d)**: users who signed up in last 30d AND created a bouncer after signup:
```sql
SELECT COUNT(DISTINCT s.user_sub) FROM events s
JOIN events bc ON s.user_sub = bc.user_sub AND bc.event_type = 'bouncer_create'
WHERE s.event_type = 'signup' AND s.created_at > NOW() - INTERVAL '30 days'
  AND bc.created_at >= s.created_at
```
Metric: `irc_now_funnel_bouncer_created_30d`

**Stage 3 -- Returned (30d)**: users who signed up in last 30d AND logged in again >1 day after signup:
```sql
SELECT COUNT(DISTINCT s.user_sub) FROM events s
WHERE s.event_type = 'signup' AND s.created_at > NOW() - INTERVAL '30 days'
  AND EXISTS (
    SELECT 1 FROM events e2 WHERE e2.user_sub = s.user_sub
      AND e2.event_type = 'login' AND e2.created_at > s.created_at + INTERVAL '1 day'
  )
```
Metric: `irc_now_funnel_returned_30d`

Pattern: `sqlx::query_scalar::<_, i64>(...)` matching existing code style.

## Part 4: Time-to-First-Bouncer (Metric #4)

**File**: `crates/web-api/src/business_metrics.rs` -- inside `record_event_metrics()`, after funnel queries

```sql
SELECT AVG(EXTRACT(EPOCH FROM (bc.created_at - s.created_at)))
FROM events s
JOIN events bc ON s.user_sub = bc.user_sub AND bc.event_type = 'bouncer_create'
WHERE s.event_type = 'signup' AND s.created_at > NOW() - INTERVAL '30 days'
  AND bc.created_at >= s.created_at
```

Return type: `Option<f64>` (AVG returns NULL if no rows). Set gauge to 0.0 on None.

Metric: `irc_now_time_to_first_bouncer_seconds` gauge

## Part 5: Feature Adoption (Metric #5)

**File**: `crates/web-api/src/business_metrics.rs` -- inside `record_event_metrics()`, after TTFB

Three gauges (raw counts, percentages computed in Grafana):

```sql
SELECT COUNT(*) FROM users WHERE plan = 'pro' AND content_expires = false
```
Metric: `irc_now_adoption_permanent_content`

```sql
SELECT COUNT(DISTINCT user_sub) FROM events WHERE event_type = 'bouncer_create'
```
Metric: `irc_now_adoption_bouncer_users`

```sql
SELECT COUNT(DISTINCT user_sub) FROM events WHERE event_type = 'network_create'
```
Metric: `irc_now_adoption_network_users`

## Part 6: Error Rates by Service (Metric #6)

**No Rust changes.** Pure PromQL in dashboard.

New panel in `deploy/monitoring/grafana-dashboard.yaml`:
- `sum(rate(http_requests_total{namespace="irc-josie-cloud", status=~"4.."}[5m])) by (job)` -- 4xx
- `sum(rate(http_requests_total{namespace="irc-josie-cloud", status=~"5.."}[5m])) by (job)` -- 5xx

Timeseries panel, unit: reqps.

## Part 7: Storage Growth Prediction (Metric #7)

**No Rust changes.** Pure PromQL in dashboard.

Two stat panels in `deploy/monitoring/grafana-dashboard.yaml`:
- `predict_linear(max(irc_now_pastes_storage_bytes)[7d:1h], 30*86400)` -- pastes in 30d
- `predict_linear(max(irc_now_images_storage_bytes)[7d:1h], 30*86400)` -- images in 30d

Unit: bytes. Shows "N/A" until 7d of data exists.

## Part 8: Bot Execution Metrics (Metric #8)

### 8a. Instrument dispatch (hot path)

**File**: `crates/bot/src/lua/dispatch.rs`

Add `use std::time::Instant;` and `use metrics::{counter, histogram};`.

Wrap the match block with timing:
1. Capture `let start = Instant::now();` before the match at line 22
2. After the match, record duration and increment counter
3. On error, also increment error counter

Metrics (label: `handler` -- bounded to 6 values: on_message, on_join, on_part, on_kick, on_nick, on_notice):
- `irc_now_bot_script_runs_total` counter
- `irc_now_bot_script_duration_seconds` histogram
- `irc_now_bot_script_errors_total` counter

### 8b. Script load errors

**File**: `crates/bot/src/manager.rs` -- line 174

Add `metrics::counter!("irc_now_bot_script_load_errors_total").increment(1);` inside the `if let Err(e)` block.

### 8c. Background gauges

**File**: `crates/bot/src/business_metrics.rs`

Add two queries to the existing loop:

```sql
SELECT COUNT(*) FROM bot_logs WHERE level = 'error' AND created_at > NOW() - INTERVAL '24 hours'
```
Metric: `irc_now_bot_errors_24h` gauge

```sql
SELECT COUNT(*) FROM bot_scripts WHERE enabled = true
```
Metric: `irc_now_bot_scripts_enabled` gauge

## Part 9: Dashboard Panels for New Metrics

**File**: `deploy/monitoring/grafana-dashboard.yaml`

Add 10 new panels (IDs 21-30) to the admin dashboard:

| ID | Title | Type | Key Expression |
|----|-------|------|---------------|
| 21 | HTTP Errors (4xx vs 5xx) | timeseries | `rate(http_requests_total{status=~"4/5.."}[5m])` by job |
| 22 | Pastes Storage in 30d | stat | `predict_linear(...)` |
| 23 | Images Storage in 30d | stat | `predict_linear(...)` |
| 24 | Onboarding Funnel (30d) | bargauge | signups, bouncer_created, returned |
| 25 | Time to First Bouncer | stat | `max(irc_now_time_to_first_bouncer_seconds)` |
| 26 | Connected Users (5m) | stat | `max(irc_now_bouncer_connected_users_5m)` |
| 27 | Deliveries by Network | timeseries | `irc_now_bouncer_deliveries_by_network` by network |
| 28 | Feature Adoption | bargauge | permanent_content, bouncer_users, network_users |
| 29 | Bot Script Runs | timeseries | `rate(irc_now_bot_script_runs_total[5m])` |
| 30 | Bot Script Latency p95 | timeseries | `histogram_quantile(0.95, ...)` |

All new panels use `max()` aggregation from the start.

## Implementation Order

1. **Part 0**: Fix dashboard duplicate series bug (quick, unblocks correct visualization)
2. **Parts 1-5**: Rust changes in `web-api/src/business_metrics.rs` (one file, one build)
3. **Part 8**: Rust changes in `bot` crate (3 files, one build)
4. **Parts 6-7, 9**: Dashboard panel additions (one `oc apply`)
5. Build and deploy

## Files to Modify

| File | Changes |
|------|---------|
| `crates/web-api/src/business_metrics.rs` | Parts 1-5: ~60 lines added to existing functions |
| `crates/bot/src/lua/dispatch.rs` | Part 8a: timing + counters around match block |
| `crates/bot/src/manager.rs` | Part 8b: 1 line added (script load error counter) |
| `crates/bot/src/business_metrics.rs` | Part 8c: 2 query blocks added (~14 lines) |
| `deploy/monitoring/grafana-dashboard.yaml` | Parts 0, 6, 7, 9: fix existing + add 10 panels |
| `deploy/monitoring/grafana-dashboard-public.yaml` | Part 0: fix duplicate series |

## Verification

1. `cargo check -p irc-now-web-api -p irc-now-bot` -- compiles
2. `cargo test -p irc-now-web-api -p irc-now-bot` -- existing tests pass
3. Build + deploy web-api and bot via `oc start-build`
4. `oc apply -f deploy/monitoring/grafana-dashboard.yaml` + public dashboard
5. Check stats.irc.now admin dashboard:
   - Stat panels show single values (not duplicated)
   - New panels appear and either show data or "N/A" / "No data" (acceptable for metrics that need time to accumulate)
6. Check `/metrics` endpoint on web-api pod for new gauge names
7. Check `/metrics` endpoint on bot pod for counter/histogram names