MIS-Contro-Tower/snappy.md

# Snappy UX plan (Next.js)

## Goals
- Make every navigation feel instant (<50ms feedback) via loading UI and disabled re-clicks.
- Reduce server and data latency for heavy pages (Overview, Reports).
- Keep data accurate while allowing slight staleness for Settings/Financial (seconds).

## Constraints
- Data-heavy pages with large payloads and expensive queries.
- Users click multiple times when no feedback is shown.

## Success targets
- Navigation feedback in <50ms (loading/skeleton/pending state).
- P95 server response under 300-500ms for most queries; worst cases hidden behind progressive loading.
- No multi-click queueing; one navigation at a time.

---

## Phase 1: Audit and baseline (completed)

### What was instrumented
- Server timing + payload logging on Overview, Reports, Reports Filters, Machines APIs.
- Per-step timings inside `getOverviewData` (machines query, events query, normalize/filter).
- Client nav timing hooks were added but not captured due to service env/build config.

### Baseline results (from `/tmp/mis-control-tower.log`)
- Aggregate stats (cold + warm averaged)
  - Client nav (`perf.client` nav duration)
    - Avg: ~38ms; p50: ~51ms; p95: ~67ms; min: ~5ms; max: ~82ms.
  - Overview API (`/api/overview`) total
    - Avg: ~3.07s; p50: ~1.73s; p95: ~8.61s; min: ~1.20s; max: ~21.54s.
  - `getOverviewData` total
    - Avg: ~1.29s; p50: ~1.26s; p95: ~1.35s; min: ~1.15s; max: ~2.41s.
  - Machines query (inside Overview)
    - Avg: ~1.27s; p50: ~1.25s; p95: ~1.33s; min: ~1.13s; max: ~2.38s.
  - Machines API (`/api/machines`) total
    - Avg: ~1.26s; p50: ~1.25s; p95: ~1.36s; min: ~1.13s; max: ~1.52s.
  - Reports API (`/api/reports`) total
    - Avg: ~3.81s; p50: ~468ms; p95: ~18.14s; min: ~168ms; max: ~26.56s.
  - Reports filters (`/api/reports/filters`) total
    - Avg: ~4.07s; p50: ~367ms; p95: ~16.61s; min: ~57ms; max: ~23.78s.
  - Reports payload size
    - Avg: ~406KB; p50: ~406KB; p95: ~407KB.

- Overview (`/api/overview`)
  - Total: ~1.3–2.5s across samples (best ~1.2s, spikes up to ~2.5s).
  - `getOverviewData` total: ~1.15–1.36s typically; one sample ~2.4s.
  - **Machines query dominates**: ~1.12–1.33s (primary bottleneck).
  - Events query: ~5–35ms (minor).
  - Payload: ~13KB.

- Machines (`/api/machines`)
  - Total: ~1.15–1.33s per call for 3 machines.
  - **Machines query dominates**: ~1.15–1.33s.
  - Payload: ~1.6KB.

- Reports (`/api/reports`)
  - Typical total: ~170–225ms (later runs), earlier spikes up to ~16s (pre-fix or cold).
  - Query timings combined: ~130–200ms.
  - Row counts: ~1.8k KPI rows, ~6.2k cycles, ~736 events.
  - **Payload size ~406KB** (largest).

- Reports filters (`/api/reports/filters`)
  - Typical total: ~56–68ms (later runs), earlier spikes up to ~23s (pre-fix or cold).
  - Query timings: ~30–40ms.
  - Payload: ~51B.

### Findings
- The dominant latency contributor is the **machines query** used by Overview and Machines endpoints.
- Reports payload is large (~406KB), which impacts UI responsiveness even when queries are moderate.
- Large outliers (multi-second totals) likely come from non-query overhead (session lookup, DB connection wait, or cold start); these need targeted checks.
- Reports and reports filters show totals that are far larger than the summed query timings, confirming significant overhead outside the measured DB queries.
- Client end-to-end nav timing (`perf.client`) is now captured; p95 is ~67ms, slightly above the 50ms target.
- Baseline summaries should average cold and warm samples together for now.

### Data captured
- Logs are stored at `/tmp/mis-control-tower.log`.
- Events include: `perf.overview.api`, `perf.overview.getOverviewData`, `perf.machines.api`, `perf.reports.api`, `perf.reports.filters`.

Update
- Client nav timing is now captured via `/api/debug/perf` (`perf.client` events).
- API timings now include auth/preQuery/postQuery with coldStart/uptimeMs when enabled.

---

## Phase 2: Instant feedback (UX)

### 1) Global route loading
- Add `app/(app)/loading.tsx` with a lightweight skeleton for the shell.
- Ensure each heavy route also has its own `loading.tsx` for targeted skeletons.

### 2) Sidebar pending state
- Use `useTransition` to mark a pending navigation.
- Disable repeated clicks and show a subtle spinner on the active item.
- Optional: debounce repeated clicks for 300-500ms.

### 3) Suspense boundaries
- Wrap the slowest sections (events, charts, tables) in `<Suspense>` with skeletons.
- Ensure initial shell renders immediately even if data is still loading.

Deliverables
- Users always see visual feedback within a single frame.
- Double-clicks do not queue up extra navigations.

Progress
- Added route-level loading skeletons for the app shell and heavy routes.
- Sidebar uses `useTransition` with a pending spinner and blocks repeat clicks.
- Added Suspense + lazy loading for the Overview timeline and Reports charts.

---

## Phase 3: Split heavy pages (Overview + Reports)

### Overview (split)
- First paint: show lightweight summary data (machines list + latest heartbeat + tiny event count).
- Defer: fetch full event stream and detailed KPIs via client API after initial render.
- Use an explicit "Load more" or lazy loading for event details.

Implementation sketch
- Create a `getOverviewSummary` for the initial server render.
- Create a client fetch (`/api/overview?detail=1`) for detailed events and charts.
- Replace large data arrays with preview-sized payloads.

Progress
- Overview now uses `getOverviewSummary` for first paint, and `/api/overview?detail=1` for deferred detail fetch.
- Summary responses are cached in-memory with TTL + in-flight de-dupe (`perf.overview.summary` shows cache hits).
- Reports charts are lazy-loaded with placeholders; heavy chart blocks render after the shell.

### Reports (split)
- Render the report shell and filters immediately.
- Lazy-load heavy charts with `next/dynamic` and loading placeholders.
- Fetch chart data on demand (per chart or on viewport with IntersectionObserver).
- Paginate any large tables or use virtualization.

Deliverables
- Overview/Reports initial response is fast and small.
- Deep detail loads after the UI is already visible.

---

## Phase 4: Caching + data freshness

### 1) Page-level caching
- Remove `force-dynamic` where it is not required.
- Use `revalidate` on pages that can be stale for a few seconds (Settings, Financial).

### 2) Data cache for Prisma queries
- Wrap stable fetchers in `unstable_cache` with short TTL and tags (per org).
- Add manual refresh button on Settings/Financial to bypass cache when needed.

### 3) API cache headers
- Use `ETag` and `If-None-Match` where possible.
- For logged-in data, use `private` caching with short max-age.

Deliverables
- Fewer full recomputes for repeated navigations.
- Settings/Financial feel instant, but still correct.

Progress
- Added session cache + throttled `lastSeenAt` updates to reduce auth overhead spikes.
- Added cached GETs with short TTL + per-org tags for Settings + Financial config/impact.
- Added refresh bypass (`?refresh=1`) and a refresh button on Financial.
- Added ETag + private cache headers for Settings + Financial config, plus private cache headers for Financial impact.
- Restored `force-dynamic` on the authenticated layout to avoid static render errors from `cookies()`.

---

## Phase 5: Query + payload tuning

- Reduce `select` fields to only what the UI needs on first render.
- Cap `take` sizes with clear UI controls to load more.
- Add indexes for `orgId + ts` combos used in orderBy filters.
- Consider summary tables for expensive aggregations.

Progress
- Split machine fetch into base + latest heartbeat/KPI queries to avoid nested relation orderBy/take on large tables.
- Added indexes for heartbeat tsServer lookup and machine ordering by orgId + createdAt.
- Machines base query dropped to low ms; new hotspots are latest heartbeat (~250-300ms) and latest KPI (~800-900ms).
- Overview/Machines now log `heartbeatsQuery` + `kpiQuery` to track the new bottlenecks.

---

## What helped most
- Overview split + summary cache: repeat navigations are instant and detail loads later.
- Route-level loading + pending state: immediate feedback reduced double-clicks.
- Session cache + throttled lastSeen: reduced non-query overhead spikes.
- Short TTL caches with refresh bypass: Settings/Financial feel instant without losing correctness.
- Query shape changes: removed nested relation ordering and shifted load to targeted queries.

## Methodology / optimization strategy
- Instrument first, measure cold + warm, and store logs.
- Use timing breakdowns to find the dominant step.
- Improve perceived performance early (skeletons, pending state).
- Split payloads into summary + deferred detail.
- Cache low-risk data with short TTL + refresh bypass and ETag for 304s.
- Tune queries with smaller selects, indexes, and safer query shapes; consider denormalizing if needed.

## Validation
- Measure navigation feedback time (click to loading UI). Goal: <50ms.
- Track p95 TTFB and payload size for Overview and Reports before/after.
- Confirm that repeated clicks no longer add latency or duplicated requests.

---

## Open opportunities
- Optimize latest KPI query (index on `orgId + machineId + tsServer` or denormalize latest KPI onto `Machine`).
- Reduce Reports payload size (trim fields, paginate, or virtualize tables).
- Consider summary tables/materialized views for heavy aggregates.

## Further implementation plan (later)
1) Latest KPI/heartbeat acceleration
   - Add index for KPI lookups by server time: `@@index([orgId, machineId, tsServer])`.
   - Switch KPI “latest” ordering to `tsServer` to match index.
   - Optional: denormalize `latestHeartbeat` + `latestKpi` onto `Machine` and update on ingest.
   - Add background backfill job for legacy machines.

2) Machines + Overview caching
   - Increase summary cache TTL (30-60s) to raise hit rates.
   - Add per-org cache invalidation when a heartbeat/KPI ingests.
   - Add ETag handling to `/api/machines` (similar to overview detail).

3) Reports payload trim
   - Reduce fields in `reports` response to the chart/minimum.
   - Add pagination for large tables (KPIs/cycles/scrap).
   - Add “Download full dataset” endpoint separate from UI view.

4) Connection + ORM tuning
   - Enable Prisma query logging to identify slow SQL.
   - Evaluate connection pool size and cold-start behavior in serverless.
   - Move heavy aggregates to `GROUP BY` at DB level with indexes.

5) UX refinements
   - Add inline “last updated” timestamp in Overview/Reports headers.
   - Show cache-hit badges when content is served from cache.
   - Add optional “refresh” on the overview to re-fetch detail data.