pre-bemis

2026-04-22 05:04:19 +00:00
parent ac1a7900c8
commit 80d27f83b6
91 changed files with 11769 additions and 820 deletions
--- a/snappy.md
+++ b/snappy.md
@@ -0,0 +1,235 @@
+# Snappy UX plan (Next.js)
+
+## Goals
+- Make every navigation feel instant (<50ms feedback) via loading UI and disabled re-clicks.
+- Reduce server and data latency for heavy pages (Overview, Reports).
+- Keep data accurate while allowing slight staleness for Settings/Financial (seconds).
+
+## Constraints
+- Data-heavy pages with large payloads and expensive queries.
+- Users click multiple times when no feedback is shown.
+
+## Success targets
+- Navigation feedback in <50ms (loading/skeleton/pending state).
+- P95 server response under 300-500ms for most queries; worst cases hidden behind progressive loading.
+- No multi-click queueing; one navigation at a time.
+
+---
+
+## Phase 1: Audit and baseline (completed)
+
+### What was instrumented
+- Server timing + payload logging on Overview, Reports, Reports Filters, Machines APIs.
+- Per-step timings inside `getOverviewData` (machines query, events query, normalize/filter).
+- Client nav timing hooks were added but not captured due to service env/build config.
+
+### Baseline results (from `/tmp/mis-control-tower.log`)
+- Aggregate stats (cold + warm averaged)
+  - Client nav (`perf.client` nav duration)
+    - Avg: ~38ms; p50: ~51ms; p95: ~67ms; min: ~5ms; max: ~82ms.
+  - Overview API (`/api/overview`) total
+    - Avg: ~3.07s; p50: ~1.73s; p95: ~8.61s; min: ~1.20s; max: ~21.54s.
+  - `getOverviewData` total
+    - Avg: ~1.29s; p50: ~1.26s; p95: ~1.35s; min: ~1.15s; max: ~2.41s.
+  - Machines query (inside Overview)
+    - Avg: ~1.27s; p50: ~1.25s; p95: ~1.33s; min: ~1.13s; max: ~2.38s.
+  - Machines API (`/api/machines`) total
+    - Avg: ~1.26s; p50: ~1.25s; p95: ~1.36s; min: ~1.13s; max: ~1.52s.
+  - Reports API (`/api/reports`) total
+    - Avg: ~3.81s; p50: ~468ms; p95: ~18.14s; min: ~168ms; max: ~26.56s.
+  - Reports filters (`/api/reports/filters`) total
+    - Avg: ~4.07s; p50: ~367ms; p95: ~16.61s; min: ~57ms; max: ~23.78s.
+  - Reports payload size
+    - Avg: ~406KB; p50: ~406KB; p95: ~407KB.
+
+- Overview (`/api/overview`)
+  - Total: ~1.3–2.5s across samples (best ~1.2s, spikes up to ~2.5s).
+  - `getOverviewData` total: ~1.15–1.36s typically; one sample ~2.4s.
+  - **Machines query dominates**: ~1.12–1.33s (primary bottleneck).
+  - Events query: ~5–35ms (minor).
+  - Payload: ~13KB.
+
+- Machines (`/api/machines`)
+  - Total: ~1.15–1.33s per call for 3 machines.
+  - **Machines query dominates**: ~1.15–1.33s.
+  - Payload: ~1.6KB.
+
+- Reports (`/api/reports`)
+  - Typical total: ~170–225ms (later runs), earlier spikes up to ~16s (pre-fix or cold).
+  - Query timings combined: ~130–200ms.
+  - Row counts: ~1.8k KPI rows, ~6.2k cycles, ~736 events.
+  - **Payload size ~406KB** (largest).
+
+- Reports filters (`/api/reports/filters`)
+  - Typical total: ~56–68ms (later runs), earlier spikes up to ~23s (pre-fix or cold).
+  - Query timings: ~30–40ms.
+  - Payload: ~51B.
+
+### Findings
+- The dominant latency contributor is the **machines query** used by Overview and Machines endpoints.
+- Reports payload is large (~406KB), which impacts UI responsiveness even when queries are moderate.
+- Large outliers (multi-second totals) likely come from non-query overhead (session lookup, DB connection wait, or cold start); these need targeted checks.
+- Reports and reports filters show totals that are far larger than the summed query timings, confirming significant overhead outside the measured DB queries.
+- Client end-to-end nav timing (`perf.client`) is now captured; p95 is ~67ms, slightly above the 50ms target.
+- Baseline summaries should average cold and warm samples together for now.
+
+### Data captured
+- Logs are stored at `/tmp/mis-control-tower.log`.
+- Events include: `perf.overview.api`, `perf.overview.getOverviewData`, `perf.machines.api`, `perf.reports.api`, `perf.reports.filters`.
+
+Update
+- Client nav timing is now captured via `/api/debug/perf` (`perf.client` events).
+- API timings now include auth/preQuery/postQuery with coldStart/uptimeMs when enabled.
+
+---
+
+## Phase 2: Instant feedback (UX)
+
+### 1) Global route loading
+- Add `app/(app)/loading.tsx` with a lightweight skeleton for the shell.
+- Ensure each heavy route also has its own `loading.tsx` for targeted skeletons.
+
+### 2) Sidebar pending state
+- Use `useTransition` to mark a pending navigation.
+- Disable repeated clicks and show a subtle spinner on the active item.
+- Optional: debounce repeated clicks for 300-500ms.
+
+### 3) Suspense boundaries
+- Wrap the slowest sections (events, charts, tables) in `<Suspense>` with skeletons.
+- Ensure initial shell renders immediately even if data is still loading.
+
+Deliverables
+- Users always see visual feedback within a single frame.
+- Double-clicks do not queue up extra navigations.
+
+Progress
+- Added route-level loading skeletons for the app shell and heavy routes.
+- Sidebar uses `useTransition` with a pending spinner and blocks repeat clicks.
+- Added Suspense + lazy loading for the Overview timeline and Reports charts.
+
+---
+
+## Phase 3: Split heavy pages (Overview + Reports)
+
+### Overview (split)
+- First paint: show lightweight summary data (machines list + latest heartbeat + tiny event count).
+- Defer: fetch full event stream and detailed KPIs via client API after initial render.
+- Use an explicit "Load more" or lazy loading for event details.
+
+Implementation sketch
+- Create a `getOverviewSummary` for the initial server render.
+- Create a client fetch (`/api/overview?detail=1`) for detailed events and charts.
+- Replace large data arrays with preview-sized payloads.
+
+Progress
+- Overview now uses `getOverviewSummary` for first paint, and `/api/overview?detail=1` for deferred detail fetch.
+- Summary responses are cached in-memory with TTL + in-flight de-dupe (`perf.overview.summary` shows cache hits).
+- Reports charts are lazy-loaded with placeholders; heavy chart blocks render after the shell.
+
+### Reports (split)
+- Render the report shell and filters immediately.
+- Lazy-load heavy charts with `next/dynamic` and loading placeholders.
+- Fetch chart data on demand (per chart or on viewport with IntersectionObserver).
+- Paginate any large tables or use virtualization.
+
+Deliverables
+- Overview/Reports initial response is fast and small.
+- Deep detail loads after the UI is already visible.
+
+---
+
+## Phase 4: Caching + data freshness
+
+### 1) Page-level caching
+- Remove `force-dynamic` where it is not required.
+- Use `revalidate` on pages that can be stale for a few seconds (Settings, Financial).
+
+### 2) Data cache for Prisma queries
+- Wrap stable fetchers in `unstable_cache` with short TTL and tags (per org).
+- Add manual refresh button on Settings/Financial to bypass cache when needed.
+
+### 3) API cache headers
+- Use `ETag` and `If-None-Match` where possible.
+- For logged-in data, use `private` caching with short max-age.
+
+Deliverables
+- Fewer full recomputes for repeated navigations.
+- Settings/Financial feel instant, but still correct.
+
+Progress
+- Added session cache + throttled `lastSeenAt` updates to reduce auth overhead spikes.
+- Added cached GETs with short TTL + per-org tags for Settings + Financial config/impact.
+- Added refresh bypass (`?refresh=1`) and a refresh button on Financial.
+- Added ETag + private cache headers for Settings + Financial config, plus private cache headers for Financial impact.
+- Restored `force-dynamic` on the authenticated layout to avoid static render errors from `cookies()`.
+
+---
+
+## Phase 5: Query + payload tuning
+
+- Reduce `select` fields to only what the UI needs on first render.
+- Cap `take` sizes with clear UI controls to load more.
+- Add indexes for `orgId + ts` combos used in orderBy filters.
+- Consider summary tables for expensive aggregations.
+
+Progress
+- Split machine fetch into base + latest heartbeat/KPI queries to avoid nested relation orderBy/take on large tables.
+- Added indexes for heartbeat tsServer lookup and machine ordering by orgId + createdAt.
+- Machines base query dropped to low ms; new hotspots are latest heartbeat (~250-300ms) and latest KPI (~800-900ms).
+- Overview/Machines now log `heartbeatsQuery` + `kpiQuery` to track the new bottlenecks.
+
+---
+
+## What helped most
+- Overview split + summary cache: repeat navigations are instant and detail loads later.
+- Route-level loading + pending state: immediate feedback reduced double-clicks.
+- Session cache + throttled lastSeen: reduced non-query overhead spikes.
+- Short TTL caches with refresh bypass: Settings/Financial feel instant without losing correctness.
+- Query shape changes: removed nested relation ordering and shifted load to targeted queries.
+
+## Methodology / optimization strategy
+- Instrument first, measure cold + warm, and store logs.
+- Use timing breakdowns to find the dominant step.
+- Improve perceived performance early (skeletons, pending state).
+- Split payloads into summary + deferred detail.
+- Cache low-risk data with short TTL + refresh bypass and ETag for 304s.
+- Tune queries with smaller selects, indexes, and safer query shapes; consider denormalizing if needed.
+
+## Validation
+- Measure navigation feedback time (click to loading UI). Goal: <50ms.
+- Track p95 TTFB and payload size for Overview and Reports before/after.
+- Confirm that repeated clicks no longer add latency or duplicated requests.
+
+---
+
+## Open opportunities
+- Optimize latest KPI query (index on `orgId + machineId + tsServer` or denormalize latest KPI onto `Machine`).
+- Reduce Reports payload size (trim fields, paginate, or virtualize tables).
+- Consider summary tables/materialized views for heavy aggregates.
+
+## Further implementation plan (later)
+1) Latest KPI/heartbeat acceleration
+   - Add index for KPI lookups by server time: `@@index([orgId, machineId, tsServer])`.
+   - Switch KPI “latest” ordering to `tsServer` to match index.
+   - Optional: denormalize `latestHeartbeat` + `latestKpi` onto `Machine` and update on ingest.
+   - Add background backfill job for legacy machines.
+
+2) Machines + Overview caching
+   - Increase summary cache TTL (30-60s) to raise hit rates.
+   - Add per-org cache invalidation when a heartbeat/KPI ingests.
+   - Add ETag handling to `/api/machines` (similar to overview detail).
+
+3) Reports payload trim
+   - Reduce fields in `reports` response to the chart/minimum.
+   - Add pagination for large tables (KPIs/cycles/scrap).
+   - Add “Download full dataset” endpoint separate from UI view.
+
+4) Connection + ORM tuning
+   - Enable Prisma query logging to identify slow SQL.
+   - Evaluate connection pool size and cold-start behavior in serverless.
+   - Move heavy aggregates to `GROUP BY` at DB level with indexes.
+
+5) UX refinements
+   - Add inline “last updated” timestamp in Overview/Reports headers.
+   - Show cache-hit badges when content is served from cache.
+   - Add optional “refresh” on the overview to re-fetch detail data.