Files
MIS-Contro-Tower/snappy.md
2026-04-22 05:04:19 +00:00

236 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Snappy UX plan (Next.js)
## Goals
- Make every navigation feel instant (<50ms feedback) via loading UI and disabled re-clicks.
- Reduce server and data latency for heavy pages (Overview, Reports).
- Keep data accurate while allowing slight staleness for Settings/Financial (seconds).
## Constraints
- Data-heavy pages with large payloads and expensive queries.
- Users click multiple times when no feedback is shown.
## Success targets
- Navigation feedback in <50ms (loading/skeleton/pending state).
- P95 server response under 300-500ms for most queries; worst cases hidden behind progressive loading.
- No multi-click queueing; one navigation at a time.
---
## Phase 1: Audit and baseline (completed)
### What was instrumented
- Server timing + payload logging on Overview, Reports, Reports Filters, Machines APIs.
- Per-step timings inside `getOverviewData` (machines query, events query, normalize/filter).
- Client nav timing hooks were added but not captured due to service env/build config.
### Baseline results (from `/tmp/mis-control-tower.log`)
- Aggregate stats (cold + warm averaged)
- Client nav (`perf.client` nav duration)
- Avg: ~38ms; p50: ~51ms; p95: ~67ms; min: ~5ms; max: ~82ms.
- Overview API (`/api/overview`) total
- Avg: ~3.07s; p50: ~1.73s; p95: ~8.61s; min: ~1.20s; max: ~21.54s.
- `getOverviewData` total
- Avg: ~1.29s; p50: ~1.26s; p95: ~1.35s; min: ~1.15s; max: ~2.41s.
- Machines query (inside Overview)
- Avg: ~1.27s; p50: ~1.25s; p95: ~1.33s; min: ~1.13s; max: ~2.38s.
- Machines API (`/api/machines`) total
- Avg: ~1.26s; p50: ~1.25s; p95: ~1.36s; min: ~1.13s; max: ~1.52s.
- Reports API (`/api/reports`) total
- Avg: ~3.81s; p50: ~468ms; p95: ~18.14s; min: ~168ms; max: ~26.56s.
- Reports filters (`/api/reports/filters`) total
- Avg: ~4.07s; p50: ~367ms; p95: ~16.61s; min: ~57ms; max: ~23.78s.
- Reports payload size
- Avg: ~406KB; p50: ~406KB; p95: ~407KB.
- Overview (`/api/overview`)
- Total: ~1.32.5s across samples (best ~1.2s, spikes up to ~2.5s).
- `getOverviewData` total: ~1.151.36s typically; one sample ~2.4s.
- **Machines query dominates**: ~1.121.33s (primary bottleneck).
- Events query: ~535ms (minor).
- Payload: ~13KB.
- Machines (`/api/machines`)
- Total: ~1.151.33s per call for 3 machines.
- **Machines query dominates**: ~1.151.33s.
- Payload: ~1.6KB.
- Reports (`/api/reports`)
- Typical total: ~170225ms (later runs), earlier spikes up to ~16s (pre-fix or cold).
- Query timings combined: ~130200ms.
- Row counts: ~1.8k KPI rows, ~6.2k cycles, ~736 events.
- **Payload size ~406KB** (largest).
- Reports filters (`/api/reports/filters`)
- Typical total: ~5668ms (later runs), earlier spikes up to ~23s (pre-fix or cold).
- Query timings: ~3040ms.
- Payload: ~51B.
### Findings
- The dominant latency contributor is the **machines query** used by Overview and Machines endpoints.
- Reports payload is large (~406KB), which impacts UI responsiveness even when queries are moderate.
- Large outliers (multi-second totals) likely come from non-query overhead (session lookup, DB connection wait, or cold start); these need targeted checks.
- Reports and reports filters show totals that are far larger than the summed query timings, confirming significant overhead outside the measured DB queries.
- Client end-to-end nav timing (`perf.client`) is now captured; p95 is ~67ms, slightly above the 50ms target.
- Baseline summaries should average cold and warm samples together for now.
### Data captured
- Logs are stored at `/tmp/mis-control-tower.log`.
- Events include: `perf.overview.api`, `perf.overview.getOverviewData`, `perf.machines.api`, `perf.reports.api`, `perf.reports.filters`.
Update
- Client nav timing is now captured via `/api/debug/perf` (`perf.client` events).
- API timings now include auth/preQuery/postQuery with coldStart/uptimeMs when enabled.
---
## Phase 2: Instant feedback (UX)
### 1) Global route loading
- Add `app/(app)/loading.tsx` with a lightweight skeleton for the shell.
- Ensure each heavy route also has its own `loading.tsx` for targeted skeletons.
### 2) Sidebar pending state
- Use `useTransition` to mark a pending navigation.
- Disable repeated clicks and show a subtle spinner on the active item.
- Optional: debounce repeated clicks for 300-500ms.
### 3) Suspense boundaries
- Wrap the slowest sections (events, charts, tables) in `<Suspense>` with skeletons.
- Ensure initial shell renders immediately even if data is still loading.
Deliverables
- Users always see visual feedback within a single frame.
- Double-clicks do not queue up extra navigations.
Progress
- Added route-level loading skeletons for the app shell and heavy routes.
- Sidebar uses `useTransition` with a pending spinner and blocks repeat clicks.
- Added Suspense + lazy loading for the Overview timeline and Reports charts.
---
## Phase 3: Split heavy pages (Overview + Reports)
### Overview (split)
- First paint: show lightweight summary data (machines list + latest heartbeat + tiny event count).
- Defer: fetch full event stream and detailed KPIs via client API after initial render.
- Use an explicit "Load more" or lazy loading for event details.
Implementation sketch
- Create a `getOverviewSummary` for the initial server render.
- Create a client fetch (`/api/overview?detail=1`) for detailed events and charts.
- Replace large data arrays with preview-sized payloads.
Progress
- Overview now uses `getOverviewSummary` for first paint, and `/api/overview?detail=1` for deferred detail fetch.
- Summary responses are cached in-memory with TTL + in-flight de-dupe (`perf.overview.summary` shows cache hits).
- Reports charts are lazy-loaded with placeholders; heavy chart blocks render after the shell.
### Reports (split)
- Render the report shell and filters immediately.
- Lazy-load heavy charts with `next/dynamic` and loading placeholders.
- Fetch chart data on demand (per chart or on viewport with IntersectionObserver).
- Paginate any large tables or use virtualization.
Deliverables
- Overview/Reports initial response is fast and small.
- Deep detail loads after the UI is already visible.
---
## Phase 4: Caching + data freshness
### 1) Page-level caching
- Remove `force-dynamic` where it is not required.
- Use `revalidate` on pages that can be stale for a few seconds (Settings, Financial).
### 2) Data cache for Prisma queries
- Wrap stable fetchers in `unstable_cache` with short TTL and tags (per org).
- Add manual refresh button on Settings/Financial to bypass cache when needed.
### 3) API cache headers
- Use `ETag` and `If-None-Match` where possible.
- For logged-in data, use `private` caching with short max-age.
Deliverables
- Fewer full recomputes for repeated navigations.
- Settings/Financial feel instant, but still correct.
Progress
- Added session cache + throttled `lastSeenAt` updates to reduce auth overhead spikes.
- Added cached GETs with short TTL + per-org tags for Settings + Financial config/impact.
- Added refresh bypass (`?refresh=1`) and a refresh button on Financial.
- Added ETag + private cache headers for Settings + Financial config, plus private cache headers for Financial impact.
- Restored `force-dynamic` on the authenticated layout to avoid static render errors from `cookies()`.
---
## Phase 5: Query + payload tuning
- Reduce `select` fields to only what the UI needs on first render.
- Cap `take` sizes with clear UI controls to load more.
- Add indexes for `orgId + ts` combos used in orderBy filters.
- Consider summary tables for expensive aggregations.
Progress
- Split machine fetch into base + latest heartbeat/KPI queries to avoid nested relation orderBy/take on large tables.
- Added indexes for heartbeat tsServer lookup and machine ordering by orgId + createdAt.
- Machines base query dropped to low ms; new hotspots are latest heartbeat (~250-300ms) and latest KPI (~800-900ms).
- Overview/Machines now log `heartbeatsQuery` + `kpiQuery` to track the new bottlenecks.
---
## What helped most
- Overview split + summary cache: repeat navigations are instant and detail loads later.
- Route-level loading + pending state: immediate feedback reduced double-clicks.
- Session cache + throttled lastSeen: reduced non-query overhead spikes.
- Short TTL caches with refresh bypass: Settings/Financial feel instant without losing correctness.
- Query shape changes: removed nested relation ordering and shifted load to targeted queries.
## Methodology / optimization strategy
- Instrument first, measure cold + warm, and store logs.
- Use timing breakdowns to find the dominant step.
- Improve perceived performance early (skeletons, pending state).
- Split payloads into summary + deferred detail.
- Cache low-risk data with short TTL + refresh bypass and ETag for 304s.
- Tune queries with smaller selects, indexes, and safer query shapes; consider denormalizing if needed.
## Validation
- Measure navigation feedback time (click to loading UI). Goal: <50ms.
- Track p95 TTFB and payload size for Overview and Reports before/after.
- Confirm that repeated clicks no longer add latency or duplicated requests.
---
## Open opportunities
- Optimize latest KPI query (index on `orgId + machineId + tsServer` or denormalize latest KPI onto `Machine`).
- Reduce Reports payload size (trim fields, paginate, or virtualize tables).
- Consider summary tables/materialized views for heavy aggregates.
## Further implementation plan (later)
1) Latest KPI/heartbeat acceleration
- Add index for KPI lookups by server time: `@@index([orgId, machineId, tsServer])`.
- Switch KPI “latest” ordering to `tsServer` to match index.
- Optional: denormalize `latestHeartbeat` + `latestKpi` onto `Machine` and update on ingest.
- Add background backfill job for legacy machines.
2) Machines + Overview caching
- Increase summary cache TTL (30-60s) to raise hit rates.
- Add per-org cache invalidation when a heartbeat/KPI ingests.
- Add ETag handling to `/api/machines` (similar to overview detail).
3) Reports payload trim
- Reduce fields in `reports` response to the chart/minimum.
- Add pagination for large tables (KPIs/cycles/scrap).
- Add “Download full dataset” endpoint separate from UI view.
4) Connection + ORM tuning
- Enable Prisma query logging to identify slow SQL.
- Evaluate connection pool size and cold-start behavior in serverless.
- Move heavy aggregates to `GROUP BY` at DB level with indexes.
5) UX refinements
- Add inline “last updated” timestamp in Overview/Reports headers.
- Show cache-hit badges when content is served from cache.
- Add optional “refresh” on the overview to re-fetch detail data.