236 lines
10 KiB
Markdown
236 lines
10 KiB
Markdown
# Snappy UX plan (Next.js)
|
||
|
||
## Goals
|
||
- Make every navigation feel instant (<50ms feedback) via loading UI and disabled re-clicks.
|
||
- Reduce server and data latency for heavy pages (Overview, Reports).
|
||
- Keep data accurate while allowing slight staleness for Settings/Financial (seconds).
|
||
|
||
## Constraints
|
||
- Data-heavy pages with large payloads and expensive queries.
|
||
- Users click multiple times when no feedback is shown.
|
||
|
||
## Success targets
|
||
- Navigation feedback in <50ms (loading/skeleton/pending state).
|
||
- P95 server response under 300-500ms for most queries; worst cases hidden behind progressive loading.
|
||
- No multi-click queueing; one navigation at a time.
|
||
|
||
---
|
||
|
||
## Phase 1: Audit and baseline (completed)
|
||
|
||
### What was instrumented
|
||
- Server timing + payload logging on Overview, Reports, Reports Filters, Machines APIs.
|
||
- Per-step timings inside `getOverviewData` (machines query, events query, normalize/filter).
|
||
- Client nav timing hooks were added but not captured due to service env/build config.
|
||
|
||
### Baseline results (from `/tmp/mis-control-tower.log`)
|
||
- Aggregate stats (cold + warm averaged)
|
||
- Client nav (`perf.client` nav duration)
|
||
- Avg: ~38ms; p50: ~51ms; p95: ~67ms; min: ~5ms; max: ~82ms.
|
||
- Overview API (`/api/overview`) total
|
||
- Avg: ~3.07s; p50: ~1.73s; p95: ~8.61s; min: ~1.20s; max: ~21.54s.
|
||
- `getOverviewData` total
|
||
- Avg: ~1.29s; p50: ~1.26s; p95: ~1.35s; min: ~1.15s; max: ~2.41s.
|
||
- Machines query (inside Overview)
|
||
- Avg: ~1.27s; p50: ~1.25s; p95: ~1.33s; min: ~1.13s; max: ~2.38s.
|
||
- Machines API (`/api/machines`) total
|
||
- Avg: ~1.26s; p50: ~1.25s; p95: ~1.36s; min: ~1.13s; max: ~1.52s.
|
||
- Reports API (`/api/reports`) total
|
||
- Avg: ~3.81s; p50: ~468ms; p95: ~18.14s; min: ~168ms; max: ~26.56s.
|
||
- Reports filters (`/api/reports/filters`) total
|
||
- Avg: ~4.07s; p50: ~367ms; p95: ~16.61s; min: ~57ms; max: ~23.78s.
|
||
- Reports payload size
|
||
- Avg: ~406KB; p50: ~406KB; p95: ~407KB.
|
||
|
||
- Overview (`/api/overview`)
|
||
- Total: ~1.3–2.5s across samples (best ~1.2s, spikes up to ~2.5s).
|
||
- `getOverviewData` total: ~1.15–1.36s typically; one sample ~2.4s.
|
||
- **Machines query dominates**: ~1.12–1.33s (primary bottleneck).
|
||
- Events query: ~5–35ms (minor).
|
||
- Payload: ~13KB.
|
||
|
||
- Machines (`/api/machines`)
|
||
- Total: ~1.15–1.33s per call for 3 machines.
|
||
- **Machines query dominates**: ~1.15–1.33s.
|
||
- Payload: ~1.6KB.
|
||
|
||
- Reports (`/api/reports`)
|
||
- Typical total: ~170–225ms (later runs), earlier spikes up to ~16s (pre-fix or cold).
|
||
- Query timings combined: ~130–200ms.
|
||
- Row counts: ~1.8k KPI rows, ~6.2k cycles, ~736 events.
|
||
- **Payload size ~406KB** (largest).
|
||
|
||
- Reports filters (`/api/reports/filters`)
|
||
- Typical total: ~56–68ms (later runs), earlier spikes up to ~23s (pre-fix or cold).
|
||
- Query timings: ~30–40ms.
|
||
- Payload: ~51B.
|
||
|
||
### Findings
|
||
- The dominant latency contributor is the **machines query** used by Overview and Machines endpoints.
|
||
- Reports payload is large (~406KB), which impacts UI responsiveness even when queries are moderate.
|
||
- Large outliers (multi-second totals) likely come from non-query overhead (session lookup, DB connection wait, or cold start); these need targeted checks.
|
||
- Reports and reports filters show totals that are far larger than the summed query timings, confirming significant overhead outside the measured DB queries.
|
||
- Client end-to-end nav timing (`perf.client`) is now captured; p95 is ~67ms, slightly above the 50ms target.
|
||
- Baseline summaries should average cold and warm samples together for now.
|
||
|
||
### Data captured
|
||
- Logs are stored at `/tmp/mis-control-tower.log`.
|
||
- Events include: `perf.overview.api`, `perf.overview.getOverviewData`, `perf.machines.api`, `perf.reports.api`, `perf.reports.filters`.
|
||
|
||
Update
|
||
- Client nav timing is now captured via `/api/debug/perf` (`perf.client` events).
|
||
- API timings now include auth/preQuery/postQuery with coldStart/uptimeMs when enabled.
|
||
|
||
---
|
||
|
||
## Phase 2: Instant feedback (UX)
|
||
|
||
### 1) Global route loading
|
||
- Add `app/(app)/loading.tsx` with a lightweight skeleton for the shell.
|
||
- Ensure each heavy route also has its own `loading.tsx` for targeted skeletons.
|
||
|
||
### 2) Sidebar pending state
|
||
- Use `useTransition` to mark a pending navigation.
|
||
- Disable repeated clicks and show a subtle spinner on the active item.
|
||
- Optional: debounce repeated clicks for 300-500ms.
|
||
|
||
### 3) Suspense boundaries
|
||
- Wrap the slowest sections (events, charts, tables) in `<Suspense>` with skeletons.
|
||
- Ensure initial shell renders immediately even if data is still loading.
|
||
|
||
Deliverables
|
||
- Users always see visual feedback within a single frame.
|
||
- Double-clicks do not queue up extra navigations.
|
||
|
||
Progress
|
||
- Added route-level loading skeletons for the app shell and heavy routes.
|
||
- Sidebar uses `useTransition` with a pending spinner and blocks repeat clicks.
|
||
- Added Suspense + lazy loading for the Overview timeline and Reports charts.
|
||
|
||
---
|
||
|
||
## Phase 3: Split heavy pages (Overview + Reports)
|
||
|
||
### Overview (split)
|
||
- First paint: show lightweight summary data (machines list + latest heartbeat + tiny event count).
|
||
- Defer: fetch full event stream and detailed KPIs via client API after initial render.
|
||
- Use an explicit "Load more" or lazy loading for event details.
|
||
|
||
Implementation sketch
|
||
- Create a `getOverviewSummary` for the initial server render.
|
||
- Create a client fetch (`/api/overview?detail=1`) for detailed events and charts.
|
||
- Replace large data arrays with preview-sized payloads.
|
||
|
||
Progress
|
||
- Overview now uses `getOverviewSummary` for first paint, and `/api/overview?detail=1` for deferred detail fetch.
|
||
- Summary responses are cached in-memory with TTL + in-flight de-dupe (`perf.overview.summary` shows cache hits).
|
||
- Reports charts are lazy-loaded with placeholders; heavy chart blocks render after the shell.
|
||
|
||
### Reports (split)
|
||
- Render the report shell and filters immediately.
|
||
- Lazy-load heavy charts with `next/dynamic` and loading placeholders.
|
||
- Fetch chart data on demand (per chart or on viewport with IntersectionObserver).
|
||
- Paginate any large tables or use virtualization.
|
||
|
||
Deliverables
|
||
- Overview/Reports initial response is fast and small.
|
||
- Deep detail loads after the UI is already visible.
|
||
|
||
---
|
||
|
||
## Phase 4: Caching + data freshness
|
||
|
||
### 1) Page-level caching
|
||
- Remove `force-dynamic` where it is not required.
|
||
- Use `revalidate` on pages that can be stale for a few seconds (Settings, Financial).
|
||
|
||
### 2) Data cache for Prisma queries
|
||
- Wrap stable fetchers in `unstable_cache` with short TTL and tags (per org).
|
||
- Add manual refresh button on Settings/Financial to bypass cache when needed.
|
||
|
||
### 3) API cache headers
|
||
- Use `ETag` and `If-None-Match` where possible.
|
||
- For logged-in data, use `private` caching with short max-age.
|
||
|
||
Deliverables
|
||
- Fewer full recomputes for repeated navigations.
|
||
- Settings/Financial feel instant, but still correct.
|
||
|
||
Progress
|
||
- Added session cache + throttled `lastSeenAt` updates to reduce auth overhead spikes.
|
||
- Added cached GETs with short TTL + per-org tags for Settings + Financial config/impact.
|
||
- Added refresh bypass (`?refresh=1`) and a refresh button on Financial.
|
||
- Added ETag + private cache headers for Settings + Financial config, plus private cache headers for Financial impact.
|
||
- Restored `force-dynamic` on the authenticated layout to avoid static render errors from `cookies()`.
|
||
|
||
---
|
||
|
||
## Phase 5: Query + payload tuning
|
||
|
||
- Reduce `select` fields to only what the UI needs on first render.
|
||
- Cap `take` sizes with clear UI controls to load more.
|
||
- Add indexes for `orgId + ts` combos used in orderBy filters.
|
||
- Consider summary tables for expensive aggregations.
|
||
|
||
Progress
|
||
- Split machine fetch into base + latest heartbeat/KPI queries to avoid nested relation orderBy/take on large tables.
|
||
- Added indexes for heartbeat tsServer lookup and machine ordering by orgId + createdAt.
|
||
- Machines base query dropped to low ms; new hotspots are latest heartbeat (~250-300ms) and latest KPI (~800-900ms).
|
||
- Overview/Machines now log `heartbeatsQuery` + `kpiQuery` to track the new bottlenecks.
|
||
|
||
---
|
||
|
||
## What helped most
|
||
- Overview split + summary cache: repeat navigations are instant and detail loads later.
|
||
- Route-level loading + pending state: immediate feedback reduced double-clicks.
|
||
- Session cache + throttled lastSeen: reduced non-query overhead spikes.
|
||
- Short TTL caches with refresh bypass: Settings/Financial feel instant without losing correctness.
|
||
- Query shape changes: removed nested relation ordering and shifted load to targeted queries.
|
||
|
||
## Methodology / optimization strategy
|
||
- Instrument first, measure cold + warm, and store logs.
|
||
- Use timing breakdowns to find the dominant step.
|
||
- Improve perceived performance early (skeletons, pending state).
|
||
- Split payloads into summary + deferred detail.
|
||
- Cache low-risk data with short TTL + refresh bypass and ETag for 304s.
|
||
- Tune queries with smaller selects, indexes, and safer query shapes; consider denormalizing if needed.
|
||
|
||
## Validation
|
||
- Measure navigation feedback time (click to loading UI). Goal: <50ms.
|
||
- Track p95 TTFB and payload size for Overview and Reports before/after.
|
||
- Confirm that repeated clicks no longer add latency or duplicated requests.
|
||
|
||
---
|
||
|
||
## Open opportunities
|
||
- Optimize latest KPI query (index on `orgId + machineId + tsServer` or denormalize latest KPI onto `Machine`).
|
||
- Reduce Reports payload size (trim fields, paginate, or virtualize tables).
|
||
- Consider summary tables/materialized views for heavy aggregates.
|
||
|
||
## Further implementation plan (later)
|
||
1) Latest KPI/heartbeat acceleration
|
||
- Add index for KPI lookups by server time: `@@index([orgId, machineId, tsServer])`.
|
||
- Switch KPI “latest” ordering to `tsServer` to match index.
|
||
- Optional: denormalize `latestHeartbeat` + `latestKpi` onto `Machine` and update on ingest.
|
||
- Add background backfill job for legacy machines.
|
||
|
||
2) Machines + Overview caching
|
||
- Increase summary cache TTL (30-60s) to raise hit rates.
|
||
- Add per-org cache invalidation when a heartbeat/KPI ingests.
|
||
- Add ETag handling to `/api/machines` (similar to overview detail).
|
||
|
||
3) Reports payload trim
|
||
- Reduce fields in `reports` response to the chart/minimum.
|
||
- Add pagination for large tables (KPIs/cycles/scrap).
|
||
- Add “Download full dataset” endpoint separate from UI view.
|
||
|
||
4) Connection + ORM tuning
|
||
- Enable Prisma query logging to identify slow SQL.
|
||
- Evaluate connection pool size and cold-start behavior in serverless.
|
||
- Move heavy aggregates to `GROUP BY` at DB level with indexes.
|
||
|
||
5) UX refinements
|
||
- Add inline “last updated” timestamp in Overview/Reports headers.
|
||
- Show cache-hit badges when content is served from cache.
|
||
- Add optional “refresh” on the overview to re-fetch detail data.
|