10 KiB
10 KiB
Snappy UX plan (Next.js)
Goals
- Make every navigation feel instant (<50ms feedback) via loading UI and disabled re-clicks.
- Reduce server and data latency for heavy pages (Overview, Reports).
- Keep data accurate while allowing slight staleness for Settings/Financial (seconds).
Constraints
- Data-heavy pages with large payloads and expensive queries.
- Users click multiple times when no feedback is shown.
Success targets
- Navigation feedback in <50ms (loading/skeleton/pending state).
- P95 server response under 300-500ms for most queries; worst cases hidden behind progressive loading.
- No multi-click queueing; one navigation at a time.
Phase 1: Audit and baseline (completed)
What was instrumented
- Server timing + payload logging on Overview, Reports, Reports Filters, Machines APIs.
- Per-step timings inside
getOverviewData(machines query, events query, normalize/filter). - Client nav timing hooks were added but not captured due to service env/build config.
Baseline results (from /tmp/mis-control-tower.log)
-
Aggregate stats (cold + warm averaged)
- Client nav (
perf.clientnav duration)- Avg: ~38ms; p50: ~51ms; p95: ~67ms; min: ~5ms; max: ~82ms.
- Overview API (
/api/overview) total- Avg: ~3.07s; p50: ~1.73s; p95: ~8.61s; min: ~1.20s; max: ~21.54s.
getOverviewDatatotal- Avg: ~1.29s; p50: ~1.26s; p95: ~1.35s; min: ~1.15s; max: ~2.41s.
- Machines query (inside Overview)
- Avg: ~1.27s; p50: ~1.25s; p95: ~1.33s; min: ~1.13s; max: ~2.38s.
- Machines API (
/api/machines) total- Avg: ~1.26s; p50: ~1.25s; p95: ~1.36s; min: ~1.13s; max: ~1.52s.
- Reports API (
/api/reports) total- Avg: ~3.81s; p50: ~468ms; p95: ~18.14s; min: ~168ms; max: ~26.56s.
- Reports filters (
/api/reports/filters) total- Avg: ~4.07s; p50: ~367ms; p95: ~16.61s; min: ~57ms; max: ~23.78s.
- Reports payload size
- Avg: ~406KB; p50: ~406KB; p95: ~407KB.
- Client nav (
-
Overview (
/api/overview)- Total: ~1.3–2.5s across samples (best ~1.2s, spikes up to ~2.5s).
getOverviewDatatotal: ~1.15–1.36s typically; one sample ~2.4s.- Machines query dominates: ~1.12–1.33s (primary bottleneck).
- Events query: ~5–35ms (minor).
- Payload: ~13KB.
-
Machines (
/api/machines)- Total: ~1.15–1.33s per call for 3 machines.
- Machines query dominates: ~1.15–1.33s.
- Payload: ~1.6KB.
-
Reports (
/api/reports)- Typical total: ~170–225ms (later runs), earlier spikes up to ~16s (pre-fix or cold).
- Query timings combined: ~130–200ms.
- Row counts: ~1.8k KPI rows, ~6.2k cycles, ~736 events.
- Payload size ~406KB (largest).
-
Reports filters (
/api/reports/filters)- Typical total: ~56–68ms (later runs), earlier spikes up to ~23s (pre-fix or cold).
- Query timings: ~30–40ms.
- Payload: ~51B.
Findings
- The dominant latency contributor is the machines query used by Overview and Machines endpoints.
- Reports payload is large (~406KB), which impacts UI responsiveness even when queries are moderate.
- Large outliers (multi-second totals) likely come from non-query overhead (session lookup, DB connection wait, or cold start); these need targeted checks.
- Reports and reports filters show totals that are far larger than the summed query timings, confirming significant overhead outside the measured DB queries.
- Client end-to-end nav timing (
perf.client) is now captured; p95 is ~67ms, slightly above the 50ms target. - Baseline summaries should average cold and warm samples together for now.
Data captured
- Logs are stored at
/tmp/mis-control-tower.log. - Events include:
perf.overview.api,perf.overview.getOverviewData,perf.machines.api,perf.reports.api,perf.reports.filters.
Update
- Client nav timing is now captured via
/api/debug/perf(perf.clientevents). - API timings now include auth/preQuery/postQuery with coldStart/uptimeMs when enabled.
Phase 2: Instant feedback (UX)
1) Global route loading
- Add
app/(app)/loading.tsxwith a lightweight skeleton for the shell. - Ensure each heavy route also has its own
loading.tsxfor targeted skeletons.
2) Sidebar pending state
- Use
useTransitionto mark a pending navigation. - Disable repeated clicks and show a subtle spinner on the active item.
- Optional: debounce repeated clicks for 300-500ms.
3) Suspense boundaries
- Wrap the slowest sections (events, charts, tables) in
<Suspense>with skeletons. - Ensure initial shell renders immediately even if data is still loading.
Deliverables
- Users always see visual feedback within a single frame.
- Double-clicks do not queue up extra navigations.
Progress
- Added route-level loading skeletons for the app shell and heavy routes.
- Sidebar uses
useTransitionwith a pending spinner and blocks repeat clicks. - Added Suspense + lazy loading for the Overview timeline and Reports charts.
Phase 3: Split heavy pages (Overview + Reports)
Overview (split)
- First paint: show lightweight summary data (machines list + latest heartbeat + tiny event count).
- Defer: fetch full event stream and detailed KPIs via client API after initial render.
- Use an explicit "Load more" or lazy loading for event details.
Implementation sketch
- Create a
getOverviewSummaryfor the initial server render. - Create a client fetch (
/api/overview?detail=1) for detailed events and charts. - Replace large data arrays with preview-sized payloads.
Progress
- Overview now uses
getOverviewSummaryfor first paint, and/api/overview?detail=1for deferred detail fetch. - Summary responses are cached in-memory with TTL + in-flight de-dupe (
perf.overview.summaryshows cache hits). - Reports charts are lazy-loaded with placeholders; heavy chart blocks render after the shell.
Reports (split)
- Render the report shell and filters immediately.
- Lazy-load heavy charts with
next/dynamicand loading placeholders. - Fetch chart data on demand (per chart or on viewport with IntersectionObserver).
- Paginate any large tables or use virtualization.
Deliverables
- Overview/Reports initial response is fast and small.
- Deep detail loads after the UI is already visible.
Phase 4: Caching + data freshness
1) Page-level caching
- Remove
force-dynamicwhere it is not required. - Use
revalidateon pages that can be stale for a few seconds (Settings, Financial).
2) Data cache for Prisma queries
- Wrap stable fetchers in
unstable_cachewith short TTL and tags (per org). - Add manual refresh button on Settings/Financial to bypass cache when needed.
3) API cache headers
- Use
ETagandIf-None-Matchwhere possible. - For logged-in data, use
privatecaching with short max-age.
Deliverables
- Fewer full recomputes for repeated navigations.
- Settings/Financial feel instant, but still correct.
Progress
- Added session cache + throttled
lastSeenAtupdates to reduce auth overhead spikes. - Added cached GETs with short TTL + per-org tags for Settings + Financial config/impact.
- Added refresh bypass (
?refresh=1) and a refresh button on Financial. - Added ETag + private cache headers for Settings + Financial config, plus private cache headers for Financial impact.
- Restored
force-dynamicon the authenticated layout to avoid static render errors fromcookies().
Phase 5: Query + payload tuning
- Reduce
selectfields to only what the UI needs on first render. - Cap
takesizes with clear UI controls to load more. - Add indexes for
orgId + tscombos used in orderBy filters. - Consider summary tables for expensive aggregations.
Progress
- Split machine fetch into base + latest heartbeat/KPI queries to avoid nested relation orderBy/take on large tables.
- Added indexes for heartbeat tsServer lookup and machine ordering by orgId + createdAt.
- Machines base query dropped to low ms; new hotspots are latest heartbeat (~250-300ms) and latest KPI (~800-900ms).
- Overview/Machines now log
heartbeatsQuery+kpiQueryto track the new bottlenecks.
What helped most
- Overview split + summary cache: repeat navigations are instant and detail loads later.
- Route-level loading + pending state: immediate feedback reduced double-clicks.
- Session cache + throttled lastSeen: reduced non-query overhead spikes.
- Short TTL caches with refresh bypass: Settings/Financial feel instant without losing correctness.
- Query shape changes: removed nested relation ordering and shifted load to targeted queries.
Methodology / optimization strategy
- Instrument first, measure cold + warm, and store logs.
- Use timing breakdowns to find the dominant step.
- Improve perceived performance early (skeletons, pending state).
- Split payloads into summary + deferred detail.
- Cache low-risk data with short TTL + refresh bypass and ETag for 304s.
- Tune queries with smaller selects, indexes, and safer query shapes; consider denormalizing if needed.
Validation
- Measure navigation feedback time (click to loading UI). Goal: <50ms.
- Track p95 TTFB and payload size for Overview and Reports before/after.
- Confirm that repeated clicks no longer add latency or duplicated requests.
Open opportunities
- Optimize latest KPI query (index on
orgId + machineId + tsServeror denormalize latest KPI ontoMachine). - Reduce Reports payload size (trim fields, paginate, or virtualize tables).
- Consider summary tables/materialized views for heavy aggregates.
Further implementation plan (later)
-
Latest KPI/heartbeat acceleration
- Add index for KPI lookups by server time:
@@index([orgId, machineId, tsServer]). - Switch KPI “latest” ordering to
tsServerto match index. - Optional: denormalize
latestHeartbeat+latestKpiontoMachineand update on ingest. - Add background backfill job for legacy machines.
- Add index for KPI lookups by server time:
-
Machines + Overview caching
- Increase summary cache TTL (30-60s) to raise hit rates.
- Add per-org cache invalidation when a heartbeat/KPI ingests.
- Add ETag handling to
/api/machines(similar to overview detail).
-
Reports payload trim
- Reduce fields in
reportsresponse to the chart/minimum. - Add pagination for large tables (KPIs/cycles/scrap).
- Add “Download full dataset” endpoint separate from UI view.
- Reduce fields in
-
Connection + ORM tuning
- Enable Prisma query logging to identify slow SQL.
- Evaluate connection pool size and cold-start behavior in serverless.
- Move heavy aggregates to
GROUP BYat DB level with indexes.
-
UX refinements
- Add inline “last updated” timestamp in Overview/Reports headers.
- Show cache-hit badges when content is served from cache.
- Add optional “refresh” on the overview to re-fetch detail data.