Files
MIS-Contro-Tower/snappy.md
2026-04-22 05:04:19 +00:00

10 KiB
Raw Permalink Blame History

Snappy UX plan (Next.js)

Goals

  • Make every navigation feel instant (<50ms feedback) via loading UI and disabled re-clicks.
  • Reduce server and data latency for heavy pages (Overview, Reports).
  • Keep data accurate while allowing slight staleness for Settings/Financial (seconds).

Constraints

  • Data-heavy pages with large payloads and expensive queries.
  • Users click multiple times when no feedback is shown.

Success targets

  • Navigation feedback in <50ms (loading/skeleton/pending state).
  • P95 server response under 300-500ms for most queries; worst cases hidden behind progressive loading.
  • No multi-click queueing; one navigation at a time.

Phase 1: Audit and baseline (completed)

What was instrumented

  • Server timing + payload logging on Overview, Reports, Reports Filters, Machines APIs.
  • Per-step timings inside getOverviewData (machines query, events query, normalize/filter).
  • Client nav timing hooks were added but not captured due to service env/build config.

Baseline results (from /tmp/mis-control-tower.log)

  • Aggregate stats (cold + warm averaged)

    • Client nav (perf.client nav duration)
      • Avg: ~38ms; p50: ~51ms; p95: ~67ms; min: ~5ms; max: ~82ms.
    • Overview API (/api/overview) total
      • Avg: ~3.07s; p50: ~1.73s; p95: ~8.61s; min: ~1.20s; max: ~21.54s.
    • getOverviewData total
      • Avg: ~1.29s; p50: ~1.26s; p95: ~1.35s; min: ~1.15s; max: ~2.41s.
    • Machines query (inside Overview)
      • Avg: ~1.27s; p50: ~1.25s; p95: ~1.33s; min: ~1.13s; max: ~2.38s.
    • Machines API (/api/machines) total
      • Avg: ~1.26s; p50: ~1.25s; p95: ~1.36s; min: ~1.13s; max: ~1.52s.
    • Reports API (/api/reports) total
      • Avg: ~3.81s; p50: ~468ms; p95: ~18.14s; min: ~168ms; max: ~26.56s.
    • Reports filters (/api/reports/filters) total
      • Avg: ~4.07s; p50: ~367ms; p95: ~16.61s; min: ~57ms; max: ~23.78s.
    • Reports payload size
      • Avg: ~406KB; p50: ~406KB; p95: ~407KB.
  • Overview (/api/overview)

    • Total: ~1.32.5s across samples (best ~1.2s, spikes up to ~2.5s).
    • getOverviewData total: ~1.151.36s typically; one sample ~2.4s.
    • Machines query dominates: ~1.121.33s (primary bottleneck).
    • Events query: ~535ms (minor).
    • Payload: ~13KB.
  • Machines (/api/machines)

    • Total: ~1.151.33s per call for 3 machines.
    • Machines query dominates: ~1.151.33s.
    • Payload: ~1.6KB.
  • Reports (/api/reports)

    • Typical total: ~170225ms (later runs), earlier spikes up to ~16s (pre-fix or cold).
    • Query timings combined: ~130200ms.
    • Row counts: ~1.8k KPI rows, ~6.2k cycles, ~736 events.
    • Payload size ~406KB (largest).
  • Reports filters (/api/reports/filters)

    • Typical total: ~5668ms (later runs), earlier spikes up to ~23s (pre-fix or cold).
    • Query timings: ~3040ms.
    • Payload: ~51B.

Findings

  • The dominant latency contributor is the machines query used by Overview and Machines endpoints.
  • Reports payload is large (~406KB), which impacts UI responsiveness even when queries are moderate.
  • Large outliers (multi-second totals) likely come from non-query overhead (session lookup, DB connection wait, or cold start); these need targeted checks.
  • Reports and reports filters show totals that are far larger than the summed query timings, confirming significant overhead outside the measured DB queries.
  • Client end-to-end nav timing (perf.client) is now captured; p95 is ~67ms, slightly above the 50ms target.
  • Baseline summaries should average cold and warm samples together for now.

Data captured

  • Logs are stored at /tmp/mis-control-tower.log.
  • Events include: perf.overview.api, perf.overview.getOverviewData, perf.machines.api, perf.reports.api, perf.reports.filters.

Update

  • Client nav timing is now captured via /api/debug/perf (perf.client events).
  • API timings now include auth/preQuery/postQuery with coldStart/uptimeMs when enabled.

Phase 2: Instant feedback (UX)

1) Global route loading

  • Add app/(app)/loading.tsx with a lightweight skeleton for the shell.
  • Ensure each heavy route also has its own loading.tsx for targeted skeletons.

2) Sidebar pending state

  • Use useTransition to mark a pending navigation.
  • Disable repeated clicks and show a subtle spinner on the active item.
  • Optional: debounce repeated clicks for 300-500ms.

3) Suspense boundaries

  • Wrap the slowest sections (events, charts, tables) in <Suspense> with skeletons.
  • Ensure initial shell renders immediately even if data is still loading.

Deliverables

  • Users always see visual feedback within a single frame.
  • Double-clicks do not queue up extra navigations.

Progress

  • Added route-level loading skeletons for the app shell and heavy routes.
  • Sidebar uses useTransition with a pending spinner and blocks repeat clicks.
  • Added Suspense + lazy loading for the Overview timeline and Reports charts.

Phase 3: Split heavy pages (Overview + Reports)

Overview (split)

  • First paint: show lightweight summary data (machines list + latest heartbeat + tiny event count).
  • Defer: fetch full event stream and detailed KPIs via client API after initial render.
  • Use an explicit "Load more" or lazy loading for event details.

Implementation sketch

  • Create a getOverviewSummary for the initial server render.
  • Create a client fetch (/api/overview?detail=1) for detailed events and charts.
  • Replace large data arrays with preview-sized payloads.

Progress

  • Overview now uses getOverviewSummary for first paint, and /api/overview?detail=1 for deferred detail fetch.
  • Summary responses are cached in-memory with TTL + in-flight de-dupe (perf.overview.summary shows cache hits).
  • Reports charts are lazy-loaded with placeholders; heavy chart blocks render after the shell.

Reports (split)

  • Render the report shell and filters immediately.
  • Lazy-load heavy charts with next/dynamic and loading placeholders.
  • Fetch chart data on demand (per chart or on viewport with IntersectionObserver).
  • Paginate any large tables or use virtualization.

Deliverables

  • Overview/Reports initial response is fast and small.
  • Deep detail loads after the UI is already visible.

Phase 4: Caching + data freshness

1) Page-level caching

  • Remove force-dynamic where it is not required.
  • Use revalidate on pages that can be stale for a few seconds (Settings, Financial).

2) Data cache for Prisma queries

  • Wrap stable fetchers in unstable_cache with short TTL and tags (per org).
  • Add manual refresh button on Settings/Financial to bypass cache when needed.

3) API cache headers

  • Use ETag and If-None-Match where possible.
  • For logged-in data, use private caching with short max-age.

Deliverables

  • Fewer full recomputes for repeated navigations.
  • Settings/Financial feel instant, but still correct.

Progress

  • Added session cache + throttled lastSeenAt updates to reduce auth overhead spikes.
  • Added cached GETs with short TTL + per-org tags for Settings + Financial config/impact.
  • Added refresh bypass (?refresh=1) and a refresh button on Financial.
  • Added ETag + private cache headers for Settings + Financial config, plus private cache headers for Financial impact.
  • Restored force-dynamic on the authenticated layout to avoid static render errors from cookies().

Phase 5: Query + payload tuning

  • Reduce select fields to only what the UI needs on first render.
  • Cap take sizes with clear UI controls to load more.
  • Add indexes for orgId + ts combos used in orderBy filters.
  • Consider summary tables for expensive aggregations.

Progress

  • Split machine fetch into base + latest heartbeat/KPI queries to avoid nested relation orderBy/take on large tables.
  • Added indexes for heartbeat tsServer lookup and machine ordering by orgId + createdAt.
  • Machines base query dropped to low ms; new hotspots are latest heartbeat (~250-300ms) and latest KPI (~800-900ms).
  • Overview/Machines now log heartbeatsQuery + kpiQuery to track the new bottlenecks.

What helped most

  • Overview split + summary cache: repeat navigations are instant and detail loads later.
  • Route-level loading + pending state: immediate feedback reduced double-clicks.
  • Session cache + throttled lastSeen: reduced non-query overhead spikes.
  • Short TTL caches with refresh bypass: Settings/Financial feel instant without losing correctness.
  • Query shape changes: removed nested relation ordering and shifted load to targeted queries.

Methodology / optimization strategy

  • Instrument first, measure cold + warm, and store logs.
  • Use timing breakdowns to find the dominant step.
  • Improve perceived performance early (skeletons, pending state).
  • Split payloads into summary + deferred detail.
  • Cache low-risk data with short TTL + refresh bypass and ETag for 304s.
  • Tune queries with smaller selects, indexes, and safer query shapes; consider denormalizing if needed.

Validation

  • Measure navigation feedback time (click to loading UI). Goal: <50ms.
  • Track p95 TTFB and payload size for Overview and Reports before/after.
  • Confirm that repeated clicks no longer add latency or duplicated requests.

Open opportunities

  • Optimize latest KPI query (index on orgId + machineId + tsServer or denormalize latest KPI onto Machine).
  • Reduce Reports payload size (trim fields, paginate, or virtualize tables).
  • Consider summary tables/materialized views for heavy aggregates.

Further implementation plan (later)

  1. Latest KPI/heartbeat acceleration

    • Add index for KPI lookups by server time: @@index([orgId, machineId, tsServer]).
    • Switch KPI “latest” ordering to tsServer to match index.
    • Optional: denormalize latestHeartbeat + latestKpi onto Machine and update on ingest.
    • Add background backfill job for legacy machines.
  2. Machines + Overview caching

    • Increase summary cache TTL (30-60s) to raise hit rates.
    • Add per-org cache invalidation when a heartbeat/KPI ingests.
    • Add ETag handling to /api/machines (similar to overview detail).
  3. Reports payload trim

    • Reduce fields in reports response to the chart/minimum.
    • Add pagination for large tables (KPIs/cycles/scrap).
    • Add “Download full dataset” endpoint separate from UI view.
  4. Connection + ORM tuning

    • Enable Prisma query logging to identify slow SQL.
    • Evaluate connection pool size and cold-start behavior in serverless.
    • Move heavy aggregates to GROUP BY at DB level with indexes.
  5. UX refinements

    • Add inline “last updated” timestamp in Overview/Reports headers.
    • Show cache-hit badges when content is served from cache.
    • Add optional “refresh” on the overview to re-fetch detail data.