pre-bemis

This commit is contained in:
Marcelo
2026-04-22 05:04:19 +00:00
parent ac1a7900c8
commit 80d27f83b6
91 changed files with 11769 additions and 820 deletions

235
snappy.md Normal file
View File

@@ -0,0 +1,235 @@
# Snappy UX plan (Next.js)
## Goals
- Make every navigation feel instant (<50ms feedback) via loading UI and disabled re-clicks.
- Reduce server and data latency for heavy pages (Overview, Reports).
- Keep data accurate while allowing slight staleness for Settings/Financial (seconds).
## Constraints
- Data-heavy pages with large payloads and expensive queries.
- Users click multiple times when no feedback is shown.
## Success targets
- Navigation feedback in <50ms (loading/skeleton/pending state).
- P95 server response under 300-500ms for most queries; worst cases hidden behind progressive loading.
- No multi-click queueing; one navigation at a time.
---
## Phase 1: Audit and baseline (completed)
### What was instrumented
- Server timing + payload logging on Overview, Reports, Reports Filters, Machines APIs.
- Per-step timings inside `getOverviewData` (machines query, events query, normalize/filter).
- Client nav timing hooks were added but not captured due to service env/build config.
### Baseline results (from `/tmp/mis-control-tower.log`)
- Aggregate stats (cold + warm averaged)
- Client nav (`perf.client` nav duration)
- Avg: ~38ms; p50: ~51ms; p95: ~67ms; min: ~5ms; max: ~82ms.
- Overview API (`/api/overview`) total
- Avg: ~3.07s; p50: ~1.73s; p95: ~8.61s; min: ~1.20s; max: ~21.54s.
- `getOverviewData` total
- Avg: ~1.29s; p50: ~1.26s; p95: ~1.35s; min: ~1.15s; max: ~2.41s.
- Machines query (inside Overview)
- Avg: ~1.27s; p50: ~1.25s; p95: ~1.33s; min: ~1.13s; max: ~2.38s.
- Machines API (`/api/machines`) total
- Avg: ~1.26s; p50: ~1.25s; p95: ~1.36s; min: ~1.13s; max: ~1.52s.
- Reports API (`/api/reports`) total
- Avg: ~3.81s; p50: ~468ms; p95: ~18.14s; min: ~168ms; max: ~26.56s.
- Reports filters (`/api/reports/filters`) total
- Avg: ~4.07s; p50: ~367ms; p95: ~16.61s; min: ~57ms; max: ~23.78s.
- Reports payload size
- Avg: ~406KB; p50: ~406KB; p95: ~407KB.
- Overview (`/api/overview`)
- Total: ~1.32.5s across samples (best ~1.2s, spikes up to ~2.5s).
- `getOverviewData` total: ~1.151.36s typically; one sample ~2.4s.
- **Machines query dominates**: ~1.121.33s (primary bottleneck).
- Events query: ~535ms (minor).
- Payload: ~13KB.
- Machines (`/api/machines`)
- Total: ~1.151.33s per call for 3 machines.
- **Machines query dominates**: ~1.151.33s.
- Payload: ~1.6KB.
- Reports (`/api/reports`)
- Typical total: ~170225ms (later runs), earlier spikes up to ~16s (pre-fix or cold).
- Query timings combined: ~130200ms.
- Row counts: ~1.8k KPI rows, ~6.2k cycles, ~736 events.
- **Payload size ~406KB** (largest).
- Reports filters (`/api/reports/filters`)
- Typical total: ~5668ms (later runs), earlier spikes up to ~23s (pre-fix or cold).
- Query timings: ~3040ms.
- Payload: ~51B.
### Findings
- The dominant latency contributor is the **machines query** used by Overview and Machines endpoints.
- Reports payload is large (~406KB), which impacts UI responsiveness even when queries are moderate.
- Large outliers (multi-second totals) likely come from non-query overhead (session lookup, DB connection wait, or cold start); these need targeted checks.
- Reports and reports filters show totals that are far larger than the summed query timings, confirming significant overhead outside the measured DB queries.
- Client end-to-end nav timing (`perf.client`) is now captured; p95 is ~67ms, slightly above the 50ms target.
- Baseline summaries should average cold and warm samples together for now.
### Data captured
- Logs are stored at `/tmp/mis-control-tower.log`.
- Events include: `perf.overview.api`, `perf.overview.getOverviewData`, `perf.machines.api`, `perf.reports.api`, `perf.reports.filters`.
Update
- Client nav timing is now captured via `/api/debug/perf` (`perf.client` events).
- API timings now include auth/preQuery/postQuery with coldStart/uptimeMs when enabled.
---
## Phase 2: Instant feedback (UX)
### 1) Global route loading
- Add `app/(app)/loading.tsx` with a lightweight skeleton for the shell.
- Ensure each heavy route also has its own `loading.tsx` for targeted skeletons.
### 2) Sidebar pending state
- Use `useTransition` to mark a pending navigation.
- Disable repeated clicks and show a subtle spinner on the active item.
- Optional: debounce repeated clicks for 300-500ms.
### 3) Suspense boundaries
- Wrap the slowest sections (events, charts, tables) in `<Suspense>` with skeletons.
- Ensure initial shell renders immediately even if data is still loading.
Deliverables
- Users always see visual feedback within a single frame.
- Double-clicks do not queue up extra navigations.
Progress
- Added route-level loading skeletons for the app shell and heavy routes.
- Sidebar uses `useTransition` with a pending spinner and blocks repeat clicks.
- Added Suspense + lazy loading for the Overview timeline and Reports charts.
---
## Phase 3: Split heavy pages (Overview + Reports)
### Overview (split)
- First paint: show lightweight summary data (machines list + latest heartbeat + tiny event count).
- Defer: fetch full event stream and detailed KPIs via client API after initial render.
- Use an explicit "Load more" or lazy loading for event details.
Implementation sketch
- Create a `getOverviewSummary` for the initial server render.
- Create a client fetch (`/api/overview?detail=1`) for detailed events and charts.
- Replace large data arrays with preview-sized payloads.
Progress
- Overview now uses `getOverviewSummary` for first paint, and `/api/overview?detail=1` for deferred detail fetch.
- Summary responses are cached in-memory with TTL + in-flight de-dupe (`perf.overview.summary` shows cache hits).
- Reports charts are lazy-loaded with placeholders; heavy chart blocks render after the shell.
### Reports (split)
- Render the report shell and filters immediately.
- Lazy-load heavy charts with `next/dynamic` and loading placeholders.
- Fetch chart data on demand (per chart or on viewport with IntersectionObserver).
- Paginate any large tables or use virtualization.
Deliverables
- Overview/Reports initial response is fast and small.
- Deep detail loads after the UI is already visible.
---
## Phase 4: Caching + data freshness
### 1) Page-level caching
- Remove `force-dynamic` where it is not required.
- Use `revalidate` on pages that can be stale for a few seconds (Settings, Financial).
### 2) Data cache for Prisma queries
- Wrap stable fetchers in `unstable_cache` with short TTL and tags (per org).
- Add manual refresh button on Settings/Financial to bypass cache when needed.
### 3) API cache headers
- Use `ETag` and `If-None-Match` where possible.
- For logged-in data, use `private` caching with short max-age.
Deliverables
- Fewer full recomputes for repeated navigations.
- Settings/Financial feel instant, but still correct.
Progress
- Added session cache + throttled `lastSeenAt` updates to reduce auth overhead spikes.
- Added cached GETs with short TTL + per-org tags for Settings + Financial config/impact.
- Added refresh bypass (`?refresh=1`) and a refresh button on Financial.
- Added ETag + private cache headers for Settings + Financial config, plus private cache headers for Financial impact.
- Restored `force-dynamic` on the authenticated layout to avoid static render errors from `cookies()`.
---
## Phase 5: Query + payload tuning
- Reduce `select` fields to only what the UI needs on first render.
- Cap `take` sizes with clear UI controls to load more.
- Add indexes for `orgId + ts` combos used in orderBy filters.
- Consider summary tables for expensive aggregations.
Progress
- Split machine fetch into base + latest heartbeat/KPI queries to avoid nested relation orderBy/take on large tables.
- Added indexes for heartbeat tsServer lookup and machine ordering by orgId + createdAt.
- Machines base query dropped to low ms; new hotspots are latest heartbeat (~250-300ms) and latest KPI (~800-900ms).
- Overview/Machines now log `heartbeatsQuery` + `kpiQuery` to track the new bottlenecks.
---
## What helped most
- Overview split + summary cache: repeat navigations are instant and detail loads later.
- Route-level loading + pending state: immediate feedback reduced double-clicks.
- Session cache + throttled lastSeen: reduced non-query overhead spikes.
- Short TTL caches with refresh bypass: Settings/Financial feel instant without losing correctness.
- Query shape changes: removed nested relation ordering and shifted load to targeted queries.
## Methodology / optimization strategy
- Instrument first, measure cold + warm, and store logs.
- Use timing breakdowns to find the dominant step.
- Improve perceived performance early (skeletons, pending state).
- Split payloads into summary + deferred detail.
- Cache low-risk data with short TTL + refresh bypass and ETag for 304s.
- Tune queries with smaller selects, indexes, and safer query shapes; consider denormalizing if needed.
## Validation
- Measure navigation feedback time (click to loading UI). Goal: <50ms.
- Track p95 TTFB and payload size for Overview and Reports before/after.
- Confirm that repeated clicks no longer add latency or duplicated requests.
---
## Open opportunities
- Optimize latest KPI query (index on `orgId + machineId + tsServer` or denormalize latest KPI onto `Machine`).
- Reduce Reports payload size (trim fields, paginate, or virtualize tables).
- Consider summary tables/materialized views for heavy aggregates.
## Further implementation plan (later)
1) Latest KPI/heartbeat acceleration
- Add index for KPI lookups by server time: `@@index([orgId, machineId, tsServer])`.
- Switch KPI “latest” ordering to `tsServer` to match index.
- Optional: denormalize `latestHeartbeat` + `latestKpi` onto `Machine` and update on ingest.
- Add background backfill job for legacy machines.
2) Machines + Overview caching
- Increase summary cache TTL (30-60s) to raise hit rates.
- Add per-org cache invalidation when a heartbeat/KPI ingests.
- Add ETag handling to `/api/machines` (similar to overview detail).
3) Reports payload trim
- Reduce fields in `reports` response to the chart/minimum.
- Add pagination for large tables (KPIs/cycles/scrap).
- Add “Download full dataset” endpoint separate from UI view.
4) Connection + ORM tuning
- Enable Prisma query logging to identify slow SQL.
- Evaluate connection pool size and cold-start behavior in serverless.
- Move heavy aggregates to `GROUP BY` at DB level with indexes.
5) UX refinements
- Add inline “last updated” timestamp in Overview/Reports headers.
- Show cache-hit badges when content is served from cache.
- Add optional “refresh” on the overview to re-fetch detail data.