updated UI

2026-04-17 15:21:27 +10:00
parent 7848557002
commit 98e92a8264
14 changed files with 1566 additions and 855 deletions
@@ -0,0 +1,330 @@
+# Inventory Capture and Aggregation Optimization Plan
+
+## Summary
+Optimize for end-to-end runtime with a Postgres-ready design. Keep the current HTTP and report behavior intact, but shift the scheduled data pipeline so it uses canonical append-only/cache tables instead of repeatedly scanning `inventory_hourly_*` tables and regenerating reports inline.
+
+This plan is intended to be implementation-ready for a `codex-5.3` execution pass.
+
+Execution-path decision:
+- For the current architecture and migration phases, scheduled daily and monthly aggregation default to the Go path.
+- This is a readability-first and current-performance decision, not a claim that Go is inherently faster than a well-designed SQL implementation.
+- SQL path is retained for compatibility, backfill, and fallback.
+- SQL remains a future optimization candidate on canonical Postgres tables.
+- SQL can be promoted to default only after benchmark evidence on canonical Postgres tables shows a clear runtime advantage.
+
+The target architecture is:
+1. `vm_hourly_stats` is the canonical hourly fact store.
+2. `vm_daily_rollup` is the canonical monthly input.
+3. Per-snapshot tables and XLSX generation remain as compatibility and output concerns, not the primary execution path.
+
+## Current State
+- Hourly capture already writes both per-snapshot tables and `vm_hourly_stats`.
+- Daily aggregation has mixed execution paths:
+  - SQL union path over `inventory_hourly_*`
+  - Go path over `vm_hourly_stats` or parallel table scans
+- Monthly aggregation has mixed execution paths:
+  - SQL path over daily or hourly snapshot tables
+  - Go path over `vm_daily_rollup` or hourly cache
+- Lifecycle reconciliation updates both canonical cache tables and prior hourly snapshot tables during the hot path.
+- Report generation is still coupled to scheduled capture and aggregation jobs.
+- The current UI is rendered through Templ pages and shared `web2`/`web3` CSS classes, but it does not yet match the visual system described in `design.md`.
+- Current shipped styling still uses a different blue accent, tighter radii, default system typography, and inconsistent component hierarchy compared with the target design language.
+
+## Implementation Goals
+- Reduce hourly capture wall-clock time.
+- Reduce daily and monthly aggregation runtime.
+- Eliminate repeated historical table scans from the normal scheduled path.
+- Keep user-visible HTTP APIs, reports, and auth behavior unchanged.
+- Improve UI clarity and consistency so the dashboard, snapshot views, and trace views reflect the design direction in `design.md`.
+- Make authentication and role requirements easier to understand from the UI without changing the auth model.
+- Preserve compatibility with SQLite for development and small installs.
+- Make the runtime architecture cleanly scalable for PostgreSQL production use.
+
+## Implementation Changes
+
+### 1. Hourly Capture Pipeline
+- Keep `GetAllVMsWithProps` as the primary vCenter inventory fetch path.
+- Preserve single-VM property retrieval only as a fallback path when bulk retrieval is incomplete.
+- Replace row-by-row database writes in hourly capture with batched writes.
+- For PostgreSQL:
+  - prefer multi-row insert/upsert or `COPY` into `vm_hourly_stats`
+  - keep conflict handling on the canonical key
+- For SQLite:
+  - keep transactional batched insert/upsert
+  - do not attempt PostgreSQL-only ingestion patterns
+- During capture, write data to these canonical destinations first:
+  - `vm_hourly_stats`
+  - `vm_lifecycle_cache`
+  - `vcenter_totals`
+  - `vcenter_latest_totals`
+  - `vcenter_aggregate_totals` for hourly totals
+- Treat `inventory_hourly_<epoch>` as compatibility output, not as the source of truth for downstream jobs.
+- Move deletion and event reconciliation to one post-capture reconciliation phase per vCenter.
+- In that reconciliation phase, update canonical cache tables first.
+- Stop updating prior hourly snapshot tables inline during the capture hot path except where compatibility mode explicitly requires it.
+- Remove synchronous XLSX regeneration from hourly capture.
+- Scheduled capture should finish once persistence and reconciliation are complete.
+- Report generation should run after the capture path, either deferred within the job or via a follow-up stage.
+
+### 2. Daily Aggregation
+- Make `vm_hourly_stats` the only normal scheduled input for daily aggregation.
+- Scheduled daily jobs must not build `UNION ALL` queries across `inventory_hourly_*`.
+- Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
+- Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current snapshot-union SQL path.
+- Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL union path by avoiding repeated historical table scans.
+- Treat the SQL path as non-default compatibility and fallback behavior.
+- Do not treat this as a permanent rejection of SQL.
+- Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
+- Keep the current SQL union path only for:
+  - compatibility fallback
+  - manual repair
+  - backfill support where needed
+- Daily aggregation output must continue writing:
+  - `inventory_daily_summary_YYYYMMDD`
+  - `vm_daily_rollup`
+  - `snapshot_registry` daily record
+  - refreshed `vcenter_aggregate_totals` daily entries
+- Lifecycle refinement should operate on canonical lifecycle data and only use snapshot-table probing as fallback.
+- Preserve existing daily semantics for:
+  - `SamplesPresent`
+  - `AvgIsPresent`
+  - weighted CPU/RAM/disk averages
+  - pool percentages
+  - creation/deletion time behavior
+
+### 3. Monthly Aggregation
+- Make `vm_daily_rollup` the default scheduled input for monthly aggregation.
+- Scheduled monthly jobs should not scan hourly snapshot tables in the normal path.
+- Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
+- Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current SQL path.
+- Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL path by avoiding snapshot-table unions and hourly-history scans in the normal case.
+- Treat the SQL path as non-default compatibility and fallback behavior.
+- Do not treat this as a permanent rejection of SQL.
+- Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
+- Keep hourly-based monthly aggregation only for:
+  - manual rebuilds
+  - repair/backfill workflows
+  - validation against old behavior
+- Preserve current monthly weighting semantics based on per-day sample volumes.
+- Monthly aggregation output must continue writing:
+  - `inventory_monthly_summary_YYYYMM`
+  - `snapshot_registry` monthly record
+  - refreshed `vcenter_aggregate_totals` monthly entries
+- Keep report generation behavior unchanged from the user’s perspective, but do not keep it on the critical aggregation hot path if it can be deferred safely.
+
+### 4. Storage and Schema
+- Keep these tables during migration:
+  - `inventory_hourly_*`
+  - `inventory_daily_summary_*`
+  - `inventory_monthly_summary_*`
+- Stop treating hourly snapshot tables as the normal scheduled aggregation source.
+- Preserve `snapshot_registry`, but register logical hourly snapshots by timestamp even when downstream jobs no longer depend on hourly table scans.
+- Validate or add the following indexes on `vm_hourly_stats` for PostgreSQL:
+  - `("SnapshotTime")`
+  - `("Vcenter","SnapshotTime")`
+  - `("Vcenter","VmId","SnapshotTime")`
+  - `("Vcenter","VmUuid","SnapshotTime")`
+  - a name lookup index aligned with current trace queries
+- Keep the existing trace-compatible indexes for SQLite.
+- After the canonical-path migration is stable, partition `vm_hourly_stats` by snapshot month for PostgreSQL.
+- Do not require partitioning for SQLite or tests.
+
+### 5. Compatibility Mode
+- Introduce an explicit compatibility mode for legacy snapshot tables.
+- When compatibility mode is enabled:
+  - continue writing `inventory_hourly_*`
+  - continue generating legacy-compatible daily/monthly summary tables
+  - continue registering snapshots as today
+- When compatibility mode is disabled in a later phase:
+  - scheduled jobs may skip legacy hourly table creation
+  - compatibility reports and endpoints must still work from canonical data or compatibility rebuild jobs
+- Default to compatibility mode enabled during the transition.
+
+### 6. Scheduling and Job Flow
+- Refactor the scheduled pipeline into explicit stages:
+  1. capture
+  2. reconcile
+  3. register and refresh totals caches
+  4. optional report generation
+- Daily aggregation should run only against the completed prior-day hourly data.
+- Monthly aggregation should depend on daily rollup completion, not hourly history scans.
+- Keep the current cron behavior and auth/UI behavior unchanged while internal data flow changes land.
+- Backfill and repair jobs should rebuild canonical caches first, then compatibility tables and reports.
+
+### 7. UI Refresh and Design-System Alignment
+- Use `design.md` as the source of truth for the UI refresh, but adapt it pragmatically to this codebase rather than attempting a pixel-perfect clone.
+- Introduce semantic theme tokens using `--theme_*` naming in the shared stylesheet layer.
+- Replace the current ad hoc `web2` color and radius values with tokenized equivalents for:
+  - primary text
+  - weak text
+  - CTA blue
+  - borders
+  - surfaces
+  - success states
+  - button spotlight text
+  - card and ambient shadows
+- Update the shared stylesheet source and shipped compiled assets so the new tokens flow through the delivered UI.
+- Keep the existing `web2` and `web3` class names if that reduces churn, but rebase them on the new token system.
+- Establish a typography strategy that follows `design.md` while remaining deployable:
+  - prefer Haas and Haas Groot Disp only if licensed webfont delivery is available
+  - otherwise define a documented fallback stack with similar proportions and spacing behavior
+  - apply positive letter spacing to body, caption, and button treatments where appropriate
+- Normalize component shape language to the design brief:
+  - buttons at 12px radius
+  - cards and sections at 16px to 24px radius
+  - larger containers at 24px to 32px radius where needed
+  - avoid the current 3px to 6px rounded treatment as the default visual language
+- Replace the current flat visual treatment with the documented blue-tinted shadow system, but keep shadows controlled and readable in data-heavy views.
+- Refactor shared UI structure in the Templ layer:
+  - `components/core/header.templ`
+  - `components/core/footer.templ`
+  - shared shell/header/card/button/table/form patterns used across `components/views/*`
+- Add a reusable page-shell pattern so all primary pages share:
+  - a consistent hero/header treatment
+  - action grouping
+  - content width rules
+  - section spacing
+  - responsive table overflow behavior
+- Improve the dashboard information architecture in `components/views/index.templ`:
+  - reduce the current long-form text density
+  - promote primary navigation and key operational tasks
+  - move build metadata into secondary status cards
+  - present auth requirements and role policy as a concise callout rather than dense paragraph copy
+- Improve snapshot and vCenter list pages in `components/views/snapshots.templ`:
+  - stronger table hierarchy
+  - clearer record counts and grouping
+  - more intentional page headers and return navigation
+  - responsive behavior that preserves readability on smaller screens
+- Improve the VM trace page in `components/views/vm_trace.templ`:
+  - upgrade search form layout and input styling
+  - improve chart framing and diagnostics presentation
+  - make lifecycle summary cards visually clearer
+  - preserve dense tabular detail without making the page feel purely utilitarian
+- Ensure the auth-enabled experience is visible in the UI:
+  - clarify that UI pages remain public while APIs require Bearer tokens when auth is enabled
+  - surface viewer versus admin capability differences in concise language
+  - keep Swagger and operational links accessible from the main navigation
+- Add accessibility and interaction requirements to the UI implementation:
+  - visible focus states
+  - sufficient text/background contrast
+  - keyboard-usable navigation and forms
+  - table layouts that remain readable with horizontal overflow
+  - mobile-safe spacing and tap targets
+- Keep UI changes implementation-friendly:
+  - avoid introducing a large frontend framework
+  - continue using Templ plus shared CSS and existing JS assets
+  - prefer incremental component replacement over a full frontend rewrite
+
+## Public Interfaces and Settings
+- No HTTP API changes are required.
+- Keep existing endpoints and report filenames stable.
+- No auth-model changes are required for the UI refresh.
+- If licensed fonts are not available for deployment, the implementation must ship with a documented fallback stack rather than blocking the UI work.
+- Add these settings:
+  - `settings.capture_write_batch_size`
+    - default: `1000`
+    - controls batched DB writes for hourly capture
+  - `settings.snapshot_table_compat_mode`
+    - default: `true`
+    - when `true`, continue writing legacy snapshot tables during migration
+  - `settings.async_report_generation`
+    - default: `true`
+    - when `true`, scheduled jobs defer XLSX generation from the hot path
+- Keep existing settings such as:
+  - `hourly_snapshot_concurrency`
+  - `monthly_aggregation_granularity`
+  - retry settings
+  - cleanup settings
+- Scheduled monthly aggregation should ignore hourly granularity unless running a manual or backfill job.
+
+## Execution Order
+
+### Phase 1: Hot-Path Runtime Wins
+- Add batched hourly writes.
+- Decouple report generation from hourly capture.
+- Ensure daily scheduled aggregation reads only from `vm_hourly_stats`.
+- Ensure monthly scheduled aggregation reads only from `vm_daily_rollup`.
+- Keep compatibility tables enabled.
+- Define the UI token layer and shared component mapping before page-level redesign work begins.
+
+### Phase 2: Canonical Dataflow
+- Refactor reconciliation so canonical caches are updated first.
+- Reduce or eliminate prior-snapshot table mutations during capture.
+- Make scheduled aggregation paths canonical-only.
+- Keep fallback and repair code for legacy unions/scans.
+- Implement the shared page shell, navigation, button, card, table, and form refinements across the existing Templ views.
+
+### Phase 3: Postgres-Ready Scale-Up
+- Validate index coverage on canonical tables.
+- Add PostgreSQL partitioning for `vm_hourly_stats`.
+- Benchmark Go and SQL aggregation paths on representative production-scale data.
+- Keep Go as default unless SQL demonstrates a clear, repeatable runtime win on canonical Postgres data.
+- Treat the benchmark as a comparison against a canonical-table SQL implementation, not the current snapshot-union SQL path.
+- If SQL wins, promote SQL behind a controlled rollout flag first, then make it default.
+- Complete page-specific UI refinement for dashboard, snapshots, vCenter totals, and VM trace using the shared tokenized design system.
+
+### Phase 4: Compatibility Reduction
+- Keep legacy table output behind `snapshot_table_compat_mode`.
+- Once canonical-path validation is complete, allow disabling legacy hourly table generation in scheduled runs.
+- Retain explicit backfill and rebuild commands for compatibility tables and reports.
+- Clean up obsolete styling rules and duplicated visual patterns once the new UI system is fully adopted.
+
+## Test Plan
+
+### Correctness Tests
+- Add golden-result tests comparing old and new daily outputs for the same synthetic hourly dataset.
+- Add golden-result tests comparing old and new monthly outputs for the same synthetic daily dataset.
+- Include edge cases for:
+  - partial-day VM presence
+  - missing creation times
+  - deletion-time refinement
+  - pool changes
+  - CPU and RAM changes across samples
+  - VMs identified by `VmId`, `VmUuid`, and fallback name matching
+
+### Integration Tests
+- Hourly capture writes `vm_hourly_stats`, lifecycle caches, and vCenter totals correctly.
+- Daily aggregation reads canonical hourly data without scanning `inventory_hourly_*`.
+- Monthly aggregation reads canonical daily rollup without scanning hourly history in the normal path.
+- `vcenter_aggregate_totals` remains correct for hourly, daily, and monthly views.
+- Trace and totals endpoints keep returning equivalent results before and after migration.
+- UI page rendering remains valid for dashboard, snapshot pages, vCenter totals, and VM trace after shared component changes.
+
+### Compatibility Tests
+- When `snapshot_table_compat_mode=true`, compatibility snapshot tables still exist and are populated.
+- Reports still generate correctly from migrated data.
+- Backfill and repair flows can rebuild compatibility outputs from canonical sources.
+- UI remains functional when auth is disabled and when auth is enabled with protected API usage documented in-page.
+
+### Performance Tests
+- Measure per-vCenter capture duration.
+- Measure hourly write throughput.
+- Measure daily aggregation runtime.
+- Measure monthly aggregation runtime.
+- Measure report generation runtime when decoupled from scheduled jobs.
+- Capture baseline metrics before refactor and compare after each phase.
+- Measure basic UI payload impact after the refresh so stylesheet and JS growth stay controlled.
+
+### UI Validation
+- Verify token usage in shared CSS so colors, radii, and shadows are not hard-coded inconsistently across pages.
+- Verify responsive behavior for dashboard, snapshot tables, vCenter totals, and VM trace at mobile and desktop widths.
+- Verify focus states, contrast, and keyboard access for links, buttons, inputs, and table navigation surfaces.
+- Verify that the auth guidance on the dashboard still matches actual route protection and Bearer-token behavior.
+
+## Acceptance Criteria
+- Scheduled hourly capture runtime is materially reduced without changing user-visible outputs.
+- Scheduled daily aggregation no longer depends on `inventory_hourly_*` scans.
+- Scheduled monthly aggregation no longer depends on hourly-history scans.
+- Canonical caches become the source of truth for normal scheduled processing.
+- Legacy compatibility behavior remains available during migration.
+- Existing endpoints, reports, auth behavior, and operational commands continue to work.
+- The UI reflects the design direction in `design.md` through tokenized colors, typography, spacing, radius, and shadow usage.
+- The dashboard, snapshot pages, vCenter totals view, and VM trace view share a coherent visual system and clearer information hierarchy.
+- The refreshed UI remains responsive, accessible, and compatible with the current Templ-based rendering model.
+
+## Assumptions
+- Target direction is Postgres-ready and runtime-first.
+- Existing endpoints, report filenames, and user-visible semantics must remain stable.
+- SQLite remains supported for development, tests, and smaller installs.
+- PostgreSQL is the intended scale-up target for larger environments.
+- Compatibility snapshot tables should remain enabled by default until canonical-path validation is complete.