407 lines
27 KiB
Markdown
407 lines
27 KiB
Markdown
# Inventory Capture and Aggregation Optimization Plan
|
||
|
||
## Summary
|
||
Optimize for end-to-end runtime with a Postgres-ready design. Keep the current HTTP and report behavior intact, but shift the scheduled data pipeline so it uses canonical append-only/cache tables instead of repeatedly scanning `inventory_hourly_*` tables and regenerating reports inline.
|
||
|
||
This plan is intended to be implementation-ready for a `codex-5.3` execution pass.
|
||
|
||
Execution-path decision:
|
||
- For the current architecture and migration phases, scheduled daily and monthly aggregation default to the Go path.
|
||
- This is a readability-first and current-performance decision, not a claim that Go is inherently faster than a well-designed SQL implementation.
|
||
- SQL path is retained for compatibility, backfill, and fallback.
|
||
- SQL remains a future optimization candidate on canonical Postgres tables.
|
||
- SQL can be promoted to default only after benchmark evidence on canonical Postgres tables shows a clear runtime advantage.
|
||
|
||
The target architecture is:
|
||
1. `vm_hourly_stats` is the canonical hourly fact store.
|
||
2. `vm_daily_rollup` is the canonical monthly input.
|
||
3. Per-snapshot tables and XLSX generation remain as compatibility and output concerns, not the primary execution path.
|
||
|
||
## Current State
|
||
- Hourly capture already writes both per-snapshot tables and `vm_hourly_stats`.
|
||
- Daily aggregation has mixed execution paths:
|
||
- SQL union path over `inventory_hourly_*`
|
||
- Go path over `vm_hourly_stats` or parallel table scans
|
||
- Monthly aggregation has mixed execution paths:
|
||
- SQL path over daily or hourly snapshot tables
|
||
- Go path over `vm_daily_rollup` or hourly cache
|
||
- Lifecycle reconciliation updates both canonical cache tables and prior hourly snapshot tables during the hot path.
|
||
- Report generation is still coupled to scheduled capture and aggregation jobs.
|
||
- The current UI is rendered through Templ pages and shared `web2`/`web3` CSS classes, but it does not yet match the visual system described in `design.md`.
|
||
- Current shipped styling still uses a different blue accent, tighter radii, default system typography, and inconsistent component hierarchy compared with the target design language.
|
||
|
||
## Implementation Goals
|
||
- Reduce hourly capture wall-clock time.
|
||
- Reduce daily and monthly aggregation runtime.
|
||
- Eliminate repeated historical table scans from the normal scheduled path.
|
||
- Keep user-visible HTTP APIs, reports, and auth behavior unchanged.
|
||
- Improve UI clarity and consistency so the dashboard, snapshot views, and trace views reflect the design direction in `design.md`.
|
||
- Make authentication and role requirements easier to understand from the UI without changing the auth model.
|
||
- Preserve compatibility with SQLite for development and small installs.
|
||
- Make the runtime architecture cleanly scalable for PostgreSQL production use.
|
||
|
||
## Implementation Changes
|
||
|
||
### 1. Hourly Capture Pipeline
|
||
- Keep `GetAllVMsWithProps` as the primary vCenter inventory fetch path.
|
||
- Preserve single-VM property retrieval only as a fallback path when bulk retrieval is incomplete.
|
||
- Replace row-by-row database writes in hourly capture with batched writes.
|
||
- For PostgreSQL:
|
||
- prefer multi-row insert/upsert or `COPY` into `vm_hourly_stats`
|
||
- keep conflict handling on the canonical key
|
||
- For SQLite:
|
||
- keep transactional batched insert/upsert
|
||
- do not attempt PostgreSQL-only ingestion patterns
|
||
- During capture, write data to these canonical destinations first:
|
||
- `vm_hourly_stats`
|
||
- `vm_lifecycle_cache`
|
||
- `vcenter_totals`
|
||
- `vcenter_latest_totals`
|
||
- `vcenter_aggregate_totals` for hourly totals
|
||
- Treat `inventory_hourly_<epoch>` as compatibility output, not as the source of truth for downstream jobs.
|
||
- Move deletion and event reconciliation to one post-capture reconciliation phase per vCenter.
|
||
- In that reconciliation phase, update canonical cache tables first.
|
||
- Stop updating prior hourly snapshot tables inline during the capture hot path except where compatibility mode explicitly requires it.
|
||
- Remove synchronous XLSX regeneration from hourly capture.
|
||
- Scheduled capture should finish once persistence and reconciliation are complete.
|
||
- Report generation should run after the capture path, either deferred within the job or via a follow-up stage.
|
||
|
||
### 2. Daily Aggregation
|
||
- Make `vm_hourly_stats` the only normal scheduled input for daily aggregation.
|
||
- Scheduled daily jobs must not build `UNION ALL` queries across `inventory_hourly_*`.
|
||
- Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
|
||
- Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current snapshot-union SQL path.
|
||
- Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL union path by avoiding repeated historical table scans.
|
||
- Treat the SQL path as non-default compatibility and fallback behavior.
|
||
- Do not treat this as a permanent rejection of SQL.
|
||
- Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
|
||
- Keep the current SQL union path only for:
|
||
- compatibility fallback
|
||
- manual repair
|
||
- backfill support where needed
|
||
- Daily aggregation output must continue writing:
|
||
- `inventory_daily_summary_YYYYMMDD`
|
||
- `vm_daily_rollup`
|
||
- `snapshot_registry` daily record
|
||
- refreshed `vcenter_aggregate_totals` daily entries
|
||
- Lifecycle refinement should operate on canonical lifecycle data and only use snapshot-table probing as fallback.
|
||
- Preserve existing daily semantics for:
|
||
- `SamplesPresent`
|
||
- `AvgIsPresent`
|
||
- weighted CPU/RAM/disk averages
|
||
- pool percentages
|
||
- creation/deletion time behavior
|
||
|
||
### 3. Monthly Aggregation
|
||
- Make `vm_daily_rollup` the default scheduled input for monthly aggregation.
|
||
- Scheduled monthly jobs should not scan hourly snapshot tables in the normal path.
|
||
- Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
|
||
- Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current SQL path.
|
||
- Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL path by avoiding snapshot-table unions and hourly-history scans in the normal case.
|
||
- Treat the SQL path as non-default compatibility and fallback behavior.
|
||
- Do not treat this as a permanent rejection of SQL.
|
||
- Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
|
||
- Keep hourly-based monthly aggregation only for:
|
||
- manual rebuilds
|
||
- repair/backfill workflows
|
||
- validation against old behavior
|
||
- Preserve current monthly weighting semantics based on per-day sample volumes.
|
||
- Monthly aggregation output must continue writing:
|
||
- `inventory_monthly_summary_YYYYMM`
|
||
- `snapshot_registry` monthly record
|
||
- refreshed `vcenter_aggregate_totals` monthly entries
|
||
- Keep report generation behavior unchanged from the user’s perspective, but do not keep it on the critical aggregation hot path if it can be deferred safely.
|
||
|
||
### 4. Storage and Schema
|
||
- Keep these tables during migration:
|
||
- `inventory_hourly_*`
|
||
- `inventory_daily_summary_*`
|
||
- `inventory_monthly_summary_*`
|
||
- Stop treating hourly snapshot tables as the normal scheduled aggregation source.
|
||
- Preserve `snapshot_registry`, but register logical hourly snapshots by timestamp even when downstream jobs no longer depend on hourly table scans.
|
||
- Validate or add the following indexes on `vm_hourly_stats` for PostgreSQL:
|
||
- `("SnapshotTime")`
|
||
- `("Vcenter","SnapshotTime")`
|
||
- `("Vcenter","VmId","SnapshotTime")`
|
||
- `("Vcenter","VmUuid","SnapshotTime")`
|
||
- a name lookup index aligned with current trace queries
|
||
- Keep the existing trace-compatible indexes for SQLite.
|
||
- After the canonical-path migration is stable, partition `vm_hourly_stats` by snapshot month for PostgreSQL.
|
||
- Do not require partitioning for SQLite or tests.
|
||
|
||
### 5. Compatibility Mode
|
||
- Introduce an explicit compatibility mode for legacy snapshot tables.
|
||
- When compatibility mode is enabled:
|
||
- continue writing `inventory_hourly_*`
|
||
- continue generating legacy-compatible daily/monthly summary tables
|
||
- continue registering snapshots as today
|
||
- When compatibility mode is disabled in a later phase:
|
||
- scheduled jobs may skip legacy hourly table creation
|
||
- compatibility reports and endpoints must still work from canonical data or compatibility rebuild jobs
|
||
- Default to compatibility mode enabled during the transition.
|
||
|
||
### 6. Scheduling and Job Flow
|
||
- Refactor the scheduled pipeline into explicit stages:
|
||
1. capture
|
||
2. reconcile
|
||
3. register and refresh totals caches
|
||
4. optional report generation
|
||
- Daily aggregation should run only against the completed prior-day hourly data.
|
||
- Monthly aggregation should depend on daily rollup completion, not hourly history scans.
|
||
- Keep the current cron behavior and auth/UI behavior unchanged while internal data flow changes land.
|
||
- Backfill and repair jobs should rebuild canonical caches first, then compatibility tables and reports.
|
||
|
||
### 7. UI Refresh and Design-System Alignment
|
||
- Use `design.md` as the source of truth for the UI refresh, but adapt it pragmatically to this codebase rather than attempting a pixel-perfect clone.
|
||
- Introduce semantic theme tokens using `--theme_*` naming in the shared stylesheet layer.
|
||
- Replace the current ad hoc `web2` color and radius values with tokenized equivalents for:
|
||
- primary text
|
||
- weak text
|
||
- CTA blue
|
||
- borders
|
||
- surfaces
|
||
- success states
|
||
- button spotlight text
|
||
- card and ambient shadows
|
||
- Update the shared stylesheet source and shipped compiled assets so the new tokens flow through the delivered UI.
|
||
- Keep the existing `web2` and `web3` class names if that reduces churn, but rebase them on the new token system.
|
||
- Establish a typography strategy that follows `design.md` while remaining deployable:
|
||
- prefer Haas and Haas Groot Disp only if licensed webfont delivery is available
|
||
- otherwise define a documented fallback stack with similar proportions and spacing behavior
|
||
- apply positive letter spacing to body, caption, and button treatments where appropriate
|
||
- Normalize component shape language to the design brief:
|
||
- buttons at 12px radius
|
||
- cards and sections at 16px to 24px radius
|
||
- larger containers at 24px to 32px radius where needed
|
||
- avoid the current 3px to 6px rounded treatment as the default visual language
|
||
- Replace the current flat visual treatment with the documented blue-tinted shadow system, but keep shadows controlled and readable in data-heavy views.
|
||
- Refactor shared UI structure in the Templ layer:
|
||
- `components/core/header.templ`
|
||
- `components/core/footer.templ`
|
||
- shared shell/header/card/button/table/form patterns used across `components/views/*`
|
||
- Add a reusable page-shell pattern so all primary pages share:
|
||
- a consistent hero/header treatment
|
||
- action grouping
|
||
- content width rules
|
||
- section spacing
|
||
- responsive table overflow behavior
|
||
- Improve the dashboard information architecture in `components/views/index.templ`:
|
||
- reduce the current long-form text density
|
||
- promote primary navigation and key operational tasks
|
||
- move build metadata into secondary status cards
|
||
- present auth requirements and role policy as a concise callout rather than dense paragraph copy
|
||
- Improve snapshot and vCenter list pages in `components/views/snapshots.templ`:
|
||
- stronger table hierarchy
|
||
- clearer record counts and grouping
|
||
- more intentional page headers and return navigation
|
||
- responsive behavior that preserves readability on smaller screens
|
||
- Improve the VM trace page in `components/views/vm_trace.templ`:
|
||
- upgrade search form layout and input styling
|
||
- improve chart framing and diagnostics presentation
|
||
- make lifecycle summary cards visually clearer
|
||
- preserve dense tabular detail without making the page feel purely utilitarian
|
||
- Ensure the auth-enabled experience is visible in the UI:
|
||
- clarify that UI pages remain public while APIs require Bearer tokens when auth is enabled
|
||
- surface viewer versus admin capability differences in concise language
|
||
- keep Swagger and operational links accessible from the main navigation
|
||
- Add accessibility and interaction requirements to the UI implementation:
|
||
- visible focus states
|
||
- sufficient text/background contrast
|
||
- keyboard-usable navigation and forms
|
||
- table layouts that remain readable with horizontal overflow
|
||
- mobile-safe spacing and tap targets
|
||
- Keep UI changes implementation-friendly:
|
||
- avoid introducing a large frontend framework
|
||
- continue using Templ plus shared CSS and existing JS assets
|
||
- prefer incremental component replacement over a full frontend rewrite
|
||
|
||
## Public Interfaces and Settings
|
||
- No HTTP API changes are required.
|
||
- Keep existing endpoints and report filenames stable.
|
||
- No auth-model changes are required for the UI refresh.
|
||
- If licensed fonts are not available for deployment, the implementation must ship with a documented fallback stack rather than blocking the UI work.
|
||
- Add these settings:
|
||
- `settings.capture_write_batch_size`
|
||
- default: `1000`
|
||
- controls batched DB writes for hourly capture
|
||
- `settings.snapshot_table_compat_mode`
|
||
- default: `true`
|
||
- when `true`, continue writing legacy snapshot tables during migration
|
||
- `settings.async_report_generation`
|
||
- default: `true`
|
||
- when `true`, scheduled jobs defer XLSX generation from the hot path
|
||
- Keep existing settings such as:
|
||
- `hourly_snapshot_concurrency`
|
||
- `monthly_aggregation_granularity`
|
||
- retry settings
|
||
- cleanup settings
|
||
- Scheduled monthly aggregation should ignore hourly granularity unless running a manual or backfill job.
|
||
|
||
## Execution Order
|
||
|
||
### Phase 1: Hot-Path Runtime Wins
|
||
- Add batched hourly writes.
|
||
- Decouple report generation from hourly capture.
|
||
- Ensure daily scheduled aggregation reads only from `vm_hourly_stats`.
|
||
- Ensure monthly scheduled aggregation reads only from `vm_daily_rollup`.
|
||
- Keep compatibility tables enabled.
|
||
- Define the UI token layer and shared component mapping before page-level redesign work begins.
|
||
|
||
### Phase 2: Canonical Dataflow
|
||
- Refactor reconciliation so canonical caches are updated first.
|
||
- Reduce or eliminate prior-snapshot table mutations during capture.
|
||
- Make scheduled aggregation paths canonical-only.
|
||
- Keep fallback and repair code for legacy unions/scans.
|
||
- Implement the shared page shell, navigation, button, card, table, and form refinements across the existing Templ views.
|
||
|
||
### Phase 3: Postgres-Ready Scale-Up
|
||
- Validate index coverage on canonical tables.
|
||
- Add PostgreSQL partitioning for `vm_hourly_stats`.
|
||
- Benchmark Go and SQL aggregation paths on representative production-scale data.
|
||
- Keep Go as default unless SQL demonstrates a clear, repeatable runtime win on canonical Postgres data.
|
||
- Treat the benchmark as a comparison against a canonical-table SQL implementation, not the current snapshot-union SQL path.
|
||
- If SQL wins, promote SQL behind a controlled rollout flag first, then make it default.
|
||
- Complete page-specific UI refinement for dashboard, snapshots, vCenter totals, and VM trace using the shared tokenized design system.
|
||
|
||
### Phase 4: Compatibility Reduction
|
||
- Keep legacy table output behind `snapshot_table_compat_mode`.
|
||
- Once canonical-path validation is complete, allow disabling legacy hourly table generation in scheduled runs.
|
||
- Retain explicit backfill and rebuild commands for compatibility tables and reports.
|
||
- Clean up obsolete styling rules and duplicated visual patterns once the new UI system is fully adopted.
|
||
|
||
## Implementation Checklist
|
||
|
||
### 0. Baseline and Guardrails
|
||
- [x] Capture baseline metrics for hourly capture, daily aggregation, monthly aggregation, and report generation.
|
||
- [x] Confirm current API/endpoint contract and report filename behavior with a regression snapshot.
|
||
- [x] Add new settings with defaults and config wiring:
|
||
- [x] `settings.capture_write_batch_size=1000`
|
||
- [x] `settings.snapshot_table_compat_mode=true`
|
||
- [x] `settings.async_report_generation=true`
|
||
- [x] Add/confirm stage-level logging and timing around capture, reconcile, totals refresh, and report generation.
|
||
- [x] Document migration guardrails: no auth-model changes, SQLite support retained, compatibility mode enabled by default.
|
||
- Evidence snapshot: see `phase0-baseline.md` for metrics, API/report contract snapshot, and guardrail verification.
|
||
|
||
### 1. Phase 1: Hot-Path Runtime Wins
|
||
- [x] Implement batched hourly writes for canonical tables in capture flow.
|
||
- [x] Add PostgreSQL multi-row insert/upsert path (or `COPY`) for `vm_hourly_stats`.
|
||
- [x] Keep SQLite transactional batched upsert path without PostgreSQL-only ingestion features.
|
||
- [x] Decouple XLSX/report generation from capture hot path via async/deferred stage.
|
||
- [x] Ensure scheduled daily aggregation reads canonical data from `vm_hourly_stats` only.
|
||
- [x] Ensure scheduled monthly aggregation reads canonical data from `vm_daily_rollup` only.
|
||
- [x] Keep legacy compatibility tables enabled during this phase.
|
||
- [x] Introduce UI token layer (`--theme_*`) and map shared component primitives before page-specific redesign.
|
||
|
||
### 2. Phase 2: Canonical Dataflow
|
||
- [x] Refactor capture/reconcile ordering so canonical caches are updated first.
|
||
- [x] Move deletion/event reconciliation to one post-capture phase per vCenter.
|
||
- [x] Remove prior-snapshot table mutations from capture hot path (except explicit compatibility needs).
|
||
- [x] Keep SQL union/legacy scan paths available only for fallback, repair, and backfill.
|
||
- [x] Verify `snapshot_registry` logical hourly registration remains correct without normal hourly table scans.
|
||
- [x] Implement shared Templ page shell improvements across header/footer/cards/buttons/tables/forms.
|
||
- [x] Refresh dashboard, snapshots, vCenter totals, and VM trace views to the tokenized design system.
|
||
|
||
### 3. Phase 3: Postgres-Ready Scale-Up
|
||
- [x] Validate/add canonical `vm_hourly_stats` indexes for snapshot time, vCenter+time, VM identity+time, and trace lookup.
|
||
- [x] Add PostgreSQL monthly partitioning for `vm_hourly_stats` behind migration controls.
|
||
- [x] Benchmark Go vs SQL on canonical Postgres tables using representative production-scale data.
|
||
- Production-scale Postgres benchmark runs completed on 2026-04-21 via one-shot canonical benchmark (`-benchmark-aggregations`, `driver=postgres`, with `runs_per_mode=1` and `runs_per_mode=3`).
|
||
- Run A (pre-tuning), daily window `2026-04-20T00:00:00Z` to `2026-04-21T00:00:00Z`: Go `4.000602432s` (`14881` rows) vs SQL `1h17m19.039092561s` (`14920` rows), with Go ~`1159.59x` faster.
|
||
- Run A (pre-tuning), monthly window `2026-04-01T00:00:00Z` to `2026-05-01T00:00:00Z`: Go `3.529410947s` (`15871` rows) vs SQL `3.313037973s` (`15873` rows), near parity with SQL slightly faster (~`0.216s`, `6.1%`).
|
||
- Run B (after PostgreSQL tuning), daily window `2026-04-21T00:00:00Z` to `2026-04-22T00:00:00Z`: Go `2.277889486s` (`14831` rows) vs SQL `1m31.273491543s` (`14839` rows), with Go still ~`40.07x` faster.
|
||
- Run B (after PostgreSQL tuning), monthly window `2026-04-01T00:00:00Z` to `2026-05-01T00:00:00Z`: Go `3.947474215s` (`15871` rows) vs SQL `2.758716002s` (`15873` rows), with SQL ~`1.43x` faster.
|
||
- Run C (after PostgreSQL tuning, `runs=3`), daily window `2026-04-21T00:00:00Z` to `2026-04-22T00:00:00Z`: Go avg `2.261369712s` (min `2.169537168s`, median `2.191474445s`, max `2.423097524s`, rows `14831`) vs SQL avg `1m31.738727387s` (min `1m29.960115863s`, median `1m32.068576507s`, max `1m33.187489791s`, rows `14839`), with Go ~`40.57x` faster by average.
|
||
- Run C (after PostgreSQL tuning, `runs=3`), monthly window `2026-04-01T00:00:00Z` to `2026-05-01T00:00:00Z`: Go avg `3.705308832s` (min `3.696553751s`, median `3.70776704s`, max `3.711605706s`, rows `15871`) vs SQL avg `3.065612298s` (min `2.873749798s`, median `3.022090149s`, max `3.300996948s`, rows `15873`), with SQL ~`1.21x` faster by average (~`17.26%` faster than Go).
|
||
- Tuning impact between Run A and Run B: daily SQL improved ~`50.83x`, daily Go improved ~`1.76x`, monthly SQL improved ~`1.20x`, and monthly Go regressed (~`0.89x` of prior speed).
|
||
- Decision remains unchanged: keep Go as scheduled default and treat SQL as fallback/backfill until SQL shows a clear, repeatable runtime win across canonical workloads, especially on daily windows (where Go remains consistently dominant across runs).
|
||
- [x] Keep Go as scheduled default unless SQL shows clear and repeatable runtime wins.
|
||
- [x] If SQL wins, roll out behind a controlled flag before any default switch.
|
||
|
||
### 4. Phase 4: Compatibility Reduction
|
||
- [x] Keep legacy outputs controlled by `snapshot_table_compat_mode`.
|
||
- Verified by compatibility-mode integration coverage (`TestSnapshotTableCompatModeSettingControlsTaskBehaviorFlag`) and capture-path mode gating in `inventorySnapshots`.
|
||
- [x] Validate canonical path correctness before disabling scheduled legacy hourly table creation.
|
||
- Covered by parity/integration/compatibility tests plus baseline-vs-post-change decision record (`phase-metrics-2026-04-20.md`).
|
||
- [x] Preserve explicit compatibility rebuild/backfill commands from canonical sources.
|
||
- Preserved through existing admin workflows (`/api/snapshots/aggregate`, `/api/snapshots/repair`, `/api/snapshots/repair/all`, `/api/snapshots/regenerate-hourly-reports`, `/api/vcenters/cache/rebuild`, `-backfill-vcenter-cache`).
|
||
- [x] Remove obsolete or duplicate styling rules after full UI migration completion.
|
||
- Removed unused selectors from shared UI stylesheet (`.web2-button-group*`, `.web2-list li`) in `dist/assets/css/web3.css`; router UI asset tests remain passing.
|
||
|
||
### 5. Validation and Quality Gates
|
||
- [x] Add golden-result tests for daily output parity (old vs new path).
|
||
- [x] Add golden-result tests for monthly output parity (old vs new path).
|
||
- [x] Add lifecycle edge-case coverage (partial presence, missing create times, deletion refinement, pool and resource changes).
|
||
- [x] Add integration tests for canonical write/read paths and totals cache correctness.
|
||
- [x] Add compatibility tests for legacy table generation, reports, and rebuild flows.
|
||
- [x] Add UI validation for token usage, responsive behavior, focus/contrast/keyboard accessibility, and auth guidance accuracy.
|
||
- Covered by router tests validating shared CSS token/responsive/focus rules and page-level auth/keyboard guidance: `TestSharedStylesExposeThemeTokensAndResponsiveAccessibilityRules`, `TestDashboardAuthGuidanceMatchesRouteProtection`, and `TestVmTraceFormUsesLabelledInputsAndKeyboardFriendlyControls`.
|
||
- [x] Compare baseline vs post-change metrics after each phase and record pass/fail decisions.
|
||
- Evidence and gate outcomes captured in `phase-metrics-2026-04-20.md` (baseline delta table + pass/fail decisions + benchmark snapshot).
|
||
|
||
### 6. Rollout and Documentation
|
||
- [x] Update operator docs for new settings and default behavior.
|
||
- [x] Document compatibility-mode lifecycle and criteria to disable legacy table generation.
|
||
- [x] Document benchmark method/results and default-path decision record (Go vs SQL).
|
||
- [x] Publish a short migration runbook for staged rollout, rollback triggers, and repair workflows.
|
||
- Completed in `README.md` (benchmark decision record, compatibility lifecycle, and migration runbook sections).
|
||
|
||
## Test Plan
|
||
|
||
### Correctness Tests
|
||
- Add golden-result tests comparing old and new daily outputs for the same synthetic hourly dataset.
|
||
- Add golden-result tests comparing old and new monthly outputs for the same synthetic daily dataset.
|
||
- Include edge cases for:
|
||
- partial-day VM presence
|
||
- missing creation times
|
||
- deletion-time refinement
|
||
- pool changes
|
||
- CPU and RAM changes across samples
|
||
- VMs identified by `VmId`, `VmUuid`, and fallback name matching
|
||
|
||
### Integration Tests
|
||
- Hourly capture writes `vm_hourly_stats`, lifecycle caches, and vCenter totals correctly.
|
||
- Daily aggregation reads canonical hourly data without scanning `inventory_hourly_*`.
|
||
- Monthly aggregation reads canonical daily rollup without scanning hourly history in the normal path.
|
||
- `vcenter_aggregate_totals` remains correct for hourly, daily, and monthly views.
|
||
- Trace and totals endpoints keep returning equivalent results before and after migration.
|
||
- UI page rendering remains valid for dashboard, snapshot pages, vCenter totals, and VM trace after shared component changes.
|
||
|
||
### Compatibility Tests
|
||
- When `snapshot_table_compat_mode=true`, compatibility snapshot tables still exist and are populated.
|
||
- Reports still generate correctly from migrated data.
|
||
- Backfill and repair flows can rebuild compatibility outputs from canonical sources.
|
||
- UI remains functional when auth is disabled and when auth is enabled with protected API usage documented in-page.
|
||
|
||
### Performance Tests
|
||
- Measure per-vCenter capture duration.
|
||
- Measure hourly write throughput.
|
||
- Measure daily aggregation runtime.
|
||
- Measure monthly aggregation runtime.
|
||
- Measure report generation runtime when decoupled from scheduled jobs.
|
||
- Capture baseline metrics before refactor and compare after each phase.
|
||
- Measure basic UI payload impact after the refresh so stylesheet and JS growth stay controlled.
|
||
|
||
### UI Validation
|
||
- Verify token usage in shared CSS so colors, radii, and shadows are not hard-coded inconsistently across pages.
|
||
- Verify responsive behavior for dashboard, snapshot tables, vCenter totals, and VM trace at mobile and desktop widths.
|
||
- Verify focus states, contrast, and keyboard access for links, buttons, inputs, and table navigation surfaces.
|
||
- Verify that the auth guidance on the dashboard still matches actual route protection and Bearer-token behavior.
|
||
|
||
## Acceptance Criteria
|
||
- Scheduled hourly capture runtime is materially reduced without changing user-visible outputs.
|
||
- Scheduled daily aggregation no longer depends on `inventory_hourly_*` scans.
|
||
- Scheduled monthly aggregation no longer depends on hourly-history scans.
|
||
- Canonical caches become the source of truth for normal scheduled processing.
|
||
- Legacy compatibility behavior remains available during migration.
|
||
- Existing endpoints, reports, auth behavior, and operational commands continue to work.
|
||
- The UI reflects the design direction in `design.md` through tokenized colors, typography, spacing, radius, and shadow usage.
|
||
- The dashboard, snapshot pages, vCenter totals view, and VM trace view share a coherent visual system and clearer information hierarchy.
|
||
- The refreshed UI remains responsive, accessible, and compatible with the current Templ-based rendering model.
|
||
|
||
## Assumptions
|
||
- Target direction is Postgres-ready and runtime-first.
|
||
- Existing endpoints, report filenames, and user-visible semantics must remain stable.
|
||
- SQLite remains supported for development, tests, and smaller installs.
|
||
- PostgreSQL is the intended scale-up target for larger environments.
|
||
- Compatibility snapshot tables should remain enabled by default until canonical-path validation is complete.
|