Files
vctp2/plan.md
T
nathan 916b0b5054
continuous-integration/drone/push Build is passing
more tests
2026-04-20 18:38:12 +10:00

392 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Inventory Capture and Aggregation Optimization Plan
## Summary
Optimize for end-to-end runtime with a Postgres-ready design. Keep the current HTTP and report behavior intact, but shift the scheduled data pipeline so it uses canonical append-only/cache tables instead of repeatedly scanning `inventory_hourly_*` tables and regenerating reports inline.
This plan is intended to be implementation-ready for a `codex-5.3` execution pass.
Execution-path decision:
- For the current architecture and migration phases, scheduled daily and monthly aggregation default to the Go path.
- This is a readability-first and current-performance decision, not a claim that Go is inherently faster than a well-designed SQL implementation.
- SQL path is retained for compatibility, backfill, and fallback.
- SQL remains a future optimization candidate on canonical Postgres tables.
- SQL can be promoted to default only after benchmark evidence on canonical Postgres tables shows a clear runtime advantage.
The target architecture is:
1. `vm_hourly_stats` is the canonical hourly fact store.
2. `vm_daily_rollup` is the canonical monthly input.
3. Per-snapshot tables and XLSX generation remain as compatibility and output concerns, not the primary execution path.
## Current State
- Hourly capture already writes both per-snapshot tables and `vm_hourly_stats`.
- Daily aggregation has mixed execution paths:
- SQL union path over `inventory_hourly_*`
- Go path over `vm_hourly_stats` or parallel table scans
- Monthly aggregation has mixed execution paths:
- SQL path over daily or hourly snapshot tables
- Go path over `vm_daily_rollup` or hourly cache
- Lifecycle reconciliation updates both canonical cache tables and prior hourly snapshot tables during the hot path.
- Report generation is still coupled to scheduled capture and aggregation jobs.
- The current UI is rendered through Templ pages and shared `web2`/`web3` CSS classes, but it does not yet match the visual system described in `design.md`.
- Current shipped styling still uses a different blue accent, tighter radii, default system typography, and inconsistent component hierarchy compared with the target design language.
## Implementation Goals
- Reduce hourly capture wall-clock time.
- Reduce daily and monthly aggregation runtime.
- Eliminate repeated historical table scans from the normal scheduled path.
- Keep user-visible HTTP APIs, reports, and auth behavior unchanged.
- Improve UI clarity and consistency so the dashboard, snapshot views, and trace views reflect the design direction in `design.md`.
- Make authentication and role requirements easier to understand from the UI without changing the auth model.
- Preserve compatibility with SQLite for development and small installs.
- Make the runtime architecture cleanly scalable for PostgreSQL production use.
## Implementation Changes
### 1. Hourly Capture Pipeline
- Keep `GetAllVMsWithProps` as the primary vCenter inventory fetch path.
- Preserve single-VM property retrieval only as a fallback path when bulk retrieval is incomplete.
- Replace row-by-row database writes in hourly capture with batched writes.
- For PostgreSQL:
- prefer multi-row insert/upsert or `COPY` into `vm_hourly_stats`
- keep conflict handling on the canonical key
- For SQLite:
- keep transactional batched insert/upsert
- do not attempt PostgreSQL-only ingestion patterns
- During capture, write data to these canonical destinations first:
- `vm_hourly_stats`
- `vm_lifecycle_cache`
- `vcenter_totals`
- `vcenter_latest_totals`
- `vcenter_aggregate_totals` for hourly totals
- Treat `inventory_hourly_<epoch>` as compatibility output, not as the source of truth for downstream jobs.
- Move deletion and event reconciliation to one post-capture reconciliation phase per vCenter.
- In that reconciliation phase, update canonical cache tables first.
- Stop updating prior hourly snapshot tables inline during the capture hot path except where compatibility mode explicitly requires it.
- Remove synchronous XLSX regeneration from hourly capture.
- Scheduled capture should finish once persistence and reconciliation are complete.
- Report generation should run after the capture path, either deferred within the job or via a follow-up stage.
### 2. Daily Aggregation
- Make `vm_hourly_stats` the only normal scheduled input for daily aggregation.
- Scheduled daily jobs must not build `UNION ALL` queries across `inventory_hourly_*`.
- Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
- Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current snapshot-union SQL path.
- Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL union path by avoiding repeated historical table scans.
- Treat the SQL path as non-default compatibility and fallback behavior.
- Do not treat this as a permanent rejection of SQL.
- Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
- Keep the current SQL union path only for:
- compatibility fallback
- manual repair
- backfill support where needed
- Daily aggregation output must continue writing:
- `inventory_daily_summary_YYYYMMDD`
- `vm_daily_rollup`
- `snapshot_registry` daily record
- refreshed `vcenter_aggregate_totals` daily entries
- Lifecycle refinement should operate on canonical lifecycle data and only use snapshot-table probing as fallback.
- Preserve existing daily semantics for:
- `SamplesPresent`
- `AvgIsPresent`
- weighted CPU/RAM/disk averages
- pool percentages
- creation/deletion time behavior
### 3. Monthly Aggregation
- Make `vm_daily_rollup` the default scheduled input for monthly aggregation.
- Scheduled monthly jobs should not scan hourly snapshot tables in the normal path.
- Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
- Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current SQL path.
- Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL path by avoiding snapshot-table unions and hourly-history scans in the normal case.
- Treat the SQL path as non-default compatibility and fallback behavior.
- Do not treat this as a permanent rejection of SQL.
- Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
- Keep hourly-based monthly aggregation only for:
- manual rebuilds
- repair/backfill workflows
- validation against old behavior
- Preserve current monthly weighting semantics based on per-day sample volumes.
- Monthly aggregation output must continue writing:
- `inventory_monthly_summary_YYYYMM`
- `snapshot_registry` monthly record
- refreshed `vcenter_aggregate_totals` monthly entries
- Keep report generation behavior unchanged from the users perspective, but do not keep it on the critical aggregation hot path if it can be deferred safely.
### 4. Storage and Schema
- Keep these tables during migration:
- `inventory_hourly_*`
- `inventory_daily_summary_*`
- `inventory_monthly_summary_*`
- Stop treating hourly snapshot tables as the normal scheduled aggregation source.
- Preserve `snapshot_registry`, but register logical hourly snapshots by timestamp even when downstream jobs no longer depend on hourly table scans.
- Validate or add the following indexes on `vm_hourly_stats` for PostgreSQL:
- `("SnapshotTime")`
- `("Vcenter","SnapshotTime")`
- `("Vcenter","VmId","SnapshotTime")`
- `("Vcenter","VmUuid","SnapshotTime")`
- a name lookup index aligned with current trace queries
- Keep the existing trace-compatible indexes for SQLite.
- After the canonical-path migration is stable, partition `vm_hourly_stats` by snapshot month for PostgreSQL.
- Do not require partitioning for SQLite or tests.
### 5. Compatibility Mode
- Introduce an explicit compatibility mode for legacy snapshot tables.
- When compatibility mode is enabled:
- continue writing `inventory_hourly_*`
- continue generating legacy-compatible daily/monthly summary tables
- continue registering snapshots as today
- When compatibility mode is disabled in a later phase:
- scheduled jobs may skip legacy hourly table creation
- compatibility reports and endpoints must still work from canonical data or compatibility rebuild jobs
- Default to compatibility mode enabled during the transition.
### 6. Scheduling and Job Flow
- Refactor the scheduled pipeline into explicit stages:
1. capture
2. reconcile
3. register and refresh totals caches
4. optional report generation
- Daily aggregation should run only against the completed prior-day hourly data.
- Monthly aggregation should depend on daily rollup completion, not hourly history scans.
- Keep the current cron behavior and auth/UI behavior unchanged while internal data flow changes land.
- Backfill and repair jobs should rebuild canonical caches first, then compatibility tables and reports.
### 7. UI Refresh and Design-System Alignment
- Use `design.md` as the source of truth for the UI refresh, but adapt it pragmatically to this codebase rather than attempting a pixel-perfect clone.
- Introduce semantic theme tokens using `--theme_*` naming in the shared stylesheet layer.
- Replace the current ad hoc `web2` color and radius values with tokenized equivalents for:
- primary text
- weak text
- CTA blue
- borders
- surfaces
- success states
- button spotlight text
- card and ambient shadows
- Update the shared stylesheet source and shipped compiled assets so the new tokens flow through the delivered UI.
- Keep the existing `web2` and `web3` class names if that reduces churn, but rebase them on the new token system.
- Establish a typography strategy that follows `design.md` while remaining deployable:
- prefer Haas and Haas Groot Disp only if licensed webfont delivery is available
- otherwise define a documented fallback stack with similar proportions and spacing behavior
- apply positive letter spacing to body, caption, and button treatments where appropriate
- Normalize component shape language to the design brief:
- buttons at 12px radius
- cards and sections at 16px to 24px radius
- larger containers at 24px to 32px radius where needed
- avoid the current 3px to 6px rounded treatment as the default visual language
- Replace the current flat visual treatment with the documented blue-tinted shadow system, but keep shadows controlled and readable in data-heavy views.
- Refactor shared UI structure in the Templ layer:
- `components/core/header.templ`
- `components/core/footer.templ`
- shared shell/header/card/button/table/form patterns used across `components/views/*`
- Add a reusable page-shell pattern so all primary pages share:
- a consistent hero/header treatment
- action grouping
- content width rules
- section spacing
- responsive table overflow behavior
- Improve the dashboard information architecture in `components/views/index.templ`:
- reduce the current long-form text density
- promote primary navigation and key operational tasks
- move build metadata into secondary status cards
- present auth requirements and role policy as a concise callout rather than dense paragraph copy
- Improve snapshot and vCenter list pages in `components/views/snapshots.templ`:
- stronger table hierarchy
- clearer record counts and grouping
- more intentional page headers and return navigation
- responsive behavior that preserves readability on smaller screens
- Improve the VM trace page in `components/views/vm_trace.templ`:
- upgrade search form layout and input styling
- improve chart framing and diagnostics presentation
- make lifecycle summary cards visually clearer
- preserve dense tabular detail without making the page feel purely utilitarian
- Ensure the auth-enabled experience is visible in the UI:
- clarify that UI pages remain public while APIs require Bearer tokens when auth is enabled
- surface viewer versus admin capability differences in concise language
- keep Swagger and operational links accessible from the main navigation
- Add accessibility and interaction requirements to the UI implementation:
- visible focus states
- sufficient text/background contrast
- keyboard-usable navigation and forms
- table layouts that remain readable with horizontal overflow
- mobile-safe spacing and tap targets
- Keep UI changes implementation-friendly:
- avoid introducing a large frontend framework
- continue using Templ plus shared CSS and existing JS assets
- prefer incremental component replacement over a full frontend rewrite
## Public Interfaces and Settings
- No HTTP API changes are required.
- Keep existing endpoints and report filenames stable.
- No auth-model changes are required for the UI refresh.
- If licensed fonts are not available for deployment, the implementation must ship with a documented fallback stack rather than blocking the UI work.
- Add these settings:
- `settings.capture_write_batch_size`
- default: `1000`
- controls batched DB writes for hourly capture
- `settings.snapshot_table_compat_mode`
- default: `true`
- when `true`, continue writing legacy snapshot tables during migration
- `settings.async_report_generation`
- default: `true`
- when `true`, scheduled jobs defer XLSX generation from the hot path
- Keep existing settings such as:
- `hourly_snapshot_concurrency`
- `monthly_aggregation_granularity`
- retry settings
- cleanup settings
- Scheduled monthly aggregation should ignore hourly granularity unless running a manual or backfill job.
## Execution Order
### Phase 1: Hot-Path Runtime Wins
- Add batched hourly writes.
- Decouple report generation from hourly capture.
- Ensure daily scheduled aggregation reads only from `vm_hourly_stats`.
- Ensure monthly scheduled aggregation reads only from `vm_daily_rollup`.
- Keep compatibility tables enabled.
- Define the UI token layer and shared component mapping before page-level redesign work begins.
### Phase 2: Canonical Dataflow
- Refactor reconciliation so canonical caches are updated first.
- Reduce or eliminate prior-snapshot table mutations during capture.
- Make scheduled aggregation paths canonical-only.
- Keep fallback and repair code for legacy unions/scans.
- Implement the shared page shell, navigation, button, card, table, and form refinements across the existing Templ views.
### Phase 3: Postgres-Ready Scale-Up
- Validate index coverage on canonical tables.
- Add PostgreSQL partitioning for `vm_hourly_stats`.
- Benchmark Go and SQL aggregation paths on representative production-scale data.
- Keep Go as default unless SQL demonstrates a clear, repeatable runtime win on canonical Postgres data.
- Treat the benchmark as a comparison against a canonical-table SQL implementation, not the current snapshot-union SQL path.
- If SQL wins, promote SQL behind a controlled rollout flag first, then make it default.
- Complete page-specific UI refinement for dashboard, snapshots, vCenter totals, and VM trace using the shared tokenized design system.
### Phase 4: Compatibility Reduction
- Keep legacy table output behind `snapshot_table_compat_mode`.
- Once canonical-path validation is complete, allow disabling legacy hourly table generation in scheduled runs.
- Retain explicit backfill and rebuild commands for compatibility tables and reports.
- Clean up obsolete styling rules and duplicated visual patterns once the new UI system is fully adopted.
## Implementation Checklist
### 0. Baseline and Guardrails
- [x] Capture baseline metrics for hourly capture, daily aggregation, monthly aggregation, and report generation.
- [x] Confirm current API/endpoint contract and report filename behavior with a regression snapshot.
- [x] Add new settings with defaults and config wiring:
- [x] `settings.capture_write_batch_size=1000`
- [x] `settings.snapshot_table_compat_mode=true`
- [x] `settings.async_report_generation=true`
- [x] Add/confirm stage-level logging and timing around capture, reconcile, totals refresh, and report generation.
- [x] Document migration guardrails: no auth-model changes, SQLite support retained, compatibility mode enabled by default.
- Evidence snapshot: see `phase0-baseline.md` for metrics, API/report contract snapshot, and guardrail verification.
### 1. Phase 1: Hot-Path Runtime Wins
- [x] Implement batched hourly writes for canonical tables in capture flow.
- [x] Add PostgreSQL multi-row insert/upsert path (or `COPY`) for `vm_hourly_stats`.
- [x] Keep SQLite transactional batched upsert path without PostgreSQL-only ingestion features.
- [x] Decouple XLSX/report generation from capture hot path via async/deferred stage.
- [x] Ensure scheduled daily aggregation reads canonical data from `vm_hourly_stats` only.
- [x] Ensure scheduled monthly aggregation reads canonical data from `vm_daily_rollup` only.
- [x] Keep legacy compatibility tables enabled during this phase.
- [x] Introduce UI token layer (`--theme_*`) and map shared component primitives before page-specific redesign.
### 2. Phase 2: Canonical Dataflow
- [x] Refactor capture/reconcile ordering so canonical caches are updated first.
- [x] Move deletion/event reconciliation to one post-capture phase per vCenter.
- [x] Remove prior-snapshot table mutations from capture hot path (except explicit compatibility needs).
- [x] Keep SQL union/legacy scan paths available only for fallback, repair, and backfill.
- [x] Verify `snapshot_registry` logical hourly registration remains correct without normal hourly table scans.
- [x] Implement shared Templ page shell improvements across header/footer/cards/buttons/tables/forms.
- [x] Refresh dashboard, snapshots, vCenter totals, and VM trace views to the tokenized design system.
### 3. Phase 3: Postgres-Ready Scale-Up
- [x] Validate/add canonical `vm_hourly_stats` indexes for snapshot time, vCenter+time, VM identity+time, and trace lookup.
- [x] Add PostgreSQL monthly partitioning for `vm_hourly_stats` behind migration controls.
- [ ] Benchmark Go vs SQL on canonical Postgres tables using representative production-scale data.
- Benchmark harness implemented via `-benchmark-aggregations` and `-benchmark-runs`; production-scale Postgres run pending.
- [x] Keep Go as scheduled default unless SQL shows clear and repeatable runtime wins.
- [x] If SQL wins, roll out behind a controlled flag before any default switch.
### 4. Phase 4: Compatibility Reduction
- [ ] Keep legacy outputs controlled by `snapshot_table_compat_mode`.
- [ ] Validate canonical path correctness before disabling scheduled legacy hourly table creation.
- [ ] Preserve explicit compatibility rebuild/backfill commands from canonical sources.
- [ ] Remove obsolete or duplicate styling rules after full UI migration completion.
### 5. Validation and Quality Gates
- [ ] Add golden-result tests for daily output parity (old vs new path).
- [ ] Add golden-result tests for monthly output parity (old vs new path).
- [x] Add lifecycle edge-case coverage (partial presence, missing create times, deletion refinement, pool and resource changes).
- [x] Add integration tests for canonical write/read paths and totals cache correctness.
- [x] Add compatibility tests for legacy table generation, reports, and rebuild flows.
- [ ] Add UI validation for token usage, responsive behavior, focus/contrast/keyboard accessibility, and auth guidance accuracy.
- [ ] Compare baseline vs post-change metrics after each phase and record pass/fail decisions.
### 6. Rollout and Documentation
- [ ] Update operator docs for new settings and default behavior.
- [ ] Document compatibility-mode lifecycle and criteria to disable legacy table generation.
- [ ] Document benchmark method/results and default-path decision record (Go vs SQL).
- [ ] Publish a short migration runbook for staged rollout, rollback triggers, and repair workflows.
## Test Plan
### Correctness Tests
- Add golden-result tests comparing old and new daily outputs for the same synthetic hourly dataset.
- Add golden-result tests comparing old and new monthly outputs for the same synthetic daily dataset.
- Include edge cases for:
- partial-day VM presence
- missing creation times
- deletion-time refinement
- pool changes
- CPU and RAM changes across samples
- VMs identified by `VmId`, `VmUuid`, and fallback name matching
### Integration Tests
- Hourly capture writes `vm_hourly_stats`, lifecycle caches, and vCenter totals correctly.
- Daily aggregation reads canonical hourly data without scanning `inventory_hourly_*`.
- Monthly aggregation reads canonical daily rollup without scanning hourly history in the normal path.
- `vcenter_aggregate_totals` remains correct for hourly, daily, and monthly views.
- Trace and totals endpoints keep returning equivalent results before and after migration.
- UI page rendering remains valid for dashboard, snapshot pages, vCenter totals, and VM trace after shared component changes.
### Compatibility Tests
- When `snapshot_table_compat_mode=true`, compatibility snapshot tables still exist and are populated.
- Reports still generate correctly from migrated data.
- Backfill and repair flows can rebuild compatibility outputs from canonical sources.
- UI remains functional when auth is disabled and when auth is enabled with protected API usage documented in-page.
### Performance Tests
- Measure per-vCenter capture duration.
- Measure hourly write throughput.
- Measure daily aggregation runtime.
- Measure monthly aggregation runtime.
- Measure report generation runtime when decoupled from scheduled jobs.
- Capture baseline metrics before refactor and compare after each phase.
- Measure basic UI payload impact after the refresh so stylesheet and JS growth stay controlled.
### UI Validation
- Verify token usage in shared CSS so colors, radii, and shadows are not hard-coded inconsistently across pages.
- Verify responsive behavior for dashboard, snapshot tables, vCenter totals, and VM trace at mobile and desktop widths.
- Verify focus states, contrast, and keyboard access for links, buttons, inputs, and table navigation surfaces.
- Verify that the auth guidance on the dashboard still matches actual route protection and Bearer-token behavior.
## Acceptance Criteria
- Scheduled hourly capture runtime is materially reduced without changing user-visible outputs.
- Scheduled daily aggregation no longer depends on `inventory_hourly_*` scans.
- Scheduled monthly aggregation no longer depends on hourly-history scans.
- Canonical caches become the source of truth for normal scheduled processing.
- Legacy compatibility behavior remains available during migration.
- Existing endpoints, reports, auth behavior, and operational commands continue to work.
- The UI reflects the design direction in `design.md` through tokenized colors, typography, spacing, radius, and shadow usage.
- The dashboard, snapshot pages, vCenter totals view, and VM trace view share a coherent visual system and clearer information hierarchy.
- The refreshed UI remains responsive, accessible, and compatible with the current Templ-based rendering model.
## Assumptions
- Target direction is Postgres-ready and runtime-first.
- Existing endpoints, report filenames, and user-visible semantics must remain stable.
- SQLite remains supported for development, tests, and smaller installs.
- PostgreSQL is the intended scale-up target for larger environments.
- Compatibility snapshot tables should remain enabled by default until canonical-path validation is complete.