# Inventory Capture and Aggregation Optimization Plan ## Summary Optimize for end-to-end runtime with a Postgres-ready design. Keep the current HTTP and report behavior intact, but shift the scheduled data pipeline so it uses canonical append-only/cache tables instead of repeatedly scanning `inventory_hourly_*` tables and regenerating reports inline. This plan is intended to be implementation-ready for a `codex-5.3` execution pass. Execution-path decision: - For the current architecture and migration phases, scheduled daily and monthly aggregation default to the Go path. - This is a readability-first and current-performance decision, not a claim that Go is inherently faster than a well-designed SQL implementation. - SQL path is retained for compatibility, backfill, and fallback. - SQL remains a future optimization candidate on canonical Postgres tables. - SQL can be promoted to default only after benchmark evidence on canonical Postgres tables shows a clear runtime advantage. The target architecture is: 1. `vm_hourly_stats` is the canonical hourly fact store. 2. `vm_daily_rollup` is the canonical monthly input. 3. Per-snapshot tables and XLSX generation remain as compatibility and output concerns, not the primary execution path. ## Current State - Hourly capture already writes both per-snapshot tables and `vm_hourly_stats`. - Daily aggregation has mixed execution paths: - SQL union path over `inventory_hourly_*` - Go path over `vm_hourly_stats` or parallel table scans - Monthly aggregation has mixed execution paths: - SQL path over daily or hourly snapshot tables - Go path over `vm_daily_rollup` or hourly cache - Lifecycle reconciliation updates both canonical cache tables and prior hourly snapshot tables during the hot path. - Report generation is still coupled to scheduled capture and aggregation jobs. - The current UI is rendered through Templ pages and shared `web2`/`web3` CSS classes, but it does not yet match the visual system described in `design.md`. - Current shipped styling still uses a different blue accent, tighter radii, default system typography, and inconsistent component hierarchy compared with the target design language. ## Implementation Goals - Reduce hourly capture wall-clock time. - Reduce daily and monthly aggregation runtime. - Eliminate repeated historical table scans from the normal scheduled path. - Keep user-visible HTTP APIs, reports, and auth behavior unchanged. - Improve UI clarity and consistency so the dashboard, snapshot views, and trace views reflect the design direction in `design.md`. - Make authentication and role requirements easier to understand from the UI without changing the auth model. - Preserve compatibility with SQLite for development and small installs. - Make the runtime architecture cleanly scalable for PostgreSQL production use. ## Implementation Changes ### 1. Hourly Capture Pipeline - Keep `GetAllVMsWithProps` as the primary vCenter inventory fetch path. - Preserve single-VM property retrieval only as a fallback path when bulk retrieval is incomplete. - Replace row-by-row database writes in hourly capture with batched writes. - For PostgreSQL: - prefer multi-row insert/upsert or `COPY` into `vm_hourly_stats` - keep conflict handling on the canonical key - For SQLite: - keep transactional batched insert/upsert - do not attempt PostgreSQL-only ingestion patterns - During capture, write data to these canonical destinations first: - `vm_hourly_stats` - `vm_lifecycle_cache` - `vcenter_totals` - `vcenter_latest_totals` - `vcenter_aggregate_totals` for hourly totals - Treat `inventory_hourly_` as compatibility output, not as the source of truth for downstream jobs. - Move deletion and event reconciliation to one post-capture reconciliation phase per vCenter. - In that reconciliation phase, update canonical cache tables first. - Stop updating prior hourly snapshot tables inline during the capture hot path except where compatibility mode explicitly requires it. - Remove synchronous XLSX regeneration from hourly capture. - Scheduled capture should finish once persistence and reconciliation are complete. - Report generation should run after the capture path, either deferred within the job or via a follow-up stage. ### 2. Daily Aggregation - Make `vm_hourly_stats` the only normal scheduled input for daily aggregation. - Scheduled daily jobs must not build `UNION ALL` queries across `inventory_hourly_*`. - Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases. - Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current snapshot-union SQL path. - Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL union path by avoiding repeated historical table scans. - Treat the SQL path as non-default compatibility and fallback behavior. - Do not treat this as a permanent rejection of SQL. - Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path. - Keep the current SQL union path only for: - compatibility fallback - manual repair - backfill support where needed - Daily aggregation output must continue writing: - `inventory_daily_summary_YYYYMMDD` - `vm_daily_rollup` - `snapshot_registry` daily record - refreshed `vcenter_aggregate_totals` daily entries - Lifecycle refinement should operate on canonical lifecycle data and only use snapshot-table probing as fallback. - Preserve existing daily semantics for: - `SamplesPresent` - `AvgIsPresent` - weighted CPU/RAM/disk averages - pool percentages - creation/deletion time behavior ### 3. Monthly Aggregation - Make `vm_daily_rollup` the default scheduled input for monthly aggregation. - Scheduled monthly jobs should not scan hourly snapshot tables in the normal path. - Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases. - Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current SQL path. - Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL path by avoiding snapshot-table unions and hourly-history scans in the normal case. - Treat the SQL path as non-default compatibility and fallback behavior. - Do not treat this as a permanent rejection of SQL. - Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path. - Keep hourly-based monthly aggregation only for: - manual rebuilds - repair/backfill workflows - validation against old behavior - Preserve current monthly weighting semantics based on per-day sample volumes. - Monthly aggregation output must continue writing: - `inventory_monthly_summary_YYYYMM` - `snapshot_registry` monthly record - refreshed `vcenter_aggregate_totals` monthly entries - Keep report generation behavior unchanged from the user’s perspective, but do not keep it on the critical aggregation hot path if it can be deferred safely. ### 4. Storage and Schema - Keep these tables during migration: - `inventory_hourly_*` - `inventory_daily_summary_*` - `inventory_monthly_summary_*` - Stop treating hourly snapshot tables as the normal scheduled aggregation source. - Preserve `snapshot_registry`, but register logical hourly snapshots by timestamp even when downstream jobs no longer depend on hourly table scans. - Validate or add the following indexes on `vm_hourly_stats` for PostgreSQL: - `("SnapshotTime")` - `("Vcenter","SnapshotTime")` - `("Vcenter","VmId","SnapshotTime")` - `("Vcenter","VmUuid","SnapshotTime")` - a name lookup index aligned with current trace queries - Keep the existing trace-compatible indexes for SQLite. - After the canonical-path migration is stable, partition `vm_hourly_stats` by snapshot month for PostgreSQL. - Do not require partitioning for SQLite or tests. ### 5. Compatibility Mode - Introduce an explicit compatibility mode for legacy snapshot tables. - When compatibility mode is enabled: - continue writing `inventory_hourly_*` - continue generating legacy-compatible daily/monthly summary tables - continue registering snapshots as today - When compatibility mode is disabled in a later phase: - scheduled jobs may skip legacy hourly table creation - compatibility reports and endpoints must still work from canonical data or compatibility rebuild jobs - Default to compatibility mode enabled during the transition. ### 6. Scheduling and Job Flow - Refactor the scheduled pipeline into explicit stages: 1. capture 2. reconcile 3. register and refresh totals caches 4. optional report generation - Daily aggregation should run only against the completed prior-day hourly data. - Monthly aggregation should depend on daily rollup completion, not hourly history scans. - Keep the current cron behavior and auth/UI behavior unchanged while internal data flow changes land. - Backfill and repair jobs should rebuild canonical caches first, then compatibility tables and reports. ### 7. UI Refresh and Design-System Alignment - Use `design.md` as the source of truth for the UI refresh, but adapt it pragmatically to this codebase rather than attempting a pixel-perfect clone. - Introduce semantic theme tokens using `--theme_*` naming in the shared stylesheet layer. - Replace the current ad hoc `web2` color and radius values with tokenized equivalents for: - primary text - weak text - CTA blue - borders - surfaces - success states - button spotlight text - card and ambient shadows - Update the shared stylesheet source and shipped compiled assets so the new tokens flow through the delivered UI. - Keep the existing `web2` and `web3` class names if that reduces churn, but rebase them on the new token system. - Establish a typography strategy that follows `design.md` while remaining deployable: - prefer Haas and Haas Groot Disp only if licensed webfont delivery is available - otherwise define a documented fallback stack with similar proportions and spacing behavior - apply positive letter spacing to body, caption, and button treatments where appropriate - Normalize component shape language to the design brief: - buttons at 12px radius - cards and sections at 16px to 24px radius - larger containers at 24px to 32px radius where needed - avoid the current 3px to 6px rounded treatment as the default visual language - Replace the current flat visual treatment with the documented blue-tinted shadow system, but keep shadows controlled and readable in data-heavy views. - Refactor shared UI structure in the Templ layer: - `components/core/header.templ` - `components/core/footer.templ` - shared shell/header/card/button/table/form patterns used across `components/views/*` - Add a reusable page-shell pattern so all primary pages share: - a consistent hero/header treatment - action grouping - content width rules - section spacing - responsive table overflow behavior - Improve the dashboard information architecture in `components/views/index.templ`: - reduce the current long-form text density - promote primary navigation and key operational tasks - move build metadata into secondary status cards - present auth requirements and role policy as a concise callout rather than dense paragraph copy - Improve snapshot and vCenter list pages in `components/views/snapshots.templ`: - stronger table hierarchy - clearer record counts and grouping - more intentional page headers and return navigation - responsive behavior that preserves readability on smaller screens - Improve the VM trace page in `components/views/vm_trace.templ`: - upgrade search form layout and input styling - improve chart framing and diagnostics presentation - make lifecycle summary cards visually clearer - preserve dense tabular detail without making the page feel purely utilitarian - Ensure the auth-enabled experience is visible in the UI: - clarify that UI pages remain public while APIs require Bearer tokens when auth is enabled - surface viewer versus admin capability differences in concise language - keep Swagger and operational links accessible from the main navigation - Add accessibility and interaction requirements to the UI implementation: - visible focus states - sufficient text/background contrast - keyboard-usable navigation and forms - table layouts that remain readable with horizontal overflow - mobile-safe spacing and tap targets - Keep UI changes implementation-friendly: - avoid introducing a large frontend framework - continue using Templ plus shared CSS and existing JS assets - prefer incremental component replacement over a full frontend rewrite ## Public Interfaces and Settings - No HTTP API changes are required. - Keep existing endpoints and report filenames stable. - No auth-model changes are required for the UI refresh. - If licensed fonts are not available for deployment, the implementation must ship with a documented fallback stack rather than blocking the UI work. - Add these settings: - `settings.capture_write_batch_size` - default: `1000` - controls batched DB writes for hourly capture - `settings.snapshot_table_compat_mode` - default: `true` - when `true`, continue writing legacy snapshot tables during migration - `settings.async_report_generation` - default: `true` - when `true`, scheduled jobs defer XLSX generation from the hot path - Keep existing settings such as: - `hourly_snapshot_concurrency` - `monthly_aggregation_granularity` - retry settings - cleanup settings - Scheduled monthly aggregation should ignore hourly granularity unless running a manual or backfill job. ## Execution Order ### Phase 1: Hot-Path Runtime Wins - Add batched hourly writes. - Decouple report generation from hourly capture. - Ensure daily scheduled aggregation reads only from `vm_hourly_stats`. - Ensure monthly scheduled aggregation reads only from `vm_daily_rollup`. - Keep compatibility tables enabled. - Define the UI token layer and shared component mapping before page-level redesign work begins. ### Phase 2: Canonical Dataflow - Refactor reconciliation so canonical caches are updated first. - Reduce or eliminate prior-snapshot table mutations during capture. - Make scheduled aggregation paths canonical-only. - Keep fallback and repair code for legacy unions/scans. - Implement the shared page shell, navigation, button, card, table, and form refinements across the existing Templ views. ### Phase 3: Postgres-Ready Scale-Up - Validate index coverage on canonical tables. - Add PostgreSQL partitioning for `vm_hourly_stats`. - Benchmark Go and SQL aggregation paths on representative production-scale data. - Keep Go as default unless SQL demonstrates a clear, repeatable runtime win on canonical Postgres data. - Treat the benchmark as a comparison against a canonical-table SQL implementation, not the current snapshot-union SQL path. - If SQL wins, promote SQL behind a controlled rollout flag first, then make it default. - Complete page-specific UI refinement for dashboard, snapshots, vCenter totals, and VM trace using the shared tokenized design system. ### Phase 4: Compatibility Reduction - Keep legacy table output behind `snapshot_table_compat_mode`. - Once canonical-path validation is complete, allow disabling legacy hourly table generation in scheduled runs. - Retain explicit backfill and rebuild commands for compatibility tables and reports. - Clean up obsolete styling rules and duplicated visual patterns once the new UI system is fully adopted. ## Implementation Checklist ### 0. Baseline and Guardrails - [x] Capture baseline metrics for hourly capture, daily aggregation, monthly aggregation, and report generation. - [x] Confirm current API/endpoint contract and report filename behavior with a regression snapshot. - [x] Add new settings with defaults and config wiring: - [x] `settings.capture_write_batch_size=1000` - [x] `settings.snapshot_table_compat_mode=true` - [x] `settings.async_report_generation=true` - [x] Add/confirm stage-level logging and timing around capture, reconcile, totals refresh, and report generation. - [x] Document migration guardrails: no auth-model changes, SQLite support retained, compatibility mode enabled by default. - Evidence snapshot: see `phase0-baseline.md` for metrics, API/report contract snapshot, and guardrail verification. ### 1. Phase 1: Hot-Path Runtime Wins - [x] Implement batched hourly writes for canonical tables in capture flow. - [x] Add PostgreSQL multi-row insert/upsert path (or `COPY`) for `vm_hourly_stats`. - [x] Keep SQLite transactional batched upsert path without PostgreSQL-only ingestion features. - [x] Decouple XLSX/report generation from capture hot path via async/deferred stage. - [x] Ensure scheduled daily aggregation reads canonical data from `vm_hourly_stats` only. - [x] Ensure scheduled monthly aggregation reads canonical data from `vm_daily_rollup` only. - [x] Keep legacy compatibility tables enabled during this phase. - [x] Introduce UI token layer (`--theme_*`) and map shared component primitives before page-specific redesign. ### 2. Phase 2: Canonical Dataflow - [x] Refactor capture/reconcile ordering so canonical caches are updated first. - [x] Move deletion/event reconciliation to one post-capture phase per vCenter. - [x] Remove prior-snapshot table mutations from capture hot path (except explicit compatibility needs). - [x] Keep SQL union/legacy scan paths available only for fallback, repair, and backfill. - [x] Verify `snapshot_registry` logical hourly registration remains correct without normal hourly table scans. - [x] Implement shared Templ page shell improvements across header/footer/cards/buttons/tables/forms. - [x] Refresh dashboard, snapshots, vCenter totals, and VM trace views to the tokenized design system. ### 3. Phase 3: Postgres-Ready Scale-Up - [x] Validate/add canonical `vm_hourly_stats` indexes for snapshot time, vCenter+time, VM identity+time, and trace lookup. - [x] Add PostgreSQL monthly partitioning for `vm_hourly_stats` behind migration controls. - [x] Benchmark Go vs SQL on canonical Postgres tables using representative production-scale data. - Production-scale Postgres run completed on 2026-04-21 via one-shot canonical benchmark (`-benchmark-aggregations` with `runs_per_mode=1`, `driver=postgres`). - Daily window `2026-04-20T00:00:00Z` to `2026-04-21T00:00:00Z`: Go `4.000602432s` (`14881` rows) vs SQL `1h17m19.039092561s` (`14920` rows), with Go ~`1159.59x` faster on this run. - Monthly window `2026-04-01T00:00:00Z` to `2026-05-01T00:00:00Z`: Go `3.529410947s` (`15871` rows) vs SQL `3.313037973s` (`15873` rows), near parity with SQL slightly faster (~`0.216s`, `6.1%`). - Decision remains unchanged: keep Go as scheduled default and treat SQL as fallback/backfill until SQL shows a clear, repeatable runtime win across canonical workloads. - [x] Keep Go as scheduled default unless SQL shows clear and repeatable runtime wins. - [x] If SQL wins, roll out behind a controlled flag before any default switch. ### 4. Phase 4: Compatibility Reduction - [x] Keep legacy outputs controlled by `snapshot_table_compat_mode`. - Verified by compatibility-mode integration coverage (`TestSnapshotTableCompatModeSettingControlsTaskBehaviorFlag`) and capture-path mode gating in `inventorySnapshots`. - [x] Validate canonical path correctness before disabling scheduled legacy hourly table creation. - Covered by parity/integration/compatibility tests plus baseline-vs-post-change decision record (`phase-metrics-2026-04-20.md`). - [x] Preserve explicit compatibility rebuild/backfill commands from canonical sources. - Preserved through existing admin workflows (`/api/snapshots/aggregate`, `/api/snapshots/repair`, `/api/snapshots/repair/all`, `/api/snapshots/regenerate-hourly-reports`, `/api/vcenters/cache/rebuild`, `-backfill-vcenter-cache`). - [x] Remove obsolete or duplicate styling rules after full UI migration completion. - Removed unused selectors from shared UI stylesheet (`.web2-button-group*`, `.web2-list li`) in `dist/assets/css/web3.css`; router UI asset tests remain passing. ### 5. Validation and Quality Gates - [x] Add golden-result tests for daily output parity (old vs new path). - [x] Add golden-result tests for monthly output parity (old vs new path). - [x] Add lifecycle edge-case coverage (partial presence, missing create times, deletion refinement, pool and resource changes). - [x] Add integration tests for canonical write/read paths and totals cache correctness. - [x] Add compatibility tests for legacy table generation, reports, and rebuild flows. - [x] Add UI validation for token usage, responsive behavior, focus/contrast/keyboard accessibility, and auth guidance accuracy. - Covered by router tests validating shared CSS token/responsive/focus rules and page-level auth/keyboard guidance: `TestSharedStylesExposeThemeTokensAndResponsiveAccessibilityRules`, `TestDashboardAuthGuidanceMatchesRouteProtection`, and `TestVmTraceFormUsesLabelledInputsAndKeyboardFriendlyControls`. - [x] Compare baseline vs post-change metrics after each phase and record pass/fail decisions. - Evidence and gate outcomes captured in `phase-metrics-2026-04-20.md` (baseline delta table + pass/fail decisions + benchmark snapshot). ### 6. Rollout and Documentation - [x] Update operator docs for new settings and default behavior. - [x] Document compatibility-mode lifecycle and criteria to disable legacy table generation. - [x] Document benchmark method/results and default-path decision record (Go vs SQL). - [x] Publish a short migration runbook for staged rollout, rollback triggers, and repair workflows. - Completed in `README.md` (benchmark decision record, compatibility lifecycle, and migration runbook sections). ## Test Plan ### Correctness Tests - Add golden-result tests comparing old and new daily outputs for the same synthetic hourly dataset. - Add golden-result tests comparing old and new monthly outputs for the same synthetic daily dataset. - Include edge cases for: - partial-day VM presence - missing creation times - deletion-time refinement - pool changes - CPU and RAM changes across samples - VMs identified by `VmId`, `VmUuid`, and fallback name matching ### Integration Tests - Hourly capture writes `vm_hourly_stats`, lifecycle caches, and vCenter totals correctly. - Daily aggregation reads canonical hourly data without scanning `inventory_hourly_*`. - Monthly aggregation reads canonical daily rollup without scanning hourly history in the normal path. - `vcenter_aggregate_totals` remains correct for hourly, daily, and monthly views. - Trace and totals endpoints keep returning equivalent results before and after migration. - UI page rendering remains valid for dashboard, snapshot pages, vCenter totals, and VM trace after shared component changes. ### Compatibility Tests - When `snapshot_table_compat_mode=true`, compatibility snapshot tables still exist and are populated. - Reports still generate correctly from migrated data. - Backfill and repair flows can rebuild compatibility outputs from canonical sources. - UI remains functional when auth is disabled and when auth is enabled with protected API usage documented in-page. ### Performance Tests - Measure per-vCenter capture duration. - Measure hourly write throughput. - Measure daily aggregation runtime. - Measure monthly aggregation runtime. - Measure report generation runtime when decoupled from scheduled jobs. - Capture baseline metrics before refactor and compare after each phase. - Measure basic UI payload impact after the refresh so stylesheet and JS growth stay controlled. ### UI Validation - Verify token usage in shared CSS so colors, radii, and shadows are not hard-coded inconsistently across pages. - Verify responsive behavior for dashboard, snapshot tables, vCenter totals, and VM trace at mobile and desktop widths. - Verify focus states, contrast, and keyboard access for links, buttons, inputs, and table navigation surfaces. - Verify that the auth guidance on the dashboard still matches actual route protection and Bearer-token behavior. ## Acceptance Criteria - Scheduled hourly capture runtime is materially reduced without changing user-visible outputs. - Scheduled daily aggregation no longer depends on `inventory_hourly_*` scans. - Scheduled monthly aggregation no longer depends on hourly-history scans. - Canonical caches become the source of truth for normal scheduled processing. - Legacy compatibility behavior remains available during migration. - Existing endpoints, reports, auth behavior, and operational commands continue to work. - The UI reflects the design direction in `design.md` through tokenized colors, typography, spacing, radius, and shadow usage. - The dashboard, snapshot pages, vCenter totals view, and VM trace view share a coherent visual system and clearer information hierarchy. - The refreshed UI remains responsive, accessible, and compatible with the current Templ-based rendering model. ## Assumptions - Target direction is Postgres-ready and runtime-first. - Existing endpoints, report filenames, and user-visible semantics must remain stable. - SQLite remains supported for development, tests, and smaller installs. - PostgreSQL is the intended scale-up target for larger environments. - Compatibility snapshot tables should remain enabled by default until canonical-path validation is complete.