Files
vctp2/plan.md
T
nathan 2c3167a1a0
continuous-integration/drone/push Build is passing
more updates
2026-04-20 19:40:01 +10:00

25 KiB
Raw Blame History

Inventory Capture and Aggregation Optimization Plan

Summary

Optimize for end-to-end runtime with a Postgres-ready design. Keep the current HTTP and report behavior intact, but shift the scheduled data pipeline so it uses canonical append-only/cache tables instead of repeatedly scanning inventory_hourly_* tables and regenerating reports inline.

This plan is intended to be implementation-ready for a codex-5.3 execution pass.

Execution-path decision:

  • For the current architecture and migration phases, scheduled daily and monthly aggregation default to the Go path.
  • This is a readability-first and current-performance decision, not a claim that Go is inherently faster than a well-designed SQL implementation.
  • SQL path is retained for compatibility, backfill, and fallback.
  • SQL remains a future optimization candidate on canonical Postgres tables.
  • SQL can be promoted to default only after benchmark evidence on canonical Postgres tables shows a clear runtime advantage.

The target architecture is:

  1. vm_hourly_stats is the canonical hourly fact store.
  2. vm_daily_rollup is the canonical monthly input.
  3. Per-snapshot tables and XLSX generation remain as compatibility and output concerns, not the primary execution path.

Current State

  • Hourly capture already writes both per-snapshot tables and vm_hourly_stats.
  • Daily aggregation has mixed execution paths:
    • SQL union path over inventory_hourly_*
    • Go path over vm_hourly_stats or parallel table scans
  • Monthly aggregation has mixed execution paths:
    • SQL path over daily or hourly snapshot tables
    • Go path over vm_daily_rollup or hourly cache
  • Lifecycle reconciliation updates both canonical cache tables and prior hourly snapshot tables during the hot path.
  • Report generation is still coupled to scheduled capture and aggregation jobs.
  • The current UI is rendered through Templ pages and shared web2/web3 CSS classes, but it does not yet match the visual system described in design.md.
  • Current shipped styling still uses a different blue accent, tighter radii, default system typography, and inconsistent component hierarchy compared with the target design language.

Implementation Goals

  • Reduce hourly capture wall-clock time.
  • Reduce daily and monthly aggregation runtime.
  • Eliminate repeated historical table scans from the normal scheduled path.
  • Keep user-visible HTTP APIs, reports, and auth behavior unchanged.
  • Improve UI clarity and consistency so the dashboard, snapshot views, and trace views reflect the design direction in design.md.
  • Make authentication and role requirements easier to understand from the UI without changing the auth model.
  • Preserve compatibility with SQLite for development and small installs.
  • Make the runtime architecture cleanly scalable for PostgreSQL production use.

Implementation Changes

1. Hourly Capture Pipeline

  • Keep GetAllVMsWithProps as the primary vCenter inventory fetch path.
  • Preserve single-VM property retrieval only as a fallback path when bulk retrieval is incomplete.
  • Replace row-by-row database writes in hourly capture with batched writes.
  • For PostgreSQL:
    • prefer multi-row insert/upsert or COPY into vm_hourly_stats
    • keep conflict handling on the canonical key
  • For SQLite:
    • keep transactional batched insert/upsert
    • do not attempt PostgreSQL-only ingestion patterns
  • During capture, write data to these canonical destinations first:
    • vm_hourly_stats
    • vm_lifecycle_cache
    • vcenter_totals
    • vcenter_latest_totals
    • vcenter_aggregate_totals for hourly totals
  • Treat inventory_hourly_<epoch> as compatibility output, not as the source of truth for downstream jobs.
  • Move deletion and event reconciliation to one post-capture reconciliation phase per vCenter.
  • In that reconciliation phase, update canonical cache tables first.
  • Stop updating prior hourly snapshot tables inline during the capture hot path except where compatibility mode explicitly requires it.
  • Remove synchronous XLSX regeneration from hourly capture.
  • Scheduled capture should finish once persistence and reconciliation are complete.
  • Report generation should run after the capture path, either deferred within the job or via a follow-up stage.

2. Daily Aggregation

  • Make vm_hourly_stats the only normal scheduled input for daily aggregation.
  • Scheduled daily jobs must not build UNION ALL queries across inventory_hourly_*.
  • Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
  • Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current snapshot-union SQL path.
  • Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL union path by avoiding repeated historical table scans.
  • Treat the SQL path as non-default compatibility and fallback behavior.
  • Do not treat this as a permanent rejection of SQL.
  • Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
  • Keep the current SQL union path only for:
    • compatibility fallback
    • manual repair
    • backfill support where needed
  • Daily aggregation output must continue writing:
    • inventory_daily_summary_YYYYMMDD
    • vm_daily_rollup
    • snapshot_registry daily record
    • refreshed vcenter_aggregate_totals daily entries
  • Lifecycle refinement should operate on canonical lifecycle data and only use snapshot-table probing as fallback.
  • Preserve existing daily semantics for:
    • SamplesPresent
    • AvgIsPresent
    • weighted CPU/RAM/disk averages
    • pool percentages
    • creation/deletion time behavior

3. Monthly Aggregation

  • Make vm_daily_rollup the default scheduled input for monthly aggregation.
  • Scheduled monthly jobs should not scan hourly snapshot tables in the normal path.
  • Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
  • Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current SQL path.
  • Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL path by avoiding snapshot-table unions and hourly-history scans in the normal case.
  • Treat the SQL path as non-default compatibility and fallback behavior.
  • Do not treat this as a permanent rejection of SQL.
  • Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
  • Keep hourly-based monthly aggregation only for:
    • manual rebuilds
    • repair/backfill workflows
    • validation against old behavior
  • Preserve current monthly weighting semantics based on per-day sample volumes.
  • Monthly aggregation output must continue writing:
    • inventory_monthly_summary_YYYYMM
    • snapshot_registry monthly record
    • refreshed vcenter_aggregate_totals monthly entries
  • Keep report generation behavior unchanged from the users perspective, but do not keep it on the critical aggregation hot path if it can be deferred safely.

4. Storage and Schema

  • Keep these tables during migration:
    • inventory_hourly_*
    • inventory_daily_summary_*
    • inventory_monthly_summary_*
  • Stop treating hourly snapshot tables as the normal scheduled aggregation source.
  • Preserve snapshot_registry, but register logical hourly snapshots by timestamp even when downstream jobs no longer depend on hourly table scans.
  • Validate or add the following indexes on vm_hourly_stats for PostgreSQL:
    • ("SnapshotTime")
    • ("Vcenter","SnapshotTime")
    • ("Vcenter","VmId","SnapshotTime")
    • ("Vcenter","VmUuid","SnapshotTime")
    • a name lookup index aligned with current trace queries
  • Keep the existing trace-compatible indexes for SQLite.
  • After the canonical-path migration is stable, partition vm_hourly_stats by snapshot month for PostgreSQL.
  • Do not require partitioning for SQLite or tests.

5. Compatibility Mode

  • Introduce an explicit compatibility mode for legacy snapshot tables.
  • When compatibility mode is enabled:
    • continue writing inventory_hourly_*
    • continue generating legacy-compatible daily/monthly summary tables
    • continue registering snapshots as today
  • When compatibility mode is disabled in a later phase:
    • scheduled jobs may skip legacy hourly table creation
    • compatibility reports and endpoints must still work from canonical data or compatibility rebuild jobs
  • Default to compatibility mode enabled during the transition.

6. Scheduling and Job Flow

  • Refactor the scheduled pipeline into explicit stages:
    1. capture
    2. reconcile
    3. register and refresh totals caches
    4. optional report generation
  • Daily aggregation should run only against the completed prior-day hourly data.
  • Monthly aggregation should depend on daily rollup completion, not hourly history scans.
  • Keep the current cron behavior and auth/UI behavior unchanged while internal data flow changes land.
  • Backfill and repair jobs should rebuild canonical caches first, then compatibility tables and reports.

7. UI Refresh and Design-System Alignment

  • Use design.md as the source of truth for the UI refresh, but adapt it pragmatically to this codebase rather than attempting a pixel-perfect clone.
  • Introduce semantic theme tokens using --theme_* naming in the shared stylesheet layer.
  • Replace the current ad hoc web2 color and radius values with tokenized equivalents for:
    • primary text
    • weak text
    • CTA blue
    • borders
    • surfaces
    • success states
    • button spotlight text
    • card and ambient shadows
  • Update the shared stylesheet source and shipped compiled assets so the new tokens flow through the delivered UI.
  • Keep the existing web2 and web3 class names if that reduces churn, but rebase them on the new token system.
  • Establish a typography strategy that follows design.md while remaining deployable:
    • prefer Haas and Haas Groot Disp only if licensed webfont delivery is available
    • otherwise define a documented fallback stack with similar proportions and spacing behavior
    • apply positive letter spacing to body, caption, and button treatments where appropriate
  • Normalize component shape language to the design brief:
    • buttons at 12px radius
    • cards and sections at 16px to 24px radius
    • larger containers at 24px to 32px radius where needed
    • avoid the current 3px to 6px rounded treatment as the default visual language
  • Replace the current flat visual treatment with the documented blue-tinted shadow system, but keep shadows controlled and readable in data-heavy views.
  • Refactor shared UI structure in the Templ layer:
    • components/core/header.templ
    • components/core/footer.templ
    • shared shell/header/card/button/table/form patterns used across components/views/*
  • Add a reusable page-shell pattern so all primary pages share:
    • a consistent hero/header treatment
    • action grouping
    • content width rules
    • section spacing
    • responsive table overflow behavior
  • Improve the dashboard information architecture in components/views/index.templ:
    • reduce the current long-form text density
    • promote primary navigation and key operational tasks
    • move build metadata into secondary status cards
    • present auth requirements and role policy as a concise callout rather than dense paragraph copy
  • Improve snapshot and vCenter list pages in components/views/snapshots.templ:
    • stronger table hierarchy
    • clearer record counts and grouping
    • more intentional page headers and return navigation
    • responsive behavior that preserves readability on smaller screens
  • Improve the VM trace page in components/views/vm_trace.templ:
    • upgrade search form layout and input styling
    • improve chart framing and diagnostics presentation
    • make lifecycle summary cards visually clearer
    • preserve dense tabular detail without making the page feel purely utilitarian
  • Ensure the auth-enabled experience is visible in the UI:
    • clarify that UI pages remain public while APIs require Bearer tokens when auth is enabled
    • surface viewer versus admin capability differences in concise language
    • keep Swagger and operational links accessible from the main navigation
  • Add accessibility and interaction requirements to the UI implementation:
    • visible focus states
    • sufficient text/background contrast
    • keyboard-usable navigation and forms
    • table layouts that remain readable with horizontal overflow
    • mobile-safe spacing and tap targets
  • Keep UI changes implementation-friendly:
    • avoid introducing a large frontend framework
    • continue using Templ plus shared CSS and existing JS assets
    • prefer incremental component replacement over a full frontend rewrite

Public Interfaces and Settings

  • No HTTP API changes are required.
  • Keep existing endpoints and report filenames stable.
  • No auth-model changes are required for the UI refresh.
  • If licensed fonts are not available for deployment, the implementation must ship with a documented fallback stack rather than blocking the UI work.
  • Add these settings:
    • settings.capture_write_batch_size
      • default: 1000
      • controls batched DB writes for hourly capture
    • settings.snapshot_table_compat_mode
      • default: true
      • when true, continue writing legacy snapshot tables during migration
    • settings.async_report_generation
      • default: true
      • when true, scheduled jobs defer XLSX generation from the hot path
  • Keep existing settings such as:
    • hourly_snapshot_concurrency
    • monthly_aggregation_granularity
    • retry settings
    • cleanup settings
  • Scheduled monthly aggregation should ignore hourly granularity unless running a manual or backfill job.

Execution Order

Phase 1: Hot-Path Runtime Wins

  • Add batched hourly writes.
  • Decouple report generation from hourly capture.
  • Ensure daily scheduled aggregation reads only from vm_hourly_stats.
  • Ensure monthly scheduled aggregation reads only from vm_daily_rollup.
  • Keep compatibility tables enabled.
  • Define the UI token layer and shared component mapping before page-level redesign work begins.

Phase 2: Canonical Dataflow

  • Refactor reconciliation so canonical caches are updated first.
  • Reduce or eliminate prior-snapshot table mutations during capture.
  • Make scheduled aggregation paths canonical-only.
  • Keep fallback and repair code for legacy unions/scans.
  • Implement the shared page shell, navigation, button, card, table, and form refinements across the existing Templ views.

Phase 3: Postgres-Ready Scale-Up

  • Validate index coverage on canonical tables.
  • Add PostgreSQL partitioning for vm_hourly_stats.
  • Benchmark Go and SQL aggregation paths on representative production-scale data.
  • Keep Go as default unless SQL demonstrates a clear, repeatable runtime win on canonical Postgres data.
  • Treat the benchmark as a comparison against a canonical-table SQL implementation, not the current snapshot-union SQL path.
  • If SQL wins, promote SQL behind a controlled rollout flag first, then make it default.
  • Complete page-specific UI refinement for dashboard, snapshots, vCenter totals, and VM trace using the shared tokenized design system.

Phase 4: Compatibility Reduction

  • Keep legacy table output behind snapshot_table_compat_mode.
  • Once canonical-path validation is complete, allow disabling legacy hourly table generation in scheduled runs.
  • Retain explicit backfill and rebuild commands for compatibility tables and reports.
  • Clean up obsolete styling rules and duplicated visual patterns once the new UI system is fully adopted.

Implementation Checklist

0. Baseline and Guardrails

  • Capture baseline metrics for hourly capture, daily aggregation, monthly aggregation, and report generation.
  • Confirm current API/endpoint contract and report filename behavior with a regression snapshot.
  • Add new settings with defaults and config wiring:
    • settings.capture_write_batch_size=1000
    • settings.snapshot_table_compat_mode=true
    • settings.async_report_generation=true
  • Add/confirm stage-level logging and timing around capture, reconcile, totals refresh, and report generation.
  • Document migration guardrails: no auth-model changes, SQLite support retained, compatibility mode enabled by default.
  • Evidence snapshot: see phase0-baseline.md for metrics, API/report contract snapshot, and guardrail verification.

1. Phase 1: Hot-Path Runtime Wins

  • Implement batched hourly writes for canonical tables in capture flow.
  • Add PostgreSQL multi-row insert/upsert path (or COPY) for vm_hourly_stats.
  • Keep SQLite transactional batched upsert path without PostgreSQL-only ingestion features.
  • Decouple XLSX/report generation from capture hot path via async/deferred stage.
  • Ensure scheduled daily aggregation reads canonical data from vm_hourly_stats only.
  • Ensure scheduled monthly aggregation reads canonical data from vm_daily_rollup only.
  • Keep legacy compatibility tables enabled during this phase.
  • Introduce UI token layer (--theme_*) and map shared component primitives before page-specific redesign.

2. Phase 2: Canonical Dataflow

  • Refactor capture/reconcile ordering so canonical caches are updated first.
  • Move deletion/event reconciliation to one post-capture phase per vCenter.
  • Remove prior-snapshot table mutations from capture hot path (except explicit compatibility needs).
  • Keep SQL union/legacy scan paths available only for fallback, repair, and backfill.
  • Verify snapshot_registry logical hourly registration remains correct without normal hourly table scans.
  • Implement shared Templ page shell improvements across header/footer/cards/buttons/tables/forms.
  • Refresh dashboard, snapshots, vCenter totals, and VM trace views to the tokenized design system.

3. Phase 3: Postgres-Ready Scale-Up

  • Validate/add canonical vm_hourly_stats indexes for snapshot time, vCenter+time, VM identity+time, and trace lookup.
  • Add PostgreSQL monthly partitioning for vm_hourly_stats behind migration controls.
  • Benchmark Go vs SQL on canonical Postgres tables using representative production-scale data.
    • Benchmark harness implemented via -benchmark-aggregations and -benchmark-runs; production-scale Postgres run pending.
  • Keep Go as scheduled default unless SQL shows clear and repeatable runtime wins.
  • If SQL wins, roll out behind a controlled flag before any default switch.

4. Phase 4: Compatibility Reduction

  • Keep legacy outputs controlled by snapshot_table_compat_mode.
    • Verified by compatibility-mode integration coverage (TestSnapshotTableCompatModeSettingControlsTaskBehaviorFlag) and capture-path mode gating in inventorySnapshots.
  • Validate canonical path correctness before disabling scheduled legacy hourly table creation.
    • Covered by parity/integration/compatibility tests plus baseline-vs-post-change decision record (phase-metrics-2026-04-20.md).
  • Preserve explicit compatibility rebuild/backfill commands from canonical sources.
    • Preserved through existing admin workflows (/api/snapshots/aggregate, /api/snapshots/repair, /api/snapshots/repair/all, /api/snapshots/regenerate-hourly-reports, /api/vcenters/cache/rebuild, -backfill-vcenter-cache).
  • Remove obsolete or duplicate styling rules after full UI migration completion.
    • Removed unused selectors from shared UI stylesheet (.web2-button-group*, .web2-list li) in dist/assets/css/web3.css; router UI asset tests remain passing.

5. Validation and Quality Gates

  • Add golden-result tests for daily output parity (old vs new path).
  • Add golden-result tests for monthly output parity (old vs new path).
  • Add lifecycle edge-case coverage (partial presence, missing create times, deletion refinement, pool and resource changes).
  • Add integration tests for canonical write/read paths and totals cache correctness.
  • Add compatibility tests for legacy table generation, reports, and rebuild flows.
  • Add UI validation for token usage, responsive behavior, focus/contrast/keyboard accessibility, and auth guidance accuracy.
    • Covered by router tests validating shared CSS token/responsive/focus rules and page-level auth/keyboard guidance: TestSharedStylesExposeThemeTokensAndResponsiveAccessibilityRules, TestDashboardAuthGuidanceMatchesRouteProtection, and TestVmTraceFormUsesLabelledInputsAndKeyboardFriendlyControls.
  • Compare baseline vs post-change metrics after each phase and record pass/fail decisions.
    • Evidence and gate outcomes captured in phase-metrics-2026-04-20.md (baseline delta table + pass/fail decisions + benchmark snapshot).

6. Rollout and Documentation

  • Update operator docs for new settings and default behavior.
  • Document compatibility-mode lifecycle and criteria to disable legacy table generation.
  • Document benchmark method/results and default-path decision record (Go vs SQL).
  • Publish a short migration runbook for staged rollout, rollback triggers, and repair workflows.
    • Completed in README.md (benchmark decision record, compatibility lifecycle, and migration runbook sections).

Test Plan

Correctness Tests

  • Add golden-result tests comparing old and new daily outputs for the same synthetic hourly dataset.
  • Add golden-result tests comparing old and new monthly outputs for the same synthetic daily dataset.
  • Include edge cases for:
    • partial-day VM presence
    • missing creation times
    • deletion-time refinement
    • pool changes
    • CPU and RAM changes across samples
    • VMs identified by VmId, VmUuid, and fallback name matching

Integration Tests

  • Hourly capture writes vm_hourly_stats, lifecycle caches, and vCenter totals correctly.
  • Daily aggregation reads canonical hourly data without scanning inventory_hourly_*.
  • Monthly aggregation reads canonical daily rollup without scanning hourly history in the normal path.
  • vcenter_aggregate_totals remains correct for hourly, daily, and monthly views.
  • Trace and totals endpoints keep returning equivalent results before and after migration.
  • UI page rendering remains valid for dashboard, snapshot pages, vCenter totals, and VM trace after shared component changes.

Compatibility Tests

  • When snapshot_table_compat_mode=true, compatibility snapshot tables still exist and are populated.
  • Reports still generate correctly from migrated data.
  • Backfill and repair flows can rebuild compatibility outputs from canonical sources.
  • UI remains functional when auth is disabled and when auth is enabled with protected API usage documented in-page.

Performance Tests

  • Measure per-vCenter capture duration.
  • Measure hourly write throughput.
  • Measure daily aggregation runtime.
  • Measure monthly aggregation runtime.
  • Measure report generation runtime when decoupled from scheduled jobs.
  • Capture baseline metrics before refactor and compare after each phase.
  • Measure basic UI payload impact after the refresh so stylesheet and JS growth stay controlled.

UI Validation

  • Verify token usage in shared CSS so colors, radii, and shadows are not hard-coded inconsistently across pages.
  • Verify responsive behavior for dashboard, snapshot tables, vCenter totals, and VM trace at mobile and desktop widths.
  • Verify focus states, contrast, and keyboard access for links, buttons, inputs, and table navigation surfaces.
  • Verify that the auth guidance on the dashboard still matches actual route protection and Bearer-token behavior.

Acceptance Criteria

  • Scheduled hourly capture runtime is materially reduced without changing user-visible outputs.
  • Scheduled daily aggregation no longer depends on inventory_hourly_* scans.
  • Scheduled monthly aggregation no longer depends on hourly-history scans.
  • Canonical caches become the source of truth for normal scheduled processing.
  • Legacy compatibility behavior remains available during migration.
  • Existing endpoints, reports, auth behavior, and operational commands continue to work.
  • The UI reflects the design direction in design.md through tokenized colors, typography, spacing, radius, and shadow usage.
  • The dashboard, snapshot pages, vCenter totals view, and VM trace view share a coherent visual system and clearer information hierarchy.
  • The refreshed UI remains responsive, accessible, and compatible with the current Templ-based rendering model.

Assumptions

  • Target direction is Postgres-ready and runtime-first.
  • Existing endpoints, report filenames, and user-visible semantics must remain stable.
  • SQLite remains supported for development, tests, and smaller installs.
  • PostgreSQL is the intended scale-up target for larger environments.
  • Compatibility snapshot tables should remain enabled by default until canonical-path validation is complete.