23 KiB
23 KiB
Inventory Capture and Aggregation Optimization Plan
Summary
Optimize for end-to-end runtime with a Postgres-ready design. Keep the current HTTP and report behavior intact, but shift the scheduled data pipeline so it uses canonical append-only/cache tables instead of repeatedly scanning inventory_hourly_* tables and regenerating reports inline.
This plan is intended to be implementation-ready for a codex-5.3 execution pass.
Execution-path decision:
- For the current architecture and migration phases, scheduled daily and monthly aggregation default to the Go path.
- This is a readability-first and current-performance decision, not a claim that Go is inherently faster than a well-designed SQL implementation.
- SQL path is retained for compatibility, backfill, and fallback.
- SQL remains a future optimization candidate on canonical Postgres tables.
- SQL can be promoted to default only after benchmark evidence on canonical Postgres tables shows a clear runtime advantage.
The target architecture is:
vm_hourly_statsis the canonical hourly fact store.vm_daily_rollupis the canonical monthly input.- Per-snapshot tables and XLSX generation remain as compatibility and output concerns, not the primary execution path.
Current State
- Hourly capture already writes both per-snapshot tables and
vm_hourly_stats. - Daily aggregation has mixed execution paths:
- SQL union path over
inventory_hourly_* - Go path over
vm_hourly_statsor parallel table scans
- SQL union path over
- Monthly aggregation has mixed execution paths:
- SQL path over daily or hourly snapshot tables
- Go path over
vm_daily_rollupor hourly cache
- Lifecycle reconciliation updates both canonical cache tables and prior hourly snapshot tables during the hot path.
- Report generation is still coupled to scheduled capture and aggregation jobs.
- The current UI is rendered through Templ pages and shared
web2/web3CSS classes, but it does not yet match the visual system described indesign.md. - Current shipped styling still uses a different blue accent, tighter radii, default system typography, and inconsistent component hierarchy compared with the target design language.
Implementation Goals
- Reduce hourly capture wall-clock time.
- Reduce daily and monthly aggregation runtime.
- Eliminate repeated historical table scans from the normal scheduled path.
- Keep user-visible HTTP APIs, reports, and auth behavior unchanged.
- Improve UI clarity and consistency so the dashboard, snapshot views, and trace views reflect the design direction in
design.md. - Make authentication and role requirements easier to understand from the UI without changing the auth model.
- Preserve compatibility with SQLite for development and small installs.
- Make the runtime architecture cleanly scalable for PostgreSQL production use.
Implementation Changes
1. Hourly Capture Pipeline
- Keep
GetAllVMsWithPropsas the primary vCenter inventory fetch path. - Preserve single-VM property retrieval only as a fallback path when bulk retrieval is incomplete.
- Replace row-by-row database writes in hourly capture with batched writes.
- For PostgreSQL:
- prefer multi-row insert/upsert or
COPYintovm_hourly_stats - keep conflict handling on the canonical key
- prefer multi-row insert/upsert or
- For SQLite:
- keep transactional batched insert/upsert
- do not attempt PostgreSQL-only ingestion patterns
- During capture, write data to these canonical destinations first:
vm_hourly_statsvm_lifecycle_cachevcenter_totalsvcenter_latest_totalsvcenter_aggregate_totalsfor hourly totals
- Treat
inventory_hourly_<epoch>as compatibility output, not as the source of truth for downstream jobs. - Move deletion and event reconciliation to one post-capture reconciliation phase per vCenter.
- In that reconciliation phase, update canonical cache tables first.
- Stop updating prior hourly snapshot tables inline during the capture hot path except where compatibility mode explicitly requires it.
- Remove synchronous XLSX regeneration from hourly capture.
- Scheduled capture should finish once persistence and reconciliation are complete.
- Report generation should run after the capture path, either deferred within the job or via a follow-up stage.
2. Daily Aggregation
- Make
vm_hourly_statsthe only normal scheduled input for daily aggregation. - Scheduled daily jobs must not build
UNION ALLqueries acrossinventory_hourly_*. - Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
- Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current snapshot-union SQL path.
- Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL union path by avoiding repeated historical table scans.
- Treat the SQL path as non-default compatibility and fallback behavior.
- Do not treat this as a permanent rejection of SQL.
- Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
- Keep the current SQL union path only for:
- compatibility fallback
- manual repair
- backfill support where needed
- Daily aggregation output must continue writing:
inventory_daily_summary_YYYYMMDDvm_daily_rollupsnapshot_registrydaily record- refreshed
vcenter_aggregate_totalsdaily entries
- Lifecycle refinement should operate on canonical lifecycle data and only use snapshot-table probing as fallback.
- Preserve existing daily semantics for:
SamplesPresentAvgIsPresent- weighted CPU/RAM/disk averages
- pool percentages
- creation/deletion time behavior
3. Monthly Aggregation
- Make
vm_daily_rollupthe default scheduled input for monthly aggregation. - Scheduled monthly jobs should not scan hourly snapshot tables in the normal path.
- Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
- Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current SQL path.
- Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL path by avoiding snapshot-table unions and hourly-history scans in the normal case.
- Treat the SQL path as non-default compatibility and fallback behavior.
- Do not treat this as a permanent rejection of SQL.
- Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
- Keep hourly-based monthly aggregation only for:
- manual rebuilds
- repair/backfill workflows
- validation against old behavior
- Preserve current monthly weighting semantics based on per-day sample volumes.
- Monthly aggregation output must continue writing:
inventory_monthly_summary_YYYYMMsnapshot_registrymonthly record- refreshed
vcenter_aggregate_totalsmonthly entries
- Keep report generation behavior unchanged from the user’s perspective, but do not keep it on the critical aggregation hot path if it can be deferred safely.
4. Storage and Schema
- Keep these tables during migration:
inventory_hourly_*inventory_daily_summary_*inventory_monthly_summary_*
- Stop treating hourly snapshot tables as the normal scheduled aggregation source.
- Preserve
snapshot_registry, but register logical hourly snapshots by timestamp even when downstream jobs no longer depend on hourly table scans. - Validate or add the following indexes on
vm_hourly_statsfor PostgreSQL:("SnapshotTime")("Vcenter","SnapshotTime")("Vcenter","VmId","SnapshotTime")("Vcenter","VmUuid","SnapshotTime")- a name lookup index aligned with current trace queries
- Keep the existing trace-compatible indexes for SQLite.
- After the canonical-path migration is stable, partition
vm_hourly_statsby snapshot month for PostgreSQL. - Do not require partitioning for SQLite or tests.
5. Compatibility Mode
- Introduce an explicit compatibility mode for legacy snapshot tables.
- When compatibility mode is enabled:
- continue writing
inventory_hourly_* - continue generating legacy-compatible daily/monthly summary tables
- continue registering snapshots as today
- continue writing
- When compatibility mode is disabled in a later phase:
- scheduled jobs may skip legacy hourly table creation
- compatibility reports and endpoints must still work from canonical data or compatibility rebuild jobs
- Default to compatibility mode enabled during the transition.
6. Scheduling and Job Flow
- Refactor the scheduled pipeline into explicit stages:
- capture
- reconcile
- register and refresh totals caches
- optional report generation
- Daily aggregation should run only against the completed prior-day hourly data.
- Monthly aggregation should depend on daily rollup completion, not hourly history scans.
- Keep the current cron behavior and auth/UI behavior unchanged while internal data flow changes land.
- Backfill and repair jobs should rebuild canonical caches first, then compatibility tables and reports.
7. UI Refresh and Design-System Alignment
- Use
design.mdas the source of truth for the UI refresh, but adapt it pragmatically to this codebase rather than attempting a pixel-perfect clone. - Introduce semantic theme tokens using
--theme_*naming in the shared stylesheet layer. - Replace the current ad hoc
web2color and radius values with tokenized equivalents for:- primary text
- weak text
- CTA blue
- borders
- surfaces
- success states
- button spotlight text
- card and ambient shadows
- Update the shared stylesheet source and shipped compiled assets so the new tokens flow through the delivered UI.
- Keep the existing
web2andweb3class names if that reduces churn, but rebase them on the new token system. - Establish a typography strategy that follows
design.mdwhile remaining deployable:- prefer Haas and Haas Groot Disp only if licensed webfont delivery is available
- otherwise define a documented fallback stack with similar proportions and spacing behavior
- apply positive letter spacing to body, caption, and button treatments where appropriate
- Normalize component shape language to the design brief:
- buttons at 12px radius
- cards and sections at 16px to 24px radius
- larger containers at 24px to 32px radius where needed
- avoid the current 3px to 6px rounded treatment as the default visual language
- Replace the current flat visual treatment with the documented blue-tinted shadow system, but keep shadows controlled and readable in data-heavy views.
- Refactor shared UI structure in the Templ layer:
components/core/header.templcomponents/core/footer.templ- shared shell/header/card/button/table/form patterns used across
components/views/*
- Add a reusable page-shell pattern so all primary pages share:
- a consistent hero/header treatment
- action grouping
- content width rules
- section spacing
- responsive table overflow behavior
- Improve the dashboard information architecture in
components/views/index.templ:- reduce the current long-form text density
- promote primary navigation and key operational tasks
- move build metadata into secondary status cards
- present auth requirements and role policy as a concise callout rather than dense paragraph copy
- Improve snapshot and vCenter list pages in
components/views/snapshots.templ:- stronger table hierarchy
- clearer record counts and grouping
- more intentional page headers and return navigation
- responsive behavior that preserves readability on smaller screens
- Improve the VM trace page in
components/views/vm_trace.templ:- upgrade search form layout and input styling
- improve chart framing and diagnostics presentation
- make lifecycle summary cards visually clearer
- preserve dense tabular detail without making the page feel purely utilitarian
- Ensure the auth-enabled experience is visible in the UI:
- clarify that UI pages remain public while APIs require Bearer tokens when auth is enabled
- surface viewer versus admin capability differences in concise language
- keep Swagger and operational links accessible from the main navigation
- Add accessibility and interaction requirements to the UI implementation:
- visible focus states
- sufficient text/background contrast
- keyboard-usable navigation and forms
- table layouts that remain readable with horizontal overflow
- mobile-safe spacing and tap targets
- Keep UI changes implementation-friendly:
- avoid introducing a large frontend framework
- continue using Templ plus shared CSS and existing JS assets
- prefer incremental component replacement over a full frontend rewrite
Public Interfaces and Settings
- No HTTP API changes are required.
- Keep existing endpoints and report filenames stable.
- No auth-model changes are required for the UI refresh.
- If licensed fonts are not available for deployment, the implementation must ship with a documented fallback stack rather than blocking the UI work.
- Add these settings:
settings.capture_write_batch_size- default:
1000 - controls batched DB writes for hourly capture
- default:
settings.snapshot_table_compat_mode- default:
true - when
true, continue writing legacy snapshot tables during migration
- default:
settings.async_report_generation- default:
true - when
true, scheduled jobs defer XLSX generation from the hot path
- default:
- Keep existing settings such as:
hourly_snapshot_concurrencymonthly_aggregation_granularity- retry settings
- cleanup settings
- Scheduled monthly aggregation should ignore hourly granularity unless running a manual or backfill job.
Execution Order
Phase 1: Hot-Path Runtime Wins
- Add batched hourly writes.
- Decouple report generation from hourly capture.
- Ensure daily scheduled aggregation reads only from
vm_hourly_stats. - Ensure monthly scheduled aggregation reads only from
vm_daily_rollup. - Keep compatibility tables enabled.
- Define the UI token layer and shared component mapping before page-level redesign work begins.
Phase 2: Canonical Dataflow
- Refactor reconciliation so canonical caches are updated first.
- Reduce or eliminate prior-snapshot table mutations during capture.
- Make scheduled aggregation paths canonical-only.
- Keep fallback and repair code for legacy unions/scans.
- Implement the shared page shell, navigation, button, card, table, and form refinements across the existing Templ views.
Phase 3: Postgres-Ready Scale-Up
- Validate index coverage on canonical tables.
- Add PostgreSQL partitioning for
vm_hourly_stats. - Benchmark Go and SQL aggregation paths on representative production-scale data.
- Keep Go as default unless SQL demonstrates a clear, repeatable runtime win on canonical Postgres data.
- Treat the benchmark as a comparison against a canonical-table SQL implementation, not the current snapshot-union SQL path.
- If SQL wins, promote SQL behind a controlled rollout flag first, then make it default.
- Complete page-specific UI refinement for dashboard, snapshots, vCenter totals, and VM trace using the shared tokenized design system.
Phase 4: Compatibility Reduction
- Keep legacy table output behind
snapshot_table_compat_mode. - Once canonical-path validation is complete, allow disabling legacy hourly table generation in scheduled runs.
- Retain explicit backfill and rebuild commands for compatibility tables and reports.
- Clean up obsolete styling rules and duplicated visual patterns once the new UI system is fully adopted.
Implementation Checklist
0. Baseline and Guardrails
- Capture baseline metrics for hourly capture, daily aggregation, monthly aggregation, and report generation.
- Confirm current API/endpoint contract and report filename behavior with a regression snapshot.
- Add new settings with defaults and config wiring:
settings.capture_write_batch_size=1000settings.snapshot_table_compat_mode=truesettings.async_report_generation=true
- Add/confirm stage-level logging and timing around capture, reconcile, totals refresh, and report generation.
- Document migration guardrails: no auth-model changes, SQLite support retained, compatibility mode enabled by default.
- Evidence snapshot: see
phase0-baseline.mdfor metrics, API/report contract snapshot, and guardrail verification.
1. Phase 1: Hot-Path Runtime Wins
- Implement batched hourly writes for canonical tables in capture flow.
- Add PostgreSQL multi-row insert/upsert path (or
COPY) forvm_hourly_stats. - Keep SQLite transactional batched upsert path without PostgreSQL-only ingestion features.
- Decouple XLSX/report generation from capture hot path via async/deferred stage.
- Ensure scheduled daily aggregation reads canonical data from
vm_hourly_statsonly. - Ensure scheduled monthly aggregation reads canonical data from
vm_daily_rolluponly. - Keep legacy compatibility tables enabled during this phase.
- Introduce UI token layer (
--theme_*) and map shared component primitives before page-specific redesign.
2. Phase 2: Canonical Dataflow
- Refactor capture/reconcile ordering so canonical caches are updated first.
- Move deletion/event reconciliation to one post-capture phase per vCenter.
- Remove prior-snapshot table mutations from capture hot path (except explicit compatibility needs).
- Keep SQL union/legacy scan paths available only for fallback, repair, and backfill.
- Verify
snapshot_registrylogical hourly registration remains correct without normal hourly table scans. - Implement shared Templ page shell improvements across header/footer/cards/buttons/tables/forms.
- Refresh dashboard, snapshots, vCenter totals, and VM trace views to the tokenized design system.
3. Phase 3: Postgres-Ready Scale-Up
- Validate/add canonical
vm_hourly_statsindexes for snapshot time, vCenter+time, VM identity+time, and trace lookup. - Add PostgreSQL monthly partitioning for
vm_hourly_statsbehind migration controls. - Benchmark Go vs SQL on canonical Postgres tables using representative production-scale data.
- Benchmark harness implemented via
-benchmark-aggregationsand-benchmark-runs; production-scale Postgres run pending.
- Benchmark harness implemented via
- Keep Go as scheduled default unless SQL shows clear and repeatable runtime wins.
- If SQL wins, roll out behind a controlled flag before any default switch.
4. Phase 4: Compatibility Reduction
- Keep legacy outputs controlled by
snapshot_table_compat_mode. - Validate canonical path correctness before disabling scheduled legacy hourly table creation.
- Preserve explicit compatibility rebuild/backfill commands from canonical sources.
- Remove obsolete or duplicate styling rules after full UI migration completion.
5. Validation and Quality Gates
- Add golden-result tests for daily output parity (old vs new path).
- Add golden-result tests for monthly output parity (old vs new path).
- Add lifecycle edge-case coverage (partial presence, missing create times, deletion refinement, pool and resource changes).
- Add integration tests for canonical write/read paths and totals cache correctness.
- Add compatibility tests for legacy table generation, reports, and rebuild flows.
- Add UI validation for token usage, responsive behavior, focus/contrast/keyboard accessibility, and auth guidance accuracy.
- Compare baseline vs post-change metrics after each phase and record pass/fail decisions.
6. Rollout and Documentation
- Update operator docs for new settings and default behavior.
- Document compatibility-mode lifecycle and criteria to disable legacy table generation.
- Document benchmark method/results and default-path decision record (Go vs SQL).
- Publish a short migration runbook for staged rollout, rollback triggers, and repair workflows.
Test Plan
Correctness Tests
- Add golden-result tests comparing old and new daily outputs for the same synthetic hourly dataset.
- Add golden-result tests comparing old and new monthly outputs for the same synthetic daily dataset.
- Include edge cases for:
- partial-day VM presence
- missing creation times
- deletion-time refinement
- pool changes
- CPU and RAM changes across samples
- VMs identified by
VmId,VmUuid, and fallback name matching
Integration Tests
- Hourly capture writes
vm_hourly_stats, lifecycle caches, and vCenter totals correctly. - Daily aggregation reads canonical hourly data without scanning
inventory_hourly_*. - Monthly aggregation reads canonical daily rollup without scanning hourly history in the normal path.
vcenter_aggregate_totalsremains correct for hourly, daily, and monthly views.- Trace and totals endpoints keep returning equivalent results before and after migration.
- UI page rendering remains valid for dashboard, snapshot pages, vCenter totals, and VM trace after shared component changes.
Compatibility Tests
- When
snapshot_table_compat_mode=true, compatibility snapshot tables still exist and are populated. - Reports still generate correctly from migrated data.
- Backfill and repair flows can rebuild compatibility outputs from canonical sources.
- UI remains functional when auth is disabled and when auth is enabled with protected API usage documented in-page.
Performance Tests
- Measure per-vCenter capture duration.
- Measure hourly write throughput.
- Measure daily aggregation runtime.
- Measure monthly aggregation runtime.
- Measure report generation runtime when decoupled from scheduled jobs.
- Capture baseline metrics before refactor and compare after each phase.
- Measure basic UI payload impact after the refresh so stylesheet and JS growth stay controlled.
UI Validation
- Verify token usage in shared CSS so colors, radii, and shadows are not hard-coded inconsistently across pages.
- Verify responsive behavior for dashboard, snapshot tables, vCenter totals, and VM trace at mobile and desktop widths.
- Verify focus states, contrast, and keyboard access for links, buttons, inputs, and table navigation surfaces.
- Verify that the auth guidance on the dashboard still matches actual route protection and Bearer-token behavior.
Acceptance Criteria
- Scheduled hourly capture runtime is materially reduced without changing user-visible outputs.
- Scheduled daily aggregation no longer depends on
inventory_hourly_*scans. - Scheduled monthly aggregation no longer depends on hourly-history scans.
- Canonical caches become the source of truth for normal scheduled processing.
- Legacy compatibility behavior remains available during migration.
- Existing endpoints, reports, auth behavior, and operational commands continue to work.
- The UI reflects the design direction in
design.mdthrough tokenized colors, typography, spacing, radius, and shadow usage. - The dashboard, snapshot pages, vCenter totals view, and VM trace view share a coherent visual system and clearer information hierarchy.
- The refreshed UI remains responsive, accessible, and compatible with the current Templ-based rendering model.
Assumptions
- Target direction is Postgres-ready and runtime-first.
- Existing endpoints, report filenames, and user-visible semantics must remain stable.
- SQLite remains supported for development, tests, and smaller installs.
- PostgreSQL is the intended scale-up target for larger environments.
- Compatibility snapshot tables should remain enabled by default until canonical-path validation is complete.