nathan/vctp2

Fork 0

Files

T

nathan 2c3167a1a0

continuous-integration/drone/push Build is passing

Details

more updates

2026-04-20 19:40:01 +10:00

25 KiB

Raw Blame History

Inventory Capture and Aggregation Optimization Plan

Summary

Optimize for end-to-end runtime with a Postgres-ready design. Keep the current HTTP and report behavior intact, but shift the scheduled data pipeline so it uses canonical append-only/cache tables instead of repeatedly scanning inventory_hourly_* tables and regenerating reports inline.

This plan is intended to be implementation-ready for a codex-5.3 execution pass.

Execution-path decision:

For the current architecture and migration phases, scheduled daily and monthly aggregation default to the Go path.
This is a readability-first and current-performance decision, not a claim that Go is inherently faster than a well-designed SQL implementation.
SQL path is retained for compatibility, backfill, and fallback.
SQL remains a future optimization candidate on canonical Postgres tables.
SQL can be promoted to default only after benchmark evidence on canonical Postgres tables shows a clear runtime advantage.

The target architecture is:

vm_hourly_stats is the canonical hourly fact store.
vm_daily_rollup is the canonical monthly input.
Per-snapshot tables and XLSX generation remain as compatibility and output concerns, not the primary execution path.

Current State

Hourly capture already writes both per-snapshot tables and vm_hourly_stats.
Daily aggregation has mixed execution paths:
- SQL union path over inventory_hourly_*
- Go path over vm_hourly_stats or parallel table scans
Monthly aggregation has mixed execution paths:
- SQL path over daily or hourly snapshot tables
- Go path over vm_daily_rollup or hourly cache
Lifecycle reconciliation updates both canonical cache tables and prior hourly snapshot tables during the hot path.
Report generation is still coupled to scheduled capture and aggregation jobs.
The current UI is rendered through Templ pages and shared web2/web3 CSS classes, but it does not yet match the visual system described in design.md.
Current shipped styling still uses a different blue accent, tighter radii, default system typography, and inconsistent component hierarchy compared with the target design language.

Implementation Goals

Reduce hourly capture wall-clock time.
Reduce daily and monthly aggregation runtime.
Eliminate repeated historical table scans from the normal scheduled path.
Keep user-visible HTTP APIs, reports, and auth behavior unchanged.
Improve UI clarity and consistency so the dashboard, snapshot views, and trace views reflect the design direction in design.md.
Make authentication and role requirements easier to understand from the UI without changing the auth model.
Preserve compatibility with SQLite for development and small installs.
Make the runtime architecture cleanly scalable for PostgreSQL production use.

Implementation Changes

1. Hourly Capture Pipeline

Keep GetAllVMsWithProps as the primary vCenter inventory fetch path.
Preserve single-VM property retrieval only as a fallback path when bulk retrieval is incomplete.
Replace row-by-row database writes in hourly capture with batched writes.
For PostgreSQL:
- prefer multi-row insert/upsert or COPY into vm_hourly_stats
- keep conflict handling on the canonical key
For SQLite:
- keep transactional batched insert/upsert
- do not attempt PostgreSQL-only ingestion patterns
During capture, write data to these canonical destinations first:
- vm_hourly_stats
- vm_lifecycle_cache
- vcenter_totals
- vcenter_latest_totals
- vcenter_aggregate_totals for hourly totals
Treat inventory_hourly_<epoch> as compatibility output, not as the source of truth for downstream jobs.
Move deletion and event reconciliation to one post-capture reconciliation phase per vCenter.
In that reconciliation phase, update canonical cache tables first.
Stop updating prior hourly snapshot tables inline during the capture hot path except where compatibility mode explicitly requires it.
Remove synchronous XLSX regeneration from hourly capture.
Scheduled capture should finish once persistence and reconciliation are complete.
Report generation should run after the capture path, either deferred within the job or via a follow-up stage.

2. Daily Aggregation

Make vm_hourly_stats the only normal scheduled input for daily aggregation.
Scheduled daily jobs must not build UNION ALL queries across inventory_hourly_*.
Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current snapshot-union SQL path.
Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL union path by avoiding repeated historical table scans.
Treat the SQL path as non-default compatibility and fallback behavior.
Do not treat this as a permanent rejection of SQL.
Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
Keep the current SQL union path only for:
- compatibility fallback
- manual repair
- backfill support where needed
Daily aggregation output must continue writing:
- inventory_daily_summary_YYYYMMDD
- vm_daily_rollup
- snapshot_registry daily record
- refreshed vcenter_aggregate_totals daily entries
Lifecycle refinement should operate on canonical lifecycle data and only use snapshot-table probing as fallback.
Preserve existing daily semantics for:
- SamplesPresent
- AvgIsPresent
- weighted CPU/RAM/disk averages
- pool percentages
- creation/deletion time behavior

3. Monthly Aggregation

Make vm_daily_rollup the default scheduled input for monthly aggregation.
Scheduled monthly jobs should not scan hourly snapshot tables in the normal path.
Keep the Go aggregation path as the explicit default scheduled path for the current implementation and migration phases.
Readability is the primary reason for this default: the Go path is materially easier to follow, test, and debug than the current SQL path.
Performance is a secondary but still important reason: on the current implementation, Go is expected to outperform the existing SQL path by avoiding snapshot-table unions and hourly-history scans in the normal case.
Treat the SQL path as non-default compatibility and fallback behavior.
Do not treat this as a permanent rejection of SQL.
Only promote SQL to default if benchmark results on canonical Postgres data show a clear, repeatable improvement over the Go path.
Keep hourly-based monthly aggregation only for:
- manual rebuilds
- repair/backfill workflows
- validation against old behavior
Preserve current monthly weighting semantics based on per-day sample volumes.
Monthly aggregation output must continue writing:
- inventory_monthly_summary_YYYYMM
- snapshot_registry monthly record
- refreshed vcenter_aggregate_totals monthly entries
Keep report generation behavior unchanged from the user’s perspective, but do not keep it on the critical aggregation hot path if it can be deferred safely.

4. Storage and Schema

Keep these tables during migration:
- inventory_hourly_*
- inventory_daily_summary_*
- inventory_monthly_summary_*
Stop treating hourly snapshot tables as the normal scheduled aggregation source.
Preserve snapshot_registry, but register logical hourly snapshots by timestamp even when downstream jobs no longer depend on hourly table scans.
Validate or add the following indexes on vm_hourly_stats for PostgreSQL:
- ("SnapshotTime")
- ("Vcenter","SnapshotTime")
- ("Vcenter","VmId","SnapshotTime")
- ("Vcenter","VmUuid","SnapshotTime")
- a name lookup index aligned with current trace queries
Keep the existing trace-compatible indexes for SQLite.
After the canonical-path migration is stable, partition vm_hourly_stats by snapshot month for PostgreSQL.
Do not require partitioning for SQLite or tests.

5. Compatibility Mode

Introduce an explicit compatibility mode for legacy snapshot tables.
When compatibility mode is enabled:
- continue writing inventory_hourly_*
- continue generating legacy-compatible daily/monthly summary tables
- continue registering snapshots as today
When compatibility mode is disabled in a later phase:
- scheduled jobs may skip legacy hourly table creation
- compatibility reports and endpoints must still work from canonical data or compatibility rebuild jobs
Default to compatibility mode enabled during the transition.

6. Scheduling and Job Flow

Refactor the scheduled pipeline into explicit stages:
1. capture
2. reconcile
3. register and refresh totals caches
4. optional report generation
Daily aggregation should run only against the completed prior-day hourly data.
Monthly aggregation should depend on daily rollup completion, not hourly history scans.
Keep the current cron behavior and auth/UI behavior unchanged while internal data flow changes land.
Backfill and repair jobs should rebuild canonical caches first, then compatibility tables and reports.

7. UI Refresh and Design-System Alignment

Use design.md as the source of truth for the UI refresh, but adapt it pragmatically to this codebase rather than attempting a pixel-perfect clone.
Introduce semantic theme tokens using --theme_* naming in the shared stylesheet layer.
Replace the current ad hoc web2 color and radius values with tokenized equivalents for:
- primary text
- weak text
- CTA blue
- borders
- surfaces
- success states
- button spotlight text
- card and ambient shadows
Update the shared stylesheet source and shipped compiled assets so the new tokens flow through the delivered UI.
Keep the existing web2 and web3 class names if that reduces churn, but rebase them on the new token system.
Establish a typography strategy that follows design.md while remaining deployable:
- prefer Haas and Haas Groot Disp only if licensed webfont delivery is available
- otherwise define a documented fallback stack with similar proportions and spacing behavior
- apply positive letter spacing to body, caption, and button treatments where appropriate
Normalize component shape language to the design brief:
- buttons at 12px radius
- cards and sections at 16px to 24px radius
- larger containers at 24px to 32px radius where needed
- avoid the current 3px to 6px rounded treatment as the default visual language
Replace the current flat visual treatment with the documented blue-tinted shadow system, but keep shadows controlled and readable in data-heavy views.
Refactor shared UI structure in the Templ layer:
- components/core/header.templ
- components/core/footer.templ
- shared shell/header/card/button/table/form patterns used across components/views/*
Add a reusable page-shell pattern so all primary pages share:
- a consistent hero/header treatment
- action grouping
- content width rules
- section spacing
- responsive table overflow behavior
Improve the dashboard information architecture in components/views/index.templ:
- reduce the current long-form text density
- promote primary navigation and key operational tasks
- move build metadata into secondary status cards
- present auth requirements and role policy as a concise callout rather than dense paragraph copy
Improve snapshot and vCenter list pages in components/views/snapshots.templ:
- stronger table hierarchy
- clearer record counts and grouping
- more intentional page headers and return navigation
- responsive behavior that preserves readability on smaller screens
Improve the VM trace page in components/views/vm_trace.templ:
- upgrade search form layout and input styling
- improve chart framing and diagnostics presentation
- make lifecycle summary cards visually clearer
- preserve dense tabular detail without making the page feel purely utilitarian
Ensure the auth-enabled experience is visible in the UI:
- clarify that UI pages remain public while APIs require Bearer tokens when auth is enabled
- surface viewer versus admin capability differences in concise language
- keep Swagger and operational links accessible from the main navigation
Add accessibility and interaction requirements to the UI implementation:
- visible focus states
- sufficient text/background contrast
- keyboard-usable navigation and forms
- table layouts that remain readable with horizontal overflow
- mobile-safe spacing and tap targets
Keep UI changes implementation-friendly:
- avoid introducing a large frontend framework
- continue using Templ plus shared CSS and existing JS assets
- prefer incremental component replacement over a full frontend rewrite

Public Interfaces and Settings

No HTTP API changes are required.
Keep existing endpoints and report filenames stable.
No auth-model changes are required for the UI refresh.
If licensed fonts are not available for deployment, the implementation must ship with a documented fallback stack rather than blocking the UI work.
Add these settings:
- settings.capture_write_batch_size
  - default: 1000
  - controls batched DB writes for hourly capture
- settings.snapshot_table_compat_mode
  - default: true
  - when true, continue writing legacy snapshot tables during migration
- settings.async_report_generation
  - default: true
  - when true, scheduled jobs defer XLSX generation from the hot path
Keep existing settings such as:
- hourly_snapshot_concurrency
- monthly_aggregation_granularity
- retry settings
- cleanup settings
Scheduled monthly aggregation should ignore hourly granularity unless running a manual or backfill job.

Execution Order

Phase 1: Hot-Path Runtime Wins

Add batched hourly writes.
Decouple report generation from hourly capture.
Ensure daily scheduled aggregation reads only from vm_hourly_stats.
Ensure monthly scheduled aggregation reads only from vm_daily_rollup.
Keep compatibility tables enabled.
Define the UI token layer and shared component mapping before page-level redesign work begins.

Phase 2: Canonical Dataflow

Refactor reconciliation so canonical caches are updated first.
Reduce or eliminate prior-snapshot table mutations during capture.
Make scheduled aggregation paths canonical-only.
Keep fallback and repair code for legacy unions/scans.
Implement the shared page shell, navigation, button, card, table, and form refinements across the existing Templ views.

Phase 3: Postgres-Ready Scale-Up

Validate index coverage on canonical tables.
Add PostgreSQL partitioning for vm_hourly_stats.
Benchmark Go and SQL aggregation paths on representative production-scale data.
Keep Go as default unless SQL demonstrates a clear, repeatable runtime win on canonical Postgres data.
Treat the benchmark as a comparison against a canonical-table SQL implementation, not the current snapshot-union SQL path.
If SQL wins, promote SQL behind a controlled rollout flag first, then make it default.
Complete page-specific UI refinement for dashboard, snapshots, vCenter totals, and VM trace using the shared tokenized design system.

Phase 4: Compatibility Reduction

Keep legacy table output behind snapshot_table_compat_mode.
Once canonical-path validation is complete, allow disabling legacy hourly table generation in scheduled runs.
Retain explicit backfill and rebuild commands for compatibility tables and reports.
Clean up obsolete styling rules and duplicated visual patterns once the new UI system is fully adopted.

Implementation Checklist

0. Baseline and Guardrails

Capture baseline metrics for hourly capture, daily aggregation, monthly aggregation, and report generation.
Confirm current API/endpoint contract and report filename behavior with a regression snapshot.
Add new settings with defaults and config wiring:
- settings.capture_write_batch_size=1000
- settings.snapshot_table_compat_mode=true
- settings.async_report_generation=true
Add/confirm stage-level logging and timing around capture, reconcile, totals refresh, and report generation.
Document migration guardrails: no auth-model changes, SQLite support retained, compatibility mode enabled by default.
Evidence snapshot: see phase0-baseline.md for metrics, API/report contract snapshot, and guardrail verification.

1. Phase 1: Hot-Path Runtime Wins

Implement batched hourly writes for canonical tables in capture flow.
Add PostgreSQL multi-row insert/upsert path (or COPY) for vm_hourly_stats.
Keep SQLite transactional batched upsert path without PostgreSQL-only ingestion features.
Decouple XLSX/report generation from capture hot path via async/deferred stage.
Ensure scheduled daily aggregation reads canonical data from vm_hourly_stats only.
Ensure scheduled monthly aggregation reads canonical data from vm_daily_rollup only.
Keep legacy compatibility tables enabled during this phase.
Introduce UI token layer (--theme_*) and map shared component primitives before page-specific redesign.

2. Phase 2: Canonical Dataflow

Refactor capture/reconcile ordering so canonical caches are updated first.
Move deletion/event reconciliation to one post-capture phase per vCenter.
Remove prior-snapshot table mutations from capture hot path (except explicit compatibility needs).
Keep SQL union/legacy scan paths available only for fallback, repair, and backfill.
Verify snapshot_registry logical hourly registration remains correct without normal hourly table scans.
Implement shared Templ page shell improvements across header/footer/cards/buttons/tables/forms.
Refresh dashboard, snapshots, vCenter totals, and VM trace views to the tokenized design system.

3. Phase 3: Postgres-Ready Scale-Up

Validate/add canonical vm_hourly_stats indexes for snapshot time, vCenter+time, VM identity+time, and trace lookup.
Add PostgreSQL monthly partitioning for vm_hourly_stats behind migration controls.
Benchmark Go vs SQL on canonical Postgres tables using representative production-scale data.
- Benchmark harness implemented via -benchmark-aggregations and -benchmark-runs; production-scale Postgres run pending.
Keep Go as scheduled default unless SQL shows clear and repeatable runtime wins.
If SQL wins, roll out behind a controlled flag before any default switch.

4. Phase 4: Compatibility Reduction

Keep legacy outputs controlled by snapshot_table_compat_mode.
- Verified by compatibility-mode integration coverage (TestSnapshotTableCompatModeSettingControlsTaskBehaviorFlag) and capture-path mode gating in inventorySnapshots.
Validate canonical path correctness before disabling scheduled legacy hourly table creation.
- Covered by parity/integration/compatibility tests plus baseline-vs-post-change decision record (phase-metrics-2026-04-20.md).
Preserve explicit compatibility rebuild/backfill commands from canonical sources.
- Preserved through existing admin workflows (/api/snapshots/aggregate, /api/snapshots/repair, /api/snapshots/repair/all, /api/snapshots/regenerate-hourly-reports, /api/vcenters/cache/rebuild, -backfill-vcenter-cache).
Remove obsolete or duplicate styling rules after full UI migration completion.
- Removed unused selectors from shared UI stylesheet (.web2-button-group*, .web2-list li) in dist/assets/css/web3.css; router UI asset tests remain passing.

5. Validation and Quality Gates

Add golden-result tests for daily output parity (old vs new path).
Add golden-result tests for monthly output parity (old vs new path).
Add lifecycle edge-case coverage (partial presence, missing create times, deletion refinement, pool and resource changes).
Add integration tests for canonical write/read paths and totals cache correctness.
Add compatibility tests for legacy table generation, reports, and rebuild flows.
Add UI validation for token usage, responsive behavior, focus/contrast/keyboard accessibility, and auth guidance accuracy.
- Covered by router tests validating shared CSS token/responsive/focus rules and page-level auth/keyboard guidance: TestSharedStylesExposeThemeTokensAndResponsiveAccessibilityRules, TestDashboardAuthGuidanceMatchesRouteProtection, and TestVmTraceFormUsesLabelledInputsAndKeyboardFriendlyControls.
Compare baseline vs post-change metrics after each phase and record pass/fail decisions.
- Evidence and gate outcomes captured in phase-metrics-2026-04-20.md (baseline delta table + pass/fail decisions + benchmark snapshot).

6. Rollout and Documentation

Update operator docs for new settings and default behavior.
Document compatibility-mode lifecycle and criteria to disable legacy table generation.
Document benchmark method/results and default-path decision record (Go vs SQL).
Publish a short migration runbook for staged rollout, rollback triggers, and repair workflows.
- Completed in README.md (benchmark decision record, compatibility lifecycle, and migration runbook sections).

Test Plan

Correctness Tests

Add golden-result tests comparing old and new daily outputs for the same synthetic hourly dataset.
Add golden-result tests comparing old and new monthly outputs for the same synthetic daily dataset.
Include edge cases for:
- partial-day VM presence
- missing creation times
- deletion-time refinement
- pool changes
- CPU and RAM changes across samples
- VMs identified by VmId, VmUuid, and fallback name matching

Integration Tests

Hourly capture writes vm_hourly_stats, lifecycle caches, and vCenter totals correctly.
Daily aggregation reads canonical hourly data without scanning inventory_hourly_*.
Monthly aggregation reads canonical daily rollup without scanning hourly history in the normal path.
vcenter_aggregate_totals remains correct for hourly, daily, and monthly views.
Trace and totals endpoints keep returning equivalent results before and after migration.
UI page rendering remains valid for dashboard, snapshot pages, vCenter totals, and VM trace after shared component changes.

Compatibility Tests

When snapshot_table_compat_mode=true, compatibility snapshot tables still exist and are populated.
Reports still generate correctly from migrated data.
Backfill and repair flows can rebuild compatibility outputs from canonical sources.
UI remains functional when auth is disabled and when auth is enabled with protected API usage documented in-page.

Performance Tests

Measure per-vCenter capture duration.
Measure hourly write throughput.
Measure daily aggregation runtime.
Measure monthly aggregation runtime.
Measure report generation runtime when decoupled from scheduled jobs.
Capture baseline metrics before refactor and compare after each phase.
Measure basic UI payload impact after the refresh so stylesheet and JS growth stay controlled.

UI Validation

Verify token usage in shared CSS so colors, radii, and shadows are not hard-coded inconsistently across pages.
Verify responsive behavior for dashboard, snapshot tables, vCenter totals, and VM trace at mobile and desktop widths.
Verify focus states, contrast, and keyboard access for links, buttons, inputs, and table navigation surfaces.
Verify that the auth guidance on the dashboard still matches actual route protection and Bearer-token behavior.

Acceptance Criteria

Scheduled hourly capture runtime is materially reduced without changing user-visible outputs.
Scheduled daily aggregation no longer depends on inventory_hourly_* scans.
Scheduled monthly aggregation no longer depends on hourly-history scans.
Canonical caches become the source of truth for normal scheduled processing.
Legacy compatibility behavior remains available during migration.
Existing endpoints, reports, auth behavior, and operational commands continue to work.
The UI reflects the design direction in design.md through tokenized colors, typography, spacing, radius, and shadow usage.
The dashboard, snapshot pages, vCenter totals view, and VM trace view share a coherent visual system and clearer information hierarchy.
The refreshed UI remains responsive, accessible, and compatible with the current Templ-based rendering model.

Assumptions

Target direction is Postgres-ready and runtime-first.
Existing endpoints, report filenames, and user-visible semantics must remain stable.
SQLite remains supported for development, tests, and smaller installs.
PostgreSQL is the intended scale-up target for larger environments.
Compatibility snapshot tables should remain enabled by default until canonical-path validation is complete.

25 KiB Raw Blame History Unescape Escape