Implemented the next 4h-plan phase: dual-run support + explicit cutover gate.

2026-04-06 19:09:20 +10:00
parent 1ef300d25e
commit 1e750e35d1
7 changed files with 238 additions and 20 deletions
@@ -111,6 +111,7 @@ Runbook/docs:
 - `docs/rain_model_runbook.md`
 - `scripts/recommend_rain_model.py` (rank reports and recommend deploy candidate)
 - `scripts/rainml_py.sh` (run ML Python scripts inside the `rainml` container; avoids host virtualenv/dependency setup)
+- `scripts/check_rain_cutover_gate.py` (automated pass/fail gate using baseline vs candidate test metrics)

 ## Publish a test WS90 payload
 ```sh
@@ -36,11 +36,11 @@ services:
      RAIN_SITE: "home"
      RAIN_HORIZON_HOURS: "4"
      RAIN_MODEL_NAME: "rain_next_4h"
-      RAIN_MODEL_VERSION_BASE: "rain-auto-v2-extended-4h"
+      RAIN_MODEL_VERSION_BASE: "rain-auto-v2-extended-calendar-4h"
      RAIN_MODEL_FAMILY: "auto"
-      RAIN_FEATURE_SET: "extended"
+      RAIN_FEATURE_SET: "extended_calendar"
      RAIN_FORECAST_MODEL: "ecmwf"
-      RAIN_LOOKBACK_DAYS: "30"
+      RAIN_LOOKBACK_DAYS: "60"
      RAIN_TRAIN_INTERVAL_HOURS: "24"
      RAIN_PREDICT_INTERVAL_MINUTES: "10"
      RAIN_MIN_PRECISION: "0.70"
@@ -50,12 +50,48 @@ services:
      RAIN_THRESHOLD_POLICY: "walk_forward"
      RAIN_WALK_FORWARD_FOLDS: "4"
      RAIN_ALLOW_EMPTY_DATA: "true"
-      RAIN_MODEL_PATH: "/app/models/rain_model.pkl"
-      RAIN_MODEL_BACKUP_PATH: "/app/models/rain_model.pkl.last_good"
-      RAIN_REPORT_PATH: "/app/models/rain_model_report.json"
-      RAIN_AUDIT_PATH: "/app/models/rain_data_audit.json"
-      RAIN_DATASET_PATH: "/app/models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
-      RAIN_MODEL_CARD_PATH: "/app/models/model_card_{model_version}.md"
+      RAIN_MODEL_PATH: "/app/models/rain_model_4h.pkl"
+      RAIN_MODEL_BACKUP_PATH: "/app/models/rain_model_4h.pkl.last_good"
+      RAIN_REPORT_PATH: "/app/models/rain_model_report_4h.json"
+      RAIN_AUDIT_PATH: "/app/models/rain_data_audit_4h.json"
+      RAIN_DATASET_PATH: "/app/models/datasets/rain_dataset_4h_{model_version}_{feature_set}.csv"
+      RAIN_MODEL_CARD_PATH: "/app/models/model_card_4h_{model_version}.md"
+    volumes:
+      - ./models:/app/models
+
+  rainml_1h:
+    build:
+      context: .
+      dockerfile: Dockerfile.train
+    depends_on:
+      - timescaledb
+    restart: unless-stopped
+    profiles: ["shadow"]
+    environment:
+      DATABASE_URL: "postgres://postgres:postgres@timescaledb:5432/micrometeo?sslmode=disable"
+      RAIN_SITE: "home"
+      RAIN_HORIZON_HOURS: "1"
+      RAIN_MODEL_NAME: "rain_next_1h"
+      RAIN_MODEL_VERSION_BASE: "rain-auto-v1-extended-1h"
+      RAIN_MODEL_FAMILY: "auto"
+      RAIN_FEATURE_SET: "extended"
+      RAIN_FORECAST_MODEL: "ecmwf"
+      RAIN_LOOKBACK_DAYS: "60"
+      RAIN_TRAIN_INTERVAL_HOURS: "24"
+      RAIN_PREDICT_INTERVAL_MINUTES: "10"
+      RAIN_MIN_PRECISION: "0.70"
+      RAIN_TUNE_HYPERPARAMETERS: "true"
+      RAIN_MAX_HYPERPARAM_TRIALS: "12"
+      RAIN_CALIBRATION_METHODS: "none,sigmoid,isotonic"
+      RAIN_THRESHOLD_POLICY: "walk_forward"
+      RAIN_WALK_FORWARD_FOLDS: "4"
+      RAIN_ALLOW_EMPTY_DATA: "true"
+      RAIN_MODEL_PATH: "/app/models/rain_model_1h.pkl"
+      RAIN_MODEL_BACKUP_PATH: "/app/models/rain_model_1h.pkl.last_good"
+      RAIN_REPORT_PATH: "/app/models/rain_model_report_1h.json"
+      RAIN_AUDIT_PATH: "/app/models/rain_data_audit_1h.json"
+      RAIN_DATASET_PATH: "/app/models/datasets/rain_dataset_1h_{model_version}_{feature_set}.csv"
+      RAIN_MODEL_CARD_PATH: "/app/models/model_card_1h_{model_version}.md"
    volumes:
      - ./models:/app/models

@@ -39,8 +39,8 @@ scripts/rainml_py.sh scripts/train_rain_model.py \
  --threshold-policy "walk_forward" \
  --walk-forward-folds 4 \
  --model-version "rain-auto-v2-extended-4h" \
-  --out "models/rain_model.pkl" \
-  --report-out "models/rain_model_report.json" \
+  --out "models/rain_model_4h.pkl" \
+  --report-out "models/rain_model_report_4h.json" \
  --model-card-out "models/model_card_{model_version}.md" \
  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
 ```
@@ -61,7 +61,7 @@ Review in report:
 ```sh
 scripts/rainml_py.sh scripts/predict_rain_model.py \
  --site home \
-  --model-path "models/rain_model.pkl" \
+  --model-path "models/rain_model_4h.pkl" \
  --model-name "rain_next_4h" \
  --horizon-hours 4 \
  --dry-run
@@ -72,7 +72,7 @@ scripts/rainml_py.sh scripts/predict_rain_model.py \
 ```sh
 scripts/rainml_py.sh scripts/predict_rain_model.py \
  --site home \
-  --model-path "models/rain_model.pkl" \
+  --model-path "models/rain_model_4h.pkl" \
  --model-name "rain_next_4h" \
  --horizon-hours 4
 ```
@@ -155,6 +155,14 @@ The script exits non-zero on failure, so it can directly drive alerting.
 - `RAIN_MODEL_BACKUP_PATH`
 - `RAIN_MODEL_CARD_PATH`

+Dual-run note:
+- `rainml` is configured as 4-hour model training/inference with dedicated artifact paths.
+- `rainml_1h` is available as an optional shadow/baseline service via profile `shadow`.
+- Start both (4h + 1h shadow):
+  `docker compose --profile shadow up -d rainml rainml_1h`
+- Run one-off script against the 1h service:
+  `RAINML_PY_SERVICE=rainml_1h scripts/rainml_py.sh scripts/train_rain_model.py ...`
+
 Recommended production defaults:
 - Enable tuning daily or weekly (`RAIN_TUNE_HYPERPARAMETERS=true`)
 - Set `RAIN_THRESHOLD_POLICY=walk_forward` with `RAIN_WALK_FORWARD_FOLDS=4` for temporally robust threshold selection
@@ -217,6 +225,19 @@ scripts/rainml_py.sh scripts/compare_rain_reports.py \
  --baseline "models/rain_model_report_1h.json" \
  --candidate "models/rain_model_report_4h.json"
 ```
+
+4b. Apply an explicit cutover gate (exit code 0 = pass):
+
+```sh
+scripts/rainml_py.sh scripts/check_rain_cutover_gate.py \
+  --baseline "models/rain_model_report_1h.json" \
+  --candidate "models/rain_model_report_4h.json" \
+  --min-candidate-precision 0.60 \
+  --max-precision-drop 0.05 \
+  --max-pr-auc-drop 0.05 \
+  --max-roc-auc-drop 0.05 \
+  --max-brier-increase 0.03
+```
 5. Run dry-run inference, then live inference with 4h model name/horizon:

 ```sh
@@ -0,0 +1,134 @@
+#!/usr/bin/env python3
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+from typing import Any
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Evaluate go/no-go cutover gate for rain model reports.")
+    parser.add_argument("--baseline", required=True, help="Baseline report JSON path (for example 1h production).")
+    parser.add_argument("--candidate", required=True, help="Candidate report JSON path (for example 4h shadow).")
+    parser.add_argument(
+        "--min-candidate-precision",
+        type=float,
+        default=0.60,
+        help="Minimum allowed candidate test precision.",
+    )
+    parser.add_argument(
+        "--max-precision-drop",
+        type=float,
+        default=0.05,
+        help="Maximum allowed drop: candidate_precision >= baseline_precision - value.",
+    )
+    parser.add_argument(
+        "--max-pr-auc-drop",
+        type=float,
+        default=0.05,
+        help="Maximum allowed drop: candidate_pr_auc >= baseline_pr_auc - value.",
+    )
+    parser.add_argument(
+        "--max-roc-auc-drop",
+        type=float,
+        default=0.05,
+        help="Maximum allowed drop: candidate_roc_auc >= baseline_roc_auc - value.",
+    )
+    parser.add_argument(
+        "--max-brier-increase",
+        type=float,
+        default=0.03,
+        help="Maximum allowed increase: candidate_brier <= baseline_brier + value.",
+    )
+    return parser.parse_args()
+
+
+def load_report(path: str) -> dict[str, Any]:
+    p = Path(path)
+    if not p.exists():
+        raise FileNotFoundError(path)
+    with p.open("r", encoding="utf-8") as f:
+        return json.load(f)
+
+
+def metric(report: dict[str, Any], key: str) -> float:
+    value = report.get("test_metrics", {}).get(key)
+    if value is None:
+        raise ValueError(f"missing test metric: {key}")
+    return float(value)
+
+
+def main() -> int:
+    args = parse_args()
+    baseline = load_report(args.baseline)
+    candidate = load_report(args.candidate)
+
+    b_precision = metric(baseline, "precision")
+    c_precision = metric(candidate, "precision")
+    b_pr_auc = metric(baseline, "pr_auc")
+    c_pr_auc = metric(candidate, "pr_auc")
+    b_roc_auc = metric(baseline, "roc_auc")
+    c_roc_auc = metric(candidate, "roc_auc")
+    b_brier = metric(baseline, "brier")
+    c_brier = metric(candidate, "brier")
+
+    checks: list[tuple[str, bool, str]] = []
+    checks.append(
+        (
+            "candidate_precision_floor",
+            c_precision >= args.min_candidate_precision,
+            f"{c_precision:.4f} >= {args.min_candidate_precision:.4f}",
+        )
+    )
+    checks.append(
+        (
+            "precision_drop",
+            c_precision >= (b_precision - args.max_precision_drop),
+            f"{c_precision:.4f} >= {b_precision - args.max_precision_drop:.4f}",
+        )
+    )
+    checks.append(
+        (
+            "pr_auc_drop",
+            c_pr_auc >= (b_pr_auc - args.max_pr_auc_drop),
+            f"{c_pr_auc:.4f} >= {b_pr_auc - args.max_pr_auc_drop:.4f}",
+        )
+    )
+    checks.append(
+        (
+            "roc_auc_drop",
+            c_roc_auc >= (b_roc_auc - args.max_roc_auc_drop),
+            f"{c_roc_auc:.4f} >= {b_roc_auc - args.max_roc_auc_drop:.4f}",
+        )
+    )
+    checks.append(
+        (
+            "brier_increase",
+            c_brier <= (b_brier + args.max_brier_increase),
+            f"{c_brier:.4f} <= {b_brier + args.max_brier_increase:.4f}",
+        )
+    )
+
+    print("Rain cutover gate:")
+    print(f"  baseline: {args.baseline}")
+    print(f"  candidate: {args.candidate}")
+    print(f"  baseline_version={baseline.get('model_version')} candidate_version={candidate.get('model_version')}")
+
+    failures: list[str] = []
+    for name, ok, detail in checks:
+        status = "ok" if ok else "fail"
+        print(f"  {name}: {status} ({detail})")
+        if not ok:
+            failures.append(name)
+
+    if failures:
+        print(f"cutover_decision: FAIL ({', '.join(failures)})")
+        return 1
+
+    print("cutover_decision: PASS")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
@@ -16,6 +16,8 @@ def parse_args() -> argparse.Namespace:

 def load_json(path: str) -> dict[str, Any]:
    p = Path(path)
+    if not p.exists():
+        raise FileNotFoundError(path)
    with p.open("r", encoding="utf-8") as f:
        return json.load(f)

@@ -42,8 +44,25 @@ def delta_str(base: float | None, cand: float | None) -> str:

 def main() -> int:
    args = parse_args()
+    try:
        baseline = load_json(args.baseline)
+    except FileNotFoundError:
+        print(f"error: baseline report not found: {args.baseline}")
+        model_dir = Path("models")
+        if model_dir.exists():
+            candidates = sorted(model_dir.glob("rain_model_report*.json"))
+            if candidates:
+                print("available report files:")
+                for c in candidates:
+                    print(f"  - {c}")
+        print("hint: provide an existing 1h report path, or train a new 1h report first.")
+        return 2
+
+    try:
        candidate = load_json(args.candidate)
+    except FileNotFoundError:
+        print(f"error: candidate report not found: {args.candidate}")
+        return 2

    pairs = [
        ("precision", metric(baseline, "test_metrics", "precision"), metric(candidate, "test_metrics", "precision")),
@@ -16,12 +16,16 @@ Examples:
 Optional:
  RAINML_PY_BUILD=1 scripts/rainml_py.sh ...
  (builds the rainml image before running)
+  RAINML_PY_SERVICE=rainml_1h scripts/rainml_py.sh ...
+  (runs against a specific compose service; default is rainml)
 EOF
  exit 1
 fi

+SERVICE="${RAINML_PY_SERVICE:-rainml}"
+
 if [[ "${RAINML_PY_BUILD:-0}" == "1" ]]; then
-  docker compose build rainml
+  docker compose build "$SERVICE"
 fi

-docker compose run --rm --no-deps --entrypoint python3 rainml "$@"
+docker compose run --rm --no-deps --entrypoint python3 "$SERVICE" "$@"
@@ -56,7 +56,7 @@ Priority key: `P0` = critical/blocking, `P1` = important, `P2` = later optimizat
 - [x] [P1] Add one expanded feature set and rerun evaluation. (completed on runtime machine 2026-03-12 with `feature_set=extended`, `model_version=rain-auto-v1-extended-202603120932`)
 - [x] [P0] Decide v1 threshold and define deployment interface.

-## 9) Extension Plan: 4-Hour Precipitation Window (Not Started)
+## 9) Extension Plan: 4-Hour Precipitation Window (In Progress)
 - [x] [P0] Lock v2 target definition for horizon extension: `rain_next_4h_mm >= <threshold_mm>` and explicitly decide whether the threshold remains `0.2mm` or is increased for 4-hour labeling. (implemented with `0.2mm` carry-forward)
 - [x] [P0] Decide rollout strategy: additive dual-horizon support (`1h` + `4h`) vs direct replacement; prefer dual-horizon for safe cutover. (implemented as additive dual-horizon)
 - [x] [P0] Parameterize label horizon in shared ML code (`scripts/rain_model_common.py`) so target columns are generated for 4-hour windows (48 x 5-minute buckets) instead of hard-coded 1-hour columns.
@@ -72,8 +72,11 @@ Priority key: `P0` = critical/blocking, `P1` = important, `P2` = later optimizat
 - [x] [P1] Update dashboard API defaults (`cmd/ingestd/web.go`) from `rain_next_1h` to the selected 4-hour model name (or make model name configurable).
 - [x] [P1] Update web UI labels/semantics (`cmd/ingestd/web/index.html`, `cmd/ingestd/web/app.js`) from “Rain 1h %” to “Rain 4h %” and verify chart legends/tooltips match the new horizon.
 - [x] [P1] Update worker/runtime defaults (`docker-compose.yml`, `scripts/run_rain_ml_worker.py`, `scripts/run_p0_rain_workflow.sh`) to use `rain_next_4h` naming/versioning.
+- [x] [P1] Add dual-run deployment support with isolated artifacts for 4h and 1h workers (`docker-compose.yml` + `scripts/rainml_py.sh` service targeting).
 - [x] [P0] Update health-check defaults (`scripts/check_rain_pipeline_health.py`) for 4-hour evaluation latency (e.g., pending-eval age threshold > 4h).
 - [x] [P1] Update docs and runbooks (`README.md`, `docs/rain_prediction.md`, `docs/rain_model_runbook.md`) so commands, table names, and target definitions match the 4-hour system.
- [ ] [P0] Run full retraining/evaluation for the 4-hour target and compare against current 1-hour model metrics before production cutover.
- [ ] [P0] Execute staged rollout: deploy schema + views, deploy model + inference, verify dashboard/health checks, then switch default model name.
+- [x] [P1] Add explicit automated cutover gate script for baseline-vs-candidate decisioning. (`scripts/check_rain_cutover_gate.py`)
+- [x] [P0] Run full retraining/evaluation for the 4-hour target and compare against current 1-hour model metrics before production cutover. (completed on runtime machine 2026-04-06; comparison vs `rain_model_report_extended_eval.json` showed regression: precision `0.7103 -> 0.5545`, PR-AUC `0.7245 -> 0.5850`, ROC-AUC `0.9184 -> 0.7843`, Brier `0.0931 -> 0.2276`)
+- [ ] [P0] Execute staged rollout: deploy schema + views, deploy model + inference, verify dashboard/health checks, then switch default model name. (schema/views/deploy/inference/health completed on runtime machine 2026-04-06; final production cutover decision remains open due 4h metric regression vs 1h baseline)
+- [ ] [P0] Improve 4-hour model quality to meet cutover gate (at minimum recover precision and calibration relative to current 1-hour production baseline) before declaring rollout complete.
 - [x] [P1] Keep rollback path documented: retain `rain_next_1h` artifacts/table access until 4-hour monitoring is stable. (documented in `docs/rain_model_runbook.md` staged rollout/rollback section)