Files
go-weatherstation/docs/rain_model_runbook.md
T

279 lines
7.8 KiB
Markdown

# Rain Model Runbook
Operational guide for training, evaluating, deploying, monitoring, and rolling back the rain model.
## 1) One-time Setup
Apply 4-hour prediction table migration:
```sh
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/003_rain_predictions_4h.sql
```
Apply monitoring views:
```sh
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
```
## 2) Train + Evaluate
Recommended evaluation run (includes validation-only tuning, calibration comparison, naive baselines, and walk-forward folds):
```sh
scripts/rainml_py.sh scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--horizon-hours 4 \
--feature-set "extended" \
--model-family "auto" \
--forecast-model "ecmwf" \
--tune-hyperparameters \
--max-hyperparam-trials 12 \
--calibration-methods "none,sigmoid,isotonic" \
--threshold-policy "walk_forward" \
--walk-forward-folds 4 \
--model-version "rain-auto-v2-extended-4h" \
--out "models/rain_model_4h.pkl" \
--report-out "models/rain_model_report_4h.json" \
--model-card-out "models/model_card_{model_version}.md" \
--dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
```
Review in report:
- `candidate_models[*].hyperparameter_tuning`
- `candidate_models[*].calibration_comparison`
- `naive_baselines_test`
- `sliced_performance_test`
- `threshold_tuning_walk_forward`
- `walk_forward_backtest`
## 3) Deploy
1. Promote the selected artifact path to the inference worker (`RAIN_MODEL_PATH` or CLI `--model-path`).
2. Run one dry-run inference:
```sh
scripts/rainml_py.sh scripts/predict_rain_model.py \
--site home \
--model-path "models/rain_model_4h.pkl" \
--model-name "rain_next_4h" \
--horizon-hours 4 \
--dry-run
```
3. Run live inference:
```sh
scripts/rainml_py.sh scripts/predict_rain_model.py \
--site home \
--model-path "models/rain_model_4h.pkl" \
--model-name "rain_next_4h" \
--horizon-hours 4
```
## 4) Rollback
1. The worker now keeps a backup model at `RAIN_MODEL_BACKUP_PATH` and promotes new models only after candidate training succeeds.
2. If promotion fails or no candidate model is produced, the worker keeps the active model unchanged.
3. If inference starts without `RAIN_MODEL_PATH` but backup exists, the worker restores from backup automatically.
4. Keep failed candidate artifacts for postmortem.
5. During 4-hour rollout stabilization, keep `predictions_rain_1h` and `rain_next_1h` model artifacts available for immediate fallback.
## 5) Monitoring
### Feature drift
```sql
SELECT *
FROM rain_feature_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;
```
Alert heuristic: any absolute z-score > 3 for 2+ consecutive days.
### Prediction drift
```sql
SELECT *
FROM rain_prediction_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;
```
Alert heuristic: `predicted_positive_rate` shifts by > 2x relative to trailing 14-day median.
### Calibration/performance drift
```sql
SELECT *
FROM rain_calibration_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;
```
Alert heuristic: sustained Brier-score increase > 25% from trailing 30-day average.
## 6) Pipeline Failure Alerts
Use the health-check script in cron, systemd timer, or your alerting scheduler:
```sh
scripts/rainml_py.sh scripts/check_rain_pipeline_health.py \
--site home \
--model-name rain_next_4h \
--horizon-hours 4 \
--max-ws90-age 20m \
--max-baro-age 30m \
--max-forecast-age 3h \
--max-prediction-age 30m \
--max-pending-eval-age 6h \
--max-pending-eval-rows 200
```
The script exits non-zero on failure, so it can directly drive alerting.
## 7) Continuous Worker Defaults
`docker-compose.yml` provides these controls for `rainml`:
- `RAIN_TUNE_HYPERPARAMETERS`
- `RAIN_MAX_HYPERPARAM_TRIALS`
- `RAIN_CALIBRATION_METHODS`
- `RAIN_THRESHOLD_POLICY`
- `RAIN_WALK_FORWARD_FOLDS`
- `RAIN_ALLOW_EMPTY_DATA`
- `RAIN_HORIZON_HOURS`
- `RAIN_MODEL_BACKUP_PATH`
- `RAIN_MODEL_CARD_PATH`
Dual-run note:
- `rainml` is configured as 4-hour model training/inference with dedicated artifact paths.
- `rainml_1h` is available as an optional shadow/baseline service via profile `shadow`.
- Start both (4h + 1h shadow):
`docker compose --profile shadow up -d rainml rainml_1h`
- Run one-off script against the 1h service:
`RAINML_PY_SERVICE=rainml_1h scripts/rainml_py.sh scripts/train_rain_model.py ...`
Recommended production defaults:
- Enable tuning daily or weekly (`RAIN_TUNE_HYPERPARAMETERS=true`)
- Set `RAIN_THRESHOLD_POLICY=walk_forward` with `RAIN_WALK_FORWARD_FOLDS=4` for temporally robust threshold selection
## 8) Auto-Recommend Candidate
To compare saved training reports and pick a deployment candidate automatically:
```sh
scripts/rainml_py.sh scripts/recommend_rain_model.py \
--reports-glob "models/rain_model_report*.json" \
--require-walk-forward \
--top-k 5 \
--json-out "models/rain_model_recommendation.json"
```
## 9) Staged 4h Rollout Checklist
Run this sequence in production/staging to satisfy the 4h cutover gate:
1. Apply schema migration for 4h predictions:
```sh
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/003_rain_predictions_4h.sql
```
2. Re-apply monitoring views (now include 1h + 4h unions):
```sh
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
```
3. Run a full 4h training/evaluation cycle and save report:
```sh
scripts/rainml_py.sh scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--horizon-hours 4 \
--feature-set "extended" \
--model-family "auto" \
--forecast-model "ecmwf" \
--tune-hyperparameters \
--threshold-policy "walk_forward" \
--walk-forward-folds 4 \
--model-version "rain-auto-v2-extended-4h" \
--out "models/rain_model_4h.pkl" \
--report-out "models/rain_model_report_4h.json"
```
4. Compare 4h metrics against the latest 1h benchmark report before switching dashboard defaults:
```sh
scripts/rainml_py.sh scripts/compare_rain_reports.py \
--baseline "models/rain_model_report_1h.json" \
--candidate "models/rain_model_report_4h.json"
```
4b. Apply an explicit cutover gate (exit code 0 = pass):
```sh
scripts/rainml_py.sh scripts/check_rain_cutover_gate.py \
--baseline "models/rain_model_report_1h.json" \
--candidate "models/rain_model_report_4h.json" \
--min-candidate-precision 0.60 \
--max-precision-drop 0.05 \
--max-pr-auc-drop 0.05 \
--max-roc-auc-drop 0.05 \
--max-brier-increase 0.03
```
5. Run dry-run inference, then live inference with 4h model name/horizon:
```sh
scripts/rainml_py.sh scripts/predict_rain_model.py \
--site home \
--model-path "models/rain_model_4h.pkl" \
--model-name "rain_next_4h" \
--horizon-hours 4 \
--dry-run
scripts/rainml_py.sh scripts/predict_rain_model.py \
--site home \
--model-path "models/rain_model_4h.pkl" \
--model-name "rain_next_4h" \
--horizon-hours 4
```
6. Validate health checks and dashboard data path for 4h:
```sh
scripts/rainml_py.sh scripts/check_rain_pipeline_health.py \
--site home \
--model-name rain_next_4h \
--horizon-hours 4 \
--max-pending-eval-age 6h
```
7. Keep 1h path live in parallel until 4h drift/calibration remains stable for at least 7 days.
### Fast rollback to 1h
If 4h performance or pipeline health regresses:
1. Set worker env back to:
`RAIN_HORIZON_HOURS=1`, `RAIN_MODEL_NAME=rain_next_1h`, and a known-good 1h model path/version.
2. Restart `rainml` service.
3. Confirm `check_rain_pipeline_health.py --horizon-hours 1 --model-name rain_next_1h` returns `ok`.
4. Keep `predictions_rain_4h` data for postmortem; do not drop tables during rollback.