bugfixes

2026-03-12 19:55:51 +11:00
parent 76851f0816
commit d1237eed44
12 changed files with 1444 additions and 82 deletions
@@ -0,0 +1,141 @@
+# Rain Model Runbook
+
+Operational guide for training, evaluating, deploying, monitoring, and rolling back the rain model.
+
+## 1) One-time Setup
+
+Apply monitoring views:
+
+```sh
+docker compose exec -T timescaledb \
+  psql -U postgres -d micrometeo \
+  -f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
+```
+
+## 2) Train + Evaluate
+
+Recommended evaluation run (includes validation-only tuning, calibration comparison, naive baselines, and walk-forward folds):
+
+```sh
+python scripts/train_rain_model.py \
+  --site "home" \
+  --start "2026-02-01T00:00:00Z" \
+  --end "2026-03-03T23:55:00Z" \
+  --feature-set "extended" \
+  --model-family "auto" \
+  --forecast-model "ecmwf" \
+  --tune-hyperparameters \
+  --max-hyperparam-trials 12 \
+  --calibration-methods "none,sigmoid,isotonic" \
+  --walk-forward-folds 4 \
+  --model-version "rain-auto-v1-extended" \
+  --out "models/rain_model.pkl" \
+  --report-out "models/rain_model_report.json" \
+  --model-card-out "models/model_card_{model_version}.md" \
+  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
+```
+
+Review in report:
+- `candidate_models[*].hyperparameter_tuning`
+- `candidate_models[*].calibration_comparison`
+- `naive_baselines_test`
+- `walk_forward_backtest`
+
+## 3) Deploy
+
+1. Promote the selected artifact path to the inference worker (`RAIN_MODEL_PATH` or CLI `--model-path`).
+2. Run one dry-run inference:
+
+```sh
+python scripts/predict_rain_model.py \
+  --site home \
+  --model-path "models/rain_model.pkl" \
+  --model-name "rain_next_1h" \
+  --dry-run
+```
+
+3. Run live inference:
+
+```sh
+python scripts/predict_rain_model.py \
+  --site home \
+  --model-path "models/rain_model.pkl" \
+  --model-name "rain_next_1h"
+```
+
+## 4) Rollback
+
+1. Identify the last known-good model artifact in `models/`.
+2. Point deployment to that artifact (worker env `RAIN_MODEL_PATH` or manual inference path).
+3. Re-run inference command and verify writes in `predictions_rain_1h`.
+4. Keep the failed artifact/report for postmortem.
+
+## 5) Monitoring
+
+### Feature drift
+
+```sql
+SELECT *
+FROM rain_feature_drift_daily
+WHERE site = 'home'
+ORDER BY day DESC
+LIMIT 30;
+```
+
+Alert heuristic: any absolute z-score > 3 for 2+ consecutive days.
+
+### Prediction drift
+
+```sql
+SELECT *
+FROM rain_prediction_drift_daily
+WHERE site = 'home'
+ORDER BY day DESC
+LIMIT 30;
+```
+
+Alert heuristic: `predicted_positive_rate` shifts by > 2x relative to trailing 14-day median.
+
+### Calibration/performance drift
+
+```sql
+SELECT *
+FROM rain_calibration_drift_daily
+WHERE site = 'home'
+ORDER BY day DESC
+LIMIT 30;
+```
+
+Alert heuristic: sustained Brier-score increase > 25% from trailing 30-day average.
+
+## 6) Pipeline Failure Alerts
+
+Use the health-check script in cron, systemd timer, or your alerting scheduler:
+
+```sh
+python scripts/check_rain_pipeline_health.py \
+  --site home \
+  --model-name rain_next_1h \
+  --max-ws90-age 20m \
+  --max-baro-age 30m \
+  --max-forecast-age 3h \
+  --max-prediction-age 30m \
+  --max-pending-eval-age 3h \
+  --max-pending-eval-rows 200
+```
+
+The script exits non-zero on failure, so it can directly drive alerting.
+
+## 7) Continuous Worker Defaults
+
+`docker-compose.yml` provides these controls for `rainml`:
+- `RAIN_TUNE_HYPERPARAMETERS`
+- `RAIN_MAX_HYPERPARAM_TRIALS`
+- `RAIN_CALIBRATION_METHODS`
+- `RAIN_WALK_FORWARD_FOLDS`
+- `RAIN_ALLOW_EMPTY_DATA`
+- `RAIN_MODEL_CARD_PATH`
+
+Recommended production defaults:
+- Enable tuning daily or weekly (`RAIN_TUNE_HYPERPARAMETERS=true`)
+- Keep walk-forward folds `0` in continuous mode, run fold backtests in scheduled evaluation jobs