142 lines
3.5 KiB
Markdown
142 lines
3.5 KiB
Markdown
# Rain Model Runbook
|
|
|
|
Operational guide for training, evaluating, deploying, monitoring, and rolling back the rain model.
|
|
|
|
## 1) One-time Setup
|
|
|
|
Apply monitoring views:
|
|
|
|
```sh
|
|
docker compose exec -T timescaledb \
|
|
psql -U postgres -d micrometeo \
|
|
-f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
|
|
```
|
|
|
|
## 2) Train + Evaluate
|
|
|
|
Recommended evaluation run (includes validation-only tuning, calibration comparison, naive baselines, and walk-forward folds):
|
|
|
|
```sh
|
|
python scripts/train_rain_model.py \
|
|
--site "home" \
|
|
--start "2026-02-01T00:00:00Z" \
|
|
--end "2026-03-03T23:55:00Z" \
|
|
--feature-set "extended" \
|
|
--model-family "auto" \
|
|
--forecast-model "ecmwf" \
|
|
--tune-hyperparameters \
|
|
--max-hyperparam-trials 12 \
|
|
--calibration-methods "none,sigmoid,isotonic" \
|
|
--walk-forward-folds 4 \
|
|
--model-version "rain-auto-v1-extended" \
|
|
--out "models/rain_model.pkl" \
|
|
--report-out "models/rain_model_report.json" \
|
|
--model-card-out "models/model_card_{model_version}.md" \
|
|
--dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
|
|
```
|
|
|
|
Review in report:
|
|
- `candidate_models[*].hyperparameter_tuning`
|
|
- `candidate_models[*].calibration_comparison`
|
|
- `naive_baselines_test`
|
|
- `walk_forward_backtest`
|
|
|
|
## 3) Deploy
|
|
|
|
1. Promote the selected artifact path to the inference worker (`RAIN_MODEL_PATH` or CLI `--model-path`).
|
|
2. Run one dry-run inference:
|
|
|
|
```sh
|
|
python scripts/predict_rain_model.py \
|
|
--site home \
|
|
--model-path "models/rain_model.pkl" \
|
|
--model-name "rain_next_1h" \
|
|
--dry-run
|
|
```
|
|
|
|
3. Run live inference:
|
|
|
|
```sh
|
|
python scripts/predict_rain_model.py \
|
|
--site home \
|
|
--model-path "models/rain_model.pkl" \
|
|
--model-name "rain_next_1h"
|
|
```
|
|
|
|
## 4) Rollback
|
|
|
|
1. Identify the last known-good model artifact in `models/`.
|
|
2. Point deployment to that artifact (worker env `RAIN_MODEL_PATH` or manual inference path).
|
|
3. Re-run inference command and verify writes in `predictions_rain_1h`.
|
|
4. Keep the failed artifact/report for postmortem.
|
|
|
|
## 5) Monitoring
|
|
|
|
### Feature drift
|
|
|
|
```sql
|
|
SELECT *
|
|
FROM rain_feature_drift_daily
|
|
WHERE site = 'home'
|
|
ORDER BY day DESC
|
|
LIMIT 30;
|
|
```
|
|
|
|
Alert heuristic: any absolute z-score > 3 for 2+ consecutive days.
|
|
|
|
### Prediction drift
|
|
|
|
```sql
|
|
SELECT *
|
|
FROM rain_prediction_drift_daily
|
|
WHERE site = 'home'
|
|
ORDER BY day DESC
|
|
LIMIT 30;
|
|
```
|
|
|
|
Alert heuristic: `predicted_positive_rate` shifts by > 2x relative to trailing 14-day median.
|
|
|
|
### Calibration/performance drift
|
|
|
|
```sql
|
|
SELECT *
|
|
FROM rain_calibration_drift_daily
|
|
WHERE site = 'home'
|
|
ORDER BY day DESC
|
|
LIMIT 30;
|
|
```
|
|
|
|
Alert heuristic: sustained Brier-score increase > 25% from trailing 30-day average.
|
|
|
|
## 6) Pipeline Failure Alerts
|
|
|
|
Use the health-check script in cron, systemd timer, or your alerting scheduler:
|
|
|
|
```sh
|
|
python scripts/check_rain_pipeline_health.py \
|
|
--site home \
|
|
--model-name rain_next_1h \
|
|
--max-ws90-age 20m \
|
|
--max-baro-age 30m \
|
|
--max-forecast-age 3h \
|
|
--max-prediction-age 30m \
|
|
--max-pending-eval-age 3h \
|
|
--max-pending-eval-rows 200
|
|
```
|
|
|
|
The script exits non-zero on failure, so it can directly drive alerting.
|
|
|
|
## 7) Continuous Worker Defaults
|
|
|
|
`docker-compose.yml` provides these controls for `rainml`:
|
|
- `RAIN_TUNE_HYPERPARAMETERS`
|
|
- `RAIN_MAX_HYPERPARAM_TRIALS`
|
|
- `RAIN_CALIBRATION_METHODS`
|
|
- `RAIN_WALK_FORWARD_FOLDS`
|
|
- `RAIN_ALLOW_EMPTY_DATA`
|
|
- `RAIN_MODEL_CARD_PATH`
|
|
|
|
Recommended production defaults:
|
|
- Enable tuning daily or weekly (`RAIN_TUNE_HYPERPARAMETERS=true`)
|
|
- Keep walk-forward folds `0` in continuous mode, run fold backtests in scheduled evaluation jobs
|