Files
go-weatherstation/docs/rain_model_runbook.md
2026-03-12 19:55:51 +11:00

3.5 KiB

Rain Model Runbook

Operational guide for training, evaluating, deploying, monitoring, and rolling back the rain model.

1) One-time Setup

Apply monitoring views:

docker compose exec -T timescaledb \
  psql -U postgres -d micrometeo \
  -f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql

2) Train + Evaluate

Recommended evaluation run (includes validation-only tuning, calibration comparison, naive baselines, and walk-forward folds):

python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "auto" \
  --forecast-model "ecmwf" \
  --tune-hyperparameters \
  --max-hyperparam-trials 12 \
  --calibration-methods "none,sigmoid,isotonic" \
  --walk-forward-folds 4 \
  --model-version "rain-auto-v1-extended" \
  --out "models/rain_model.pkl" \
  --report-out "models/rain_model_report.json" \
  --model-card-out "models/model_card_{model_version}.md" \
  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"

Review in report:

  • candidate_models[*].hyperparameter_tuning
  • candidate_models[*].calibration_comparison
  • naive_baselines_test
  • walk_forward_backtest

3) Deploy

  1. Promote the selected artifact path to the inference worker (RAIN_MODEL_PATH or CLI --model-path).
  2. Run one dry-run inference:
python scripts/predict_rain_model.py \
  --site home \
  --model-path "models/rain_model.pkl" \
  --model-name "rain_next_1h" \
  --dry-run
  1. Run live inference:
python scripts/predict_rain_model.py \
  --site home \
  --model-path "models/rain_model.pkl" \
  --model-name "rain_next_1h"

4) Rollback

  1. Identify the last known-good model artifact in models/.
  2. Point deployment to that artifact (worker env RAIN_MODEL_PATH or manual inference path).
  3. Re-run inference command and verify writes in predictions_rain_1h.
  4. Keep the failed artifact/report for postmortem.

5) Monitoring

Feature drift

SELECT *
FROM rain_feature_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;

Alert heuristic: any absolute z-score > 3 for 2+ consecutive days.

Prediction drift

SELECT *
FROM rain_prediction_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;

Alert heuristic: predicted_positive_rate shifts by > 2x relative to trailing 14-day median.

Calibration/performance drift

SELECT *
FROM rain_calibration_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;

Alert heuristic: sustained Brier-score increase > 25% from trailing 30-day average.

6) Pipeline Failure Alerts

Use the health-check script in cron, systemd timer, or your alerting scheduler:

python scripts/check_rain_pipeline_health.py \
  --site home \
  --model-name rain_next_1h \
  --max-ws90-age 20m \
  --max-baro-age 30m \
  --max-forecast-age 3h \
  --max-prediction-age 30m \
  --max-pending-eval-age 3h \
  --max-pending-eval-rows 200

The script exits non-zero on failure, so it can directly drive alerting.

7) Continuous Worker Defaults

docker-compose.yml provides these controls for rainml:

  • RAIN_TUNE_HYPERPARAMETERS
  • RAIN_MAX_HYPERPARAM_TRIALS
  • RAIN_CALIBRATION_METHODS
  • RAIN_WALK_FORWARD_FOLDS
  • RAIN_ALLOW_EMPTY_DATA
  • RAIN_MODEL_CARD_PATH

Recommended production defaults:

  • Enable tuning daily or weekly (RAIN_TUNE_HYPERPARAMETERS=true)
  • Keep walk-forward folds 0 in continuous mode, run fold backtests in scheduled evaluation jobs