7.0 KiB
Rain Model Runbook
Operational guide for training, evaluating, deploying, monitoring, and rolling back the rain model.
1) One-time Setup
Apply 4-hour prediction table migration:
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/003_rain_predictions_4h.sql
Apply monitoring views:
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
2) Train + Evaluate
Recommended evaluation run (includes validation-only tuning, calibration comparison, naive baselines, and walk-forward folds):
scripts/rainml_py.sh scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--horizon-hours 4 \
--feature-set "extended" \
--model-family "auto" \
--forecast-model "ecmwf" \
--tune-hyperparameters \
--max-hyperparam-trials 12 \
--calibration-methods "none,sigmoid,isotonic" \
--threshold-policy "walk_forward" \
--walk-forward-folds 4 \
--model-version "rain-auto-v2-extended-4h" \
--out "models/rain_model.pkl" \
--report-out "models/rain_model_report.json" \
--model-card-out "models/model_card_{model_version}.md" \
--dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
Review in report:
candidate_models[*].hyperparameter_tuningcandidate_models[*].calibration_comparisonnaive_baselines_testsliced_performance_testthreshold_tuning_walk_forwardwalk_forward_backtest
3) Deploy
- Promote the selected artifact path to the inference worker (
RAIN_MODEL_PATHor CLI--model-path). - Run one dry-run inference:
scripts/rainml_py.sh scripts/predict_rain_model.py \
--site home \
--model-path "models/rain_model.pkl" \
--model-name "rain_next_4h" \
--horizon-hours 4 \
--dry-run
- Run live inference:
scripts/rainml_py.sh scripts/predict_rain_model.py \
--site home \
--model-path "models/rain_model.pkl" \
--model-name "rain_next_4h" \
--horizon-hours 4
4) Rollback
- The worker now keeps a backup model at
RAIN_MODEL_BACKUP_PATHand promotes new models only after candidate training succeeds. - If promotion fails or no candidate model is produced, the worker keeps the active model unchanged.
- If inference starts without
RAIN_MODEL_PATHbut backup exists, the worker restores from backup automatically. - Keep failed candidate artifacts for postmortem.
- During 4-hour rollout stabilization, keep
predictions_rain_1handrain_next_1hmodel artifacts available for immediate fallback.
5) Monitoring
Feature drift
SELECT *
FROM rain_feature_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;
Alert heuristic: any absolute z-score > 3 for 2+ consecutive days.
Prediction drift
SELECT *
FROM rain_prediction_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;
Alert heuristic: predicted_positive_rate shifts by > 2x relative to trailing 14-day median.
Calibration/performance drift
SELECT *
FROM rain_calibration_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;
Alert heuristic: sustained Brier-score increase > 25% from trailing 30-day average.
6) Pipeline Failure Alerts
Use the health-check script in cron, systemd timer, or your alerting scheduler:
scripts/rainml_py.sh scripts/check_rain_pipeline_health.py \
--site home \
--model-name rain_next_4h \
--horizon-hours 4 \
--max-ws90-age 20m \
--max-baro-age 30m \
--max-forecast-age 3h \
--max-prediction-age 30m \
--max-pending-eval-age 6h \
--max-pending-eval-rows 200
The script exits non-zero on failure, so it can directly drive alerting.
7) Continuous Worker Defaults
docker-compose.yml provides these controls for rainml:
RAIN_TUNE_HYPERPARAMETERSRAIN_MAX_HYPERPARAM_TRIALSRAIN_CALIBRATION_METHODSRAIN_THRESHOLD_POLICYRAIN_WALK_FORWARD_FOLDSRAIN_ALLOW_EMPTY_DATARAIN_HORIZON_HOURSRAIN_MODEL_BACKUP_PATHRAIN_MODEL_CARD_PATH
Recommended production defaults:
- Enable tuning daily or weekly (
RAIN_TUNE_HYPERPARAMETERS=true) - Set
RAIN_THRESHOLD_POLICY=walk_forwardwithRAIN_WALK_FORWARD_FOLDS=4for temporally robust threshold selection
8) Auto-Recommend Candidate
To compare saved training reports and pick a deployment candidate automatically:
scripts/rainml_py.sh scripts/recommend_rain_model.py \
--reports-glob "models/rain_model_report*.json" \
--require-walk-forward \
--top-k 5 \
--json-out "models/rain_model_recommendation.json"
9) Staged 4h Rollout Checklist
Run this sequence in production/staging to satisfy the 4h cutover gate:
- Apply schema migration for 4h predictions:
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/003_rain_predictions_4h.sql
- Re-apply monitoring views (now include 1h + 4h unions):
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
- Run a full 4h training/evaluation cycle and save report:
scripts/rainml_py.sh scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--horizon-hours 4 \
--feature-set "extended" \
--model-family "auto" \
--forecast-model "ecmwf" \
--tune-hyperparameters \
--threshold-policy "walk_forward" \
--walk-forward-folds 4 \
--model-version "rain-auto-v2-extended-4h" \
--out "models/rain_model_4h.pkl" \
--report-out "models/rain_model_report_4h.json"
- Compare 4h metrics against the latest 1h benchmark report before switching dashboard defaults:
scripts/rainml_py.sh scripts/compare_rain_reports.py \
--baseline "models/rain_model_report_1h.json" \
--candidate "models/rain_model_report_4h.json"
- Run dry-run inference, then live inference with 4h model name/horizon:
scripts/rainml_py.sh scripts/predict_rain_model.py \
--site home \
--model-path "models/rain_model_4h.pkl" \
--model-name "rain_next_4h" \
--horizon-hours 4 \
--dry-run
scripts/rainml_py.sh scripts/predict_rain_model.py \
--site home \
--model-path "models/rain_model_4h.pkl" \
--model-name "rain_next_4h" \
--horizon-hours 4
- Validate health checks and dashboard data path for 4h:
scripts/rainml_py.sh scripts/check_rain_pipeline_health.py \
--site home \
--model-name rain_next_4h \
--horizon-hours 4 \
--max-pending-eval-age 6h
- Keep 1h path live in parallel until 4h drift/calibration remains stable for at least 7 days.
Fast rollback to 1h
If 4h performance or pipeline health regresses:
- Set worker env back to:
RAIN_HORIZON_HOURS=1,RAIN_MODEL_NAME=rain_next_1h, and a known-good 1h model path/version. - Restart
rainmlservice. - Confirm
check_rain_pipeline_health.py --horizon-hours 1 --model-name rain_next_1hreturnsok. - Keep
predictions_rain_4hdata for postmortem; do not drop tables during rollback.