nathan/go-weatherstation

Fork 0

Files

Nathan Coad 9785fc0235 improve model training

2026-03-12 20:39:44 +11:00

4.1 KiB

Raw Blame History

Rain Model Runbook

Operational guide for training, evaluating, deploying, monitoring, and rolling back the rain model.

1) One-time Setup

Apply monitoring views:

docker compose exec -T timescaledb \
  psql -U postgres -d micrometeo \
  -f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql

2) Train + Evaluate

Recommended evaluation run (includes validation-only tuning, calibration comparison, naive baselines, and walk-forward folds):

python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "auto" \
  --forecast-model "ecmwf" \
  --tune-hyperparameters \
  --max-hyperparam-trials 12 \
  --calibration-methods "none,sigmoid,isotonic" \
  --threshold-policy "walk_forward" \
  --walk-forward-folds 4 \
  --model-version "rain-auto-v1-extended" \
  --out "models/rain_model.pkl" \
  --report-out "models/rain_model_report.json" \
  --model-card-out "models/model_card_{model_version}.md" \
  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"

Review in report:

candidate_models[*].hyperparameter_tuning
candidate_models[*].calibration_comparison
naive_baselines_test
sliced_performance_test
threshold_tuning_walk_forward
walk_forward_backtest

3) Deploy

Promote the selected artifact path to the inference worker (RAIN_MODEL_PATH or CLI --model-path).
Run one dry-run inference:

python scripts/predict_rain_model.py \
  --site home \
  --model-path "models/rain_model.pkl" \
  --model-name "rain_next_1h" \
  --dry-run

Run live inference:

python scripts/predict_rain_model.py \
  --site home \
  --model-path "models/rain_model.pkl" \
  --model-name "rain_next_1h"

4) Rollback

The worker now keeps a backup model at RAIN_MODEL_BACKUP_PATH and promotes new models only after candidate training succeeds.
If promotion fails or no candidate model is produced, the worker keeps the active model unchanged.
If inference starts without RAIN_MODEL_PATH but backup exists, the worker restores from backup automatically.
Keep failed candidate artifacts for postmortem.

5) Monitoring

Feature drift

SELECT *
FROM rain_feature_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;

Alert heuristic: any absolute z-score > 3 for 2+ consecutive days.

Prediction drift

SELECT *
FROM rain_prediction_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;

Alert heuristic: predicted_positive_rate shifts by > 2x relative to trailing 14-day median.

Calibration/performance drift

SELECT *
FROM rain_calibration_drift_daily
WHERE site = 'home'
ORDER BY day DESC
LIMIT 30;

Alert heuristic: sustained Brier-score increase > 25% from trailing 30-day average.

6) Pipeline Failure Alerts

Use the health-check script in cron, systemd timer, or your alerting scheduler:

python scripts/check_rain_pipeline_health.py \
  --site home \
  --model-name rain_next_1h \
  --max-ws90-age 20m \
  --max-baro-age 30m \
  --max-forecast-age 3h \
  --max-prediction-age 30m \
  --max-pending-eval-age 3h \
  --max-pending-eval-rows 200

The script exits non-zero on failure, so it can directly drive alerting.

7) Continuous Worker Defaults

docker-compose.yml provides these controls for rainml:

RAIN_TUNE_HYPERPARAMETERS
RAIN_MAX_HYPERPARAM_TRIALS
RAIN_CALIBRATION_METHODS
RAIN_THRESHOLD_POLICY
RAIN_WALK_FORWARD_FOLDS
RAIN_ALLOW_EMPTY_DATA
RAIN_MODEL_BACKUP_PATH
RAIN_MODEL_CARD_PATH

Recommended production defaults:

Enable tuning daily or weekly (RAIN_TUNE_HYPERPARAMETERS=true)
Keep walk-forward folds 0 in continuous mode, run fold backtests in scheduled evaluation jobs

To compare saved training reports and pick a deployment candidate automatically:

python scripts/recommend_rain_model.py \
  --reports-glob "models/rain_model_report*.json" \
  --require-walk-forward \
  --top-k 5 \
  --json-out "models/rain_model_recommendation.json"

4.1 KiB Raw Blame History