Files
go-weatherstation/docs/rain_prediction.md
2026-03-12 19:55:51 +11:00

7.5 KiB

Rain Prediction (Next 1 Hour)

This project includes a baseline workflow for binary rain prediction:

Will we see >= 0.2 mm of rain in the next hour?

It uses local observations (WS90 + barometer), trains a logistic regression baseline, and writes model-driven predictions back to TimescaleDB.

P0 Decisions (Locked)

  • Target: rain_next_1h_mm >= 0.2.
  • Primary use-case: low-noise rain heads-up signal for dashboard + alert candidate.
  • Frozen v1 training window (UTC): 2026-02-01T00:00:00Z to 2026-03-03T23:55:00Z.
  • Threshold policy: choose threshold on validation set by maximizing recall under precision >= 0.70; fallback to max-F1 if the precision constraint is unreachable.
  • Acceptance gate (test split): report and track precision, recall, ROC-AUC, PR-AUC, Brier score, and confusion matrix.

Requirements

Python 3.10+ and:

pandas
numpy
scikit-learn
psycopg2-binary
joblib

Install with:

python3 -m venv .venv
source .venv/bin/activate
pip install -r scripts/requirements.txt

Scripts

  • scripts/audit_rain_data.py: data quality + label quality + class balance audit.
  • scripts/train_rain_model.py: strict time-based split training and metrics report, with optional validation-only hyperparameter tuning, calibration comparison, naive baseline comparison, and walk-forward folds.
  • scripts/predict_rain_model.py: inference using saved model artifact; upserts into predictions_rain_1h.
  • scripts/run_rain_ml_worker.py: long-running worker for periodic training + prediction.
  • scripts/check_rain_pipeline_health.py: freshness/failure check for alerting.

Feature-set options:

  • baseline: original 5 local observation features.
  • extended: adds wind-direction encoding, lag/rolling stats, recent rain accumulation, and aligned forecast features from forecast_openmeteo_hourly.

Model-family options (train_rain_model.py):

  • logreg: logistic regression baseline.
  • hist_gb: histogram gradient boosting (tree-based baseline).
  • auto: trains both logreg and hist_gb, picks the best validation model by PR-AUC, then ROC-AUC, then F1.

Usage

1) Apply schema update (existing DBs)

001_schema.sql includes predictions_rain_1h.

docker compose exec -T timescaledb \
  psql -U postgres -d micrometeo \
  -f /docker-entrypoint-initdb.d/001_schema.sql

Apply monitoring views:

docker compose exec -T timescaledb \
  psql -U postgres -d micrometeo \
  -f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql

2) Run data audit

export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"

python scripts/audit_rain_data.py \
  --site home \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "baseline" \
  --out "models/rain_data_audit.json"

3) Train baseline model

python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --train-ratio 0.7 \
  --val-ratio 0.15 \
  --min-precision 0.70 \
  --feature-set "baseline" \
  --model-family "logreg" \
  --model-version "rain-logreg-v1" \
  --out "models/rain_model.pkl" \
  --report-out "models/rain_model_report.json" \
  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"

3b) Train expanded (P1) feature-set model

python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "logreg" \
  --forecast-model "ecmwf" \
  --model-version "rain-logreg-v1-extended" \
  --out "models/rain_model_extended.pkl" \
  --report-out "models/rain_model_report_extended.json" \
  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"

3c) Train tree-based baseline (P1)

python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "hist_gb" \
  --forecast-model "ecmwf" \
  --model-version "rain-hgb-v1-extended" \
  --out "models/rain_model_hgb.pkl" \
  --report-out "models/rain_model_report_hgb.json" \
  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"

3d) Auto-compare logistic vs tree baseline

python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "auto" \
  --forecast-model "ecmwf" \
  --model-version "rain-auto-v1-extended" \
  --out "models/rain_model_auto.pkl" \
  --report-out "models/rain_model_report_auto.json"

3e) Full P1 evaluation (tuning + calibration + walk-forward)

python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "auto" \
  --forecast-model "ecmwf" \
  --tune-hyperparameters \
  --max-hyperparam-trials 12 \
  --calibration-methods "none,sigmoid,isotonic" \
  --walk-forward-folds 4 \
  --model-version "rain-auto-v1-extended-eval" \
  --out "models/rain_model_auto.pkl" \
  --report-out "models/rain_model_report_auto.json" \
  --model-card-out "models/model_card_{model_version}.md"

4) Run inference and store prediction

python scripts/predict_rain_model.py \
  --site home \
  --model-path "models/rain_model.pkl" \
  --model-name "rain_next_1h"

5) One-command P0 workflow

export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
bash scripts/run_p0_rain_workflow.sh

6) Continuous training + prediction via Docker Compose

The rainml service in docker-compose.yml now runs:

  • periodic retraining (default every 24 hours)
  • periodic prediction writes (default every 10 minutes)
  • configurable tuning/calibration behavior (RAIN_TUNE_HYPERPARAMETERS, RAIN_MAX_HYPERPARAM_TRIALS, RAIN_CALIBRATION_METHODS)
  • graceful gap handling for temporary source outages (RAIN_ALLOW_EMPTY_DATA=true)
  • optional model-card output (RAIN_MODEL_CARD_PATH)

Artifacts are persisted to ./models on the host.

docker compose up -d rainml
docker compose logs -f rainml

Output

  • Audit report: models/rain_data_audit.json
  • Training report: models/rain_model_report.json
  • Model card: models/model_card_<model_version>.md
  • Model artifact: models/rain_model.pkl
  • Dataset snapshot: models/datasets/rain_dataset_<model_version>_<feature_set>.csv
  • Prediction rows: predictions_rain_1h (probability + threshold decision + realized outcome fields once available)

Model Features (v1 baseline)

  • pressure_trend_1h
  • humidity
  • temperature_c
  • wind_avg_m_s
  • wind_max_m_s

Model Features (extended set)

  • baseline features, plus:
  • wind_dir_sin, wind_dir_cos
  • temp_lag_5m, temp_roll_1h_mean, temp_roll_1h_std
  • humidity_lag_5m, humidity_roll_1h_mean, humidity_roll_1h_std
  • wind_avg_lag_5m, wind_avg_roll_1h_mean, wind_gust_roll_1h_max
  • pressure_lag_5m, pressure_roll_1h_mean, pressure_roll_1h_std
  • rain_last_1h_mm
  • fc_temp_c, fc_rh, fc_pressure_msl_hpa, fc_wind_m_s, fc_wind_gust_m_s, fc_precip_mm, fc_precip_prob, fc_cloud_cover

Notes