9.2 KiB
9.2 KiB
Rain Prediction (Next 1 Hour)
This project includes a baseline workflow for binary rain prediction:
Will we see >= 0.2 mm of rain in the next hour?
It uses local observations (WS90 + barometer), trains a logistic regression baseline, and writes model-driven predictions back to TimescaleDB.
P0 Decisions (Locked)
- Target:
rain_next_1h_mm >= 0.2. - Primary use-case: low-noise rain heads-up signal for dashboard + alert candidate.
- Frozen v1 training window (UTC):
2026-02-01T00:00:00Zto2026-03-03T23:55:00Z. - Threshold policy: choose threshold on validation set by maximizing recall under
precision >= 0.70; fallback to max-F1 if the precision constraint is unreachable. - Acceptance gate (test split): report and track
precision,recall,ROC-AUC,PR-AUC,Brier score, and confusion matrix.
Requirements
Python 3.10+ and:
pandas
numpy
scikit-learn
psycopg2-binary
joblib
Install with:
python3 -m venv .venv
source .venv/bin/activate
pip install -r scripts/requirements.txt
Scripts
scripts/audit_rain_data.py: data quality + label quality + class balance audit.scripts/train_rain_model.py: strict time-based split training and metrics report, with optional validation-only hyperparameter tuning, calibration comparison, naive baseline comparison, and walk-forward folds.scripts/predict_rain_model.py: inference using saved model artifact; upserts intopredictions_rain_1h.scripts/run_rain_ml_worker.py: long-running worker for periodic training + prediction.scripts/check_rain_pipeline_health.py: freshness/failure check for alerting.scripts/recommend_rain_model.py: rank saved training reports and recommend a deployment candidate.
Feature-set options:
baseline: original 5 local observation features.extended: adds wind-direction encoding, lag/rolling stats, recent rain accumulation, and aligned forecast features fromforecast_openmeteo_hourly.extended_calendar:extendedplus UTC calendar seasonality features (hour_*,dow_*,month_*,is_weekend).
Model-family options (train_rain_model.py):
logreg: logistic regression baseline.hist_gb: histogram gradient boosting (tree-based baseline).auto: trains bothlogregandhist_gb, picks the best validation model by PR-AUC, then ROC-AUC, then F1.
Usage
1) Apply schema update (existing DBs)
001_schema.sql includes predictions_rain_1h.
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/001_schema.sql
Apply monitoring views:
docker compose exec -T timescaledb \
psql -U postgres -d micrometeo \
-f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
2) Run data audit
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
python scripts/audit_rain_data.py \
--site home \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--feature-set "baseline" \
--out "models/rain_data_audit.json"
3) Train baseline model
python scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--train-ratio 0.7 \
--val-ratio 0.15 \
--min-precision 0.70 \
--feature-set "baseline" \
--model-family "logreg" \
--model-version "rain-logreg-v1" \
--out "models/rain_model.pkl" \
--report-out "models/rain_model_report.json" \
--dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
3b) Train expanded (P1) feature-set model
python scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--feature-set "extended" \
--model-family "logreg" \
--forecast-model "ecmwf" \
--model-version "rain-logreg-v1-extended" \
--out "models/rain_model_extended.pkl" \
--report-out "models/rain_model_report_extended.json" \
--dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
3b.1) Train expanded + calendar (P2) feature-set model
python scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--feature-set "extended_calendar" \
--model-family "auto" \
--forecast-model "ecmwf" \
--model-version "rain-auto-v1-extended-calendar" \
--out "models/rain_model_extended_calendar.pkl" \
--report-out "models/rain_model_report_extended_calendar.json"
3c) Train tree-based baseline (P1)
python scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--feature-set "extended" \
--model-family "hist_gb" \
--forecast-model "ecmwf" \
--model-version "rain-hgb-v1-extended" \
--out "models/rain_model_hgb.pkl" \
--report-out "models/rain_model_report_hgb.json" \
--dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
3d) Auto-compare logistic vs tree baseline
python scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--feature-set "extended" \
--model-family "auto" \
--forecast-model "ecmwf" \
--model-version "rain-auto-v1-extended" \
--out "models/rain_model_auto.pkl" \
--report-out "models/rain_model_report_auto.json"
3e) Full P1 evaluation (tuning + calibration + walk-forward)
python scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--feature-set "extended" \
--model-family "auto" \
--forecast-model "ecmwf" \
--tune-hyperparameters \
--max-hyperparam-trials 12 \
--calibration-methods "none,sigmoid,isotonic" \
--walk-forward-folds 4 \
--model-version "rain-auto-v1-extended-eval" \
--out "models/rain_model_auto.pkl" \
--report-out "models/rain_model_report_auto.json" \
--model-card-out "models/model_card_{model_version}.md"
3f) Walk-forward threshold policy (more temporally robust alert threshold)
python scripts/train_rain_model.py \
--site "home" \
--start "2026-02-01T00:00:00Z" \
--end "2026-03-03T23:55:00Z" \
--feature-set "extended" \
--model-family "auto" \
--forecast-model "ecmwf" \
--threshold-policy "walk_forward" \
--walk-forward-folds 4 \
--model-version "rain-auto-v1-extended-wf-threshold" \
--out "models/rain_model_auto.pkl" \
--report-out "models/rain_model_report_auto.json"
4) Run inference and store prediction
python scripts/predict_rain_model.py \
--site home \
--model-path "models/rain_model.pkl" \
--model-name "rain_next_1h"
5) One-command P0 workflow
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
bash scripts/run_p0_rain_workflow.sh
6) Continuous training + prediction via Docker Compose
The rainml service in docker-compose.yml now runs:
- periodic retraining (default every 24 hours)
- periodic prediction writes (default every 10 minutes)
- configurable tuning/calibration behavior (
RAIN_TUNE_HYPERPARAMETERS,RAIN_MAX_HYPERPARAM_TRIALS,RAIN_CALIBRATION_METHODS,RAIN_THRESHOLD_POLICY) - graceful gap handling for temporary source outages (
RAIN_ALLOW_EMPTY_DATA=true) - automatic rollback path for last-known-good model (
RAIN_MODEL_BACKUP_PATH) - optional model-card output (
RAIN_MODEL_CARD_PATH)
Artifacts are persisted to ./models on the host.
docker compose up -d rainml
docker compose logs -f rainml
Output
- Audit report:
models/rain_data_audit.json - Training report:
models/rain_model_report.json - Regime slices in training report:
sliced_performance_test - Model card:
models/model_card_<model_version>.md - Model artifact:
models/rain_model.pkl - Dataset snapshot:
models/datasets/rain_dataset_<model_version>_<feature_set>.csv - Prediction rows:
predictions_rain_1h(probability + threshold decision + realized outcome fields once available)
7) Recommend deploy candidate from saved reports
python scripts/recommend_rain_model.py \
--reports-glob "models/rain_model_report*.json" \
--require-walk-forward \
--top-k 5 \
--json-out "models/rain_model_recommendation.json"
Model Features (v1 baseline)
pressure_trend_1hhumiditytemperature_cwind_avg_m_swind_max_m_s
Model Features (extended set)
- baseline features, plus:
wind_dir_sin,wind_dir_costemp_lag_5m,temp_roll_1h_mean,temp_roll_1h_stdhumidity_lag_5m,humidity_roll_1h_mean,humidity_roll_1h_stdwind_avg_lag_5m,wind_avg_roll_1h_mean,wind_gust_roll_1h_maxpressure_lag_5m,pressure_roll_1h_mean,pressure_roll_1h_stdrain_last_1h_mmfc_temp_c,fc_rh,fc_pressure_msl_hpa,fc_wind_m_s,fc_wind_gust_m_s,fc_precip_mm,fc_precip_prob,fc_cloud_cover
Model Features (extended_calendar extras)
hour_sin,hour_cosdow_sin,dow_cosmonth_sin,month_cosis_weekend
Notes
- Data is resampled into 5-minute buckets.
- Label is derived from incremental rain from WS90 cumulative
rain_mm. - Timestamps are handled as UTC in training/inference workflow.
- See Data issues and mitigation rules and runbook/monitoring guidance.