# Rain Prediction (Next 1 Hour) This project includes a baseline workflow for **binary rain prediction**: > **Will we see >= 0.2 mm of rain in the next hour?** It uses local observations (WS90 + barometer), trains a logistic regression baseline, and writes model-driven predictions back to TimescaleDB. ## P0 Decisions (Locked) - Target: `rain_next_1h_mm >= 0.2`. - Primary use-case: low-noise rain heads-up signal for dashboard + alert candidate. - Frozen v1 training window (UTC): `2026-02-01T00:00:00Z` to `2026-03-03T23:55:00Z`. - Threshold policy: choose threshold on validation set by maximizing recall under `precision >= 0.70`; fallback to max-F1 if the precision constraint is unreachable. - Acceptance gate (test split): report and track `precision`, `recall`, `ROC-AUC`, `PR-AUC`, `Brier score`, and confusion matrix. ## Requirements Python 3.10+ and: ``` pandas numpy scikit-learn psycopg2-binary joblib ``` Install with: ```sh python3 -m venv .venv source .venv/bin/activate pip install -r scripts/requirements.txt ``` ## Scripts - `scripts/audit_rain_data.py`: data quality + label quality + class balance audit. - `scripts/train_rain_model.py`: strict time-based split training and metrics report. - `scripts/predict_rain_model.py`: inference using saved model artifact; upserts into `predictions_rain_1h`. - `scripts/run_rain_ml_worker.py`: long-running worker for periodic training + prediction. Feature-set options: - `baseline`: original 5 local observation features. - `extended`: adds wind-direction encoding, lag/rolling stats, recent rain accumulation, and aligned forecast features from `forecast_openmeteo_hourly`. Model-family options (`train_rain_model.py`): - `logreg`: logistic regression baseline. - `hist_gb`: histogram gradient boosting (tree-based baseline). - `auto`: trains both `logreg` and `hist_gb`, picks the best validation model by PR-AUC, then ROC-AUC, then F1. ## Usage ### 1) Apply schema update (existing DBs) `001_schema.sql` now includes `predictions_rain_1h`. ```sh docker compose exec -T timescaledb \ psql -U postgres -d micrometeo \ -f /docker-entrypoint-initdb.d/001_schema.sql ``` ### 2) Run data audit ```sh export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable" python scripts/audit_rain_data.py \ --site home \ --start "2026-02-01T00:00:00Z" \ --end "2026-03-03T23:55:00Z" \ --feature-set "baseline" \ --out "models/rain_data_audit.json" ``` ### 3) Train baseline model ```sh python scripts/train_rain_model.py \ --site "home" \ --start "2026-02-01T00:00:00Z" \ --end "2026-03-03T23:55:00Z" \ --train-ratio 0.7 \ --val-ratio 0.15 \ --min-precision 0.70 \ --feature-set "baseline" \ --model-family "logreg" \ --model-version "rain-logreg-v1" \ --out "models/rain_model.pkl" \ --report-out "models/rain_model_report.json" \ --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv" ``` ### 3b) Train expanded (P1) feature-set model ```sh python scripts/train_rain_model.py \ --site "home" \ --start "2026-02-01T00:00:00Z" \ --end "2026-03-03T23:55:00Z" \ --feature-set "extended" \ --model-family "logreg" \ --forecast-model "ecmwf" \ --model-version "rain-logreg-v1-extended" \ --out "models/rain_model_extended.pkl" \ --report-out "models/rain_model_report_extended.json" \ --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv" ``` ### 3c) Train tree-based baseline (P1) ```sh python scripts/train_rain_model.py \ --site "home" \ --start "2026-02-01T00:00:00Z" \ --end "2026-03-03T23:55:00Z" \ --feature-set "extended" \ --model-family "hist_gb" \ --forecast-model "ecmwf" \ --model-version "rain-hgb-v1-extended" \ --out "models/rain_model_hgb.pkl" \ --report-out "models/rain_model_report_hgb.json" \ --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv" ``` ### 3d) Auto-compare logistic vs tree baseline ```sh python scripts/train_rain_model.py \ --site "home" \ --start "2026-02-01T00:00:00Z" \ --end "2026-03-03T23:55:00Z" \ --feature-set "extended" \ --model-family "auto" \ --forecast-model "ecmwf" \ --model-version "rain-auto-v1-extended" \ --out "models/rain_model_auto.pkl" \ --report-out "models/rain_model_report_auto.json" ``` ### 4) Run inference and store prediction ```sh python scripts/predict_rain_model.py \ --site home \ --model-path "models/rain_model.pkl" \ --model-name "rain_next_1h" ``` ### 5) One-command P0 workflow ```sh export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable" bash scripts/run_p0_rain_workflow.sh ``` ### 6) Continuous training + prediction via Docker Compose The `rainml` service in `docker-compose.yml` now runs: - periodic retraining (default every 24 hours) - periodic prediction writes (default every 10 minutes) Artifacts are persisted to `./models` on the host. ```sh docker compose up -d rainml docker compose logs -f rainml ``` ## Output - Audit report: `models/rain_data_audit.json` - Training report: `models/rain_model_report.json` - Model artifact: `models/rain_model.pkl` - Dataset snapshot: `models/datasets/rain_dataset__.csv` - Prediction rows: `predictions_rain_1h` (probability + threshold decision + realized outcome fields once available) ## Model Features (v1 baseline) - `pressure_trend_1h` - `humidity` - `temperature_c` - `wind_avg_m_s` - `wind_max_m_s` ## Model Features (extended set) - baseline features, plus: - `wind_dir_sin`, `wind_dir_cos` - `temp_lag_5m`, `temp_roll_1h_mean`, `temp_roll_1h_std` - `humidity_lag_5m`, `humidity_roll_1h_mean`, `humidity_roll_1h_std` - `wind_avg_lag_5m`, `wind_avg_roll_1h_mean`, `wind_gust_roll_1h_max` - `pressure_lag_5m`, `pressure_roll_1h_mean`, `pressure_roll_1h_std` - `rain_last_1h_mm` - `fc_temp_c`, `fc_rh`, `fc_pressure_msl_hpa`, `fc_wind_m_s`, `fc_wind_gust_m_s`, `fc_precip_mm`, `fc_precip_prob`, `fc_cloud_cover` ## Notes - Data is resampled into 5-minute buckets. - Label is derived from incremental rain from WS90 cumulative `rain_mm`. - Timestamps are handled as UTC in training/inference workflow.