go-weatherstation/docs/rain_prediction.md

# Rain Prediction (Next 1 Hour)

This project includes a baseline workflow for **binary rain prediction**:

> **Will we see >= 0.2 mm of rain in the next hour?**

It uses local observations (WS90 + barometer), trains a logistic regression
baseline, and writes model-driven predictions back to TimescaleDB.

## P0 Decisions (Locked)
- Target: `rain_next_1h_mm >= 0.2`.
- Primary use-case: low-noise rain heads-up signal for dashboard + alert candidate.
- Frozen v1 training window (UTC): `2026-02-01T00:00:00Z` to `2026-03-03T23:55:00Z`.
- Threshold policy: choose threshold on validation set by maximizing recall under
  `precision >= 0.70`; fallback to max-F1 if the precision constraint is unreachable.
- Acceptance gate (test split): report and track `precision`, `recall`, `ROC-AUC`,
  `PR-AUC`, `Brier score`, and confusion matrix.

## Requirements
Python 3.10+ and:

```
pandas
numpy
scikit-learn
psycopg2-binary
joblib
```

Install with:

```sh
python3 -m venv .venv
source .venv/bin/activate
pip install -r scripts/requirements.txt
```

## Scripts
- `scripts/audit_rain_data.py`: data quality + label quality + class balance audit.
- `scripts/train_rain_model.py`: strict time-based split training and metrics report, with optional
  validation-only hyperparameter tuning, calibration comparison, naive baseline comparison, and walk-forward folds.
- `scripts/predict_rain_model.py`: inference using saved model artifact; upserts into
  `predictions_rain_1h`.
- `scripts/run_rain_ml_worker.py`: long-running worker for periodic training + prediction.
- `scripts/check_rain_pipeline_health.py`: freshness/failure check for alerting.

Feature-set options:
- `baseline`: original 5 local observation features.
- `extended`: adds wind-direction encoding, lag/rolling stats, recent rain accumulation,
  and aligned forecast features from `forecast_openmeteo_hourly`.
- `extended_calendar`: `extended` plus UTC calendar seasonality features
  (`hour_*`, `dow_*`, `month_*`, `is_weekend`).

Model-family options (`train_rain_model.py`):
- `logreg`: logistic regression baseline.
- `hist_gb`: histogram gradient boosting (tree-based baseline).
- `auto`: trains both `logreg` and `hist_gb`, picks the best validation model by
  PR-AUC, then ROC-AUC, then F1.

## Usage
### 1) Apply schema update (existing DBs)
`001_schema.sql` includes `predictions_rain_1h`.

```sh
docker compose exec -T timescaledb \
  psql -U postgres -d micrometeo \
  -f /docker-entrypoint-initdb.d/001_schema.sql
```

Apply monitoring views:

```sh
docker compose exec -T timescaledb \
  psql -U postgres -d micrometeo \
  -f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
```

### 2) Run data audit
```sh
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"

python scripts/audit_rain_data.py \
  --site home \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "baseline" \
  --out "models/rain_data_audit.json"
```

### 3) Train baseline model
```sh
python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --train-ratio 0.7 \
  --val-ratio 0.15 \
  --min-precision 0.70 \
  --feature-set "baseline" \
  --model-family "logreg" \
  --model-version "rain-logreg-v1" \
  --out "models/rain_model.pkl" \
  --report-out "models/rain_model_report.json" \
  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
```

### 3b) Train expanded (P1) feature-set model
```sh
python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "logreg" \
  --forecast-model "ecmwf" \
  --model-version "rain-logreg-v1-extended" \
  --out "models/rain_model_extended.pkl" \
  --report-out "models/rain_model_report_extended.json" \
  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
```

### 3b.1) Train expanded + calendar (P2) feature-set model
```sh
python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended_calendar" \
  --model-family "auto" \
  --forecast-model "ecmwf" \
  --model-version "rain-auto-v1-extended-calendar" \
  --out "models/rain_model_extended_calendar.pkl" \
  --report-out "models/rain_model_report_extended_calendar.json"
```

### 3c) Train tree-based baseline (P1)
```sh
python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "hist_gb" \
  --forecast-model "ecmwf" \
  --model-version "rain-hgb-v1-extended" \
  --out "models/rain_model_hgb.pkl" \
  --report-out "models/rain_model_report_hgb.json" \
  --dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
```

### 3d) Auto-compare logistic vs tree baseline
```sh
python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "auto" \
  --forecast-model "ecmwf" \
  --model-version "rain-auto-v1-extended" \
  --out "models/rain_model_auto.pkl" \
  --report-out "models/rain_model_report_auto.json"
```

### 3e) Full P1 evaluation (tuning + calibration + walk-forward)
```sh
python scripts/train_rain_model.py \
  --site "home" \
  --start "2026-02-01T00:00:00Z" \
  --end "2026-03-03T23:55:00Z" \
  --feature-set "extended" \
  --model-family "auto" \
  --forecast-model "ecmwf" \
  --tune-hyperparameters \
  --max-hyperparam-trials 12 \
  --calibration-methods "none,sigmoid,isotonic" \
  --walk-forward-folds 4 \
  --model-version "rain-auto-v1-extended-eval" \
  --out "models/rain_model_auto.pkl" \
  --report-out "models/rain_model_report_auto.json" \
  --model-card-out "models/model_card_{model_version}.md"
```

### 4) Run inference and store prediction
```sh
python scripts/predict_rain_model.py \
  --site home \
  --model-path "models/rain_model.pkl" \
  --model-name "rain_next_1h"
```

### 5) One-command P0 workflow
```sh
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
bash scripts/run_p0_rain_workflow.sh
```

### 6) Continuous training + prediction via Docker Compose
The `rainml` service in `docker-compose.yml` now runs:
- periodic retraining (default every 24 hours)
- periodic prediction writes (default every 10 minutes)
- configurable tuning/calibration behavior (`RAIN_TUNE_HYPERPARAMETERS`,
  `RAIN_MAX_HYPERPARAM_TRIALS`, `RAIN_CALIBRATION_METHODS`)
- graceful gap handling for temporary source outages (`RAIN_ALLOW_EMPTY_DATA=true`)
- automatic rollback path for last-known-good model (`RAIN_MODEL_BACKUP_PATH`)
- optional model-card output (`RAIN_MODEL_CARD_PATH`)

Artifacts are persisted to `./models` on the host.

```sh
docker compose up -d rainml
docker compose logs -f rainml
```

## Output
- Audit report: `models/rain_data_audit.json`
- Training report: `models/rain_model_report.json`
- Regime slices in training report: `sliced_performance_test`
- Model card: `models/model_card_<model_version>.md`
- Model artifact: `models/rain_model.pkl`
- Dataset snapshot: `models/datasets/rain_dataset_<model_version>_<feature_set>.csv`
- Prediction rows: `predictions_rain_1h` (probability + threshold decision + realized
  outcome fields once available)

## Model Features (v1 baseline)
- `pressure_trend_1h`
- `humidity`
- `temperature_c`
- `wind_avg_m_s`
- `wind_max_m_s`

## Model Features (extended set)
- baseline features, plus:
- `wind_dir_sin`, `wind_dir_cos`
- `temp_lag_5m`, `temp_roll_1h_mean`, `temp_roll_1h_std`
- `humidity_lag_5m`, `humidity_roll_1h_mean`, `humidity_roll_1h_std`
- `wind_avg_lag_5m`, `wind_avg_roll_1h_mean`, `wind_gust_roll_1h_max`
- `pressure_lag_5m`, `pressure_roll_1h_mean`, `pressure_roll_1h_std`
- `rain_last_1h_mm`
- `fc_temp_c`, `fc_rh`, `fc_pressure_msl_hpa`, `fc_wind_m_s`, `fc_wind_gust_m_s`,
  `fc_precip_mm`, `fc_precip_prob`, `fc_cloud_cover`

## Model Features (extended_calendar extras)
- `hour_sin`, `hour_cos`
- `dow_sin`, `dow_cos`
- `month_sin`, `month_cos`
- `is_weekend`

## Notes
- Data is resampled into 5-minute buckets.
- Label is derived from incremental rain from WS90 cumulative `rain_mm`.
- Timestamps are handled as UTC in training/inference workflow.
- See [Data issues and mitigation rules](./rain_data_issues.md) and
  [runbook/monitoring guidance](./rain_model_runbook.md).