feat: add rain data audit and prediction scripts

2026-03-05 08:01:54 +11:00
parent 5bfa910495
commit 96e72d7c43
13 changed files with 1004 additions and 182 deletions
@@ -1,21 +1,20 @@
 # Rain Prediction (Next 1 Hour)

-This project now includes a starter training script for a **binary rain prediction**:
+This project includes a baseline workflow for **binary rain prediction**:

 > **Will we see >= 0.2 mm of rain in the next hour?**

-It uses local observations (WS90 + barometric pressure) and trains a lightweight
-logistic regression model. This is a baseline you can iterate on as you collect
-more data.
+It uses local observations (WS90 + barometer), trains a logistic regression
+baseline, and writes model-driven predictions back to TimescaleDB.

-## What the script does
- Pulls data from TimescaleDB.
- Resamples observations to 5-minute buckets.
- Derives **pressure trend (1h)** from barometer data.
- Computes **future 1-hour rainfall** from the cumulative `rain_mm` counter.
- Trains a model and prints evaluation metrics.
-
-The output is a saved model file (optional) you can use later for inference.
+## P0 Decisions (Locked)
+- Target: `rain_next_1h_mm >= 0.2`.
+- Primary use-case: low-noise rain heads-up signal for dashboard + alert candidate.
+- Frozen v1 training window (UTC): `2026-02-01T00:00:00Z` to `2026-03-03T23:55:00Z`.
+- Threshold policy: choose threshold on validation set by maximizing recall under
+  `precision >= 0.70`; fallback to max-F1 if the precision constraint is unreachable.
+- Acceptance gate (test split): report and track `precision`, `recall`, `ROC-AUC`,
+  `PR-AUC`, `Brier score`, and confusion matrix.

 ## Requirements
 Python 3.10+ and:
@@ -36,67 +35,76 @@ source .venv/bin/activate
 pip install -r scripts/requirements.txt
 ```

+## Scripts
+- `scripts/audit_rain_data.py`: data quality + label quality + class balance audit.
+- `scripts/train_rain_model.py`: strict time-based split training and metrics report.
+- `scripts/predict_rain_model.py`: inference using saved model artifact; upserts into
+  `predictions_rain_1h`.
+
 ## Usage
+### 1) Apply schema update (existing DBs)
+`001_schema.sql` now includes `predictions_rain_1h`.

 ```sh
-python scripts/train_rain_model.py \
-  --db-url "postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable" \
-  --site "home" \
-  --start "2026-01-01" \
-  --end "2026-02-01" \
-  --out "models/rain_model.pkl"
+docker compose exec -T timescaledb \
+  psql -U postgres -d micrometeo \
+  -f /docker-entrypoint-initdb.d/001_schema.sql
 ```

-You can also provide the connection string via `DATABASE_URL`:
-
+### 2) Run data audit
 ```sh
 export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
-python scripts/train_rain_model.py --site home
+
+python scripts/audit_rain_data.py \
+  --site home \
+  --start "2026-02-01T00:00:00Z" \
+  --end "2026-03-03T23:55:00Z" \
+  --out "models/rain_data_audit.json"
+```
+
+### 3) Train baseline model
+```sh
+python scripts/train_rain_model.py \
+  --site "home" \
+  --start "2026-02-01T00:00:00Z" \
+  --end "2026-03-03T23:55:00Z" \
+  --train-ratio 0.7 \
+  --val-ratio 0.15 \
+  --min-precision 0.70 \
+  --model-version "rain-logreg-v1" \
+  --out "models/rain_model.pkl" \
+  --report-out "models/rain_model_report.json"
+```
+
+### 4) Run inference and store prediction
+```sh
+python scripts/predict_rain_model.py \
+  --site home \
+  --model-path "models/rain_model.pkl" \
+  --model-name "rain_next_1h"
+```
+
+### 5) One-command P0 workflow
+```sh
+export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
+bash scripts/run_p0_rain_workflow.sh
 ```

 ## Output
-The script prints metrics including:
- accuracy
- precision / recall
- ROC AUC
- confusion matrix
+- Audit report: `models/rain_data_audit.json`
+- Training report: `models/rain_model_report.json`
+- Model artifact: `models/rain_model.pkl`
+- Prediction rows: `predictions_rain_1h` (probability + threshold decision + realized
+  outcome fields once available)

-If `joblib` is installed, it saves a model bundle:
+## Model Features (v1)
+- `pressure_trend_1h`
+- `humidity`
+- `temperature_c`
+- `wind_avg_m_s`
+- `wind_max_m_s`

-```
-models/rain_model.pkl
-```
-
-This bundle contains:
- The trained model pipeline
- The feature list used during training
-
-## Data needs / when to run
-For a reliable model, you will want:
- **At least 2-4 weeks** of observations
- A mix of rainy and non-rainy periods
-
-Training with only a few days will produce an unstable model.
-
-## Features used
-The baseline model uses:
- `pressure_trend_1h` (hPa)
- `humidity` (%)
- `temperature_c` (C)
- `wind_avg_m_s` (m/s)
- `wind_max_m_s` (m/s)
-
-These are easy to expand once you have more data (e.g. add forecast features).
-
-## Notes / assumptions
- Rain detection is based on **incremental rain** derived from the WS90
-  `rain_mm` cumulative counter.
- Pressure comes from `observations_baro`.
- All timestamps are treated as UTC.
-
-## Next improvements
-Ideas once more data is available:
- Add forecast precipitation and cloud cover as features
- Try gradient boosted trees (e.g. XGBoost / LightGBM)
- Train per-season models
- Calibrate probabilities (Platt scaling / isotonic regression)
+## Notes
+- Data is resampled into 5-minute buckets.
+- Label is derived from incremental rain from WS90 cumulative `rain_mm`.
+- Timestamps are handled as UTC in training/inference workflow.