feat: add rain data audit and prediction scripts
This commit is contained in:
@@ -1,21 +1,20 @@
|
||||
# Rain Prediction (Next 1 Hour)
|
||||
|
||||
This project now includes a starter training script for a **binary rain prediction**:
|
||||
This project includes a baseline workflow for **binary rain prediction**:
|
||||
|
||||
> **Will we see >= 0.2 mm of rain in the next hour?**
|
||||
|
||||
It uses local observations (WS90 + barometric pressure) and trains a lightweight
|
||||
logistic regression model. This is a baseline you can iterate on as you collect
|
||||
more data.
|
||||
It uses local observations (WS90 + barometer), trains a logistic regression
|
||||
baseline, and writes model-driven predictions back to TimescaleDB.
|
||||
|
||||
## What the script does
|
||||
- Pulls data from TimescaleDB.
|
||||
- Resamples observations to 5-minute buckets.
|
||||
- Derives **pressure trend (1h)** from barometer data.
|
||||
- Computes **future 1-hour rainfall** from the cumulative `rain_mm` counter.
|
||||
- Trains a model and prints evaluation metrics.
|
||||
|
||||
The output is a saved model file (optional) you can use later for inference.
|
||||
## P0 Decisions (Locked)
|
||||
- Target: `rain_next_1h_mm >= 0.2`.
|
||||
- Primary use-case: low-noise rain heads-up signal for dashboard + alert candidate.
|
||||
- Frozen v1 training window (UTC): `2026-02-01T00:00:00Z` to `2026-03-03T23:55:00Z`.
|
||||
- Threshold policy: choose threshold on validation set by maximizing recall under
|
||||
`precision >= 0.70`; fallback to max-F1 if the precision constraint is unreachable.
|
||||
- Acceptance gate (test split): report and track `precision`, `recall`, `ROC-AUC`,
|
||||
`PR-AUC`, `Brier score`, and confusion matrix.
|
||||
|
||||
## Requirements
|
||||
Python 3.10+ and:
|
||||
@@ -36,67 +35,76 @@ source .venv/bin/activate
|
||||
pip install -r scripts/requirements.txt
|
||||
```
|
||||
|
||||
## Scripts
|
||||
- `scripts/audit_rain_data.py`: data quality + label quality + class balance audit.
|
||||
- `scripts/train_rain_model.py`: strict time-based split training and metrics report.
|
||||
- `scripts/predict_rain_model.py`: inference using saved model artifact; upserts into
|
||||
`predictions_rain_1h`.
|
||||
|
||||
## Usage
|
||||
### 1) Apply schema update (existing DBs)
|
||||
`001_schema.sql` now includes `predictions_rain_1h`.
|
||||
|
||||
```sh
|
||||
python scripts/train_rain_model.py \
|
||||
--db-url "postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable" \
|
||||
--site "home" \
|
||||
--start "2026-01-01" \
|
||||
--end "2026-02-01" \
|
||||
--out "models/rain_model.pkl"
|
||||
docker compose exec -T timescaledb \
|
||||
psql -U postgres -d micrometeo \
|
||||
-f /docker-entrypoint-initdb.d/001_schema.sql
|
||||
```
|
||||
|
||||
You can also provide the connection string via `DATABASE_URL`:
|
||||
|
||||
### 2) Run data audit
|
||||
```sh
|
||||
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
|
||||
python scripts/train_rain_model.py --site home
|
||||
|
||||
python scripts/audit_rain_data.py \
|
||||
--site home \
|
||||
--start "2026-02-01T00:00:00Z" \
|
||||
--end "2026-03-03T23:55:00Z" \
|
||||
--out "models/rain_data_audit.json"
|
||||
```
|
||||
|
||||
### 3) Train baseline model
|
||||
```sh
|
||||
python scripts/train_rain_model.py \
|
||||
--site "home" \
|
||||
--start "2026-02-01T00:00:00Z" \
|
||||
--end "2026-03-03T23:55:00Z" \
|
||||
--train-ratio 0.7 \
|
||||
--val-ratio 0.15 \
|
||||
--min-precision 0.70 \
|
||||
--model-version "rain-logreg-v1" \
|
||||
--out "models/rain_model.pkl" \
|
||||
--report-out "models/rain_model_report.json"
|
||||
```
|
||||
|
||||
### 4) Run inference and store prediction
|
||||
```sh
|
||||
python scripts/predict_rain_model.py \
|
||||
--site home \
|
||||
--model-path "models/rain_model.pkl" \
|
||||
--model-name "rain_next_1h"
|
||||
```
|
||||
|
||||
### 5) One-command P0 workflow
|
||||
```sh
|
||||
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
|
||||
bash scripts/run_p0_rain_workflow.sh
|
||||
```
|
||||
|
||||
## Output
|
||||
The script prints metrics including:
|
||||
- accuracy
|
||||
- precision / recall
|
||||
- ROC AUC
|
||||
- confusion matrix
|
||||
- Audit report: `models/rain_data_audit.json`
|
||||
- Training report: `models/rain_model_report.json`
|
||||
- Model artifact: `models/rain_model.pkl`
|
||||
- Prediction rows: `predictions_rain_1h` (probability + threshold decision + realized
|
||||
outcome fields once available)
|
||||
|
||||
If `joblib` is installed, it saves a model bundle:
|
||||
## Model Features (v1)
|
||||
- `pressure_trend_1h`
|
||||
- `humidity`
|
||||
- `temperature_c`
|
||||
- `wind_avg_m_s`
|
||||
- `wind_max_m_s`
|
||||
|
||||
```
|
||||
models/rain_model.pkl
|
||||
```
|
||||
|
||||
This bundle contains:
|
||||
- The trained model pipeline
|
||||
- The feature list used during training
|
||||
|
||||
## Data needs / when to run
|
||||
For a reliable model, you will want:
|
||||
- **At least 2-4 weeks** of observations
|
||||
- A mix of rainy and non-rainy periods
|
||||
|
||||
Training with only a few days will produce an unstable model.
|
||||
|
||||
## Features used
|
||||
The baseline model uses:
|
||||
- `pressure_trend_1h` (hPa)
|
||||
- `humidity` (%)
|
||||
- `temperature_c` (C)
|
||||
- `wind_avg_m_s` (m/s)
|
||||
- `wind_max_m_s` (m/s)
|
||||
|
||||
These are easy to expand once you have more data (e.g. add forecast features).
|
||||
|
||||
## Notes / assumptions
|
||||
- Rain detection is based on **incremental rain** derived from the WS90
|
||||
`rain_mm` cumulative counter.
|
||||
- Pressure comes from `observations_baro`.
|
||||
- All timestamps are treated as UTC.
|
||||
|
||||
## Next improvements
|
||||
Ideas once more data is available:
|
||||
- Add forecast precipitation and cloud cover as features
|
||||
- Try gradient boosted trees (e.g. XGBoost / LightGBM)
|
||||
- Train per-season models
|
||||
- Calibrate probabilities (Platt scaling / isotonic regression)
|
||||
## Notes
|
||||
- Data is resampled into 5-minute buckets.
|
||||
- Label is derived from incremental rain from WS90 cumulative `rain_mm`.
|
||||
- Timestamps are handled as UTC in training/inference workflow.
|
||||
|
||||
Reference in New Issue
Block a user