158 lines
4.9 KiB
Markdown
158 lines
4.9 KiB
Markdown
# Rain Prediction (Next 1 Hour)
|
|
|
|
This project includes a baseline workflow for **binary rain prediction**:
|
|
|
|
> **Will we see >= 0.2 mm of rain in the next hour?**
|
|
|
|
It uses local observations (WS90 + barometer), trains a logistic regression
|
|
baseline, and writes model-driven predictions back to TimescaleDB.
|
|
|
|
## P0 Decisions (Locked)
|
|
- Target: `rain_next_1h_mm >= 0.2`.
|
|
- Primary use-case: low-noise rain heads-up signal for dashboard + alert candidate.
|
|
- Frozen v1 training window (UTC): `2026-02-01T00:00:00Z` to `2026-03-03T23:55:00Z`.
|
|
- Threshold policy: choose threshold on validation set by maximizing recall under
|
|
`precision >= 0.70`; fallback to max-F1 if the precision constraint is unreachable.
|
|
- Acceptance gate (test split): report and track `precision`, `recall`, `ROC-AUC`,
|
|
`PR-AUC`, `Brier score`, and confusion matrix.
|
|
|
|
## Requirements
|
|
Python 3.10+ and:
|
|
|
|
```
|
|
pandas
|
|
numpy
|
|
scikit-learn
|
|
psycopg2-binary
|
|
joblib
|
|
```
|
|
|
|
Install with:
|
|
|
|
```sh
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install -r scripts/requirements.txt
|
|
```
|
|
|
|
## Scripts
|
|
- `scripts/audit_rain_data.py`: data quality + label quality + class balance audit.
|
|
- `scripts/train_rain_model.py`: strict time-based split training and metrics report.
|
|
- `scripts/predict_rain_model.py`: inference using saved model artifact; upserts into
|
|
`predictions_rain_1h`.
|
|
- `scripts/run_rain_ml_worker.py`: long-running worker for periodic training + prediction.
|
|
|
|
Feature-set options:
|
|
- `baseline`: original 5 local observation features.
|
|
- `extended`: adds wind-direction encoding, lag/rolling stats, recent rain accumulation,
|
|
and aligned forecast features from `forecast_openmeteo_hourly`.
|
|
|
|
## Usage
|
|
### 1) Apply schema update (existing DBs)
|
|
`001_schema.sql` now includes `predictions_rain_1h`.
|
|
|
|
```sh
|
|
docker compose exec -T timescaledb \
|
|
psql -U postgres -d micrometeo \
|
|
-f /docker-entrypoint-initdb.d/001_schema.sql
|
|
```
|
|
|
|
### 2) Run data audit
|
|
```sh
|
|
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
|
|
|
|
python scripts/audit_rain_data.py \
|
|
--site home \
|
|
--start "2026-02-01T00:00:00Z" \
|
|
--end "2026-03-03T23:55:00Z" \
|
|
--feature-set "baseline" \
|
|
--out "models/rain_data_audit.json"
|
|
```
|
|
|
|
### 3) Train baseline model
|
|
```sh
|
|
python scripts/train_rain_model.py \
|
|
--site "home" \
|
|
--start "2026-02-01T00:00:00Z" \
|
|
--end "2026-03-03T23:55:00Z" \
|
|
--train-ratio 0.7 \
|
|
--val-ratio 0.15 \
|
|
--min-precision 0.70 \
|
|
--feature-set "baseline" \
|
|
--model-version "rain-logreg-v1" \
|
|
--out "models/rain_model.pkl" \
|
|
--report-out "models/rain_model_report.json" \
|
|
--dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
|
|
```
|
|
|
|
### 3b) Train expanded (P1) feature-set model
|
|
```sh
|
|
python scripts/train_rain_model.py \
|
|
--site "home" \
|
|
--start "2026-02-01T00:00:00Z" \
|
|
--end "2026-03-03T23:55:00Z" \
|
|
--feature-set "extended" \
|
|
--forecast-model "ecmwf" \
|
|
--model-version "rain-logreg-v1-extended" \
|
|
--out "models/rain_model_extended.pkl" \
|
|
--report-out "models/rain_model_report_extended.json" \
|
|
--dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
|
|
```
|
|
|
|
### 4) Run inference and store prediction
|
|
```sh
|
|
python scripts/predict_rain_model.py \
|
|
--site home \
|
|
--model-path "models/rain_model.pkl" \
|
|
--model-name "rain_next_1h"
|
|
```
|
|
|
|
### 5) One-command P0 workflow
|
|
```sh
|
|
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
|
|
bash scripts/run_p0_rain_workflow.sh
|
|
```
|
|
|
|
### 6) Continuous training + prediction via Docker Compose
|
|
The `rainml` service in `docker-compose.yml` now runs:
|
|
- periodic retraining (default every 24 hours)
|
|
- periodic prediction writes (default every 10 minutes)
|
|
|
|
Artifacts are persisted to `./models` on the host.
|
|
|
|
```sh
|
|
docker compose up -d rainml
|
|
docker compose logs -f rainml
|
|
```
|
|
|
|
## Output
|
|
- Audit report: `models/rain_data_audit.json`
|
|
- Training report: `models/rain_model_report.json`
|
|
- Model artifact: `models/rain_model.pkl`
|
|
- Dataset snapshot: `models/datasets/rain_dataset_<model_version>_<feature_set>.csv`
|
|
- Prediction rows: `predictions_rain_1h` (probability + threshold decision + realized
|
|
outcome fields once available)
|
|
|
|
## Model Features (v1 baseline)
|
|
- `pressure_trend_1h`
|
|
- `humidity`
|
|
- `temperature_c`
|
|
- `wind_avg_m_s`
|
|
- `wind_max_m_s`
|
|
|
|
## Model Features (extended set)
|
|
- baseline features, plus:
|
|
- `wind_dir_sin`, `wind_dir_cos`
|
|
- `temp_lag_5m`, `temp_roll_1h_mean`, `temp_roll_1h_std`
|
|
- `humidity_lag_5m`, `humidity_roll_1h_mean`, `humidity_roll_1h_std`
|
|
- `wind_avg_lag_5m`, `wind_avg_roll_1h_mean`, `wind_gust_roll_1h_max`
|
|
- `pressure_lag_5m`, `pressure_roll_1h_mean`, `pressure_roll_1h_std`
|
|
- `rain_last_1h_mm`
|
|
- `fc_temp_c`, `fc_rh`, `fc_pressure_msl_hpa`, `fc_wind_m_s`, `fc_wind_gust_m_s`,
|
|
`fc_precip_mm`, `fc_precip_prob`, `fc_cloud_cover`
|
|
|
|
## Notes
|
|
- Data is resampled into 5-minute buckets.
|
|
- Label is derived from incremental rain from WS90 cumulative `rain_mm`.
|
|
- Timestamps are handled as UTC in training/inference workflow.
|