bugfixes
This commit is contained in:
35
docs/rain_data_issues.md
Normal file
35
docs/rain_data_issues.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Rain Model Data Issues and Mitigations
|
||||
|
||||
This document captures known data-quality issues observed in the rain-model pipeline and the mitigation rules used in code.
|
||||
|
||||
## Issue Register
|
||||
|
||||
| Area | Known issue | Mitigation in code/workflow |
|
||||
|---|---|---|
|
||||
| WS90 rain counter (`rain_mm`) | Counter resets can produce negative deltas. | `rain_inc_raw = diff(rain_mm)` then `rain_inc = clip(lower=0)`; reset events tracked as `rain_reset`. |
|
||||
| WS90 rain spikes | Isolated large 5-minute jumps may be sensor/transmission anomalies. | Spikes flagged as `rain_spike_5m` when increment >= `5.0mm/5m`; counts tracked in audit/training report. |
|
||||
| Sensor gaps | Missing 5-minute buckets from WS90/barometer ingestion. | Resample to 5-minute grid; barometer interpolated with short limit (`limit=3`); gap lengths tracked by audit. |
|
||||
| Out-of-order arrivals | Late MQTT events can arrive with older `ts`. | Audit reports out-of-order count by sorting on `received_at` and checking `ts` monotonicity. |
|
||||
| Duplicate rows | Replays/reconnects can duplicate sensor rows. | Audit reports duplicate counts by `(ts, station_id)` for WS90 and `(ts, source)` for barometer. |
|
||||
| Forecast sparsity/jitter | Hourly forecast retrieval cadence does not always align with 5-minute features. | Select latest forecast per `ts` (`DISTINCT ON` + `retrieved_at DESC`), resample to 5 minutes, short forward/backfill windows, and clip `fc_precip_prob` to `[0,1]`. |
|
||||
| Local vs UTC day boundary | Daily rainfall resets can look wrong when local timezone is not respected. | Station timezone is configured via `site.timezone` and used by Wunderground uploader; model training/inference stays UTC-based for split consistency. |
|
||||
|
||||
## Audit Command
|
||||
|
||||
Run this regularly and retain JSON reports for comparison:
|
||||
|
||||
```sh
|
||||
python scripts/audit_rain_data.py \
|
||||
--site home \
|
||||
--start "2026-02-01T00:00:00Z" \
|
||||
--end "2026-03-03T23:55:00Z" \
|
||||
--feature-set "extended" \
|
||||
--forecast-model "ecmwf" \
|
||||
--out "models/rain_data_audit.json"
|
||||
```
|
||||
|
||||
## Operational Rules
|
||||
|
||||
- Treat large jumps in `rain_reset_count` or `rain_spike_5m_count` as data-quality incidents.
|
||||
- If `gaps_5m.ws90_max_gap_minutes` or `gaps_5m.baro_max_gap_minutes` exceeds one hour, avoid model refresh until ingestion stabilizes.
|
||||
- If forecast rows are unexpectedly low for an `extended` feature run, either fix forecast ingestion first or temporarily fall back to `baseline` feature set.
|
||||
141
docs/rain_model_runbook.md
Normal file
141
docs/rain_model_runbook.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# Rain Model Runbook
|
||||
|
||||
Operational guide for training, evaluating, deploying, monitoring, and rolling back the rain model.
|
||||
|
||||
## 1) One-time Setup
|
||||
|
||||
Apply monitoring views:
|
||||
|
||||
```sh
|
||||
docker compose exec -T timescaledb \
|
||||
psql -U postgres -d micrometeo \
|
||||
-f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
|
||||
```
|
||||
|
||||
## 2) Train + Evaluate
|
||||
|
||||
Recommended evaluation run (includes validation-only tuning, calibration comparison, naive baselines, and walk-forward folds):
|
||||
|
||||
```sh
|
||||
python scripts/train_rain_model.py \
|
||||
--site "home" \
|
||||
--start "2026-02-01T00:00:00Z" \
|
||||
--end "2026-03-03T23:55:00Z" \
|
||||
--feature-set "extended" \
|
||||
--model-family "auto" \
|
||||
--forecast-model "ecmwf" \
|
||||
--tune-hyperparameters \
|
||||
--max-hyperparam-trials 12 \
|
||||
--calibration-methods "none,sigmoid,isotonic" \
|
||||
--walk-forward-folds 4 \
|
||||
--model-version "rain-auto-v1-extended" \
|
||||
--out "models/rain_model.pkl" \
|
||||
--report-out "models/rain_model_report.json" \
|
||||
--model-card-out "models/model_card_{model_version}.md" \
|
||||
--dataset-out "models/datasets/rain_dataset_{model_version}_{feature_set}.csv"
|
||||
```
|
||||
|
||||
Review in report:
|
||||
- `candidate_models[*].hyperparameter_tuning`
|
||||
- `candidate_models[*].calibration_comparison`
|
||||
- `naive_baselines_test`
|
||||
- `walk_forward_backtest`
|
||||
|
||||
## 3) Deploy
|
||||
|
||||
1. Promote the selected artifact path to the inference worker (`RAIN_MODEL_PATH` or CLI `--model-path`).
|
||||
2. Run one dry-run inference:
|
||||
|
||||
```sh
|
||||
python scripts/predict_rain_model.py \
|
||||
--site home \
|
||||
--model-path "models/rain_model.pkl" \
|
||||
--model-name "rain_next_1h" \
|
||||
--dry-run
|
||||
```
|
||||
|
||||
3. Run live inference:
|
||||
|
||||
```sh
|
||||
python scripts/predict_rain_model.py \
|
||||
--site home \
|
||||
--model-path "models/rain_model.pkl" \
|
||||
--model-name "rain_next_1h"
|
||||
```
|
||||
|
||||
## 4) Rollback
|
||||
|
||||
1. Identify the last known-good model artifact in `models/`.
|
||||
2. Point deployment to that artifact (worker env `RAIN_MODEL_PATH` or manual inference path).
|
||||
3. Re-run inference command and verify writes in `predictions_rain_1h`.
|
||||
4. Keep the failed artifact/report for postmortem.
|
||||
|
||||
## 5) Monitoring
|
||||
|
||||
### Feature drift
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM rain_feature_drift_daily
|
||||
WHERE site = 'home'
|
||||
ORDER BY day DESC
|
||||
LIMIT 30;
|
||||
```
|
||||
|
||||
Alert heuristic: any absolute z-score > 3 for 2+ consecutive days.
|
||||
|
||||
### Prediction drift
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM rain_prediction_drift_daily
|
||||
WHERE site = 'home'
|
||||
ORDER BY day DESC
|
||||
LIMIT 30;
|
||||
```
|
||||
|
||||
Alert heuristic: `predicted_positive_rate` shifts by > 2x relative to trailing 14-day median.
|
||||
|
||||
### Calibration/performance drift
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM rain_calibration_drift_daily
|
||||
WHERE site = 'home'
|
||||
ORDER BY day DESC
|
||||
LIMIT 30;
|
||||
```
|
||||
|
||||
Alert heuristic: sustained Brier-score increase > 25% from trailing 30-day average.
|
||||
|
||||
## 6) Pipeline Failure Alerts
|
||||
|
||||
Use the health-check script in cron, systemd timer, or your alerting scheduler:
|
||||
|
||||
```sh
|
||||
python scripts/check_rain_pipeline_health.py \
|
||||
--site home \
|
||||
--model-name rain_next_1h \
|
||||
--max-ws90-age 20m \
|
||||
--max-baro-age 30m \
|
||||
--max-forecast-age 3h \
|
||||
--max-prediction-age 30m \
|
||||
--max-pending-eval-age 3h \
|
||||
--max-pending-eval-rows 200
|
||||
```
|
||||
|
||||
The script exits non-zero on failure, so it can directly drive alerting.
|
||||
|
||||
## 7) Continuous Worker Defaults
|
||||
|
||||
`docker-compose.yml` provides these controls for `rainml`:
|
||||
- `RAIN_TUNE_HYPERPARAMETERS`
|
||||
- `RAIN_MAX_HYPERPARAM_TRIALS`
|
||||
- `RAIN_CALIBRATION_METHODS`
|
||||
- `RAIN_WALK_FORWARD_FOLDS`
|
||||
- `RAIN_ALLOW_EMPTY_DATA`
|
||||
- `RAIN_MODEL_CARD_PATH`
|
||||
|
||||
Recommended production defaults:
|
||||
- Enable tuning daily or weekly (`RAIN_TUNE_HYPERPARAMETERS=true`)
|
||||
- Keep walk-forward folds `0` in continuous mode, run fold backtests in scheduled evaluation jobs
|
||||
@@ -37,10 +37,12 @@ pip install -r scripts/requirements.txt
|
||||
|
||||
## Scripts
|
||||
- `scripts/audit_rain_data.py`: data quality + label quality + class balance audit.
|
||||
- `scripts/train_rain_model.py`: strict time-based split training and metrics report.
|
||||
- `scripts/train_rain_model.py`: strict time-based split training and metrics report, with optional
|
||||
validation-only hyperparameter tuning, calibration comparison, naive baseline comparison, and walk-forward folds.
|
||||
- `scripts/predict_rain_model.py`: inference using saved model artifact; upserts into
|
||||
`predictions_rain_1h`.
|
||||
- `scripts/run_rain_ml_worker.py`: long-running worker for periodic training + prediction.
|
||||
- `scripts/check_rain_pipeline_health.py`: freshness/failure check for alerting.
|
||||
|
||||
Feature-set options:
|
||||
- `baseline`: original 5 local observation features.
|
||||
@@ -55,7 +57,7 @@ Model-family options (`train_rain_model.py`):
|
||||
|
||||
## Usage
|
||||
### 1) Apply schema update (existing DBs)
|
||||
`001_schema.sql` now includes `predictions_rain_1h`.
|
||||
`001_schema.sql` includes `predictions_rain_1h`.
|
||||
|
||||
```sh
|
||||
docker compose exec -T timescaledb \
|
||||
@@ -63,6 +65,14 @@ docker compose exec -T timescaledb \
|
||||
-f /docker-entrypoint-initdb.d/001_schema.sql
|
||||
```
|
||||
|
||||
Apply monitoring views:
|
||||
|
||||
```sh
|
||||
docker compose exec -T timescaledb \
|
||||
psql -U postgres -d micrometeo \
|
||||
-f /docker-entrypoint-initdb.d/002_rain_monitoring_views.sql
|
||||
```
|
||||
|
||||
### 2) Run data audit
|
||||
```sh
|
||||
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/micrometeo?sslmode=disable"
|
||||
@@ -136,6 +146,25 @@ python scripts/train_rain_model.py \
|
||||
--report-out "models/rain_model_report_auto.json"
|
||||
```
|
||||
|
||||
### 3e) Full P1 evaluation (tuning + calibration + walk-forward)
|
||||
```sh
|
||||
python scripts/train_rain_model.py \
|
||||
--site "home" \
|
||||
--start "2026-02-01T00:00:00Z" \
|
||||
--end "2026-03-03T23:55:00Z" \
|
||||
--feature-set "extended" \
|
||||
--model-family "auto" \
|
||||
--forecast-model "ecmwf" \
|
||||
--tune-hyperparameters \
|
||||
--max-hyperparam-trials 12 \
|
||||
--calibration-methods "none,sigmoid,isotonic" \
|
||||
--walk-forward-folds 4 \
|
||||
--model-version "rain-auto-v1-extended-eval" \
|
||||
--out "models/rain_model_auto.pkl" \
|
||||
--report-out "models/rain_model_report_auto.json" \
|
||||
--model-card-out "models/model_card_{model_version}.md"
|
||||
```
|
||||
|
||||
### 4) Run inference and store prediction
|
||||
```sh
|
||||
python scripts/predict_rain_model.py \
|
||||
@@ -154,6 +183,10 @@ bash scripts/run_p0_rain_workflow.sh
|
||||
The `rainml` service in `docker-compose.yml` now runs:
|
||||
- periodic retraining (default every 24 hours)
|
||||
- periodic prediction writes (default every 10 minutes)
|
||||
- configurable tuning/calibration behavior (`RAIN_TUNE_HYPERPARAMETERS`,
|
||||
`RAIN_MAX_HYPERPARAM_TRIALS`, `RAIN_CALIBRATION_METHODS`)
|
||||
- graceful gap handling for temporary source outages (`RAIN_ALLOW_EMPTY_DATA=true`)
|
||||
- optional model-card output (`RAIN_MODEL_CARD_PATH`)
|
||||
|
||||
Artifacts are persisted to `./models` on the host.
|
||||
|
||||
@@ -165,6 +198,7 @@ docker compose logs -f rainml
|
||||
## Output
|
||||
- Audit report: `models/rain_data_audit.json`
|
||||
- Training report: `models/rain_model_report.json`
|
||||
- Model card: `models/model_card_<model_version>.md`
|
||||
- Model artifact: `models/rain_model.pkl`
|
||||
- Dataset snapshot: `models/datasets/rain_dataset_<model_version>_<feature_set>.csv`
|
||||
- Prediction rows: `predictions_rain_1h` (probability + threshold decision + realized
|
||||
@@ -192,3 +226,5 @@ docker compose logs -f rainml
|
||||
- Data is resampled into 5-minute buckets.
|
||||
- Label is derived from incremental rain from WS90 cumulative `rain_mm`.
|
||||
- Timestamps are handled as UTC in training/inference workflow.
|
||||
- See [Data issues and mitigation rules](./rain_data_issues.md) and
|
||||
[runbook/monitoring guidance](./rain_model_runbook.md).
|
||||
|
||||
Reference in New Issue
Block a user