docs: rewrite all documentation to reflect current state

- Remove adminer references (service was removed) - Remove mermaid diagrams (ASCII only) - Remove hardcoded credentials (use env var references) - Update all Docker references to 4-container setup (app, postgres, web, pgadmin) - Document env-based admin credentials (ADMIN_EMAIL/ADMIN_PASSWORD) - Document parameterized queries (SQL injection fixed) - Document FCM topic routing by stationtype+level - Document siren stationtype=3 fix in sidesdecode.py - Document idempotent seeder (firstOrCreate) - Document reverse proxy setup in deployment guide - Remove Makefile references (Docker Compose only)
2026-05-21 02:59:32 +08:00
parent c1b2a8d553
commit 6863f39a24
7 changed files with 1116 additions and 658 deletions
--- a/docs/04-DATA-PIPELINE.md
+++ b/docs/04-DATA-PIPELINE.md
@@ -1,48 +1,62 @@
+<!-- generated-by: gsd-doc-writer -->
 # Data Pipeline: Python Autoscript

 ## Overview

-The file `autoscript/sidesdecode.py` is the data ingestion pipeline that:
+`autoscript/sidesdecode.py` is the data ingestion pipeline that:

 1. Connects to an FTP server where telemetry stations upload CSV data
 2. Downloads and parses CSV files for the current day
 3. Inserts rainfall, water level, and siren data into PostgreSQL
 4. Triggers push notifications when thresholds are exceeded

-## How It Runs
+The script is designed to run on a schedule (e.g., cron job), processing new files uploaded by remote telemetry stations throughout the day.

-The script is designed to be run on a **schedule** (likely cron job), processing new data files uploaded by remote telemetry stations throughout the day.
+## Environment Variables
+
+All credentials come from environment variables with defaults:
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `FTP_SERVER` | `myvscada.com` | FTP server hostname |
+| `FTP_USERNAME` | `tck` | FTP login username |
+| `FTP_PASSWORD` | *(empty)* | FTP login password |
+| `PG_HOST` | `postgres` | PostgreSQL host (`postgres` on Docker network, `localhost` on Docker host) |
+| `PG_DATABASE` | `sides_db` | PostgreSQL database name |
+| `PG_USER` | `tck` | PostgreSQL username |
+| `PG_PASSWORD` | *(empty)* | PostgreSQL password |

 ## FTP Connection

+The script connects to the configured FTP server and navigates to today's date folder:
+
 ```
-Server: myvscada.com
-Username: tck
-Password: tck6789
 Path: files/SIDES/SUCCESS/{year}/{month}/{day}/
 ```

-The script navigates to today's date folder and lists all files.
+Example path for 21 May 2026: `files/SIDES/SUCCESS/2026/05/21/`

 ### File Filtering

 - Skips files containing "rf" in the filename (Tideda format files)
- Only processes files with today's date (`yymmdd` format) in the filename
+- Only processes files with today's date in `yymmdd` format in the filename

 ## CSV Format

-Each line in the CSV file contains 37+ comma-separated columns. Key columns extracted:
+Each line in the CSV file contains 37+ comma-separated columns. The script requires at least 25 columns per line; shorter lines are skipped.

-| Column Index | Field | Description |
-|-------------|-------|-------------|
-| 1 | `station_id` | Station identifier (e.g., KBLG0026) |
+Key columns extracted:
+
+| Column Index | Variable | Description |
+|-------------|----------|-------------|
+| 1 | `station_id` | Station identifier (e.g., `KBLG0026`) |
 | 4 | `timestamp` | Timestamp in `yymmddHHMMSS` format |
 | 6 | `battery` | Battery voltage |
 | 15 | `wlalert` | Water level alert threshold |
 | 16 | `wlwarn` | Water level warning threshold |
 | 17 | `wldgr` | Water level danger threshold |
 | 18 | `sirenid` | Siren identifier |
-| 19 | `siren` | Siren status (`H`=Danger/High, `L`=Warning/Low, `N`=Normal) |
+| 19 | `siren` | Siren status: `H`=Danger, `L`=Warning, `N`=Normal |
 | 21 | `anncumm` | Annual cumulative rainfall |
 | 22 | `dailycumm` | Daily cumulative rainfall |
 | 23 | `hourlycumm` | Hourly rainfall |
@@ -51,39 +65,129 @@ Each line in the CSV file contains 37+ comma-separated columns. Key columns extr

 ## Data Processing Logic

+The `process_line()` function handles each CSV line. All database operations use `psycopg2` parameterized queries to prevent SQL injection.
+
 ### Rainfall Data

-1. Check if `dailycumm` or `hourlycumm` is not null
-2. Check if record already exists for this station+timestamp
-3. If new, INSERT into `rainfall` table
-4. **Threshold check**: If `hourlycumm >= 30`:
-   - `30 <= hourly < 60` → **Warning** level
-   - `hourly >= 60` → **Danger** level
-   - INSERT into `notification` table
-   - Send push notification via Laravel API
+```
+CSV line
+  |
+  v
+[dailycumm or hourlycumm not null?]
+  |-- No --> skip
+  |-- Yes
+       v
+  [record exists for station+timestamp?]
+    |-- Yes --> skip insert
+    |-- No --> INSERT INTO rainfall (stationid, timestamp, anncum, daily, hourly, currentrf, battery)
+         |
+         v
+  [hourlycumm >= 30?]
+    |-- No --> done
+    |-- Yes
+         |
+         v
+  Determine level:
+    30 <= hourly < 60 --> Warning
+    hourly >= 60       --> Danger
+         |
+         v
+  [notification exists for station+timestamp+stationtype=1?]
+    |-- Yes --> skip insert
+    |-- No --> INSERT INTO notification (stationid, timestamp, stationtype=1, level, active_time)
+         |
+         v
+  send_alert_to_laravel(station_id, level, 1)
+```

 ### Water Level Data

-1. Check if `waterlevel` is not null
-2. Check if record already exists for this station+datetime
-3. If new, INSERT into `waterlevel` table (with alert/warning/danger thresholds)
-4. **Threshold check**: If `waterlevel >= alert`:
-   - `alert <= wl < warning` → **Alert** level
-   - `warning <= wl < danger` → **Warning** level
-   - `wl >= danger` → **Danger** level
-   - INSERT into `notification` table
-   - Send push notification via Laravel API
+```
+CSV line
+  |
+  v
+[waterlevel not null?]
+  |-- No --> skip
+  |-- Yes
+       v
+  [record exists for station+datetime?]
+    |-- Yes --> skip insert
+    |-- No --> INSERT INTO waterlevel (stationid, datetime, waterlevel, alert, warning, danger)
+         |
+         v
+  [waterlevel >= wlalert?]
+    |-- No --> done
+    |-- Yes
+         |
+         v
+  Determine level:
+    alert   <= wl < warning --> Alert
+    warning <= wl < danger  --> Warning
+    wl >= danger            --> Danger
+         |
+         v
+  [notification exists for station+timestamp+stationtype=2?]
+    |-- Yes --> skip insert
+    |-- No --> INSERT INTO notification (stationid, timestamp, stationtype=2, level, active_time)
+         |
+         v
+  send_alert_to_laravel(station_id, level, 2)
+```

 ### Siren Data

-1. Check if `sirenid` is not null
-2. Check if record already exists for this station+active_time
-3. Determine level from siren status:
-   - `H` → **Danger**
-   - `L` → **Warning**
-   - `N` → **Normal**
-4. INSERT into `siren` table
-5. If level is not Normal, send push notification via Laravel API
+```
+CSV line
+  |
+  v
+[sirenid not null?]
+  |-- No --> skip
+  |-- Yes
+       v
+  Determine level from siren status:
+    H --> Danger
+    L --> Warning
+    N --> Normal
+       |
+       v
+  [record exists for station+active_time?]
+    |-- Yes --> skip insert
+    |-- No --> INSERT INTO siren (stationid, stationtype=3, active_time, level)
+         |
+         v
+  [level != Normal?]
+    |-- No --> done
+    |-- Yes --> send_alert_to_laravel(station_id, level, 3)
+```
+
+## Station Types
+
+The `stationtype` integer identifies the data source in notifications and alerts:
+
+| stationtype | Data Source |
+|------------|-------------|
+| 1 | Rainfall |
+| 2 | Water Level |
+| 3 | Siren |
+
+## Threshold Summary
+
+### Rainfall Thresholds
+
+| Condition | Level |
+|-----------|-------|
+| `hourlycumm >= 30` and `< 60` | Warning |
+| `hourlycumm >= 60` | Danger |
+
+### Water Level Thresholds
+
+Thresholds are per-station values from CSV columns 15-17.
+
+| Condition | Level |
+|-----------|-------|
+| `waterlevel >= wlalert` and `< wlwarn` | Alert |
+| `waterlevel >= wlwarn` and `< wldgr` | Warning |
+| `waterlevel >= wldgr` | Danger |

 ## Alert Notification Flow

@@ -99,45 +203,57 @@ def send_alert_to_laravel(stationid, level, stationtype):
    response = requests.post("https://sides.tck.com.my/api/alert", json=payload, timeout=5)
 ```

-This hits the Laravel `AlertController` which:
-1. Builds notification title/body based on station type and level
-2. Calls `FcmService::sendToTopic()` which:
-   - Reads Firebase service account credentials
-   - Gets an OAuth2 access token from Google
-   - Sends FCM message to topic (e.g., `rainfall_warning`)
-   - Push notification arrives on subscribed mobile devices
+<!-- VERIFY: Alert API endpoint is https://sides.tck.com.my/api/alert -->

-## PostgreSQL Connection
+The full notification chain:

-The script connects directly to PostgreSQL:
-
-```python
-pg_host = "192.168.0.211"
-pg_database = "sides_db"
-pg_user = "tck"
-pg_password = "projectdev##1"
+```
+sidesdecode.py
+  |  POST /api/alert  {stationid, level, stationtype}
+  v
+AlertController (Laravel)
+  |  Builds notification title/body from station type and level
+  v
+FcmService::sendToTopic()
+  |  Routes to FCM topic by stationtype and level
+  |  (e.g., rainfall_warning, rainfall_danger, waterlevel_alert, waterlevel_danger)
+  v
+Firebase Cloud Messaging
+  |  Push notification delivered to subscribed mobile devices
+  v
+Mobile App
 ```

-**Note**: This is a hardcoded external IP, not using the Docker container. The database name is `sides_db` (different from the Docker `.env` which uses `tckdev`).
+## Deduplication
+
+Before every INSERT, the script checks for an existing record:
+
+- **Rainfall**: `SELECT COUNT(*) FROM rainfall WHERE stationid = %s AND timestamp = %s`
+- **Water Level**: `SELECT COUNT(*) FROM waterlevel WHERE stationid = %s AND datetime = %s`
+- **Siren**: `SELECT COUNT(*) FROM siren WHERE stationid = %s AND active_time = %s`
+- **Notification**: `SELECT COUNT(*) FROM notification WHERE stationid = %s AND timestamp = %s AND stationtype = %s`
+
+If a record exists, the INSERT is skipped. This makes the script safe to re-run for the same time period.

 ## File Management (Commented Out)

-The script contains (commented out) functions for:
- `move_to_error_folder()` — Move malformed files to an FTP error folder
- `move_to_success_folder()` — Move processed files to a success archive folder
+Two file management functions are defined but currently commented out:

-These are currently disabled — files remain in the source folder after processing.
+- `move_to_error_folder()` -- Move malformed files to an FTP error subfolder
+- `move_to_success_folder()` -- Move processed files to a success archive subfolder

-## Log Files
+When active, these functions create the target FTP directory if it does not exist, upload the file, and delete the original. Currently disabled -- processed files remain in the source FTP folder after processing.

- `autoscript/sidesdecode.log` — Processing output
- `autoscript/sidesdecode_error.log` — Error output
+## Error Handling
+
+- Malformed lines (fewer than 25 columns) are skipped with a log message
+- Any exception during line processing triggers `conn.rollback()` to prevent partial inserts
+- Alert sending failures are caught and logged but do not halt processing
+- The script closes both FTP and database connections on exit

 ## Known Issues

-1. **Hardcoded credentials** — FTP and PostgreSQL credentials are embedded in the script
-2. **No deduplication beyond same-timestamp** — If the script runs twice, it skips exact duplicates but has no broader deduplication
-3. **Commented out file management** — Processed files are not moved/archived
-4. **Water level alert sends `stationtype=1`** instead of `2` (likely a bug at line 378)
-5. **No error recovery** — If the script crashes mid-processing, some data may be partially inserted
-6. **No connection pooling** — New FTP and database connections each run
+1. **No file archiving** -- `move_to_error_folder` and `move_to_success_folder` are commented out, so files are never moved after processing
+2. **No broader deduplication** -- Deduplication only checks exact station+timestamp matches; no handling for near-duplicate records
+3. **No connection retry** -- If FTP or PostgreSQL is unreachable, the script fails immediately with no retry logic
+4. **Partial processing risk** -- If the script crashes mid-file, lines already processed are committed but remaining lines are lost until the next run