Files
rtu_v5/.planning/research/PITFALLS.md
2026-03-12 06:04:19 +08:00

283 lines
13 KiB
Markdown

# Pitfalls Research
**Domain:** Raspberry Pi Web Monitoring Station Interface
**Researched:** 2026-03-12
**Confidence:** HIGH
## Critical Pitfalls
### Pitfall 1: Flask Performance Degradation on Pi Zero 2 W
**What goes wrong:**
Web server takes 60+ seconds to start, consumes 60-100% CPU continuously, causing severe UI lag and unresponsive touchscreen. Pages load slowly even for simple requests.
**Why it happens:**
The Pi Zero 2 W has only a single ARM11 core at 1GHz. Flask's development server isn't optimized for embedded deployment. Every request blocks the event loop. Python's GIL limits concurrency. The "quick start" approach doesn't account for resource constraints.
**How to avoid:**
- Use a production WSGI server (Gunicorn with multiple workers won't help on single core — use event-driven approach)
- Consider lighter alternatives: Python's built-in http.server for simple needs, or Node.js for better concurrency
- Cache static assets aggressively
- Implement server-sent events (SSE) instead of polling for real-time updates
- Pre-compile templates, minimize Python imports at startup
**Warning signs:**
- First page load takes >30 seconds
- CPU stays above 50% when serving pages
- Touchscreen response delay >500ms
**Phase to address:** Phase 1 (UI/Display) — Choose the right backend technology before building UI
---
### Pitfall 2: Chromium Kiosk Mode Instability
**What goes wrong:**
Chromium exits kiosk mode after system updates, monitor power cycling, or resolution changes. The screen shows a window instead of fullscreen, breaking the dedicated display experience. Session restore prompts interrupt kiosk operation.
**Why it happens:**
- Debian/Raspbian updates replace Chromium with different versions that have different flag behaviors
- Wayland/Labwc (default on Bookworm) behaves differently than X11 for kiosk flags
- Monitor power on/off triggers display mode changes that exit kiosk
- Default session restore behavior shows popup dialogs
**How to avoid:**
- Pin Chromium version: `sudo apt-mark hold chromium-browser`
- Use X11 instead of Wayland for kiosk stability (configure via raspi-config)
- Add flags to prevent session restore: `--disable-session-crashed-bubble --disable-infobars --noerrdialogs`
- Implement a watchdog script that restarts Chromium if it exits or enters wrong mode
- Test with monitor power cycling during development
**Warning signs:**
- `/usr/bin/chromium --version` shows different version after apt upgrade
- Kiosk appears as window after system reboot
- "Restore pages?" bubble appears on boot
**Phase to address:** Phase 1 (UI/Display) — Resolve kiosk stability before considering display "complete"
---
### Pitfall 3: SD Card Corruption from Data Logging
**What goes wrong:**
After weeks of operation, the system becomes read-only or fails to boot. All sensor data and configuration is lost. The RTU stops transmitting, creating data gaps in the rainfall record.
**Why it happens:**
- Continuous CSV writing to SD card causes wear
- Power interruptions during write operations corrupt the filesystem
- No write caching strategy — every sensor reading triggers a file sync
- Logs and temp data accumulate on SD card
**How to avoid:**
- Mount `/var/log` and `/tmp` as tmpfs (RAM disks)
- Write sensor data to memory buffer, flush to SD only periodically (e.g., every 5 minutes)
- Use SQLite with WAL mode for atomic writes instead of CSV append
- Implement proper shutdown button (hardware + software) to prevent power-loss during writes
- Consider USB SSD for data storage if available
- Disable swap: `sudo dphys-swapfile swapoff`
**Warning signs:**
- `dmesg | grep -i error` shows I/O errors
- Filesystem becomes read-only randomly
- Boot failures after power outage
**Phase to address:** Phase 2 (Data/CSV) — Data persistence strategy must be designed upfront
---
### Pitfall 4: Touchscreen Unresponsiveness at 7-inch Display
**What goes wrong:**
Touches require repeated taps to register. There's noticeable lag between touch and UI response. The 7-inch official touchscreen feels "sluggish" compared to phone touch experience.
**Why it happens:**
- Official 7-inch touchscreen has ~33Hz polling rate (30-40ms between touch events)
- Web browser input handling adds additional latency
- GPU memory may be undersized for smooth rendering
- Xorg/Wayland compositor overhead on Pi Zero
**How to avoid:**
- Allocate sufficient GPU memory: `gpu_mem=128` in config.txt
- Use hardware-accelerated rendering where possible
- Design UI with large touch targets (minimum 48px, recommend 64px for primary actions)
- Add visual feedback for touches (immediate color change before action completes)
- Avoid rapid-fire touch interactions — design for deliberate touches
- Consider `UDEV=1` environment variable for better input handling
**Warning signs:**
- Single tap requires 2-3 attempts to register
- Slider/scroll gestures feel jerky
- UI feels "mushy" — no immediate feedback
**Phase to address:** Phase 1 (UI/Display) — Test touch responsiveness early, not as afterthought
---
### Pitfall 5: Network Data Transmission Failures Silently Lost
**What goes wrong:**
CSV files fail to transmit to myvscada server but no alert is generated. Data accumulates locally until storage fills. The monitoring station appears operational but isn't actually reporting.
**Why it happens:**
- FTP/SFTP connections fail due to network issues but code doesn't retry aggressively
- No local queue for failed transmissions
- No verification that server actually received the file
- Transmission happens but errors are logged only, not surfaced to UI
**How to avoid:**
- Implement transmission queue with retry logic (exponential backoff)
- Verify file receipt via server acknowledgment or file existence check
- Show transmission status prominently on dashboard (last successful sync, pending count)
- Implement dead letter queue — alert after N failed attempts
- Log all transmission attempts with timestamps and error codes
**Warning signs:**
- Dashboard shows "transmitting" but files never leave local storage
- Server reports missing data but RTU shows "success"
- Network logs show connection timeouts but no UI indication
**Phase to address:** Phase 3 (Network) — Build transmission verification before considering networking "done"
---
### Pitfall 6: Real-Time Data Staleness Without Notification
**What goes wrong:**
Dashboard shows rainfall readings that are hours old. User doesn't realize data hasn't updated. The RTU appears to work but sensor polling has stopped.
**Why it happens:**
- No watchdog for sensor polling process
- Web page uses initial data load only — no auto-refresh
- Backend fails silently, continues serving stale cached data
- No "last updated" timestamp displayed
**How to avoid:**
- Always display "last updated" timestamp prominently
- Implement WebSocket or Server-Sent Events for live updates
- If using polling, show "updating..." indicator and timeout after N seconds
- Add backend health check — if sensor reader hasn't updated in X minutes, show warning
- Implement process monitoring (systemd watchdog or custom health check)
**Warning signs:**
- Timestamps on dashboard don't change for extended periods
- Rainfall values don't match physical bucket tipping
- No indication of "live" vs "stale" data
**Phase to address:** Phase 1 (UI/Display) — Data freshness is a UX issue, not just backend
---
## Technical Debt Patterns
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Using Flask dev server | Simple startup | High CPU, slow response | Never in production |
| Writing CSV on every reading | Simple code | SD card wear, data loss risk | Never |
| HTTP polling for updates | Simple implementation | Wastes CPU, UI lag | Only if SSE/WebSocket unavailable |
| Hardcoded IP addresses | Quick setup | Breaks when network changes | Never — use DNS/hostname |
| No transmission retry | Simpler code | Silent data loss | Never for operational data |
| Single network interface | Simple config | No resilience | Only for non-critical displays |
---
## Integration Gotchas
| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| FTP Server | Assuming passive mode works everywhere | Test with active mode, check firewall rules |
| SFTP | Using default SSH ciphers (slow) | Enable hardware acceleration, optimize ciphers |
| myvscada Server | No authentication verification | Test credentials before production |
| Sensor Hardware | Polling too frequently | Respect sensor timing, buffer readings |
| Mobile Network | No reconnection logic | Implement connection watchdog |
---
## Performance Traps
| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| Large CSV files | Memory exhaustion, slow transmission | Chunk files, limit records per file | At >10MB files |
| Many concurrent browser tabs | RAM exhaustion | Limit connections, close unused | At 3+ tabs on Pi Zero |
| Animated UI elements | High CPU, battery drain | Minimize animations, use CSS transforms | Always on embedded |
| Heavy JavaScript framework | Slow load, high memory | Use vanilla JS or lightweight framework | Any framework >50KB |
---
## Security Mistakes
| Mistake | Risk | Prevention |
|---------|------|------------|
| No authentication on local port 8080 | Physical access = full control | Implement session auth, even for local |
| Plain FTP for data transmission | Credential theft | Use SFTP/SCP only |
| Exposed network ports without firewall | Remote exploitation | Firewall rules, minimal exposure |
| Storing passwords in plain text | Credential exposure | Use environment variables or secure storage |
| No input validation on settings | Command injection | Validate all inputs, sanitize before use |
---
## UX Pitfalls
| Pitfall | User Impact | Better Approach |
|---------|-------------|-----------------|
| No confirmation for destructive actions | Accidental reset of calibration/data | Require explicit confirmation dialogs |
| Settings changes apply immediately | Unintended side effects | Use "Preview" then "Apply" pattern |
| No visual feedback for touch | User double-taps, causes errors | Immediate visual + haptic feedback |
| Error messages are technical | Non-technical users confused | User-friendly messages, offer solutions |
| No offline indication | User trusts data when network down | Clear "offline" banner, show last update |
---
## "Looks Done But Isn't" Checklist
- [ ] **Kiosk Mode:** Verified working after `apt upgrade && reboot` — not just on fresh install
- [ ] **Touchscreen:** Tested with actual finger touches, not mouse clicks — 33Hz polling shows difference
- [ ] **Data Transmission:** Verified file actually arrives at server — not just "sent"
- [ ] **Data Freshness:** Dashboard shows "last updated" timestamp — not just current values
- [ ] **Power Loss:** System survives unexpected power cut — test by pulling plug
- [ ] **Remote Access:** Works from external network — not just localhost
- [ ] **Memory Usage:** Verified stable over 24hr run — no gradual growth
- [ ] **Temperature:** Verified works in expected environmental conditions
---
## Recovery Strategies
| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| SD card corruption | HIGH | Requires physical access, reimage, restore backup |
| Kiosk exit | LOW | Watchdog script auto-restarts, or manual `sudo systemctl restart kiosk` |
| Transmission failure | MEDIUM | Check queue, retry manually, investigate root cause |
| Sensor stop | MEDIUM | Restart sensor polling service, check wiring |
| Network down | LOW | Show offline indicator, auto-reconnect with backoff |
---
## Pitfall-to-Phase Mapping
| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| Flask performance | Phase 1: UI Design | Benchmark first page load, CPU under load |
| Kiosk instability | Phase 1: UI Design | Test after system updates |
| SD card corruption | Phase 2: Data/CSV | Power-loss test, check for I/O errors |
| Touchscreen lag | Phase 1: UI Design | Physical touch testing |
| Transmission failures | Phase 3: Network | Monitor queue, verify server receipt |
| Stale data | Phase 1: UI Design | Verify timestamps update |
| Network resilience | Phase 3: Network | Test with disconnected network |
---
## Sources
- Raspberry Pi Forums: Kiosk mode issues, Chromium autostart problems
- Raspberry Pi Stack Exchange: Flask performance on Pi Zero W
- GitHub Issue #3777 (raspberrypi/linux): 7" touchscreen polling rate at 33Hz
- Hackaday: "Raspberry Pi And The Story Of SD Card Corruption"
- pidiylab.com: SD card corruption prevention, performance tuning
- raspberrytips.com: Common Raspberry Pi problems and solutions
- XDA Developers: Common Raspberry Pi mistakes (2025)
- Community reports: SFTP slow speeds on Pi 3/4, Wayland kiosk issues
---
*Pitfalls research for: Raspberry Pi Web Monitoring Station RTU*
*Researched: 2026-03-12*