Files
rtu_v5/.planning/research/PITFALLS.md
2026-03-12 06:04:19 +08:00

13 KiB

Pitfalls Research

Domain: Raspberry Pi Web Monitoring Station Interface Researched: 2026-03-12 Confidence: HIGH

Critical Pitfalls

Pitfall 1: Flask Performance Degradation on Pi Zero 2 W

What goes wrong: Web server takes 60+ seconds to start, consumes 60-100% CPU continuously, causing severe UI lag and unresponsive touchscreen. Pages load slowly even for simple requests.

Why it happens: The Pi Zero 2 W has only a single ARM11 core at 1GHz. Flask's development server isn't optimized for embedded deployment. Every request blocks the event loop. Python's GIL limits concurrency. The "quick start" approach doesn't account for resource constraints.

How to avoid:

  • Use a production WSGI server (Gunicorn with multiple workers won't help on single core — use event-driven approach)
  • Consider lighter alternatives: Python's built-in http.server for simple needs, or Node.js for better concurrency
  • Cache static assets aggressively
  • Implement server-sent events (SSE) instead of polling for real-time updates
  • Pre-compile templates, minimize Python imports at startup

Warning signs:

  • First page load takes >30 seconds
  • CPU stays above 50% when serving pages
  • Touchscreen response delay >500ms

Phase to address: Phase 1 (UI/Display) — Choose the right backend technology before building UI


Pitfall 2: Chromium Kiosk Mode Instability

What goes wrong: Chromium exits kiosk mode after system updates, monitor power cycling, or resolution changes. The screen shows a window instead of fullscreen, breaking the dedicated display experience. Session restore prompts interrupt kiosk operation.

Why it happens:

  • Debian/Raspbian updates replace Chromium with different versions that have different flag behaviors
  • Wayland/Labwc (default on Bookworm) behaves differently than X11 for kiosk flags
  • Monitor power on/off triggers display mode changes that exit kiosk
  • Default session restore behavior shows popup dialogs

How to avoid:

  • Pin Chromium version: sudo apt-mark hold chromium-browser
  • Use X11 instead of Wayland for kiosk stability (configure via raspi-config)
  • Add flags to prevent session restore: --disable-session-crashed-bubble --disable-infobars --noerrdialogs
  • Implement a watchdog script that restarts Chromium if it exits or enters wrong mode
  • Test with monitor power cycling during development

Warning signs:

  • /usr/bin/chromium --version shows different version after apt upgrade
  • Kiosk appears as window after system reboot
  • "Restore pages?" bubble appears on boot

Phase to address: Phase 1 (UI/Display) — Resolve kiosk stability before considering display "complete"


Pitfall 3: SD Card Corruption from Data Logging

What goes wrong: After weeks of operation, the system becomes read-only or fails to boot. All sensor data and configuration is lost. The RTU stops transmitting, creating data gaps in the rainfall record.

Why it happens:

  • Continuous CSV writing to SD card causes wear
  • Power interruptions during write operations corrupt the filesystem
  • No write caching strategy — every sensor reading triggers a file sync
  • Logs and temp data accumulate on SD card

How to avoid:

  • Mount /var/log and /tmp as tmpfs (RAM disks)
  • Write sensor data to memory buffer, flush to SD only periodically (e.g., every 5 minutes)
  • Use SQLite with WAL mode for atomic writes instead of CSV append
  • Implement proper shutdown button (hardware + software) to prevent power-loss during writes
  • Consider USB SSD for data storage if available
  • Disable swap: sudo dphys-swapfile swapoff

Warning signs:

  • dmesg | grep -i error shows I/O errors
  • Filesystem becomes read-only randomly
  • Boot failures after power outage

Phase to address: Phase 2 (Data/CSV) — Data persistence strategy must be designed upfront


Pitfall 4: Touchscreen Unresponsiveness at 7-inch Display

What goes wrong: Touches require repeated taps to register. There's noticeable lag between touch and UI response. The 7-inch official touchscreen feels "sluggish" compared to phone touch experience.

Why it happens:

  • Official 7-inch touchscreen has ~33Hz polling rate (30-40ms between touch events)
  • Web browser input handling adds additional latency
  • GPU memory may be undersized for smooth rendering
  • Xorg/Wayland compositor overhead on Pi Zero

How to avoid:

  • Allocate sufficient GPU memory: gpu_mem=128 in config.txt
  • Use hardware-accelerated rendering where possible
  • Design UI with large touch targets (minimum 48px, recommend 64px for primary actions)
  • Add visual feedback for touches (immediate color change before action completes)
  • Avoid rapid-fire touch interactions — design for deliberate touches
  • Consider UDEV=1 environment variable for better input handling

Warning signs:

  • Single tap requires 2-3 attempts to register
  • Slider/scroll gestures feel jerky
  • UI feels "mushy" — no immediate feedback

Phase to address: Phase 1 (UI/Display) — Test touch responsiveness early, not as afterthought


Pitfall 5: Network Data Transmission Failures Silently Lost

What goes wrong: CSV files fail to transmit to myvscada server but no alert is generated. Data accumulates locally until storage fills. The monitoring station appears operational but isn't actually reporting.

Why it happens:

  • FTP/SFTP connections fail due to network issues but code doesn't retry aggressively
  • No local queue for failed transmissions
  • No verification that server actually received the file
  • Transmission happens but errors are logged only, not surfaced to UI

How to avoid:

  • Implement transmission queue with retry logic (exponential backoff)
  • Verify file receipt via server acknowledgment or file existence check
  • Show transmission status prominently on dashboard (last successful sync, pending count)
  • Implement dead letter queue — alert after N failed attempts
  • Log all transmission attempts with timestamps and error codes

Warning signs:

  • Dashboard shows "transmitting" but files never leave local storage
  • Server reports missing data but RTU shows "success"
  • Network logs show connection timeouts but no UI indication

Phase to address: Phase 3 (Network) — Build transmission verification before considering networking "done"


Pitfall 6: Real-Time Data Staleness Without Notification

What goes wrong: Dashboard shows rainfall readings that are hours old. User doesn't realize data hasn't updated. The RTU appears to work but sensor polling has stopped.

Why it happens:

  • No watchdog for sensor polling process
  • Web page uses initial data load only — no auto-refresh
  • Backend fails silently, continues serving stale cached data
  • No "last updated" timestamp displayed

How to avoid:

  • Always display "last updated" timestamp prominently
  • Implement WebSocket or Server-Sent Events for live updates
  • If using polling, show "updating..." indicator and timeout after N seconds
  • Add backend health check — if sensor reader hasn't updated in X minutes, show warning
  • Implement process monitoring (systemd watchdog or custom health check)

Warning signs:

  • Timestamps on dashboard don't change for extended periods
  • Rainfall values don't match physical bucket tipping
  • No indication of "live" vs "stale" data

Phase to address: Phase 1 (UI/Display) — Data freshness is a UX issue, not just backend


Technical Debt Patterns

Shortcut Immediate Benefit Long-term Cost When Acceptable
Using Flask dev server Simple startup High CPU, slow response Never in production
Writing CSV on every reading Simple code SD card wear, data loss risk Never
HTTP polling for updates Simple implementation Wastes CPU, UI lag Only if SSE/WebSocket unavailable
Hardcoded IP addresses Quick setup Breaks when network changes Never — use DNS/hostname
No transmission retry Simpler code Silent data loss Never for operational data
Single network interface Simple config No resilience Only for non-critical displays

Integration Gotchas

Integration Common Mistake Correct Approach
FTP Server Assuming passive mode works everywhere Test with active mode, check firewall rules
SFTP Using default SSH ciphers (slow) Enable hardware acceleration, optimize ciphers
myvscada Server No authentication verification Test credentials before production
Sensor Hardware Polling too frequently Respect sensor timing, buffer readings
Mobile Network No reconnection logic Implement connection watchdog

Performance Traps

Trap Symptoms Prevention When It Breaks
Large CSV files Memory exhaustion, slow transmission Chunk files, limit records per file At >10MB files
Many concurrent browser tabs RAM exhaustion Limit connections, close unused At 3+ tabs on Pi Zero
Animated UI elements High CPU, battery drain Minimize animations, use CSS transforms Always on embedded
Heavy JavaScript framework Slow load, high memory Use vanilla JS or lightweight framework Any framework >50KB

Security Mistakes

Mistake Risk Prevention
No authentication on local port 8080 Physical access = full control Implement session auth, even for local
Plain FTP for data transmission Credential theft Use SFTP/SCP only
Exposed network ports without firewall Remote exploitation Firewall rules, minimal exposure
Storing passwords in plain text Credential exposure Use environment variables or secure storage
No input validation on settings Command injection Validate all inputs, sanitize before use

UX Pitfalls

Pitfall User Impact Better Approach
No confirmation for destructive actions Accidental reset of calibration/data Require explicit confirmation dialogs
Settings changes apply immediately Unintended side effects Use "Preview" then "Apply" pattern
No visual feedback for touch User double-taps, causes errors Immediate visual + haptic feedback
Error messages are technical Non-technical users confused User-friendly messages, offer solutions
No offline indication User trusts data when network down Clear "offline" banner, show last update

"Looks Done But Isn't" Checklist

  • Kiosk Mode: Verified working after apt upgrade && reboot — not just on fresh install
  • Touchscreen: Tested with actual finger touches, not mouse clicks — 33Hz polling shows difference
  • Data Transmission: Verified file actually arrives at server — not just "sent"
  • Data Freshness: Dashboard shows "last updated" timestamp — not just current values
  • Power Loss: System survives unexpected power cut — test by pulling plug
  • Remote Access: Works from external network — not just localhost
  • Memory Usage: Verified stable over 24hr run — no gradual growth
  • Temperature: Verified works in expected environmental conditions

Recovery Strategies

Pitfall Recovery Cost Recovery Steps
SD card corruption HIGH Requires physical access, reimage, restore backup
Kiosk exit LOW Watchdog script auto-restarts, or manual sudo systemctl restart kiosk
Transmission failure MEDIUM Check queue, retry manually, investigate root cause
Sensor stop MEDIUM Restart sensor polling service, check wiring
Network down LOW Show offline indicator, auto-reconnect with backoff

Pitfall-to-Phase Mapping

Pitfall Prevention Phase Verification
Flask performance Phase 1: UI Design Benchmark first page load, CPU under load
Kiosk instability Phase 1: UI Design Test after system updates
SD card corruption Phase 2: Data/CSV Power-loss test, check for I/O errors
Touchscreen lag Phase 1: UI Design Physical touch testing
Transmission failures Phase 3: Network Monitor queue, verify server receipt
Stale data Phase 1: UI Design Verify timestamps update
Network resilience Phase 3: Network Test with disconnected network

Sources

  • Raspberry Pi Forums: Kiosk mode issues, Chromium autostart problems
  • Raspberry Pi Stack Exchange: Flask performance on Pi Zero W
  • GitHub Issue #3777 (raspberrypi/linux): 7" touchscreen polling rate at 33Hz
  • Hackaday: "Raspberry Pi And The Story Of SD Card Corruption"
  • pidiylab.com: SD card corruption prevention, performance tuning
  • raspberrytips.com: Common Raspberry Pi problems and solutions
  • XDA Developers: Common Raspberry Pi mistakes (2025)
  • Community reports: SFTP slow speeds on Pi 3/4, Wayland kiosk issues

Pitfalls research for: Raspberry Pi Web Monitoring Station RTU Researched: 2026-03-12