iot/spore

Files

Patrick Balsiger 356ec3d381 feat: simplify udp listen

2025-09-25 20:44:31 +02:00

10 KiB

Raw Blame History

SPORE Architecture & Implementation

System Overview

SPORE (SProcket ORchestration Engine) is a cluster engine for ESP8266 microcontrollers that provides automatic node discovery, health monitoring, and over-the-air updates in a distributed network environment.

Core Components

The system architecture consists of several key components working together:

Network Manager

WiFi Connection Handling: Automatic WiFi STA/AP configuration
Hostname Configuration: MAC-based hostname generation
Fallback Management: Automatic access point creation if WiFi connection fails

Cluster Manager

Node Discovery: UDP-based automatic node detection
Member List Management: Dynamic cluster membership tracking
Health Monitoring: Continuous node status checking
Resource Tracking: Monitor node resources and capabilities

API Server

HTTP API Server: RESTful API for cluster management
Dynamic Endpoint Registration: Automatic API endpoint discovery
Service Registry: Track available services across the cluster

Task Scheduler

Cooperative Multitasking: Background task management system (TaskManager)
Task Lifecycle Management: Enable/disable tasks and set intervals at runtime
Execution Model: Tasks run in Spore::loop() when their interval elapses

Node Context

Central Context: Shared resources and configuration
Event System: Local and cluster-wide event publishing/subscription
Resource Management: Centralized resource allocation and monitoring

Auto Discovery Protocol

The cluster uses a UDP-based discovery protocol for automatic node detection:

Discovery Process

Discovery Broadcast: Nodes periodically send UDP packets on port udp_port (default 4210)
Response Handling: Nodes respond with CLUSTER_RESPONSE:<hostname>
Member Management: Discovered nodes are added/updated in the cluster
Node Info via UDP: Heartbeat triggers peers to send CLUSTER_NODE_INFO:<hostname>:<json>

Protocol Details

UDP Port: 4210 (configurable via Config.udp_port)
Discovery Message: CLUSTER_DISCOVERY
Response Message: CLUSTER_RESPONSE
Heartbeat Message: CLUSTER_HEARTBEAT
Node Info Message: CLUSTER_NODE_INFO:<hostname>:<json>
Broadcast Address: 255.255.255.255
Discovery Interval: Config.discovery_interval_ms (default 1000 ms)
Listen Interval: Config.discovery_interval_ms / 10 (default 100 ms)
Heartbeat Interval: Config.heartbeat_interval_ms (default 5000 ms)

Node Status Categories

Nodes are automatically categorized by their activity:

ACTIVE: lastSeen < node_inactive_threshold_ms (default 10s)
INACTIVE: < node_dead_threshold_ms (default 120s)
DEAD: ≥ node_dead_threshold_ms

Task Scheduling System

The system runs several background tasks at different intervals:

Core System Tasks

Task	Interval (default)	Purpose
`discovery_send`	1000 ms	Send UDP discovery packets
`cluster_listen`	100 ms	Listen for discovery/heartbeat/node-info
`status_update`	1000 ms	Update node status categories, purge dead
`heartbeat`	5000 ms	Broadcast heartbeat and update local resources
`update_members_info`	10000 ms	Reserved; no-op (info via UDP)
`print_members`	5000 ms	Log current member list

Task Management Features

Dynamic Intervals: Change execution frequency on-the-fly
Runtime Control: Enable/disable tasks without restart
Status Monitoring: Real-time task health tracking
Resource Integration: View task status with system resources

Event System

The NodeContext provides an event-driven architecture for system-wide communication:

Event Subscription

// Subscribe to events
ctx.on("node_discovered", [](void* data) {
    NodeInfo* node = static_cast<NodeInfo*>(data);
    // Handle new node discovery
});

ctx.on("cluster_updated", [](void* data) {
    // Handle cluster membership changes
});

Event Publishing

// Publish events
ctx.fire("node_discovered", &newNode);
ctx.fire("cluster_updated", &clusterData);

Available Events

node_discovered: New node added or local node refreshed

Resource Monitoring

Each node tracks comprehensive system resources:

System Resources

Free Heap Memory: Available RAM in bytes
Chip ID: Unique ESP8266 identifier
SDK Version: ESP8266 firmware version
CPU Frequency: Operating frequency in MHz
Flash Chip Size: Total flash storage in bytes

API Endpoint Registry

Dynamic Discovery: Automatically detect available endpoints
Method Information: HTTP method (GET, POST, etc.)
Service Catalog: Complete service registry across cluster

Health Metrics

Response Time: API response latency
Uptime: System uptime in milliseconds
Connection Status: Network connectivity health
Resource Utilization: Memory and CPU usage

WiFi Fallback System

The system includes automatic WiFi fallback for robust operation:

Fallback Process

Primary Connection: Attempts to connect to configured WiFi network
Connection Failure: If connection fails, creates an access point
Hostname Generation: Automatically generates hostname from MAC address
Service Continuity: Maintains cluster functionality in fallback mode

Configuration

Hostname: Derived from MAC (esp-<mac>) and assigned to ctx.hostname
AP Mode: If STA connection fails, device switches to AP mode with configured SSID/password

Cluster Topology

Node Types

Master Node: Primary cluster coordinator (if applicable)
Worker Nodes: Standard cluster members
Edge Nodes: Network edge devices

Network Architecture

UDP broadcast-based discovery and heartbeats on local subnet
Optional HTTP polling (disabled by default; node info exchanged via UDP)

Data Flow

Node Discovery

UDP Broadcast: Nodes broadcast discovery packets on port 4210
UDP Response: Receiving nodes respond with hostname
Registration: Discovered nodes are added to local cluster member list

Health Monitoring

Periodic Checks: Cluster manager updates node status categories
Status Collection: Each node updates resources via UDP node-info messages

Task Management

Scheduling: TaskManager executes registered tasks at configured intervals
Execution: Tasks run cooperatively in the main loop without preemption
Monitoring: Task status is exposed via REST (/api/tasks/status)

Performance Characteristics

Memory Usage

Base System: ~15-20KB RAM (device dependent)
Per Task: ~100-200 bytes per task
Cluster Members: ~50-100 bytes per member
API Endpoints: ~20-30 bytes per endpoint

Network Overhead

Discovery Packets: 64 bytes every 1 second
Health Checks: ~200-500 bytes every 1 second
Status Updates: ~1-2KB per node
API Responses: Varies by endpoint (typically 100B-5KB)

Processing Overhead

Task Execution: Minimal overhead per task
Event Processing: Fast event dispatch
JSON Parsing: Efficient ArduinoJson usage
Network I/O: Asynchronous operations

Security Considerations

Current Implementation

Network Access: Local network only (no internet exposure)
Authentication: None currently implemented; LAN-only access assumed
Data Validation: Basic input validation
Resource Limits: Memory and processing constraints

Future Enhancements

TLS/SSL: Encrypted communications
API Keys: Authentication for API access
Access Control: Role-based permissions
Audit Logging: Security event tracking

Scalability

Cluster Size Limits

Theoretical: Up to 255 nodes (IP subnet limit)
Practical: 20-50 nodes for optimal performance
Memory Constraint: ~8KB available for member tracking
Network Constraint: UDP packet size limits

Performance Scaling

Linear Scaling: Most operations scale linearly with node count
Discovery Overhead: Increases with cluster size
Health Monitoring: Parallel HTTP requests
Task Management: Independent per-node execution

Configuration Management

Environment Variables

# API node IP for cluster management
export API_NODE=192.168.1.100

PlatformIO Configuration

The project uses PlatformIO with the following configuration:

Framework: Arduino
Board: ESP-01 with 1MB flash
Upload Speed: 115200 baud
Flash Mode: DOUT (required for ESP-01S)

Dependencies

The project requires the following libraries:

esp32async/ESPAsyncWebServer@^3.8.0 - HTTP API server
bblanchon/ArduinoJson@^7.4.2 - JSON processing
arkhipenko/TaskScheduler@^3.8.5 - Cooperative multitasking

Development Workflow

Building

Build the firmware for specific chip:

./ctl.sh build target esp01_1m

Flashing

Flash firmware to a connected device:

./ctl.sh flash target esp01_1m

Over-The-Air Updates

Update a specific node:

./ctl.sh ota update 192.168.1.100 esp01_1m

Update all nodes in the cluster:

./ctl.sh ota all esp01_1m

Cluster Management

View cluster members:

./ctl.sh cluster members

Troubleshooting

Common Issues

Discovery Failures: Check UDP port 4210 is not blocked
WiFi Connection: Verify SSID/password in Config.cpp
OTA Updates: Ensure sufficient flash space (1MB minimum)
Cluster Split: Check network connectivity between nodes

Debug Output

Enable serial monitoring to see cluster activity:

pio device monitor

Performance Monitoring

Memory Usage: Monitor free heap with /api/node/status
Task Health: Check task status with /api/tasks/status
Cluster Health: Monitor member status with /api/cluster/members
Network Latency: Track response times in cluster data

Task Management - Background task system
API Reference - REST API documentation
TaskManager API - TaskManager class reference
OpenAPI Specification - Machine-readable API specification

10 KiB Raw Blame History