iot/spore

Files

0x1d 407b651b82 feat: change event naming schema

2025-10-19 13:48:13 +02:00

14 KiB

Raw Blame History

SPORE Architecture & Implementation

System Overview

SPORE (SProcket ORchestration Engine) is a cluster engine for ESP8266 microcontrollers that provides automatic node discovery, health monitoring, and over-the-air updates in a distributed network environment.

Core Components

The system architecture consists of several key components working together:

Network Manager

WiFi Connection Handling: Automatic WiFi STA/AP configuration
Hostname Configuration: MAC-based hostname generation
Fallback Management: Automatic access point creation if WiFi connection fails

Cluster Manager

Node Discovery: UDP-based automatic node detection
Member List Management: Dynamic cluster membership tracking
Health Monitoring: Continuous node status checking
Resource Tracking: Monitor node resources and capabilities

API Server

HTTP API Server: RESTful API for cluster management
Dynamic Endpoint Registration: Services register endpoints via registerEndpoints(ApiServer&)
Service Registry: Track available services across the cluster
Service Lifecycle: Services register both endpoints and tasks through unified interface

Task Scheduler

Cooperative Multitasking: Background task management system (TaskManager)
Service Task Registration: Services register tasks via registerTasks(TaskManager&)
Task Lifecycle Management: Enable/disable tasks and set intervals at runtime
Execution Model: Tasks run in Spore::loop() when their interval elapses

Node Context

Central Context: Shared resources and configuration
Event System: Local and cluster-wide event publishing/subscription
Resource Management: Centralized resource allocation and monitoring

Auto Discovery Protocol

The cluster uses a UDP-based discovery protocol for automatic node detection:

Discovery Process

Discovery Broadcast: Nodes periodically send UDP packets on port udp_port (default 4210)
Response Handling: Nodes respond with CLUSTER_RESPONSE:<hostname>
Member Management: Discovered nodes are added/updated in the cluster
Node Info via UDP: Heartbeat triggers peers to send CLUSTER_NODE_INFO:<hostname>:<json>

Protocol Details

UDP Port: 4210 (configurable via Config.udp_port)
Discovery Message: CLUSTER_DISCOVERY
Response Message: CLUSTER_RESPONSE
Heartbeat Message: CLUSTER_HEARTBEAT:hostname
Node Update Message: NODE_UPDATE:hostname:{json}
Broadcast Address: 255.255.255.255
Listen Interval: Config.cluster_listen_interval_ms (default 10 ms)
Heartbeat Interval: Config.heartbeat_interval_ms (default 5000 ms)

Message Formats

Heartbeat: CLUSTER_HEARTBEAT:hostname
- Sender: each node, broadcast to 255.255.255.255:udp_port on interval
- Purpose: announce presence, prompt peers for node info, and keep liveness
Node Update: NODE_UPDATE:hostname:{json}
- Sender: node receiving a heartbeat; unicast to heartbeat sender IP
- JSON fields: hostname, ip, uptime, optional labels
- Purpose: provide minimal node information in response to heartbeat

Heartbeat Flow

A node broadcasts CLUSTER_HEARTBEAT:hostname
Each receiver responds with NODE_UPDATE:hostname:{json} to the heartbeat sender IP
The sender:
- Ensures the node exists or creates it with hostname and sender IP
- Parses JSON and updates node info, status = ACTIVE, lastSeen = now
- Sets latency = now - lastHeartbeatSentAt (per-node, measured at heartbeat origin)

Node Update Broadcasting

Periodic broadcast: Each node broadcasts NODE_UPDATE:hostname:{json} every 5 seconds
All receivers: Update their memberlist entry for the broadcasting node
Purpose: Ensures all nodes have current information about each other

Listener Behavior

The cluster_listen task parses one UDP packet per run and dispatches by prefix to:

Heartbeat → add/update node and send NODE_UPDATE JSON response
Node Update → update node information and status

Timing and Intervals

UDP Port: Config.udp_port (default 4210)
Listen Interval: Config.cluster_listen_interval_ms (default 10 ms)
Heartbeat Interval: Config.heartbeat_interval_ms (default 5000 ms)

Node Status Categories

Nodes are automatically categorized by their activity:

ACTIVE: lastSeen < node_inactive_threshold_ms (default 10s)
INACTIVE: < node_dead_threshold_ms (default 120s)
DEAD: ≥ node_dead_threshold_ms

Task Scheduling System

The system runs several background tasks at different intervals:

Core System Tasks

Task	Interval (default)	Purpose
`cluster_listen`	10 ms	Listen for heartbeat/node-info messages
`status_update`	1000 ms	Update node status categories, purge dead
`heartbeat`	5000 ms	Broadcast heartbeat and update local resources

Task Management Features

Dynamic Intervals: Change execution frequency on-the-fly
Runtime Control: Enable/disable tasks without restart
Status Monitoring: Real-time task health tracking
Resource Integration: View task status with system resources

Event System

The NodeContext provides an event-driven architecture for system-wide communication:

Event Subscription

// Subscribe to events
ctx.on("node/discovered", [](void* data) {
    NodeInfo* node = static_cast<NodeInfo*>(data);
    // Handle new node discovery
});

ctx.on("cluster/updated", [](void* data) {
    // Handle cluster membership changes
});

Event Publishing

// Publish events
ctx.fire("node/discovered", &newNode);
ctx.fire("cluster/updated", &clusterData);

Available Events

node/discovered: New node added or local node refreshed

Resource Monitoring

Each node tracks comprehensive system resources:

System Resources

Free Heap Memory: Available RAM in bytes
Chip ID: Unique ESP8266 identifier
SDK Version: ESP8266 firmware version
CPU Frequency: Operating frequency in MHz
Flash Chip Size: Total flash storage in bytes

API Endpoint Registry

Dynamic Discovery: Automatically detect available endpoints
Method Information: HTTP method (GET, POST, etc.)
Service Catalog: Complete service registry across cluster

Health Metrics

Response Time: API response latency
Uptime: System uptime in milliseconds
Connection Status: Network connectivity health
Resource Utilization: Memory and CPU usage

WiFi Fallback System

The system includes automatic WiFi fallback for robust operation:

Fallback Process

Primary Connection: Attempts to connect to configured WiFi network
Connection Failure: If connection fails, creates an access point
Hostname Generation: Automatically generates hostname from MAC address
Service Continuity: Maintains cluster functionality in fallback mode

Configuration

Hostname: Derived from MAC (esp-<mac>) and assigned to ctx.hostname
AP Mode: If STA connection fails, device switches to AP mode with configured SSID/password

Cluster Topology

Node Types

Master Node: Primary cluster coordinator (if applicable)
Worker Nodes: Standard cluster members
Edge Nodes: Network edge devices

Network Architecture

UDP broadcast-based discovery and heartbeats on local subnet
Optional HTTP polling (disabled by default; node info exchanged via UDP)

Data Flow

Node Discovery

UDP Broadcast: Nodes broadcast discovery packets on port 4210
UDP Response: Receiving nodes respond with hostname
Registration: Discovered nodes are added to local cluster member list

Health Monitoring

Periodic Checks: Cluster manager updates node status categories
Status Collection: Each node updates resources via UDP node-info messages

Task Management

Scheduling: TaskManager executes registered tasks at configured intervals
Execution: Tasks run cooperatively in the main loop without preemption
Monitoring: Task status is exposed via REST (/api/tasks/status)

Performance Characteristics

Memory Usage

Base System: ~15-20KB RAM (device dependent)
Per Task: ~100-200 bytes per task
Cluster Members: ~50-100 bytes per member
API Endpoints: ~20-30 bytes per endpoint

Network Overhead

Discovery Packets: 64 bytes every 1 second
Health Checks: ~200-500 bytes every 1 second
Status Updates: ~1-2KB per node
API Responses: Varies by endpoint (typically 100B-5KB)

Processing Overhead

Task Execution: Minimal overhead per task
Event Processing: Fast event dispatch
JSON Parsing: Efficient ArduinoJson usage
Network I/O: Asynchronous operations

Security Considerations

Current Implementation

Network Access: Local network only (no internet exposure)
Authentication: None currently implemented; LAN-only access assumed
Data Validation: Basic input validation
Resource Limits: Memory and processing constraints

Future Enhancements

TLS/SSL: Encrypted communications
API Keys: Authentication for API access
Access Control: Role-based permissions
Audit Logging: Security event tracking

Scalability

Cluster Size Limits

Theoretical: Up to 255 nodes (IP subnet limit)
Practical: 20-50 nodes for optimal performance
Memory Constraint: ~8KB available for member tracking
Network Constraint: UDP packet size limits

Performance Scaling

Linear Scaling: Most operations scale linearly with node count
Discovery Overhead: Increases with cluster size
Health Monitoring: Parallel HTTP requests
Task Management: Independent per-node execution

Configuration Management

SPORE implements a persistent configuration system that manages device settings across reboots and provides runtime reconfiguration capabilities.

Configuration Architecture

The configuration system consists of several key components:

Config Class: Central configuration management with default constants
LittleFS Storage: Persistent file-based storage (/config.json)
Runtime Updates: Live configuration changes via HTTP API
Automatic Persistence: Configuration changes are automatically saved

Configuration Categories

Category	Description	Examples
WiFi Configuration	Network connection settings	SSID, password, timeouts
Network Configuration	Network service settings	UDP port, API server port
Cluster Configuration	Cluster management settings	Discovery intervals, heartbeat timing
Node Status Thresholds	Health monitoring thresholds	Active/inactive/dead timeouts
System Configuration	Core system settings	Restart delay, JSON document size
Memory Management	Resource management settings	Memory thresholds, HTTP request limits

Configuration Lifecycle

Boot Process: Load configuration from /config.json or use defaults
Runtime Updates: Configuration changes via HTTP API
Persistent Storage: Changes automatically saved to LittleFS
Service Integration: Configuration applied to all system services

Default Value Management

All default values are defined as constexpr constants in the Config class:

static constexpr const char* DEFAULT_WIFI_SSID = "shroud";
static constexpr uint16_t DEFAULT_UDP_PORT = 4210;
static constexpr unsigned long DEFAULT_HEARTBEAT_INTERVAL_MS = 5000;

This ensures:

Single Source of Truth: All defaults defined once
Type Safety: Compile-time type checking
Maintainability: Easy to update default values
Consistency: Same defaults used in setDefaults() and loadFromFile()

Environment Variables

# API node IP for cluster management
export API_NODE=192.168.1.100

PlatformIO Configuration

The project uses PlatformIO with the following configuration:

Framework: Arduino
Board: ESP-01 with 1MB flash
Upload Speed: 115200 baud
Flash Mode: DOUT (required for ESP-01S)

Dependencies

The project requires the following libraries:

esp32async/ESPAsyncWebServer@^3.8.0 - HTTP API server
bblanchon/ArduinoJson@^7.4.2 - JSON processing
arkhipenko/TaskScheduler@^3.8.5 - Cooperative multitasking

Development Workflow

Building

Build the firmware for specific chip:

./ctl.sh build target esp01_1m

Flashing

Flash firmware to a connected device:

./ctl.sh flash target esp01_1m

Over-The-Air Updates

Update a specific node:

./ctl.sh ota update 192.168.1.100 esp01_1m

Update all nodes in the cluster:

./ctl.sh ota all esp01_1m

Cluster Management

View cluster members:

./ctl.sh cluster members

Troubleshooting

Common Issues

Discovery Failures: Check UDP port 4210 is not blocked
WiFi Connection: Verify SSID/password in Config.cpp
OTA Updates: Ensure sufficient flash space (1MB minimum)
Cluster Split: Check network connectivity between nodes

Debug Output

Enable serial monitoring to see cluster activity:

pio device monitor

Performance Monitoring

Memory Usage: Monitor free heap with /api/node/status
Task Health: Check task status with /api/tasks/status
Cluster Health: Monitor member status with /api/cluster/members
Network Latency: Track response times in cluster data

Configuration Management - Persistent configuration system
WiFi Configuration - WiFi setup and reconfiguration process
Task Management - Background task system
API Reference - REST API documentation
TaskManager API - TaskManager class reference
OpenAPI Specification - Machine-readable API specification

14 KiB Raw Blame History