# SPORE Architecture & Implementation ## System Overview SPORE (SProcket ORchestration Engine) is a cluster engine for ESP8266 microcontrollers that provides automatic node discovery, health monitoring, and over-the-air updates in a distributed network environment. ## Core Components The system architecture consists of several key components working together: ### Network Manager - **WiFi Connection Handling**: Automatic WiFi STA/AP configuration - **Hostname Configuration**: MAC-based hostname generation - **Fallback Management**: Automatic access point creation if WiFi connection fails ### Cluster Manager - **Node Discovery**: UDP-based automatic node detection - **Member List Management**: Dynamic cluster membership tracking - **Health Monitoring**: Continuous node status checking - **Resource Tracking**: Monitor node resources and capabilities ### API Server - **HTTP API Server**: RESTful API for cluster management - **Dynamic Endpoint Registration**: Services register endpoints via `registerEndpoints(ApiServer&)` - **Service Registry**: Track available services across the cluster - **Service Lifecycle**: Services register both endpoints and tasks through unified interface ### Task Scheduler - **Cooperative Multitasking**: Background task management system (`TaskManager`) - **Service Task Registration**: Services register tasks via `registerTasks(TaskManager&)` - **Task Lifecycle Management**: Enable/disable tasks and set intervals at runtime - **Execution Model**: Tasks run in `Spore::loop()` when their interval elapses ### Node Context - **Central Context**: Shared resources and configuration - **Event System**: Local and cluster-wide event publishing/subscription - **Resource Management**: Centralized resource allocation and monitoring ## Auto Discovery Protocol The cluster uses a UDP-based discovery protocol for automatic node detection: ### Discovery Process 1. **Discovery Broadcast**: Nodes periodically send heartbeat messages on port `udp_port` (default 4210) 2. **Response Handling**: Nodes respond with node update information containing their current state 3. **Member Management**: Discovered nodes are added/updated in the cluster with current information 4. **Node Synchronization**: Periodic broadcasts ensure all nodes maintain current cluster state ### Protocol Details - **UDP Port**: 4210 (configurable via `Config.udp_port`) - **Heartbeat Message**: `CLUSTER_HEARTBEAT:hostname` - **Node Update Message**: `NODE_UPDATE:hostname:{json}` - **Broadcast Address**: 255.255.255.255 - **Listen Interval**: `Config.cluster_listen_interval_ms` (default 10 ms) - **Heartbeat Interval**: `Config.heartbeat_interval_ms` (default 5000 ms) ### Message Formats - **Heartbeat**: `CLUSTER_HEARTBEAT:hostname` - Sender: each node, broadcast to 255.255.255.255:`udp_port` on interval - Purpose: announce presence, prompt peers for node info, and keep liveness - **Node Update**: `NODE_UPDATE:hostname:{json}` - Sender: node responding to heartbeat or broadcasting current state - JSON fields: hostname, ip, uptime, optional labels - Purpose: provide current node information for cluster synchronization ### Discovery Flow 1. **A node broadcasts** `CLUSTER_HEARTBEAT:hostname` to announce its presence 2. **Each receiver responds** with `NODE_UPDATE:hostname:{json}` containing current node state 3. **The sender**: - Ensures the responding node exists or creates it with current IP and information - Parses JSON and updates node info, `status = ACTIVE`, `lastSeen = now` - Calculates `latency = now - lastHeartbeatSentAt` for network performance monitoring ### Node Synchronization 1. **Event-driven broadcasts**: Nodes broadcast `NODE_UPDATE:hostname:{json}` when node information changes 2. **All receivers**: Update their memberlist entry for the broadcasting node 3. **Purpose**: Ensures all nodes maintain current cluster state and configuration ### Sequence Diagram ```mermaid sequenceDiagram participant N1 as Node A (esp-node1) participant N2 as Node B (esp-node2) Note over N1,N2: Discovery via heartbeat broadcast N1->>+N2: CLUSTER_HEARTBEAT:esp-node1 Note over N2: Node B responds with its current state N2->>+N1: NODE_UPDATE:esp-node1:{"hostname":"esp-node2","uptime":12345,"labels":{"role":"sensor"}} Note over N1: Process NODE_UPDATE response N1-->>N1: Update memberlist for Node B N1-->>N1: Set Node B status = ACTIVE N1-->>N1: Calculate latency for Node B Note over N1,N2: Event-driven node synchronization N1->>+N2: NODE_UPDATE:esp-node1:{"hostname":"esp-node1","uptime":12346,"labels":{"role":"controller"}} Note over N2: Update memberlist with latest information N2-->>N2: Update Node A info, maintain ACTIVE status ``` ### Listener Behavior The `cluster_listen` task parses one UDP packet per run and dispatches by prefix to: - **Heartbeat** → add/update responding node and send `NODE_UPDATE` response - **Node Update** → update node information and trigger memberlist logging ### Timing and Intervals - **UDP Port**: `Config.udp_port` (default 4210) - **Listen Interval**: `Config.cluster_listen_interval_ms` (default 10 ms) - **Heartbeat Interval**: `Config.heartbeat_interval_ms` (default 5000 ms) ### Node Status Categories Nodes are automatically categorized by their activity: - **ACTIVE**: lastSeen < `node_inactive_threshold_ms` (default 10s) - **INACTIVE**: < `node_dead_threshold_ms` (default 120s) - **DEAD**: ≥ `node_dead_threshold_ms` ## Task Scheduling System The system runs several background tasks at different intervals: ### Core System Tasks | Task | Interval (default) | Purpose | |------|--------------------|---------| | `cluster_listen` | 10 ms | Listen for heartbeat/node-info messages | | `status_update` | 1000 ms | Update node status categories, purge dead | | `heartbeat` | 5000 ms | Broadcast heartbeat and update local resources | ### Task Management Features - **Dynamic Intervals**: Change execution frequency on-the-fly - **Runtime Control**: Enable/disable tasks without restart - **Status Monitoring**: Real-time task health tracking - **Resource Integration**: View task status with system resources ## Event System The `NodeContext` provides an event-driven architecture for system-wide communication: ### Event Subscription ```cpp // Subscribe to events ctx.on("node/discovered", [](void* data) { NodeInfo* node = static_cast(data); // Handle new node discovery }); ctx.on("cluster/updated", [](void* data) { // Handle cluster membership changes }); ``` ### Event Publishing ```cpp // Publish events ctx.fire("node/discovered", &newNode); ctx.fire("cluster/updated", &clusterData); ``` ### Available Events - **`node/discovered`**: New node added or local node refreshed ## Resource Monitoring Each node tracks comprehensive system resources: ### System Resources - **Free Heap Memory**: Available RAM in bytes - **Chip ID**: Unique ESP8266 identifier - **SDK Version**: ESP8266 firmware version - **CPU Frequency**: Operating frequency in MHz - **Flash Chip Size**: Total flash storage in bytes ### API Endpoint Registry - **Dynamic Discovery**: Automatically detect available endpoints - **Method Information**: HTTP method (GET, POST, etc.) - **Service Catalog**: Complete service registry across cluster ### Health Metrics - **Response Time**: API response latency - **Uptime**: System uptime in milliseconds - **Connection Status**: Network connectivity health - **Resource Utilization**: Memory and CPU usage ## WiFi Fallback System The system includes automatic WiFi fallback for robust operation: ### Fallback Process 1. **Primary Connection**: Attempts to connect to configured WiFi network 2. **Connection Failure**: If connection fails, creates an access point 3. **Hostname Generation**: Automatically generates hostname from MAC address 4. **Service Continuity**: Maintains cluster functionality in fallback mode ### Configuration - **Hostname**: Derived from MAC (`esp-`) and assigned to `ctx.hostname` - **AP Mode**: If STA connection fails, device switches to AP mode with configured SSID/password ## Cluster Topology ### Node Types - **Master Node**: Primary cluster coordinator (if applicable) - **Worker Nodes**: Standard cluster members - **Edge Nodes**: Network edge devices ### Network Architecture - UDP broadcast-based discovery and heartbeats on local subnet - Optional HTTP polling (disabled by default; node info exchanged via UDP) ## Data Flow ### Node Discovery 1. **UDP Broadcast**: Nodes broadcast discovery packets on port 4210 2. **UDP Response**: Receiving nodes respond with hostname 3. **Registration**: Discovered nodes are added to local cluster member list ### Health Monitoring 1. **Periodic Checks**: Cluster manager updates node status categories 2. **Status Collection**: Each node updates resources via UDP node-info messages ### Task Management 1. **Scheduling**: `TaskManager` executes registered tasks at configured intervals 2. **Execution**: Tasks run cooperatively in the main loop without preemption 3. **Monitoring**: Task status is exposed via REST (`/api/tasks/status`) ## Performance Characteristics ### Memory Usage - **Base System**: ~15-20KB RAM (device dependent) - **Per Task**: ~100-200 bytes per task - **Cluster Members**: ~50-100 bytes per member - **API Endpoints**: ~20-30 bytes per endpoint ### Network Overhead - **Discovery Packets**: 64 bytes every 1 second - **Health Checks**: ~200-500 bytes every 1 second - **Status Updates**: ~1-2KB per node - **API Responses**: Varies by endpoint (typically 100B-5KB) ### Processing Overhead - **Task Execution**: Minimal overhead per task - **Event Processing**: Fast event dispatch - **JSON Parsing**: Efficient ArduinoJson usage - **Network I/O**: Asynchronous operations ## Security Considerations ### Current Implementation - **Network Access**: Local network only (no internet exposure) - **Authentication**: None currently implemented; LAN-only access assumed - **Data Validation**: Basic input validation - **Resource Limits**: Memory and processing constraints ### Future Enhancements - **TLS/SSL**: Encrypted communications - **API Keys**: Authentication for API access - **Access Control**: Role-based permissions - **Audit Logging**: Security event tracking ## Scalability ### Cluster Size Limits - **Theoretical**: Up to 255 nodes (IP subnet limit) - **Practical**: 20-50 nodes for optimal performance - **Memory Constraint**: ~8KB available for member tracking - **Network Constraint**: UDP packet size limits ### Performance Scaling - **Linear Scaling**: Most operations scale linearly with node count - **Discovery Overhead**: Increases with cluster size - **Health Monitoring**: Parallel HTTP requests - **Task Management**: Independent per-node execution ## Configuration Management SPORE implements a persistent configuration system that manages device settings across reboots and provides runtime reconfiguration capabilities. ### Configuration Architecture The configuration system consists of several key components: - **`Config` Class**: Central configuration management with default constants - **LittleFS Storage**: Persistent file-based storage (`/config.json`) - **Runtime Updates**: Live configuration changes via HTTP API - **Automatic Persistence**: Configuration changes are automatically saved ### Configuration Categories | Category | Description | Examples | |----------|-------------|----------| | **WiFi Configuration** | Network connection settings | SSID, password, timeouts | | **Network Configuration** | Network service settings | UDP port, API server port | | **Cluster Configuration** | Cluster management settings | Discovery intervals, heartbeat timing | | **Node Status Thresholds** | Health monitoring thresholds | Active/inactive/dead timeouts | | **System Configuration** | Core system settings | Restart delay, JSON document size | | **Memory Management** | Resource management settings | Memory thresholds, HTTP request limits | ### Configuration Lifecycle 1. **Boot Process**: Load configuration from `/config.json` or use defaults 2. **Runtime Updates**: Configuration changes via HTTP API 3. **Persistent Storage**: Changes automatically saved to LittleFS 4. **Service Integration**: Configuration applied to all system services ### Default Value Management All default values are defined as `constexpr` constants in the `Config` class: ```cpp static constexpr const char* DEFAULT_WIFI_SSID = "shroud"; static constexpr uint16_t DEFAULT_UDP_PORT = 4210; static constexpr unsigned long DEFAULT_HEARTBEAT_INTERVAL_MS = 5000; ``` This ensures: - **Single Source of Truth**: All defaults defined once - **Type Safety**: Compile-time type checking - **Maintainability**: Easy to update default values - **Consistency**: Same defaults used in `setDefaults()` and `loadFromFile()` ### Environment Variables ```bash # API node IP for cluster management export API_NODE=192.168.1.100 ``` ### PlatformIO Configuration The project uses PlatformIO with the following configuration: - **Framework**: Arduino - **Board**: ESP-01 with 1MB flash - **Upload Speed**: 115200 baud - **Flash Mode**: DOUT (required for ESP-01S) ### Dependencies The project requires the following libraries: - `esp32async/ESPAsyncWebServer@^3.8.0` - HTTP API server - `bblanchon/ArduinoJson@^7.4.2` - JSON processing - `arkhipenko/TaskScheduler@^3.8.5` - Cooperative multitasking ## Development Workflow ### Building Build the firmware for specific chip: ```bash ./ctl.sh build target esp01_1m ``` ### Flashing Flash firmware to a connected device: ```bash ./ctl.sh flash target esp01_1m ``` ### Over-The-Air Updates Update a specific node: ```bash ./ctl.sh ota update 192.168.1.100 esp01_1m ``` Update all nodes in the cluster: ```bash ./ctl.sh ota all esp01_1m ``` ### Cluster Management View cluster members: ```bash ./ctl.sh cluster members ``` ## Troubleshooting ### Common Issues 1. **Discovery Failures**: Check UDP port 4210 is not blocked 2. **WiFi Connection**: Verify SSID/password in Config.cpp 3. **OTA Updates**: Ensure sufficient flash space (1MB minimum) 4. **Cluster Split**: Check network connectivity between nodes ### Debug Output Enable serial monitoring to see cluster activity: ```bash pio device monitor ``` ### Performance Monitoring - **Memory Usage**: Monitor free heap with `/api/node/status` - **Task Health**: Check task status with `/api/tasks/status` - **Cluster Health**: Monitor member status with `/api/cluster/members` - **Network Latency**: Track response times in cluster data ## Related Documentation - **[Configuration Management](./ConfigurationManagement.md)** - Persistent configuration system - **[WiFi Configuration](./WiFiConfiguration.md)** - WiFi setup and reconfiguration process - **[Task Management](./TaskManagement.md)** - Background task system - **[API Reference](./API.md)** - REST API documentation - **[TaskManager API](./TaskManager.md)** - TaskManager class reference - **[OpenAPI Specification](../api/)** - Machine-readable API specification