444 lines
15 KiB
Markdown
444 lines
15 KiB
Markdown
# SPORE Architecture & Implementation
|
|
|
|
## System Overview
|
|
|
|
SPORE (SProcket ORchestration Engine) is a cluster engine for ESP8266 microcontrollers that provides automatic node discovery, health monitoring, and over-the-air updates in a distributed network environment.
|
|
|
|
## Core Components
|
|
|
|
The system architecture consists of several key components working together:
|
|
|
|
### Network Manager
|
|
- **WiFi Connection Handling**: Automatic WiFi STA/AP configuration
|
|
- **Hostname Configuration**: MAC-based hostname generation
|
|
- **Fallback Management**: Automatic access point creation if WiFi connection fails
|
|
|
|
### Cluster Manager
|
|
- **Node Discovery**: UDP-based automatic node detection
|
|
- **Member List Management**: Dynamic cluster membership tracking
|
|
- **Health Monitoring**: Continuous node status checking
|
|
- **Resource Tracking**: Monitor node resources and capabilities
|
|
|
|
### API Server
|
|
- **HTTP API Server**: RESTful API for cluster management
|
|
- **Dynamic Endpoint Registration**: Services register endpoints via `registerEndpoints(ApiServer&)`
|
|
- **Service Registry**: Track available services across the cluster
|
|
- **Service Lifecycle**: Services register both endpoints and tasks through unified interface
|
|
|
|
### Task Scheduler
|
|
- **Cooperative Multitasking**: Background task management system (`TaskManager`)
|
|
- **Service Task Registration**: Services register tasks via `registerTasks(TaskManager&)`
|
|
- **Task Lifecycle Management**: Enable/disable tasks and set intervals at runtime
|
|
- **Execution Model**: Tasks run in `Spore::loop()` when their interval elapses
|
|
|
|
### Node Context
|
|
- **Central Context**: Shared resources and configuration
|
|
- **Event System**: Local and cluster-wide event publishing/subscription
|
|
- **Resource Management**: Centralized resource allocation and monitoring
|
|
|
|
## Auto Discovery Protocol
|
|
|
|
The cluster uses a UDP-based discovery protocol for automatic node detection:
|
|
|
|
### Discovery Process
|
|
|
|
1. **Discovery Broadcast**: Nodes periodically send heartbeat messages on port `udp_port` (default 4210)
|
|
2. **Response Handling**: Nodes respond with node update information containing their current state
|
|
3. **Member Management**: Discovered nodes are added/updated in the cluster with current information
|
|
4. **Node Synchronization**: Periodic broadcasts ensure all nodes maintain current cluster state
|
|
|
|
### Protocol Details
|
|
|
|
- **UDP Port**: 4210 (configurable via `Config.udp_port`)
|
|
- **Heartbeat Message**: `CLUSTER_HEARTBEAT:hostname`
|
|
- **Node Update Message**: `NODE_UPDATE:hostname:{json}`
|
|
- **Broadcast Address**: 255.255.255.255
|
|
- **Listen Interval**: `Config.cluster_listen_interval_ms` (default 10 ms)
|
|
- **Heartbeat Interval**: `Config.heartbeat_interval_ms` (default 5000 ms)
|
|
|
|
### Message Formats
|
|
|
|
- **Heartbeat**: `CLUSTER_HEARTBEAT:hostname`
|
|
- Sender: each node, broadcast to 255.255.255.255:`udp_port` on interval
|
|
- Purpose: announce presence, prompt peers for node info, and keep liveness
|
|
- **Node Update**: `NODE_UPDATE:hostname:{json}`
|
|
- Sender: node responding to heartbeat or broadcasting current state
|
|
- JSON fields: hostname, ip, uptime, optional labels
|
|
- Purpose: provide current node information for cluster synchronization
|
|
|
|
### Discovery Flow
|
|
|
|
1. **A node broadcasts** `CLUSTER_HEARTBEAT:hostname` to announce its presence
|
|
2. **Each receiver responds** with `NODE_UPDATE:hostname:{json}` containing current node state
|
|
3. **The sender**:
|
|
- Ensures the responding node exists or creates it with current IP and information
|
|
- Parses JSON and updates node info, `status = ACTIVE`, `lastSeen = now`
|
|
- Calculates `latency = now - lastHeartbeatSentAt` for network performance monitoring
|
|
|
|
### Node Synchronization
|
|
|
|
1. **Event-driven broadcasts**: Nodes broadcast `NODE_UPDATE:hostname:{json}` when node information changes
|
|
2. **All receivers**: Update their memberlist entry for the broadcasting node
|
|
3. **Purpose**: Ensures all nodes maintain current cluster state and configuration
|
|
|
|
### Sequence Diagram
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant N1 as Node A (esp-node1)
|
|
participant N2 as Node B (esp-node2)
|
|
|
|
Note over N1,N2: Discovery via heartbeat broadcast
|
|
N1->>+N2: CLUSTER_HEARTBEAT:esp-node1
|
|
|
|
Note over N2: Node B responds with its current state
|
|
N2->>+N1: NODE_UPDATE:esp-node1:{"hostname":"esp-node2","uptime":12345,"labels":{"role":"sensor"}}
|
|
|
|
Note over N1: Process NODE_UPDATE response
|
|
N1-->>N1: Update memberlist for Node B
|
|
N1-->>N1: Set Node B status = ACTIVE
|
|
N1-->>N1: Calculate latency for Node B
|
|
|
|
Note over N1,N2: Event-driven node synchronization
|
|
N1->>+N2: NODE_UPDATE:esp-node1:{"hostname":"esp-node1","uptime":12346,"labels":{"role":"controller"}}
|
|
|
|
Note over N2: Update memberlist with latest information
|
|
N2-->>N2: Update Node A info, maintain ACTIVE status
|
|
```
|
|
|
|
### Listener Behavior
|
|
|
|
The `cluster_listen` task parses one UDP packet per run and dispatches by prefix to:
|
|
- **Heartbeat** → add/update responding node and send `NODE_UPDATE` response
|
|
- **Node Update** → update node information and trigger memberlist logging
|
|
|
|
### Timing and Intervals
|
|
|
|
- **UDP Port**: `Config.udp_port` (default 4210)
|
|
- **Listen Interval**: `Config.cluster_listen_interval_ms` (default 10 ms)
|
|
- **Heartbeat Interval**: `Config.heartbeat_interval_ms` (default 5000 ms)
|
|
|
|
### Node Status Categories
|
|
|
|
Nodes are automatically categorized by their activity:
|
|
|
|
- **ACTIVE**: lastSeen < `node_inactive_threshold_ms` (default 10s)
|
|
- **INACTIVE**: < `node_dead_threshold_ms` (default 120s)
|
|
- **DEAD**: ≥ `node_dead_threshold_ms`
|
|
|
|
## Task Scheduling System
|
|
|
|
The system runs several background tasks at different intervals:
|
|
|
|
### Core System Tasks
|
|
|
|
| Task | Interval (default) | Purpose |
|
|
|------|--------------------|---------|
|
|
| `cluster_listen` | 10 ms | Listen for heartbeat/node-info messages |
|
|
| `status_update` | 1000 ms | Update node status categories, purge dead |
|
|
| `heartbeat` | 5000 ms | Broadcast heartbeat and update local resources |
|
|
|
|
### Task Management Features
|
|
|
|
- **Dynamic Intervals**: Change execution frequency on-the-fly
|
|
- **Runtime Control**: Enable/disable tasks without restart
|
|
- **Status Monitoring**: Real-time task health tracking
|
|
- **Resource Integration**: View task status with system resources
|
|
|
|
## Event System
|
|
|
|
The `NodeContext` provides an event-driven architecture for system-wide communication:
|
|
|
|
### Event Subscription
|
|
|
|
```cpp
|
|
// Subscribe to events
|
|
ctx.on("node/discovered", [](void* data) {
|
|
NodeInfo* node = static_cast<NodeInfo*>(data);
|
|
// Handle new node discovery
|
|
});
|
|
|
|
ctx.on("cluster/updated", [](void* data) {
|
|
// Handle cluster membership changes
|
|
});
|
|
```
|
|
|
|
### Event Publishing
|
|
|
|
```cpp
|
|
// Publish events
|
|
ctx.fire("node/discovered", &newNode);
|
|
ctx.fire("cluster/updated", &clusterData);
|
|
```
|
|
|
|
### Available Events
|
|
|
|
- **`node/discovered`**: New node added or local node refreshed
|
|
|
|
## Resource Monitoring
|
|
|
|
Each node tracks comprehensive system resources:
|
|
|
|
### System Resources
|
|
|
|
- **Free Heap Memory**: Available RAM in bytes
|
|
- **Chip ID**: Unique ESP8266 identifier
|
|
- **SDK Version**: ESP8266 firmware version
|
|
- **CPU Frequency**: Operating frequency in MHz
|
|
- **Flash Chip Size**: Total flash storage in bytes
|
|
|
|
### API Endpoint Registry
|
|
|
|
- **Dynamic Discovery**: Automatically detect available endpoints
|
|
- **Method Information**: HTTP method (GET, POST, etc.)
|
|
- **Service Catalog**: Complete service registry across cluster
|
|
|
|
### Health Metrics
|
|
|
|
- **Response Time**: API response latency
|
|
- **Uptime**: System uptime in milliseconds
|
|
- **Connection Status**: Network connectivity health
|
|
- **Resource Utilization**: Memory and CPU usage
|
|
|
|
## WiFi Fallback System
|
|
|
|
The system includes automatic WiFi fallback for robust operation:
|
|
|
|
### Fallback Process
|
|
|
|
1. **Primary Connection**: Attempts to connect to configured WiFi network
|
|
2. **Connection Failure**: If connection fails, creates an access point
|
|
3. **Hostname Generation**: Automatically generates hostname from MAC address
|
|
4. **Service Continuity**: Maintains cluster functionality in fallback mode
|
|
|
|
### Configuration
|
|
|
|
- **Hostname**: Derived from MAC (`esp-<mac>`) and assigned to `ctx.hostname`
|
|
- **AP Mode**: If STA connection fails, device switches to AP mode with configured SSID/password
|
|
|
|
## Cluster Topology
|
|
|
|
### Node Types
|
|
|
|
- **Master Node**: Primary cluster coordinator (if applicable)
|
|
- **Worker Nodes**: Standard cluster members
|
|
- **Edge Nodes**: Network edge devices
|
|
|
|
### Network Architecture
|
|
|
|
- UDP broadcast-based discovery and heartbeats on local subnet
|
|
- Optional HTTP polling (disabled by default; node info exchanged via UDP)
|
|
|
|
## Data Flow
|
|
|
|
### Node Discovery
|
|
1. **UDP Broadcast**: Nodes broadcast discovery packets on port 4210
|
|
2. **UDP Response**: Receiving nodes respond with hostname
|
|
3. **Registration**: Discovered nodes are added to local cluster member list
|
|
|
|
### Health Monitoring
|
|
1. **Periodic Checks**: Cluster manager updates node status categories
|
|
2. **Status Collection**: Each node updates resources via UDP node-info messages
|
|
|
|
### Task Management
|
|
1. **Scheduling**: `TaskManager` executes registered tasks at configured intervals
|
|
2. **Execution**: Tasks run cooperatively in the main loop without preemption
|
|
3. **Monitoring**: Task status is exposed via REST (`/api/tasks/status`)
|
|
|
|
## Performance Characteristics
|
|
|
|
### Memory Usage
|
|
|
|
- **Base System**: ~15-20KB RAM (device dependent)
|
|
- **Per Task**: ~100-200 bytes per task
|
|
- **Cluster Members**: ~50-100 bytes per member
|
|
- **API Endpoints**: ~20-30 bytes per endpoint
|
|
|
|
### Network Overhead
|
|
|
|
- **Discovery Packets**: 64 bytes every 1 second
|
|
- **Health Checks**: ~200-500 bytes every 1 second
|
|
- **Status Updates**: ~1-2KB per node
|
|
- **API Responses**: Varies by endpoint (typically 100B-5KB)
|
|
|
|
### Processing Overhead
|
|
|
|
- **Task Execution**: Minimal overhead per task
|
|
- **Event Processing**: Fast event dispatch
|
|
- **JSON Parsing**: Efficient ArduinoJson usage
|
|
- **Network I/O**: Asynchronous operations
|
|
|
|
## Security Considerations
|
|
|
|
### Current Implementation
|
|
|
|
- **Network Access**: Local network only (no internet exposure)
|
|
- **Authentication**: None currently implemented; LAN-only access assumed
|
|
- **Data Validation**: Basic input validation
|
|
- **Resource Limits**: Memory and processing constraints
|
|
|
|
### Future Enhancements
|
|
|
|
- **TLS/SSL**: Encrypted communications
|
|
- **API Keys**: Authentication for API access
|
|
- **Access Control**: Role-based permissions
|
|
- **Audit Logging**: Security event tracking
|
|
|
|
## Scalability
|
|
|
|
### Cluster Size Limits
|
|
|
|
- **Theoretical**: Up to 255 nodes (IP subnet limit)
|
|
- **Practical**: 20-50 nodes for optimal performance
|
|
- **Memory Constraint**: ~8KB available for member tracking
|
|
- **Network Constraint**: UDP packet size limits
|
|
|
|
### Performance Scaling
|
|
|
|
- **Linear Scaling**: Most operations scale linearly with node count
|
|
- **Discovery Overhead**: Increases with cluster size
|
|
- **Health Monitoring**: Parallel HTTP requests
|
|
- **Task Management**: Independent per-node execution
|
|
|
|
## Configuration Management
|
|
|
|
SPORE implements a persistent configuration system that manages device settings across reboots and provides runtime reconfiguration capabilities.
|
|
|
|
### Configuration Architecture
|
|
|
|
The configuration system consists of several key components:
|
|
|
|
- **`Config` Class**: Central configuration management with default constants
|
|
- **LittleFS Storage**: Persistent file-based storage (`/config.json`)
|
|
- **Runtime Updates**: Live configuration changes via HTTP API
|
|
- **Automatic Persistence**: Configuration changes are automatically saved
|
|
|
|
### Configuration Categories
|
|
|
|
| Category | Description | Examples |
|
|
|----------|-------------|----------|
|
|
| **WiFi Configuration** | Network connection settings | SSID, password, timeouts |
|
|
| **Network Configuration** | Network service settings | UDP port, API server port |
|
|
| **Cluster Configuration** | Cluster management settings | Discovery intervals, heartbeat timing |
|
|
| **Node Status Thresholds** | Health monitoring thresholds | Active/inactive/dead timeouts |
|
|
| **System Configuration** | Core system settings | Restart delay, JSON document size |
|
|
| **Memory Management** | Resource management settings | Memory thresholds, HTTP request limits |
|
|
|
|
### Configuration Lifecycle
|
|
|
|
1. **Boot Process**: Load configuration from `/config.json` or use defaults
|
|
2. **Runtime Updates**: Configuration changes via HTTP API
|
|
3. **Persistent Storage**: Changes automatically saved to LittleFS
|
|
4. **Service Integration**: Configuration applied to all system services
|
|
|
|
### Default Value Management
|
|
|
|
All default values are defined as `constexpr` constants in the `Config` class:
|
|
|
|
```cpp
|
|
static constexpr const char* DEFAULT_WIFI_SSID = "shroud";
|
|
static constexpr uint16_t DEFAULT_UDP_PORT = 4210;
|
|
static constexpr unsigned long DEFAULT_HEARTBEAT_INTERVAL_MS = 5000;
|
|
```
|
|
|
|
This ensures:
|
|
- **Single Source of Truth**: All defaults defined once
|
|
- **Type Safety**: Compile-time type checking
|
|
- **Maintainability**: Easy to update default values
|
|
- **Consistency**: Same defaults used in `setDefaults()` and `loadFromFile()`
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# API node IP for cluster management
|
|
export API_NODE=192.168.1.100
|
|
```
|
|
|
|
### PlatformIO Configuration
|
|
|
|
The project uses PlatformIO with the following configuration:
|
|
|
|
- **Framework**: Arduino
|
|
- **Board**: ESP-01 with 1MB flash
|
|
- **Upload Speed**: 115200 baud
|
|
- **Flash Mode**: DOUT (required for ESP-01S)
|
|
|
|
### Dependencies
|
|
|
|
The project requires the following libraries:
|
|
- `esp32async/ESPAsyncWebServer@^3.8.0` - HTTP API server
|
|
- `bblanchon/ArduinoJson@^7.4.2` - JSON processing
|
|
- `arkhipenko/TaskScheduler@^3.8.5` - Cooperative multitasking
|
|
|
|
## Development Workflow
|
|
|
|
### Building
|
|
|
|
Build the firmware for specific chip:
|
|
|
|
```bash
|
|
./ctl.sh build target esp01_1m
|
|
```
|
|
|
|
### Flashing
|
|
|
|
Flash firmware to a connected device:
|
|
|
|
```bash
|
|
./ctl.sh flash target esp01_1m
|
|
```
|
|
|
|
### Over-The-Air Updates
|
|
|
|
Update a specific node:
|
|
|
|
```bash
|
|
./ctl.sh ota update 192.168.1.100 esp01_1m
|
|
```
|
|
|
|
Update all nodes in the cluster:
|
|
|
|
```bash
|
|
./ctl.sh ota all esp01_1m
|
|
```
|
|
|
|
### Cluster Management
|
|
|
|
View cluster members:
|
|
|
|
```bash
|
|
./ctl.sh cluster members
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Discovery Failures**: Check UDP port 4210 is not blocked
|
|
2. **WiFi Connection**: Verify SSID/password in Config.cpp
|
|
3. **OTA Updates**: Ensure sufficient flash space (1MB minimum)
|
|
4. **Cluster Split**: Check network connectivity between nodes
|
|
|
|
### Debug Output
|
|
|
|
Enable serial monitoring to see cluster activity:
|
|
|
|
```bash
|
|
pio device monitor
|
|
```
|
|
|
|
### Performance Monitoring
|
|
|
|
- **Memory Usage**: Monitor free heap with `/api/node/status`
|
|
- **Task Health**: Check task status with `/api/tasks/status`
|
|
- **Cluster Health**: Monitor member status with `/api/cluster/members`
|
|
- **Network Latency**: Track response times in cluster data
|
|
|
|
## Related Documentation
|
|
|
|
- **[Configuration Management](./ConfigurationManagement.md)** - Persistent configuration system
|
|
- **[WiFi Configuration](./WiFiConfiguration.md)** - WiFi setup and reconfiguration process
|
|
- **[Task Management](./TaskManagement.md)** - Background task system
|
|
- **[API Reference](./API.md)** - REST API documentation
|
|
- **[TaskManager API](./TaskManager.md)** - TaskManager class reference
|
|
- **[OpenAPI Specification](../api/)** - Machine-readable API specification |