Files
spore/docs/Architecture.md
2025-10-19 13:48:13 +02:00

14 KiB

SPORE Architecture & Implementation

System Overview

SPORE (SProcket ORchestration Engine) is a cluster engine for ESP8266 microcontrollers that provides automatic node discovery, health monitoring, and over-the-air updates in a distributed network environment.

Core Components

The system architecture consists of several key components working together:

Network Manager

  • WiFi Connection Handling: Automatic WiFi STA/AP configuration
  • Hostname Configuration: MAC-based hostname generation
  • Fallback Management: Automatic access point creation if WiFi connection fails

Cluster Manager

  • Node Discovery: UDP-based automatic node detection
  • Member List Management: Dynamic cluster membership tracking
  • Health Monitoring: Continuous node status checking
  • Resource Tracking: Monitor node resources and capabilities

API Server

  • HTTP API Server: RESTful API for cluster management
  • Dynamic Endpoint Registration: Services register endpoints via registerEndpoints(ApiServer&)
  • Service Registry: Track available services across the cluster
  • Service Lifecycle: Services register both endpoints and tasks through unified interface

Task Scheduler

  • Cooperative Multitasking: Background task management system (TaskManager)
  • Service Task Registration: Services register tasks via registerTasks(TaskManager&)
  • Task Lifecycle Management: Enable/disable tasks and set intervals at runtime
  • Execution Model: Tasks run in Spore::loop() when their interval elapses

Node Context

  • Central Context: Shared resources and configuration
  • Event System: Local and cluster-wide event publishing/subscription
  • Resource Management: Centralized resource allocation and monitoring

Auto Discovery Protocol

The cluster uses a UDP-based discovery protocol for automatic node detection:

Discovery Process

  1. Discovery Broadcast: Nodes periodically send UDP packets on port udp_port (default 4210)
  2. Response Handling: Nodes respond with CLUSTER_RESPONSE:<hostname>
  3. Member Management: Discovered nodes are added/updated in the cluster
  4. Node Info via UDP: Heartbeat triggers peers to send CLUSTER_NODE_INFO:<hostname>:<json>

Protocol Details

  • UDP Port: 4210 (configurable via Config.udp_port)
  • Discovery Message: CLUSTER_DISCOVERY
  • Response Message: CLUSTER_RESPONSE
  • Heartbeat Message: CLUSTER_HEARTBEAT:hostname
  • Node Update Message: NODE_UPDATE:hostname:{json}
  • Broadcast Address: 255.255.255.255
  • Listen Interval: Config.cluster_listen_interval_ms (default 10 ms)
  • Heartbeat Interval: Config.heartbeat_interval_ms (default 5000 ms)

Message Formats

  • Heartbeat: CLUSTER_HEARTBEAT:hostname
    • Sender: each node, broadcast to 255.255.255.255:udp_port on interval
    • Purpose: announce presence, prompt peers for node info, and keep liveness
  • Node Update: NODE_UPDATE:hostname:{json}
    • Sender: node receiving a heartbeat; unicast to heartbeat sender IP
    • JSON fields: hostname, ip, uptime, optional labels
    • Purpose: provide minimal node information in response to heartbeat

Heartbeat Flow

  1. A node broadcasts CLUSTER_HEARTBEAT:hostname
  2. Each receiver responds with NODE_UPDATE:hostname:{json} to the heartbeat sender IP
  3. The sender:
    • Ensures the node exists or creates it with hostname and sender IP
    • Parses JSON and updates node info, status = ACTIVE, lastSeen = now
    • Sets latency = now - lastHeartbeatSentAt (per-node, measured at heartbeat origin)

Node Update Broadcasting

  1. Periodic broadcast: Each node broadcasts NODE_UPDATE:hostname:{json} every 5 seconds
  2. All receivers: Update their memberlist entry for the broadcasting node
  3. Purpose: Ensures all nodes have current information about each other

Listener Behavior

The cluster_listen task parses one UDP packet per run and dispatches by prefix to:

  • Heartbeat → add/update node and send NODE_UPDATE JSON response
  • Node Update → update node information and status

Timing and Intervals

  • UDP Port: Config.udp_port (default 4210)
  • Listen Interval: Config.cluster_listen_interval_ms (default 10 ms)
  • Heartbeat Interval: Config.heartbeat_interval_ms (default 5000 ms)

Node Status Categories

Nodes are automatically categorized by their activity:

  • ACTIVE: lastSeen < node_inactive_threshold_ms (default 10s)
  • INACTIVE: < node_dead_threshold_ms (default 120s)
  • DEAD: ≥ node_dead_threshold_ms

Task Scheduling System

The system runs several background tasks at different intervals:

Core System Tasks

Task Interval (default) Purpose
cluster_listen 10 ms Listen for heartbeat/node-info messages
status_update 1000 ms Update node status categories, purge dead
heartbeat 5000 ms Broadcast heartbeat and update local resources

Task Management Features

  • Dynamic Intervals: Change execution frequency on-the-fly
  • Runtime Control: Enable/disable tasks without restart
  • Status Monitoring: Real-time task health tracking
  • Resource Integration: View task status with system resources

Event System

The NodeContext provides an event-driven architecture for system-wide communication:

Event Subscription

// Subscribe to events
ctx.on("node/discovered", [](void* data) {
    NodeInfo* node = static_cast<NodeInfo*>(data);
    // Handle new node discovery
});

ctx.on("cluster/updated", [](void* data) {
    // Handle cluster membership changes
});

Event Publishing

// Publish events
ctx.fire("node/discovered", &newNode);
ctx.fire("cluster/updated", &clusterData);

Available Events

  • node/discovered: New node added or local node refreshed

Resource Monitoring

Each node tracks comprehensive system resources:

System Resources

  • Free Heap Memory: Available RAM in bytes
  • Chip ID: Unique ESP8266 identifier
  • SDK Version: ESP8266 firmware version
  • CPU Frequency: Operating frequency in MHz
  • Flash Chip Size: Total flash storage in bytes

API Endpoint Registry

  • Dynamic Discovery: Automatically detect available endpoints
  • Method Information: HTTP method (GET, POST, etc.)
  • Service Catalog: Complete service registry across cluster

Health Metrics

  • Response Time: API response latency
  • Uptime: System uptime in milliseconds
  • Connection Status: Network connectivity health
  • Resource Utilization: Memory and CPU usage

WiFi Fallback System

The system includes automatic WiFi fallback for robust operation:

Fallback Process

  1. Primary Connection: Attempts to connect to configured WiFi network
  2. Connection Failure: If connection fails, creates an access point
  3. Hostname Generation: Automatically generates hostname from MAC address
  4. Service Continuity: Maintains cluster functionality in fallback mode

Configuration

  • Hostname: Derived from MAC (esp-<mac>) and assigned to ctx.hostname
  • AP Mode: If STA connection fails, device switches to AP mode with configured SSID/password

Cluster Topology

Node Types

  • Master Node: Primary cluster coordinator (if applicable)
  • Worker Nodes: Standard cluster members
  • Edge Nodes: Network edge devices

Network Architecture

  • UDP broadcast-based discovery and heartbeats on local subnet
  • Optional HTTP polling (disabled by default; node info exchanged via UDP)

Data Flow

Node Discovery

  1. UDP Broadcast: Nodes broadcast discovery packets on port 4210
  2. UDP Response: Receiving nodes respond with hostname
  3. Registration: Discovered nodes are added to local cluster member list

Health Monitoring

  1. Periodic Checks: Cluster manager updates node status categories
  2. Status Collection: Each node updates resources via UDP node-info messages

Task Management

  1. Scheduling: TaskManager executes registered tasks at configured intervals
  2. Execution: Tasks run cooperatively in the main loop without preemption
  3. Monitoring: Task status is exposed via REST (/api/tasks/status)

Performance Characteristics

Memory Usage

  • Base System: ~15-20KB RAM (device dependent)
  • Per Task: ~100-200 bytes per task
  • Cluster Members: ~50-100 bytes per member
  • API Endpoints: ~20-30 bytes per endpoint

Network Overhead

  • Discovery Packets: 64 bytes every 1 second
  • Health Checks: ~200-500 bytes every 1 second
  • Status Updates: ~1-2KB per node
  • API Responses: Varies by endpoint (typically 100B-5KB)

Processing Overhead

  • Task Execution: Minimal overhead per task
  • Event Processing: Fast event dispatch
  • JSON Parsing: Efficient ArduinoJson usage
  • Network I/O: Asynchronous operations

Security Considerations

Current Implementation

  • Network Access: Local network only (no internet exposure)
  • Authentication: None currently implemented; LAN-only access assumed
  • Data Validation: Basic input validation
  • Resource Limits: Memory and processing constraints

Future Enhancements

  • TLS/SSL: Encrypted communications
  • API Keys: Authentication for API access
  • Access Control: Role-based permissions
  • Audit Logging: Security event tracking

Scalability

Cluster Size Limits

  • Theoretical: Up to 255 nodes (IP subnet limit)
  • Practical: 20-50 nodes for optimal performance
  • Memory Constraint: ~8KB available for member tracking
  • Network Constraint: UDP packet size limits

Performance Scaling

  • Linear Scaling: Most operations scale linearly with node count
  • Discovery Overhead: Increases with cluster size
  • Health Monitoring: Parallel HTTP requests
  • Task Management: Independent per-node execution

Configuration Management

SPORE implements a persistent configuration system that manages device settings across reboots and provides runtime reconfiguration capabilities.

Configuration Architecture

The configuration system consists of several key components:

  • Config Class: Central configuration management with default constants
  • LittleFS Storage: Persistent file-based storage (/config.json)
  • Runtime Updates: Live configuration changes via HTTP API
  • Automatic Persistence: Configuration changes are automatically saved

Configuration Categories

Category Description Examples
WiFi Configuration Network connection settings SSID, password, timeouts
Network Configuration Network service settings UDP port, API server port
Cluster Configuration Cluster management settings Discovery intervals, heartbeat timing
Node Status Thresholds Health monitoring thresholds Active/inactive/dead timeouts
System Configuration Core system settings Restart delay, JSON document size
Memory Management Resource management settings Memory thresholds, HTTP request limits

Configuration Lifecycle

  1. Boot Process: Load configuration from /config.json or use defaults
  2. Runtime Updates: Configuration changes via HTTP API
  3. Persistent Storage: Changes automatically saved to LittleFS
  4. Service Integration: Configuration applied to all system services

Default Value Management

All default values are defined as constexpr constants in the Config class:

static constexpr const char* DEFAULT_WIFI_SSID = "shroud";
static constexpr uint16_t DEFAULT_UDP_PORT = 4210;
static constexpr unsigned long DEFAULT_HEARTBEAT_INTERVAL_MS = 5000;

This ensures:

  • Single Source of Truth: All defaults defined once
  • Type Safety: Compile-time type checking
  • Maintainability: Easy to update default values
  • Consistency: Same defaults used in setDefaults() and loadFromFile()

Environment Variables

# API node IP for cluster management
export API_NODE=192.168.1.100

PlatformIO Configuration

The project uses PlatformIO with the following configuration:

  • Framework: Arduino
  • Board: ESP-01 with 1MB flash
  • Upload Speed: 115200 baud
  • Flash Mode: DOUT (required for ESP-01S)

Dependencies

The project requires the following libraries:

  • esp32async/ESPAsyncWebServer@^3.8.0 - HTTP API server
  • bblanchon/ArduinoJson@^7.4.2 - JSON processing
  • arkhipenko/TaskScheduler@^3.8.5 - Cooperative multitasking

Development Workflow

Building

Build the firmware for specific chip:

./ctl.sh build target esp01_1m

Flashing

Flash firmware to a connected device:

./ctl.sh flash target esp01_1m

Over-The-Air Updates

Update a specific node:

./ctl.sh ota update 192.168.1.100 esp01_1m

Update all nodes in the cluster:

./ctl.sh ota all esp01_1m

Cluster Management

View cluster members:

./ctl.sh cluster members

Troubleshooting

Common Issues

  1. Discovery Failures: Check UDP port 4210 is not blocked
  2. WiFi Connection: Verify SSID/password in Config.cpp
  3. OTA Updates: Ensure sufficient flash space (1MB minimum)
  4. Cluster Split: Check network connectivity between nodes

Debug Output

Enable serial monitoring to see cluster activity:

pio device monitor

Performance Monitoring

  • Memory Usage: Monitor free heap with /api/node/status
  • Task Health: Check task status with /api/tasks/status
  • Cluster Health: Monitor member status with /api/cluster/members
  • Network Latency: Track response times in cluster data