Files
spore/docs/Architecture.md

10 KiB

SPORE Architecture & Implementation

System Overview

SPORE (SProcket ORchestration Engine) is a cluster engine for ESP8266 microcontrollers that provides automatic node discovery, health monitoring, and over-the-air updates in a distributed network environment.

Core Components

The system architecture consists of several key components working together:

Network Manager

  • WiFi Connection Handling: Automatic WiFi STA/AP configuration
  • Hostname Configuration: MAC-based hostname generation
  • Fallback Management: Automatic access point creation if WiFi connection fails

Cluster Manager

  • Node Discovery: UDP-based automatic node detection
  • Member List Management: Dynamic cluster membership tracking
  • Health Monitoring: Continuous node status checking
  • Resource Tracking: Monitor node resources and capabilities

API Server

  • HTTP API Server: RESTful API for cluster management
  • Dynamic Endpoint Registration: Automatic API endpoint discovery
  • Service Registry: Track available services across the cluster

Task Scheduler

  • Cooperative Multitasking: Background task management system
  • Task Lifecycle Management: Automatic task execution and monitoring
  • Resource Optimization: Efficient task scheduling and execution

Node Context

  • Central Context: Shared resources and configuration
  • Event System: Local and cluster-wide event publishing/subscription
  • Resource Management: Centralized resource allocation and monitoring

Auto Discovery Protocol

The cluster uses a UDP-based discovery protocol for automatic node detection:

Discovery Process

  1. Discovery Broadcast: Nodes periodically send UDP packets on port 4210
  2. Response Handling: Nodes respond with their hostname and IP address
  3. Member Management: Discovered nodes are automatically added to the cluster
  4. Health Monitoring: Continuous status checking via HTTP API calls

Protocol Details

  • UDP Port: 4210 (configurable)
  • Discovery Message: CLUSTER_DISCOVERY
  • Response Message: CLUSTER_RESPONSE
  • Broadcast Address: 255.255.255.255
  • Discovery Interval: 1 second (configurable)
  • Listen Interval: 100ms (configurable)

Node Status Categories

Nodes are automatically categorized by their activity:

  • ACTIVE: Responding within 10 seconds
  • INACTIVE: No response for 10-60 seconds
  • DEAD: No response for over 60 seconds

Task Scheduling System

The system runs several background tasks at different intervals:

Core System Tasks

Task Interval Purpose
Discovery Send 1 second Send UDP discovery packets
Discovery Listen 100ms Listen for discovery responses
Status Updates 1 second Monitor cluster member health
Heartbeat 2 seconds Maintain cluster connectivity
Member Info 10 seconds Update detailed node information
Debug Output 5 seconds Print cluster status

Task Management Features

  • Dynamic Intervals: Change execution frequency on-the-fly
  • Runtime Control: Enable/disable tasks without restart
  • Status Monitoring: Real-time task health tracking
  • Resource Integration: View task status with system resources

Event System

The NodeContext provides an event-driven architecture for system-wide communication:

Event Subscription

// Subscribe to events
ctx.on("node_discovered", [](void* data) {
    NodeInfo* node = static_cast<NodeInfo*>(data);
    // Handle new node discovery
});

ctx.on("cluster_updated", [](void* data) {
    // Handle cluster membership changes
});

Event Publishing

// Publish events
ctx.fire("node_discovered", &newNode);
ctx.fire("cluster_updated", &clusterData);

Available Events

  • node_discovered: New node added to cluster
  • cluster_updated: Cluster membership changed
  • resource_update: Node resources updated
  • health_check: Node health status changed

Resource Monitoring

Each node tracks comprehensive system resources:

System Resources

  • Free Heap Memory: Available RAM in bytes
  • Chip ID: Unique ESP8266 identifier
  • SDK Version: ESP8266 firmware version
  • CPU Frequency: Operating frequency in MHz
  • Flash Chip Size: Total flash storage in bytes

API Endpoint Registry

  • Dynamic Discovery: Automatically detect available endpoints
  • Method Information: HTTP method (GET, POST, etc.)
  • Service Catalog: Complete service registry across cluster

Health Metrics

  • Response Time: API response latency
  • Uptime: System uptime in milliseconds
  • Connection Status: Network connectivity health
  • Resource Utilization: Memory and CPU usage

WiFi Fallback System

The system includes automatic WiFi fallback for robust operation:

Fallback Process

  1. Primary Connection: Attempts to connect to configured WiFi network
  2. Connection Failure: If connection fails, creates an access point
  3. Hostname Generation: Automatically generates hostname from MAC address
  4. Service Continuity: Maintains cluster functionality in fallback mode

Configuration

  • SSID Format: SPORE_<MAC_LAST_4>
  • Password: Configurable fallback password
  • IP Range: 192.168.4.x subnet
  • Gateway: 192.168.4.1

Cluster Topology

Node Types

  • Master Node: Primary cluster coordinator (if applicable)
  • Worker Nodes: Standard cluster members
  • Edge Nodes: Network edge devices

Network Architecture

  • Mesh-like Structure: Nodes can communicate with each other
  • Dynamic Routing: Automatic path discovery between nodes
  • Load Distribution: Tasks distributed across available nodes
  • Fault Tolerance: Automatic failover and recovery

Data Flow

Node Discovery

  1. UDP Broadcast: Nodes broadcast discovery packets on port 4210
  2. UDP Response: Receiving nodes responds with hostname
  3. Registration: Discovered nodes are added to local cluster member list

Health Monitoring

  1. Periodic Checks: Cluster manager polls member nodes every 1 second
  2. Status Collection: Each node returns resource usage and health metrics

Task Management

  1. Scheduling: TaskScheduler executes registered tasks at configured intervals
  2. Execution: Tasks run cooperatively, yielding control to other tasks
  3. Monitoring: Task status and results are exposed via REST API endpoints

Performance Characteristics

Memory Usage

  • Base System: ~15-20KB RAM
  • Per Task: ~100-200 bytes per task
  • Cluster Members: ~50-100 bytes per member
  • API Endpoints: ~20-30 bytes per endpoint

Network Overhead

  • Discovery Packets: 64 bytes every 1 second
  • Health Checks: ~200-500 bytes every 1 second
  • Status Updates: ~1-2KB per node
  • API Responses: Varies by endpoint (typically 100B-5KB)

Processing Overhead

  • Task Execution: Minimal overhead per task
  • Event Processing: Fast event dispatch
  • JSON Parsing: Efficient ArduinoJson usage
  • Network I/O: Asynchronous operations

Security Considerations

Current Implementation

  • Network Access: Local network only (no internet exposure)
  • Authentication: None currently implemented
  • Data Validation: Basic input validation
  • Resource Limits: Memory and processing constraints

Future Enhancements

  • TLS/SSL: Encrypted communications
  • API Keys: Authentication for API access
  • Access Control: Role-based permissions
  • Audit Logging: Security event tracking

Scalability

Cluster Size Limits

  • Theoretical: Up to 255 nodes (IP subnet limit)
  • Practical: 20-50 nodes for optimal performance
  • Memory Constraint: ~8KB available for member tracking
  • Network Constraint: UDP packet size limits

Performance Scaling

  • Linear Scaling: Most operations scale linearly with node count
  • Discovery Overhead: Increases with cluster size
  • Health Monitoring: Parallel HTTP requests
  • Task Management: Independent per-node execution

Configuration Management

Environment Variables

# API node IP for cluster management
export API_NODE=192.168.1.100

PlatformIO Configuration

The project uses PlatformIO with the following configuration:

  • Framework: Arduino
  • Board: ESP-01 with 1MB flash
  • Upload Speed: 115200 baud
  • Flash Mode: DOUT (required for ESP-01S)

Dependencies

The project requires the following libraries:

  • esp32async/ESPAsyncWebServer@^3.8.0 - HTTP API server
  • bblanchon/ArduinoJson@^7.4.2 - JSON processing
  • arkhipenko/TaskScheduler@^3.8.5 - Cooperative multitasking

Development Workflow

Building

Build the firmware for specific chip:

./ctl.sh build target esp01_1m

Flashing

Flash firmware to a connected device:

./ctl.sh flash target esp01_1m

Over-The-Air Updates

Update a specific node:

./ctl.sh ota update 192.168.1.100 esp01_1m

Update all nodes in the cluster:

./ctl.sh ota all esp01_1m

Cluster Management

View cluster members:

./ctl.sh cluster members

Troubleshooting

Common Issues

  1. Discovery Failures: Check UDP port 4210 is not blocked
  2. WiFi Connection: Verify SSID/password in Config.cpp
  3. OTA Updates: Ensure sufficient flash space (1MB minimum)
  4. Cluster Split: Check network connectivity between nodes

Debug Output

Enable serial monitoring to see cluster activity:

pio device monitor

Performance Monitoring

  • Memory Usage: Monitor free heap with /api/node/status
  • Task Health: Check task status with /api/tasks/status
  • Cluster Health: Monitor member status with /api/cluster/members
  • Network Latency: Track response times in cluster data