feat: task manager endpoint, updated documentation

2025-08-22 15:47:08 +02:00
parent d7d307e3ce
commit 30a5f8b8cb
14 changed files with 2550 additions and 551 deletions
--- a/docs/Architecture.md
+++ b/docs/Architecture.md
@@ -0,0 +1,358 @@
+# SPORE Architecture & Implementation
+
+## System Overview
+
+SPORE (SProcket ORchestration Engine) is a cluster engine for ESP8266 microcontrollers that provides automatic node discovery, health monitoring, and over-the-air updates in a distributed network environment.
+
+## Core Components
+
+The system architecture consists of several key components working together:
+
+### Network Manager
+- **WiFi Connection Handling**: Automatic WiFi STA/AP configuration
+- **Hostname Configuration**: MAC-based hostname generation
+- **Fallback Management**: Automatic access point creation if WiFi connection fails
+
+### Cluster Manager
+- **Node Discovery**: UDP-based automatic node detection
+- **Member List Management**: Dynamic cluster membership tracking
+- **Health Monitoring**: Continuous node status checking
+- **Resource Tracking**: Monitor node resources and capabilities
+
+### API Server
+- **HTTP API Server**: RESTful API for cluster management
+- **Dynamic Endpoint Registration**: Automatic API endpoint discovery
+- **Service Registry**: Track available services across the cluster
+
+### Task Scheduler
+- **Cooperative Multitasking**: Background task management system
+- **Task Lifecycle Management**: Automatic task execution and monitoring
+- **Resource Optimization**: Efficient task scheduling and execution
+
+### Node Context
+- **Central Context**: Shared resources and configuration
+- **Event System**: Local and cluster-wide event publishing/subscription
+- **Resource Management**: Centralized resource allocation and monitoring
+
+## Auto Discovery Protocol
+
+The cluster uses a UDP-based discovery protocol for automatic node detection:
+
+### Discovery Process
+
+1. **Discovery Broadcast**: Nodes periodically send UDP packets on port 4210
+2. **Response Handling**: Nodes respond with their hostname and IP address
+3. **Member Management**: Discovered nodes are automatically added to the cluster
+4. **Health Monitoring**: Continuous status checking via HTTP API calls
+
+### Protocol Details
+
+- **UDP Port**: 4210 (configurable)
+- **Discovery Message**: `CLUSTER_DISCOVERY`
+- **Response Message**: `CLUSTER_RESPONSE`
+- **Broadcast Address**: 255.255.255.255
+- **Discovery Interval**: 1 second (configurable)
+- **Listen Interval**: 100ms (configurable)
+
+### Node Status Categories
+
+Nodes are automatically categorized by their activity:
+
+- **ACTIVE**: Responding within 10 seconds
+- **INACTIVE**: No response for 10-60 seconds  
+- **DEAD**: No response for over 60 seconds
+
+## Task Scheduling System
+
+The system runs several background tasks at different intervals:
+
+### Core System Tasks
+
+| Task | Interval | Purpose |
+|------|----------|---------|
+| **Discovery Send** | 1 second | Send UDP discovery packets |
+| **Discovery Listen** | 100ms | Listen for discovery responses |
+| **Status Updates** | 1 second | Monitor cluster member health |
+| **Heartbeat** | 2 seconds | Maintain cluster connectivity |
+| **Member Info** | 10 seconds | Update detailed node information |
+| **Debug Output** | 5 seconds | Print cluster status |
+
+### Task Management Features
+
+- **Dynamic Intervals**: Change execution frequency on-the-fly
+- **Runtime Control**: Enable/disable tasks without restart
+- **Status Monitoring**: Real-time task health tracking
+- **Resource Integration**: View task status with system resources
+
+## Event System
+
+The `NodeContext` provides an event-driven architecture for system-wide communication:
+
+### Event Subscription
+
+```cpp
+// Subscribe to events
+ctx.on("node_discovered", [](void* data) {
+    NodeInfo* node = static_cast<NodeInfo*>(data);
+    // Handle new node discovery
+});
+
+ctx.on("cluster_updated", [](void* data) {
+    // Handle cluster membership changes
+});
+```
+
+### Event Publishing
+
+```cpp
+// Publish events
+ctx.fire("node_discovered", &newNode);
+ctx.fire("cluster_updated", &clusterData);
+```
+
+### Available Events
+
+- **`node_discovered`**: New node added to cluster
+- **`cluster_updated`**: Cluster membership changed
+- **`resource_update`**: Node resources updated
+- **`health_check`**: Node health status changed
+
+## Resource Monitoring
+
+Each node tracks comprehensive system resources:
+
+### System Resources
+
+- **Free Heap Memory**: Available RAM in bytes
+- **Chip ID**: Unique ESP8266 identifier
+- **SDK Version**: ESP8266 firmware version
+- **CPU Frequency**: Operating frequency in MHz
+- **Flash Chip Size**: Total flash storage in bytes
+
+### API Endpoint Registry
+
+- **Dynamic Discovery**: Automatically detect available endpoints
+- **Method Information**: HTTP method (GET, POST, etc.)
+- **Service Catalog**: Complete service registry across cluster
+
+### Health Metrics
+
+- **Response Time**: API response latency
+- **Uptime**: System uptime in milliseconds
+- **Connection Status**: Network connectivity health
+- **Resource Utilization**: Memory and CPU usage
+
+## WiFi Fallback System
+
+The system includes automatic WiFi fallback for robust operation:
+
+### Fallback Process
+
+1. **Primary Connection**: Attempts to connect to configured WiFi network
+2. **Connection Failure**: If connection fails, creates an access point
+3. **Hostname Generation**: Automatically generates hostname from MAC address
+4. **Service Continuity**: Maintains cluster functionality in fallback mode
+
+### Configuration
+
+- **SSID Format**: `SPORE_<MAC_LAST_4>`
+- **Password**: Configurable fallback password
+- **IP Range**: 192.168.4.x subnet
+- **Gateway**: 192.168.4.1
+
+## Cluster Topology
+
+### Node Types
+
+- **Master Node**: Primary cluster coordinator (if applicable)
+- **Worker Nodes**: Standard cluster members
+- **Edge Nodes**: Network edge devices
+
+### Network Architecture
+
+- **Mesh-like Structure**: Nodes can communicate with each other
+- **Dynamic Routing**: Automatic path discovery between nodes
+- **Load Distribution**: Tasks distributed across available nodes
+- **Fault Tolerance**: Automatic failover and recovery
+
+## Data Flow
+
+### Discovery Flow
+
+```
+Node A → UDP Broadcast → Node B
+Node B → HTTP Response → Node A
+Node A → Add to Cluster → Update Member List
+```
+
+### Health Monitoring Flow
+
+```
+Cluster Manager → HTTP Request → Node Status
+Node → JSON Response → Resource Information
+Cluster Manager → Update Health → Fire Events
+```
+
+### Task Execution Flow
+
+```
+Task Scheduler → Check Intervals → Execute Tasks
+Task → Update Status → API Server
+API Server → JSON Response → Client
+```
+
+## Performance Characteristics
+
+### Memory Usage
+
+- **Base System**: ~15-20KB RAM
+- **Per Task**: ~100-200 bytes per task
+- **Cluster Members**: ~50-100 bytes per member
+- **API Endpoints**: ~20-30 bytes per endpoint
+
+### Network Overhead
+
+- **Discovery Packets**: 64 bytes every 1 second
+- **Health Checks**: ~200-500 bytes every 1 second
+- **Status Updates**: ~1-2KB per node
+- **API Responses**: Varies by endpoint (typically 100B-5KB)
+
+### Processing Overhead
+
+- **Task Execution**: Minimal overhead per task
+- **Event Processing**: Fast event dispatch
+- **JSON Parsing**: Efficient ArduinoJson usage
+- **Network I/O**: Asynchronous operations
+
+## Security Considerations
+
+### Current Implementation
+
+- **Network Access**: Local network only (no internet exposure)
+- **Authentication**: None currently implemented
+- **Data Validation**: Basic input validation
+- **Resource Limits**: Memory and processing constraints
+
+### Future Enhancements
+
+- **TLS/SSL**: Encrypted communications
+- **API Keys**: Authentication for API access
+- **Access Control**: Role-based permissions
+- **Audit Logging**: Security event tracking
+
+## Scalability
+
+### Cluster Size Limits
+
+- **Theoretical**: Up to 255 nodes (IP subnet limit)
+- **Practical**: 20-50 nodes for optimal performance
+- **Memory Constraint**: ~8KB available for member tracking
+- **Network Constraint**: UDP packet size limits
+
+### Performance Scaling
+
+- **Linear Scaling**: Most operations scale linearly with node count
+- **Discovery Overhead**: Increases with cluster size
+- **Health Monitoring**: Parallel HTTP requests
+- **Task Management**: Independent per-node execution
+
+## Configuration Management
+
+### Environment Variables
+
+```bash
+# API node IP for cluster management
+export API_NODE=192.168.1.100
+
+# Cluster configuration
+export CLUSTER_PORT=4210
+export DISCOVERY_INTERVAL=1000
+export HEALTH_CHECK_INTERVAL=1000
+```
+
+### PlatformIO Configuration
+
+The project uses PlatformIO with the following configuration:
+
+- **Framework**: Arduino
+- **Board**: ESP-01 with 1MB flash
+- **Upload Speed**: 115200 baud
+- **Flash Mode**: DOUT (required for ESP-01S)
+
+### Dependencies
+
+The project requires the following libraries:
+- `esp32async/ESPAsyncWebServer@^3.8.0` - HTTP API server
+- `bblanchon/ArduinoJson@^7.4.2` - JSON processing
+- `arkhipenko/TaskScheduler@^3.8.5` - Cooperative multitasking
+
+## Development Workflow
+
+### Building
+
+Build the firmware for specific chip:
+
+```bash
+./ctl.sh build target esp01_1m
+```
+
+### Flashing
+
+Flash firmware to a connected device:
+
+```bash
+./ctl.sh flash target esp01_1m
+```
+
+### Over-The-Air Updates
+
+Update a specific node:
+
+```bash
+./ctl.sh ota update 192.168.1.100 esp01_1m
+```
+
+Update all nodes in the cluster:
+
+```bash
+./ctl.sh ota all esp01_1m
+```
+
+### Cluster Management
+
+View cluster members:
+
+```bash
+./ctl.sh cluster members
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Discovery Failures**: Check UDP port 4210 is not blocked
+2. **WiFi Connection**: Verify SSID/password in Config.cpp
+3. **OTA Updates**: Ensure sufficient flash space (1MB minimum)
+4. **Cluster Split**: Check network connectivity between nodes
+
+### Debug Output
+
+Enable serial monitoring to see cluster activity:
+
+```bash
+pio device monitor
+```
+
+### Performance Monitoring
+
+- **Memory Usage**: Monitor free heap with `/api/node/status`
+- **Task Health**: Check task status with `/api/tasks/status`
+- **Cluster Health**: Monitor member status with `/api/cluster/members`
+- **Network Latency**: Track response times in cluster data
+
+## Related Documentation
+
+- **[Task Management](./TaskManagement.md)** - Background task system
+- **[API Reference](./API.md)** - REST API documentation
+- **[TaskManager API](./TaskManager.md)** - TaskManager class reference
+- **[OpenAPI Specification](../api/)** - Machine-readable API specification