feat: task manager endpoint, updated documentation
This commit is contained in:
358
docs/Architecture.md
Normal file
358
docs/Architecture.md
Normal file
@@ -0,0 +1,358 @@
|
||||
# SPORE Architecture & Implementation
|
||||
|
||||
## System Overview
|
||||
|
||||
SPORE (SProcket ORchestration Engine) is a cluster engine for ESP8266 microcontrollers that provides automatic node discovery, health monitoring, and over-the-air updates in a distributed network environment.
|
||||
|
||||
## Core Components
|
||||
|
||||
The system architecture consists of several key components working together:
|
||||
|
||||
### Network Manager
|
||||
- **WiFi Connection Handling**: Automatic WiFi STA/AP configuration
|
||||
- **Hostname Configuration**: MAC-based hostname generation
|
||||
- **Fallback Management**: Automatic access point creation if WiFi connection fails
|
||||
|
||||
### Cluster Manager
|
||||
- **Node Discovery**: UDP-based automatic node detection
|
||||
- **Member List Management**: Dynamic cluster membership tracking
|
||||
- **Health Monitoring**: Continuous node status checking
|
||||
- **Resource Tracking**: Monitor node resources and capabilities
|
||||
|
||||
### API Server
|
||||
- **HTTP API Server**: RESTful API for cluster management
|
||||
- **Dynamic Endpoint Registration**: Automatic API endpoint discovery
|
||||
- **Service Registry**: Track available services across the cluster
|
||||
|
||||
### Task Scheduler
|
||||
- **Cooperative Multitasking**: Background task management system
|
||||
- **Task Lifecycle Management**: Automatic task execution and monitoring
|
||||
- **Resource Optimization**: Efficient task scheduling and execution
|
||||
|
||||
### Node Context
|
||||
- **Central Context**: Shared resources and configuration
|
||||
- **Event System**: Local and cluster-wide event publishing/subscription
|
||||
- **Resource Management**: Centralized resource allocation and monitoring
|
||||
|
||||
## Auto Discovery Protocol
|
||||
|
||||
The cluster uses a UDP-based discovery protocol for automatic node detection:
|
||||
|
||||
### Discovery Process
|
||||
|
||||
1. **Discovery Broadcast**: Nodes periodically send UDP packets on port 4210
|
||||
2. **Response Handling**: Nodes respond with their hostname and IP address
|
||||
3. **Member Management**: Discovered nodes are automatically added to the cluster
|
||||
4. **Health Monitoring**: Continuous status checking via HTTP API calls
|
||||
|
||||
### Protocol Details
|
||||
|
||||
- **UDP Port**: 4210 (configurable)
|
||||
- **Discovery Message**: `CLUSTER_DISCOVERY`
|
||||
- **Response Message**: `CLUSTER_RESPONSE`
|
||||
- **Broadcast Address**: 255.255.255.255
|
||||
- **Discovery Interval**: 1 second (configurable)
|
||||
- **Listen Interval**: 100ms (configurable)
|
||||
|
||||
### Node Status Categories
|
||||
|
||||
Nodes are automatically categorized by their activity:
|
||||
|
||||
- **ACTIVE**: Responding within 10 seconds
|
||||
- **INACTIVE**: No response for 10-60 seconds
|
||||
- **DEAD**: No response for over 60 seconds
|
||||
|
||||
## Task Scheduling System
|
||||
|
||||
The system runs several background tasks at different intervals:
|
||||
|
||||
### Core System Tasks
|
||||
|
||||
| Task | Interval | Purpose |
|
||||
|------|----------|---------|
|
||||
| **Discovery Send** | 1 second | Send UDP discovery packets |
|
||||
| **Discovery Listen** | 100ms | Listen for discovery responses |
|
||||
| **Status Updates** | 1 second | Monitor cluster member health |
|
||||
| **Heartbeat** | 2 seconds | Maintain cluster connectivity |
|
||||
| **Member Info** | 10 seconds | Update detailed node information |
|
||||
| **Debug Output** | 5 seconds | Print cluster status |
|
||||
|
||||
### Task Management Features
|
||||
|
||||
- **Dynamic Intervals**: Change execution frequency on-the-fly
|
||||
- **Runtime Control**: Enable/disable tasks without restart
|
||||
- **Status Monitoring**: Real-time task health tracking
|
||||
- **Resource Integration**: View task status with system resources
|
||||
|
||||
## Event System
|
||||
|
||||
The `NodeContext` provides an event-driven architecture for system-wide communication:
|
||||
|
||||
### Event Subscription
|
||||
|
||||
```cpp
|
||||
// Subscribe to events
|
||||
ctx.on("node_discovered", [](void* data) {
|
||||
NodeInfo* node = static_cast<NodeInfo*>(data);
|
||||
// Handle new node discovery
|
||||
});
|
||||
|
||||
ctx.on("cluster_updated", [](void* data) {
|
||||
// Handle cluster membership changes
|
||||
});
|
||||
```
|
||||
|
||||
### Event Publishing
|
||||
|
||||
```cpp
|
||||
// Publish events
|
||||
ctx.fire("node_discovered", &newNode);
|
||||
ctx.fire("cluster_updated", &clusterData);
|
||||
```
|
||||
|
||||
### Available Events
|
||||
|
||||
- **`node_discovered`**: New node added to cluster
|
||||
- **`cluster_updated`**: Cluster membership changed
|
||||
- **`resource_update`**: Node resources updated
|
||||
- **`health_check`**: Node health status changed
|
||||
|
||||
## Resource Monitoring
|
||||
|
||||
Each node tracks comprehensive system resources:
|
||||
|
||||
### System Resources
|
||||
|
||||
- **Free Heap Memory**: Available RAM in bytes
|
||||
- **Chip ID**: Unique ESP8266 identifier
|
||||
- **SDK Version**: ESP8266 firmware version
|
||||
- **CPU Frequency**: Operating frequency in MHz
|
||||
- **Flash Chip Size**: Total flash storage in bytes
|
||||
|
||||
### API Endpoint Registry
|
||||
|
||||
- **Dynamic Discovery**: Automatically detect available endpoints
|
||||
- **Method Information**: HTTP method (GET, POST, etc.)
|
||||
- **Service Catalog**: Complete service registry across cluster
|
||||
|
||||
### Health Metrics
|
||||
|
||||
- **Response Time**: API response latency
|
||||
- **Uptime**: System uptime in milliseconds
|
||||
- **Connection Status**: Network connectivity health
|
||||
- **Resource Utilization**: Memory and CPU usage
|
||||
|
||||
## WiFi Fallback System
|
||||
|
||||
The system includes automatic WiFi fallback for robust operation:
|
||||
|
||||
### Fallback Process
|
||||
|
||||
1. **Primary Connection**: Attempts to connect to configured WiFi network
|
||||
2. **Connection Failure**: If connection fails, creates an access point
|
||||
3. **Hostname Generation**: Automatically generates hostname from MAC address
|
||||
4. **Service Continuity**: Maintains cluster functionality in fallback mode
|
||||
|
||||
### Configuration
|
||||
|
||||
- **SSID Format**: `SPORE_<MAC_LAST_4>`
|
||||
- **Password**: Configurable fallback password
|
||||
- **IP Range**: 192.168.4.x subnet
|
||||
- **Gateway**: 192.168.4.1
|
||||
|
||||
## Cluster Topology
|
||||
|
||||
### Node Types
|
||||
|
||||
- **Master Node**: Primary cluster coordinator (if applicable)
|
||||
- **Worker Nodes**: Standard cluster members
|
||||
- **Edge Nodes**: Network edge devices
|
||||
|
||||
### Network Architecture
|
||||
|
||||
- **Mesh-like Structure**: Nodes can communicate with each other
|
||||
- **Dynamic Routing**: Automatic path discovery between nodes
|
||||
- **Load Distribution**: Tasks distributed across available nodes
|
||||
- **Fault Tolerance**: Automatic failover and recovery
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Discovery Flow
|
||||
|
||||
```
|
||||
Node A → UDP Broadcast → Node B
|
||||
Node B → HTTP Response → Node A
|
||||
Node A → Add to Cluster → Update Member List
|
||||
```
|
||||
|
||||
### Health Monitoring Flow
|
||||
|
||||
```
|
||||
Cluster Manager → HTTP Request → Node Status
|
||||
Node → JSON Response → Resource Information
|
||||
Cluster Manager → Update Health → Fire Events
|
||||
```
|
||||
|
||||
### Task Execution Flow
|
||||
|
||||
```
|
||||
Task Scheduler → Check Intervals → Execute Tasks
|
||||
Task → Update Status → API Server
|
||||
API Server → JSON Response → Client
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Memory Usage
|
||||
|
||||
- **Base System**: ~15-20KB RAM
|
||||
- **Per Task**: ~100-200 bytes per task
|
||||
- **Cluster Members**: ~50-100 bytes per member
|
||||
- **API Endpoints**: ~20-30 bytes per endpoint
|
||||
|
||||
### Network Overhead
|
||||
|
||||
- **Discovery Packets**: 64 bytes every 1 second
|
||||
- **Health Checks**: ~200-500 bytes every 1 second
|
||||
- **Status Updates**: ~1-2KB per node
|
||||
- **API Responses**: Varies by endpoint (typically 100B-5KB)
|
||||
|
||||
### Processing Overhead
|
||||
|
||||
- **Task Execution**: Minimal overhead per task
|
||||
- **Event Processing**: Fast event dispatch
|
||||
- **JSON Parsing**: Efficient ArduinoJson usage
|
||||
- **Network I/O**: Asynchronous operations
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Current Implementation
|
||||
|
||||
- **Network Access**: Local network only (no internet exposure)
|
||||
- **Authentication**: None currently implemented
|
||||
- **Data Validation**: Basic input validation
|
||||
- **Resource Limits**: Memory and processing constraints
|
||||
|
||||
### Future Enhancements
|
||||
|
||||
- **TLS/SSL**: Encrypted communications
|
||||
- **API Keys**: Authentication for API access
|
||||
- **Access Control**: Role-based permissions
|
||||
- **Audit Logging**: Security event tracking
|
||||
|
||||
## Scalability
|
||||
|
||||
### Cluster Size Limits
|
||||
|
||||
- **Theoretical**: Up to 255 nodes (IP subnet limit)
|
||||
- **Practical**: 20-50 nodes for optimal performance
|
||||
- **Memory Constraint**: ~8KB available for member tracking
|
||||
- **Network Constraint**: UDP packet size limits
|
||||
|
||||
### Performance Scaling
|
||||
|
||||
- **Linear Scaling**: Most operations scale linearly with node count
|
||||
- **Discovery Overhead**: Increases with cluster size
|
||||
- **Health Monitoring**: Parallel HTTP requests
|
||||
- **Task Management**: Independent per-node execution
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# API node IP for cluster management
|
||||
export API_NODE=192.168.1.100
|
||||
|
||||
# Cluster configuration
|
||||
export CLUSTER_PORT=4210
|
||||
export DISCOVERY_INTERVAL=1000
|
||||
export HEALTH_CHECK_INTERVAL=1000
|
||||
```
|
||||
|
||||
### PlatformIO Configuration
|
||||
|
||||
The project uses PlatformIO with the following configuration:
|
||||
|
||||
- **Framework**: Arduino
|
||||
- **Board**: ESP-01 with 1MB flash
|
||||
- **Upload Speed**: 115200 baud
|
||||
- **Flash Mode**: DOUT (required for ESP-01S)
|
||||
|
||||
### Dependencies
|
||||
|
||||
The project requires the following libraries:
|
||||
- `esp32async/ESPAsyncWebServer@^3.8.0` - HTTP API server
|
||||
- `bblanchon/ArduinoJson@^7.4.2` - JSON processing
|
||||
- `arkhipenko/TaskScheduler@^3.8.5` - Cooperative multitasking
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Building
|
||||
|
||||
Build the firmware for specific chip:
|
||||
|
||||
```bash
|
||||
./ctl.sh build target esp01_1m
|
||||
```
|
||||
|
||||
### Flashing
|
||||
|
||||
Flash firmware to a connected device:
|
||||
|
||||
```bash
|
||||
./ctl.sh flash target esp01_1m
|
||||
```
|
||||
|
||||
### Over-The-Air Updates
|
||||
|
||||
Update a specific node:
|
||||
|
||||
```bash
|
||||
./ctl.sh ota update 192.168.1.100 esp01_1m
|
||||
```
|
||||
|
||||
Update all nodes in the cluster:
|
||||
|
||||
```bash
|
||||
./ctl.sh ota all esp01_1m
|
||||
```
|
||||
|
||||
### Cluster Management
|
||||
|
||||
View cluster members:
|
||||
|
||||
```bash
|
||||
./ctl.sh cluster members
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Discovery Failures**: Check UDP port 4210 is not blocked
|
||||
2. **WiFi Connection**: Verify SSID/password in Config.cpp
|
||||
3. **OTA Updates**: Ensure sufficient flash space (1MB minimum)
|
||||
4. **Cluster Split**: Check network connectivity between nodes
|
||||
|
||||
### Debug Output
|
||||
|
||||
Enable serial monitoring to see cluster activity:
|
||||
|
||||
```bash
|
||||
pio device monitor
|
||||
```
|
||||
|
||||
### Performance Monitoring
|
||||
|
||||
- **Memory Usage**: Monitor free heap with `/api/node/status`
|
||||
- **Task Health**: Check task status with `/api/tasks/status`
|
||||
- **Cluster Health**: Monitor member status with `/api/cluster/members`
|
||||
- **Network Latency**: Track response times in cluster data
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **[Task Management](./TaskManagement.md)** - Background task system
|
||||
- **[API Reference](./API.md)** - REST API documentation
|
||||
- **[TaskManager API](./TaskManager.md)** - TaskManager class reference
|
||||
- **[OpenAPI Specification](../api/)** - Machine-readable API specification
|
||||
Reference in New Issue
Block a user