spore/docs/Architecture.md

# SPORE Architecture & Implementation

## System Overview

SPORE (SProcket ORchestration Engine) is a cluster engine for ESP8266 microcontrollers that provides automatic node discovery, health monitoring, and over-the-air updates in a distributed network environment.

## Core Components

The system architecture consists of several key components working together:

### Network Manager
- **WiFi Connection Handling**: Automatic WiFi STA/AP configuration
- **Hostname Configuration**: MAC-based hostname generation
- **Fallback Management**: Automatic access point creation if WiFi connection fails

### Cluster Manager
- **Node Discovery**: UDP-based automatic node detection
- **Member List Management**: Dynamic cluster membership tracking
- **Health Monitoring**: Continuous node status checking
- **Resource Tracking**: Monitor node resources and capabilities

### API Server
- **HTTP API Server**: RESTful API for cluster management
- **Dynamic Endpoint Registration**: Automatic API endpoint discovery
- **Service Registry**: Track available services across the cluster

### Task Scheduler
- **Cooperative Multitasking**: Background task management system (`TaskManager`)
- **Task Lifecycle Management**: Enable/disable tasks and set intervals at runtime
- **Execution Model**: Tasks run in `Spore::loop()` when their interval elapses

### Node Context
- **Central Context**: Shared resources and configuration
- **Event System**: Local and cluster-wide event publishing/subscription
- **Resource Management**: Centralized resource allocation and monitoring

## Auto Discovery Protocol

The cluster uses a UDP-based discovery protocol for automatic node detection:

### Discovery Process

1. **Discovery Broadcast**: Nodes periodically send UDP packets on port `udp_port` (default 4210)
2. **Response Handling**: Nodes respond with `CLUSTER_RESPONSE:<hostname>`
3. **Member Management**: Discovered nodes are added/updated in the cluster
4. **Node Info via UDP**: Heartbeat triggers peers to send `CLUSTER_NODE_INFO:<hostname>:<json>`

### Protocol Details

- **UDP Port**: 4210 (configurable via `Config.udp_port`)
- **Discovery Message**: `CLUSTER_DISCOVERY`
- **Response Message**: `CLUSTER_RESPONSE`
- **Heartbeat Message**: `CLUSTER_HEARTBEAT`
- **Node Info Message**: `CLUSTER_NODE_INFO:<hostname>:<json>`
- **Broadcast Address**: 255.255.255.255
- **Discovery Interval**: `Config.discovery_interval_ms` (default 1000 ms)
- **Listen Interval**: `Config.discovery_interval_ms / 10` (default 100 ms)
- **Heartbeat Interval**: `Config.heartbeat_interval_ms` (default 5000 ms)

### Node Status Categories

Nodes are automatically categorized by their activity:

- **ACTIVE**: lastSeen < `node_inactive_threshold_ms` (default 10s)
- **INACTIVE**: < `node_dead_threshold_ms` (default 120s)
- **DEAD**: ≥ `node_dead_threshold_ms`

## Task Scheduling System

The system runs several background tasks at different intervals:

### Core System Tasks

| Task | Interval (default) | Purpose |
|------|--------------------|---------|
| `discovery_send` | 1000 ms | Send UDP discovery packets |
| `discovery_listen` | 100 ms | Listen for discovery/heartbeat/node-info |
| `status_update` | 1000 ms | Update node status categories, purge dead |
| `heartbeat` | 5000 ms | Broadcast heartbeat and update local resources |
| `update_members_info` | 10000 ms | Reserved; no-op (info via UDP) |
| `print_members` | 5000 ms | Log current member list |

### Task Management Features

- **Dynamic Intervals**: Change execution frequency on-the-fly
- **Runtime Control**: Enable/disable tasks without restart
- **Status Monitoring**: Real-time task health tracking
- **Resource Integration**: View task status with system resources

## Event System

The `NodeContext` provides an event-driven architecture for system-wide communication:

### Event Subscription

```cpp
// Subscribe to events
ctx.on("node_discovered", [](void* data) {
    NodeInfo* node = static_cast<NodeInfo*>(data);
    // Handle new node discovery
});

ctx.on("cluster_updated", [](void* data) {
    // Handle cluster membership changes
});
```

### Event Publishing

```cpp
// Publish events
ctx.fire("node_discovered", &newNode);
ctx.fire("cluster_updated", &clusterData);
```

### Available Events

- **`node_discovered`**: New node added or local node refreshed

## Resource Monitoring

Each node tracks comprehensive system resources:

### System Resources

- **Free Heap Memory**: Available RAM in bytes
- **Chip ID**: Unique ESP8266 identifier
- **SDK Version**: ESP8266 firmware version
- **CPU Frequency**: Operating frequency in MHz
- **Flash Chip Size**: Total flash storage in bytes

### API Endpoint Registry

- **Dynamic Discovery**: Automatically detect available endpoints
- **Method Information**: HTTP method (GET, POST, etc.)
- **Service Catalog**: Complete service registry across cluster

### Health Metrics

- **Response Time**: API response latency
- **Uptime**: System uptime in milliseconds
- **Connection Status**: Network connectivity health
- **Resource Utilization**: Memory and CPU usage

## WiFi Fallback System

The system includes automatic WiFi fallback for robust operation:

### Fallback Process

1. **Primary Connection**: Attempts to connect to configured WiFi network
2. **Connection Failure**: If connection fails, creates an access point
3. **Hostname Generation**: Automatically generates hostname from MAC address
4. **Service Continuity**: Maintains cluster functionality in fallback mode

### Configuration

- **Hostname**: Derived from MAC (`esp-<mac>`) and assigned to `ctx.hostname`
- **AP Mode**: If STA connection fails, device switches to AP mode with configured SSID/password

## Cluster Topology

### Node Types

- **Master Node**: Primary cluster coordinator (if applicable)
- **Worker Nodes**: Standard cluster members
- **Edge Nodes**: Network edge devices

### Network Architecture

- UDP broadcast-based discovery and heartbeats on local subnet
- Optional HTTP polling (disabled by default; node info exchanged via UDP)

## Data Flow

### Node Discovery
1. **UDP Broadcast**: Nodes broadcast discovery packets on port 4210
2. **UDP Response**: Receiving nodes respond with hostname
3. **Registration**: Discovered nodes are added to local cluster member list

### Health Monitoring
1. **Periodic Checks**: Cluster manager updates node status categories
2. **Status Collection**: Each node updates resources via UDP node-info messages

### Task Management
1. **Scheduling**: `TaskManager` executes registered tasks at configured intervals
2. **Execution**: Tasks run cooperatively in the main loop without preemption
3. **Monitoring**: Task status is exposed via REST (`/api/tasks/status`)

## Performance Characteristics

### Memory Usage

- **Base System**: ~15-20KB RAM (device dependent)
- **Per Task**: ~100-200 bytes per task
- **Cluster Members**: ~50-100 bytes per member
- **API Endpoints**: ~20-30 bytes per endpoint

### Network Overhead

- **Discovery Packets**: 64 bytes every 1 second
- **Health Checks**: ~200-500 bytes every 1 second
- **Status Updates**: ~1-2KB per node
- **API Responses**: Varies by endpoint (typically 100B-5KB)

### Processing Overhead

- **Task Execution**: Minimal overhead per task
- **Event Processing**: Fast event dispatch
- **JSON Parsing**: Efficient ArduinoJson usage
- **Network I/O**: Asynchronous operations

## Security Considerations

### Current Implementation

- **Network Access**: Local network only (no internet exposure)
- **Authentication**: None currently implemented; LAN-only access assumed
- **Data Validation**: Basic input validation
- **Resource Limits**: Memory and processing constraints

### Future Enhancements

- **TLS/SSL**: Encrypted communications
- **API Keys**: Authentication for API access
- **Access Control**: Role-based permissions
- **Audit Logging**: Security event tracking

## Scalability

### Cluster Size Limits

- **Theoretical**: Up to 255 nodes (IP subnet limit)
- **Practical**: 20-50 nodes for optimal performance
- **Memory Constraint**: ~8KB available for member tracking
- **Network Constraint**: UDP packet size limits

### Performance Scaling

- **Linear Scaling**: Most operations scale linearly with node count
- **Discovery Overhead**: Increases with cluster size
- **Health Monitoring**: Parallel HTTP requests
- **Task Management**: Independent per-node execution

## Configuration Management

### Environment Variables

```bash
# API node IP for cluster management
export API_NODE=192.168.1.100
```

### PlatformIO Configuration

The project uses PlatformIO with the following configuration:

- **Framework**: Arduino
- **Board**: ESP-01 with 1MB flash
- **Upload Speed**: 115200 baud
- **Flash Mode**: DOUT (required for ESP-01S)

### Dependencies

The project requires the following libraries:
- `esp32async/ESPAsyncWebServer@^3.8.0` - HTTP API server
- `bblanchon/ArduinoJson@^7.4.2` - JSON processing
- `arkhipenko/TaskScheduler@^3.8.5` - Cooperative multitasking

## Development Workflow

### Building

Build the firmware for specific chip:

```bash
./ctl.sh build target esp01_1m
```

### Flashing

Flash firmware to a connected device:

```bash
./ctl.sh flash target esp01_1m
```

### Over-The-Air Updates

Update a specific node:

```bash
./ctl.sh ota update 192.168.1.100 esp01_1m
```

Update all nodes in the cluster:

```bash
./ctl.sh ota all esp01_1m
```

### Cluster Management

View cluster members:

```bash
./ctl.sh cluster members
```

## Troubleshooting

### Common Issues

1. **Discovery Failures**: Check UDP port 4210 is not blocked
2. **WiFi Connection**: Verify SSID/password in Config.cpp
3. **OTA Updates**: Ensure sufficient flash space (1MB minimum)
4. **Cluster Split**: Check network connectivity between nodes

### Debug Output

Enable serial monitoring to see cluster activity:

```bash
pio device monitor
```

### Performance Monitoring

- **Memory Usage**: Monitor free heap with `/api/node/status`
- **Task Health**: Check task status with `/api/tasks/status`
- **Cluster Health**: Monitor member status with `/api/cluster/members`
- **Network Latency**: Track response times in cluster data

## Related Documentation

- **[Task Management](./TaskManagement.md)** - Background task system
- **[API Reference](./API.md)** - REST API documentation
- **[TaskManager API](./TaskManager.md)** - TaskManager class reference
- **[OpenAPI Specification](../api/)** - Machine-readable API specification