Files
spore-gateway/docs/monitoring-example.md
Patrick Balsiger 3c3fb886a3 feat: mock gateway
2025-10-24 14:24:14 +02:00

199 lines
6.1 KiB
Markdown

# Monitoring Resources Endpoint
## Overview
The `/api/monitoring/resources` endpoint provides comprehensive real-time resource monitoring for all nodes in the cluster.
## Endpoint
```
GET /api/monitoring/resources
```
## Response Format
```json
{
"timestamp": "2025-10-24T10:30:45Z",
"nodes": [
{
"timestamp": 1729763445,
"node_ip": "192.168.1.100",
"hostname": "spore-node-1",
"cpu": {
"frequency_mhz": 160,
"usage_percent": 42.5,
"temperature_c": 58.3
},
"memory": {
"total_bytes": 98304,
"free_bytes": 45632,
"used_bytes": 52672,
"usage_percent": 53.6
},
"network": {
"bytes_sent": 3245678,
"bytes_received": 5678901,
"packets_sent": 32456,
"packets_received": 56789,
"rssi_dbm": -65,
"signal_quality_percent": 75.5
},
"flash": {
"total_bytes": 4194304,
"used_bytes": 2097152,
"free_bytes": 2097152,
"usage_percent": 50.0
},
"labels": {
"version": "1.0.0",
"stable": "true",
"env": "production",
"zone": "zone-1",
"type": "spore-node"
}
}
],
"summary": {
"total_nodes": 5,
"avg_cpu_usage_percent": 38.7,
"avg_memory_usage_percent": 51.2,
"avg_flash_usage_percent": 52.8,
"total_bytes_sent": 16228390,
"total_bytes_received": 28394505
}
}
```
## Data Fields
### CPU Metrics
- **frequency_mhz**: Current CPU frequency in MHz (80-240 MHz typical for ESP32)
- **usage_percent**: CPU utilization percentage (0-100%)
- **temperature_c**: CPU temperature in Celsius (45-65°C typical)
### Memory Metrics
- **total_bytes**: Total RAM available (64-128 KB typical)
- **free_bytes**: Free RAM available
- **used_bytes**: Used RAM
- **usage_percent**: Memory utilization percentage
### Network Metrics
- **bytes_sent**: Total bytes transmitted since boot
- **bytes_received**: Total bytes received since boot
- **packets_sent**: Total packets transmitted
- **packets_received**: Total packets received
- **rssi_dbm**: WiFi signal strength in dBm (-30 to -90 typical)
- **signal_quality_percent**: WiFi signal quality (0-100%)
### Flash Metrics
- **total_bytes**: Total flash storage (typically 4MB)
- **used_bytes**: Used flash storage
- **free_bytes**: Free flash storage
- **usage_percent**: Flash utilization percentage
### Node Labels
Each node includes labels that match firmware versions:
- **version**: Current firmware version (e.g., "1.0.0", "1.1.0", "1.2.0")
- **stable**: Whether this is a stable release ("true" or "false")
- **env**: Environment (e.g., "production", "beta")
- **zone**: Deployment zone (e.g., "zone-1", "zone-2", "zone-3")
- **type**: Node type (e.g., "spore-node")
### Summary Statistics
Aggregate metrics across all nodes:
- **total_nodes**: Total number of nodes monitored
- **avg_cpu_usage_percent**: Average CPU usage across all nodes
- **avg_memory_usage_percent**: Average memory usage across all nodes
- **avg_flash_usage_percent**: Average flash usage across all nodes
- **total_bytes_sent**: Combined network traffic sent
- **total_bytes_received**: Combined network traffic received
## Firmware Version Matching
Node labels are automatically synchronized with the firmware available in the registry:
| Version | Registry Status | Node Distribution | Environment |
|---------|----------------|-------------------|-------------|
| 1.0.0 | Stable | 40% of nodes | production |
| 1.1.0 | Stable | 40% of nodes | production |
| 1.2.0 | Beta | 20% of nodes | beta |
This ensures that monitoring data accurately reflects which firmware versions are deployed across the cluster.
## Use Cases
### 1. Real-time Dashboard
Display live resource usage for all nodes in a monitoring dashboard.
### 2. Alerting
Set up alerts based on thresholds:
- CPU usage > 80%
- Memory usage > 90%
- Flash usage > 95%
- WiFi signal quality < 30%
### 3. Capacity Planning
Track resource trends to plan firmware optimizations or hardware upgrades.
### 4. Firmware Rollout Monitoring
Monitor resource usage before, during, and after firmware rollouts to detect issues.
### 5. Network Health
Track WiFi signal quality and network traffic to identify connectivity issues.
## Example Usage
### cURL
```bash
curl http://localhost:3001/api/monitoring/resources
```
### JavaScript (fetch)
```javascript
const response = await fetch('http://localhost:3001/api/monitoring/resources');
const data = await response.json();
console.log(`Monitoring ${data.summary.total_nodes} nodes`);
console.log(`Average CPU: ${data.summary.avg_cpu_usage_percent.toFixed(1)}%`);
console.log(`Average Memory: ${data.summary.avg_memory_usage_percent.toFixed(1)}%`);
data.nodes.forEach(node => {
console.log(`${node.hostname} (${node.labels.version}): CPU ${node.cpu.usage_percent.toFixed(1)}%`);
});
```
### Python
```python
import requests
response = requests.get('http://localhost:3001/api/monitoring/resources')
data = response.json()
print(f"Monitoring {data['summary']['total_nodes']} nodes")
print(f"Average CPU: {data['summary']['avg_cpu_usage_percent']:.1f}%")
print(f"Average Memory: {data['summary']['avg_memory_usage_percent']:.1f}%")
for node in data['nodes']:
print(f"{node['hostname']} ({node['labels']['version']}): "
f"CPU {node['cpu']['usage_percent']:.1f}%")
```
## Mock Gateway Behavior
The mock gateway generates realistic monitoring data with:
- **Dynamic values**: CPU, memory, and network metrics vary on each request
- **Realistic ranges**: Values stay within typical ESP32 hardware limits
- **Signal quality**: WiFi RSSI converted to quality percentage
- **Consistent labels**: Node labels always match firmware registry versions
- **Aggregate summaries**: Automatic calculation of cluster-wide statistics
## Integration with WebSocket
For real-time updates, consider combining this endpoint with the WebSocket connection at `/ws` which broadcasts:
- Node status changes
- Firmware update progress
- Cluster membership changes
The monitoring endpoint provides detailed point-in-time snapshots, while WebSocket provides real-time event streams.