feat: improve cluster forming; just use heartbeat to form the cluster

This commit is contained in:
2025-10-19 12:50:43 +02:00
parent ce70830678
commit 3ed44cd00f
10 changed files with 185 additions and 144 deletions

View File

@@ -52,57 +52,50 @@ The cluster uses a UDP-based discovery protocol for automatic node detection:
- **UDP Port**: 4210 (configurable via `Config.udp_port`)
- **Discovery Message**: `CLUSTER_DISCOVERY`
- **Response Message**: `CLUSTER_RESPONSE`
- **Heartbeat Message**: `CLUSTER_HEARTBEAT`
- **Node Info Message**: `CLUSTER_NODE_INFO:<hostname>:<json>`
- **Heartbeat Message**: `CLUSTER_HEARTBEAT:hostname`
- **Node Update Message**: `NODE_UPDATE:hostname:{json}`
- **Broadcast Address**: 255.255.255.255
- **Discovery Interval**: `Config.discovery_interval_ms` (default 1000 ms)
- **Listen Interval**: `Config.cluster_listen_interval_ms` (default 10 ms)
- **Heartbeat Interval**: `Config.heartbeat_interval_ms` (default 5000 ms)
- **Node Update Broadcast Interval**: `Config.node_update_broadcast_interval_ms` (default 5000 ms)
### Message Formats
- **Discovery**: `CLUSTER_DISCOVERY`
- Sender: any node, broadcast to 255.255.255.255:`udp_port`
- Purpose: announce presence and solicit peer identification
- **Response**: `CLUSTER_RESPONSE:<hostname>`
- Sender: node receiving a discovery; unicast to requester IP
- Purpose: provide hostname so requester can register/update member
- **Heartbeat**: `CLUSTER_HEARTBEAT:<hostname>`
- **Heartbeat**: `CLUSTER_HEARTBEAT:hostname`
- Sender: each node, broadcast to 255.255.255.255:`udp_port` on interval
- Purpose: prompt peers to reply with their node info and keep liveness
- **Node Info**: `CLUSTER_NODE_INFO:<hostname>:<json>`
- Purpose: announce presence, prompt peers for node info, and keep liveness
- **Node Update**: `NODE_UPDATE:hostname:{json}`
- Sender: node receiving a heartbeat; unicast to heartbeat sender IP
- JSON fields: freeHeap, chipId, sdkVersion, cpuFreqMHz, flashChipSize, optional labels
### Discovery Flow
1. **Sender broadcasts** `CLUSTER_DISCOVERY`
2. **Each receiver responds** with `CLUSTER_RESPONSE:<hostname>` to the sender IP
3. **Sender registers/updates** the node using hostname and source IP
- JSON fields: hostname, ip, uptime, optional labels
- Purpose: provide minimal node information in response to heartbeat
### Heartbeat Flow
1. **A node broadcasts** `CLUSTER_HEARTBEAT:<hostname>`
2. **Each receiver replies** with `CLUSTER_NODE_INFO:<hostname>:<json>` to the heartbeat sender IP
1. **A node broadcasts** `CLUSTER_HEARTBEAT:hostname`
2. **Each receiver responds** with `NODE_UPDATE:hostname:{json}` to the heartbeat sender IP
3. **The sender**:
- Ensures the node exists or creates it with `hostname` and sender IP
- Parses JSON and updates resources, labels, `status = ACTIVE`, `lastSeen = now`
- Parses JSON and updates node info, `status = ACTIVE`, `lastSeen = now`
- Sets `latency = now - lastHeartbeatSentAt` (per-node, measured at heartbeat origin)
### Node Update Broadcasting
1. **Periodic broadcast**: Each node broadcasts `NODE_UPDATE:hostname:{json}` every 5 seconds
2. **All receivers**: Update their memberlist entry for the broadcasting node
3. **Purpose**: Ensures all nodes have current information about each other
### Listener Behavior
The `cluster_listen` task parses one UDP packet per run and dispatches by prefix to:
- **Discovery** → send `CLUSTER_RESPONSE`
- **Heartbeat** → send `CLUSTER_NODE_INFO` JSON
- **Response** → add/update node using provided hostname and source IP
- **Node Info** → update resources/status/labels and record latency
- **Heartbeat** → add/update node and send `NODE_UPDATE` JSON response
- **Node Update** → update node information and status
### Timing and Intervals
- **UDP Port**: `Config.udp_port` (default 4210)
- **Discovery Interval**: `Config.discovery_interval_ms` (default 1000 ms)
- **Listen Interval**: `Config.cluster_listen_interval_ms` (default 10 ms)
- **Heartbeat Interval**: `Config.heartbeat_interval_ms` (default 5000 ms)
- **Node Update Broadcast Interval**: `Config.node_update_broadcast_interval_ms` (default 5000 ms)
### Node Status Categories