-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster leader sensor reports none #195
Comments
Adding something to this now. Every member of the cluster now reports a different leader:
They all report 3 as cluster members and everything works as expected. I only see the Could it be linked to the fact that the raspberry pi zero w (Office and Bedroom) are loosing packets while scanning and thus not receiving the cluster messages (UDP I think)? |
From what you provided I can think of two reasons this might be happening:
Since I don't see the "elected" message in the logs you posted I think the former is more likely. The latter can easily be checked: if the state is correct in the (still unofficial) API ( Lost packets could manifest in dropping out of the cluster and the size reducing, it shouldn't lead to leader mismatch generally. If you're seeing correct behavior for the rest of the application I don't think this is the case. |
Thanks @mKeRix So regarding MQTT, I've checked and the status are in sync between what is stored in the MQTT server (and thus what is displayed in HA) and what is reported by the API of each node. However, right now I have in the API and HA (no reboot since the last message I sent here):
Bedroom switched its leader from 5/10/2020, 11:44:00 AM - debug - BluetoothClassicService: Received RSSI of -0.2 for XXXXXXX from Office
5/10/2020, 11:44:00 AM - info - ClusterService: Removed IP_OF_LIVING_ROOM:6425 from the cluster with id living-room
5/10/2020, 11:44:00 AM - debug - ClusterService: Saving configured peer IP_OF_LIVING_ROOM:6425 from ultimate removal
5/10/2020, 11:44:01 AM - info - ClusterService: office has been elected as leader
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state office for room-assistant-bedroom-status-cluster-leader
5/10/2020, 11:44:19 AM - info - ClusterService: Added IP_OF_LIVING_ROOM:6425 to the cluster with id living-room
5/10/2020, 11:44:19 AM - debug - BluetoothClassicService: Received RSSI of -0.1 for XXXXXXX from Office
5/10/2020, 11:44:24 AM - debug - BluetoothClassicService: Querying for RSSI of XXXXXXX using hcitool
... Nothing relevant after that Office lost living room but doesn't have the new leader log message (but I can see several messages before 11:44AM reporting living room as the new leader): 5/10/2020, 11:39:18 AM - info - ClusterService: living-room has been elected as leader
[...]
5/10/2020, 11:44:00 AM - info - ClusterService: Removed IP_OF_LIVING_ROOM:6425 from the cluster with id living-room
5/10/2020, 11:44:01 AM - info - EntitiesService: Refreshing entity states
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state true for room-assistant-office-bluetooth-classic-inqu
iries-switch
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state 3 for room-assistant-office-status-cluster-size
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state living-room for room-assistant-office-status-cluster-leader
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state true for room-assistant-bluetooth-classic-xxxxxxxxx-tracker
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state Office for room-assistant-bluetooth-classic-xxxxxxxxx
5/10/2020, 11:44:01 AM - debug - ClusterService: Saving configured peer IP_OF_LIVING_ROOM:6425 from ultimate removal
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {} for room-assistant-office-bluetooth-classic-inquiries-switch
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {"nodes":["bedroom","living-room","office"]} for room-assistant-office-status-cluster-size
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {"quorumReached":true} for room-assistant-office-status-cluster-leader
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {} for room-assistant-bluetooth-classic-xxxxxxxxx-tracker
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {"distance":0.2,"lastUpdatedAt":"2020-05-10T11:44:00.385Z"} for room-assistant-bluetooth-classic-xxxxxxxxx
5/10/2020, 11:44:01 AM - info - ClusterService: Added IP_OF_LIVING_ROOM:6425 to the cluster with id living-room
... Nothing specific after that Living Room (Wired) switched its leader at 11:44AM from 5/10/2020, 11:44:00 AM - debug - HomeAssistantService: Sending new attributes {"distance":0.2,"lastUpdatedAt":"2020-05-10T11:44:00.406Z"} for room-assistant-bluetooth-classic-xxxxxxxxx
5/10/2020, 11:44:01 AM - info - ClusterService: office has been elected as leader
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state living-room for room-assistant-living-room-status-cluster-leader
5/10/2020, 11:44:12 AM - debug - BluetoothClassicService: Querying for RSSI of XXXXXXX using hcitool
5/10/2020, 11:44:13 AM - debug - BluetoothClassicService: Received RSSI of -9.4 for XXXXXXX from Living Room
5/10/2020, 11:44:19 AM - debug - BluetoothClassicService: Received RSSI of -0.1 for XXXXXXX from Office
5/10/2020, 11:44:19 AM - debug - HomeAssistantService: Sending new attributes {"distance":0.1,"lastUpdatedAt":"2020-05-10T11:44:19.725Z"} for room-assistant-bluetooth-classic-xxxxxxxxx
5/10/2020, 11:44:24 AM - debug - BluetoothClassicService: Received RSSI of -13.3 for XXXXXXX from Bedroom
5/10/2020, 11:44:30 AM - debug - BluetoothClassicService: Querying for RSSI of XXXXXXX using hcitool
5/10/2020, 11:44:31 AM - debug - BluetoothClassicService: Received RSSI of -10.5 for XXXXXXX from Living Room
5/10/2020, 11:44:37 AM - debug - BluetoothClassicService: Received RSSI of 0 for XXXXXXX from Office
5/10/2020, 11:44:37 AM - debug - HomeAssistantService: Sending new attributes {"distance":0,"lastUpdatedAt":"2020-05-10T11:44:37.624Z"} for room-assistant-bluetooth-classic-xxxxxxxxx Everything is still working fine and Living Room is the apparent leader as it sends all the updates. So I think there is a glitch somewhere :) |
Interesting - thanks for the detailed report! I will check the code that triggers the election messages and the sensor updates. |
# [2.7.0](v2.6.0...v2.7.0) (2020-05-26) ### Bug Fixes * **cluster:** resolve localhost with digResolver as well ([5b076de](5b076de)), closes [#171](#171) * update leader status when local node is elected ([2a51603](2a51603)), closes [#195](#195) ### Features * **bluetooth-classic:** monitor command health ([37ae1e4](37ae1e4)), closes [#194](#194) * add update notifications ([a54cee9](a54cee9)) * allow state updates to be debounced ([3a93728](3a93728))
🎉 This issue has been resolved in version 2.7.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Hi @mKeRix, it seems the issue with "none" is now gone, however, I still have issues with values reported by each instance. After a rolling upgrade to 2.7.1 (living-room, then bedroom, then office) the status for the cluster leader was this one (all of them reporting quorum reached, and living room only sending updates):
I've restarted the service on Bedroom and now the status is (still quorum reached everywhere and status updates are only sent by living-room):
And I can see Maybe there's a bug in the democracy library or maybe it doesn't behave well with lost UDP packets because of the shared BT/Wifi on the Zero W? |
I've been trying to think of scenarios that would cause these issues to occur and came up with the following scenario that may lead to a bug in the democracy library:
Bear in mind that I don't have a test or anything that would prove this behavior and it's just an idea. The whole thing is kinda complex and I also have trouble wrapping my head around the conditions and the result code flow. Should the above be true a quick fix could possibly be to also broadcast the currently known leader for each node and then act if there are differences there. This requires some more though though, and ideally a test to reproduce the behavior with. |
That would make sense, also another though I had was that based on how https://github.com/goldfire/democracy.js/blob/master/lib/democracy.js#L432 works:
Maybe increasing the timeout here https://github.com/goldfire/democracy.js/blob/master/lib/democracy.js#L81-L93 might help. |
## [2.8.2](v2.8.1...v2.8.2) (2020-06-07) ### Bug Fixes * **cluster:** ensure there is only one leader ([286adc2](286adc2)), closes [#195](#195)
🎉 This issue has been resolved in version 2.8.2 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Seems to be working great now :) Thanks a lot! |
Indirectly I suppose. To get to that fix I basically built the steps I described above as a test and then adapted the code so that it would come to the correct result and not have 2 leaders. Essentially this would happen when the nodes don't connect to each other quickly and start electing based on incomplete node lists (as you mentioned). Pair that with an inconvenient order of connections and you run into this bug. |
I'm still seeing this or something related. I've got 6 pi, 5x0w and a 3b+, the 3b+ is configured with a weight of 100 and all the other pi's have a weight of 1. After rebooting them all two clusters formed one with all of the 0's and one with just the 3b+ by itself, after a number of hours neither of them seem to be changing cluster leaders as well, quorum is set to 4 I have seen if I reboot any of the 0's they will join the media cluster |
@danpowell88 What you're seeing is a mixture of "some nodes not appearing" from the cluster troubleshooting section and #196. If you configure the peer addresses (at least the address from the leader, as the others seem to successfully find each other through MDNS) and then restart all instances at the same time you should see the wanted outcome. |
Describe the bug
One of my instance mostly reports
none
in its cluster leader sensor. However, it "seems" to work well and reports values. I can also see my monitored devices in the location of this instance. Not sure this instance would take over if one of the other instance fails. It is supposed the be the one with the higher weight (50 vs. 10 and 20), but is not currently the leader because it was started last.To reproduce
Not sure
Relevant logs
Nothing relevant in the logs I think... This is the startup, a leader (
office
) is already elected and quorum is already reached (office hasweight: 20
, bedroom hasweight: 10
)Living Room:
Relevant configuration
Expected behavior
It seems like it should report the actual cluster leader or say something in the logs if it's not working.
Environment
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: