Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster leader sensor reports none #195

Closed
RomRider opened this issue May 10, 2020 · 13 comments
Closed

cluster leader sensor reports none #195

RomRider opened this issue May 10, 2020 · 13 comments

Comments

@RomRider
Copy link

Describe the bug
One of my instance mostly reports none in its cluster leader sensor. However, it "seems" to work well and reports values. I can also see my monitored devices in the location of this instance. Not sure this instance would take over if one of the other instance fails. It is supposed the be the one with the higher weight (50 vs. 10 and 20), but is not currently the leader because it was started last.

To reproduce
Not sure

Relevant logs
Nothing relevant in the logs I think... This is the startup, a leader (office) is already elected and quorum is already reached (office has weight: 20, bedroom has weight: 10)

Living Room:

room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - HomeAssistantService: Sending new state 1 for room-assistant-living-room-status-cluster-size
room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - HomeAssistantService: Sending new state none for room-assistant-living-room-status-cluster-leader
room-assistant_1  | 5/10/2020, 12:40:34 AM - info - NestApplication: Nest application successfully started
room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - HomeAssistantService: Sending new attributes {"nodes":["living-room"]} for room-assistant-living-room-status-cluster-size
room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - HomeAssistantService: Sending new attributes {"quorumReached":false} for room-assistant-living-room-status-cluster-leader
room-assistant_1  | 5/10/2020, 12:40:34 AM - info - ClusterService: Added x.y.z.t:6425 to the cluster with id office
room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - HomeAssistantService: Sending new state 2 for room-assistant-living-room-status-cluster-size
room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - HomeAssistantService: Sending new attributes {"nodes":["office","living-room"]} for room-assistant-living-room-status-cluster-size
room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - BluetoothClassicService: Received RSSI of -10.6 for XX:XX:XX:XX:XX from Bedroom
room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - EntitiesService: Adding new entity bluetooth-classic-xx-xx-xx-xx-xx-xx-tracker
room-assistant_1  | 5/10/2020, 12:40:34 AM - info - HomeAssistantService: Device tracker requires manual setup in Home Assistant with topic: room-assistant/device_tracker/bluetooth-classic-xx-xx-xx-xx-xx-xx-tracker/state
room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - EntitiesService: Adding new entity bluetooth-classic-xx-xx-xx-xx-xx-xx
room-assistant_1  | 5/10/2020, 12:40:34 AM - debug - HomeAssistantService: Registering entity room-assistant-bluetooth-classic-xx-xx-xx-xx-xx-xx under homeassistant/sensor/room-assistant/bluetooth-classic-xx-xx-xx-xx-xx-xx/config
room-assistant_1  | 5/10/2020, 12:40:35 AM - info - ClusterService: Added x.y.z.t:6425 to the cluster with id bedroom
room-assistant_1  | 5/10/2020, 12:40:35 AM - debug - HomeAssistantService: Sending new state 3 for room-assistant-living-room-status-cluster-size
room-assistant_1  | 5/10/2020, 12:40:35 AM - debug - HomeAssistantService: Sending new attributes {"nodes":["office","bedroom","living-room"]} for room-assistant-living-room-status-cluster-size
room-assistant_1  | 5/10/2020, 12:40:39 AM - debug - BluetoothClassicService: Querying for RSSI of XX:XX:XX:XX:XX using hcitool
room-assistant_1  | 5/10/2020, 12:40:39 AM - debug - BluetoothClassicService: Received RSSI of -21 for XX:XX:XX:XX:XX from Living Room
room-assistant_1  | 5/10/2020, 12:40:46 AM - debug - BluetoothClassicService: Received RSSI of -0.1 for XX:XX:XX:XX:XX from Office
room-assistant_1  | 5/10/2020, 12:40:52 AM - debug - BluetoothClassicService: Received RSSI of -8.2 for XX:XX:XX:XX:XX from Bedroom

Relevant configuration

cluster:
  networkInterface: '{{ room_assistant_interface }}'
  port: 6425
  peerAddresses: {{ room_assistant_cluster_members }}
  autoDiscovery: false
  weight: {{ room_assistant_weight | default(10) }}
  quorum: 2
global:
  instanceName: '{{ room_assistant_name }}'
  integrations:
    - homeAssistant
    - bluetoothClassic
homeAssistant:
  mqttUrl: '{{ mqtt_server }}'
  mqttOptions:
    username: '{{ mqtt_room_assistant_username }}'
    password: '{{ mqtt_room_assistant_password }}'
bluetoothClassic:
  hciDeviceId: 0
  minRssi: -30
  interval: 6
  addresses: {{ room_assistant_bluetooth_devices }}

Expected behavior
It seems like it should report the actual cluster leader or say something in the logs if it's not working.

Environment

  • room-assistant version: 2.6.0
  • installation type: Docker
  • hardware: Pi 3B (+ 2 Zero W)
  • OS: Linux

Additional context
Add any other context about the problem here.

@RomRider RomRider added the bug label May 10, 2020
@RomRider
Copy link
Author

RomRider commented May 10, 2020

Adding something to this now. Every member of the cluster now reports a different leader:

  • Living Room reports none
  • Office reports office
  • Bedroom reports living-room

They all report 3 as cluster members and everything works as expected. I only see the Living Room instance sending updates to mqtt which probably means it is the actual leader.
HomeAssistantService: Sending new attributes {"distance":0.3,"lastUpdatedAt":"2020-05-10T11:33:20.414Z"} for room-assistant-bluetooth-classic-xx-xx-xx-xx-xx-xx

Could it be linked to the fact that the raspberry pi zero w (Office and Bedroom) are loosing packets while scanning and thus not receiving the cluster messages (UDP I think)?

@mKeRix
Copy link
Owner

mKeRix commented May 10, 2020

From what you provided I can think of two reasons this might be happening:

  • election event happens before the entity is registered/created
  • MQTT message with state update gets lost due to the QOS not being high enough

Since I don't see the "elected" message in the logs you posted I think the former is more likely. The latter can easily be checked: if the state is correct in the (still unofficial) API (http://<instance-ip>:6415/entities) then the message just got lost on the way.

Lost packets could manifest in dropping out of the cluster and the size reducing, it shouldn't lead to leader mismatch generally. If you're seeing correct behavior for the rest of the application I don't think this is the case.

@RomRider
Copy link
Author

Thanks @mKeRix

So regarding MQTT, I've checked and the status are in sync between what is stored in the MQTT server (and thus what is displayed in HA) and what is reported by the API of each node.

However, right now I have in the API and HA (no reboot since the last message I sent here):

  • Office (Zero W) reports living-room (quorum reached)
  • Living Room (3B+ Ethernet, not wifi) reports living-room (quorum reached)
  • Bedroom (Zero W) reports office 🤷 (quorum reached)

Bedroom switched its leader from living-room to office at 11:44AM:

5/10/2020, 11:44:00 AM - debug - BluetoothClassicService: Received RSSI of -0.2 for XXXXXXX from Office
5/10/2020, 11:44:00 AM - info - ClusterService: Removed IP_OF_LIVING_ROOM:6425 from the cluster with id living-room
5/10/2020, 11:44:00 AM - debug - ClusterService: Saving configured peer IP_OF_LIVING_ROOM:6425 from ultimate removal
5/10/2020, 11:44:01 AM - info - ClusterService: office has been elected as leader
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state office for room-assistant-bedroom-status-cluster-leader
5/10/2020, 11:44:19 AM - info - ClusterService: Added IP_OF_LIVING_ROOM:6425 to the cluster with id living-room
5/10/2020, 11:44:19 AM - debug - BluetoothClassicService: Received RSSI of -0.1 for XXXXXXX from Office
5/10/2020, 11:44:24 AM - debug - BluetoothClassicService: Querying for RSSI of XXXXXXX using hcitool
... Nothing relevant after that

Office lost living room but doesn't have the new leader log message (but I can see several messages before 11:44AM reporting living room as the new leader):

5/10/2020, 11:39:18 AM - info - ClusterService: living-room has been elected as leader
[...]
5/10/2020, 11:44:00 AM - info - ClusterService: Removed IP_OF_LIVING_ROOM:6425 from the cluster with id living-room
5/10/2020, 11:44:01 AM - info - EntitiesService: Refreshing entity states
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state true for room-assistant-office-bluetooth-classic-inqu
iries-switch
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state 3 for room-assistant-office-status-cluster-size
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state living-room for room-assistant-office-status-cluster-leader
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state true for room-assistant-bluetooth-classic-xxxxxxxxx-tracker
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state Office for room-assistant-bluetooth-classic-xxxxxxxxx
5/10/2020, 11:44:01 AM - debug - ClusterService: Saving configured peer IP_OF_LIVING_ROOM:6425 from ultimate removal
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {} for room-assistant-office-bluetooth-classic-inquiries-switch
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {"nodes":["bedroom","living-room","office"]} for room-assistant-office-status-cluster-size
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {"quorumReached":true} for room-assistant-office-status-cluster-leader
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {} for room-assistant-bluetooth-classic-xxxxxxxxx-tracker
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new attributes {"distance":0.2,"lastUpdatedAt":"2020-05-10T11:44:00.385Z"} for room-assistant-bluetooth-classic-xxxxxxxxx
5/10/2020, 11:44:01 AM - info - ClusterService: Added IP_OF_LIVING_ROOM:6425 to the cluster with id living-room
... Nothing specific after that

Living Room (Wired) switched its leader at 11:44AM from none to living-room even though it says it has elected office as the new leader...:

5/10/2020, 11:44:00 AM - debug - HomeAssistantService: Sending new attributes {"distance":0.2,"lastUpdatedAt":"2020-05-10T11:44:00.406Z"} for room-assistant-bluetooth-classic-xxxxxxxxx
5/10/2020, 11:44:01 AM - info - ClusterService: office has been elected as leader
5/10/2020, 11:44:01 AM - debug - HomeAssistantService: Sending new state living-room for room-assistant-living-room-status-cluster-leader
5/10/2020, 11:44:12 AM - debug - BluetoothClassicService: Querying for RSSI of XXXXXXX using hcitool
5/10/2020, 11:44:13 AM - debug - BluetoothClassicService: Received RSSI of -9.4 for XXXXXXX from Living Room
5/10/2020, 11:44:19 AM - debug - BluetoothClassicService: Received RSSI of -0.1 for XXXXXXX from Office
5/10/2020, 11:44:19 AM - debug - HomeAssistantService: Sending new attributes {"distance":0.1,"lastUpdatedAt":"2020-05-10T11:44:19.725Z"} for room-assistant-bluetooth-classic-xxxxxxxxx
5/10/2020, 11:44:24 AM - debug - BluetoothClassicService: Received RSSI of -13.3 for XXXXXXX from Bedroom
5/10/2020, 11:44:30 AM - debug - BluetoothClassicService: Querying for RSSI of XXXXXXX using hcitool
5/10/2020, 11:44:31 AM - debug - BluetoothClassicService: Received RSSI of -10.5 for XXXXXXX from Living Room
5/10/2020, 11:44:37 AM - debug - BluetoothClassicService: Received RSSI of 0 for XXXXXXX from Office
5/10/2020, 11:44:37 AM - debug - HomeAssistantService: Sending new attributes {"distance":0,"lastUpdatedAt":"2020-05-10T11:44:37.624Z"} for room-assistant-bluetooth-classic-xxxxxxxxx

Everything is still working fine and Living Room is the apparent leader as it sends all the updates.

So I think there is a glitch somewhere :)

@mKeRix
Copy link
Owner

mKeRix commented May 12, 2020

Interesting - thanks for the detailed report! I will check the code that triggers the election messages and the sensor updates.

@mKeRix mKeRix closed this as completed in 2a51603 May 26, 2020
github-actions bot pushed a commit that referenced this issue May 26, 2020
# [2.7.0](v2.6.0...v2.7.0) (2020-05-26)

### Bug Fixes

* **cluster:** resolve localhost with digResolver as well ([5b076de](5b076de)), closes [#171](#171)
* update leader status when local node is elected ([2a51603](2a51603)), closes [#195](#195)

### Features

* **bluetooth-classic:** monitor command health ([37ae1e4](37ae1e4)), closes [#194](#194)
* add update notifications ([a54cee9](a54cee9))
* allow state updates to be debounced ([3a93728](3a93728))
@github-actions
Copy link

🎉 This issue has been resolved in version 2.7.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@RomRider
Copy link
Author

RomRider commented May 27, 2020

Hi @mKeRix, it seems the issue with "none" is now gone, however, I still have issues with values reported by each instance.

After a rolling upgrade to 2.7.1 (living-room, then bedroom, then office) the status for the cluster leader was this one (all of them reporting quorum reached, and living room only sending updates):

  • Living-Room reported "living-room"
  • Bedroom reported "office" 🤷‍♂️
  • Office reported "living-room"

I've restarted the service on Bedroom and now the status is (still quorum reached everywhere and status updates are only sent by living-room):

  • Living-Room reports "living-room"
  • Bedroom now reports "living-room"
  • Office switched to "bedroom" 🤷‍♂️

And I can see room-assistant_1 | 5/27/2020, 5:12:27 PM - info - ClusterService: bedroom has been elected as leader just when Bedroom went back online in the Living-Room instance log, while still having Living-Room being the cluster leader.

Maybe there's a bug in the democracy library or maybe it doesn't behave well with lost UDP packets because of the shared BT/Wifi on the Zero W?

@mKeRix
Copy link
Owner

mKeRix commented May 27, 2020

I've been trying to think of scenarios that would cause these issues to occur and came up with the following scenario that may lead to a bug in the democracy library:

  • you start in the first state from the post above
  • bedroom is restarted, drops out of the cluster for a bit
  • while starting bedroom doesn't immediately connect to the other instances and elects itself as leader (even if it doesn't have a quorum)
  • bedroom connects to office and announces itself as leader
    • office now knows two leaders, as it wasn't the leader itself before and therefor did not trigger a new election
    • the internals of democracy just cause the last node in the list with leader state to be considered leader within office
  • bedroom connects to living room, re-election is triggered between those two
  • office does not update its leader state - as its local decision still stands

Bear in mind that I don't have a test or anything that would prove this behavior and it's just an idea. The whole thing is kinda complex and I also have trouble wrapping my head around the conditions and the result code flow. Should the above be true a quick fix could possibly be to also broadcast the currently known leader for each node and then act if there are differences there. This requires some more though though, and ideally a test to reproduce the behavior with.

@mKeRix mKeRix reopened this May 27, 2020
@mKeRix mKeRix removed the released label May 27, 2020
@RomRider
Copy link
Author

RomRider commented May 27, 2020

That would make sense, also another though I had was that based on how https://github.com/goldfire/democracy.js/blob/master/lib/democracy.js#L432 works:

  • it could have not received all the hello from the other members of the cluster
  • thus the _nodes property would be empty or partially complete straight after init (because the _nodes property of the class is not based on the peer list as far as I understand how the code is built. The _nodes contains nodes from which we received a hello, and the peer list is statically defined / or discovered through MDNS in the case of room-assistant)
  • This would lead the node to elect itself (or the one with the highest weight from the partial list it knows about) as the leader because https://github.com/goldfire/democracy.js/blob/master/lib/democracy.js#L439 only contains itself (or a partial list)

Maybe increasing the timeout here https://github.com/goldfire/democracy.js/blob/master/lib/democracy.js#L81-L93 might help.

@mKeRix mKeRix closed this as completed in 286adc2 Jun 7, 2020
github-actions bot pushed a commit that referenced this issue Jun 7, 2020
## [2.8.2](v2.8.1...v2.8.2) (2020-06-07)

### Bug Fixes

* **cluster:** ensure there is only one leader ([286adc2](286adc2)), closes [#195](#195)
@github-actions
Copy link

github-actions bot commented Jun 7, 2020

🎉 This issue has been resolved in version 2.8.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

@RomRider
Copy link
Author

RomRider commented Jun 8, 2020

Seems to be working great now :) Thanks a lot!
Out of curiosity, was it related to packet loss during initialisation? (I see you re-run the election if there is more than 1 leader)

@mKeRix
Copy link
Owner

mKeRix commented Jun 8, 2020

Indirectly I suppose. To get to that fix I basically built the steps I described above as a test and then adapted the code so that it would come to the correct result and not have 2 leaders. Essentially this would happen when the nodes don't connect to each other quickly and start electing based on incomplete node lists (as you mentioned). Pair that with an inconvenient order of connections and you run into this bug.

@danpowell88
Copy link

danpowell88 commented Jun 13, 2020

I'm still seeing this or something related.

I've got 6 pi, 5x0w and a 3b+, the 3b+ is configured with a weight of 100 and all the other pi's have a weight of 1.

After rebooting them all two clusters formed one with all of the 0's and one with just the 3b+ by itself, after a number of hours neither of them seem to be changing cluster leaders as well, quorum is set to 4

I have seen if I reboot any of the 0's they will join the media cluster

@mKeRix
Copy link
Owner

mKeRix commented Jun 13, 2020

@danpowell88 What you're seeing is a mixture of "some nodes not appearing" from the cluster troubleshooting section and #196. If you configure the peer addresses (at least the address from the leader, as the others seem to successfully find each other through MDNS) and then restart all instances at the same time you should see the wanted outcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants