Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APP-7364] [APP-7366] [RSDK-9684] [APP-7154] Add more logging, reduce monitoring loop time, misc small fixes. #56

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Otterverse
Copy link
Member

Quick PR for easy fixes.

Needs testing for the linked Jira issues

  1. Install
  2. Upgrade/downgrade viam-agent and ensure a "disabled" systemd service remains disabled after upgrade.
  3. Confirm captive portal works for new wifi (mainly that browser doesn't throw errors) and fixes capitalization issue.
  4. Corrupt /etc/viam.json, ensure errors are clear and file is renamed
  5. Make sure it runs okay for a while (15+ minutes) with the new, shorter timeout.
  6. Kill viam-server (or otherwise make it fail) and confirm it logs and restarts quickly.

@Otterverse Otterverse requested a review from maxhorowitz January 9, 2025 20:25
@Otterverse Otterverse changed the title [ APP-7364][APP-7366][RSDK-9684][APP-7154]Add more logging, reduce monitoring loop time, misc small fixes. [ APP-7364][APP-7366][RSDK-9684][APP-7154] Add more logging, reduce monitoring loop time, misc small fixes. Jan 9, 2025
@Otterverse Otterverse changed the title [ APP-7364][APP-7366][RSDK-9684][APP-7154] Add more logging, reduce monitoring loop time, misc small fixes. [APP-7364] [APP-7366] [RSDK-9684] [APP-7154] Add more logging, reduce monitoring loop time, misc small fixes. Jan 9, 2025
Copy link
Member

@maxhorowitz maxhorowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code scan ✅
Have some questions which are mainly aimed at me getting up to speed! I will manually test these fixes later today.

@@ -37,7 +37,7 @@ <h2>Smart Machine Setup</h2>
<div class="form-group">
<label for="network">Network</label>
{{if eq (len .VisibleSSIDs) 0}}
<input type="text" name="ssid" placeholder="Enter Wifi SSID" id="network" required>
<input type="text" name="ssid" placeholder="Enter Wifi SSID" id="network" required autocorrect="off" autocapitalize="off">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context these are "on" if omitted and that was not allowing you to connect with a SSID that was lowercased? Nice find if so 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micheal Lee found/suggested this fix. But yeah, I guess some phones/browsers see an empty text field, and if you type a password that's a normal "word" it auto-capitalizes it, which isn't ideal for passwords.

We don't use a "password" type field here, because I wanted this to be visible before submitting. Otherwise it's way too easy to accidentally typo and then it takes several minutes for the device to try out the thing, fail, and restart a hotspot.

if err := os.MkdirAll(filepath.Dir(serviceFilePath), 0o755); err != nil {
return errw.Wrapf(err, "creating directory %s", filepath.Dir(serviceFilePath))
}
// use this later to avoid re-enabling an existing agent service a user might have disabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to confirm my understanding...there are users who have Viam agent installed but who don't want it to automatically start up on boot, so they've "disabled" auto start. When they upgrade their Viam agent, the configuration file where they've "disabled" the auto start gets overwritten such that startup on boot becomes true...and that's what you're fixing in this PR.

Copied the following snippet from the ticket - is this the final product (i.e. the user now has to manually restart Viam agent after an upgrade)?

Jan 07 16:11:09 pluto viam-agent[911]: 2025-01-07T22:11:09.496Z        INFO        viam-agent        agent/subsystem.go:355        viam-agent updated from 0.11.0 to 0.12.0
Jan 07 16:11:10 pluto viam-agent[911]: 2025-01-07T22:11:09.515Z        INFO        viam-agent        viamagent/viamagent.go:131        writing systemd service file to /usr/local/lib/systemd/system/viam-agent.service
Jan 07 16:11:10 pluto viam-agent[911]: 2025-01-07T22:11:09.515Z        INFO        viam-agent        viamagent/viamagent.go:146        enabling systemd viam-agent service
Jan 07 16:11:10 pluto viam-agent[911]: 2025-01-07T22:11:10.091Z        INFO        viam-agent        viamagent/viamagent.go:168        Install complete. Please (re)start the service with 'systemctl restart viam-agent' when ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if someone installed viam-agent, then runs systemctl disable viam-agent it won't start at boot. The problem is that, previously, when we did an upgrade, this same code got called, and it ALWAYS enabled the service again. Now it should only enable it on first/new install.

The snippet is the old behavior, where it shows "enabling systemd viam-agent service" was the problem. That should only happen on new installs now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on getting manual testing setup. Lmk if I'm moving in right direction. Using an RPi5 because agent runs on linux (and I use Mac). Should I be:

  1. pulling a config from prod (without actually starting the viam-server)
  2. check out your branch and run make
  3. run the binary that I've created which includes the changes in your PR
  4. validate the viam-agent starts at reboot of the RPi5
  5. run systemctl disable viam-agent
  6. upgrade the viam-agent on the RPi5 by doing what exactly?
  7. validate that the viam-agent service isn't running on 2nd reboot

Please let me know gaps here. Thanks

Copy link
Member

@maxhorowitz maxhorowitz Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated by disabling the agent and upgrading from a stable version

{
  "agent": {
    "viam-agent": {
      "pin_version": "0.12.0"
    },
    "viam-server": {
      "release_channel": "stable",
      "attributes": {}
    }
  }
}

to a pinned URL:

{
  "agent": {
    "viam-agent": {
      "pin_url": "file:///home/jeep/dev/agent/bin/viam-agent-custom-aarch64"
    },
    "viam-server": {
      "release_channel": "stable",
      "attributes": {}
    }
  }
}

After the upgrade and subsequent reboot, I validated the viam-agent system service wasn't running and was still disabled:

○ viam-agent.service - Viam Services Agent
     Loaded: loaded (/usr/local/lib/systemd/system/viam-agent.service; disabled; preset: enabled)
     Active: inactive (dead)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-enable and reboot the viam-agent starts up on its own 👍

● viam-agent.service - Viam Services Agent
     Loaded: loaded (/usr/local/lib/systemd/system/viam-agent.service; enabled; preset: enabled)
     Active: active (running) since Fri 2025-01-10 13:56:50 EST; 1min 22s ago
   Main PID: 758 (viam-agent)
      Tasks: 10 (limit: 9247)
        CPU: 4.953s
     CGroup: /system.slice/viam-agent.service
             └─758 /opt/viam/bin/viam-agent --config /etc/viam.json

@@ -24,7 +24,7 @@ import (
)

const (
minimalCheckInterval = time.Second * 60
minimalCheckInterval = time.Second * 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the monitoring loop checking for on a super high level? Is this monitoring for FTDC data in the viam-server? What other things are "monitored" on a regular basis via the agent ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how often it fetches a new config from the cloud. The cloud itself can say "check again after X time") but that's currently not implemented, so it falls to this.

The main loop basically: check for new config > apply changes/updates (if there) > start subsystems that should be started > check health of subsystems > repeat

So the check interval determins (roughly) the overall timing. At 5 seconds, it may have trouble keeping up if there are slow responses or other work to do, but that should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

}
// use this later to avoid re-enabling an existing agent service a user might have disabled
_, err = os.Stat(serviceFilePath)
newInstall := err != nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth checking for a file does not exist error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers edge cases like file corruption too. If we can't stat, then it's best to treat it as a new install, IMHO. But a good thought! In many other cases it'd be worth differentiating.

@maxhorowitz
Copy link
Member

maxhorowitz commented Jan 13, 2025

@Otterverse - working through the cases you need tested - I'm not familiar with the wifi setup flow - how should I be testing

Confirm captive portal works for new wifi (mainly that browser doesn't throw errors) and fixes capitalization issue.

And does the following mean adding an unknown field to the cloud config? Otherwise, how would it be able to re-fetch its config if I change one of its values? Or is it cached on the robot at startup?

Corrupt /etc/viam.json, ensure errors are clear and file is renamed

@Otterverse
Copy link
Member Author

@Otterverse - working through the cases you need tested - I'm not familiar with the wifi setup flow - how should I be testing

Confirm captive portal works for new wifi (mainly that browser doesn't throw errors) and fixes capitalization issue.

After installing and using the custom version, you can delete the wifi network with nmcli and it should go into provisioning mode. When it does, you can connect to the hotspot it makes, and verify that you can use the captive portal (you should be redirected automatically if you connect with your phone as a "sign in" page) to add a wifi network and it doesn't automatically capitalize the password.

And does the following mean adding an unknown field to the cloud config? Otherwise, how would it be able to re-fetch its config if I change one of its values? Or is it cached on the robot at startup?

Corrupt /etc/viam.json, ensure errors are clear and file is renamed

You can always get the config from the link in App for the machine. Click the status bar "offline" notification to make it drop down and give details. There's a "machine credentials" button there which gives you the contents of /etc/viam.json. And to corrupt that file, I meant just make it invalid... open it up with a text editor like nano or whatever, and just delete some curly brackets or something to make it invalid json.

Ping me on slack if you have any questions. Thanks again for the work here!

@maxhorowitz
Copy link
Member

maxhorowitz commented Jan 16, 2025

RE: #56 (comment)

Findings

(1) Does agent provisioning automatically start?

Not sure that this is working as intended. I was running both the agent (with the pinned version pointing to my local agent which I built from your branch) and the server, and I used nmcli con delete **** to delete the active wlan connection - this didn't bring me directly into provisioning mode unless my process was wrong. In a separate tab, once I ran sudo systemctl restart viam-agent the agent-provisioning startup would happen and then I would be able to re-configure its wifi. Note: I was testing this on an RPi5 that was plugged into a monitor (such that I could see both viam-server and viam-agent logs with/out internet connection so I am pretty confident about my observations being true - whether or not my process was wrong is a separate story).

(2) Does the captive portal work once the viam-agent is in provisioning mode?

Yes.

(3) Does the lowercased wifi password get modified to uppercase?

No, as intended.

(4) Does the viam-server start up properly after connecting to the Wifi network provided in the captive portal?

Not really. I'm seeing a continuous loop of restarting the viam-server. The error is listen tcp :8080: bind: address already in use, and I've killed the individual process running viam-server on that port but I keep getting this same loop. I fixed by running killall on viam-server, which results in a successful restart of viam-server by the agent.

(5) Once running, corrupt /etc/viam.json and ensure the file is overwritten to clean version.

This works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants