Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hw: number of vCPUS exceeding number of host cores triggers pager error #5442

Open
atopia opened this issue Feb 5, 2025 · 13 comments
Open
Labels

Comments

@atopia
Copy link
Contributor

atopia commented Feb 5, 2025

Commit 2728853 causes the following error when running run/seoul-auto on hw:

Genode sculpt-24.10.3-58-g27288530059 <local changes>
1478 MiB RAM and 64536 caps assigned to init
[init -> seoul]   0x1000000 .. 0x10ffffff: linker area
[init -> seoul]   0x40000000 .. 0x4fffffff: stack area
[init -> seoul]   0x30000 .. 0x131fff: ld.lib.so
[init -> seoul] --- Seoul VMM starting ---
[init -> seoul]  VMM memory 27M
[init -> seoul]  using large memory attachments for guest VM.
[init -> seoul]  framebuffer 1024x768
[init -> seoul] - vmm: [0000000020000000,000000003df17000) - vm: [0000000000000000,000000001df17000) - 0+490588K
[init -> seoul]
[init -> seoul] --- Setup VM ---
[init -> seoul] VMM: physmem: 0 [0, 9a000]
[init -> seoul] VMM: physmem: 0 [100000, 1df17000]
[init -> seoul] VMM: directmem: 20000 base e0000+10000 readonly
[...]
Error: illegal READ at address 0x58 by pager_object: pd='init -> seoul' thread='vCPU EP 0' ip=0xb24bd
[init -> seoul] VMM: create vcpu 0 affinity 0:0
Error: illegal READ at address 0x58 by pager_object: pd='init -> seoul' thread='vCPU EP 1' ip=0xb24bd
[init -> seoul] VMM: create vcpu 1 affinity 0:0

Initially it appeared that the error only occured on the qemu target. However closer inspection revealed that the error is triggered when the number of vCPUs reaches the number of CPUs in the system, i.e. when running with the default of vcpus_to_be_used 2 and adding -smp 3 to the qemu command line, the error does not trigger. When running with -smp 2, it does.
Similarly, running the scenario on a Lenovo X260 with 4 logical cores works fine with vcpus_to_be_used 3 but triggers the error with vcpus_to_be_used 4 (or more).

The runscript run/vmm_x86 works fine despite spawning two vCPUs per core on qemu with -smp 2.

@atopia atopia added the bug label Feb 5, 2025
@atopia
Copy link
Contributor Author

atopia commented Feb 5, 2025

@skalk I have noticed that this error message is expected in run/smp when destroying the ram dataspace. I have skimmed the commit but I don't yet understand the connection between the change, the number of host cores in the scenario and the error message. The matter is not pressing but I'd appreciate if you had a look at your convenience.

@atopia
Copy link
Contributor Author

atopia commented Feb 5, 2025

While running seoul-auto on qemu on the parent of the commit does not trigger the error message (which prompted this investigation in the first place), it turns out that 2728853 only exposes the error differently: when running on the X260 with vcpus_to_be_used 4, the issue triggers also without the change:

[init -> seoul] VMM: create vcpu 0 affinity 1:0
[init -> seoul] VMM: create vcpu 1 affinity 2:0
[init -> seoul] VMM: create vcpu 2 affinity 3:0
Error: illegal READ at address 0x58 by pager_object: pd='init -> seoul' thread='vCPU EP 3' ip=0xb24bd
[init -> seoul] VMM: create vcpu 3 affinity 0:0

So it's still worth investigating, but not a regression in 2728853.

@atopia atopia changed the title hw: pager thread helping introduces regression in run/seoul-auto hw: number of vCPUS exceeding number of host cores triggers pager error Feb 5, 2025
@atopia
Copy link
Contributor Author

atopia commented Feb 5, 2025

Running the same scenario with 4 vCPUs on nova does not cause the error message.

@atopia
Copy link
Contributor Author

atopia commented Feb 6, 2025

The issue is present since at least 24.11 and seems to be a race condition (i.e., does not always trigger).

@chelmuth
Copy link
Member

chelmuth commented Feb 6, 2025

@alex-ab addressed vCPU race issues in 2024-09/10 in genode-world/seoul. How do genodelabs/genode-world@0cb6a8c and genodelabs/genode-world@5d1f087 affected this issue?

@atopia
Copy link
Contributor Author

atopia commented Feb 6, 2025

I did test with an up to date genode-world repo previously and without the commits I reliably get the panic they are fixing on nova even with 2 vCPUs, but on hw the commits don't appear to make a difference. When reverting the commits in genode-world and testing against Genode 24.11,

  • with less vCPUs than host cores, I haven't seen the pager error yet
  • with as many vCPUs as X260 host cores, I see the pager error in 1/2 to 2/3 of the cases:
[init -> seoul] VMM: create vcpu 0 affinity 1:0
[init -> seoul] VMM: create vcpu 1 affinity 2:0
[init -> seoul] VMM: create vcpu 2 affinity 3:0
Error: illegal READ at address 0x58 by pager_object: pd='init -> seoul' thread='vCPU EP 3' ip=0xb1f1d
[init -> seoul] VMM: create vcpu 3 affinity 0:0

And the guest subsequently fails to bring up CPU#3. As I mentioned before, previous to 2728853 I haven't seen the pager error when running on qemu.

alex-ab added a commit to alex-ab/genode that referenced this issue Feb 7, 2025
@alex-ab
Copy link
Member

alex-ab commented Feb 7, 2025

I crafted the debug commit cbdf3d4, which triggers the same symptom in vmm_x86 (which is way simpler to understand than seoul) with hw, so the reason is not specific for seoul at all.

[init -> vmm] vcpu 2 : created
[init -> vmm] vcpu 3 : created
Error: illegal READ at address 0x58 by pager_object: pd='init -> vmm' thread='third  ep' ip=0xb20ed
[init -> vmm] vcpu 4 : created

@chelmuth
Copy link
Member

chelmuth commented Feb 7, 2025

That's great, thanks @alex-ab! I wonder if _ep_second(env, STACK_SIZE, "second ep", env.cpu().affinity_space().location_of_index(2)) would also trigger the failture.

@alex-ab
Copy link
Member

alex-ab commented Feb 7, 2025

That's great, thanks @alex-ab! I wonder if _ep_second(env, STACK_SIZE, "second ep", env.cpu().affinity_space().location_of_index(2)) would also trigger the failture.

It does trigger.

@atopia
Copy link
Contributor Author

atopia commented Feb 7, 2025

Thanks for the debug commit @alex-ab! As you probably figured already, location_of_index(2) will wrap to CPU 0 on qemu (which by default is run with -smp 2 and trigger the error. Changing the index to 0 therefore causes the same behavior. On the X260 with 4 logical cores that does not happen. Changing the 3rd EP's index to 1 (thereby constructing another EP on core 1) doesn't exhibit the effect either though.

@chelmuth
Copy link
Member

chelmuth commented Feb 7, 2025

Could you please check that the following two lines produce the same result on base-hw?

Entrypoint ep1 { env, Component::stack_size(), "ep", Affinity::Location() };
Entrypoint ep2 { env, Component::stack_size(), "ep", env.cpu().affinity_space().location_of_index(0) };

@atopia
Copy link
Contributor Author

atopia commented Feb 7, 2025

No they don't cause the error message.

@skalk
Copy link
Member

skalk commented Feb 12, 2025

@skalk I have noticed that this error message is expected in run/smp when destroying the ram dataspace. I have skimmed the commit but I don't yet understand the connection between the change, the number of host cores in the scenario and the error message. The matter is not pressing but I'd appreciate if you had a look at your convenience.

@atopia sorry for the late response. Just for completeness: this error message in general is a page-fault message. Within run/smp it is expected, because we want to test that cross-core TLB shootdown is done right. So we give threads of different CPUs access to that RAM dataspace and destroy it. Afterwards we check whether each thread faults as assumed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants