hw: number of vCPUS exceeding number of host cores triggers pager error #5442

atopia · 2025-02-05T10:01:38Z

Commit 2728853 causes the following error when running run/seoul-auto on hw:

Genode sculpt-24.10.3-58-g27288530059 <local changes>
1478 MiB RAM and 64536 caps assigned to init
[init -> seoul]   0x1000000 .. 0x10ffffff: linker area
[init -> seoul]   0x40000000 .. 0x4fffffff: stack area
[init -> seoul]   0x30000 .. 0x131fff: ld.lib.so
[init -> seoul] --- Seoul VMM starting ---
[init -> seoul]  VMM memory 27M
[init -> seoul]  using large memory attachments for guest VM.
[init -> seoul]  framebuffer 1024x768
[init -> seoul] - vmm: [0000000020000000,000000003df17000) - vm: [0000000000000000,000000001df17000) - 0+490588K
[init -> seoul]
[init -> seoul] --- Setup VM ---
[init -> seoul] VMM: physmem: 0 [0, 9a000]
[init -> seoul] VMM: physmem: 0 [100000, 1df17000]
[init -> seoul] VMM: directmem: 20000 base e0000+10000 readonly
[...]
Error: illegal READ at address 0x58 by pager_object: pd='init -> seoul' thread='vCPU EP 0' ip=0xb24bd
[init -> seoul] VMM: create vcpu 0 affinity 0:0
Error: illegal READ at address 0x58 by pager_object: pd='init -> seoul' thread='vCPU EP 1' ip=0xb24bd
[init -> seoul] VMM: create vcpu 1 affinity 0:0

Initially it appeared that the error only occured on the qemu target. However closer inspection revealed that the error is triggered when the number of vCPUs reaches the number of CPUs in the system, i.e. when running with the default of vcpus_to_be_used 2 and adding -smp 3 to the qemu command line, the error does not trigger. When running with -smp 2, it does.
Similarly, running the scenario on a Lenovo X260 with 4 logical cores works fine with vcpus_to_be_used 3 but triggers the error with vcpus_to_be_used 4 (or more).

The runscript run/vmm_x86 works fine despite spawning two vCPUs per core on qemu with -smp 2.

The text was updated successfully, but these errors were encountered:

atopia · 2025-02-05T10:29:46Z

@skalk I have noticed that this error message is expected in run/smp when destroying the ram dataspace. I have skimmed the commit but I don't yet understand the connection between the change, the number of host cores in the scenario and the error message. The matter is not pressing but I'd appreciate if you had a look at your convenience.

atopia · 2025-02-05T10:59:21Z

While running seoul-auto on qemu on the parent of the commit does not trigger the error message (which prompted this investigation in the first place), it turns out that 2728853 only exposes the error differently: when running on the X260 with vcpus_to_be_used 4, the issue triggers also without the change:

[init -> seoul] VMM: create vcpu 0 affinity 1:0
[init -> seoul] VMM: create vcpu 1 affinity 2:0
[init -> seoul] VMM: create vcpu 2 affinity 3:0
Error: illegal READ at address 0x58 by pager_object: pd='init -> seoul' thread='vCPU EP 3' ip=0xb24bd
[init -> seoul] VMM: create vcpu 3 affinity 0:0

So it's still worth investigating, but not a regression in 2728853.

atopia · 2025-02-05T11:06:17Z

Running the same scenario with 4 vCPUs on nova does not cause the error message.

atopia · 2025-02-06T15:29:29Z

The issue is present since at least 24.11 and seems to be a race condition (i.e., does not always trigger).

chelmuth · 2025-02-06T15:42:19Z

@alex-ab addressed vCPU race issues in 2024-09/10 in genode-world/seoul. How do genodelabs/genode-world@0cb6a8c and genodelabs/genode-world@5d1f087 affected this issue?

atopia · 2025-02-06T19:18:26Z

I did test with an up to date genode-world repo previously and without the commits I reliably get the panic they are fixing on nova even with 2 vCPUs, but on hw the commits don't appear to make a difference. When reverting the commits in genode-world and testing against Genode 24.11,

with less vCPUs than host cores, I haven't seen the pager error yet
with as many vCPUs as X260 host cores, I see the pager error in 1/2 to 2/3 of the cases:

[init -> seoul] VMM: create vcpu 0 affinity 1:0
[init -> seoul] VMM: create vcpu 1 affinity 2:0
[init -> seoul] VMM: create vcpu 2 affinity 3:0
Error: illegal READ at address 0x58 by pager_object: pd='init -> seoul' thread='vCPU EP 3' ip=0xb1f1d
[init -> seoul] VMM: create vcpu 3 affinity 0:0

And the guest subsequently fails to bring up CPU#3. As I mentioned before, previous to 2728853 I haven't seen the pager error when running on qemu.

Issue genodelabs#5442

alex-ab · 2025-02-07T10:14:03Z

I crafted the debug commit cbdf3d4, which triggers the same symptom in vmm_x86 (which is way simpler to understand than seoul) with hw, so the reason is not specific for seoul at all.

[init -> vmm] vcpu 2 : created
[init -> vmm] vcpu 3 : created
Error: illegal READ at address 0x58 by pager_object: pd='init -> vmm' thread='third  ep' ip=0xb20ed
[init -> vmm] vcpu 4 : created

chelmuth · 2025-02-07T10:21:54Z

That's great, thanks @alex-ab! I wonder if _ep_second(env, STACK_SIZE, "second ep", env.cpu().affinity_space().location_of_index(2)) would also trigger the failture.

alex-ab · 2025-02-07T10:25:24Z

That's great, thanks @alex-ab! I wonder if _ep_second(env, STACK_SIZE, "second ep", env.cpu().affinity_space().location_of_index(2)) would also trigger the failture.

It does trigger.

atopia · 2025-02-07T14:07:10Z

Thanks for the debug commit @alex-ab! As you probably figured already, location_of_index(2) will wrap to CPU 0 on qemu (which by default is run with -smp 2 and trigger the error. Changing the index to 0 therefore causes the same behavior. On the X260 with 4 logical cores that does not happen. Changing the 3rd EP's index to 1 (thereby constructing another EP on core 1) doesn't exhibit the effect either though.

chelmuth · 2025-02-07T14:37:15Z

Could you please check that the following two lines produce the same result on base-hw?

Entrypoint ep1 { env, Component::stack_size(), "ep", Affinity::Location() };
Entrypoint ep2 { env, Component::stack_size(), "ep", env.cpu().affinity_space().location_of_index(0) };

atopia · 2025-02-07T14:48:51Z

No they don't cause the error message.

skalk · 2025-02-12T14:30:31Z

@skalk I have noticed that this error message is expected in run/smp when destroying the ram dataspace. I have skimmed the commit but I don't yet understand the connection between the change, the number of host cores in the scenario and the error message. The matter is not pressing but I'd appreciate if you had a look at your convenience.

@atopia sorry for the late response. Just for completeness: this error message in general is a page-fault message. Within run/smp it is expected, because we want to test that cross-core TLB shootdown is done right. So we give threads of different CPUs access to that RAM dataspace and destroy it. Afterwards we check whether each thread faults as assumed.

atopia added the bug label Feb 5, 2025

atopia changed the title ~~hw: pager thread helping introduces regression in run/seoul-auto~~ hw: number of vCPUS exceeding number of host cores triggers pager error Feb 5, 2025

alex-ab added a commit to alex-ab/genode that referenced this issue Feb 7, 2025

DEBUG - trigger vcpu fault on hw in vmm_x86

cbdf3d4

Issue genodelabs#5442

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hw: number of vCPUS exceeding number of host cores triggers pager error #5442

hw: number of vCPUS exceeding number of host cores triggers pager error #5442

atopia commented Feb 5, 2025

atopia commented Feb 5, 2025

atopia commented Feb 5, 2025

atopia commented Feb 5, 2025

atopia commented Feb 6, 2025

chelmuth commented Feb 6, 2025

atopia commented Feb 6, 2025

alex-ab commented Feb 7, 2025

chelmuth commented Feb 7, 2025

alex-ab commented Feb 7, 2025

atopia commented Feb 7, 2025 •

edited

Loading

chelmuth commented Feb 7, 2025

atopia commented Feb 7, 2025

skalk commented Feb 12, 2025

hw: number of vCPUS exceeding number of host cores triggers pager error #5442

hw: number of vCPUS exceeding number of host cores triggers pager error #5442

Comments

atopia commented Feb 5, 2025

atopia commented Feb 5, 2025

atopia commented Feb 5, 2025

atopia commented Feb 5, 2025

atopia commented Feb 6, 2025

chelmuth commented Feb 6, 2025

atopia commented Feb 6, 2025

alex-ab commented Feb 7, 2025

chelmuth commented Feb 7, 2025

alex-ab commented Feb 7, 2025

atopia commented Feb 7, 2025 • edited Loading

chelmuth commented Feb 7, 2025

atopia commented Feb 7, 2025

skalk commented Feb 12, 2025

atopia commented Feb 7, 2025 •

edited

Loading