NVMe keepalive full feature preview #210

yupavlen-ms · 2024-10-29T19:39:17Z

This is a preview of the full NVMe keepalive feature implementation. The goal is to keep attached NVMe devices intact during OpenVMM servicing (e.g. reloading new OpenVMM binary from host OS).

The memory allocated for NVMe queues and buffers must be preserved during servicing, including queue head/tail pointers. This memory region must be marked as reserved so Linux kernel will not try to use it, a new device tree property will be passed from Host to indicate the preserved memory size (the memory is extracted from vtl2 available memory, it is not additional memory).
The NVMe device should not go through any resets, so from device point of view there was no change.
New OpenHCL module gets saved state from host and continues from there.

mattkur · 2024-10-29T19:59:13Z

Thanks for submitting this Yuri. I'm starting to take a look.

vm/devices/user_driver/src/memory.rs

vm/devices/storage/nvme/src/pci.rs

support/sparse_mmap/src/unix.rs

vm/devices/storage/disk_nvme/nvme_driver/src/queues.rs

jstarks · 2024-10-29T20:18:14Z

vm/devices/storage/disk_nvme/nvme_driver/src/queues.rs

+    #[mesh(7)]
+    pub base_mem: u64,
+    #[mesh(8)]
+    pub pfns: Vec<u64>,


The queues are always linear in physical memory, so we should just need the base gpa here.

Yes, agree.

Updating my reply: when we call mmap() in LockedMemorySpawner, the VA region is contiguous but underlying PFNs are not contiguous.

But since we're moving to mshv driver in fixed pool allocator, this might not be the case anymore and it will be safe to assume sequential PFNs. TODO: check.

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

jstarks · 2024-10-29T20:20:20Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+    #[mesh(5)]
+    pub cq_state: CompletionQueueSavedState,
+    #[mesh(6)]
+    pub sq_addr: u64,


These and some other fields seem redundant with stuff that's in SubmissionQueueSavedState/CompletionQueueSavedState.

Will review redundancy and eliminate all saved VA addresses

jstarks · 2024-10-29T20:21:03Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+    pub cid: u16,
+}
+
+impl From<&[u8]> for PendingCommandSavedState {


Why are we doing this instead of using protobuf?

This is my implementation to re-create PendingCommand from byte array. I did not find a better way, open to suggestions.

Why do we need to re-create it from a byte array?

pub struct Command defined in vm/devices/storage/nvme_spec doesn't implement protobuf. iirc the attempt to add protobuf support was too intrusive. I found that alternative Command --> byte array --> protobuf conversion had less changes.

Would you suggest to add protobuf to struct Command instead?

Changed to protobuf as part of CID-high-bits integration and now spec::Command has dependency on Protobuf. If that's not ok then I'll revert.

jstarks · 2024-10-29T20:24:34Z

vm/loader/loader_defs/src/shim.rs

@@ -88,6 +88,8 @@ open_enum! {
        VTL0_MMIO = 5,
        /// This range is mmio for VTL2.
        VTL2_MMIO = 6,
+        /// Memory preserved during servicing.
+        VTL2_PRESERVED = 7,


How do these entries make it into the tables? Does this mean we tell the host about these ranges somehow?

Or does the host just know what the preserved range is? Is this some new IGVM concept?

Host knows what the DMA preserved size should be but does not calculate the ranges. Host provides size through device tree, boot shim selects the top of vtl2 memory range and marks it as reserved with the new type.

jstarks · 2024-10-29T20:28:39Z

vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs

+    ) -> Result<Self, NamespaceError> {
+        let identify = nvm::IdentifyNamespace::read_from_prefix(identify_ns)
+            .unwrap_or(nvm::IdentifyNamespace::new_zeroed());
+        // Restore provides Identify Namespace result to new() so there is no wait.


We should add an explicit restore() that is a non-async function, rather than have this block_on here.

Thanks. I was thinking to add an explicit Namespace::restore but wanted to avoid duplicate code. If this is preferred way then I will redesign Namespace::new to use common logic with restore.

One remaining question for me is creation of async rescan task when restoring namespace. I may need help with that.

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

jstarks · 2024-10-29T20:40:05Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+                command: FromBytes::read_from_prefix(cmd.command.as_bytes()).unwrap(),
+                respond: send,
+            };
+            pending.push((cmd.cid as usize, pending_command));


It seems like pending commands need to save the fact that they allocated memory, somehow.

Specifically, any PRP list allocation needs to be saved/restored. Is this done somewhere?

We also need to re-lock any guest memory/re-reserve any double buffer memory that was in use.

And of course, all this needs to be released after the pending command completes.

If PRP points to the bounce buffer (vtl2 buffer) then its GPN should not change and it still points to the same physical page which is preserved during servicing.
Same should be true if PRP points to a region in vtl0 memory as we don't touch vtl0 during servicing.

However, maybe I missed the part on where we allocate PRP list itself, if there are multiple entries. If it is not a part of preserved memory then such change needs to be added.

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

jstarks · 2024-10-29T20:55:54Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+    }
+
+    /// Restore queue data after servicing.
+    pub fn restore(&mut self, saved_state: &QueuePairSavedState) -> anyhow::Result<()> {


A key thing to reason about is what happens to requests that were in flight at save. It seems that the design is that we will keep those requests in flight (i.e., we won't wait for them to complete before saving/after restoring), right? And so storvsp's existing restore path will reissue any IOs that were in flight, but they won't wait for old IOs.

I think this is a problem, because new IOs can race with the original IO and cause data loss. Here's an example of what can go wrong:

Before servicing, the guest writes "A" to block 0x1000 with CID 1.

Before CID 1 completes, a servicing operation starts. CID 1 makes it into the saved state.

At restore, storvsp reissues the IO, so it writes "A" to block 0x1000 with CID 2.

Due to reordering in the backend, CID 2 completes before CID 1.

Storvsp completes the original guest IO.

The guest issues another IO, this time writing "B" to the same block 0x1000. Our driver issues this with CID 3.

CID 3 completes before CID 1.

Finally, some backend code gets around to processing CID 1. It writes "A" to block 0x1000.

So the guest thought it wrote "A" and then "B", but actually what happens is "A" then "B" then "A". Corruption.

There are variants on this, where the guest changes the memory that CID 1 was using and ASAP DMAs it late, after the IO was completed to the guest. This could even affects reads by corrupting guest memory.

The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about. A more complicated solution would be to reason about what the in flight commands are doing and only block IOs that alias those pages. I don't think that's necessary for v1.

Did I miss somewhere where this is handled?

The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about.

I think we cut off receiving mesh commands once servicing save request is received, let me double check that and at which point this is done.

Regarding draining of the outstanding (in-flight) I/Os - eventually we want to bypass this but for v1 it makes sense to drain them, for simplicity. Let me confirm the path. The draining happens for non-keepalive path.

I think two things are important in the conversation above:

It's okay to drain IOs before issuing new IOs. But: unlike the non-keepalive path, the overall servicing operation should not block on outstanding IOs. This means that the replay will need to remain asynchronous.

I agree with John's analysis. It is unsafe to complete any IOs back to the guest until all outstanding IOs (IOs in progress at the save time to the underlying nvme device) complete from the NVMe device to the HCL environment. I think it is fine to start replaying the saved IOs while the in flight IOs are still in progress. This means you have two choices: (a) wait to begin replaying IOs until everything in flight completes, or (b) build some logic in the storvsp layer to not notify the guest until all the in flight IOs complete. (a) seems simpler to me.

In future changes, I think we will need to consider some sort of filtering (e.g. hold new IOs from guest that overlap LBA ranges, as John suggested in an offline conversation).

jstarks · 2024-10-31T16:32:58Z

vm/devices/user_driver/src/lockmem.rs

-            addr, len,
-        )?))
+        // TODO: With the recent change it may not work. Review.
+        Ok(crate::memory::MemoryBlock::new(LockedMemory::new(len)?))


We need to just fail here, right? Because we can't restore the requested GPAs.

Yes. Is it ok to do it in two waves?
1 - Fail here
2 - Return an error and then reinitialize everything as on 1st boot?

E.g. will we prioritize VM health over error discovery? And I mean to defer #2 as future enhancement.

(ignore the link, that is just bullet point)

If this fails, then presumably we no longer know to where a device is sending DMA. I think that means that said DMA will corrupt arbitrary memory. This is based on what I understand so far, so please feel free to correct me. But, if my understanding is correct, then I think you need to fail the restore. The VM will need to be power cycled.

openhcl/fixed_pool_alloc/Cargo.toml

openhcl/openhcl_boot/src/main.rs

vm/devices/user_driver/vfio_sys/src/lib.rs

vm/devices/user_driver/Cargo.toml

daprilik · 2024-10-31T18:31:31Z

vm/devices/storage/nvme_spec/src/nvm.rs

@@ -16,7 +16,7 @@ use zerocopy::LE;
 use zerocopy::U16;

 #[repr(C)]
-#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect)]
+#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect, Clone)]


can you share what fails to compile (if anything) if you remove Clone?

Fails in namespace.rs line 85 - when we restore namespace object if Identify structure was provided from saved state

jstarks · 2024-10-31T19:29:55Z

openhcl/shared_pool_alloc/src/lib.rs

+        len: usize,
+        _pfns: &[u64],
+    ) -> anyhow::Result<MemoryBlock> {
+        self.create_dma_buffer(len)


Ditto here, you can't just ignore the caller's request.

jstarks · 2024-10-31T19:38:42Z

openhcl/fixed_pool_alloc/src/lib.rs

+        #[inspect(hex)]
+        size_pages: u64,
+    },
+    Allocated {


I am a bit worried that we will fail to restore some allocations and will never notice until we hit memory corruption. I also don't see a mechanism where we prevent the allocator from handing out saved allocations, e.g. if one device is allocating before another device is done restoring.

I would suggest we improve the logic like this:

Add a parameter to FixedPool::allocator for whether this specific allocator is for a device that will be saved/restored.

When allocating via a save/restore-capable allocator, remember that fact in State.

As part of servicing state, save the list of ranges that were allocated as part of saved devices.

When recreating the allocator on restore, create entries for each saved range.

Make sure restore can only restore a range that was previously saved (at which point you convert it to an allocated range). Make sure alloc never allocates on top of a saved range.

After all device initialization has completed, we should be able to assert that there are no remaining saved ranges. They should all be converted to allocated ranges by this point. We can fail VTL0 start if there are still ranges (or we can leak the memory forever, policy decision).

What I am getting from this - you propose to move save/restore memory logic from NVMe queues to the allocator itself, and restore queues with either already-present MemBlock, or validate that the queues restored to the same range as allocator?

vm/devices/user_driver/src/vfio.rs

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs

…tore

openhcl/underhill_core/src/nvme_manager.rs

jstarks · 2024-11-01T22:52:48Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

-        .await
+    pub async fn namespace(&mut self, nsid: u32) -> Result<Arc<Namespace>, NamespaceError> {
+        // Check if namespace was already added after restore.
+        if let Some(ns) = self.namespace.iter().find(|n| n.nsid() == nsid) {


This will break the case where the namespace is hot removed and hot added back with a different identity.

Not sure I follow. In vtl2 settings we identify controller's namespace by pci id + namespace id.
Do you expect to have dynamic vtl2 settings in future?

We discussed this offline, so sharing here. The existence of namespaces is controlled by hardware. And, namespaces can come and go at runtime. It's possible that a namespace has disappeared while the OpenHCL servicing operation. (Or, worse: disappeared and a new one came back!)

mattkur · 2024-11-04T17:30:00Z

vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs

@@ -126,3 +134,70 @@ async fn test_nvme_driver(driver: DefaultDriver) {

    driver.shutdown().await;
 }
+
+#[async_test]
+async fn test_nvme_save_restore(driver: DefaultDriver) {


Thanks for adding tests here, Yuri. Would you also add some test cases for:

Multiple namespaces per controller

Failure to restore (at various stages)

A save/restore with pending IO

I don't think we should block this PR on these changes, as long as there's a plan to follow up. I also understand some of these tests may be difficult to write with the current architecture. We can also use this discussion to identify those gaps.

yupavlen-ms · 2024-11-07T21:33:23Z

openhcl/underhill_core/src/worker.rs

        let manager = NvmeManager::new(
            &driver_source,
            processor_topology.vp_count(),
-            vfio_dma_buffer(&shared_vis_pages_pool),
+            vfio_dma_buffer(&shared_vis_pages_pool, fixed_mem_pool.as_ref()),


@chris-oo This is my implementation of it

yupavlen-ms added 7 commits October 24, 2024 13:31

Initial port from HvLite repo

082618e

Port fixes from stable branch

af9ca5f

Port DT changes, before formatting

607c118

Fix formatting (mostly)

1d40e2e

Unfinished merge of fixed_pool_alloc branch

d21b846

Keep syncing with fixed_pool_alloc branch

d011589

Merge fixed allocator pool with the rest of codebase

4d938ca

yupavlen-ms requested review from a team as code owners October 29, 2024 19:39