-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVMe keepalive full feature preview #210
base: main
Are you sure you want to change the base?
NVMe keepalive full feature preview #210
Conversation
Thanks for submitting this Yuri. I'm starting to take a look. |
#[mesh(7)] | ||
pub base_mem: u64, | ||
#[mesh(8)] | ||
pub pfns: Vec<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queues are always linear in physical memory, so we should just need the base gpa here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree.
Updating my reply: when we call mmap() in LockedMemorySpawner, the VA region is contiguous but underlying PFNs are not contiguous.
But since we're moving to mshv driver in fixed pool allocator, this might not be the case anymore and it will be safe to assume sequential PFNs. TODO: check.
#[mesh(5)] | ||
pub cq_state: CompletionQueueSavedState, | ||
#[mesh(6)] | ||
pub sq_addr: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These and some other fields seem redundant with stuff that's in SubmissionQueueSavedState
/CompletionQueueSavedState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will review redundancy and eliminate all saved VA addresses
pub cid: u16, | ||
} | ||
|
||
impl From<&[u8]> for PendingCommandSavedState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this instead of using protobuf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my implementation to re-create PendingCommand from byte array. I did not find a better way, open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to re-create it from a byte array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub struct Command defined in vm/devices/storage/nvme_spec doesn't implement protobuf. iirc the attempt to add protobuf support was too intrusive. I found that alternative Command --> byte array --> protobuf conversion had less changes.
Would you suggest to add protobuf to struct Command instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to protobuf as part of CID-high-bits integration and now spec::Command has dependency on Protobuf. If that's not ok then I'll revert.
@@ -88,6 +88,8 @@ open_enum! { | |||
VTL0_MMIO = 5, | |||
/// This range is mmio for VTL2. | |||
VTL2_MMIO = 6, | |||
/// Memory preserved during servicing. | |||
VTL2_PRESERVED = 7, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do these entries make it into the tables? Does this mean we tell the host about these ranges somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or does the host just know what the preserved range is? Is this some new IGVM concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host knows what the DMA preserved size should be but does not calculate the ranges. Host provides size through device tree, boot shim selects the top of vtl2 memory range and marks it as reserved with the new type.
) -> Result<Self, NamespaceError> { | ||
let identify = nvm::IdentifyNamespace::read_from_prefix(identify_ns) | ||
.unwrap_or(nvm::IdentifyNamespace::new_zeroed()); | ||
// Restore provides Identify Namespace result to new() so there is no wait. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add an explicit restore()
that is a non-async function, rather than have this block_on
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I was thinking to add an explicit Namespace::restore but wanted to avoid duplicate code. If this is preferred way then I will redesign Namespace::new to use common logic with restore.
One remaining question for me is creation of async rescan task when restoring namespace. I may need help with that.
command: FromBytes::read_from_prefix(cmd.command.as_bytes()).unwrap(), | ||
respond: send, | ||
}; | ||
pending.push((cmd.cid as usize, pending_command)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like pending commands need to save the fact that they allocated memory, somehow.
Specifically, any PRP list allocation needs to be saved/restored. Is this done somewhere?
We also need to re-lock any guest memory/re-reserve any double buffer memory that was in use.
And of course, all this needs to be released after the pending command completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PRP points to the bounce buffer (vtl2 buffer) then its GPN should not change and it still points to the same physical page which is preserved during servicing.
Same should be true if PRP points to a region in vtl0 memory as we don't touch vtl0 during servicing.
However, maybe I missed the part on where we allocate PRP list itself, if there are multiple entries. If it is not a part of preserved memory then such change needs to be added.
} | ||
|
||
/// Restore queue data after servicing. | ||
pub fn restore(&mut self, saved_state: &QueuePairSavedState) -> anyhow::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A key thing to reason about is what happens to requests that were in flight at save. It seems that the design is that we will keep those requests in flight (i.e., we won't wait for them to complete before saving/after restoring), right? And so storvsp's existing restore path will reissue any IOs that were in flight, but they won't wait for old IOs.
I think this is a problem, because new IOs can race with the original IO and cause data loss. Here's an example of what can go wrong:
- Before servicing, the guest writes "A" to block 0x1000 with CID 1.
- Before CID 1 completes, a servicing operation starts. CID 1 makes it into the saved state.
- At restore, storvsp reissues the IO, so it writes "A" to block 0x1000 with CID 2.
- Due to reordering in the backend, CID 2 completes before CID 1.
- Storvsp completes the original guest IO.
- The guest issues another IO, this time writing "B" to the same block 0x1000. Our driver issues this with CID 3.
- CID 3 completes before CID 1.
- Finally, some backend code gets around to processing CID 1. It writes "A" to block 0x1000.
So the guest thought it wrote "A" and then "B", but actually what happens is "A" then "B" then "A". Corruption.
There are variants on this, where the guest changes the memory that CID 1 was using and ASAP DMAs it late, after the IO was completed to the guest. This could even affects reads by corrupting guest memory.
The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about. A more complicated solution would be to reason about what the in flight commands are doing and only block IOs that alias those pages. I don't think that's necessary for v1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I miss somewhere where this is handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about.
I think we cut off receiving mesh commands once servicing save request is received, let me double check that and at which point this is done.
Regarding draining of the outstanding (in-flight) I/Os - eventually we want to bypass this but for v1 it makes sense to drain them, for simplicity. Let me confirm the path. The draining happens for non-keepalive path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think two things are important in the conversation above:
- It's okay to drain IOs before issuing new IOs. But: unlike the non-keepalive path, the overall servicing operation should not block on outstanding IOs. This means that the replay will need to remain asynchronous.
- I agree with John's analysis. It is unsafe to complete any IOs back to the guest until all outstanding IOs (IOs in progress at the save time to the underlying nvme device) complete from the NVMe device to the HCL environment. I think it is fine to start replaying the saved IOs while the in flight IOs are still in progress. This means you have two choices: (a) wait to begin replaying IOs until everything in flight completes, or (b) build some logic in the storvsp layer to not notify the guest until all the in flight IOs complete. (a) seems simpler to me.
In future changes, I think we will need to consider some sort of filtering (e.g. hold new IOs from guest that overlap LBA ranges, as John suggested in an offline conversation).
addr, len, | ||
)?)) | ||
// TODO: With the recent change it may not work. Review. | ||
Ok(crate::memory::MemoryBlock::new(LockedMemory::new(len)?)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to just fail here, right? Because we can't restore the requested GPAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Is it ok to do it in two waves?
1 - Fail here
2 - Return an error and then reinitialize everything as on 1st boot?
E.g. will we prioritize VM health over error discovery? And I mean to defer #2 as future enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ignore the link, that is just bullet point)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this fails, then presumably we no longer know to where a device is sending DMA. I think that means that said DMA will corrupt arbitrary memory. This is based on what I understand so far, so please feel free to correct me. But, if my understanding is correct, then I think you need to fail the restore. The VM will need to be power cycled.
@@ -16,7 +16,7 @@ use zerocopy::LE; | |||
use zerocopy::U16; | |||
|
|||
#[repr(C)] | |||
#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect)] | |||
#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share what fails to compile (if anything) if you remove Clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fails in namespace.rs line 85 - when we restore namespace object if Identify structure was provided from saved state
len: usize, | ||
_pfns: &[u64], | ||
) -> anyhow::Result<MemoryBlock> { | ||
self.create_dma_buffer(len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here, you can't just ignore the caller's request.
#[inspect(hex)] | ||
size_pages: u64, | ||
}, | ||
Allocated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit worried that we will fail to restore some allocations and will never notice until we hit memory corruption. I also don't see a mechanism where we prevent the allocator from handing out saved allocations, e.g. if one device is allocating before another device is done restoring.
I would suggest we improve the logic like this:
- Add a parameter to
FixedPool::allocator
for whether this specific allocator is for a device that will be saved/restored. - When allocating via a save/restore-capable allocator, remember that fact in
State
. - As part of servicing state, save the list of ranges that were allocated as part of saved devices.
- When recreating the allocator on restore, create entries for each saved range.
- Make sure
restore
can only restore a range that was previously saved (at which point you convert it to an allocated range). Make surealloc
never allocates on top of a saved range. - After all device initialization has completed, we should be able to assert that there are no remaining saved ranges. They should all be converted to allocated ranges by this point. We can fail VTL0 start if there are still ranges (or we can leak the memory forever, policy decision).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am getting from this - you propose to move save/restore memory logic from NVMe queues to the allocator itself, and restore queues with either already-present MemBlock, or validate that the queues restored to the same range as allocator?
.await | ||
pub async fn namespace(&mut self, nsid: u32) -> Result<Arc<Namespace>, NamespaceError> { | ||
// Check if namespace was already added after restore. | ||
if let Some(ns) = self.namespace.iter().find(|n| n.nsid() == nsid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break the case where the namespace is hot removed and hot added back with a different identity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow. In vtl2 settings we identify controller's namespace by pci id + namespace id.
Do you expect to have dynamic vtl2 settings in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline, so sharing here. The existence of namespaces is controlled by hardware. And, namespaces can come and go at runtime. It's possible that a namespace has disappeared while the OpenHCL servicing operation. (Or, worse: disappeared and a new one came back!)
@@ -126,3 +134,70 @@ async fn test_nvme_driver(driver: DefaultDriver) { | |||
|
|||
driver.shutdown().await; | |||
} | |||
|
|||
#[async_test] | |||
async fn test_nvme_save_restore(driver: DefaultDriver) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding tests here, Yuri. Would you also add some test cases for:
- Multiple namespaces per controller
- Failure to restore (at various stages)
- A save/restore with pending IO
I don't think we should block this PR on these changes, as long as there's a plan to follow up. I also understand some of these tests may be difficult to write with the current architecture. We can also use this discussion to identify those gaps.
let manager = NvmeManager::new( | ||
&driver_source, | ||
processor_topology.vp_count(), | ||
vfio_dma_buffer(&shared_vis_pages_pool), | ||
vfio_dma_buffer(&shared_vis_pages_pool, fixed_mem_pool.as_ref()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-oo This is my implementation of it
This is a preview of the full NVMe keepalive feature implementation. The goal is to keep attached NVMe devices intact during OpenVMM servicing (e.g. reloading new OpenVMM binary from host OS).
The memory allocated for NVMe queues and buffers must be preserved during servicing, including queue head/tail pointers. This memory region must be marked as reserved so Linux kernel will not try to use it, a new device tree property will be passed from Host to indicate the preserved memory size (the memory is extracted from vtl2 available memory, it is not additional memory).
The NVMe device should not go through any resets, so from device point of view there was no change.
New OpenHCL module gets saved state from host and continues from there.