Support DDA on VA-backed OHCL VM: Configurable Bounce Buffer for DMA #275

juantian8seattle · 2024-11-08T17:45:34Z

This pull request introduces several enhancements to support Direct Device Assignment (DDA) on VA-backed OHCL VM. The key changes include:

Configurable Settings in VTL2:
Make dma_bounce_buffer_pages_per_queue, dma_bounce_buffer_pages_per_io_threshold, and max_nvme_drivers configurable in VTL2 settings.
Shared Pool Creation:
Use the calculated device_dma to create the shared pool.
Bounce Buffers in NvmeDriver:
Pass dma_bounce_buffer_pages_per_queue and dma_bounce_buffer_pages_per_io_threshold from VTL2 settings to the NvmeDriver to create bounce buffers per queue.
Transaction Handling:
Each transaction will use a common VA-backed memory helper trait (pending VID support) to determine if the GPA (Guest Physical Address) is VA-backed and consider using the bounce buffer if it is.
Handling Full Bounce Buffers:
If the bounce buffer is full or the I/O size exceeds dma_bounce_buffer_pages_per_io_threshold, use the common VA-backed memory helper trait to pin the GPA.

juantian8seattle · 2024-11-08T17:47:08Z

@microsoft-github-policy-service agree company="Microsoft"

mattkur · 2024-11-08T18:02:25Z

openhcl/underhill_core/src/worker.rs

@@ -1246,12 +1246,89 @@ async fn new_underhill_vm(

    let boot_info = runtime_params.parsed_openhcl_boot();

+    // Determine if x2apic is supported so that the topology matches


Are these changes indeed part of your commit? Or stale from a fork?

I have to move these codes up so we can calculate device_dma using processor_topology.vp_count() below.

jstarks · 2024-11-08T18:11:37Z

Thanks for the PR, Juan.

Overall, this is not the approach we want to take for supporting this scenario. As your changes show us, having pinning be handled on a per-device basis adds a lot of complexity and scenario-specificity to the device driver, which is supposed to be a generic thing that we could use anywhere (even outside of OpenHCL).

And you can imagine that in the future, we will want to have other actions that we take as part of preparing for an assigned device DMA, e.g., locking VTL protections for the pages, mapping the pages into an IOMMU, that kind of thing. Or we will have yet more scenarios where we need to double buffer, e.g., to support unenlightened guests running in a CVM.

I think the approach that makes sense is to centralize a set of DMA operations, like an OS kernel would. So, then devices can prepare some memory for DMA by calling into the DMA API, issue the device transaction, and then release the memory for DMA after the transaction completes. In some scenarios, this will do nothing. In others, it might double buffer or pin or lock the memory. But the device driver doesn't need to know the details.

I don't have a full design of what this DMA API should look like yet, but I think I am probably the person best suited to designing it. So I would suggest that you let me define a DMA API scaffolding and update the existing device drivers to use it, and then you or Bhargav can provide an implementation that does this pinning/double buffering thing in a subsequent change.

daprilik · 2024-11-08T18:32:19Z

meta: avoid using the term UH moving forwards. prefer OHCL. (re: the PR title)

juantian8seattle · 2024-11-08T19:03:07Z

meta: avoid using the term UH moving forwards. prefer OHCL. (re: the PR title)

Got it. Thanks!

juantian8seattle · 2024-11-08T19:11:11Z

Thanks for the PR, Juan.

Overall, this is not the approach we want to take for supporting this scenario. As your changes show us, having pinning be handled on a per-device basis adds a lot of complexity and scenario-specificity to the device driver, which is supposed to be a generic thing that we could use anywhere (even outside of OpenHCL).

And you can imagine that in the future, we will want to have other actions that we take as part of preparing for an assigned device DMA, e.g., locking VTL protections for the pages, mapping the pages into an IOMMU, that kind of thing. Or we will have yet more scenarios where we need to double buffer, e.g., to support unenlightened guests running in a CVM.

I think the approach that makes sense is to centralize a set of DMA operations, like an OS kernel would. So, then devices can prepare some memory for DMA by calling into the DMA API, issue the device transaction, and then release the memory for DMA after the transaction completes. In some scenarios, this will do nothing. In others, it might double buffer or pin or lock the memory. But the device driver doesn't need to know the details.

I don't have a full design of what this DMA API should look like yet, but I think I am probably the person best suited to designing it. So I would suggest that you let me define a DMA API scaffolding and update the existing device drivers to use it, and then you or Bhargav can provide an implementation that does this pinning/double buffering thing in a subsequent change.

Thank you John for your detailed feedback on the approach for supporting this scenario. I understand the need to centralize DMA operations to reduce complexity and ensure the device driver remains generic.

Regarding the design of the DMA API, do you need a task to track this design work? Additionally, could you provide an estimated timeline for when you expect to complete the design?

Looking forward to your response.

mattkur · 2024-11-08T20:01:49Z

openhcl/underhill_core/src/worker.rs

+    let x2apic = if isolation.is_hardware_isolated() {
+        // For hardware CVMs, always enable x2apic support at boot.
+        vm_topology::processor::x86::X2ApicState::Enabled
+    } else if safe_x86_intrinsics::cpuid(x86defs::cpuid::CpuidFunction::VersionAndFeatures.0, 0).ecx


fyi, this will conflict with Eric's recent change that refactors this crate.

mattkur · 2024-11-08T20:04:11Z

openhcl/underhill_core/src/worker.rs


+    // TODO: determine actual memory usage by NVME/MANA. hardcode as 10MB
+    let device_dma = 10 * 1024 * 1024


Can you give a rough estimate of how much memory this will use?

I suspect that John's comments obviate this discussion, but it does seem like this is inefficient: we're allocating (what I assume is) a lot of memory for devices that may not be using them.

mattkur · 2024-11-08T20:09:18Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+            let mut prp_result = None;
+            let mut is_pinned = false;
+            if let Some(io_threshold) = self.io_threshold {
+                if is_va_backed && self.partition.is_some() && mem.len() as u32 > io_threshold {


I would rather keep is_va_backed out of this routine. The pinning code (whatever form that takes) should take care of this.

mattkur · 2024-11-08T20:10:35Z

vm/devices/storage/disk_backend/src/lib.rs

@@ -50,6 +51,8 @@ pub enum DiskError {
    ReservationConflict,
    #[error("unsupported eject")]
    UnsupportedEject,
+    #[error("failed to pin/unpin guest memory {0}")]
+    Hv(HvError),


Why not just "PinFailure" instead of Hv?

mattkur · 2024-11-08T20:11:27Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

@@ -206,6 +214,8 @@ pub enum RequestError {
    Memory(#[source] GuestMemoryError),
    #[error("i/o too large for double buffering")]
    TooLarge,
+    #[error("hv error")]
+    Hv(#[source] HvError),


Ditto - why not "PinError" ?

mattkur · 2024-11-08T20:17:13Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+            let mut prp_result = None;
+            let mut is_pinned = false;
+            if let Some(io_threshold) = self.io_threshold {
+                if is_va_backed && self.partition.is_some() && mem.len() as u32 > io_threshold {


In addition, let's factor out the check for io size > io_threshold as well. The routine should either pin or buffer. It can take in the threshold as one config. It should also figure out if the memory is VA backed.

Okay, I still think I'm talking about what I think John will scaffold. But I want to give the feedback here as well: this code is challenging to follow, so I'm making point suggestions on how to factor the code so that (IMO) it'll be easier to read and maintain in the future.

Yeah, I believe the logic can be moved from the nvme_driver to the common DMA APIs once John's design is done. We might not need the io_threshold and can always buffer first if there's enough space. However, considering performance, adding the io_threshold could benefit larger IOs and provide flexibility for performance tuning later.

mattkur · 2024-11-08T20:18:48Z

@yupavlen-ms , will anything here need to change after the nvme keepalive changes go in?

mattkur · 2024-11-08T20:20:30Z

vm/devices/storage/nvme/src/namespace.rs

@@ -228,6 +228,7 @@ fn map_disk_error(err: disk_backend::DiskError) -> NvmeError {
        disk_backend::DiskError::MemoryAccess(err) => {
            NvmeError::new(spec::Status::DATA_TRANSFER_ERROR, err)
        }
+        disk_backend::DiskError::Hv(_) => spec::Status::DATA_TRANSFER_ERROR.into(),


Forgive my lack of Rust-idiom knowledge, but shouldn't this be ...

disk_backend::DiskError::Hv(_) => { NvmeError::new(spec::Status::DATA_TRANSFER_ERROR, err) }

There're two ways to convert spec::Status to NvmeError: 1)through NvmeError::new() or 2) using the implicit .into(). For this specific case, NvmeError::new requires err with extra restrict which HvError does not implement so I just use the .into().

We can implement the required traits or other approach if there's a need to keep the source.

mattkur · 2024-11-08T20:25:42Z

vm/devices/storage/scsidisk/src/lib.rs

@@ -947,6 +947,7 @@ impl SimpleScsiDisk {
                    | ScsiError::UnsupportedVpdPageCode(_)
                    | ScsiError::SrbError
                    | ScsiError::Disk(DiskError::InvalidInput)
+                    | ScsiError::Disk(DiskError::Hv(_))


This error could rectify on retry, right? I don't think we want to return this as a INVALID REQUEST / INVALID CDB. We can use something like insufficient resources.

I think we want to return SCSI CHECK_CONDITION, Sense Key 0x2 NOT READY, Add'l Sense Key 0x04 (LOGICAL UNIT NOT READY, CAUSE NOT REPORTABLE)

You mean the pin/unpin may succeed if retry? I just follow the same logic when we fail on allocate double buffer today, we return InvalidInput (in nvme_driver the error is TooLarge).

Yeah, I think it would.

mattkur · 2024-11-08T20:27:00Z

vm/devices/storage/scsidisk/src/lib.rs

@@ -1069,6 +1070,7 @@ impl SimpleScsiDisk {
                            },
                            DiskError::InvalidInput
                            | DiskError::MemoryAccess(_)
+                            | DiskError::Hv(_)


I think this is handled above, right?

Yes, that's why it's unreachable!().

oh, I get it - all matches are unreachable. Then why have them here? (sorry for the n00b question)

mattkur · 2024-11-08T20:30:22Z

vm/devices/get/vtl2_settings_proto/src/vtl2_settings.namespaces.proto

@@ -27,6 +27,9 @@ message Vtl2SettingsFixed {
    optional uint32 io_ring_size = 2;
    // Specify the maximum number of bounce buffer pages allowed per cpu
    optional uint32 max_bounce_buffer_pages = 3;
+    optional uint64 dma_bounce_buffer_pages_per_queue = 4;
+    optional uint32 dma_bounce_buffer_pages_per_io_threshold = 5;
+    optional uint32 max_nvme_drivers = 6;


Again a moot point, but I want to get in the habit of asking: we need a test that ensures that not passing in these values results in acceptable behavior. I see the code handling it, but since this is our API surface: we need to be certain we aren't regressing it. Both JSON (until we delete ...) and PB.

juantian8seattle requested review from a team as code owners November 8, 2024 17:45

juantian8seattle closed this Nov 8, 2024

juantian8seattle reopened this Nov 8, 2024

configurable bounce buffer for dma

da3af7b

mattkur self-assigned this Nov 8, 2024

mattkur reviewed Nov 8, 2024

View reviewed changes

fix Cargo.lock

d632d32

juantian8seattle force-pushed the user/juantian/dma_bounce_buffer_size branch from 0e6886c to d632d32 Compare November 8, 2024 18:07

juantian8seattle changed the title ~~Support DDA on VA-backed UH VM: Configurable Bounce Buffer for DMA~~ Support DDA on VA-backed OHCL VM: Configurable Bounce Buffer for DMA Nov 8, 2024

mattkur reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DDA on VA-backed OHCL VM: Configurable Bounce Buffer for DMA #275

Support DDA on VA-backed OHCL VM: Configurable Bounce Buffer for DMA #275

juantian8seattle commented Nov 8, 2024 •

edited

Loading

juantian8seattle commented Nov 8, 2024

mattkur Nov 8, 2024

juantian8seattle Nov 8, 2024

jstarks commented Nov 8, 2024

daprilik commented Nov 8, 2024 •

edited

Loading

juantian8seattle commented Nov 8, 2024

juantian8seattle commented Nov 8, 2024

mattkur Nov 8, 2024

mattkur Nov 8, 2024

mattkur Nov 8, 2024

mattkur Nov 8, 2024

mattkur Nov 8, 2024

mattkur Nov 8, 2024

juantian8seattle Nov 8, 2024

mattkur commented Nov 8, 2024

mattkur Nov 8, 2024

juantian8seattle Nov 8, 2024

juantian8seattle Nov 8, 2024

mattkur Nov 8, 2024

juantian8seattle Nov 8, 2024

mattkur Nov 9, 2024

mattkur Nov 8, 2024

juantian8seattle Nov 8, 2024

mattkur Nov 9, 2024

mattkur Nov 8, 2024

		@@ -1246,12 +1246,89 @@ async fn new_underhill_vm(

		let boot_info = runtime_params.parsed_openhcl_boot();

		// Determine if x2apic is supported so that the topology matches


		// TODO: determine actual memory usage by NVME/MANA. hardcode as 10MB
		let device_dma = 10 * 1024 * 1024

Support DDA on VA-backed OHCL VM: Configurable Bounce Buffer for DMA #275

Are you sure you want to change the base?

Support DDA on VA-backed OHCL VM: Configurable Bounce Buffer for DMA #275

Conversation

juantian8seattle commented Nov 8, 2024 • edited Loading

juantian8seattle commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jstarks commented Nov 8, 2024

daprilik commented Nov 8, 2024 • edited Loading

juantian8seattle commented Nov 8, 2024

juantian8seattle commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattkur commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juantian8seattle commented Nov 8, 2024 •

edited

Loading

daprilik commented Nov 8, 2024 •

edited

Loading