Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add with_skip_validation flag to IPC StreamReader, FileReader and FileDecoder #7120

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Feb 11, 2025

Draft

Which issue does this PR close?

Rationale for this change

Forcing Array validation while reading arrow IPC trusted data is inefficient. Users should be able to avoid doing so if they want

What changes are included in this PR?

This PR builds on this PR from @totoroyyb

  1. Add API in IPC options to skip validation
  2. Pass this flag through StreamReader/FileReader/FileDecoder
  3. Add a few more tests
  4. Add benchmarks

Are there any user-facing changes?

  1. New with_disable_validation APIs on StreamReader, FileReader and FileDecoder
  2. improved performance

Benchmark results

Benchmark Default (validation) with_skip_validation Difference
StreamReader 255.59 µs 78.884 µs 3.23
StreamReader(zstd) 4.8023 ms 4.4646 ms 1.07
FileReader 251.63 µs 79.072 µs 3.18
FileDecoder + mmap 241.10 µs 26.900 µs 9.2x
Details for Mac M3

   Finished `bench` profile [optimized] target(s) in 0.27s
     Running benches/ipc_reader.rs (target/release/deps/ipc_reader-a8f04a4266073085)
Gnuplot not found, using plotters backend
arrow_ipc_reader/StreamReader/read_10
                        time:   [254.38 µs 255.59 µs 256.92 µs]
                        change: [-4.5688% -3.2519% -1.9891%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_ipc_reader/StreamReader/no_validation/read_10
                        time:   [78.548 µs 78.884 µs 79.252 µs]
                        change: [-1.5389% -0.7882% -0.0139%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_ipc_reader/StreamReader/read_10/zstd
                        time:   [4.6568 ms 4.8023 ms 4.9459 ms]
                        change: [-13.864% -11.080% -7.8912%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_ipc_reader/StreamReader/no_validation/read_10/zstd
                        time:   [4.3195 ms 4.4646 ms 4.6121 ms]
                        change: [-17.118% -14.219% -11.026%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_ipc_reader/FileReader/read_10
                        time:   [251.14 µs 251.63 µs 252.16 µs]
                        change: [-3.3146% -2.7427% -2.2098%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
arrow_ipc_reader/FileReader/no_validation/read_10
                        time:   [78.319 µs 79.072 µs 79.939 µs]
                        change: [-14.299% -13.005% -11.710%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
arrow_ipc_reader/FileReader/read_10/mmap
                        time:   [240.51 µs 241.10 µs 241.78 µs]
                        change: [-1.0442% -0.6140% -0.2258%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
arrow_ipc_reader/FileReader/no_validation/read_10/mmap
                        time:   [26.816 µs 26.900 µs 26.994 µs]
                        change: [-1.3681% -0.7452% -0.1508%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarks: GCP

On a GCP c2-standard-16 (16 vCPUs, 64 GB Memory)

Benchmark Default (validation) with_skip_validation Difference
StreamReader 665.40 µs 419.04 µs Cell
StreamReader(zstd) 3.4684 ms 3.2056 ms Cell
FileReader 664.86 µs 406.10 µs Cell
FileDecoder + mmap 530.24 µs 77.129 µs Cell

Notes: compressed ipc slows down compression by a factor of XXX
mmap is XX faster than non mmap

Raw Results (gpc)

arrow_ipc_reader/StreamReader/read_10
                        time:   [664.17 µs 665.40 µs 666.63 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
arrow_ipc_reader/StreamReader/no_validation/read_10
                        time:   [417.36 µs 419.04 µs 420.93 µs]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
arrow_ipc_reader/StreamReader/read_10/zstd
                        time:   [3.4657 ms 3.4684 ms 3.4711 ms]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_ipc_reader/StreamReader/no_validation/read_10/zstd
                        time:   [3.2021 ms 3.2056 ms 3.2095 ms]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
arrow_ipc_reader/FileReader/read_10
                        time:   [663.60 µs 664.86 µs 666.13 µs]
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe
arrow_ipc_reader/FileReader/no_validation/read_10
                        time:   [400.17 µs 406.10 µs 411.81 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
arrow_ipc_reader/FileReader/read_10/mmap
                        time:   [529.08 µs 530.24 µs 531.62 µs]
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) high mild
  7 (7.00%) high severe
arrow_ipc_reader/FileReader/no_validation/read_10/mmap
                        time:   [77.066 µs 77.129 µs 77.203 µs]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 11, 2025
@alamb alamb force-pushed the alamb/ipc_disable_validation branch from d4102d6 to 59b3033 Compare February 11, 2025 22:12
@alamb alamb force-pushed the alamb/ipc_disable_validation branch from 7b85e5e to 77d3de5 Compare February 12, 2025 11:45
@@ -1781,33 +1780,61 @@ impl PartialEq for ArrayData {
}
}

mod private {
/// A boolean flag that cannot be mutated outside of unsafe code.
/// A boolean flag that cannot be mutated outside of unsafe code.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to make this UnsafeFlag public (and added examples and more docs) so I could use it across the two crates. However, I can also make a private copy of it in arrow-ipc if reviewers feel it would be better to avoid a new API

writer.write(&batch).unwrap();
}
writer.finish().unwrap();
let buffer = ipc_stream();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added new versions of each benchmark that work with disabled validation

StructArray::try_new(struct_fields.clone(), struct_arrays, None)?
};
Ok(Arc::new(struct_array))
self.create_struct_array(struct_node, null_buffer, struct_fields, struct_arrays)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this code into its own function so it was eaiser to call StructArray::new_unchecked when validation was disabled

///
/// Relies on the caller only passing a flag with `true` value if they are
/// certain that the data is valid
pub fn with_skip_validation(mut self, skip_validation: UnsafeFlag) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -809,6 +858,21 @@ impl FileDecoder {
self
}

/// Specifies if validation should be skipped when reading data (defaults to `false`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a new public API and follows the same pattern as ArrayData::skip_validation

@@ -1177,6 +1243,16 @@ impl<R: Read + Seek> FileReader<R> {
pub fn get_mut(&mut self) -> &mut R {
&mut self.reader
}

/// Specifies if validation should be skipped when reading data (defaults to `false`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new API

@@ -1462,6 +1546,16 @@ impl<R: Read> StreamReader<R> {
pub fn get_mut(&mut self) -> &mut R {
&mut self.reader
}

/// Specifies if validation should be skipped when reading data (defaults to `false`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new API

@@ -2456,6 +2584,57 @@ mod tests {
);
}

#[test]
fn test_invalid_nested_array_ipc_read_errors() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some additional coverage to make sure the flag got passed when decoding multiple nesting levels

@@ -2592,6 +2771,32 @@ mod tests {
);
}

#[test]
fn test_validation_of_invalid_union_array() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out UnionArray had its own path / handling so new coverage added

@@ -2602,18 +2807,18 @@ mod tests {

// IPC Stream format
let buf = write_stream(&rb); // write is ok
read_stream_skip_validation(&buf).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now all the validation tests also verify they can read the batch back without error if validation is disabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Arrow-IPC performance by avoiding Unsafe Unchecked IPC Read RecordBatch
1 participant