-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add with_skip_validation
flag to IPC StreamReader
, FileReader
and FileDecoder
#7120
base: main
Are you sure you want to change the base?
Conversation
d4102d6
to
59b3033
Compare
7b85e5e
to
77d3de5
Compare
@@ -1781,33 +1780,61 @@ impl PartialEq for ArrayData { | |||
} | |||
} | |||
|
|||
mod private { | |||
/// A boolean flag that cannot be mutated outside of unsafe code. | |||
/// A boolean flag that cannot be mutated outside of unsafe code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to make this UnsafeFlag
public (and added examples and more docs) so I could use it across the two crates. However, I can also make a private copy of it in arrow-ipc if reviewers feel it would be better to avoid a new API
writer.write(&batch).unwrap(); | ||
} | ||
writer.finish().unwrap(); | ||
let buffer = ipc_stream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added new versions of each benchmark that work with disabled validation
StructArray::try_new(struct_fields.clone(), struct_arrays, None)? | ||
}; | ||
Ok(Arc::new(struct_array)) | ||
self.create_struct_array(struct_node, null_buffer, struct_fields, struct_arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this code into its own function so it was eaiser to call StructArray::new_unchecked
when validation was disabled
/// | ||
/// Relies on the caller only passing a flag with `true` value if they are | ||
/// certain that the data is valid | ||
pub fn with_skip_validation(mut self, skip_validation: UnsafeFlag) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that RecordBatchDecoder
is not a public API:
https://docs.rs/arrow-ipc/54.1.0/arrow_ipc/reader/struct.StreamReader.html?search=RecordBatchDecoder
@@ -809,6 +858,21 @@ impl FileDecoder { | |||
self | |||
} | |||
|
|||
/// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a new public API and follows the same pattern as ArrayData::skip_validation
@@ -1177,6 +1243,16 @@ impl<R: Read + Seek> FileReader<R> { | |||
pub fn get_mut(&mut self) -> &mut R { | |||
&mut self.reader | |||
} | |||
|
|||
/// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new API
@@ -1462,6 +1546,16 @@ impl<R: Read> StreamReader<R> { | |||
pub fn get_mut(&mut self) -> &mut R { | |||
&mut self.reader | |||
} | |||
|
|||
/// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new API
@@ -2456,6 +2584,57 @@ mod tests { | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_invalid_nested_array_ipc_read_errors() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some additional coverage to make sure the flag got passed when decoding multiple nesting levels
@@ -2592,6 +2771,32 @@ mod tests { | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_validation_of_invalid_union_array() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out UnionArray had its own path / handling so new coverage added
@@ -2602,18 +2807,18 @@ mod tests { | |||
|
|||
// IPC Stream format | |||
let buf = write_stream(&rb); // write is ok | |||
read_stream_skip_validation(&buf).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now all the validation tests also verify they can read the batch back without error if validation is disabled
Draft
Which issue does this PR close?
with_skip_validation
flag to IPC readers/writers #7093Rationale for this change
Forcing
Array
validation while reading arrow IPC trusted data is inefficient. Users should be able to avoid doing so if they wantWhat changes are included in this PR?
This PR builds on this PR from @totoroyyb
Are there any user-facing changes?
with_disable_validation
APIs onStreamReader
,FileReader
andFileDecoder
Benchmark results
with_skip_validation
StreamReader
StreamReader(zstd)
FileReader
FileDecoder
+mmap
Details for Mac M3
Benchmarks: GCP
On a GCP
c2-standard-16 (16 vCPUs, 64 GB Memory)
with_skip_validation
StreamReader
StreamReader(zstd)
FileReader
FileDecoder
+mmap
Notes: compressed ipc slows down compression by a factor of XXX
mmap is XX faster than non mmap
Raw Results (gpc)