-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose record boundary information in JSON decoder #7092
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of notes for reviewers, in case it's helpful.
arrow-json/src/reader/mod.rs
Outdated
/// The number of unflushed records, including the partially decoded record (if any). | ||
pub fn len(&self) -> usize { | ||
self.tape_decoder.num_buffered_rows() | ||
} | ||
|
||
/// True if there are no records to flush, i.e. [`len`] is zero. | ||
pub fn is_empty(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clippy prefers is_empty()
over len() == 0
, so I added both methods. I debated just exposing num_buffered_records
instead, but len+is_empty seemed more rustic.
Happy to adjust the approach based on feedback.
@@ -545,6 +545,16 @@ impl TapeDecoder { | |||
Ok(()) | |||
} | |||
|
|||
/// The number of buffered rows, including the partially decoded row (if any). | |||
pub fn num_buffered_rows(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For whatever reason, TapeDecoder
seems to call them "rows" while Decoder
calls them "records" -- so I named the new methods to match.
0df7e26
to
3ea1c49
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, my understanding being this allows deserializing StringArray one value at a time, ensuring records are not split across value boundaries.
Whilst this probably has some additional overheads, I'd be curious to see these quantified e.g. compared to the approach of not checking, I suspect these are low relative to the inherent costs of JSON decoding, and such an approach still benefits from the vectorised tape->array conversion.
That's a good description of what I hoped to achieve, yes.
In the common case where all strings contain correct JSON, the check should be branch-predicted away. It's ultimately just checking two variables that should already be hot in CPU cache, if not in registers, and both branches should be not-taken almost always. In any case tho -- this enables the user of a |
Actually... sending in a bunch of small (not I/O optimal) strings one at a time will probably be the biggest overhead. If we didn't need boundary validation, we could probably just pass the entire underlying byte array from the |
Yes this was the approach I was comparing to, which of course would not work for strings with non-empty nulls, or malicious input, and so is probably not a good idea - I'm just curious 😅 (If there is a big difference, which I doubt, it might justify a more intrusive integration). |
Which issue does this PR close?
Closes #6522
Rationale for this change
The current JSON decoder gives no way to reliably identify record boundaries in the input (as different from mere whitespace or buffer boundaries), which makes it very difficult to correctly and efficiently parse a series of unrelated JSON values (such as from a
StringArray
column).Examples of adversarial inputs include: blank strings (no rows produced), a single string containing multiple records (multiple rows produced), or multiple invalid strings whose concatenation looks like a single record (one row produced).
Such cases can be detected easily by checking the number of records parsed, and whether the last record was incomplete -- but that state is not publicly accessible (buried in the
TapeDecoder
struct).What changes are included in this PR?
Expose two new methods on the
TapeDecoder
struct, which support three new pub methods onDecoder
, which exposes the number of records the decoder has buffered up so far, and whether the last record is partial (incomplete).This allows e.g.
Also update documentation and add unit tests.
Are there any user-facing changes?
Three new public methods on
Decoder
:has_partial_record
,len
, andis_empty
.