Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose record boundary information in JSON decoder #7092

Merged
merged 2 commits into from
Feb 11, 2025

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Feb 6, 2025

Which issue does this PR close?

Closes #6522

Rationale for this change

The current JSON decoder gives no way to reliably identify record boundaries in the input (as different from mere whitespace or buffer boundaries), which makes it very difficult to correctly and efficiently parse a series of unrelated JSON values (such as from a StringArray column).

Examples of adversarial inputs include: blank strings (no rows produced), a single string containing multiple records (multiple rows produced), or multiple invalid strings whose concatenation looks like a single record (one row produced).

Such cases can be detected easily by checking the number of records parsed, and whether the last record was incomplete -- but that state is not publicly accessible (buried in the TapeDecoder struct).

What changes are included in this PR?

Expose two new methods on the TapeDecoder struct, which support three new pub methods on Decoder, which exposes the number of records the decoder has buffered up so far, and whether the last record is partial (incomplete).

This allows e.g.

fn parse_one_record(decoder: &mut Decoder, bytes: &[u8]) -> Result<(), ArrowError> {
    let existing_len = decoder.len();
    let decoded_bytes = decoder.decode(bytes)?;
    assert_eq!(decoded_bytes, bytes.len()); // all bytes consumed
    assert_eq!(decoder.len(), existing_len + 1); // exactly one record produced
    assert!(!decoder.has_partial_record()); // the record was complete
    Ok(())
}

Also update documentation and add unit tests.

Are there any user-facing changes?

Three new public methods on Decoder: has_partial_record, len, and is_empty.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 6, 2025
Copy link
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of notes for reviewers, in case it's helpful.

Comment on lines 623 to 629
/// The number of unflushed records, including the partially decoded record (if any).
pub fn len(&self) -> usize {
self.tape_decoder.num_buffered_rows()
}

/// True if there are no records to flush, i.e. [`len`] is zero.
pub fn is_empty(&self) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clippy prefers is_empty() over len() == 0, so I added both methods. I debated just exposing num_buffered_records instead, but len+is_empty seemed more rustic.

Happy to adjust the approach based on feedback.

@@ -545,6 +545,16 @@ impl TapeDecoder {
Ok(())
}

/// The number of buffered rows, including the partially decoded row (if any).
pub fn num_buffered_rows(&self) -> usize {
Copy link
Contributor Author

@scovich scovich Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whatever reason, TapeDecoder seems to call them "rows" while Decoder calls them "records" -- so I named the new methods to match.

@scovich scovich force-pushed the expose-json-decoder-state branch from 0df7e26 to 3ea1c49 Compare February 7, 2025 19:24
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, my understanding being this allows deserializing StringArray one value at a time, ensuring records are not split across value boundaries.

Whilst this probably has some additional overheads, I'd be curious to see these quantified e.g. compared to the approach of not checking, I suspect these are low relative to the inherent costs of JSON decoding, and such an approach still benefits from the vectorised tape->array conversion.

@scovich
Copy link
Contributor Author

scovich commented Feb 10, 2025

This makes sense to me, my understanding being this allows deserializing StringArray one value at a time, ensuring records are not split across value boundaries.

That's a good description of what I hoped to achieve, yes.

Whilst this probably has some additional overheads, I'd be curious to see these quantified e.g. compared to the approach of not checking, I suspect these are low relative to the inherent costs of JSON decoding, and such an approach still benefits from the vectorised tape->array conversion.

In the common case where all strings contain correct JSON, the check should be branch-predicted away. It's ultimately just checking two variables that should already be hot in CPU cache, if not in registers, and both branches should be not-taken almost always.

In any case tho -- this enables the user of a Decoder to express correctness constraints they care about, and the small performance overhead would be totally acceptable. The change doesn't impact normal parsing at all.

@scovich
Copy link
Contributor Author

scovich commented Feb 10, 2025

Whilst this probably has some additional overheads, I'd be curious to see these quantified e.g. compared to the approach of not checking, I suspect these are low relative to the inherent costs of JSON decoding, and such an approach still benefits from the vectorised tape->array conversion.

In the common case where all strings contain correct JSON, the check should be branch-predicted away. It's ultimately just checking two variables that should already be hot in CPU cache, if not in registers, and both branches should be not-taken almost always.

In any case tho -- this enables the user of a Decoder to express correctness constraints they care about, and the small performance overhead would be totally acceptable. The change doesn't impact normal parsing at all.

Actually... sending in a bunch of small (not I/O optimal) strings one at a time will probably be the biggest overhead. If we didn't need boundary validation, we could probably just pass the entire underlying byte array from the StringArray in a single call. But that's unavoidable. Especially because I don't think the underlying byte array is required to be tightly packed (there could be regions of invalid bytes between strings).

@tustvold
Copy link
Contributor

tustvold commented Feb 10, 2025

If we didn't need boundary validation, we could probably just pass the entire underlying byte array

Yes this was the approach I was comparing to, which of course would not work for strings with non-empty nulls, or malicious input, and so is probably not a good idea - I'm just curious 😅 (If there is a big difference, which I doubt, it might justify a more intrusive integration).

@tustvold tustvold merged commit 27d2a75 into apache:main Feb 11, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing a string column containing JSON values into a typed array
2 participants