Expose record boundary information in JSON decoder #7092

scovich · 2025-02-06T21:37:34Z

Which issue does this PR close?

Rationale for this change

The current JSON decoder gives no way to reliably identify record boundaries in the input (as different from mere whitespace or buffer boundaries), which makes it very difficult to correctly and efficiently parse a series of unrelated JSON values (such as from a StringArray column).

Examples of adversarial inputs include: blank strings (no rows produced), a single string containing multiple records (multiple rows produced), or multiple invalid strings whose concatenation looks like a single record (one row produced).

Such cases can be detected easily by checking the number of records parsed, and whether the last record was incomplete -- but that state is not publicly accessible (buried in the TapeDecoder struct).

What changes are included in this PR?

Expose two new methods on the TapeDecoder struct, which support three new pub methods on Decoder, which exposes the number of records the decoder has buffered up so far, and whether the last record is partial (incomplete).

This allows e.g.

fn parse_one_record(decoder: &mut Decoder, bytes: &[u8]) -> Result<(), ArrowError> {
    let existing_len = decoder.len();
    let decoded_bytes = decoder.decode(bytes)?;
    assert_eq!(decoded_bytes, bytes.len()); // all bytes consumed
    assert_eq!(decoder.len(), existing_len + 1); // exactly one record produced
    assert!(!decoder.has_partial_record()); // the record was complete
    Ok(())
}

Also update documentation and add unit tests.

Are there any user-facing changes?

Three new public methods on Decoder: has_partial_record, len, and is_empty.

scovich

A couple of notes for reviewers, in case it's helpful.

scovich · 2025-02-06T21:39:02Z

arrow-json/src/reader/mod.rs

+    /// The number of unflushed records, including the partially decoded record (if any).
+    pub fn len(&self) -> usize {
+        self.tape_decoder.num_buffered_rows()
+    }
+
+    /// True if there are no records to flush, i.e. [`len`] is zero.
+    pub fn is_empty(&self) -> bool {


Clippy prefers is_empty() over len() == 0, so I added both methods. I debated just exposing num_buffered_records instead, but len+is_empty seemed more rustic.

Happy to adjust the approach based on feedback.

scovich · 2025-02-06T21:40:22Z

arrow-json/src/reader/tape.rs

@@ -545,6 +545,16 @@ impl TapeDecoder {
        Ok(())
    }

+    /// The number of buffered rows, including the partially decoded row (if any).
+    pub fn num_buffered_rows(&self) -> usize {


For whatever reason, TapeDecoder seems to call them "rows" while Decoder calls them "records" -- so I named the new methods to match.

tustvold

This makes sense to me, my understanding being this allows deserializing StringArray one value at a time, ensuring records are not split across value boundaries.

Whilst this probably has some additional overheads, I'd be curious to see these quantified e.g. compared to the approach of not checking, I suspect these are low relative to the inherent costs of JSON decoding, and such an approach still benefits from the vectorised tape->array conversion.

scovich · 2025-02-10T18:54:13Z

This makes sense to me, my understanding being this allows deserializing StringArray one value at a time, ensuring records are not split across value boundaries.

That's a good description of what I hoped to achieve, yes.

Whilst this probably has some additional overheads, I'd be curious to see these quantified e.g. compared to the approach of not checking, I suspect these are low relative to the inherent costs of JSON decoding, and such an approach still benefits from the vectorised tape->array conversion.

In the common case where all strings contain correct JSON, the check should be branch-predicted away. It's ultimately just checking two variables that should already be hot in CPU cache, if not in registers, and both branches should be not-taken almost always.

In any case tho -- this enables the user of a Decoder to express correctness constraints they care about, and the small performance overhead would be totally acceptable. The change doesn't impact normal parsing at all.

scovich · 2025-02-10T19:20:25Z

Whilst this probably has some additional overheads, I'd be curious to see these quantified e.g. compared to the approach of not checking, I suspect these are low relative to the inherent costs of JSON decoding, and such an approach still benefits from the vectorised tape->array conversion.

In the common case where all strings contain correct JSON, the check should be branch-predicted away. It's ultimately just checking two variables that should already be hot in CPU cache, if not in registers, and both branches should be not-taken almost always.

In any case tho -- this enables the user of a Decoder to express correctness constraints they care about, and the small performance overhead would be totally acceptable. The change doesn't impact normal parsing at all.

Actually... sending in a bunch of small (not I/O optimal) strings one at a time will probably be the biggest overhead. If we didn't need boundary validation, we could probably just pass the entire underlying byte array from the StringArray in a single call. But that's unavoidable. Especially because I don't think the underlying byte array is required to be tightly packed (there could be regions of invalid bytes between strings).

tustvold · 2025-02-10T19:31:54Z

If we didn't need boundary validation, we could probably just pass the entire underlying byte array

Yes this was the approach I was comparing to, which of course would not work for strings with non-empty nulls, or malicious input, and so is probably not a good idea - I'm just curious 😅 (If there is a big difference, which I doubt, it might justify a more intrusive integration).

github-actions bot added the arrow Changes to the arrow crate label Feb 6, 2025

scovich commented Feb 6, 2025

View reviewed changes

scovich added 2 commits February 7, 2025 11:22

Expose record boundary information in JSON decoder

a7a2774

fix doc links

3ea1c49

scovich force-pushed the expose-json-decoder-state branch from 0df7e26 to 3ea1c49 Compare February 7, 2025 19:24

tustvold approved these changes Feb 8, 2025

View reviewed changes

tustvold merged commit 27d2a75 into apache:main Feb 11, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose record boundary information in JSON decoder #7092

Expose record boundary information in JSON decoder #7092

scovich commented Feb 6, 2025 •

edited

Loading

scovich left a comment

scovich Feb 6, 2025

scovich Feb 6, 2025 •

edited

Loading

tustvold left a comment

scovich commented Feb 10, 2025

scovich commented Feb 10, 2025

tustvold commented Feb 10, 2025 •

edited

Loading

Expose record boundary information in JSON decoder #7092

Expose record boundary information in JSON decoder #7092

Conversation

scovich commented Feb 6, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

scovich left a comment

Choose a reason for hiding this comment

scovich Feb 6, 2025

Choose a reason for hiding this comment

scovich Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

scovich commented Feb 10, 2025

scovich commented Feb 10, 2025

tustvold commented Feb 10, 2025 • edited Loading

scovich commented Feb 6, 2025 •

edited

Loading

scovich Feb 6, 2025 •

edited

Loading

tustvold commented Feb 10, 2025 •

edited

Loading