Add support for source byte-range tracking for ByteRecord #286

AndreaOddo89 · 2022-09-15T09:22:40Z

ByteRecord (via ByteRecordInner) already exposes a position method to access line number, record number and byte offset of the start of the record.
This PR adds information to ByteRecord to track not only the byte offset of the start of the record, but also its end, via a span method returning a Span; this is useful to retrieve the original source bytes for a parsed record when e.g. reporting errors.

AndreaOddo89 · 2022-09-15T09:25:16Z

There is a clear overlap between Position and Span, as the former contains the line number and record numbers (that resemble a span), alongside half of the information encoded into Span.
We've tried to minimize PR impact by creating a separate field, but they could be somehow combined. Feedback welcome.

BurntSushi · 2022-09-15T11:07:40Z

Sorry, can you please elaborate a bit more on what you want to use this for? A code sample would be helpful I think.

New APIs, and especially new types, really need to clear a big hurdle.

AndreaOddo89 · 2022-09-15T13:25:51Z

Makes perfect sense, I'll try to clarify the specific use case.

I need to read some generally large csv files, and validate the values in each line for conformance with specific constraints.
Whenever the validation test on a specific line does not pass, I need to report also the full original line, as-is.

In order to do so, I'd need to seek into the original file and extract the source bytes, but String/ByteRecord only provides access to the location of the first byte, not of the last, nor the length to calculate it.

Any alternative strategy to identify the end of the source span requires reimplementing the CSV parsing logic, which defeats the purpose of using a csv parser in the first place.

Here is an example code where this solution applies:

use std::error::Error;
use std::io::{Read, Seek, SeekFrom};
use csv::{ReaderBuilder, StringRecord};

pub fn process_file(src: impl Read + Seek) -> Result<(), Box<dyn Error>> {
    let mut csv_reader = ReaderBuilder::new().from_reader(src);
    let mut record_buffer = StringRecord::new();

    fn validate_record(_record: &StringRecord) -> Result<(), String> {
        todo!()
    }

    while csv_reader.read_record(&mut record_buffer)? {
        if let Err(msg) = validate_record(&record_buffer) {
            let position = record_buffer.position().unwrap();
            let src_line = {
                let span = record_buffer.as_byte_record().span().unwrap();
                let mut raw_reader = csv_reader.into_inner();
                raw_reader.seek(SeekFrom::Start(span.start()))?;
                let mut raw_line_buffer = vec![0; span.len()];
                raw_reader.read_exact(&mut raw_line_buffer)?;
                String::from_utf8(raw_line_buffer)?
            };

            return Err(Box::from(format!("Validation error at record {}: {}. Source line: {}", position.record(), msg, src_line)));
        }
    }
    Ok(())
}

As the necessary state is already available in the reader, I assume the performance impact should be negligible.

… when using write_byte_record

Add support for source byte-range tracking for ByteRecord

1f8fd89

EdMcBane added 2 commits October 15, 2022 00:20

:enh: ensure only full records are flushed from Writer to inner Write…

3471c66

… when using write_byte_record

🆕 allow to peek at the inner writer of a csv Writer

e34403c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for source byte-range tracking for ByteRecord #286

Add support for source byte-range tracking for ByteRecord #286

AndreaOddo89 commented Sep 15, 2022

AndreaOddo89 commented Sep 15, 2022

BurntSushi commented Sep 15, 2022

AndreaOddo89 commented Sep 15, 2022

Add support for source byte-range tracking for ByteRecord #286

Are you sure you want to change the base?

Add support for source byte-range tracking for ByteRecord #286

Conversation

AndreaOddo89 commented Sep 15, 2022

AndreaOddo89 commented Sep 15, 2022

BurntSushi commented Sep 15, 2022

AndreaOddo89 commented Sep 15, 2022