Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for source byte-range tracking for ByteRecord #286

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AndreaOddo89
Copy link

ByteRecord (via ByteRecordInner) already exposes a position method to access line number, record number and byte offset of the start of the record.
This PR adds information to ByteRecord to track not only the byte offset of the start of the record, but also its end, via a span method returning a Span; this is useful to retrieve the original source bytes for a parsed record when e.g. reporting errors.

@AndreaOddo89
Copy link
Author

There is a clear overlap between Position and Span, as the former contains the line number and record numbers (that resemble a span), alongside half of the information encoded into Span.
We've tried to minimize PR impact by creating a separate field, but they could be somehow combined. Feedback welcome.

@BurntSushi
Copy link
Owner

Sorry, can you please elaborate a bit more on what you want to use this for? A code sample would be helpful I think.

New APIs, and especially new types, really need to clear a big hurdle.

@AndreaOddo89
Copy link
Author

Makes perfect sense, I'll try to clarify the specific use case.

I need to read some generally large csv files, and validate the values in each line for conformance with specific constraints.
Whenever the validation test on a specific line does not pass, I need to report also the full original line, as-is.

In order to do so, I'd need to seek into the original file and extract the source bytes, but String/ByteRecord only provides access to the location of the first byte, not of the last, nor the length to calculate it.

Any alternative strategy to identify the end of the source span requires reimplementing the CSV parsing logic, which defeats the purpose of using a csv parser in the first place.

Here is an example code where this solution applies:

use std::error::Error;
use std::io::{Read, Seek, SeekFrom};
use csv::{ReaderBuilder, StringRecord};

pub fn process_file(src: impl Read + Seek) -> Result<(), Box<dyn Error>> {
    let mut csv_reader = ReaderBuilder::new().from_reader(src);
    let mut record_buffer = StringRecord::new();

    fn validate_record(_record: &StringRecord) -> Result<(), String> {
        todo!()
    }

    while csv_reader.read_record(&mut record_buffer)? {
        if let Err(msg) = validate_record(&record_buffer) {
            let position = record_buffer.position().unwrap();
            let src_line = {
                let span = record_buffer.as_byte_record().span().unwrap();
                let mut raw_reader = csv_reader.into_inner();
                raw_reader.seek(SeekFrom::Start(span.start()))?;
                let mut raw_line_buffer = vec![0; span.len()];
                raw_reader.read_exact(&mut raw_line_buffer)?;
                String::from_utf8(raw_line_buffer)?
            };

            return Err(Box::from(format!("Validation error at record {}: {}. Source line: {}", position.record(), msg, src_line)));
        }
    }
    Ok(())
}

As the necessary state is already available in the reader, I assume the performance impact should be negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants