-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 122 streaming operation #125
base: main
Are you sure you want to change the base?
Conversation
…echanism if the stream doesn't support the Length operation. Addresses drewnoakes#122 by allowing for streamed parsing of metadata.
…file unless it's needed.
try | ||
{ | ||
// For some reason, FileStreams are faster in contiguous mode. Since this is such a a commont case, we | ||
// specifically check for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the underlying OS pre-fetches data. Many read operations are sequential. In fact, when you create a FileStream
there's a ctor overload that lets you specify Sequential
vs RandomAccess
(IIRC) which presumably hints the platform of how the reader will behave. For the places that we create FileStream
we could experiment there to see if it matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so what I didn't understand is why the performance would have changed: in the case where it previously pre-loaded the whole file by reading sequentially (on FileStreams), we still do that. Likewise, the cost of the lookup was all but eliminated by memoizing the last looked up chunk (only something like 5% of calls ended up requesting a chunk other than the last one it returned). But the new code is still ~50% slower at running the test-suite.
Alas I am developing on a Mac, and the Xamarin-based toolchain here doesn't have a robust profiler (nor do I have VS Ultimate (or whatever it's called) on my Windows machine). If you do have access to a profiler, I'd be very interested to know what you see.
This looks very cool and I'm keen to take it for a proper spin and learn more about it. The failing unit tests suggest there are still some pending issues though. Have you looked into the reported problems there? The comments you've added mention a perf hit relating to dictionary lookups. If data's mostly read sequentially then perhaps one approach is to have the dictionary's values be a struct Chunk
{
public byte[] Bytes { get; }
/// <summary>Gets and sets the subsequent and adjacent chunk, if loaded.</summary>
public Chunk? Next { get; set; }
} I haven't thought this through to deeply. Just throwing the idea out there. |
// If we know the length of the stream ahead of time, we can allocate a Dictionary with enough slots | ||
// for all the chunks. We 2X it to try to avoid hash collisions. | ||
var chunksCapacity = 2 * (_stream.Length / chunkLength); | ||
_chunks = new Dictionary<int, byte[]>((int) chunksCapacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most data streams we would be unlikely to load all the chunks in the file, so perhaps preallocating enough buckets for the entire file is unnecessary and may even hurt performance by spreading entries across more cache lines than are strictly needed. We'd need to measure performance to be sure one way or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added this after doing a series of performance tests to ensure it would be a win. It's relatively small (~5%) but it was measurable and repeatable.
Would also be good to create some more unit tests that drive the class with stream implementations with different |
Seems like I can't respond to all of your comments inline, so let me answer here:
At present this branch solves the initial problem I set out to solve (it can now process raw files from the network without reading in the entire stream; a ~60X performance improvement for my use-case), so it will be at least a few days before I can look into this again. |
Yep, this is pretty much what I was thinking. Shouldn't be too hard. Can write one set of assertions across a single data stream, then have code test it in all permutations of settings to validate all modes operate correctly. The test failures suggest that the new reader implementation is producing incorrect values. Did you compare output before/after this change on any files? This may only trigger for some kinds of streams.
The unit tests are stand-alone and don't require the image repo. I think you said you're using Rider, so it should be possible to run them directly there. IIRC you just right click the tests project in the solution view. As for confusion regarding the images repo, you're right that it's confusing. I've opened an issue regarding this. |
…s repeats the IndexedCapturingReaderTests, for each of the permutations of seekable and length supported.
Hi there, quick ping on this -- is there anything further needed here? |
Hi James, this hasn't fallen off my radar but I have been busy elsewhere and this PR will take more than just a few minutes to review and think through properly. It's a core change and it seems great but I still want to give it a proper look through. I will also consider the approach in the context of the Java implementation, as it'd be nice to keep parity as much as possible. Thanks for your patience. |
No problem, Drew. Thanks for the note and for being a diligent maintainer. I’ve done my best to make sure that it’s right, but I have a lot less experience with his codebase than you do, so I appreciate the attention to detail. |
Hi there, just checking in on this. Totally understand if you're still swamped; just want to make sure it doesn't bit-rot. |
Modified IndexedCapturingReader that only reads as much of the file as is necessary.