-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents #303
base: main
Are you sure you want to change the base?
Conversation
The changes in this PR are being introduced in SK as part of microsoft/semantic-kernel#5489 Once that's merged, this PR will be updated to reflect those changes. cc: @tonybaloney |
@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change |
Yes, I'll wait for a new release of SK so I can test the changes |
logger: null, | ||
maxTokensPerSection: 500); | ||
List<Section> sections = []; | ||
string testContent = "".PadRight(1000, ' '); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string testContent = "".PadRight(1000, ' '); | |
string testContent = new(' ', 1_000); |
@@ -312,4 +315,89 @@ public async Task EmbedImageBlobTestAsync() | |||
await blobServiceClient.DeleteBlobContainerAsync(blobContainer); | |||
} | |||
} | |||
|
|||
[Fact] | |||
public async Task EnsureTextSplitsOnTinySectionAsync() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect a lot more test cases. What about some content with the various word splits or sentence delimiters. Also, consider using a Theory
to avoid redundant code, and parameterize all the known-inputs, and expected outputs.
@@ -16,7 +16,10 @@ | |||
using Azure.Storage.Blobs; | |||
using FluentAssertions; | |||
using Microsoft.Extensions.Logging; | |||
using MudBlazor.Services; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
Id: MatchInSetRegex().Replace($"{blobName}-{start}", "_").TrimStart('_'), | ||
Content: allText[start..end], | ||
SourcePage: BlobNameFromFilePage(blobName, FindPage(pageMap, start)), | ||
SourceFile: blobName); | ||
SourceFile: blobName)) { yield return section; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SourceFile: blobName)) { yield return section; } | |
SourceFile: blobName)) | |
{ | |
yield return section; | |
} |
Id: MatchInSetRegex().Replace($"{blobName}-{start}", "_").TrimStart('_'), | ||
Content: sectionText, | ||
SourcePage: BlobNameFromFilePage(blobName, FindPage(pageMap, start)), | ||
SourceFile: blobName); | ||
SourceFile: blobName)) { yield return section; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SourceFile: blobName)) { yield return section; } | |
SourceFile: blobName)) | |
{ | |
yield return section; | |
} |
Id: MatchInSetRegex().Replace($"{blobName}-{start}", "_").TrimStart('_'), | ||
Content: allText, | ||
SourcePage: BlobNameFromFilePage(blobName, FindPage(pageMap, start)), | ||
SourceFile: blobName)) { yield return section; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SourceFile: blobName)) { yield return section; } | |
SourceFile: blobName)) | |
{ | |
yield return section; | |
} |
IReadOnlyList<PageDetail> pageMap, string blobName) | ||
{ | ||
const int MaxSectionLength = 1_000; | ||
const int SentenceSearchLimit = 100; | ||
const int SectionOverlap = 100; | ||
|
||
var sentenceEndings = new[] { '.', '!', '?' }; | ||
var wordBreaks = new[] { ',', ';', ':', ' ', '(', ')', '[', ']', '{', '}', '\t', '\n' }; | ||
var wordBreaks = new[] { ',', '、', ';', ':', ' ', '(', ')', '[', ']', '{', '}', '\t', '\n' }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lift this to a class-scoped variable to avoid reallocating the same thing multiple times with each call to CreateSectionsAsync
.
Purpose
Both changes are based on improvements made to the Python sample
Does this introduce a breaking change?
Pull Request Type
What kind of change does this Pull Request introduce?
[ ] Bugfix
[x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:
How to Test
What to Check
Verify that the following are valid
Other Information