Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents #303

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

tonybaloney
Copy link
Contributor

@tonybaloney tonybaloney commented Mar 14, 2024

Purpose

  1. Better support CJK documents with ideographic and full-width unicode punctuation marks.
  2. Implement a recursive character splitting algorithm to make sure that all sections are < 500 tokens (the limit for Azure AI Search for this model)
  3. Also fixes PDF upload will not generate or index sections if the number of characters on the page is less than 1000 #304

Both changes are based on improvements made to the Python sample

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • Get the code
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install
  • Test the code

What to Check

Verify that the following are valid

  • ...

Other Information

@tonybaloney
Copy link
Contributor Author

The e734ef1 commit should fail, I added a test to prove #304

@tonybaloney tonybaloney changed the title Add full-width and ideographic punctuation to text splitter Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents Mar 14, 2024
@luisquintanilla
Copy link
Collaborator

luisquintanilla commented Mar 15, 2024

The changes in this PR are being introduced in SK as part of microsoft/semantic-kernel#5489

Once that's merged, this PR will be updated to reflect those changes.

cc: @tonybaloney

@LittleLittleCloud
Copy link
Collaborator

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

@tonybaloney
Copy link
Contributor Author

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

Yes, I'll wait for a new release of SK so I can test the changes

logger: null,
maxTokensPerSection: 500);
List<Section> sections = [];
string testContent = "".PadRight(1000, ' ');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string testContent = "".PadRight(1000, ' ');
string testContent = new(' ', 1_000);

@@ -312,4 +315,89 @@ public async Task EmbedImageBlobTestAsync()
await blobServiceClient.DeleteBlobContainerAsync(blobContainer);
}
}

[Fact]
public async Task EnsureTextSplitsOnTinySectionAsync()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect a lot more test cases. What about some content with the various word splits or sentence delimiters. Also, consider using a Theory to avoid redundant code, and parameterize all the known-inputs, and expected outputs.

@@ -16,7 +16,10 @@
using Azure.Storage.Blobs;
using FluentAssertions;
using Microsoft.Extensions.Logging;
using MudBlazor.Services;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Id: MatchInSetRegex().Replace($"{blobName}-{start}", "_").TrimStart('_'),
Content: allText[start..end],
SourcePage: BlobNameFromFilePage(blobName, FindPage(pageMap, start)),
SourceFile: blobName);
SourceFile: blobName)) { yield return section; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SourceFile: blobName)) { yield return section; }
SourceFile: blobName))
{
yield return section;
}

Id: MatchInSetRegex().Replace($"{blobName}-{start}", "_").TrimStart('_'),
Content: sectionText,
SourcePage: BlobNameFromFilePage(blobName, FindPage(pageMap, start)),
SourceFile: blobName);
SourceFile: blobName)) { yield return section; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SourceFile: blobName)) { yield return section; }
SourceFile: blobName))
{
yield return section;
}

Id: MatchInSetRegex().Replace($"{blobName}-{start}", "_").TrimStart('_'),
Content: allText,
SourcePage: BlobNameFromFilePage(blobName, FindPage(pageMap, start)),
SourceFile: blobName)) { yield return section; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SourceFile: blobName)) { yield return section; }
SourceFile: blobName))
{
yield return section;
}

IReadOnlyList<PageDetail> pageMap, string blobName)
{
const int MaxSectionLength = 1_000;
const int SentenceSearchLimit = 100;
const int SectionOverlap = 100;

var sentenceEndings = new[] { '.', '!', '?' };
var wordBreaks = new[] { ',', ';', ':', ' ', '(', ')', '[', ']', '{', '}', '\t', '\n' };
var wordBreaks = new[] { ',', '、', ';', ':', ' ', '(', ')', '[', ']', '{', '}', '\t', '\n' };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lift this to a class-scoped variable to avoid reallocating the same thing multiple times with each call to CreateSectionsAsync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDF upload will not generate or index sections if the number of characters on the page is less than 1000
4 participants