Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents #303

tonybaloney · 2024-03-14T02:13:18Z

Purpose

Better support CJK documents with ideographic and full-width unicode punctuation marks.
Implement a recursive character splitting algorithm to make sure that all sections are < 500 tokens (the limit for Azure AI Search for this model)
Also fixes PDF upload will not generate or index sections if the number of characters on the page is less than 1000 #304

Both changes are based on improvements made to the Python sample

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

Get the code

git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install

Test the code

What to Check

Verify that the following are valid

...

Other Information

tonybaloney · 2024-03-14T05:08:58Z

The e734ef1 commit should fail, I added a test to prove #304

luisquintanilla · 2024-03-15T20:07:17Z

The changes in this PR are being introduced in SK as part of microsoft/semantic-kernel#5489

Once that's merged, this PR will be updated to reflect those changes.

cc: @tonybaloney

LittleLittleCloud · 2024-03-25T21:15:40Z

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

tonybaloney · 2024-03-25T21:57:04Z

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

Yes, I'll wait for a new release of SK so I can test the changes

IEvangelist · 2024-07-24T12:23:04Z

app/tests/MinimalApi.Tests/AzureSearchEmbedServiceTest.cs

+            logger: null,
+            maxTokensPerSection: 500);
+        List<Section> sections = [];
+        string testContent = "".PadRight(1000, ' ');


Suggested change

string testContent = "".PadRight(1000, ' ');

string testContent = new(' ', 1_000);

IEvangelist · 2024-07-24T12:24:40Z

app/tests/MinimalApi.Tests/AzureSearchEmbedServiceTest.cs

@@ -312,4 +315,89 @@ public async Task EmbedImageBlobTestAsync()
            await blobServiceClient.DeleteBlobContainerAsync(blobContainer);
        }
    }
+
+    [Fact]
+    public async Task EnsureTextSplitsOnTinySectionAsync()


I'd expect a lot more test cases. What about some content with the various word splits or sentence delimiters. Also, consider using a Theory to avoid redundant code, and parameterize all the known-inputs, and expected outputs.

IEvangelist · 2024-07-24T12:24:58Z

app/tests/MinimalApi.Tests/AzureSearchEmbedServiceTest.cs

@@ -16,7 +16,10 @@
 using Azure.Storage.Blobs;
 using FluentAssertions;
 using Microsoft.Extensions.Logging;
+using MudBlazor.Services;


Why is this needed?

IEvangelist · 2024-07-24T12:25:34Z

app/shared/Shared/Services/AzureSearchEmbedService.cs

                Id: MatchInSetRegex().Replace($"{blobName}-{start}", "_").TrimStart('_'),
                Content: allText[start..end],
                SourcePage: BlobNameFromFilePage(blobName, FindPage(pageMap, start)),
-                SourceFile: blobName);
+                SourceFile: blobName)) { yield return section; }


Suggested change

SourceFile: blobName)) { yield return section; }

SourceFile: blobName))

{

yield return section;

}

IEvangelist · 2024-07-24T12:26:14Z

app/shared/Shared/Services/AzureSearchEmbedService.cs

                Id: MatchInSetRegex().Replace($"{blobName}-{start}", "_").TrimStart('_'),
                Content: sectionText,
                SourcePage: BlobNameFromFilePage(blobName, FindPage(pageMap, start)),
-                SourceFile: blobName);
+                SourceFile: blobName)) { yield return section; }


Suggested change

SourceFile: blobName)) { yield return section; }

SourceFile: blobName))

{

yield return section;

}

IEvangelist · 2024-07-24T12:26:37Z

app/shared/Shared/Services/AzureSearchEmbedService.cs

+                Id: MatchInSetRegex().Replace($"{blobName}-{start}", "_").TrimStart('_'),
+                Content: allText,
+                SourcePage: BlobNameFromFilePage(blobName, FindPage(pageMap, start)),
+                SourceFile: blobName)) { yield return section; }


Suggested change

SourceFile: blobName)) { yield return section; }

SourceFile: blobName))

{

yield return section;

}

IEvangelist · 2024-07-24T12:27:36Z

app/shared/Shared/Services/AzureSearchEmbedService.cs

        IReadOnlyList<PageDetail> pageMap, string blobName)
    {
        const int MaxSectionLength = 1_000;
        const int SentenceSearchLimit = 100;
        const int SectionOverlap = 100;

-        var sentenceEndings = new[] { '.', '!', '?' };
-        var wordBreaks = new[] { ',', ';', ':', ' ', '(', ')', '[', ']', '{', '}', '\t', '\n' };
+        var wordBreaks = new[] { ',', '、', ';', ':', ' ', '(', ')', '[', ']', '{', '}', '\t', '\n' };


Lift this to a class-scoped variable to avoid reallocating the same thing multiple times with each call to CreateSectionsAsync.

tonybaloney added 4 commits March 14, 2024 13:08

Add full-width and ideographic punctuation

a7e7ff2

Add a recursive splitter by token length

4ec7fcb

Allow up to the length of tokens

710d220

Make tokens per section configurable. Add some tests

e734ef1

Fixes issue with small documents never returning a valid section

855632d

tonybaloney requested review from LittleLittleCloud and IEvangelist March 14, 2024 05:12

tonybaloney changed the title ~~Add full-width and ideographic punctuation to text splitter~~ Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents Mar 14, 2024

Fix the overlapping when section doesn't contain any breaks

4e45bd2

IEvangelist reviewed Jul 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents #303

Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents #303

tonybaloney commented Mar 14, 2024 •

edited

Loading

tonybaloney commented Mar 14, 2024

luisquintanilla commented Mar 15, 2024 •

edited

Loading

LittleLittleCloud commented Mar 25, 2024

tonybaloney commented Mar 25, 2024

IEvangelist Jul 24, 2024

IEvangelist Jul 24, 2024

IEvangelist Jul 24, 2024

IEvangelist Jul 24, 2024

IEvangelist Jul 24, 2024

IEvangelist Jul 24, 2024

IEvangelist Jul 24, 2024

	string testContent = "".PadRight(1000, ' ');
	string testContent = new(' ', 1_000);

Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents #303

Are you sure you want to change the base?

Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents #303

Conversation

tonybaloney commented Mar 14, 2024 • edited Loading

Purpose

Does this introduce a breaking change?

Pull Request Type

How to Test

What to Check

Other Information

tonybaloney commented Mar 14, 2024

luisquintanilla commented Mar 15, 2024 • edited Loading

LittleLittleCloud commented Mar 25, 2024

tonybaloney commented Mar 25, 2024

IEvangelist Jul 24, 2024

Choose a reason for hiding this comment

IEvangelist Jul 24, 2024

Choose a reason for hiding this comment

IEvangelist Jul 24, 2024

Choose a reason for hiding this comment

IEvangelist Jul 24, 2024

Choose a reason for hiding this comment

IEvangelist Jul 24, 2024

Choose a reason for hiding this comment

IEvangelist Jul 24, 2024

Choose a reason for hiding this comment

IEvangelist Jul 24, 2024

Choose a reason for hiding this comment

tonybaloney commented Mar 14, 2024 •

edited

Loading

luisquintanilla commented Mar 15, 2024 •

edited

Loading