Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add page_number attribute to document segments and update related retrieval logic #13742

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cpwan
Copy link

@cpwan cpwan commented Feb 14, 2025

Summary

This feature was first introduced in #7749, but then reverted since it has a bug in #8211 . Since then, there are couple of issues asking to reintroduce the feature.

The problem with the original feature in #7749 is that, it did not consider the case that not all document has page number info, such as txt, md files. It also added the page number attribute along with the embedding, which is not the most natural way to store the page number.

In this pull request, the page number is added to Document Segment, which requires a change in database schema. The page number is retrieved as part of the meta data.

Resolves #8502
Resolves #11891

Screenshots

Before After
image image

Checklist

Important

Please review the checklist below before submitting your pull request.

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 💪 enhancement New feature or request labels Feb 14, 2025
@lps-llm lps-llm force-pushed the segment_with_page_no branch from 34157b3 to 00ba1ba Compare February 14, 2025 09:31
@crazywoola crazywoola requested a review from JohnJyong February 14, 2025 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 enhancement New feature or request size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

return page number of word docx documents upon retrieval PDF page number is absent in knowledge retrieval
1 participant