Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Words containing "ff" are not parsing properly #755

Open
rbdeveloper12 opened this issue Jan 6, 2025 · 0 comments
Open

Words containing "ff" are not parsing properly #755

rbdeveloper12 opened this issue Jan 6, 2025 · 0 comments
Labels

Comments

@rbdeveloper12
Copy link

rbdeveloper12 commented Jan 6, 2025

when parsing pdf files, words containing "ff" are not properly saved in the word table through pdfparser.

For example "puff" is saved as "pu"

It only happens with words in the content, not in the title. IOW, if "stuff" is a title word, then "stuff" is saved. If "stuff" is in the content, then only "stu" is saved.

Upon further testing, it appears that multiple "f" are treated as a space, anywhere in a word.

The word boardfflamps is parsed and saved as 2 words "board" and "lamps"

The word Cablefknoss is parsed and saved correctly as "cablefknoss"

@k00ni k00ni added the bug label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants