Crashes on every file i tested (more than 100) with UnicodeEncodeError error. #291

ruslankiskinov · 2025-01-17T15:47:57Z

For every PDF file I tested the tool crashes with whatever UnicodeEncodeError. In every file it finds a different character to crash on.
The problem is that it didn't even try to skip the character just crashed and the output is empty which makes the tool useless.
I tested with files in Cyrillic, French, and German, and some files in English too. If the file is extremely simple it is able to convert it.

Unfortunately, I can't expose these examples here.

Environment:
Windows / Python 3.12
Error:
UnicodeEncodeError: 'charmap' codec can't encode character '\xfc' in position 1809: character maps to

Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in
File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main
print(result.text_content)
File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\xfc' in position 1809: character maps to

OR for an Excel file:

Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in
File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main
print(result.text_content)
File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u0144' in position 296: character maps to

I can provide an example with the manual of my SONY headphones:
https://www.sony.com/electronics/support/res/manuals/4559/45598331M.pdf

Here is the error:
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in
File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main
print(result.text_content)
File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\xe7' in position 23: character maps to

I don't know why it tries to use CP1251 codepage as the file is PDF with no Cyrillic content in it.

kristofmulier · 2025-01-18T16:01:34Z

Try this as a workaround:

>chcp 65001
Active code page: 65001

>set PYTHONIOENCODING=utf-8

>markitdown my_document.pdf > my_document.md

ruslankiskinov changed the title ~~Crashes on every file i tested (more than 100) wiht unicode errors.~~ Crashes on every file i tested (more than 100) with UnicodeEncodeError error. Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes on every file i tested (more than 100) with UnicodeEncodeError error. #291

Crashes on every file i tested (more than 100) with UnicodeEncodeError error. #291

ruslankiskinov commented Jan 17, 2025 •

edited

Loading

kristofmulier commented Jan 18, 2025

Crashes on every file i tested (more than 100) with UnicodeEncodeError error. #291

Crashes on every file i tested (more than 100) with UnicodeEncodeError error. #291

Comments

ruslankiskinov commented Jan 17, 2025 • edited Loading

kristofmulier commented Jan 18, 2025

ruslankiskinov commented Jan 17, 2025 •

edited

Loading