-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance _markitdown.py to support embedding images in markdown #205
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree |
Looks promising. @MauroDruwel Can you please add some test cases? Also, sanitizing filenames is a task that will come up a lot (and my already be implemented). I'm going to ask around for advice on how to handle this broadly and robustly (without regular expressions). Filenames may also have other restrictions (e.g., length etc.) on some OSs. |
Hi @afourney, I've added the following improvements:
Let me know if you need any further adjustments! |
Hi @afourney, is there anything left that I need to do? |
This update enhances the
DocxConverter
class by adding functionality to extract and embed images when converting DOCX files to Markdown. The changes include:convert_image
method: Handles image extraction, sanitizes the filename, and saves the image as a PNG in the specified output directory.convert
method: Integrates theconvert_image
method with the Mammoth library's HTML conversion process, ensuring images are extracted and included in the final output.