-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for nested <ul>/<ol> tags without <li> (not technically valid) #445
base: master
Are you sure you want to change the base?
Support for nested <ul>/<ol> tags without <li> (not technically valid) #445
Conversation
- fix invalid markdown when <ul> is nested inside an empty <li> - add support for nested <ul>/<ol> outside of <li> (a common and widely supported usage, though not technically valid)
- empty li is allowed in html and markdown (who does this on purpose?) - fixed edge case of numbering when <ol>s are nested without <li>
I support merging this PR. I had to check out the repo and cherry-pick these commits using |
I don't think "being simple / easy to fix" or "browsers are able to display that" is a valid argument here. The decision for this library (made long time ago) is that it supports valid HTML and nothing more. We don't want to break this rule as it might invite all sorts of non-standard hacks. However we want to allow users to modify rules and extend the functionality if they want to do that. That is a long standing task (unfortunatelly with very little momentum behind it at the moment). But various errors in the HTML can be easily fixed by simply pre-processing the DOM before passing it to Turndown. In that way you don't have to make Turndown do anything non-standard. I think that would be ideal solution for this usecase. (side note - I am not a full maintainer here, I have simply fixed few things here and there, so my opinion is not binding) |
I can understand saying that this shouldn't be supported, and I'll respect if that's the decision because of performance or principles or whatever, and I don't want to be difficult...but the phrase "browsers are able to display that" drastically and unfairly understates the argument. This is a de-facto standard, not a hack. As far as I can tell, absolutely every browser displays this HTML in exactly the same way by default. The lists are also clearly nested in the code. Turndown corrupts the meaning of content that has been clearly and consistently conveyed by every browser since at least the days of IE8. Browser screenshots from BrowserStack below. I think this is as strongly as I can state the case; I'm gonna back out now. Cheers! |
Oh, and even if this is not committed, you shouldn't have to cherry-pick the commits or preprocess the DOM to fix this in your own code; since Turndown does allow you to add and replace the rules as needed, you should be able to copy/paste the rules for lists and list items from the Turndown code and make the modifications there. I'm posting some code from one project for reference, but you might have to change it for your uses. Hope this helps @alexander-turner so that you don't have to basically maintain your own fork of the entire project.
|
You might want to use something like this to preprocess the invalid HTML. Update: this lib unfortunately does not fix nested lists. Still, preprocessing the DOM is the cleaner way to do it. The more the rules contains something that looks inside and around the processed elements, the less is the code readable and the less stable the rules get, especially for users customizing them. Also, we want to track context in the future - the rules will be provided by the rule path from DOM root to the current node. Which is something that needs reasonable DOM input and one-thing per rule approach. What can be added in future versions, is a DOM preprocessing hook that would be called after the DOM is cloned by turndown. By the way, where did you guys get your HTMLs with broken nested lists? What generates it? |
I'm opening this pull request for your re-consideration of issue #232. My reasoning is this:
Please see my comment on #232.
Note that this PR differs slightly from the code that I pasted there; I think it's unnecessary to "fix" the case with an empty
<li>
element (<li><ul>...</ul></li>
) because the meaning is preserved in the resulting markdown.