Add flag to force html mode #673

nacnudus · 2022-07-01T17:21:54Z

This is an attempt to address #671 by adding a flag --html to parse
the input as HTML. Otherwise, STDIN and local files without the .html
suffix are parsed as plain text.

This is an attempt to address lycheeverse#671 by adding a flag `--html` to parse the input as HTML. Otherwise, STDIN and local files without the `.html` suffix are parsed as plain text.

mre · 2022-07-07T11:32:34Z

Thanks for pushing this forward @nacnudus.

Some questions:

What if a user wants to check different file types in a single run? E.g. one Markdown file and one HTML file. Would the --html flag always override the file extension?
Are there any other tools that have a similar feature? I know that ripgrep has a --type option, which allows a user to filter input files by type (but not parse/interpret the input files differently). That's the closest I could find. (A precedent would help us make the best design decision for lychee.)

We could also change the option to --input-type <format>, which is a bit more flexible. E.g. --input-type markdown would enforce the Markdown parser instead. It still would not work for different input types, though, but that might be fine.

Alternatively, we could just add a section to the TROUBLESHOOTING.md file with a workaround on how to enforce the correct file type by storing the input to a temporary file with the html extension. Not pretty, but we avoid maintaining the option.

Looking forward to your thoughts.

nacnudus · 2022-07-16T09:36:04Z

Hi, thanks for your reply, and apologies for my slowness responding.

What if a user wants to check different file types in a single run? E.g. one Markdown file and one HTML file. Would the --html flag always override the file extension?

Yes, --html would always override the file extension. I really have unix pipes in mind, which don't have file extensions.

Are there any other tools that have a similar feature?

Imagemagick requires an image format specifier to read an APNG image sequence, otherwise it assumes that it is PNG and reads only the first frame.

We could also change the option to --input-type <format>

I agree, that would be better.

Alternatively, we could just add a section to the TROUBLESHOOTING.md file with a workaround on how to enforce the correct file type by storing the input to a temporary file with the html extension.

That wouldn't work for STDIN, but perhaps lychee isn't the best tool for that any, because pipes can't benefit from lychee's multithreading, as far as I know.

lebensterben · 2022-07-20T17:41:20Z

currently we relies on extension to determine the file format, and non supported format are treated as plaintext where we simply grab urls.

this process is VERY inefficient due to the underlying implementation of linkify.

instead, I think we can TRY to parse plaintext as html first, should that fail, we fallback to linkify.

this is helpful given the most likely use case of lychee is to validate links in html rather than any other formats.

mre · 2023-01-17T14:24:45Z

this is helpful given the most likely use case of lychee is to validate links in html rather than any other formats.

I'm not sure if that's actually true. Most users use lychee-action and I wouldn't be surprised if the majority of links are checked in Markdown files.

this process is VERY inefficient due to the underlying implementation of linkify.

Not sure if parsing HTML is faster, either. We'd have to run a benchmark to see if trying HTML first would give us any significant performance gains.

lebensterben · 2023-01-20T19:59:43Z

plain text file are text/plain and markdown files are text/markdown so

I think we can TRY to parse plaintext as html first

doesn't apply to mark down files and as such

I wouldn't be surprised if the majority of links are checked in Markdown files

this is irrelevant.

lebensterben · 2023-01-20T20:04:44Z

Not sure if parsing HTML is faster, either. We'd have to run a benchmark to see if trying HTML first would give us any significant performance gains.

Parsing HTML is definitely faster than using linkify on normal sized html files.
This is because a parsed html file is a structured data and links will only appear in certain tags.

HU90m · 2023-09-03T11:49:33Z

My preference would be a ripgrep/fd style type option, specifically a --default-type option which allows the user to override the behaviour for a file without an extension, e.g. --default-type html.

mre · 2023-09-03T11:56:20Z

I like that.

Add flag to force html mode

1ef782e

This is an attempt to address lycheeverse#671 by adding a flag `--html` to parse the input as HTML. Otherwise, STDIN and local files without the `.html` suffix are parsed as plain text.

nacnudus mentioned this pull request Jul 1, 2022

Relative links aren't detected from STDIN or local files #671

Open

nacnudus added 2 commits July 3, 2022 22:36

Update README section USAGE

88120e9

Lint

d87b687

nacnudus force-pushed the add-html-flag branch from 8d90c79 to d87b687 Compare July 3, 2022 22:17

mre added waiting-for-feedback request-for-comments labels Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag to force html mode #673

Add flag to force html mode #673

nacnudus commented Jul 1, 2022

mre commented Jul 7, 2022

nacnudus commented Jul 16, 2022

lebensterben commented Jul 20, 2022

mre commented Jan 17, 2023

lebensterben commented Jan 20, 2023

lebensterben commented Jan 20, 2023

HU90m commented Sep 3, 2023

mre commented Sep 3, 2023

Add flag to force html mode #673

Are you sure you want to change the base?

Add flag to force html mode #673

Conversation

nacnudus commented Jul 1, 2022

mre commented Jul 7, 2022

nacnudus commented Jul 16, 2022

lebensterben commented Jul 20, 2022

mre commented Jan 17, 2023

lebensterben commented Jan 20, 2023

lebensterben commented Jan 20, 2023

HU90m commented Sep 3, 2023

mre commented Sep 3, 2023