extraction: improve spacing in item, cell and code blocks #772

unsleepy22 · 2024-12-24T12:18:35Z

Currently preserve_space is forced False in sanitize method when converting xml to text, while for code blocks esp. python code we wish to preserve spaces to keep the original format since space matters in python.

codecov · 2024-12-24T12:24:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.29%. Comparing base (fbdffe3) to head (5f9eeb1).
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #772      +/-   ##
==========================================
+ Coverage   99.27%   99.29%   +0.01%     
==========================================
  Files          21       21              
  Lines        3601     3663      +62     
==========================================
+ Hits         3575     3637      +62     
  Misses         26       26

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adbar · 2024-12-24T16:25:38Z

Hi @unsleepy22, a few comments on the PR:

I believe this fixes Preserve horizontal space in code blocks #553, could you please check?
There is already a preserve_spaces parameter in htmlprocessing/handle_textnode, is it necessary to use a new argument across multiple functions?
This behavior could actually become the default if there are no adverse effects, it would be easier for maintenance

Last, if you have several PRs in store about unrelated parts of the code it would be best to send them all at once, I don't plan to maintain the software on a day-to-day basis, we could both save time by grouping requests.

unsleepy22 · 2025-01-03T16:32:32Z

Hi @unsleepy22, a few comments on the PR:

I believe this fixes Preserve horizontal space in code blocks #553, could you please check?

There is already a preserve_spaces parameter in htmlprocessing/handle_textnode, is it necessary to use a new argument across multiple functions?

This behavior could actually become the default if there are no adverse effects, it would be easier for maintenance

Last, if you have several PRs in store about unrelated parts of the code it would be best to send them all at once, I don't plan to maintain the software on a day-to-day basis, we could both save time by grouping requests.

Sorry for late response as I have been on vacation for a few days.

Yes, this fixes Preserve horizontal space in code blocks #553.
And also I added code to fix #769.

The preserve_spaces option does not work globally in [htmlprocessing/handle_textnode], and to keep backward-compatibility, I leave preserve_spaces to False by default.

Another change I'd like to make but needs discussion is adding another option like include_metadata_in_text to preserve/remove metadata dict string from text, but this may cause a bit confusion since we have 3 options regarding metadata, and, again, just for backward-compatibility. Another way to do this is change the default behavior of with_metadata option, which might introduce breaking changes. What do you think?

adbar · 2025-01-03T16:56:09Z

If preserve_spaces is nearly always set to false, wouldn't it be easier to just remove the parameter and consider that space should be preserved in certain elements only like code blocks?
I'm not sure about with_metadata, what do you want to achieve by doing this? Metadata is already added in TXT/MD format when with_metadata=True or is there a bug?

unsleepy22 · 2025-01-04T05:12:31Z

If preserve_spaces is nearly always set to false, wouldn't it be easier to just remove the parameter and consider that space should be preserved in certain elements only like code blocks?

I'm not sure about with_metadata, what do you want to achieve by doing this? Metadata is already added in TXT/MD format when with_metadata=True or is there a bug?

It's ok with me to make preserve_spaces default to True, I'll change my code a bit and fix a few UTs.

Currently if with_metadata is set to True, the metadata dict string will be prepended before text, what I want is to exclude metadata dict string from text and return within the Document.

adbar · 2025-01-05T17:07:30Z

OK then, you can change the code of preserve_spaces to avoid adding another parameter.

Concerning with_metadata this is expected result when people use extract(). To get metadata separately from the text you could just use bare_extraction(). The text is then in .raw_text, the title in .title etc.
If this is not enough please open an issue with an example of the output you want to get.

adbar · 2025-02-03T11:32:06Z

Hi @unsleepy22, are you available to make the final changes to the PR?

unsleepy22 · 2025-02-05T15:12:25Z

Hi @unsleepy22, are you available to make the final changes to the PR?

Sorry it's been quite a while, got a bit busy on work stuff (and a long Chinese Spring Festival vacation). I think it'll take a few more days to fix the changes.

adbar · 2025-02-05T16:00:56Z

That's fine, 蛇年祝福 !

unsleepy22 · 2025-02-07T13:26:07Z

This PR fixes #769 and #553

Also I forced a space after highlighted text for better compatibility with various markdown editors as some md editor will fail to render for text like **hello,**world.
However this PR introduces a minor defect that it adds an extra space in table cells which seems a bit weird, though there's no side effect in md rendering. I'm still trying to find a better way to tackle this.

BTW, #776 looks more elegant in handling code blocks, if it also fixes #553 , I think my PR would be a duplicate fix? I'm OK to merge #776 and simplify my PR.

@adbar what do you think?

adbar · 2025-02-07T15:49:35Z

OK then, I will merge #776 and then you can simplify this PR.

I don't really like the idea of adding a space but if it's a common way to write markdown then why not.

unsleepy22 · 2025-02-08T11:39:41Z

Checked that #776 does not fully fix #553 so I just merged master and resolved conflicts. @adbar would you take a look?

adbar · 2025-02-08T12:27:45Z

@unsleepy22 Everything looks good but can you please remove the additional spacing? Or do you have a format specification for this? I checked Markdown and Commonmark, the current syntax in master is correct.

unsleepy22 · 2025-02-10T08:32:00Z

@unsleepy22 Everything looks good but can you please remove the additional spacing? Or do you have a format specification for this? I checked Markdown and Commonmark, the current syntax in master is correct.

OK all fixed, would you take a look again?

adbar · 2025-02-10T16:30:24Z

trafilatura/core.py

@@ -487,7 +488,7 @@ def extract_with_metadata(
        include_images: Take images into account (experimental).
        include_formatting: Keep structural elements related to formatting
            (only valuable if output_format is set to XML).
-        include_links: Keep links along with their targets (experimental).
+=        include_links: Keep links along with their targets (experimental).


This can be removed, right?

@unsleepy22 Could you please remove it if necessary?

OK, maybe mis-typed this.

adbar · 2025-02-10T16:32:55Z

@unsleepy22 The PR looks good. however I just ran tests on the benchmark and the extraction is really slow after your PRs, it gets even slower after this one. I think we need to fix this.

unsleepy22 · 2025-02-11T01:30:38Z

@unsleepy22 The PR looks good. however I just ran tests on the benchmark and the extraction is really slow after your PRs, it gets even slower after this one. I think we need to fix this.

OK I'll see to it.

adbar · 2025-02-11T10:21:25Z

Thanks, this is much better but the fast mode (without extraction fallbacks) is still slower than the full mode. Could you do something about that?

Edit: The fast mode is not only slower, now the results are the same as in full mode, so I believe extraction steps happen which shouldn't be there.

unsleepy22 · 2025-02-11T13:10:19Z

Thanks, this is much better but the fast mode (without extraction fallbacks) is still slower than the full mode. Could you do something about that?

Edit: The fast mode is not only slower, now the results are the same as in full mode, so I believe extraction steps happen which shouldn't be there.

Do you mean fast mode in master branch is faster than fast mode in my PR? I checked both on my M4 max and MBP 2019, both show no significant performance gap (average on 30 runs, less than 30ms performance gap on MBP out of avg 1800ms/call and less than 5ms gap on M4 max out of avg 300ms/call).

I ran another test which shows that fast mode produces different result with full mode(also checked that fast param is passed into options and functioned in trafilatura_sequence) , did I miss some point?

    htmlstring = ('<html><body>'
                  '<table><tr><td>aa</td><td>bb</td></tr><tr><td>aa</td><td>bb</td></tr></table>'
                  '<table><tr><td>aa</td><td>bb</td></tr><tr><td>aa</td><td>bb</td></tr></table>'
                  '<div><details><summary>Epcot Center</summary><p>Epcot is a theme park at Walt Disney World Resort featuring exciting attractions, international pavilions, award-winning fireworks and seasonal special events.</p></details>'
                  '</div><div>aaa<br/>bbb<br/>ccc</div></body></html>')
    print('\n')
    print(extract(htmlstring, fast=True, config=ZERO_CONFIG))
    print(extract(htmlstring, fast=False, config=ZERO_CONFIG))

adbar · 2025-02-11T13:17:04Z

@unsleepy22 Yes, the problem is now that the "fast" mode isn't faster than the other anymore (somewhere after your first PRs). Not a huge deal but something we could address in another PR.

unsleepy22 · 2025-02-11T14:02:05Z

@unsleepy22 Yes, the problem is now that the "fast" mode isn't faster than the other anymore (somewhere after your first PRs). Not a huge deal but something we could address in another PR.

I run comparison_small.py locally, looks it always costs 20+ secs in whichever branch (v1.2.2, v2.0.0, current master, my PR).
The main issue is that in current PR, fast mode is not a lot faster than full mode, am I understanding correctly?

trafilatura/utils.py

adbar · 2025-02-12T12:40:50Z

@unsleepy22 Yes, it has been like this since one of your recent PRs, I'm not sure which, we would have to test.

Please get rid of the superfluous lines so that I can merge this PR, you can solve the fast/full issue in another PR if you find a way.

brycepg · 2025-02-13T23:00:07Z

I'd really like to see this feature present, please merge :)

add preserve_space option to keep original format in code blocks

eed8c0b

adbar linked an issue Dec 24, 2024 that may be closed by this pull request

Preserve horizontal space in code blocks #553

Closed

CodyInnowhere added 3 commits January 3, 2025 23:58

fix adbar#769

9853e56

fix ut

204647b

fix type check

bc83c56

adbar linked an issue Jan 3, 2025 that may be closed by this pull request

Loss format or data when li contains p #769

Closed

CodyInnowhere added 3 commits February 7, 2025 20:56

set preserve_space as default option

4cb750e

fix xml_tei_tests

2f4a123

fix readworld_tests

31352d2

merge master

837358f

set preserve_space default to True and refine list extraction

c1b8a9f

unsleepy22 force-pushed the add-preserve-space-option branch from 2777f85 to c1b8a9f Compare February 10, 2025 08:24

adbar reviewed Feb 10, 2025

View reviewed changes

improve performance by replacing xpath search with iterative search

13bd44e

improve performance and move item-related checks to utils

aaf5553

adbar linked an issue Feb 11, 2025 that may be closed by this pull request

Slow extraction after recent PRs #786

Closed

adbar reviewed Feb 12, 2025

View reviewed changes

trafilatura/utils.py Outdated Show resolved Hide resolved

minor fixes

5f9eeb1

adbar changed the title ~~add preserve_space option to keep original format in code blocks~~ extraction: improve spacing in item, cell and code blocks Feb 17, 2025

adbar merged commit 729b737 into adbar:master Feb 17, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extraction: improve spacing in item, cell and code blocks #772

extraction: improve spacing in item, cell and code blocks #772

unsleepy22 commented Dec 24, 2024

codecov bot commented Dec 24, 2024 •

edited

Loading

adbar commented Dec 24, 2024

unsleepy22 commented Jan 3, 2025

adbar commented Jan 3, 2025

unsleepy22 commented Jan 4, 2025 •

edited

Loading

adbar commented Jan 5, 2025

adbar commented Feb 3, 2025

unsleepy22 commented Feb 5, 2025

adbar commented Feb 5, 2025

unsleepy22 commented Feb 7, 2025 •

edited

Loading

adbar commented Feb 7, 2025

unsleepy22 commented Feb 8, 2025

adbar commented Feb 8, 2025

unsleepy22 commented Feb 10, 2025

adbar Feb 10, 2025 •

edited

Loading

adbar Feb 11, 2025

unsleepy22 Feb 11, 2025

adbar commented Feb 10, 2025

unsleepy22 commented Feb 11, 2025

adbar commented Feb 11, 2025 •

edited

Loading

unsleepy22 commented Feb 11, 2025 •

edited

Loading

adbar commented Feb 11, 2025

unsleepy22 commented Feb 11, 2025

adbar commented Feb 12, 2025

brycepg commented Feb 13, 2025

extraction: improve spacing in item, cell and code blocks #772

extraction: improve spacing in item, cell and code blocks #772

Conversation

unsleepy22 commented Dec 24, 2024

codecov bot commented Dec 24, 2024 • edited Loading

Codecov Report

adbar commented Dec 24, 2024

unsleepy22 commented Jan 3, 2025

adbar commented Jan 3, 2025

unsleepy22 commented Jan 4, 2025 • edited Loading

adbar commented Jan 5, 2025

adbar commented Feb 3, 2025

unsleepy22 commented Feb 5, 2025

adbar commented Feb 5, 2025

unsleepy22 commented Feb 7, 2025 • edited Loading

adbar commented Feb 7, 2025

unsleepy22 commented Feb 8, 2025

adbar commented Feb 8, 2025

unsleepy22 commented Feb 10, 2025

adbar Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

adbar Feb 11, 2025

Choose a reason for hiding this comment

unsleepy22 Feb 11, 2025

Choose a reason for hiding this comment

adbar commented Feb 10, 2025

unsleepy22 commented Feb 11, 2025

adbar commented Feb 11, 2025 • edited Loading

unsleepy22 commented Feb 11, 2025 • edited Loading

adbar commented Feb 11, 2025

unsleepy22 commented Feb 11, 2025

adbar commented Feb 12, 2025

brycepg commented Feb 13, 2025

codecov bot commented Dec 24, 2024 •

edited

Loading

unsleepy22 commented Jan 4, 2025 •

edited

Loading

unsleepy22 commented Feb 7, 2025 •

edited

Loading

adbar Feb 10, 2025 •

edited

Loading

adbar commented Feb 11, 2025 •

edited

Loading

unsleepy22 commented Feb 11, 2025 •

edited

Loading