Feature/use huggingface compatible pretokenizer #38

t-yamamura · 2022-01-20T08:18:31Z

No description provided.

eiennohito

Also please add tests for correct token mapping logic (if I am wording that correctly).

sudachitra/pretokenizer/japanese_bert_wordpiece_tokenizer.py

sudachitra/pretokenizer/sudachipy_pretokenizer.py

eiennohito · 2022-01-25T04:31:08Z

sudachitra/pretokenizer/sudachipy_pretokenizer.py

        if word_form_type != WordFormTypes.SURFACE:
-            _ = [ns.replace(ns.normalized, _word_formatter(m)) for ns, m in zip(normalized_strings, morphemes)]
+            for ns, m in zip(normalized_strings, morphemes):


zip is incorrect when `len(morphemes) != len(normalized_strings)

This must be a single loop, not two loops.

m.surface() != '' can also be written as len(m) != 0, but in this case there are already m.begin() and m.end() calls.

So it can be done something like this:

result = [] for m in morphemes: b = m.begin() e = m.end() if b == e: # empty token continue token = original[b:e] if _word_formatter is not None: token.replace(token.normalized, _word_formatter(m)) result.append(token) return result

Note that this code tries to avoid unnecessary computations. PreTokenizer handler is a performance-critical component and unnecessary calls should be avoided here

I agree with your proposal from a performance of view. 64df1cd

My only concern is that there is the if statement in each iteration.
To avoid this, we could use an if statement before the loop, but I think that would be redundant (even though the performance is almost the same).
What do you think?

The best way would be probably to emit different handler functions depending on the condition. That probably can be overkill (or you can try and measure, that can be a nice learning experience as well).
My comments on the performance were more related to calling m.surface(), which creates a string object, allocating memory, for each call.
Comparison to None should be pretty cheap, but you can measure it.

Also, your previous version was incorrect in case of mismatching lengths of MorphemeList and tokens so creating a token and calling replace should share a single condition anyway and can't be easily written in two loops.

setup.py

t-yamamura · 2022-01-26T08:54:09Z

Also please add tests for correct token mapping logic (if I am wording that correctly).

I added these tests (66afb37), except for the ones that can't be tested now due to (#42).

t-yamamura added 2 commits January 20, 2022 17:17

use Dictionary.pre_tokenizer()

3caef1f

add tests for JapaneseBertWordPieceTokenizer

9a7e1d9

t-yamamura requested a review from eiennohito January 20, 2022 08:18

t-yamamura self-assigned this Jan 20, 2022

add pytextspan

633eb05

eiennohito requested changes Jan 21, 2022

View reviewed changes

t-yamamura added 5 commits January 21, 2022 19:15

use morpheme indexes instead of pytextspan

1b818ea

remove pytextspan import

d43d1ad

add decorator to disable parallelism

b213c8d

disable parallelism on the outside of JapaneseBertWordPieceTokenizer

32cf99c

use for loop instead of list comprehension

5375aa1

eiennohito requested changes Jan 25, 2022

View reviewed changes

t-yamamura added 2 commits January 25, 2022 15:22

avoid using redundant loop

64df1cd

remove unused package

fb737c8

eiennohito approved these changes Jan 25, 2022

View reviewed changes

add tests for encoding properties

66afb37

t-yamamura merged commit ecda79e into main Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/use huggingface compatible pretokenizer #38

Feature/use huggingface compatible pretokenizer #38

t-yamamura commented Jan 20, 2022

eiennohito left a comment

eiennohito Jan 25, 2022

eiennohito Jan 25, 2022

eiennohito Jan 25, 2022

t-yamamura Jan 25, 2022

eiennohito Jan 25, 2022 •

edited

Loading

eiennohito Jan 25, 2022 •

edited

Loading

t-yamamura commented Jan 26, 2022

Feature/use huggingface compatible pretokenizer #38

Feature/use huggingface compatible pretokenizer #38

Conversation

t-yamamura commented Jan 20, 2022

eiennohito left a comment

Choose a reason for hiding this comment

eiennohito Jan 25, 2022

Choose a reason for hiding this comment

eiennohito Jan 25, 2022

Choose a reason for hiding this comment

eiennohito Jan 25, 2022

Choose a reason for hiding this comment

t-yamamura Jan 25, 2022

Choose a reason for hiding this comment

eiennohito Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

eiennohito Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

t-yamamura commented Jan 26, 2022

eiennohito Jan 25, 2022 •

edited

Loading

eiennohito Jan 25, 2022 •

edited

Loading