Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/use huggingface compatible pretokenizer #38

Merged
merged 11 commits into from
Jan 27, 2022

Conversation

t-yamamura
Copy link
Collaborator

No description provided.

@t-yamamura t-yamamura requested a review from eiennohito January 20, 2022 08:18
@t-yamamura t-yamamura self-assigned this Jan 20, 2022
Copy link
Collaborator

@eiennohito eiennohito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add tests for correct token mapping logic (if I am wording that correctly).

sudachitra/pretokenizer/sudachipy_pretokenizer.py Outdated Show resolved Hide resolved
sudachitra/pretokenizer/sudachipy_pretokenizer.py Outdated Show resolved Hide resolved
if word_form_type != WordFormTypes.SURFACE:
_ = [ns.replace(ns.normalized, _word_formatter(m)) for ns, m in zip(normalized_strings, morphemes)]
for ns, m in zip(normalized_strings, morphemes):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zip is incorrect when `len(morphemes) != len(normalized_strings)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be a single loop, not two loops.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.surface() != '' can also be written as len(m) != 0, but in this case there are already m.begin() and m.end() calls.

So it can be done something like this:

result = []
for m in morphemes:
  b = m.begin()
  e = m.end()
  if b == e: # empty token
    continue
  token = original[b:e]
  if _word_formatter is not None:
    token.replace(token.normalized, _word_formatter(m))
  result.append(token)
return result

Note that this code tries to avoid unnecessary computations. PreTokenizer handler is a performance-critical component and unnecessary calls should be avoided here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your proposal from a performance of view. 64df1cd

My only concern is that there is the if statement in each iteration.
To avoid this, we could use an if statement before the loop, but I think that would be redundant (even though the performance is almost the same).
What do you think?

Copy link
Collaborator

@eiennohito eiennohito Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way would be probably to emit different handler functions depending on the condition. That probably can be overkill (or you can try and measure, that can be a nice learning experience as well).
My comments on the performance were more related to calling m.surface(), which creates a string object, allocating memory, for each call.
Comparison to None should be pretty cheap, but you can measure it.

Copy link
Collaborator

@eiennohito eiennohito Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, your previous version was incorrect in case of mismatching lengths of MorphemeList and tokens so creating a token and calling replace should share a single condition anyway and can't be easily written in two loops.

setup.py Outdated Show resolved Hide resolved
@t-yamamura
Copy link
Collaborator Author

Also please add tests for correct token mapping logic (if I am wording that correctly).

I added these tests (66afb37), except for the ones that can't be tested now due to (#42).

@t-yamamura t-yamamura merged commit ecda79e into main Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants