-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/use huggingface compatible pretokenizer #38
Feature/use huggingface compatible pretokenizer #38
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please add tests for correct token mapping logic (if I am wording that correctly).
if word_form_type != WordFormTypes.SURFACE: | ||
_ = [ns.replace(ns.normalized, _word_formatter(m)) for ns, m in zip(normalized_strings, morphemes)] | ||
for ns, m in zip(normalized_strings, morphemes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zip is incorrect when `len(morphemes) != len(normalized_strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be a single loop, not two loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m.surface() != ''
can also be written as len(m) != 0
, but in this case there are already m.begin()
and m.end()
calls.
So it can be done something like this:
result = []
for m in morphemes:
b = m.begin()
e = m.end()
if b == e: # empty token
continue
token = original[b:e]
if _word_formatter is not None:
token.replace(token.normalized, _word_formatter(m))
result.append(token)
return result
Note that this code tries to avoid unnecessary computations. PreTokenizer handler is a performance-critical component and unnecessary calls should be avoided here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your proposal from a performance of view. 64df1cd
My only concern is that there is the if statement in each iteration.
To avoid this, we could use an if statement before the loop, but I think that would be redundant (even though the performance is almost the same).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best way would be probably to emit different handler functions depending on the condition. That probably can be overkill (or you can try and measure, that can be a nice learning experience as well).
My comments on the performance were more related to calling m.surface()
, which creates a string object, allocating memory, for each call.
Comparison to None
should be pretty cheap, but you can measure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, your previous version was incorrect in case of mismatching lengths of MorphemeList
and tokens so creating a token and calling replace should share a single condition anyway and can't be easily written in two loops.
No description provided.