Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/use huggingface compatible pretokenizer #38
Feature/use huggingface compatible pretokenizer #38
Changes from 8 commits
3caef1f
9a7e1d9
633eb05
1b818ea
d43d1ad
b213c8d
32cf99c
5375aa1
64df1cd
fb737c8
66afb37
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zip is incorrect when `len(morphemes) != len(normalized_strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be a single loop, not two loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m.surface() != ''
can also be written aslen(m) != 0
, but in this case there are alreadym.begin()
andm.end()
calls.So it can be done something like this:
Note that this code tries to avoid unnecessary computations. PreTokenizer handler is a performance-critical component and unnecessary calls should be avoided here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your proposal from a performance of view. 64df1cd
My only concern is that there is the if statement in each iteration.
To avoid this, we could use an if statement before the loop, but I think that would be redundant (even though the performance is almost the same).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best way would be probably to emit different handler functions depending on the condition. That probably can be overkill (or you can try and measure, that can be a nice learning experience as well).
My comments on the performance were more related to calling
m.surface()
, which creates a string object, allocating memory, for each call.Comparison to
None
should be pretty cheap, but you can measure it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, your previous version was incorrect in case of mismatching lengths of
MorphemeList
and tokens so creating a token and calling replace should share a single condition anyway and can't be easily written in two loops.