HugTokenCraft is a user-friendly Python library that simplifies the process of modifying the vocabulary of a PreTrainedTokenizer from HuggingFace Transformers, making it accessible without additional training. As of now, this was validated for BertTokenizer, which is word-piece-based vocabulary.
While adding new tokens to a pre-trained tokenizer is relatively simple, removing tokens however, is not straightforward. Specially, if you want to remove majority of the tokens, there will be inconsistencies with the special token ids. HugTokenCraft makes these operations very simple.
- Creating artifical language from existing one for Language models
- Editing existing vocabulary
- Remove tokens from a pre-trained Tokenizer
- Add tokens to a pre-trained Tokenizer
- Change the maximum token length
- Works even when majority of tokens are removed
You can install HugTokenCraft using pip:
pip install hugtokencraft
git clone [email protected]/MDFahimAnjum/HugTokenCraft.git
cd HugTokenCraft
python setup.py install
Let's take a pre-trained BertTokenizer which has 30,000 tokens and modify it to only keep 20 tokens
#import library
from hugtokencraft import editor
from transformers import BertTokenizer
import os
#load BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
#check
initial_vocab_size=len(tokenizer)
print(f"initial vocab size: {initial_vocab_size}")
#Target vocabulary
target_vocab_size=20
selected_words=editor.get_top_tokens(tokenizer,target_vocab_size)
#parameters
current_directory = os.getcwd()
# Define the path where you want to save the tokenizer
tokenizer_path = os.path.join(current_directory,"ModifiedTokenizer")
model_max_length=128
#reduce vocabulary
modified_tokenizer=editor.reduce_vocabulary(tokenizer,selected_words)
tokenizer_path=editor.save_tokenizer(modified_tokenizer,tokenizer_path,model_max_length)
modified_tokenizer=editor.load_tokenizer(type(tokenizer),tokenizer_path)
#check
new_vocab_size=len(modified_tokenizer)
print(f"new vocab size: {new_vocab_size} words")
Let's take a pre-trained BertTokenizer and add two new tokens
#import library
from hugtokencraft import editor
from transformers import BertTokenizer
import os
#load BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
#check
initial_vocab_size=len(tokenizer)
print(f"initial vocab size: {initial_vocab_size}")
#Target vocabulary
selected_words_add={'hugtoken','hugtokencraft'}
#parameters
current_directory = os.getcwd()
# Define the path where you want to save the tokenizer
tokenizer_path = os.path.join(current_directory,"ModifiedTokenizer")
#expand vocabulary
modified_tokenizer=editor.expand_vocabulary(tokenizer,selected_words_add)
tokenizer_path=editor.save_tokenizer(modified_tokenizer,tokenizer_path,model_max_length=None,isreduced=False)
modified_tokenizer=editor.load_tokenizer(type(tokenizer),tokenizer_path)
#check
new_vocab_size=len(modified_tokenizer)
print(f"new vocab size: {new_vocab_size}")
You can also run the Python jupyter notebook examples directly by running example_notebook.ipynb
Obtains the k most frequently used tokens from tokenizer vocabulary.
token_set=get_top_tokens(tokenizer,k)
- tokenizer: BertTokenizer
- Pre-trained Bert Tokenizer
- k: int
- Desired number of tokens
- token_list: set
- Set of k most frequent tokens
Adds a set of new tokens to the vocabulary
modified_tokenizer=expand_vocabulary(tokenizer,tokens_to_add)
- tokenizer: BertTokenizer
- Pre-trained Bert Tokenizer
- tokens_to_add: set
- Set of tokens to add
- modified_tokenizer: BertTokenizer
- Modified Bert Tokenizer
Removes all tokens execpt the given set of tokens from vocabulary
modified_tokenizer=reduce_vocabulary(tokenizer,tokens_to_keep)
- tokenizer: BertTokenizer
- Pre-trained Bert Tokenizer
- tokens_to_keep: set
- Set of tokens to keep
- modified_tokenizer: BertTokenizer
- Modified Bert Tokenizer
Saves the modified tokenizer for use
tokenizer_path=save_tokenizer(tokenizer,tokenizer_path,model_max_length=None,isreduced=True)
- tokenizer: BertTokenizer
- Pre-trained Bert Tokenizer
- tokenizer_path: str
- Location path to save the tokenizer
- model_max_length: int
- New value of maximum token length
- Defaults to None which means no change
- isreduced: bool
- Whether the modified tokenizer was reduced
- True if vocabulary was reduced (Default)
- False if vocabulary was expanded
- tokenizer_path: str
- Location path to save the tokenizer
Loads a tokenizer from a given path
tokenizer=load_tokenizer(tokenizer_class,tokenizer_path)
- tokenizer_class: type
- Class type of Tokenizer
- tokenizer_path: str
- Location path to save the tokenizer
- tokenizer: tokenizer_class
- Tokenizer
Simple sanity check for tokenizer
is_pass=validate_tokenizer(tokenizer)
- tokenizer: BertTokenizer
- Pre-trained Bert Tokenizer
- is_pass: bool
- Valication result
- True: validation passed
- False: Validation failed
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions!