You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With a quick read of the code, it seems that the attention mask tensor is created on the fly during inference. The mask is then broadcasted to all the prompt token sequences in individual layers (normally it's one, but to allow batch inferences, we should not assume this). This might cause a problem during batch inference as we cannot mask the padded tokens for prompts with different lengths. One way to do it now is to process those pad tokens as well, but this will change the output. Just want to make sure my understanding here is correct.
It will be beneficial to be able to customise these mask tensors so that we can just fill the prompt cache with zeros for padded tokens and it won't affect the output. I am not sure if this is possible.
A simple example use case:
frommlx_lmimportloadfrommlx_lm.models.cacheimportmake_prompt_cacheimportmlx.coreasmxmodel, tokenizer=load('my/model/path')
pad_token_id=tokenizer.bos_token_idiftokenizer.pad_token_idisNoneelsetokenizer.pad_token_id# Get the lists of token ids for all the promptsprompts= [
'The weather is nice out there',
'The weather is awful out there, and this is a longer prompt'
]
prompt_tokens= [tokenizer.encode(prompt) forpromptinprompts]
# Get the masks for each token in all the promptsprompt_lens= [len(pt) forptinprompt_tokens]
max_prompt_len=max(prompt_lens)
mask= [[-1] * (max_prompt_len-n) +tksfortks, ninzip(prompt_tokens, prompt_lens)]
mask= (mx.array(mask) !=-1).astype(mx.int16)
# Pad the shorter promptsprompt_tokens= [[pad_token_id] * (max_prompt_len-n) +tksfortks, ninzip(prompt_tokens, prompt_lens)]
prompt_tokens=mx.array(prompt_tokens)
# Make the cachecache=make_prompt_cache(model)
# Get the logits for the next token for each promptlogits=model(prompt_tokens, mask=mask, cache=cache)
I realise this might take a lot of rework, but I am just wondering if this is possible?
The text was updated successfully, but these errors were encountered:
With a quick read of the code, it seems that the attention mask tensor is created on the fly during inference. The mask is then broadcasted to all the prompt token sequences in individual layers (normally it's one, but to allow batch inferences, we should not assume this). This might cause a problem during batch inference as we cannot mask the padded tokens for prompts with different lengths. One way to do it now is to process those pad tokens as well, but this will change the output. Just want to make sure my understanding here is correct.
It will be beneficial to be able to customise these mask tensors so that we can just fill the prompt cache with zeros for padded tokens and it won't affect the output. I am not sure if this is possible.
A simple example use case:
I realise this might take a lot of rework, but I am just wondering if this is possible?
The text was updated successfully, but these errors were encountered: