-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to continue a conversation with more images? #68
Comments
Not yet. It's one of the things want to add next. My focus at the moment is on the trainer and new models (pixtral, llama and molmo) |
It would be awesome if you could implement this I would be more than happy to help, review and merge the PR🚀 |
+1, would love to see this implemented |
I think this will be easier and faster to do after I release prompt caching. That way you only are computing KV for the last message only. |
Hey guys, I thought a about it and here is an example that you could use to build this use case. I will work on a more robust example, showcase different models that support and add it as a chat CLI tool in the next release :) The idea is to only add the image tag to the last use message in the messages/conversations list alongside the lastest image. from mlx_vlm import load
import mlx.core as mx
from mlx_vlm.utils import generate_step, load_image
import time
model_mlx, processor = load("mlx-community/idefics2-8b-4bit")
# Image
url = "/path/to/your/image"
image = load_image(url)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": """The image shows a colorful chameleon sitting on a vibrant flower. The chameleon has a blue body with vibrant green and red stripes, and its eyes are wide open, giving it a curious and alert expression. The flower has a mix of pink, yellow, and red petals, adding to the vividness of the scene."""}
]
},
{
"role": "user",
"content": [
{"type": "text", "text": "Compare this image to the previous one."},
{"type": "image"} # used on the last user message in the list
]
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="np"
)
pixel_values = mx.array(inputs['pixel_values'])
input_ids = mx.array(inputs['input_ids'])
mask = mx.array(inputs['attention_mask'])
max_tokens = 1000
verbose = False # Set to True to stream the output
# Get the prompt tokens and the tokenizer
prompt_tokens = mx.array(processor.tokenizer.encode(text_prompt))
tokenizer = processor.tokenizer
# Initialize timing and detokenizer
tic = time.perf_counter()
detokenizer = processor.detokenizer
detokenizer.reset()
# Generate tokens
generator = generate_step(
input_ids,
model_mlx,
pixel_values,
mask,
temperature=0.7,
)
prompt_time = 0
for (token, prob), n in zip(generator, range(max_tokens)):
if n == 0:
prompt_time = time.perf_counter() - tic
tic = time.perf_counter()
if token == tokenizer.eos_token_id and n > 0:
break
detokenizer.add_token(token)
if verbose:
print(detokenizer.last_segment, end="", flush=True)
token_count = n + 1
detokenizer.finalize()
if verbose:
print(detokenizer.last_segment, flush=True)
gen_time = time.perf_counter() - tic
print("=" * 10)
if token_count == 0:
print("No tokens generated for this prompt")
prompt_tps = prompt_tokens.size / prompt_time
gen_tps = (token_count - 1) / gen_time
print(f"Prompt: {prompt_tps:.3f} tokens-per-sec")
print(f"Generation: {gen_tps:.3f} tokens-per-sec")
# Print the generated text
print(detokenizer.text) |
Looks like there's new code for chat in this branch: https://github.com/Blaizzy/mlx-vlm/tree/pc/video - e.g. 810fb53 |
Yes there is :) |
It's not clear to me from looking at the code if this library supports the following pattern:
Is this something the library can or could do? I'm interested in being able to implement multi-step conversations where images might be attached to future messages.
The text was updated successfully, but these errors were encountered: