-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial commit of vlm #151
base: main
Are you sure you want to change the base?
Conversation
- based on models from https://github.com/Blaizzy/mlx-vlm - for #132
@@ -1,30 +1,7 @@ | |||
// Copyright © 2024 Apple Inc. | |||
|
|||
import Foundation | |||
|
|||
public enum StringOrNumber: Codable, Equatable, Sendable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to LMCommon
|
||
/// Container for models that guarantees single threaded access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to ModelContainer
Libraries/LLM/LLMModel.swift
Outdated
} | ||
} | ||
} | ||
// TODO move? these cause some ambiguity -- how to resolve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was playing around with these to avoid breaking API -- moving types into LMCommon means callers will need to import LMCommon if they refer to them. This (the aliases) caused more trouble than I think it is worth
@@ -3,6 +3,7 @@ | |||
import Foundation | |||
@preconcurrency import Hub | |||
import MLX | |||
import MLXLMCommon | |||
import MLXNN | |||
import MLXRandom | |||
import Tokenizers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately I would like this to move into LMCommon -- I think it can support both LLM and VLM models, but I didn't get a chance to move this yet.
import MLXNN | ||
import MLXOptimizers | ||
import MLXRandom | ||
import Tokenizers | ||
|
||
/// Layers to apply LoRA adapters to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to LMCommon
return y + scale * z | ||
} | ||
} | ||
|
||
/// Equivalent to `lora.py/iterate_batches()`. Used internally by ``LoRATrain``. | ||
struct LoRABatchIterator: Sequence, IteratorProtocol { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the rest of this moves to LMCommon as well -- I think it can.
mutating func prompt(_ prompt: MLXArray) | ||
func process(logits: MLXArray) -> MLXArray | ||
mutating func didSample(token: MLXArray) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generate / step code has been refactored a bit and can now take custom logit samplers and processors
public init( | ||
prompt: MLXArray, model: any LanguageModel, cache: [KVCache]? = nil, | ||
parameters: GenerateParameters | ||
) throws { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now takes either a prompt (MLXArray) or an LMInput (text + image + ...) via multiple initializers.
} | ||
} | ||
|
||
public struct LMInput { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new union type that holds the different inputs to generate()
and LanguageModel.prepare()
} | ||
} | ||
|
||
public struct LMOutput { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union type for the output. Some of the VLMs return additional state, which is represented here.
Libraries/LMCommon/Models.swift
Outdated
@@ -134,6 +135,7 @@ extension ModelConfiguration { | |||
extraEOSTokens: ["<|end|>"] | |||
) | |||
|
|||
// TODO the prompt formatter is replaced by the chat template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it? #150
|
||
import CoreImage | ||
import Foundation | ||
import MLX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file may be deleted -- it was some notes & thoughts along the way
// Copyright © 2024 Apple Inc. | ||
|
||
import Foundation | ||
import MLX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to be deleted -- LMInput
replaces this
private let context = CIContext() | ||
|
||
// TODO documentation | ||
public enum MediaProcessing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs documentation, but see PaliGemmaImageProvider
which implements
SiglipImageProcessor {
"do_convert_rgb": null,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.5,
0.5,
0.5
],
"image_processor_type": "SiglipImageProcessor",
"image_seq_length": 1024,
"image_std": [
0.5,
0.5,
0.5
],
"processor_class": "PaliGemmaProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"height": 448,
"width": 448
}
}
from the python transformers code.
import MLXNN | ||
import Tokenizers | ||
|
||
// MARK: - Language |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First cut at a port of https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/paligemma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this builds, loads weights and "runs" but doesn't produce any output -- still needs to be debugged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be usable as an example of the structure I think we need
} | ||
} | ||
|
||
// TODO does not suport multiple images -- how do we represent? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a protocol for the image and text processing pieces.
image = MediaProcessing.inSRGBToneCurveSpace(image) | ||
|
||
image = MediaProcessing.resampleBicubic(image, to: .init(width: size, height: size)) | ||
image = MediaProcessing.normalize(image, mean: (0.5, 0.5, 0.5), std: (0.5, 0.5, 0.5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SiglipImageProcessor {
"do_convert_rgb": null,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.5,
0.5,
0.5
],
"image_processor_type": "SiglipImageProcessor",
"image_seq_length": 1024,
"image_std": [
0.5,
0.5,
0.5
],
"processor_class": "PaliGemmaProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"height": 448,
"width": 448
}
}
} | ||
} | ||
|
||
private func loadConfiguration(url: URL) throws -> PaliGemma { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These next couple of functions are just stubs to let me try it out -- this will work much like the LLM models
private let _ropeTheta: Float? | ||
public var ropeTheta: Float { _ropeTheta ?? 10_000 } | ||
public let _ropeTraditional: Bool? | ||
public var ropeTraditional: Bool { _ropeTraditional ?? false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing the full implementation of Codable I went a simpler route for default values. Less code, cleaner (I think)
@Option var path: URL | ||
|
||
@MainActor | ||
mutating func run() async throws { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just stub code to exercise the model. This still needs the input processing layers, in particular the prompt processing. The image processing is in place but will need to be wrapped up API-wise.
- llm-tool vlm --image path/to/image.jpg
Note: this is not ready for use but feel free to comment! The paligemma model loads and "runs" but doesn't produce valid output. The structure of load/evaluation is getting close.