A custom implementation of PaliGemma, a multimodal vision-language model combining SigLIP vision encoder with Gemma language model.
- Custom multimodal transformer architecture
- SigLIP-based vision encoder
- Gemma language model integration
- Custom inference pipeline
- Image and text preprocessing
inference.py
: Token generation and model inferencemodel_siglip.py
: Vision encoder implementationmodeling_gamma.py
: Core model architectureprocessing_paligamma.py
: Image/text preprocessingutils.py
: Model loading utilities
- Python 3.8+
- PyTorch
- CUDA-capable GPU recommended
- Clone repository
- Install dependencies:
pip install -r requirements.txt
- Download weights from HuggingFace PaliGemma repository
- Place in
./weights/
directory
chmod +x launch_inference.sh
./launch_inference.sh
- Multimodal cross-attention mechanism
- Custom token generation with top-p sampling
- Flexible image preprocessing
- Support for different device types (CUDA, MPS, CPU)
- Single image per inference
- Experimental implementation
- Performance may vary from official implementation
MIT License