blip-2-vision-language

davila7

Updated 9 days ago

480 views

18,478

1,685

18,478

View on GitHub

DesignMultimodalVision-LanguageImage CaptioningVQAZero-Shot

About

BLIP-2 is a vision-language framework that connects a frozen image encoder with a large language model for multimodal tasks. Use it for zero-shot image captioning, visual question answering, or image-text retrieval without task-specific fine-tuning. It's ideal for developers needing to add state-of-the-art visual understanding to LLM-based applications.

Quick Install

Claude Code

Recommended

Primary

npx skills add davila7/claude-code-templates -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/blip-2-vision-language

Copy and paste this command in Claude Code to install this skill

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/multimodal-blip-2

anthropicanthropic-claudeclaudeclaude-code

Related Skills

stable-diffusion-image-generation

audiocraft-audio-generation

whisper

Other

Whisper is OpenAI's multilingual speech recognition model for transcription and translation across 99 languages. It handles tasks like speech-to-text, podcast transcription, and processing noisy or multilingual audio. Developers should use it for robust, production-ready automatic speech recognition (ASR).

View skill