gemini-vision
About
The gemini-vision skill enables Claude to implement Google's Gemini API for advanced image analysis. It provides capabilities for image captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use this skill when building applications that require processing images, answering visual questions, or detecting objects in visual content.
Documentation
Gemini Vision API Skill
This skill enables Claude to use Google's Gemini API for advanced image understanding tasks including captioning, classification, visual question answering, object detection, segmentation, and multi-image analysis.
Quick Start
Prerequisites
- Get API Key: Obtain from Google AI Studio
- Install SDK:
pip install google-genai(Python 3.9+)
- If
pipis not installed, instructs user to install it first.
API Key Configuration
The skill supports both Google AI Studio and Vertex AI endpoints.
Option 1: Google AI Studio (Default)
The skill checks for GEMINI_API_KEY in this order:
- Process environment:
export GEMINI_API_KEY="your-key" - Project root:
.env - .claude directory:
.claude/.env - .claude/skills directory:
.claude/skills/.env - Skill directory:
.claude/skills/gemini-vision/.env
Get your API key: Visit Google AI Studio
Option 2: Vertex AI
To use Vertex AI instead:
# Enable Vertex AI
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional, defaults to us-central1
Or in .env file:
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1
Security: Never commit API keys to version control. Add .env to .gitignore.
Core Capabilities
Image Analysis
- Captioning: Generate descriptive text for images
- Classification: Categorize and identify image content
- Visual QA: Answer questions about image content
- Multi-image: Compare and analyze up to 3,600 images
Advanced Features (Model-Specific)
- Object Detection: Identify and locate objects with bounding boxes (Gemini 2.0+)
- Segmentation: Create pixel-level masks for objects (Gemini 2.5+)
- Document Understanding: Process PDFs with vision (up to 1,000 pages)
Supported Formats
- Images: PNG, JPEG, WEBP, HEIC, HEIF
- Documents: PDF (up to 1,000 pages)
- Size Limits:
- Inline: 20MB max total request size
- File API: For larger files
- Max images: 3,600 per request
Available Models
- gemini-2.5-pro: Most capable, segmentation + detection
- gemini-2.5-flash: Fast, efficient, segmentation + detection
- gemini-2.5-flash-lite: Lightweight, segmentation + detection
- gemini-2.0-flash: Object detection support
- gemini-1.5-pro/flash: Previous generation
Usage Examples
Basic Image Analysis
# Analyze a local image
python scripts/analyze-image.py path/to/image.jpg "What's in this image?"
# Analyze from URL
python scripts/analyze-image.py https://example.com/image.jpg "Describe this"
# Specify model
python scripts/analyze-image.py image.jpg "Caption this" --model gemini-2.5-pro
Object Detection (2.0+)
python scripts/analyze-image.py image.jpg "Detect all objects" --model gemini-2.0-flash
Multi-Image Comparison
python scripts/analyze-image.py img1.jpg img2.jpg "What's different between these?"
File Upload (for large files or reuse)
# Upload file
python scripts/upload-file.py path/to/large-image.jpg
# Use uploaded file
python scripts/analyze-image.py file://file-id "Caption this"
File Management
# List uploaded files
python scripts/manage-files.py list
# Get file info
python scripts/manage-files.py get file-id
# Delete file
python scripts/manage-files.py delete file-id
Token Costs
Images consume tokens based on size:
- Small (≤384px both dimensions): 258 tokens
- Large: Tiled into 768×768 chunks, 258 tokens each
Token Formula:
crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258
Example: 960×540 image = 6 tiles = 1,548 tokens
Rate Limits
Limits vary by tier (Free, Tier 1, 2, 3):
- Measured in RPM (requests/min), TPM (tokens/min), RPD (requests/day)
- Applied per project, not per API key
- RPD resets at midnight Pacific
Best Practices
Image Quality
- Use clear, non-blurry images
- Verify correct image rotation
- Consider token costs when sizing
Prompting
- Be specific in instructions
- Place text after image for single-image prompts
- Use few-shot examples for better accuracy
- Specify output format (JSON, markdown, etc.)
File Management
- Use File API for files >20MB
- Use File API for repeated usage (saves tokens)
- Files auto-delete after 48 hours
- Clean up manually when done
Security
- Never expose API keys in code
- Use environment variables
- Add API key restrictions in Google Cloud Console
- Monitor usage regularly
- Rotate keys periodically
Error Handling
Common errors:
- 401: Invalid API key
- 429: Rate limit exceeded
- 400: Invalid request (check file size, format)
- 403: Permission denied (check API key restrictions)
Additional Resources
See the references/ directory for:
- api-reference.md: Detailed API methods and endpoints
- examples.md: Comprehensive code examples
- best-practices.md: Advanced tips and optimization strategies
Implementation Guide
When implementing Gemini vision features:
- Check API key availability using the 3-step lookup
- If no key is found, fall back to the workspace default vision model.
- If the default model is missing or unavailable, surface a clear message to the user explaining the absence and next steps to configure either an API key or model.
- Choose appropriate model based on requirements:
- Need segmentation? Use 2.5+ models
- Need detection? Use 2.0+ models
- Need speed? Use Flash variants
- Need quality? Use Pro variants
- Validate inputs:
- Check file format (PNG, JPEG, WEBP, HEIC, HEIF, PDF)
- Verify file size (<20MB for inline, >20MB use File API)
- Count images (max 3,600)
- Handle responses appropriately:
- Parse structured output if requested
- Extract bounding boxes for object detection
- Process segmentation masks if applicable
- Manage files efficiently:
- Upload large files via File API
- Reuse uploaded files when possible
- Clean up after use
Scripts Overview
All scripts support the 3-step API key lookup:
- analyze-image.py: Main script for image analysis, supports inline and File API
- upload-file.py: Upload files to Gemini File API
- manage-files.py: List, get metadata, and delete uploaded files
Run any script with --help for detailed usage instructions.
Official Documentation: https://ai.google.dev/gemini-api/docs/image-understanding
Quick Install
/plugin add https://github.com/Elios-FPT/EliosCodePracticeService/tree/main/gemini-visionCopy and paste this command in Claude Code to install this skill
GitHub 仓库
Related Skills
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
go-test
MetaThe go-test skill provides expertise in Go's standard testing package and best practices. It helps developers implement table-driven tests, subtests, benchmarks, and coverage strategies while following Go conventions. Use it when writing test files, creating mocks, detecting race conditions, or organizing integration tests in Go projects.
llamaindex
MetaLlamaIndex is a data framework for building RAG-powered LLM applications, specializing in document ingestion, indexing, and querying. It provides key features like vector indices, query engines, and agents, and supports over 300 data connectors. Use it for document Q&A, chatbots, and knowledge retrieval when building data-centric applications.
