# gemini-vision

> Guide for implementing Google Gemini API image understanding - analyze images with captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use when analyzing images, answering visual questions, detecting objects, or processing documents with vision.

- Author: Kien Ha
- Repository: kienhaminh/speed-reader
- Version: 20251229180018
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-08
- Source: https://github.com/kienhaminh/speed-reader
- Web: https://mule.run/skillshub/@@kienhaminh/speed-reader~gemini-vision:20251229180018

---

---
name: gemini-vision
description: Guide for implementing Google Gemini API image understanding - analyze images with captioning, classification, visual QA, object detection, segmentation, and multi-image comparison. Use when analyzing images, answering visual questions, detecting objects, or processing documents with vision.
license: MIT
allowed-tools:
  - Bash
  - Read
  - Write
  - Edit
---

# Gemini Vision API Skill

This skill enables Claude to use Google's Gemini API for advanced image understanding tasks including captioning, classification, visual question answering, object detection, segmentation, and multi-image analysis.

## Quick Start

### Prerequisites

1. **Get API Key**: Obtain from [Google AI Studio](https://aistudio.google.com/apikey)
2. **Install SDK**: `pip install google-genai` (Python 3.9+)

### API Key Configuration

The skill supports both **Google AI Studio** and **Vertex AI** endpoints.

#### Option 1: Google AI Studio (Default)

The skill checks for `GEMINI_API_KEY` in this order:

1. **Process environment**: `export GEMINI_API_KEY="your-key"`
2. **Project root**: `.env`
3. **.claude directory**: `.claude/.env`
4. **.claude/skills directory**: `.claude/skills/.env`
5. **Skill directory**: `.claude/skills/gemini-vision/.env`

**Get your API key**: Visit [Google AI Studio](https://aistudio.google.com/apikey)

#### Option 2: Vertex AI

To use Vertex AI instead:

```bash
# Enable Vertex AI
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional, defaults to us-central1
```

Or in `.env` file:
```bash
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1
```

**Security**: Never commit API keys to version control. Add `.env` to `.gitignore`.

## Core Capabilities

### Image Analysis
- **Captioning**: Generate descriptive text for images
- **Classification**: Categorize and identify image content
- **Visual QA**: Answer questions about image content
- **Multi-image**: Compare and analyze up to 3,600 images

### Advanced Features (Model-Specific)
- **Object Detection**: Identify and locate objects with bounding boxes (Gemini 2.0+)
- **Segmentation**: Create pixel-level masks for objects (Gemini 2.5+)
- **Document Understanding**: Process PDFs with vision (up to 1,000 pages)

## Supported Formats

- **Images**: PNG, JPEG, WEBP, HEIC, HEIF
- **Documents**: PDF (up to 1,000 pages)
- **Size Limits**:
  - Inline: 20MB max total request size
  - File API: For larger files
  - Max images: 3,600 per request

## Available Models

- **gemini-2.5-pro**: Most capable, segmentation + detection
- **gemini-2.5-flash**: Fast, efficient, segmentation + detection
- **gemini-2.5-flash-lite**: Lightweight, segmentation + detection
- **gemini-2.0-flash**: Object detection support
- **gemini-1.5-pro/flash**: Previous generation

## Usage Examples

### Basic Image Analysis

```bash
# Analyze a local image
python scripts/analyze-image.py path/to/image.jpg "What's in this image?"

# Analyze from URL
python scripts/analyze-image.py https://example.com/image.jpg "Describe this"

# Specify model
python scripts/analyze-image.py image.jpg "Caption this" --model gemini-2.5-pro
```

### Object Detection (2.0+)

```bash
python scripts/analyze-image.py image.jpg "Detect all objects" --model gemini-2.0-flash
```

### Multi-Image Comparison

```bash
python scripts/analyze-image.py img1.jpg img2.jpg "What's different between these?"
```

### File Upload (for large files or reuse)

```bash
# Upload file
python scripts/upload-file.py path/to/large-image.jpg

# Use uploaded file
python scripts/analyze-image.py file://file-id "Caption this"
```

### File Management

```bash
# List uploaded files
python scripts/manage-files.py list

# Get file info
python scripts/manage-files.py get file-id

# Delete file
python scripts/manage-files.py delete file-id
```

## Token Costs

Images consume tokens based on size:

- **Small** (≤384px both dimensions): 258 tokens
- **Large**: Tiled into 768×768 chunks, 258 tokens each

**Token Formula**:
```
crop_unit = floor(min(width, height) / 1.5)
tiles = (width / crop_unit) × (height / crop_unit)
total_tokens = tiles × 258
```

**Example**: 960×540 image = 6 tiles = 1,548 tokens

## Rate Limits

Limits vary by tier (Free, Tier 1, 2, 3):
- Measured in RPM (requests/min), TPM (tokens/min), RPD (requests/day)
- Applied per project, not per API key
- RPD resets at midnight Pacific

## Best Practices

### Image Quality
- Use clear, non-blurry images
- Verify correct image rotation
- Consider token costs when sizing

### Prompting
- Be specific in instructions
- Place text after image for single-image prompts
- Use few-shot examples for better accuracy
- Specify output format (JSON, markdown, etc.)

### File Management
- Use File API for files >20MB
- Use File API for repeated usage (saves tokens)
- Files auto-delete after 48 hours
- Clean up manually when done

### Security
- Never expose API keys in code
- Use environment variables
- Add API key restrictions in Google Cloud Console
- Monitor usage regularly
- Rotate keys periodically

## Error Handling

Common errors:
- **401**: Invalid API key
- **429**: Rate limit exceeded
- **400**: Invalid request (check file size, format)
- **403**: Permission denied (check API key restrictions)

## Additional Resources

See the `references/` directory for:
- **api-reference.md**: Detailed API methods and endpoints
- **examples.md**: Comprehensive code examples
- **best-practices.md**: Advanced tips and optimization strategies

## Implementation Guide

When implementing Gemini vision features:

1. **Check API key availability** using the 3-step lookup
   - If no key is found, fall back to the workspace **default vision model**.
   - If the default model is missing or unavailable, surface a clear message to the user explaining the absence and next steps to configure either an API key or model.
2. **Choose appropriate model** based on requirements:
   - Need segmentation? Use 2.5+ models
   - Need detection? Use 2.0+ models
   - Need speed? Use Flash variants
   - Need quality? Use Pro variants
3. **Validate inputs**:
   - Check file format (PNG, JPEG, WEBP, HEIC, HEIF, PDF)
   - Verify file size (<20MB for inline, >20MB use File API)
   - Count images (max 3,600)
4. **Handle responses** appropriately:
   - Parse structured output if requested
   - Extract bounding boxes for object detection
   - Process segmentation masks if applicable
5. **Manage files** efficiently:
   - Upload large files via File API
   - Reuse uploaded files when possible
   - Clean up after use

## Scripts Overview

All scripts support the 3-step API key lookup:

- **analyze-image.py**: Main script for image analysis, supports inline and File API
- **upload-file.py**: Upload files to Gemini File API
- **manage-files.py**: List, get metadata, and delete uploaded files

Run any script with `--help` for detailed usage instructions.

---

**Official Documentation**: https://ai.google.dev/gemini-api/docs/image-understanding