Back to Skills

gemini-audio

Elios-FPT
Updated Today
29 views
1
View on GitHub
Metaapidesign

About

This skill enables developers to implement Google Gemini API's audio capabilities for both analysis and generation. It can transcribe, summarize, and analyze audio files up to 9.5 hours long, as well as generate natural speech from text with controllable TTS. Use it for processing podcasts, meetings, or any project requiring robust audio-to-text or text-to-speech functionality.

Documentation

Gemini Audio API Skill

Process audio with transcription, analysis, and understanding, plus generate natural speech using Google's Gemini API. Supports up to 9.5 hours of audio per request with multiple formats.

When to Use This Skill

Use this skill when you need to:

  • Transcribe audio files to text with timestamps
  • Summarize audio content and extract key points
  • Analyze speech, music, or environmental sounds
  • Generate speech from text with controllable voice and style
  • Process podcasts, interviews, meetings, or any audio content
  • Understand non-speech audio (birdsong, sirens, music)

Prerequisites

API Key Setup

The skill supports both Google AI Studio and Vertex AI endpoints.

Option 1: Google AI Studio (Default)

The skill automatically detects your GEMINI_API_KEY in this order:

  1. Process environment: export GEMINI_API_KEY="your-key"
  2. Project root: .env
  3. .claude directory: .claude/.env
  4. .claude/skills directory: .claude/skills/.env
  5. Skill directory: .claude/skills/gemini-audio/.env

Get your API key: Visit Google AI Studio

Create .env file with:

GEMINI_API_KEY=your_api_key_here

Option 2: Vertex AI

To use Vertex AI instead:

# Enable Vertex AI
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional, defaults to us-central1

Or in .env file:

GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1

Python Setup

Install required package:

pip install google-genai

Quick Start

Audio Analysis (Transcription, Summarization)

from google import genai
import os

# API key auto-detected from environment
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))

# Upload audio file
myfile = client.files.upload(file='podcast.mp3')

# Transcribe
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Generate a transcript of the speech.', myfile]
)
print(response.text)

# Summarize
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Summarize the key points in 5 bullets.', myfile]
)
print(response.text)

Using Helper Scripts

# Transcribe audio
python .claude/skills/gemini-audio/scripts/transcribe.py audio.mp3

# Summarize audio
python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \
  "Summarize key points"

# Analyze specific segment (timestamps in MM:SS format)
python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \
  "What is discussed from 02:30 to 05:15?"

# Generate speech
python .claude/skills/gemini-audio/scripts/generate-speech.py \
  "Welcome to our podcast" \
  --output welcome.wav

Audio Understanding Capabilities

Supported Formats

FormatMIME TypeBest Use
WAVaudio/wavUncompressed, highest quality
MP3audio/mp3Compressed, widely compatible
AACaudio/aacCompressed, good quality
FLACaudio/flacLossless compression
OGG Vorbisaudio/oggOpen format
AIFFaudio/aiffApple format

Audio Specifications

  • Maximum length: 9.5 hours per request
  • Multiple files: Unlimited count, combined max 9.5 hours
  • Token rate: 32 tokens/second (1 minute = 1,920 tokens)
  • Processing: Auto-downsampled to 16 Kbps mono
  • File size limits:
    • Inline: 20 MB max total request
    • File API: 2 GB per file, 20 GB project quota
    • Retention: 48 hours auto-delete

Analysis Features

  • Transcription: Full text with punctuation
  • Timestamps: Reference segments (MM:SS format)
  • Multi-speaker: Identify different speakers
  • Non-speech: Analyze music, sounds, ambient audio
  • Languages: Support for multiple languages

Speech Generation (TTS)

Available TTS Models

ModelQualitySpeedCost/1M tokens
gemini-2.5-flash-native-audio-preview-09-2025HighFast$10
gemini-2.5-pro TTS modePremiumSlower$20

Controllable Voice Options

  • Style: Professional, casual, narrative, conversational
  • Pace: Slow, normal, fast
  • Tone: Friendly, serious, enthusiastic
  • Accent: Natural language control

TTS Example

response = client.models.generate_content(
    model='gemini-2.5-flash-native-audio-preview-09-2025',
    contents='Generate audio: Welcome to today\'s episode, in a warm, friendly tone.'
)

# Save audio output
with open('output.wav', 'wb') as f:
    f.write(response.audio_data)

Input Methods

Method 1: File Upload (Recommended for >20MB)

# Upload and reuse
myfile = client.files.upload(file='large-audio.mp3')

# Use file multiple times
response1 = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Transcribe this', myfile]
)

response2 = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Summarize this', myfile]
)

Method 2: Inline Data (<20MB)

from google.genai import types

with open('small-audio.mp3', 'rb') as f:
    audio_bytes = f.read()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Describe this audio',
        types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
    ]
)

Common Use Cases

Transcription

python scripts/transcribe.py meeting.mp3 --include-timestamps

Summary with Key Points

python scripts/analyze.py interview.wav "Extract main topics and key quotes"

Speaker Identification

python scripts/analyze.py discussion.mp3 "Identify speakers and extract dialogue"

Segment Analysis

python scripts/analyze.py podcast.mp3 "Summarize content from 10:30 to 15:45"

Non-Speech Analysis

python scripts/analyze.py ambient.wav "Identify all sounds: voices, music, ambient"

Best Practices

File Management

  • Use File API for files >20MB or repeated usage
  • Files auto-delete after 48 hours
  • Manage quota (20 GB project limit)

Prompt Engineering

  • Be specific: "Transcribe from 02:30 to 03:29"
  • Use timestamps for segment analysis (MM:SS format)
  • Combine tasks: "Transcribe and summarize"
  • Provide context: "This is a medical interview"

Cost Optimization

  • Use gemini-2.5-flash ($1/1M tokens) for most tasks
  • Upgrade to gemini-2.5-pro ($3/1M tokens) for complex analysis
  • Check token count: 1 min audio = 1,920 tokens

Error Handling

  • Validate file format and size before upload
  • Implement exponential backoff for rate limits
  • Handle 48-hour file expiration

Token Costs & Pricing

Audio Input (32 tokens/second):

  • 1 minute = 1,920 tokens
  • 1 hour = 115,200 tokens
  • 9.5 hours = 1,094,400 tokens

Model Pricing:

  • Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
  • Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
  • Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output

TTS Pricing:

  • Flash TTS: $10/1M tokens
  • Pro TTS: $20/1M tokens

Reference Documentation

For detailed information, see:

  • references/api-reference.md - Complete API specifications
  • references/code-examples.md - Comprehensive code examples
  • references/tts-guide.md - Text-to-speech implementation guide
  • references/best-practices.md - Advanced optimization strategies

Scripts Overview

All scripts support 3-step API key detection:

  • transcribe.py: Generate transcripts with optional timestamps
  • analyze.py: General audio analysis with custom prompts
  • generate-speech.py: Text-to-speech generation
  • manage-files.py: Upload, list, and delete audio files

Run any script with --help for detailed usage.

Resources

Quick Install

/plugin add https://github.com/Elios-FPT/EliosCodePracticeService/tree/main/gemini-audio

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

Elios-FPT/EliosCodePracticeService
Path: .claude/skills/gemini-audio

Related Skills

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

langchain

Meta

LangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.

View skill

llamaindex

Meta

LlamaIndex is a data framework for building RAG-powered LLM applications, specializing in document ingestion, indexing, and querying. It provides key features like vector indices, query engines, and agents, and supports over 300 data connectors. Use it for document Q&A, chatbots, and knowledge retrieval when building data-centric applications.

View skill

business-rule-documentation

Meta

This skill provides standardized templates for systematically documenting business logic and domain knowledge following Domain-Driven Design principles. It helps developers capture business rules, process flows, decision trees, and terminology glossaries to maintain consistency between requirements and implementation. Use it when documenting domain models, creating business rule repositories, or bridging communication between business and technical teams.

View skill