bio-compressed-files
About
This skill enables reading and writing of compressed sequence files (gzip, bzip2, BGZF) using Biopython. It provides specific implementations for different compression formats, with BGZF supporting indexable random access. Use it when working with compressed FASTA/FASTQ files to handle decompression transparently within your bioinformatics workflows.
Quick Install
Claude Code
Recommended/plugin add https://github.com/majiayu000/claude-skill-registrygit clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/bio-compressed-filesCopy and paste this command in Claude Code to install this skill
Documentation
Compressed Files
Handle gzip, bzip2, and BGZF compressed sequence files with Biopython.
Required Imports
import gzip
import bz2
from Bio import SeqIO
from Bio import bgzf # For BGZF (indexable compression)
Reading Compressed Files
Gzip (.gz)
with gzip.open('sequences.fasta.gz', 'rt') as handle:
for record in SeqIO.parse(handle, 'fasta'):
print(record.id, len(record.seq))
Important: Use 'rt' (read text) mode, not 'rb' (read binary).
Bzip2 (.bz2)
with bz2.open('sequences.fasta.bz2', 'rt') as handle:
for record in SeqIO.parse(handle, 'fasta'):
print(record.id, len(record.seq))
BGZF (Block Gzip)
BGZF files can be read like regular gzip, but also support indexing:
# Read like normal gzip (auto-detected)
for record in SeqIO.parse('sequences.fasta.bgz', 'fasta'):
print(record.id)
# Or explicitly with bgzf module
with bgzf.open('sequences.fasta.bgz', 'rt') as handle:
for record in SeqIO.parse(handle, 'fasta'):
print(record.id)
Writing Compressed Files
Gzip (.gz)
with gzip.open('output.fasta.gz', 'wt') as handle:
SeqIO.write(records, handle, 'fasta')
Bzip2 (.bz2)
with bz2.open('output.fasta.bz2', 'wt') as handle:
SeqIO.write(records, handle, 'fasta')
BGZF (.bgz)
with bgzf.open('output.fasta.bgz', 'wt') as handle:
SeqIO.write(records, handle, 'fasta')
BGZF: Indexable Compression
BGZF is the only compressed format that supports SeqIO.index() and SeqIO.index_db().
BGZF (Block GZip Format) is a variant of gzip that allows random access. It's used by BAM files and tabix-indexed files.
Create Indexable Compressed File
from Bio import SeqIO, bgzf
# Write as BGZF (can be indexed later)
records = SeqIO.parse('input.fasta', 'fasta')
with bgzf.open('output.fasta.bgz', 'wt') as handle:
SeqIO.write(records, handle, 'fasta')
Index a BGZF File
# SeqIO.index() works with BGZF!
records = SeqIO.index('sequences.fasta.bgz', 'fasta')
seq = records['target_id'].seq
records.close()
# SeqIO.index_db() also works
records = SeqIO.index_db('index.sqlite', 'sequences.fasta.bgz', 'fasta')
Convert Gzip to BGZF
from Bio import SeqIO, bgzf
import gzip
# Read from gzip, write to BGZF
with gzip.open('input.fasta.gz', 'rt') as in_handle:
with bgzf.open('output.fasta.bgz', 'wt') as out_handle:
SeqIO.write(SeqIO.parse(in_handle, 'fasta'), out_handle, 'fasta')
Code Patterns
Read Gzipped FASTQ
with gzip.open('reads.fastq.gz', 'rt') as handle:
records = list(SeqIO.parse(handle, 'fastq'))
print(f'Loaded {len(records)} reads')
Count Records in Gzipped File
with gzip.open('sequences.fasta.gz', 'rt') as handle:
count = sum(1 for _ in SeqIO.parse(handle, 'fasta'))
print(f'{count} sequences')
Fast Count with Low-Level Parser
from Bio.SeqIO.FastaIO import SimpleFastaParser
import gzip
with gzip.open('sequences.fasta.gz', 'rt') as handle:
count = sum(1 for _ in SimpleFastaParser(handle))
Convert Compressed to Uncompressed
with gzip.open('input.fasta.gz', 'rt') as in_handle:
records = SeqIO.parse(in_handle, 'fasta')
SeqIO.write(records, 'output.fasta', 'fasta')
Convert Uncompressed to Compressed
records = SeqIO.parse('input.fasta', 'fasta')
with gzip.open('output.fasta.gz', 'wt') as out_handle:
SeqIO.write(records, out_handle, 'fasta')
Auto-Detect Compression
from pathlib import Path
from Bio import SeqIO, bgzf
import gzip
import bz2
def open_sequence_file(filepath, format):
filepath = Path(filepath)
suffix = filepath.suffix.lower()
if suffix == '.gz':
# Could be gzip or bgzf - bgzf handles both
handle = bgzf.open(filepath, 'rt')
elif suffix == '.bgz':
handle = bgzf.open(filepath, 'rt')
elif suffix == '.bz2':
handle = bz2.open(filepath, 'rt')
else:
handle = open(filepath, 'r')
return SeqIO.parse(handle, format)
Process Large Gzipped File (Memory Efficient)
with gzip.open('large.fastq.gz', 'rt') as handle:
for record in SeqIO.parse(handle, 'fastq'):
if len(record.seq) >= 100:
process(record)
Compress Existing File (Raw Copy)
import shutil
with open('sequences.fasta', 'rb') as f_in:
with gzip.open('sequences.fasta.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
Compression Comparison
| Format | Extension | Indexable | Speed | Compression |
|---|---|---|---|---|
| Gzip | .gz | No | Fast | Good |
| BGZF | .bgz | Yes | Fast | Good |
| Bzip2 | .bz2 | No | Slow | Better |
| LZMA | .xz | No | Slowest | Best |
When to Use Each Format
| Use Case | Recommended Format |
|---|---|
| Archive (no random access needed) | gzip or bzip2 |
| Need to index compressed file | BGZF |
| BAM files and tabix | BGZF (native) |
| Maximum compression | bzip2 or xz |
| Best speed | gzip or BGZF |
Common Errors
| Error | Cause | Solution |
|---|---|---|
TypeError: a bytes-like object is required | Used 'rb' mode | Use 'rt' for text mode |
UnicodeDecodeError | Wrong encoding | Try gzip.open(file, 'rt', encoding='latin-1') |
gzip.BadGzipFile | Not a gzip file | Check file extension matches actual format |
OSError: Not a gzipped file | Corrupt or wrong format | Verify file integrity |
SeqIO.index() fails on .gz | Regular gzip not indexable | Convert to BGZF first |
Decision Tree
Working with compressed sequence files?
├── Just reading sequentially?
│ └── Use gzip.open() or bz2.open() with 'rt' mode
├── Need to index the compressed file?
│ └── Convert to BGZF, then use SeqIO.index()
├── Writing compressed output?
│ ├── Will need to index later? → Use bgzf.open()
│ └── Just archiving? → Use gzip.open() or bz2.open()
└── Converting between formats?
└── Parse with SeqIO, write to new handle
Related Skills
- read-sequences - Core parsing functions used with compressed handles
- write-sequences - Write to compressed output files
- batch-processing - Process multiple compressed files
- alignment-files - BAM files use BGZF natively; samtools handles compression
GitHub Repository
Related Skills
algorithmic-art
MetaThis Claude Skill creates original algorithmic art using p5.js with seeded randomness and interactive parameters. It generates .md files for algorithmic philosophies, plus .html and .js files for interactive generative art implementations. Use it when developers need to create flow fields, particle systems, or other computational art while avoiding copyright issues.
subagent-driven-development
DevelopmentThis skill executes implementation plans by dispatching a fresh subagent for each independent task, with code review between tasks. It enables fast iteration while maintaining quality gates through this review process. Use it when working on mostly independent tasks within the same session to ensure continuous progress with built-in quality checks.
executing-plans
DesignUse the executing-plans skill when you have a complete implementation plan to execute in controlled batches with review checkpoints. It loads and critically reviews the plan, then executes tasks in small batches (default 3 tasks) while reporting progress between each batch for architect review. This ensures systematic implementation with built-in quality control checkpoints.
cost-optimization
OtherThis Claude Skill helps developers optimize cloud costs through resource rightsizing, tagging strategies, and spending analysis. It provides a framework for reducing cloud expenses and implementing cost governance across AWS, Azure, and GCP. Use it when you need to analyze infrastructure costs, right-size resources, or meet budget constraints.
