llama-cpp

zechenzhangAGI

Updated Today

94 views

Otherai

About

llama-cpp enables efficient LLM inference on non-NVIDIA hardware including CPUs, Apple Silicon, and consumer GPUs. It's ideal for edge deployment, Macs, or when CUDA is unavailable, offering GGUF quantization for reduced memory usage. This provides 4-10× speedup over PyTorch on CPU with minimal dependencies.

Documentation

llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

When to use llama.cpp

Use llama.cpp when:

Running on CPU-only machines
Deploying on Apple Silicon (M1/M2/M3/M4)
Using AMD or Intel GPUs (no CUDA)
Edge deployment (Raspberry Pi, embedded systems)
Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:

Have NVIDIA GPUs (A100/H100)
Need maximum throughput (100K+ tok/s)
Running in datacenter with CUDA

Use vLLM instead when:

Have NVIDIA GPUs
Need Python-first API
Want PagedAttention

Quick start

Installation

# macOS/Linux
brew install llama.cpp

# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# With Metal (Apple Silicon)
make LLAMA_METAL=1

# With CUDA (NVIDIA)
make LLAMA_CUDA=1

# With ROCm (AMD)
make LLAMA_HIP=1

Download model

# Download from HuggingFace (GGUF format)
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

# Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/

Run inference

# Simple chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    -p "Explain quantum computing" \
    -n 256  # Max tokens

# Interactive chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --interactive

Server mode

# Start OpenAI-compatible server
./llama-server \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 32  # Offload 32 layers to GPU

# Client request
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Quantization formats

GGUF format overview

Format	Bits	Size (7B)	Speed	Quality	Use Case
Q4_K_M	4.5	4.1 GB	Fast	Good	Recommended default
Q4_K_S	4.3	3.9 GB	Faster	Lower	Speed critical
Q5_K_M	5.5	4.8 GB	Medium	Better	Quality critical
Q6_K	6.5	5.5 GB	Slower	Best	Maximum quality
Q8_0	8.0	7.0 GB	Slow	Excellent	Minimal degradation
Q2_K	2.5	2.7 GB	Fastest	Poor	Testing only

Choosing quantization

# General use (balanced)
Q4_K_M  # 4-bit, medium quality

# Maximum speed (more degradation)
Q2_K or Q3_K_M

# Maximum quality (slower)
Q6_K or Q8_0

# Very large models (70B, 405B)
Q3_K_M or Q4_K_S  # Lower bits to fit in memory

Hardware acceleration

Apple Silicon (Metal)

# Build with Metal
make LLAMA_METAL=1

# Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999  # Offload all layers

# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)

NVIDIA GPUs (CUDA)

# Build with CUDA
make LLAMA_CUDA=1

# Offload layers to GPU
./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers

# Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest

AMD GPUs (ROCm)

# Build with ROCm
make LLAMA_HIP=1

# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999

Common patterns

Batch processing

# Process multiple prompts from file
cat prompts.txt | ./llama-cli \
    -m model.gguf \
    --batch-size 512 \
    -n 100

Constrained generation

# JSON output with grammar
./llama-cli \
    -m model.gguf \
    -p "Generate a person: " \
    --grammar-file grammars/json.gbnf

# Outputs valid JSON only

Context size

# Increase context (default 512)
./llama-cli \
    -m model.gguf \
    -c 4096  # 4K context window

# Very long context (if model supports)
./llama-cli -m model.gguf -c 32768  # 32K context

Performance benchmarks

CPU performance (Llama 2-7B Q4_K_M)

CPU	Threads	Speed	Cost
Apple M3 Max	16	50 tok/s	$0 (local)
AMD Ryzen 9 7950X	32	35 tok/s	$0.50/hour
Intel i9-13900K	32	30 tok/s	$0.40/hour
AWS c7i.16xlarge	64	40 tok/s	$2.88/hour

GPU acceleration (Llama 2-7B Q4_K_M)

GPU	Speed	vs CPU	Cost
NVIDIA RTX 4090	120 tok/s	3-4×	$0 (local)
NVIDIA A10	80 tok/s	2-3×	$1.00/hour
AMD MI250	70 tok/s	2×	$2.00/hour
Apple M3 Max (Metal)	50 tok/s	~Same	$0 (local)

Supported models

LLaMA family:

Llama 2 (7B, 13B, 70B)
Llama 3 (8B, 70B, 405B)
Code Llama

Mistral family:

Mistral 7B
Mixtral 8x7B, 8x22B

Other:

Falcon, BLOOM, GPT-J
Phi-3, Gemma, Qwen
LLaVA (vision), Whisper (audio)

Find models: https://huggingface.co/models?library=gguf

References

Quantization Guide - GGUF formats, conversion, quality comparison
Server Deployment - API endpoints, Docker, monitoring
Optimization - Performance tuning, hybrid CPU+GPU

Resources

GitHub: https://github.com/ggerganov/llama.cpp
Models: https://huggingface.co/models?library=gguf
Discord: https://discord.gg/llama-cpp

Quick Install

/plugin add https://github.com/zechenzhangAGI/AI-research-SKILLs/tree/main/llama-cpp

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

zechenzhangAGI/AI-research-SKILLs

Path: 12-inference-serving/llama-cpp

aiai-researchclaudeclaude-codeclaude-skillscodex

Related Skills

sglang

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.