gguf-quantization

davila7

Updated 9 days ago

499 views

18,478

1,685

18,478

View on GitHub

DesignGGUFQuantizationllama.cppCPU InferenceApple SiliconModel CompressionOptimization

About

This skill enables GGUF quantization for efficient model deployment on consumer hardware like CPUs and Apple Silicon. It provides flexible 2-8 bit quantization options without requiring GPU acceleration. Use it when optimizing models for local inference tools or resource-constrained environments.

Quick Install

Claude Code

Recommended

Primary

npx skills add davila7/claude-code-templates -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/gguf-quantization

Copy and paste this command in Claude Code to install this skill

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/optimization-gguf

anthropicanthropic-claudeclaudeclaude-code

Related Skills

quantizing-models-bitsandbytes

Other

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, achieving 50-75% memory reduction with minimal accuracy loss. It's ideal for running larger models on limited GPU memory or accelerating inference, supporting formats like INT8, NF4, and FP4. The skill integrates with HuggingFace Transformers and enables QLoRA training and 8-bit optimizers.

View skill

awq-quantization

Other

AWQ is a 4-bit weight quantization technique that uses activation patterns to preserve critical weights, enabling 3x faster inference with minimal accuracy loss. It's ideal for deploying large models (7B-70B) on limited GPU memory and is particularly effective for instruction-tuned and multimodal models. This skill integrates with vLLM and Marlin kernels for optimized deployment.

View skill

hqq-quantization

Other

HQQ enables fast, calibration-free quantization of LLMs down to 4/3/2-bit precision without needing a dataset. It's ideal for rapid quantization workflows and deployment with vLLM or HuggingFace Transformers. Key advantages include significantly faster quantization than methods like GPTQ and support for fine-tuning quantized models.

View skill

llama-cpp

Other

The llama-cpp skill enables efficient LLM inference on CPU, Apple Silicon, and non-NVIDIA GPUs, making it ideal for edge deployment or when CUDA is unavailable. It supports GGUF quantization for reduced memory usage and offers significant speedups over PyTorch on CPU. Use this for Macs, AMD/Intel systems, or embedded devices, but choose TensorRT-LLM for NVIDIA hardware requiring maximum throughput.

View skill