sglang

davila7

Updated 5 days ago

405 views

18,478

1,685

18,478

MetaInference ServingSGLangStructured GenerationRadixAttentionPrefix CachingConstrained DecodingAgentsJSON OutputFast InferenceProduction Scale

About

SGLang is a high-performance LLM serving framework that uses RadixAttention for automatic prefix caching, enabling significantly faster structured generation. It's ideal for developers needing JSON/regex outputs, constrained decoding, or building agentic workflows with tool calls. Use it when you require up to 5× faster inference than alternatives like vLLM in scenarios with shared prefixes.

Quick Install

Claude Code

Recommended

Primary

npx skills add davila7/claude-code-templates -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/sglang

Copy and paste this command in Claude Code to install this skill

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/inference-serving-sglang

anthropicanthropic-claudeclaudeclaude-code

Related Skills

awq-quantization

Other

AWQ is a 4-bit weight quantization technique that uses activation patterns to preserve critical weights, enabling 3x faster inference with minimal accuracy loss. It's ideal for deploying large models (7B-70B) on limited GPU memory and is particularly effective for instruction-tuned and multimodal models. This skill integrates with vLLM and Marlin kernels for optimized deployment.

View skill

crewai-multi-agent

autogpt-agents

llama-cpp

Other

The llama-cpp skill enables efficient LLM inference on CPU, Apple Silicon, and non-NVIDIA GPUs, making it ideal for edge deployment or when CUDA is unavailable. It supports GGUF quantization for reduced memory usage and offers significant speedups over PyTorch on CPU. Use this for Macs, AMD/Intel systems, or embedded devices, but choose TensorRT-LLM for NVIDIA hardware requiring maximum throughput.

View skill