sglang
About
SGLang is a high-performance LLM serving framework that uses RadixAttention for automatic prefix caching, enabling significantly faster structured generation. It's ideal for developers needing JSON/regex outputs, constrained decoding, or building agentic workflows with tool calls. Use it when you require up to 5× faster inference than alternatives like vLLM in scenarios with shared prefixes.
Quick Install
Claude Code
Recommendednpx skills add davila7/claude-code-templates -a claude-code/plugin add https://github.com/davila7/claude-code-templatesgit clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/sglangCopy and paste this command in Claude Code to install this skill
GitHub Repository
Related Skills
awq-quantization
OtherAWQ is a 4-bit weight quantization technique that uses activation patterns to preserve critical weights, enabling 3x faster inference with minimal accuracy loss. It's ideal for deploying large models (7B-70B) on limited GPU memory and is particularly effective for instruction-tuned and multimodal models. This skill integrates with vLLM and Marlin kernels for optimized deployment.
crewai-multi-agent
MetaCrewAI is a lightweight multi-agent orchestration framework for building teams of specialized AI agents that collaborate autonomously on complex tasks. It enables role-based agent collaboration with memory and supports sequential or hierarchical workflows for production use. The framework is built without LangChain dependencies for lean, fast execution.
autogpt-agents
MetaAutoGPT Agents is a platform for building and deploying persistent, autonomous AI agents using visual workflows or code. It's ideal for developers creating complex, multi-step automation systems that require continuous operation or external triggers. Key features include a drag-and-drop visual builder and support for deploying agents via webhooks and schedules.
llama-cpp
OtherThe llama-cpp skill enables efficient LLM inference on CPU, Apple Silicon, and non-NVIDIA GPUs, making it ideal for edge deployment or when CUDA is unavailable. It supports GGUF quantization for reduced memory usage and offers significant speedups over PyTorch on CPU. Use this for Macs, AMD/Intel systems, or embedded devices, but choose TensorRT-LLM for NVIDIA hardware requiring maximum throughput.
