llamaguard

davila7

Updated 4 days ago

166 views

18,478

1,685

18,478

OtherSafety AlignmentLlamaGuardContent ModerationMetaGuardrailsSafety ClassificationInput FilteringOutput FilteringAI Safety

About

LlamaGuard is a specialized 7-8B parameter model from Meta for classifying LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and integrates with common deployment tools like vLLM and Hugging Face, as well as NeMo Guardrails. Use this skill to add a robust, dedicated moderation layer to filter unsafe content in your AI applications.

Quick Install

Claude Code

Recommended

Primary

npx skills add davila7/claude-code-templates -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/llamaguard

Copy and paste this command in Claude Code to install this skill

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/safety-alignment-llamaguard

anthropicanthropic-claudeclaudeclaude-code

Related Skills

nemo-guardrails

Testing

NeMo Guardrails is a runtime safety framework for LLM applications that adds programmable guardrails. It provides key safety features like jailbreak detection, input/output validation, and hallucination detection using the Colang 2.0 DSL. Use it to enforce safety and compliance rules in production LLM deployments.

View skill

clip

Other

CLIP is a vision-language model for zero-shot image classification and cross-modal retrieval, requiring no fine-tuning. It excels at general-purpose tasks like image-text matching, semantic search, and content moderation. Developers can use it for vision-language applications by providing image and text pairs for similarity scoring.

View skill

constitutional-ai

Other

Constitutional AI trains models to be harmless using a two-phase method of self-critique/revision and reinforcement learning from AI feedback (RLAIF). It's designed for safety alignment, enabling models to reduce harmful outputs without relying on human labels. Developers can use this skill to implement the core safety system that powers Claude.

View skill

constitutional-ai

Other

This skill implements Anthropic's Constitutional AI method for training harmless AI models through self-critique and revision. It provides a two-phase approach using supervised learning with AI self-critique followed by RLAIF (Reinforcement Learning from AI Feedback) for safety alignment. Use it to reduce harmful outputs in your Claude applications without requiring human-labeled harmful data.

View skill