constitutional-ai

majiayu000

Updated 4 days ago

77 views

OtherSafety AlignmentConstitutional AIRLAIFSelf-CritiqueHarmlessnessAnthropicAI SafetyRL From AI FeedbackClaude

About

This skill implements Anthropic's Constitutional AI method for training harmless AI models through self-critique and revision. It provides a two-phase approach using supervised learning with AI self-critique followed by RLAIF (Reinforcement Learning from AI Feedback) for safety alignment. Use it to reduce harmful outputs in your Claude applications without requiring human-labeled harmful data.

Quick Install

Claude Code

Recommended

Primary

npx skills add majiayu000/claude-skill-registry -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/majiayu000/claude-skill-registry

Git CloneAlternative

git clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/constitutional-ai

Copy and paste this command in Claude Code to install this skill

GitHub Repository

majiayu000/claude-skill-registry

Path: skills/constitutional-ai

Related Skills

instructor

Testing

Instructor is a structured output library that extracts validated data from LLM responses using Pydantic schemas. It automatically retries failed extractions and provides type-safe JSON parsing with streaming support. Use it when you need reliable, validated data extraction from LLMs like OpenAI or Anthropic.

View skill

nemo-guardrails

Testing

NeMo Guardrails is a runtime safety framework for LLM applications that adds programmable guardrails. It provides key safety features like jailbreak detection, input/output validation, and hallucination detection using the Colang 2.0 DSL. Use it to enforce safety and compliance rules in production LLM deployments.

View skill

constitutional-ai

Other

Constitutional AI trains models to be harmless using a two-phase method of self-critique/revision and reinforcement learning from AI feedback (RLAIF). It's designed for safety alignment, enabling models to reduce harmful outputs without relying on human labels. Developers can use this skill to implement the core safety system that powers Claude.

View skill

llamaguard

Other

LlamaGuard is a specialized 7-8B parameter model from Meta for classifying LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and integrates with common deployment tools like vLLM and Hugging Face, as well as NeMo Guardrails. Use this skill to add a robust, dedicated moderation layer to filter unsafe content in your AI applications.

View skill