training-llms-megatron

davila7

Updated 9 days ago

273 views

18,478

1,685

18,478

DesignMegatron-CoreLarge-Scale TrainingNVIDIATensor ParallelismPipeline ParallelismModel ParallelismH100Distributed TrainingProduction

About

This skill trains massive LLMs (2B-462B parameters) using NVIDIA's Megatron-Core framework for maximum GPU efficiency. Use it when training models over 1B parameters and needing advanced parallelism like tensor, pipeline, or expert parallelism. It's a production-ready framework proven on models like Nemotron and LLaMA.

Quick Install

Claude Code

Recommended

Primary

npx skills add davila7/claude-code-templates -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/training-llms-megatron

Copy and paste this command in Claude Code to install this skill

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/distributed-training-megatron-core

anthropicanthropic-claudeclaudeclaude-code

Related Skills

openrlhf-training

Design

OpenRLHF is a high-performance RLHF training framework for fine-tuning large language models (7B-70B+ parameters) using methods like PPO, DPO, and GRPO. It leverages Ray for distributed architecture and vLLM for accelerated inference, achieving speeds 2x faster than alternatives like DeepSpeedChat. Use this skill when you need efficient, distributed RLHF training with optimized GPU resource sharing and ZeRO-3 support.

View skill

huggingface-tokenizers

Documents

This skill provides high-performance tokenization using HuggingFace's Rust-based library, processing 1GB of text in under 20 seconds. It supports BPE, WordPiece, and Unigram algorithms while enabling custom tokenizer training and alignment tracking. Use it when you need production-fast tokenization or to build custom tokenizers integrated with the transformers ecosystem.