Back to Skills

crawl

majiayu000
Updated Today
58
9
58
View on GitHub
Othergeneral

About

This skill enables web scraping of login-protected websites using an authenticated Chrome browser via Chrome DevTools Protocol. It supports article extraction and batch processing through direct browser automation. Use it specifically when you need to access content that requires user authentication to view fully.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/majiayu000/claude-skill-registry
Git CloneAlternative
git clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/crawl

Copy and paste this command in Claude Code to install this skill

Documentation

crawl

Reference

请先参考学习how-to-crawl-with-chrome-dev-mcp.md

Instructions

这个程序仅用来处理那些需要登陆后才能完整登录的网站,在处理过程中不要尝试生成采用其他框架或者程序来获取内容,这样获取的内容是不完整的。

  1. 使用python脚本程序,先判断当前是在macos还是windows环境
  2. 根据当前的操作系统环境,开启新浏览器实例
  3. 检查mcp工具chrome-devtools是否就绪,如果还未就绪请重新连接mcp工具
  4. 你只能使用chrome-devtols来获取浏览器中的信息,请调用mcp工具完成用户给出的任务
  5. 重要: 所有输出文件和程序都必须保存在项目根目录下的 output 文件夹中

MCP 配置要求

{
  "mcpServers": {
    "chrome-devtools": {
      "type": "stdio",
      "command": "npx",
      "args": [
        "chrome-devtools-mcp@latest",
        "--browser-url=http://127.0.0.1:9222"
      ],
      "env": {}
    }
  }
}

🚀 核心功能

1. 智能浏览器管理

  • 自动环境检测: 智能识别 Windows/macOS/Linux 环境
  • 自动浏览器启动: 根据系统自动启动Chrome实例
  • MCP连接检查: 自动验证Chrome DevTools MCP连接状态
  • 代理配置支持: 支持自动代理配置

2. 统一API集成

  • API服务管理: 自动启动和管理API服务
  • 数据格式验证: 确保数据符合API要求
  • 批量数据写入: 支持批量数据高效写入
  • 错误重试机制: 自动重试失败的数据写入

3. 文章内容提取

使用集成的文章内容提取器,支持以下网站:

  • X/Twitter (x.com) - 推文内容提取
  • The Atlantic (theatlantic.com)
  • Medium (medium.com)

📁 输出目录结构

output/
├── logs/              # 执行日志
├── data/              # 数据文件
├── snapshots/         # 页面快照
└── reports/           # 执行报告

📚 相关文档

文档描述用途
QUICK_START.md快速启动指南新手入门
EXAMPLES.md详细使用示例参考代码
BEST_PRACTICES.md最佳实践指南进阶优化
crawl_manager.py核心管理器直接使用

🎯 快速开始

方法一:使用核心管理器(推荐)

from .crawl_manager import extract_x_tweets

# 提取Elon Musk的最新5篇推文
result = extract_x_tweets("elonmusk", 5)
print(result)

方法二:使用标准模板

# 参考 EXAMPLES.md 中的完整示例

⚡ 性能特点

  • 一键式启动 - 自动环境配置
  • 智能重试 - 自动错误恢复
  • 数据验证 - 确保数据质量
  • 日志追踪 - 完整执行记录
  • 批量处理 - 高效数据处理

🚨 重要提醒

  1. Output目录: 所有输出文件必须保存在 output/ 目录下
  2. URL要求: 数据必须有有效的URL字段
  3. 依赖检查: 使用前确保Chrome和相关依赖已安装
  4. 网络环境: 根据需要配置代理设置

📖 详细文档

GitHub Repository

majiayu000/claude-skill-registry
Path: skills/crawl

Related Skills

algorithmic-art

Meta

This Claude Skill creates original algorithmic art using p5.js with seeded randomness and interactive parameters. It generates .md files for algorithmic philosophies, plus .html and .js files for interactive generative art implementations. Use it when developers need to create flow fields, particle systems, or other computational art while avoiding copyright issues.

View skill

subagent-driven-development

Development

This skill executes implementation plans by dispatching a fresh subagent for each independent task, with code review between tasks. It enables fast iteration while maintaining quality gates through this review process. Use it when working on mostly independent tasks within the same session to ensure continuous progress with built-in quality checks.

View skill

executing-plans

Design

Use the executing-plans skill when you have a complete implementation plan to execute in controlled batches with review checkpoints. It loads and critically reviews the plan, then executes tasks in small batches (default 3 tasks) while reporting progress between each batch for architect review. This ensures systematic implementation with built-in quality control checkpoints.

View skill

cost-optimization

Other

This Claude Skill helps developers optimize cloud costs through resource rightsizing, tagging strategies, and spending analysis. It provides a framework for reducing cloud expenses and implementing cost governance across AWS, Azure, and GCP. Use it when you need to analyze infrastructure costs, right-size resources, or meet budget constraints.

View skill