--- name: web-to-md description: Use when user wants to fetch a webpage as markdown, scrape a URL, or get content from a website. This is the default skill for all web-to-markdown tasks. Tries a free service first, falls back to a paid one for JS-heavy sites. --- # Web to Markdown - Smart Website Fetcher Fetch any URL as clean markdown. Tries the free defuddle.md service first, automatically falls back to webcrawlerapi (paid, JS-capable) when content is insufficient. ## When to Use This Skill Use when you need to fetch a webpage and get its content as markdown. This is the **default skill for all web-to-markdown tasks** -- it handles the routing automatically. ## How It Works 1. Fetches URL via defuddle.md (free, ~1-2 sec) 2. Strips YAML frontmatter, measures body content 3. If body > 500 bytes -> returns defuddle.md result (done) 4. If body <= 500 bytes -> falls back to webcrawlerapi (paid, ~15-30 sec) ## Setup ### Option 1: Using the fetch script (recommended) Create a shell script (e.g., `~/.claude/skills/web-to-md/fetch.sh`) with this content: ```bash #!/usr/bin/env bash # web-to-md: Fetch any URL as clean markdown. # Uses defuddle.md (free) first, falls back to webcrawlerapi (paid) if insufficient content. # # Usage: bash fetch.sh # Output: markdown to stdout, status messages to stderr set -euo pipefail MIN_BODY_BYTES=500 DEFUDDLE_BASE="https://defuddle.md" # Set this to the path of your webcrawlerapi fetch script, or remove the fallback WEBCRAWLER_SCRIPT="YOUR_PATH/webcrawlerapi/fetch.py" url="${1:-}" if [[ -z "$url" ]]; then echo "Usage: bash fetch.sh " >&2 exit 1 fi # Normalize URL: ensure it has a scheme for defuddle defuddle_url="$url" if [[ ! "$defuddle_url" =~ ^https?:// ]]; then defuddle_url="https://$defuddle_url" fi # Step 1: Try defuddle.md echo "[web-to-md] Trying defuddle.md..." >&2 defuddle_result=$(curl -sL --max-time 15 "${DEFUDDLE_BASE}/${defuddle_url}" 2>/dev/null) || defuddle_result="" # Step 2: Check content quality - strip YAML frontmatter and measure body if [[ -n "$defuddle_result" ]]; then body=$(echo "$defuddle_result" | awk ' BEGIN { in_front=0; front_done=0 } /^---\s*$/ { if (!front_done) { if (in_front) { front_done=1; next } else { in_front=1; next } } } front_done { print } ') body_bytes=$(echo "$body" | tr -d '[:space:]' | wc -c | tr -d ' ') if [[ "$body_bytes" -gt "$MIN_BODY_BYTES" ]]; then echo "[web-to-md] defuddle.md returned ${body_bytes} bytes of content. Using it." >&2 echo "$defuddle_result" exit 0 else echo "[web-to-md] defuddle.md returned only ${body_bytes} bytes (threshold: ${MIN_BODY_BYTES}). Falling back..." >&2 fi else echo "[web-to-md] defuddle.md returned empty response. Falling back..." >&2 fi # Step 3: Fallback to webcrawlerapi echo "[web-to-md] Trying webcrawlerapi..." >&2 if [[ ! -f "$WEBCRAWLER_SCRIPT" ]]; then echo "[web-to-md] Error: webcrawlerapi script not found at $WEBCRAWLER_SCRIPT" >&2 echo "[web-to-md] To use the paid fallback, set up webcrawlerapi and update WEBCRAWLER_SCRIPT path." >&2 echo "[web-to-md] Returning defuddle.md result (may be incomplete)." >&2 echo "$defuddle_result" exit 1 fi python3 "$WEBCRAWLER_SCRIPT" "$url" ``` Make it executable: `chmod +x ~/.claude/skills/web-to-md/fetch.sh` ### Option 2: Without the paid fallback If you only want the free tier, you can skip the webcrawlerapi setup entirely. The script will use defuddle.md for everything and warn you when content seems insufficient. ### Setting up the paid fallback (optional) [WebcrawlerAPI](https://webcrawlerapi.com) is a pay-per-use service (credits don't expire). To set it up: 1. Sign up and get an API key 2. Create a Python script that calls their API with your key 3. Update `WEBCRAWLER_SCRIPT` in fetch.sh to point to it ## Usage ```bash bash ~/.claude/skills/web-to-md/fetch.sh "URL_HERE" ``` ### Examples ```bash # Blog post (will use defuddle.md - free) bash ~/.claude/skills/web-to-md/fetch.sh "https://example.com/blog-post" # Meetup events page (will auto-fallback to webcrawlerapi if configured) bash ~/.claude/skills/web-to-md/fetch.sh "https://www.meetup.com/some-group/events/" ``` ## When Each Backend Gets Used | Site type | Backend used | Cost | |-----------|-------------|------| | Blogs, articles, docs, wikis | defuddle.md | Free | | JS-rendered SPAs (Meetup, dashboards) | webcrawlerapi | Paid | | Cloudflare-protected sites | webcrawlerapi | Paid | | Booking calendars, dynamic grids | webcrawlerapi | Paid | ## Output - Markdown to stdout, status messages to stderr - defuddle.md results include YAML frontmatter (title, source, domain, word_count) - webcrawlerapi results are raw markdown (may include sidebar/footer noise)