Phantom Suite Docs — PhantomCrawl + PhantomClean

PhantomCrawl is a command-line web crawler for developers who need to extract data from the real web - not just simple static pages. Most scrapers break the moment they hit a site with anti-bot protection. PhantomCrawl does not.

It works by trying the fastest, cheapest method first and only escalating to more powerful techniques when necessary. This means it is fast on simple sites and capable on protected ones - all automatically.

✦

PhantomCrawl is self-hosted and free for personal use. You bring your own AI keys (Groq is free). No subscriptions, no per-request charges, no vendor lock-in.

Architecture

How it works

PhantomCrawl uses a 4-layer escalation engine. Every URL starts at Layer 1 and only moves to the next layer if the current one fails or returns no meaningful content. Most of the internet never leaves Layer 1.

Layer 1

Direct HTTP + TLS Fingerprinting

Makes a direct HTTP request but disguises itself at the TCP level using utls HelloChrome_120 - the same TLS fingerprint as a real Chrome browser. Most anti-bot systems (including Cloudflare) identify bots by their TLS handshake. PhantomCrawl's handshake is indistinguishable from a real user.

Covers ~90% of the web. SSR, static sites, Next.js, WordPress, etc.

Layer 2

Network Hijacking

If Layer 1 returns HTML but the content is not useful, Layer 2 inspects the HTML for embedded data. It looks for window.__NEXT_DATA__, window.__INITIAL_STATE__, window.__NUXT__, JSON-LD structured data, and API endpoint patterns. Many modern SPAs ship their data pre-embedded in the HTML even before JavaScript runs - Layer 2 extracts it directly.

Covers Next.js SPAs, Nuxt apps, and sites with embedded JSON blobs.

Layer 2.5

XHR/Fetch Interception

This is what makes PhantomCrawl unique. Instead of scraping the rendered HTML from a headless browser, Layer 2.5 intercepts the actual API responses the browser receives during page load - the raw JSON from XHR and fetch calls. The result is clean structured data with no boilerplate, no parsing noise, and no need to extract anything. This only fires when a browser client (Browserless or go-rod) is available.

Unique to PhantomCrawl. Gets the data before it becomes HTML.

Layer 3

Full Headless Browser

The last resort. PhantomCrawl launches a real browser (go-rod if Chrome is installed locally, or Browserless via API) and fully renders the page - executing JavaScript, handling SPAs, waiting for dynamic content to load. Slower and more expensive, but handles anything. go-rod is detected automatically if Chrome is installed on the machine.

Handles complex SPAs where all three previous layers fail.

ℹ

The escalation is automatic. You do not need to configure which layer to use. PhantomCrawl decides based on what each site returns.

Getting Started

Installation

PhantomCrawl ships as a pre-built binary. No Go installation required. Download the binary for your platform and move it to your PATH.

🐧

Linux (64-bit)

phantomcrawl-linux-amd64

↓ 📱

Linux ARM / Termux

phantomcrawl-linux-arm64

↓ 🍎

macOS Apple Silicon

phantomcrawl-darwin-arm64

↓ 🍎

macOS Intel

phantomcrawl-darwin-amd64

↓ 🪟

Windows (64-bit)

phantomcrawl-windows-amd64.exe

↓

Linux / macOS / Termux

terminal

# Linux
chmod +x phantomcrawl-linux-amd64
sudo mv phantomcrawl-linux-amd64 /usr/local/bin/phantomcrawl

# Termux (Android)
chmod +x phantomcrawl-linux-arm64
mv phantomcrawl-linux-arm64 $PREFIX/bin/phantomcrawl

# macOS
chmod +x phantomcrawl-darwin-arm64
sudo mv phantomcrawl-darwin-arm64 /usr/local/bin/phantomcrawl

Build from source

Requires Go 1.21+

terminal

git clone https://github.com/var-raphael/PhantomCrawl.git
cd PhantomCrawl
go build -ldflags="-s -w" -o phantomcrawl .

Getting Started

Quickstart

Get your first crawl running in under 2 minutes.

1

Generate your config

Creates a crawl.json template and a blank urls.txt in the current directory.

phantomcrawl init

2

Add your URLs

Open urls.txt and add one URL per line.

https://example.com
https://news.ycombinator.com
https://cloudflare.com

3

Start crawling

phantomcrawl start

Output is saved to ~/phantomcrawl/scraped/ by default.

✦

Prefer a visual interface? Open ui.html (included in the repo) in any browser to generate your crawl.json and urls.txt without touching the terminal.

Configuration

Crawl Settings

All configuration lives in crawl.json in your working directory. Here is a full example with every field:

crawl.json

{
  "urls_file": "./urls.txt",
  "batch_size": 3,
  "throttle": 5,
  "depth": 0,
  "depth_limit": 10,
  "stay_on_domain": true,
  "output": "~/phantomcrawl/scraped",
  "retry": {
    "max_attempts": 3,
    "backoff": "exponential",
    "respect_retry_after": true
  }
}

Field	Type	Description
urls_file	string	Path to your URLs file. One URL per line. Lines starting with `#` are ignored.
batch_size	number	How many URLs to crawl concurrently per batch. Keep this low (2-5) to avoid triggering rate limits.
throttle	number	Seconds to wait between batches. A random jitter is applied so timing is never predictable.
depth	number	How deep to follow links from seed URLs. `0` means only scrape the seed URLs themselves. `1` means follow links one level deep.
depth_limit	number	Maximum number of child links to follow per parent URL. `0` means unlimited. Useful when a site has thousands of links on its homepage.
stay_on_domain	bool	When following links at depth, only follow links that stay within the seed URL's domain. Prevents crawling the entire internet.
output	string	Directory where scraped data is saved. Supports `~` for home directory.
retry.max_attempts	number	How many times to retry a failed request before giving up.
retry.backoff	string	`"exponential"` doubles the wait time between each retry. Reduces hammering on rate-limited sites.
retry.respect_retry_after	bool	When a server returns a `Retry-After` header, wait that long before retrying.

Configuration

Scrape Options

Control what data is extracted from each page.

crawl.json

"scrape": {
  "text": true,        // Extract visible text content
  "links": true,       // Extract all hyperlinks (resolved to absolute URLs)
  "images": "links_only", // Extract image URLs
  "videos": "links_only", // Extract video/iframe URLs
  "documents": "links_only", // Extract PDF, DOC, XLS links
  "emails": false,     // Extract email addresses
  "phone_numbers": false, // Extract phone numbers
  "metadata": true     // Extract meta tags (title, description, OG, Twitter)
}

All extracted links are automatically resolved to absolute URLs. A relative link like /about on https://example.com becomes https://example.com/about.

Configuration

AI Cleaning

Raw HTML is noisy - navigation menus, cookie banners, footers, ads. AI cleaning strips all of that and extracts only the meaningful content. PhantomCrawl supports Groq (free tier) and OpenAI.

Setup with Groq (recommended, free)

1

Create a free Groq account

Go to console.groq.com and sign up. Free tier gives you 100,000 tokens per day.

2

Create a .env file

In the same directory as your crawl.json, create a .env file:

.env

GROQ_KEY_1=gsk_your_key_here
GROQ_KEY_2=gsk_another_key_here

Multiple keys from different accounts are rotated automatically to maximize your daily quota.

3

Configure crawl.json

crawl.json

"ai": {
  "enabled": true,
  "save_raw": true,
  "save_cleaned": true,
  "max_concurrent_cleans": 1,
  "model": "llama-3.3-70b-versatile",
  "provider": "groq",
  "key_rotation": "random",
  "keys": ["$GROQ_KEY_1", "$GROQ_KEY_2"],
  "prompt": "You are a web content extractor..."
}

⚠

Groq's free tier has a daily token limit (100k TPD per account). If you hit it mid-crawl, PhantomCrawl will show which URLs are pending and automatically resume when you run it again - no data is lost.

AI config fields

Field	Type	Description
enabled	bool	Turn AI cleaning on or off. Raw data is always saved regardless of this setting.
save_raw	bool	Save the full raw HTML in `raw.json`. Useful for debugging or reprocessing later.
save_cleaned	bool	Save the AI-cleaned text in `cleaned.json`.
max_concurrent_cleans	number	How many URLs to AI-clean at the same time. Keep at `1` on free tier to avoid rate limits.
model	string	The LLM model to use. For Groq: `llama-3.3-70b-versatile`. For OpenAI: `gpt-4o-mini`.
provider	string	`"groq"` or `"openai"`.
key_rotation	string	`"random"` picks a key randomly. `"round-robin"` cycles through keys in order.
keys	array	List of API keys. Use `$VAR_NAME` to reference your `.env` file.
prompt	string	The instruction sent to the LLM. Customize this to extract specific types of content.

Configuration

Proxy Setup

Route requests through proxies to avoid IP bans on large crawls. Proxies are tunneled at the TCP level - traffic goes through the proxy before the TLS handshake, making it look like the request originated from the proxy's IP.

.env

PROXY_1=http://user:[email protected]:8080
PROXY_2=http://proxy2.example.com:8080

crawl.json

"anti_bot": {
  "rotate_user_agents": true,
  "request_delay_jitter": true,
  "proxy": {
    "enabled": true,
    "key_rotation": "random",
    "urls": ["$PROXY_1", "$PROXY_2"]
  }
}

ℹ

For testing, Webshare offers 10 free proxies. Set authentication to IP-based (no username/password needed) and format URLs as http://host:port.

Configuration

Layer 3 - Headless Browser

Layer 3 is used automatically when Layers 1, 2, and 2.5 all fail to return meaningful content. It runs a real browser to fully render the page.

Option A: Local Chrome (automatic)

If Chrome or Chromium is installed on your machine, PhantomCrawl detects it automatically and uses go-rod to control it. No configuration needed.

terminal - check if Chrome is detected

phantomcrawl start
# You will see:
✓ Chrome detected. Using go-rod for Layer 3.

Option B: Browserless (cloud Chrome)

If Chrome is not installed (common on servers and mobile), you can use Browserless - a cloud Chrome service. Free tier available.

.env

BROWSERLESS_KEY=your_key_here

crawl.json

"layer3": {
  "key_rotation": "random",
  "keys": ["$BROWSERLESS_KEY"]
}

⚠

If neither Chrome nor Browserless keys are configured, Layer 3 is unavailable. PhantomCrawl will still work via Layers 1 and 2, which cover the vast majority of the web.

Configuration

Environment Variables

Never put API keys directly in crawl.json. Use a .env file instead and reference variables with $VAR_NAME in your config.

.env

# AI keys
GROQ_KEY_1=gsk_abc123...
GROQ_KEY_2=gsk_def456...

# Browser
BROWSERLESS_KEY=your_browserless_key

# Proxies
PROXY_1=http://host:port
PROXY_2=http://user:pass@host:port

⚠

Add .env and crawl.json to your .gitignore so you never accidentally commit API keys.

.gitignore

.env
crawl.json
urls.txt
scraped/

Reference

Commands

phantomcrawl init Generate a crawl.json template in the current directory. Run this first in a new project folder.

phantomcrawl start Start crawling based on your crawl.json and urls.txt. Automatically resumes if a previous crawl was interrupted.

phantomcrawl reset Wipe the crawl state database so the next start treats all URLs as fresh. Does not delete your scraped data files.

phantomcrawl stats Show a detailed breakdown of all crawled URLs with timestamps, layer used, clean status, and any failures. Useful for checking what is pending after a quota hit.

Resume behavior

PhantomCrawl tracks every crawled URL in a local SQLite database at ~/.phantomcrawl/state.db. If a crawl is interrupted - whether by a network error, token quota, or manual stop - you can resume exactly where you left off by running phantomcrawl start again without resetting. Already-crawled URLs are skipped. Pending AI cleans are retried.

Reference

Output Format

Each crawled URL creates a folder under your output directory. The folder name is derived from the page title or URL slug.

~/phantomcrawl/scraped/
  cloudflare.com/
    cloudflare/
      raw.json      # Full extracted data + raw HTML
      cleaned.json  # AI-cleaned text + structured data
    en-gb/
      raw.json
      cleaned.json

raw.json

Contains everything extracted from the page before AI processing.

{
  "url": "https://cloudflare.com",
  "title": "Cloudflare - The Web Performance & Security Company",
  "content": "Extracted text content...",
  "links": ["https://cloudflare.com/products", "..."],
  "images": ["https://cloudflare.com/img/logo.svg"],
  "documents": [],
  "emails": [],
  "metadata": { "description": "...", "og:image": "..." },
  "layer_used": "layer1",
  "crawled_at": "2026-03-14T...",
  "raw": "<!DOCTYPE html>..."
}

cleaned.json

AI-processed version with clean text and all structured data included.

{
  "url": "https://cloudflare.com",
  "title": "Cloudflare - The Web Performance & Security Company",
  "cleaned": "Cloudflare is a global network...\nProducts: CDN, DDoS protection...",
  "links": ["https://cloudflare.com/products", "..."],
  "images": ["https://cloudflare.com/img/logo.svg"],
  "metadata": { "description": "..." },
  "crawled_at": "2026-03-14T..."
}

Converting to other formats

Output is JSON. Convert to any format using standard tools:

python

import json, csv

with open('scraped/cloudflare.com/cloudflare/cleaned.json') as f:
    data = json.load(f)

print(data['cleaned'])   # AI-cleaned text
print(data['links'])     # All extracted links
print(data['metadata'])  # Page metadata

Phantom Suite

PhantomClean

While PhantomCrawl handles extraction, PhantomClean handles quality. It processes folders in configurable concurrent batches, watches for new files in real time as your scraper runs, and exports results in JSON, CSV, XML, TXT, or HTML.

✦

Run PhantomCrawl and PhantomClean simultaneously in two terminals. PhantomClean watches the scraped folder and cleans files as soon as PhantomCrawl drops them. No waiting for the crawl to finish.

PhantomClean

Installation

PhantomClean ships as a pre-built binary. Download for your platform from GitHub Releases.

🐧

Linux (64-bit)

phantomclean-linux-amd64

↓ 📱

Linux ARM / Termux

phantomclean-linux-arm64

↓ 🍎

macOS Apple Silicon

phantomclean-darwin-arm64

↓ 🍎

macOS Intel

phantomclean-darwin-amd64

↓ 🪟

Windows (64-bit)

phantomclean-windows-amd64.exe

↓

terminal

# Linux
chmod +x phantomclean-linux-amd64
sudo mv phantomclean-linux-amd64 /usr/local/bin/phantomclean

# Termux (Android)
chmod +x phantomclean-linux-arm64
mv phantomclean-linux-arm64 $PREFIX/bin/phantomclean

# macOS
chmod +x phantomclean-darwin-arm64
sudo mv phantomclean-darwin-arm64 /usr/local/bin/phantomclean

PhantomClean

Input Format Requirements

PhantomClean expects scraped data in a specific folder/file structure. PhantomCrawl produces this automatically — using any other scraper requires matching this format.

scraped/
  site-name/
    page-title/
      raw.json         # required — full extracted data
      cleaned.json     # optional — preferred over raw.json if present

Each raw.json must contain at least one of these fields. PhantomClean checks them in this order:

Field	Type	Description
content	string	Plain text content — checked first. This is what PhantomCrawl puts extracted text into.
text	string	Fallback text field.
html	string	Raw HTML — last resort. PhantomClean will extract text from it.

If none of content, text, or html are present, the file is skipped with reason no text content found.

These fields are optional but are preserved in the cleaned output when present:

Field	Type	Description
url	string	Page URL
title	string	Page title — trademark symbols and site suffixes are stripped automatically
links	array	Extracted links
images	array	Image URLs — CDN proxy wrappers are automatically unwrapped
emails	array	Extracted email addresses
phones	array	Extracted phone numbers
metadata.description	string	Page meta description
layer_used	string	Which PhantomCrawl layer extracted this page
crawled_at	string	ISO timestamp of when the page was crawled

⚠

PhantomClean is designed for PhantomCrawl output. Using a different scraper is supported as long as the JSON structure matches the format above. Mismatched field names mean PhantomClean either skips the file or produces empty output.

PhantomClean

Configuration

Run phantomclean init to generate a cleaner.json template. All settings live here.

cleaner.json

{
  "concurrent_folders": 3,         // folders processed concurrently per batch
  "folder_to_clean": "./scraped",  // root scraped folder (PhantomCrawl output)
  "output_folder": "./organized",  // where cleaned files are written
  "output_file_name": "organized", // output filename without extension
  "export_format": ["json"],       // json, csv, xml, txt, html, all_formats
  "watch_mode": true,              // keep running after initial scan
  "watch_debounce_seconds": 2,     // seconds of silence before flushing new files
  "min_word_count": 10,            // skip files below this word count
  "boilerplate_threshold": 3,      // sentences seen this many times = boilerplate
  "quality_score_minimum": 0.5,    // skip files scoring below this (0.0-1.0)
  "language": "en",               // only keep english. use "all" to disable
  "content_type": "text",          // text, code, or mixed
  "strip_nav_links": true,         // remove navigation links from output
  "overwrite": false,             // re-process already-cleaned files
  "resume": true,                 // skip already-processed files on restart
  "omit_folders": [],              // folder names to skip entirely
  "output": {
    "zip_on_complete": true,
    "zip_name": "dataset-{date}-{file_count}-files"
  }
}

Field	Default	Description
concurrent_folders	3	Number of folders processed concurrently within each batch. Tune based on your machine and API rate limits.
watch_mode	true	After the initial scan completes, keep running and process new files as they arrive.
watch_debounce_seconds	2	When new files arrive, wait this many seconds of silence before processing them as a batch. Handles bursts from the scraper gracefully.
boilerplate_threshold	3	A sentence seen across this many files is considered boilerplate and stripped. Lower = more aggressive.
quality_score_minimum	0.5	Files scoring below this are skipped. Score is based on word count, special char ratio, avg word length, and caps ratio.
content_type	text	`text` strips everything aggressively. `code` preserves all syntax. `mixed` preserves code blocks while cleaning surrounding text.
language	en	Only keep files detected as English. Set to `all` to disable language filtering.
export_format	["json"]	Output formats. Supports `json`, `csv`, `xml`, `txt`, `html`, `all_formats`.

PhantomClean

Cleaning Pipeline

Every file passes through 4 layers in sequence. A file is skipped if it fails the quality or word count threshold after Layer 3.

Layer 1

Regex Stripping

Applies all patterns from regex.txt to strip navigation text, legal boilerplate, social share prompts, ads, and timestamps. Also decodes HTML entities and unicode escapes, strips emojis, and removes non-text characters. Behaviour changes based on content_type: code preserves syntax, mixed preserves code blocks.

Fully customisable via regex.txt. Add your own patterns.

Layer 2

Boilerplate Frequency Detection

Tracks every sentence seen across all files. Sentences that appear in more than boilerplate_threshold files are considered boilerplate and removed — things like cookie consent text, footer slogans, or repeated CTAs that appear on every page of a site.

Gets smarter as more files are processed — frequency builds up over the run.

Layer 3

Quality Scoring

Scores content from 0.0 to 1.0 based on four signals: word count (below 50 words penalised), special character ratio (above 10% penalised), average word length (above 15 chars penalised), and uppercase ratio (above 50% penalised). Files below quality_score_minimum are skipped.

Catches CAPTCHA pages, error pages, and content-free files before they waste AI tokens.

Layer 4

AI Cascade

Sends content through a chain of AI providers in order (Groq → OpenAI → Anthropic by default). Each provider is retried up to max_retries times before trying the next. Large documents are automatically chunked at 2000 words so token limits are never hit. Falls back to rule-based output if all providers fail.

Multi-provider cascade means one quota hit doesn't stop the whole run.

PhantomClean

AI Cascade

PhantomClean supports three AI providers in a cascade — it tries each in order and falls back to the next if one fails or hits a rate limit. If all fail, it falls back to rule-based output and marks the file as rules-only so you can retry with phantomclean clean-ai later.

cleaner.json

"ai": {
  "enabled": true,
  "fallback_to_rules": true,      // use rule output if all AI fails
  "only_if_no_cleaned": false,   // skip AI if cleaned.json already exists
  "prompt_file": "prompt.txt",   // path to your prompt file
  "rotate": "random",            // random or sequential key rotation
  "timeout_seconds": 15,          // per-request timeout
  "max_retries": 2,               // retries per provider before trying next
  "providers": [
    { "provider": "groq",      "model": "llama-3.3-70b-versatile", "keys": ["$GROQ_KEY_1"] },
    { "provider": "openai",    "model": "gpt-4o-mini",            "keys": ["$OPENAI_KEY_1"] },
    { "provider": "anthropic", "model": "claude-haiku-4-5-20251001", "keys": ["$ANTHROPIC_KEY_1"] }
  ]
}

.env

GROQ_KEY_1=gsk_your_key_here
GROQ_KEY_2=gsk_another_key_here
OPENAI_KEY_1=sk_your_key_here
ANTHROPIC_KEY_1=your_key_here

ℹ

Groq is recommended as the first provider — it's free (500k tokens/day per key), fast, and handles most documents well. Multiple Groq keys from different accounts are rotated automatically to spread token usage.

Chunking

Documents over 2000 words are automatically split into chunks before being sent to the AI. Each chunk is cleaned independently and the results are joined back together. This means even very large documents (10,000+ words) are handled without hitting token limits or timeouts.

PhantomClean

Commands

phantomclean init Generate cleaner.json, regex.txt, prompt.txt, and .env template in the current directory.

phantomclean start Full batch scan of the scraped folder, then watch mode for new files. Always starts fresh — clears any interrupted state first.

phantomclean resume Resume an interrupted run. Skips already-cleaned files, retries pending and failed ones.

phantomclean clean-ai Retry AI cleaning on files that only received rule-based cleaning — useful after a quota hit or provider failure.

phantomclean stats Show a full breakdown — done, skipped, failed, pending, rules-only — with file paths, scores, word counts, and timestamps.

phantomclean reset Wipe all state. The next start re-processes everything. Does not delete output files.

PhantomClean

Output Format

Cleaned files mirror the input folder structure under output_folder.

organized/
  site-name/
    page-title/
      organized.json   # cleaned output
      organized.csv    # if csv in export_format
      organized.txt    # if txt in export_format

organized.json schema

{
  "url": "https://example.com/page",
  "title": "Page Title",
  "content": "Cleaned text content...",
  "word_count": 1842,
  "quality_score": 0.94,
  "language": "en",
  "cleaned_at": "2026-03-18T12:00:00Z",
  "ai_used": "groq/llama-3.3-70b-versatile",
  "layer_used": "layer1",
  "crawled_at": "2026-03-18T11:58:00Z",
  "links": ["https://example.com/other"],
  "images": ["https://example.com/img.jpg"],
  "emails": [],
  "phones": []
}

Phantom Suite

Running the Full Pipeline

PhantomCrawl and PhantomClean are designed to run simultaneously. Point PhantomClean at PhantomCrawl's output folder and both tools handle their half of the pipeline in real time.

1

Set up PhantomCrawl

phantomcrawl init
# Edit crawl.json — set output to ./scraped
# Add URLs to urls.txt

2

Set up PhantomClean

phantomclean init
# Edit cleaner.json — set folder_to_clean to ./scraped
# Add AI keys to .env

3

Run both simultaneously

# Terminal 1
phantomcrawl start

# Terminal 2
phantomclean start

PhantomClean watches the scraped folder with a debounce buffer. As PhantomCrawl drops new folders, PhantomClean picks them up, batches them, and cleans them automatically.

4

Retry AI on quota hits

# After the crawl is done, retry any rules-only files
phantomclean clean-ai

✦

PhantomClean prefers cleaned.json over raw.json when both exist. If PhantomCrawl already AI-cleaned a page, PhantomClean uses that output and skips redundant AI processing.

Open Source

Contributing

PhantomCrawl is open source under the BSL license - free for personal and non-commercial use. Contributions are welcome.

1

Fork the repo

git clone https://github.com/var-raphael/PhantomCrawl.git
cd PhantomCrawl

2

Build and test

go build ./...
go run main.go init
go run main.go start

3

Open a pull request

Submit your changes at github.com/var-raphael/PhantomCrawl.

ℹ

Built by Raphael Samuel, 18, Lagos, Nigeria. Self-taught. Started coding on a phone. Portfolio · GitHub

PhantomCrawl

How it works

Installation

Linux / macOS / Termux

Build from source

Quickstart

Crawl Settings

Scrape Options

AI Cleaning

Setup with Groq (recommended, free)

AI config fields

Proxy Setup

Layer 3 - Headless Browser

Option A: Local Chrome (automatic)

Option B: Browserless (cloud Chrome)

Environment Variables

Commands

Resume behavior

Output Format

raw.json

cleaned.json

Converting to other formats

PhantomClean

The cleaning half of the pipeline

Installation

Input Format Requirements

Configuration

Cleaning Pipeline

AI Cascade

Chunking

Commands

Output Format

organized.json schema

Running the Full Pipeline

Contributing