PhantomCrawl is a command-line web crawler for developers who need to extract data from the real web - not just simple static pages. Most scrapers break the moment they hit a site with anti-bot protection. PhantomCrawl does not.
It works by trying the fastest, cheapest method first and only escalating to more powerful techniques when necessary. This means it is fast on simple sites and capable on protected ones - all automatically.
PhantomCrawl is self-hosted and free for personal use. You bring your own AI keys (Groq is free). No subscriptions, no per-request charges, no vendor lock-in.
How it works
PhantomCrawl uses a 4-layer escalation engine. Every URL starts at Layer 1 and only moves to the next layer if the current one fails or returns no meaningful content. Most of the internet never leaves Layer 1.
Makes a direct HTTP request but disguises itself at the TCP level using utls HelloChrome_120 - the same TLS fingerprint as a real Chrome browser. Most anti-bot systems (including Cloudflare) identify bots by their TLS handshake. PhantomCrawl's handshake is indistinguishable from a real user.
If Layer 1 returns HTML but the content is not useful, Layer 2 inspects the HTML for embedded data. It looks for window.__NEXT_DATA__, window.__INITIAL_STATE__, window.__NUXT__, JSON-LD structured data, and API endpoint patterns. Many modern SPAs ship their data pre-embedded in the HTML even before JavaScript runs - Layer 2 extracts it directly.
This is what makes PhantomCrawl unique. Instead of scraping the rendered HTML from a headless browser, Layer 2.5 intercepts the actual API responses the browser receives during page load - the raw JSON from XHR and fetch calls. The result is clean structured data with no boilerplate, no parsing noise, and no need to extract anything. This only fires when a browser client (Browserless or go-rod) is available.
The last resort. PhantomCrawl launches a real browser (go-rod if Chrome is installed locally, or Browserless via API) and fully renders the page - executing JavaScript, handling SPAs, waiting for dynamic content to load. Slower and more expensive, but handles anything. go-rod is detected automatically if Chrome is installed on the machine.
The escalation is automatic. You do not need to configure which layer to use. PhantomCrawl decides based on what each site returns.
Installation
PhantomCrawl ships as a pre-built binary. No Go installation required. Download the binary for your platform and move it to your PATH.
Linux / macOS / Termux
# Linux
chmod +x phantomcrawl-linux-amd64
sudo mv phantomcrawl-linux-amd64 /usr/local/bin/phantomcrawl
# Termux (Android)
chmod +x phantomcrawl-linux-arm64
mv phantomcrawl-linux-arm64 $PREFIX/bin/phantomcrawl
# macOS
chmod +x phantomcrawl-darwin-arm64
sudo mv phantomcrawl-darwin-arm64 /usr/local/bin/phantomcrawl
Build from source
Requires Go 1.21+
git clone https://github.com/var-raphael/PhantomCrawl.git
cd PhantomCrawl
go build -ldflags="-s -w" -o phantomcrawl .
Quickstart
Get your first crawl running in under 2 minutes.
Creates a crawl.json template and a blank urls.txt in the current directory.
phantomcrawl init
Open urls.txt and add one URL per line.
https://example.com
https://news.ycombinator.com
https://cloudflare.com
phantomcrawl start
Output is saved to ~/phantomcrawl/scraped/ by default.
Prefer a visual interface? Open ui.html (included in the repo) in any browser to generate your crawl.json and urls.txt without touching the terminal.
Crawl Settings
All configuration lives in crawl.json in your working directory. Here is a full example with every field:
{
"urls_file": "./urls.txt",
"batch_size": 3,
"throttle": 5,
"depth": 0,
"depth_limit": 10,
"stay_on_domain": true,
"output": "~/phantomcrawl/scraped",
"retry": {
"max_attempts": 3,
"backoff": "exponential",
"respect_retry_after": true
}
}
| Field | Type | Description |
|---|---|---|
| urls_file | string | Path to your URLs file. One URL per line. Lines starting with # are ignored. |
| batch_size | number | How many URLs to crawl concurrently per batch. Keep this low (2-5) to avoid triggering rate limits. |
| throttle | number | Seconds to wait between batches. A random jitter is applied so timing is never predictable. |
| depth | number | How deep to follow links from seed URLs. 0 means only scrape the seed URLs themselves. 1 means follow links one level deep. |
| depth_limit | number | Maximum number of child links to follow per parent URL. 0 means unlimited. Useful when a site has thousands of links on its homepage. |
| stay_on_domain | bool | When following links at depth, only follow links that stay within the seed URL's domain. Prevents crawling the entire internet. |
| output | string | Directory where scraped data is saved. Supports ~ for home directory. |
| retry.max_attempts | number | How many times to retry a failed request before giving up. |
| retry.backoff | string | "exponential" doubles the wait time between each retry. Reduces hammering on rate-limited sites. |
| retry.respect_retry_after | bool | When a server returns a Retry-After header, wait that long before retrying. |
Scrape Options
Control what data is extracted from each page.
"scrape": {
"text": true, // Extract visible text content
"links": true, // Extract all hyperlinks (resolved to absolute URLs)
"images": "links_only", // Extract image URLs
"videos": "links_only", // Extract video/iframe URLs
"documents": "links_only", // Extract PDF, DOC, XLS links
"emails": false, // Extract email addresses
"phone_numbers": false, // Extract phone numbers
"metadata": true // Extract meta tags (title, description, OG, Twitter)
}
All extracted links are automatically resolved to absolute URLs. A relative link like /about on https://example.com becomes https://example.com/about.
AI Cleaning
Raw HTML is noisy - navigation menus, cookie banners, footers, ads. AI cleaning strips all of that and extracts only the meaningful content. PhantomCrawl supports Groq (free tier) and OpenAI.
Setup with Groq (recommended, free)
Go to console.groq.com and sign up. Free tier gives you 100,000 tokens per day.
In the same directory as your crawl.json, create a .env file:
GROQ_KEY_1=gsk_your_key_here
GROQ_KEY_2=gsk_another_key_here
Multiple keys from different accounts are rotated automatically to maximize your daily quota.
"ai": {
"enabled": true,
"save_raw": true,
"save_cleaned": true,
"max_concurrent_cleans": 1,
"model": "llama-3.3-70b-versatile",
"provider": "groq",
"key_rotation": "random",
"keys": ["$GROQ_KEY_1", "$GROQ_KEY_2"],
"prompt": "You are a web content extractor..."
}
Groq's free tier has a daily token limit (100k TPD per account). If you hit it mid-crawl, PhantomCrawl will show which URLs are pending and automatically resume when you run it again - no data is lost.
AI config fields
| Field | Type | Description |
|---|---|---|
| enabled | bool | Turn AI cleaning on or off. Raw data is always saved regardless of this setting. |
| save_raw | bool | Save the full raw HTML in raw.json. Useful for debugging or reprocessing later. |
| save_cleaned | bool | Save the AI-cleaned text in cleaned.json. |
| max_concurrent_cleans | number | How many URLs to AI-clean at the same time. Keep at 1 on free tier to avoid rate limits. |
| model | string | The LLM model to use. For Groq: llama-3.3-70b-versatile. For OpenAI: gpt-4o-mini. |
| provider | string | "groq" or "openai". |
| key_rotation | string | "random" picks a key randomly. "round-robin" cycles through keys in order. |
| keys | array | List of API keys. Use $VAR_NAME to reference your .env file. |
| prompt | string | The instruction sent to the LLM. Customize this to extract specific types of content. |
Proxy Setup
Route requests through proxies to avoid IP bans on large crawls. Proxies are tunneled at the TCP level - traffic goes through the proxy before the TLS handshake, making it look like the request originated from the proxy's IP.
PROXY_1=http://user:[email protected]:8080
PROXY_2=http://proxy2.example.com:8080
"anti_bot": {
"rotate_user_agents": true,
"request_delay_jitter": true,
"proxy": {
"enabled": true,
"key_rotation": "random",
"urls": ["$PROXY_1", "$PROXY_2"]
}
}
For testing, Webshare offers 10 free proxies. Set authentication to IP-based (no username/password needed) and format URLs as http://host:port.
Layer 3 - Headless Browser
Layer 3 is used automatically when Layers 1, 2, and 2.5 all fail to return meaningful content. It runs a real browser to fully render the page.
Option A: Local Chrome (automatic)
If Chrome or Chromium is installed on your machine, PhantomCrawl detects it automatically and uses go-rod to control it. No configuration needed.
phantomcrawl start
# You will see:
✓ Chrome detected. Using go-rod for Layer 3.
Option B: Browserless (cloud Chrome)
If Chrome is not installed (common on servers and mobile), you can use Browserless - a cloud Chrome service. Free tier available.
BROWSERLESS_KEY=your_key_here
"layer3": {
"key_rotation": "random",
"keys": ["$BROWSERLESS_KEY"]
}
If neither Chrome nor Browserless keys are configured, Layer 3 is unavailable. PhantomCrawl will still work via Layers 1 and 2, which cover the vast majority of the web.
Environment Variables
Never put API keys directly in crawl.json. Use a .env file instead and reference variables with $VAR_NAME in your config.
# AI keys
GROQ_KEY_1=gsk_abc123...
GROQ_KEY_2=gsk_def456...
# Browser
BROWSERLESS_KEY=your_browserless_key
# Proxies
PROXY_1=http://host:port
PROXY_2=http://user:pass@host:port
Add .env and crawl.json to your .gitignore so you never accidentally commit API keys.
.env
crawl.json
urls.txt
scraped/
Commands
crawl.json template in the current directory. Run this first in a new project folder.
crawl.json and urls.txt. Automatically resumes if a previous crawl was interrupted.
start treats all URLs as fresh. Does not delete your scraped data files.
Resume behavior
PhantomCrawl tracks every crawled URL in a local SQLite database at ~/.phantomcrawl/state.db. If a crawl is interrupted - whether by a network error, token quota, or manual stop - you can resume exactly where you left off by running phantomcrawl start again without resetting. Already-crawled URLs are skipped. Pending AI cleans are retried.
Output Format
Each crawled URL creates a folder under your output directory. The folder name is derived from the page title or URL slug.
~/phantomcrawl/scraped/
cloudflare.com/
cloudflare/
raw.json # Full extracted data + raw HTML
cleaned.json # AI-cleaned text + structured data
en-gb/
raw.json
cleaned.json
raw.json
Contains everything extracted from the page before AI processing.
{
"url": "https://cloudflare.com",
"title": "Cloudflare - The Web Performance & Security Company",
"content": "Extracted text content...",
"links": ["https://cloudflare.com/products", "..."],
"images": ["https://cloudflare.com/img/logo.svg"],
"documents": [],
"emails": [],
"metadata": { "description": "...", "og:image": "..." },
"layer_used": "layer1",
"crawled_at": "2026-03-14T...",
"raw": "<!DOCTYPE html>..."
}
cleaned.json
AI-processed version with clean text and all structured data included.
{
"url": "https://cloudflare.com",
"title": "Cloudflare - The Web Performance & Security Company",
"cleaned": "Cloudflare is a global network...\nProducts: CDN, DDoS protection...",
"links": ["https://cloudflare.com/products", "..."],
"images": ["https://cloudflare.com/img/logo.svg"],
"metadata": { "description": "..." },
"crawled_at": "2026-03-14T..."
}
Converting to other formats
Output is JSON. Convert to any format using standard tools:
import json, csv
with open('scraped/cloudflare.com/cloudflare/cleaned.json') as f:
data = json.load(f)
print(data['cleaned']) # AI-cleaned text
print(data['links']) # All extracted links
print(data['metadata']) # Page metadata
PhantomClean
While PhantomCrawl handles extraction, PhantomClean handles quality. It processes folders in configurable concurrent batches, watches for new files in real time as your scraper runs, and exports results in JSON, CSV, XML, TXT, or HTML.
Run PhantomCrawl and PhantomClean simultaneously in two terminals. PhantomClean watches the scraped folder and cleans files as soon as PhantomCrawl drops them. No waiting for the crawl to finish.
Installation
PhantomClean ships as a pre-built binary. Download for your platform from GitHub Releases.
# Linux
chmod +x phantomclean-linux-amd64
sudo mv phantomclean-linux-amd64 /usr/local/bin/phantomclean
# Termux (Android)
chmod +x phantomclean-linux-arm64
mv phantomclean-linux-arm64 $PREFIX/bin/phantomclean
# macOS
chmod +x phantomclean-darwin-arm64
sudo mv phantomclean-darwin-arm64 /usr/local/bin/phantomclean
Input Format Requirements
PhantomClean expects scraped data in a specific folder/file structure. PhantomCrawl produces this automatically — using any other scraper requires matching this format.
scraped/
site-name/
page-title/
raw.json # required — full extracted data
cleaned.json # optional — preferred over raw.json if present
Each raw.json must contain at least one of these fields. PhantomClean checks them in this order:
| Field | Type | Description |
|---|---|---|
| content | string | Plain text content — checked first. This is what PhantomCrawl puts extracted text into. |
| text | string | Fallback text field. |
| html | string | Raw HTML — last resort. PhantomClean will extract text from it. |
If none of content, text, or html are present, the file is skipped with reason no text content found.
These fields are optional but are preserved in the cleaned output when present:
| Field | Type | Description |
|---|---|---|
| url | string | Page URL |
| title | string | Page title — trademark symbols and site suffixes are stripped automatically |
| links | array | Extracted links |
| images | array | Image URLs — CDN proxy wrappers are automatically unwrapped |
| emails | array | Extracted email addresses |
| phones | array | Extracted phone numbers |
| metadata.description | string | Page meta description |
| layer_used | string | Which PhantomCrawl layer extracted this page |
| crawled_at | string | ISO timestamp of when the page was crawled |
PhantomClean is designed for PhantomCrawl output. Using a different scraper is supported as long as the JSON structure matches the format above. Mismatched field names mean PhantomClean either skips the file or produces empty output.
Configuration
Run phantomclean init to generate a cleaner.json template. All settings live here.
{
"concurrent_folders": 3, // folders processed concurrently per batch
"folder_to_clean": "./scraped", // root scraped folder (PhantomCrawl output)
"output_folder": "./organized", // where cleaned files are written
"output_file_name": "organized", // output filename without extension
"export_format": ["json"], // json, csv, xml, txt, html, all_formats
"watch_mode": true, // keep running after initial scan
"watch_debounce_seconds": 2, // seconds of silence before flushing new files
"min_word_count": 10, // skip files below this word count
"boilerplate_threshold": 3, // sentences seen this many times = boilerplate
"quality_score_minimum": 0.5, // skip files scoring below this (0.0-1.0)
"language": "en", // only keep english. use "all" to disable
"content_type": "text", // text, code, or mixed
"strip_nav_links": true, // remove navigation links from output
"overwrite": false, // re-process already-cleaned files
"resume": true, // skip already-processed files on restart
"omit_folders": [], // folder names to skip entirely
"output": {
"zip_on_complete": true,
"zip_name": "dataset-{date}-{file_count}-files"
}
}
| Field | Default | Description |
|---|---|---|
| concurrent_folders | 3 | Number of folders processed concurrently within each batch. Tune based on your machine and API rate limits. |
| watch_mode | true | After the initial scan completes, keep running and process new files as they arrive. |
| watch_debounce_seconds | 2 | When new files arrive, wait this many seconds of silence before processing them as a batch. Handles bursts from the scraper gracefully. |
| boilerplate_threshold | 3 | A sentence seen across this many files is considered boilerplate and stripped. Lower = more aggressive. |
| quality_score_minimum | 0.5 | Files scoring below this are skipped. Score is based on word count, special char ratio, avg word length, and caps ratio. |
| content_type | text | text strips everything aggressively. code preserves all syntax. mixed preserves code blocks while cleaning surrounding text. |
| language | en | Only keep files detected as English. Set to all to disable language filtering. |
| export_format | ["json"] | Output formats. Supports json, csv, xml, txt, html, all_formats. |
Cleaning Pipeline
Every file passes through 4 layers in sequence. A file is skipped if it fails the quality or word count threshold after Layer 3.
Applies all patterns from regex.txt to strip navigation text, legal boilerplate, social share prompts, ads, and timestamps. Also decodes HTML entities and unicode escapes, strips emojis, and removes non-text characters. Behaviour changes based on content_type: code preserves syntax, mixed preserves code blocks.
Tracks every sentence seen across all files. Sentences that appear in more than boilerplate_threshold files are considered boilerplate and removed — things like cookie consent text, footer slogans, or repeated CTAs that appear on every page of a site.
Scores content from 0.0 to 1.0 based on four signals: word count (below 50 words penalised), special character ratio (above 10% penalised), average word length (above 15 chars penalised), and uppercase ratio (above 50% penalised). Files below quality_score_minimum are skipped.
Sends content through a chain of AI providers in order (Groq → OpenAI → Anthropic by default). Each provider is retried up to max_retries times before trying the next. Large documents are automatically chunked at 2000 words so token limits are never hit. Falls back to rule-based output if all providers fail.
AI Cascade
PhantomClean supports three AI providers in a cascade — it tries each in order and falls back to the next if one fails or hits a rate limit. If all fail, it falls back to rule-based output and marks the file as rules-only so you can retry with phantomclean clean-ai later.
"ai": {
"enabled": true,
"fallback_to_rules": true, // use rule output if all AI fails
"only_if_no_cleaned": false, // skip AI if cleaned.json already exists
"prompt_file": "prompt.txt", // path to your prompt file
"rotate": "random", // random or sequential key rotation
"timeout_seconds": 15, // per-request timeout
"max_retries": 2, // retries per provider before trying next
"providers": [
{ "provider": "groq", "model": "llama-3.3-70b-versatile", "keys": ["$GROQ_KEY_1"] },
{ "provider": "openai", "model": "gpt-4o-mini", "keys": ["$OPENAI_KEY_1"] },
{ "provider": "anthropic", "model": "claude-haiku-4-5-20251001", "keys": ["$ANTHROPIC_KEY_1"] }
]
}
GROQ_KEY_1=gsk_your_key_here
GROQ_KEY_2=gsk_another_key_here
OPENAI_KEY_1=sk_your_key_here
ANTHROPIC_KEY_1=your_key_here
Groq is recommended as the first provider — it's free (500k tokens/day per key), fast, and handles most documents well. Multiple Groq keys from different accounts are rotated automatically to spread token usage.
Chunking
Documents over 2000 words are automatically split into chunks before being sent to the AI. Each chunk is cleaned independently and the results are joined back together. This means even very large documents (10,000+ words) are handled without hitting token limits or timeouts.
Commands
cleaner.json, regex.txt, prompt.txt, and .env template in the current directory.
start re-processes everything. Does not delete output files.
Output Format
Cleaned files mirror the input folder structure under output_folder.
organized/
site-name/
page-title/
organized.json # cleaned output
organized.csv # if csv in export_format
organized.txt # if txt in export_format
organized.json schema
{
"url": "https://example.com/page",
"title": "Page Title",
"content": "Cleaned text content...",
"word_count": 1842,
"quality_score": 0.94,
"language": "en",
"cleaned_at": "2026-03-18T12:00:00Z",
"ai_used": "groq/llama-3.3-70b-versatile",
"layer_used": "layer1",
"crawled_at": "2026-03-18T11:58:00Z",
"links": ["https://example.com/other"],
"images": ["https://example.com/img.jpg"],
"emails": [],
"phones": []
}
Running the Full Pipeline
PhantomCrawl and PhantomClean are designed to run simultaneously. Point PhantomClean at PhantomCrawl's output folder and both tools handle their half of the pipeline in real time.
phantomcrawl init
# Edit crawl.json — set output to ./scraped
# Add URLs to urls.txt
phantomclean init
# Edit cleaner.json — set folder_to_clean to ./scraped
# Add AI keys to .env
# Terminal 1
phantomcrawl start
# Terminal 2
phantomclean start
PhantomClean watches the scraped folder with a debounce buffer. As PhantomCrawl drops new folders, PhantomClean picks them up, batches them, and cleans them automatically.
# After the crawl is done, retry any rules-only files
phantomclean clean-ai
PhantomClean prefers cleaned.json over raw.json when both exist. If PhantomCrawl already AI-cleaned a page, PhantomClean uses that output and skips redundant AI processing.
Contributing
PhantomCrawl is open source under the BSL license - free for personal and non-commercial use. Contributions are welcome.
git clone https://github.com/var-raphael/PhantomCrawl.git
cd PhantomCrawl
go build ./...
go run main.go init
go run main.go start
Submit your changes at github.com/var-raphael/PhantomCrawl.