v1.0.0
Open Source · v1.0.0 · BSL License

PhantomCrawl

A 4-layer web crawler with TLS fingerprinting, AI content cleaning, and anti-bot evasion. Built in Go. Ships as a single binary.

⚡ Proof: Scraped Cloudflare.com - a site protected by Cloudflare - at depth 1, 100+ pages, zero blocks. View the data →

PhantomCrawl is a command-line web crawler for developers who need to extract data from the real web - not just simple static pages. Most scrapers break the moment they hit a site with anti-bot protection. PhantomCrawl does not.

It works by trying the fastest, cheapest method first and only escalating to more powerful techniques when necessary. This means it is fast on simple sites and capable on protected ones - all automatically.

PhantomCrawl is self-hosted and free for personal use. You bring your own AI keys (Groq is free). No subscriptions, no per-request charges, no vendor lock-in.


Architecture

How it works

PhantomCrawl uses a 4-layer escalation engine. Every URL starts at Layer 1 and only moves to the next layer if the current one fails or returns no meaningful content. Most of the internet never leaves Layer 1.

Layer 1
Direct HTTP + TLS Fingerprinting

Makes a direct HTTP request but disguises itself at the TCP level using utls HelloChrome_120 - the same TLS fingerprint as a real Chrome browser. Most anti-bot systems (including Cloudflare) identify bots by their TLS handshake. PhantomCrawl's handshake is indistinguishable from a real user.

Covers ~90% of the web. SSR, static sites, Next.js, WordPress, etc.
Layer 2
Network Hijacking

If Layer 1 returns HTML but the content is not useful, Layer 2 inspects the HTML for embedded data. It looks for window.__NEXT_DATA__, window.__INITIAL_STATE__, window.__NUXT__, JSON-LD structured data, and API endpoint patterns. Many modern SPAs ship their data pre-embedded in the HTML even before JavaScript runs - Layer 2 extracts it directly.

Covers Next.js SPAs, Nuxt apps, and sites with embedded JSON blobs.
Layer 2.5
XHR/Fetch Interception

This is what makes PhantomCrawl unique. Instead of scraping the rendered HTML from a headless browser, Layer 2.5 intercepts the actual API responses the browser receives during page load - the raw JSON from XHR and fetch calls. The result is clean structured data with no boilerplate, no parsing noise, and no need to extract anything. This only fires when a browser client (Browserless or go-rod) is available.

Unique to PhantomCrawl. Gets the data before it becomes HTML.
Layer 3
Full Headless Browser

The last resort. PhantomCrawl launches a real browser (go-rod if Chrome is installed locally, or Browserless via API) and fully renders the page - executing JavaScript, handling SPAs, waiting for dynamic content to load. Slower and more expensive, but handles anything. go-rod is detected automatically if Chrome is installed on the machine.

Handles complex SPAs where all three previous layers fail.

The escalation is automatic. You do not need to configure which layer to use. PhantomCrawl decides based on what each site returns.


Getting Started

Installation

PhantomCrawl ships as a pre-built binary. No Go installation required. Download the binary for your platform and move it to your PATH.

Linux / macOS / Termux

terminal
# Linux
chmod +x phantomcrawl-linux-amd64
sudo mv phantomcrawl-linux-amd64 /usr/local/bin/phantomcrawl

# Termux (Android)
chmod +x phantomcrawl-linux-arm64
mv phantomcrawl-linux-arm64 $PREFIX/bin/phantomcrawl

# macOS
chmod +x phantomcrawl-darwin-arm64
sudo mv phantomcrawl-darwin-arm64 /usr/local/bin/phantomcrawl

Build from source

Requires Go 1.21+

terminal
git clone https://github.com/var-raphael/PhantomCrawl.git
cd PhantomCrawl
go build -ldflags="-s -w" -o phantomcrawl .

Getting Started

Quickstart

Get your first crawl running in under 2 minutes.

1
Generate your config

Creates a crawl.json template and a blank urls.txt in the current directory.

phantomcrawl init
2
Add your URLs

Open urls.txt and add one URL per line.

https://example.com
https://news.ycombinator.com
https://cloudflare.com
3
Start crawling
phantomcrawl start

Output is saved to ~/phantomcrawl/scraped/ by default.

Prefer a visual interface? Open ui.html (included in the repo) in any browser to generate your crawl.json and urls.txt without touching the terminal.


Configuration

Crawl Settings

All configuration lives in crawl.json in your working directory. Here is a full example with every field:

crawl.json
{
  "urls_file": "./urls.txt",
  "batch_size": 3,
  "throttle": 5,
  "depth": 0,
  "depth_limit": 10,
  "stay_on_domain": true,
  "output": "~/phantomcrawl/scraped",
  "retry": {
    "max_attempts": 3,
    "backoff": "exponential",
    "respect_retry_after": true
  }
}
FieldTypeDescription
urls_filestringPath to your URLs file. One URL per line. Lines starting with # are ignored.
batch_sizenumberHow many URLs to crawl concurrently per batch. Keep this low (2-5) to avoid triggering rate limits.
throttlenumberSeconds to wait between batches. A random jitter is applied so timing is never predictable.
depthnumberHow deep to follow links from seed URLs. 0 means only scrape the seed URLs themselves. 1 means follow links one level deep.
depth_limitnumberMaximum number of child links to follow per parent URL. 0 means unlimited. Useful when a site has thousands of links on its homepage.
stay_on_domainboolWhen following links at depth, only follow links that stay within the seed URL's domain. Prevents crawling the entire internet.
outputstringDirectory where scraped data is saved. Supports ~ for home directory.
retry.max_attemptsnumberHow many times to retry a failed request before giving up.
retry.backoffstring"exponential" doubles the wait time between each retry. Reduces hammering on rate-limited sites.
retry.respect_retry_afterboolWhen a server returns a Retry-After header, wait that long before retrying.

Configuration

Scrape Options

Control what data is extracted from each page.

crawl.json
"scrape": {
  "text": true,        // Extract visible text content
  "links": true,       // Extract all hyperlinks (resolved to absolute URLs)
  "images": "links_only", // Extract image URLs
  "videos": "links_only", // Extract video/iframe URLs
  "documents": "links_only", // Extract PDF, DOC, XLS links
  "emails": false,     // Extract email addresses
  "phone_numbers": false, // Extract phone numbers
  "metadata": true     // Extract meta tags (title, description, OG, Twitter)
}

All extracted links are automatically resolved to absolute URLs. A relative link like /about on https://example.com becomes https://example.com/about.


Configuration

AI Cleaning

Raw HTML is noisy - navigation menus, cookie banners, footers, ads. AI cleaning strips all of that and extracts only the meaningful content. PhantomCrawl supports Groq (free tier) and OpenAI.

Setup with Groq (recommended, free)

1
Create a free Groq account

Go to console.groq.com and sign up. Free tier gives you 100,000 tokens per day.

2
Create a .env file

In the same directory as your crawl.json, create a .env file:

.env
GROQ_KEY_1=gsk_your_key_here
GROQ_KEY_2=gsk_another_key_here

Multiple keys from different accounts are rotated automatically to maximize your daily quota.

3
Configure crawl.json
crawl.json
"ai": {
  "enabled": true,
  "save_raw": true,
  "save_cleaned": true,
  "max_concurrent_cleans": 1,
  "model": "llama-3.3-70b-versatile",
  "provider": "groq",
  "key_rotation": "random",
  "keys": ["$GROQ_KEY_1", "$GROQ_KEY_2"],
  "prompt": "You are a web content extractor..."
}

Groq's free tier has a daily token limit (100k TPD per account). If you hit it mid-crawl, PhantomCrawl will show which URLs are pending and automatically resume when you run it again - no data is lost.

AI config fields

FieldTypeDescription
enabledboolTurn AI cleaning on or off. Raw data is always saved regardless of this setting.
save_rawboolSave the full raw HTML in raw.json. Useful for debugging or reprocessing later.
save_cleanedboolSave the AI-cleaned text in cleaned.json.
max_concurrent_cleansnumberHow many URLs to AI-clean at the same time. Keep at 1 on free tier to avoid rate limits.
modelstringThe LLM model to use. For Groq: llama-3.3-70b-versatile. For OpenAI: gpt-4o-mini.
providerstring"groq" or "openai".
key_rotationstring"random" picks a key randomly. "round-robin" cycles through keys in order.
keysarrayList of API keys. Use $VAR_NAME to reference your .env file.
promptstringThe instruction sent to the LLM. Customize this to extract specific types of content.

Configuration

Proxy Setup

Route requests through proxies to avoid IP bans on large crawls. Proxies are tunneled at the TCP level - traffic goes through the proxy before the TLS handshake, making it look like the request originated from the proxy's IP.

.env
PROXY_1=http://user:[email protected]:8080
PROXY_2=http://proxy2.example.com:8080
crawl.json
"anti_bot": {
  "rotate_user_agents": true,
  "request_delay_jitter": true,
  "proxy": {
    "enabled": true,
    "key_rotation": "random",
    "urls": ["$PROXY_1", "$PROXY_2"]
  }
}

For testing, Webshare offers 10 free proxies. Set authentication to IP-based (no username/password needed) and format URLs as http://host:port.


Configuration

Layer 3 - Headless Browser

Layer 3 is used automatically when Layers 1, 2, and 2.5 all fail to return meaningful content. It runs a real browser to fully render the page.

Option A: Local Chrome (automatic)

If Chrome or Chromium is installed on your machine, PhantomCrawl detects it automatically and uses go-rod to control it. No configuration needed.

terminal - check if Chrome is detected
phantomcrawl start
# You will see:
✓ Chrome detected. Using go-rod for Layer 3.

Option B: Browserless (cloud Chrome)

If Chrome is not installed (common on servers and mobile), you can use Browserless - a cloud Chrome service. Free tier available.

.env
BROWSERLESS_KEY=your_key_here
crawl.json
"layer3": {
  "key_rotation": "random",
  "keys": ["$BROWSERLESS_KEY"]
}

If neither Chrome nor Browserless keys are configured, Layer 3 is unavailable. PhantomCrawl will still work via Layers 1 and 2, which cover the vast majority of the web.


Configuration

Environment Variables

Never put API keys directly in crawl.json. Use a .env file instead and reference variables with $VAR_NAME in your config.

.env
# AI keys
GROQ_KEY_1=gsk_abc123...
GROQ_KEY_2=gsk_def456...

# Browser
BROWSERLESS_KEY=your_browserless_key

# Proxies
PROXY_1=http://host:port
PROXY_2=http://user:pass@host:port

Add .env and crawl.json to your .gitignore so you never accidentally commit API keys.

.gitignore
.env
crawl.json
urls.txt
scraped/

Reference

Commands

phantomcrawl init Generate a crawl.json template in the current directory. Run this first in a new project folder.
phantomcrawl start Start crawling based on your crawl.json and urls.txt. Automatically resumes if a previous crawl was interrupted.
phantomcrawl reset Wipe the crawl state database so the next start treats all URLs as fresh. Does not delete your scraped data files.
phantomcrawl stats Show a detailed breakdown of all crawled URLs with timestamps, layer used, clean status, and any failures. Useful for checking what is pending after a quota hit.

Resume behavior

PhantomCrawl tracks every crawled URL in a local SQLite database at ~/.phantomcrawl/state.db. If a crawl is interrupted - whether by a network error, token quota, or manual stop - you can resume exactly where you left off by running phantomcrawl start again without resetting. Already-crawled URLs are skipped. Pending AI cleans are retried.


Reference

Output Format

Each crawled URL creates a folder under your output directory. The folder name is derived from the page title or URL slug.

~/phantomcrawl/scraped/
  cloudflare.com/
    cloudflare/
      raw.json      # Full extracted data + raw HTML
      cleaned.json  # AI-cleaned text + structured data
    en-gb/
      raw.json
      cleaned.json

raw.json

Contains everything extracted from the page before AI processing.

{
  "url": "https://cloudflare.com",
  "title": "Cloudflare - The Web Performance & Security Company",
  "content": "Extracted text content...",
  "links": ["https://cloudflare.com/products", "..."],
  "images": ["https://cloudflare.com/img/logo.svg"],
  "documents": [],
  "emails": [],
  "metadata": { "description": "...", "og:image": "..." },
  "layer_used": "layer1",
  "crawled_at": "2026-03-14T...",
  "raw": "<!DOCTYPE html>..."
}

cleaned.json

AI-processed version with clean text and all structured data included.

{
  "url": "https://cloudflare.com",
  "title": "Cloudflare - The Web Performance & Security Company",
  "cleaned": "Cloudflare is a global network...\nProducts: CDN, DDoS protection...",
  "links": ["https://cloudflare.com/products", "..."],
  "images": ["https://cloudflare.com/img/logo.svg"],
  "metadata": { "description": "..." },
  "crawled_at": "2026-03-14T..."
}

Converting to other formats

Output is JSON. Convert to any format using standard tools:

python
import json, csv

with open('scraped/cloudflare.com/cloudflare/cleaned.json') as f:
    data = json.load(f)

print(data['cleaned'])   # AI-cleaned text
print(data['links'])     # All extracted links
print(data['metadata'])  # Page metadata


Phantom Suite

PhantomClean

Companion Tool · v1.0.0 · BSL License

The cleaning half of the pipeline

PhantomClean takes the raw scraped JSON from PhantomCrawl and runs it through a 4-layer cleaning pipeline — regex stripping, boilerplate detection, quality scoring, and a multi-provider AI cascade — producing clean, structured datasets ready for AI training or export.

While PhantomCrawl handles extraction, PhantomClean handles quality. It processes folders in configurable concurrent batches, watches for new files in real time as your scraper runs, and exports results in JSON, CSV, XML, TXT, or HTML.

Run PhantomCrawl and PhantomClean simultaneously in two terminals. PhantomClean watches the scraped folder and cleans files as soon as PhantomCrawl drops them. No waiting for the crawl to finish.


PhantomClean

Installation

PhantomClean ships as a pre-built binary. Download for your platform from GitHub Releases.

terminal
# Linux
chmod +x phantomclean-linux-amd64
sudo mv phantomclean-linux-amd64 /usr/local/bin/phantomclean

# Termux (Android)
chmod +x phantomclean-linux-arm64
mv phantomclean-linux-arm64 $PREFIX/bin/phantomclean

# macOS
chmod +x phantomclean-darwin-arm64
sudo mv phantomclean-darwin-arm64 /usr/local/bin/phantomclean

PhantomClean

Input Format Requirements

PhantomClean expects scraped data in a specific folder/file structure. PhantomCrawl produces this automatically — using any other scraper requires matching this format.

scraped/
  site-name/
    page-title/
      raw.json         # required — full extracted data
      cleaned.json     # optional — preferred over raw.json if present

Each raw.json must contain at least one of these fields. PhantomClean checks them in this order:

FieldTypeDescription
contentstringPlain text content — checked first. This is what PhantomCrawl puts extracted text into.
textstringFallback text field.
htmlstringRaw HTML — last resort. PhantomClean will extract text from it.

If none of content, text, or html are present, the file is skipped with reason no text content found.

These fields are optional but are preserved in the cleaned output when present:

FieldTypeDescription
urlstringPage URL
titlestringPage title — trademark symbols and site suffixes are stripped automatically
linksarrayExtracted links
imagesarrayImage URLs — CDN proxy wrappers are automatically unwrapped
emailsarrayExtracted email addresses
phonesarrayExtracted phone numbers
metadata.descriptionstringPage meta description
layer_usedstringWhich PhantomCrawl layer extracted this page
crawled_atstringISO timestamp of when the page was crawled

PhantomClean is designed for PhantomCrawl output. Using a different scraper is supported as long as the JSON structure matches the format above. Mismatched field names mean PhantomClean either skips the file or produces empty output.


PhantomClean

Configuration

Run phantomclean init to generate a cleaner.json template. All settings live here.

cleaner.json
{
  "concurrent_folders": 3,         // folders processed concurrently per batch
  "folder_to_clean": "./scraped",  // root scraped folder (PhantomCrawl output)
  "output_folder": "./organized",  // where cleaned files are written
  "output_file_name": "organized", // output filename without extension
  "export_format": ["json"],       // json, csv, xml, txt, html, all_formats
  "watch_mode": true,              // keep running after initial scan
  "watch_debounce_seconds": 2,     // seconds of silence before flushing new files
  "min_word_count": 10,            // skip files below this word count
  "boilerplate_threshold": 3,      // sentences seen this many times = boilerplate
  "quality_score_minimum": 0.5,    // skip files scoring below this (0.0-1.0)
  "language": "en",               // only keep english. use "all" to disable
  "content_type": "text",          // text, code, or mixed
  "strip_nav_links": true,         // remove navigation links from output
  "overwrite": false,             // re-process already-cleaned files
  "resume": true,                 // skip already-processed files on restart
  "omit_folders": [],              // folder names to skip entirely
  "output": {
    "zip_on_complete": true,
    "zip_name": "dataset-{date}-{file_count}-files"
  }
}
FieldDefaultDescription
concurrent_folders3Number of folders processed concurrently within each batch. Tune based on your machine and API rate limits.
watch_modetrueAfter the initial scan completes, keep running and process new files as they arrive.
watch_debounce_seconds2When new files arrive, wait this many seconds of silence before processing them as a batch. Handles bursts from the scraper gracefully.
boilerplate_threshold3A sentence seen across this many files is considered boilerplate and stripped. Lower = more aggressive.
quality_score_minimum0.5Files scoring below this are skipped. Score is based on word count, special char ratio, avg word length, and caps ratio.
content_typetexttext strips everything aggressively. code preserves all syntax. mixed preserves code blocks while cleaning surrounding text.
languageenOnly keep files detected as English. Set to all to disable language filtering.
export_format["json"]Output formats. Supports json, csv, xml, txt, html, all_formats.

PhantomClean

Cleaning Pipeline

Every file passes through 4 layers in sequence. A file is skipped if it fails the quality or word count threshold after Layer 3.

Layer 1
Regex Stripping

Applies all patterns from regex.txt to strip navigation text, legal boilerplate, social share prompts, ads, and timestamps. Also decodes HTML entities and unicode escapes, strips emojis, and removes non-text characters. Behaviour changes based on content_type: code preserves syntax, mixed preserves code blocks.

Fully customisable via regex.txt. Add your own patterns.
Layer 2
Boilerplate Frequency Detection

Tracks every sentence seen across all files. Sentences that appear in more than boilerplate_threshold files are considered boilerplate and removed — things like cookie consent text, footer slogans, or repeated CTAs that appear on every page of a site.

Gets smarter as more files are processed — frequency builds up over the run.
Layer 3
Quality Scoring

Scores content from 0.0 to 1.0 based on four signals: word count (below 50 words penalised), special character ratio (above 10% penalised), average word length (above 15 chars penalised), and uppercase ratio (above 50% penalised). Files below quality_score_minimum are skipped.

Catches CAPTCHA pages, error pages, and content-free files before they waste AI tokens.
Layer 4
AI Cascade

Sends content through a chain of AI providers in order (Groq → OpenAI → Anthropic by default). Each provider is retried up to max_retries times before trying the next. Large documents are automatically chunked at 2000 words so token limits are never hit. Falls back to rule-based output if all providers fail.

Multi-provider cascade means one quota hit doesn't stop the whole run.

PhantomClean

AI Cascade

PhantomClean supports three AI providers in a cascade — it tries each in order and falls back to the next if one fails or hits a rate limit. If all fail, it falls back to rule-based output and marks the file as rules-only so you can retry with phantomclean clean-ai later.

cleaner.json
"ai": {
  "enabled": true,
  "fallback_to_rules": true,      // use rule output if all AI fails
  "only_if_no_cleaned": false,   // skip AI if cleaned.json already exists
  "prompt_file": "prompt.txt",   // path to your prompt file
  "rotate": "random",            // random or sequential key rotation
  "timeout_seconds": 15,          // per-request timeout
  "max_retries": 2,               // retries per provider before trying next
  "providers": [
    { "provider": "groq",      "model": "llama-3.3-70b-versatile", "keys": ["$GROQ_KEY_1"] },
    { "provider": "openai",    "model": "gpt-4o-mini",            "keys": ["$OPENAI_KEY_1"] },
    { "provider": "anthropic", "model": "claude-haiku-4-5-20251001", "keys": ["$ANTHROPIC_KEY_1"] }
  ]
}
.env
GROQ_KEY_1=gsk_your_key_here
GROQ_KEY_2=gsk_another_key_here
OPENAI_KEY_1=sk_your_key_here
ANTHROPIC_KEY_1=your_key_here

Groq is recommended as the first provider — it's free (500k tokens/day per key), fast, and handles most documents well. Multiple Groq keys from different accounts are rotated automatically to spread token usage.

Chunking

Documents over 2000 words are automatically split into chunks before being sent to the AI. Each chunk is cleaned independently and the results are joined back together. This means even very large documents (10,000+ words) are handled without hitting token limits or timeouts.


PhantomClean

Commands

phantomclean init Generate cleaner.json, regex.txt, prompt.txt, and .env template in the current directory.
phantomclean start Full batch scan of the scraped folder, then watch mode for new files. Always starts fresh — clears any interrupted state first.
phantomclean resume Resume an interrupted run. Skips already-cleaned files, retries pending and failed ones.
phantomclean clean-ai Retry AI cleaning on files that only received rule-based cleaning — useful after a quota hit or provider failure.
phantomclean stats Show a full breakdown — done, skipped, failed, pending, rules-only — with file paths, scores, word counts, and timestamps.
phantomclean reset Wipe all state. The next start re-processes everything. Does not delete output files.

PhantomClean

Output Format

Cleaned files mirror the input folder structure under output_folder.

organized/
  site-name/
    page-title/
      organized.json   # cleaned output
      organized.csv    # if csv in export_format
      organized.txt    # if txt in export_format

organized.json schema

{
  "url": "https://example.com/page",
  "title": "Page Title",
  "content": "Cleaned text content...",
  "word_count": 1842,
  "quality_score": 0.94,
  "language": "en",
  "cleaned_at": "2026-03-18T12:00:00Z",
  "ai_used": "groq/llama-3.3-70b-versatile",
  "layer_used": "layer1",
  "crawled_at": "2026-03-18T11:58:00Z",
  "links": ["https://example.com/other"],
  "images": ["https://example.com/img.jpg"],
  "emails": [],
  "phones": []
}

Phantom Suite

Running the Full Pipeline

PhantomCrawl and PhantomClean are designed to run simultaneously. Point PhantomClean at PhantomCrawl's output folder and both tools handle their half of the pipeline in real time.

1
Set up PhantomCrawl
phantomcrawl init
# Edit crawl.json — set output to ./scraped
# Add URLs to urls.txt
2
Set up PhantomClean
phantomclean init
# Edit cleaner.json — set folder_to_clean to ./scraped
# Add AI keys to .env
3
Run both simultaneously
# Terminal 1
phantomcrawl start

# Terminal 2
phantomclean start

PhantomClean watches the scraped folder with a debounce buffer. As PhantomCrawl drops new folders, PhantomClean picks them up, batches them, and cleans them automatically.

4
Retry AI on quota hits
# After the crawl is done, retry any rules-only files
phantomclean clean-ai

PhantomClean prefers cleaned.json over raw.json when both exist. If PhantomCrawl already AI-cleaned a page, PhantomClean uses that output and skips redundant AI processing.

Open Source

Contributing

PhantomCrawl is open source under the BSL license - free for personal and non-commercial use. Contributions are welcome.

1
Fork the repo
git clone https://github.com/var-raphael/PhantomCrawl.git
cd PhantomCrawl
2
Build and test
go build ./...
go run main.go init
go run main.go start
3
Open a pull request

Submit your changes at github.com/var-raphael/PhantomCrawl.

Built by Raphael Samuel, 18, Lagos, Nigeria. Self-taught. Started coding on a phone. Portfolio · GitHub