CodeDox Documentation

Transform any documentation site into a searchable code database

CodeDox is an AI-powered documentation extraction and search system that crawls documentation websites, intelligently extracts code snippets with context, and provides lightning-fast search capabilities through PostgreSQL full-text search and Model Context Protocol (MCP) integration.

Quick Links

Why CodeDox?

The Problem

Developers spend countless hours searching through documentation sites for code examples. Documentation is scattered across different sites, formats, and versions, making it difficult to find relevant code quickly.

The Solution

CodeDox solves this by:

Centralizing all your documentation sources in one searchable database
Extracting code with intelligent context understanding
Providing instant search across all your documentation
Integrating directly with AI assistants via MCP

Key Features

Intelligent Code Extraction

AI-Enhanced Processing: LLM-powered language detection, title generation, and intelligent descriptions
Multi-Format Support: Extracts from Markdown, HTML, text files
Smart Context Preservation: Maintains relationships and surrounding context for better code understanding
Batch Processing: Efficient parallel processing with configurable concurrency

Lightning-Fast Search

Full-Text Search: PostgreSQL-powered search with weighted fields (title > description > code)
Advanced Filtering: Filter by language, source, date range, snippet count, and content type
Fuzzy Matching: Built-in typo tolerance and similarity scoring
Document-Level Search: Search within specific documentation pages with highlighting

File Upload System

Batch File Upload: Upload multiple markdown/text files simultaneously
Directory Processing: Recursive directory scanning with progress tracking
GitHub Repository Support: Clone and process entire repositories or specific folders
Smart Deduplication: Content-hash based duplicate detection across uploads and crawls
Mixed Sources: Seamlessly search across crawled and uploaded content

Advanced Crawl Management

Health Monitoring: Real-time job health tracking with stall detection
Resume Capability: Automatically resume failed crawls from last checkpoint
Depth Control: Configurable crawl depth (0-3) with URL pattern filtering
Domain Intelligence: Smart domain detection and restriction
Recrawl Support: Update existing sources with content-hash optimization

MCP Integration

Multiple Transport Modes: HTTP (recommended), Server-Sent Events, and Stdio
4 Core Tools: init_crawl, search_libraries, get_content, get_page_markdown
Token Authentication: Secure remote deployments with multiple token support
Pagination Support: Built-in pagination for large result sets

Modern Web Dashboard

Real-Time Monitoring: Live crawl progress with WebSocket updates
Source Management: Edit names, bulk operations, filtered deletion
Advanced Search UI: Multi-criteria filtering with instant results
Document Browser: Full markdown viewing with syntax highlighting
Performance Analytics: Detailed statistics and crawl health metrics

What's New in Latest Version

GitHub Repository Processing

Direct Repository Clone: Process documentation without manual download
Selective Processing: Target specific folders within repositories
Private Repository Support: Token authentication for private repos
Pattern Filtering: Include/exclude files based on patterns

Advanced File Upload System

Batch Upload: Upload multiple markdown files simultaneously with progress tracking
Directory Processing: Recursive directory scanning with automatic file type detection
Smart Deduplication: Content-hash based duplicate detection across all sources

Enhanced Crawl Management

Health Monitoring: Real-time job health tracking with automatic stall detection
Resume Capability: Automatically resume failed crawls from the last successful checkpoint
Bulk Operations: Manage multiple crawl jobs with bulk cancel, delete, and status operations

Powerful Search & Filtering

Document-Level Search: Search within specific documentation pages with highlighted results
Advanced Filtering: Filter sources by snippet count, date range, content type, and more
Pagination Support: Navigate through large result sets efficiently across all endpoints

MCP Tool Enhancements

get_page_markdown: New tool to retrieve full documentation pages with search and chunking
Improved get_content: Enhanced with pagination and better library name resolution
Token Chunking: Intelligent content chunking for large documents with overlap support

Installation

Docker Setup (Recommended)

The easiest way to get started with CodeDox:

# Clone the repository
git clone https://github.com/chriswritescode-dev/codedox.git
cd codedox

# Configure environment
cp .env.example .env
# Edit .env to add your CODE_LLM_API_KEY (optional)

# Run automated setup
./docker-setup.sh

# Access the web UI at http://localhost:5173

Manual Installation

For detailed manual installation instructions, see the README.

Quick Start Guide

1. Start Your First Crawl

# Crawl React documentation with pattern filtering
python cli.py crawl start "React" https://react.dev/docs --depth 2 \
  --url-patterns "*docs*" "*guide*" --concurrent 10

# Check crawl status and health
python cli.py crawl status <job-id>
python cli.py crawl health

2. Upload Local Documentation

# Upload single markdown file
python cli.py upload /path/to/docs.md --name "My Docs"

# Process entire documentation directory
python cli.py upload ./docs-directory --name "Local Documentation"

# Process GitHub repository
python cli.py upload-repo https://github.com/user/repo --name "Repository Docs"

# Process specific folder in repository
python cli.py upload-repo https://github.com/user/repo --path docs

3. Search and Explore Content

# Search across all sources
python cli.py search "authentication middleware"

# Search within specific source
python cli.py search "useState hook" --source "React"

# Use the advanced web UI at http://localhost:5173

4. Integrate with AI Assistants

Configure your AI assistant to use CodeDox:

{
  "mcpServers": {
    "codedox": {
      "url": "http://localhost:8000/mcp",
      "transport": "http"
    }
  }
}

5. Manage and Monitor

# List all crawl jobs
python cli.py crawl list

# Resume failed crawl
python cli.py crawl resume <job-id>

# Cancel running jobs
python cli.py crawl cancel <job-id>

Documentation Structure

Core Documentation

Features - Detailed feature documentation
Upload Guide - Directory and file upload instructions
API Reference - Complete API documentation

Guides

API Reference

Search Endpoints

GET /search - Advanced code search

{
  "query": "authentication middleware",
  "source_name": "Express",
  "language": "javascript",
  "limit": 20,
  "offset": 0
}

Source Management

GET /sources - List all documentation sources GET /sources/search - Advanced source filtering GET /sources/{id}/documents - Browse source documents
GET /sources/{id}/snippets - Get source code snippets DELETE /sources/bulk - Bulk source deletion POST /sources/bulk/delete-filtered - Delete by criteria PATCH /sources/{id} - Update source names POST /sources/{id}/recrawl - Recrawl existing sources

Document Access

GET /documents/markdown?url={url} - Get full page markdown GET /documents/{id}/markdown - Get document by ID GET /documents/search - Search documents by title/URL GET /documents/{id}/snippets - Get document code snippets

Crawl Management

POST /crawl/init - Advanced crawl configuration

{
  "name": "Vue.js Documentation",
  "start_urls": ["https://vuejs.org/docs"],
  "max_depth": 2,
  "domain_filter": "vuejs.org",
  "url_patterns": ["*docs*", "*guide*", "*api*"],
  "max_concurrent_crawls": 15,
  "max_pages": 1000
}

GET /crawl/status/{job_id} - Detailed progress tracking POST /crawl/cancel/{job_id} - Cancel running jobs POST /crawl-jobs/bulk/cancel - Bulk job cancellation DELETE /crawl-jobs/bulk - Bulk job deletion

Upload System

POST /upload/file - Single file upload

{
  "file": "documentation.md",
  "name": "Project Documentation",
  "title": "Custom Title"
}

POST /upload/files - Batch file upload

{
  "files": ["doc1.md", "doc2.md", "doc3.md"],
  "name": "Documentation Set"
}

POST /upload/markdown - Direct content upload

{
  "content": "# Title\n\n```python\nprint('hello')\n```",
  "name": "Code Examples"
}

MCP Tools

init_crawl - Start documentation crawling with advanced options

{
  "name": "Next.js",
  "start_urls": ["https://nextjs.org/docs"],
  "max_depth": 2,
  "domain_filter": "nextjs.org",
  "url_patterns": ["*docs*", "*guide*", "*api*"],
  "max_concurrent_crawls": 20
}

search_libraries - Find available libraries with pagination

{
  "query": "javascript",
  "limit": 20,
  "page": 1
}

get_content - Retrieve code snippets with search within library

{
  "library_id": "nextjs-docs",
  "query": "routing middleware",
  "limit": 10,
  "page": 1
}

get_page_markdown - Get full documentation page with search and chunking

{
  "url": "https://nextjs.org/docs/app/building-your-application/routing",
  "query": "middleware",
  "max_tokens": 2048,
  "chunk_index": 0
}

Architecture Overview

CodeDox uses a modular architecture designed for scalability and extensibility:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Web UI    │────▶│   FastAPI   │────▶│ PostgreSQL  │
│   (React)   │     │   Server    │     │  Database   │
└─────────────┘     └─────────────┘     └─────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │     MCP     │
                    │   Server    │
                    └─────────────┘
                           │
                    ┌──────┴──────┐
                    ▼             ▼
              ┌─────────┐   ┌─────────┐
              │   AI    │   │   AI    │
              │Assistant│   │Assistant│
              └─────────┘   └─────────┘

Components

Web UI: React-based dashboard for visual management
API Server: FastAPI backend handling requests and crawls
PostgreSQL: Full-text search and data storage
MCP Server: Protocol bridge for AI assistants
Crawler: Crawl4AI-based web scraping engine
Extractor: HTML and LLM-based code extraction

Performance & Optimization Features

Smart Content Management - Content-hash based deduplication saves 60-90% on re-crawl costs - Intelligent caching at multiple levels (database, search, content) - Batch processing with configurable concurrency limits - Progressive crawling with depth and pattern controls

Advanced Database Features - PostgreSQL full-text search with custom ranking - Optimized indexes on search vectors and metadata - Efficient pagination with cursor-based navigation - Bulk operations for source and job management

Health & Monitoring - Real-time crawl health monitoring with stall detection - WebSocket-based progress updates - Failed page tracking and retry mechanisms - Performance analytics and crawl statistics

Scalability Features - Configurable concurrent crawl sessions (up to 20) - Batch upload processing with size limits - Memory-efficient chunk processing for large documents - Rate limiting and polite crawling with configurable delays

Use Cases

Development Teams

Centralized Documentation: Aggregate all framework docs, internal guides, and code examples in one searchable database
Rapid Development: Find relevant code patterns instantly without browsing multiple documentation sites
Team Onboarding: Create searchable knowledge bases combining public docs with internal documentation
Code Discovery: Use advanced filtering to find examples by language, framework version, or specific patterns

AI Assistant Enhancement

Smart Code Context: Provide AI assistants with up-to-date, relevant code examples from actual documentation
Documentation-Aware Responses: Enable AI to reference specific documentation pages and code snippets
Custom Knowledge Bases: Build domain-specific AI tools with curated documentation sets
Real-Time Updates: Keep AI assistants current with latest framework changes through automated recrawling

Documentation Management

Batch Operations: Process large documentation sets efficiently with concurrent crawling and batch uploads
Content Deduplication: Automatically detect and handle duplicate content across multiple sources
Performance Optimization: Use intelligent caching and content-hash tracking to minimize re-processing costs

Contributing

We welcome contributions! CodeDox is open source and accepts pull requests for:

New documentation framework support
Additional extraction patterns
Performance improvements
Bug fixes and enhancements

See our GitHub repository for contribution guidelines.

Support

Issues: GitHub Issues
Author: Chris Scott - chriswritescode.dev
License: MIT

Getting Help

Common Questions

Q: Do I need an API key to use CodeDox? A: No, CodeDox works without API keys in standalone mode. API keys enable enhanced AI features for better extraction quality.

Q: What documentation sites are supported? A: CodeDox works with any HTML-based documentation site, including Docusaurus, VitePress, MkDocs, Sphinx, and custom sites.

Q: How much does it cost to run? A: CodeDox itself is free and open source. Optional AI enhancement costs depend on your LLM provider (OpenAI, Anthropic, or local models). Smart deduplication reduces API costs by 60-90% on re-crawls.

Q: Can I use local LLMs? A: Yes! CodeDox supports Ollama and any OpenAI-compatible API endpoint. Configure CODE_LLM_BASE_URL to point to your local model server.

Q: What file types can I upload? A: Currently supports Markdown (.md), MDX (.mdx), and plain text files (.txt). Batch upload supports multiple files simultaneously with automatic duplicate detection.

Q: How does health monitoring work? A: CodeDox automatically monitors crawl jobs for stalls, tracks progress via heartbeats, and can resume failed jobs. Use python cli.py crawl health to check job status or use the web UI to view active jobs.

Q: Can I recrawl existing sources? A: Yes! Use the recrawl feature to update existing documentation sources. Content-hash tracking ensures only changed pages are reprocessed, saving time and API costs.

Troubleshooting

For detailed troubleshooting, see:

Ready to transform your documentation into a searchable knowledge base?

Get Started with CodeDox →