CodeDox Documentation
Transform any documentation site into a searchable code database
CodeDox is an AI-powered documentation extraction and search system that crawls documentation websites, intelligently extracts code snippets with context, and provides lightning-fast search capabilities through PostgreSQL full-text search and Model Context Protocol (MCP) integration.
Quick Links
Why CodeDox?
The Problem
Developers spend countless hours searching through documentation sites for code examples. Documentation is scattered across different sites, formats, and versions, making it difficult to find relevant code quickly.
The Solution
CodeDox solves this by:
- Centralizing all your documentation sources in one searchable database
- Extracting code with intelligent context understanding
- Providing instant search across all your documentation
- Integrating directly with AI assistants via MCP
Key Features
Intelligent Code Extraction
- AI-Enhanced Processing: LLM-powered language detection, title generation, and intelligent descriptions
- Multi-Format Support: Extracts from Markdown, HTML, text files
- Smart Context Preservation: Maintains relationships and surrounding context for better code understanding
- Batch Processing: Efficient parallel processing with configurable concurrency
Lightning-Fast Search
- Full-Text Search: PostgreSQL-powered search with weighted fields (title > description > code)
- Advanced Filtering: Filter by language, source, date range, snippet count, and content type
- Fuzzy Matching: Built-in typo tolerance and similarity scoring
- Document-Level Search: Search within specific documentation pages with highlighting
File Upload System
- Batch File Upload: Upload multiple markdown/text files simultaneously
- Directory Processing: Recursive directory scanning with progress tracking
- GitHub Repository Support: Clone and process entire repositories or specific folders
- Smart Deduplication: Content-hash based duplicate detection across uploads and crawls
- Mixed Sources: Seamlessly search across crawled and uploaded content
Advanced Crawl Management
- Health Monitoring: Real-time job health tracking with stall detection
- Resume Capability: Automatically resume failed crawls from last checkpoint
- Depth Control: Configurable crawl depth (0-3) with URL pattern filtering
- Domain Intelligence: Smart domain detection and restriction
- Recrawl Support: Update existing sources with content-hash optimization
MCP Integration
- Multiple Transport Modes: HTTP (recommended), Server-Sent Events, and Stdio
- 4 Core Tools:
init_crawl,search_libraries,get_content,get_page_markdown - Token Authentication: Secure remote deployments with multiple token support
- Pagination Support: Built-in pagination for large result sets
Modern Web Dashboard
- Real-Time Monitoring: Live crawl progress with WebSocket updates
- Source Management: Edit names, bulk operations, filtered deletion
- Advanced Search UI: Multi-criteria filtering with instant results
- Document Browser: Full markdown viewing with syntax highlighting
- Performance Analytics: Detailed statistics and crawl health metrics
What's New in Latest Version
GitHub Repository Processing
- Direct Repository Clone: Process documentation without manual download
- Selective Processing: Target specific folders within repositories
- Private Repository Support: Token authentication for private repos
- Pattern Filtering: Include/exclude files based on patterns
Advanced File Upload System
- Batch Upload: Upload multiple markdown files simultaneously with progress tracking
- Directory Processing: Recursive directory scanning with automatic file type detection
- Smart Deduplication: Content-hash based duplicate detection across all sources
Enhanced Crawl Management
- Health Monitoring: Real-time job health tracking with automatic stall detection
- Resume Capability: Automatically resume failed crawls from the last successful checkpoint
- Bulk Operations: Manage multiple crawl jobs with bulk cancel, delete, and status operations
Powerful Search & Filtering
- Document-Level Search: Search within specific documentation pages with highlighted results
- Advanced Filtering: Filter sources by snippet count, date range, content type, and more
- Pagination Support: Navigate through large result sets efficiently across all endpoints
MCP Tool Enhancements
- get_page_markdown: New tool to retrieve full documentation pages with search and chunking
- Improved get_content: Enhanced with pagination and better library name resolution
- Token Chunking: Intelligent content chunking for large documents with overlap support
Installation
Docker Setup (Recommended)
The easiest way to get started with CodeDox:
# Clone the repository
git clone https://github.com/chriswritescode-dev/codedox.git
cd codedox
# Configure environment
cp .env.example .env
# Edit .env to add your CODE_LLM_API_KEY (optional)
# Run automated setup
./docker-setup.sh
# Access the web UI at http://localhost:5173
Manual Installation
For detailed manual installation instructions, see the README.
Quick Start Guide
1. Start Your First Crawl
# Crawl React documentation with pattern filtering
python cli.py crawl start "React" https://react.dev/docs --depth 2 \
--url-patterns "*docs*" "*guide*" --concurrent 10
# Check crawl status and health
python cli.py crawl status <job-id>
python cli.py crawl health
2. Upload Local Documentation
# Upload single markdown file
python cli.py upload /path/to/docs.md --name "My Docs"
# Process entire documentation directory
python cli.py upload ./docs-directory --name "Local Documentation"
# Process GitHub repository
python cli.py upload-repo https://github.com/user/repo --name "Repository Docs"
# Process specific folder in repository
python cli.py upload-repo https://github.com/user/repo --path docs
3. Search and Explore Content
# Search across all sources
python cli.py search "authentication middleware"
# Search within specific source
python cli.py search "useState hook" --source "React"
# Use the advanced web UI at http://localhost:5173
4. Integrate with AI Assistants
Configure your AI assistant to use CodeDox:
5. Manage and Monitor
# List all crawl jobs
python cli.py crawl list
# Resume failed crawl
python cli.py crawl resume <job-id>
# Cancel running jobs
python cli.py crawl cancel <job-id>
Documentation Structure
Core Documentation
- Features - Detailed feature documentation
- Upload Guide - Directory and file upload instructions
- API Reference - Complete API documentation
Guides
API Reference
Search Endpoints
GET /search - Advanced code search
{
"query": "authentication middleware",
"source_name": "Express",
"language": "javascript",
"limit": 20,
"offset": 0
}
Source Management
GET /sources - List all documentation sources
GET /sources/search - Advanced source filtering
GET /sources/{id}/documents - Browse source documents
GET /sources/{id}/snippets - Get source code snippets
DELETE /sources/bulk - Bulk source deletion
POST /sources/bulk/delete-filtered - Delete by criteria
PATCH /sources/{id} - Update source names
POST /sources/{id}/recrawl - Recrawl existing sources
Document Access
GET /documents/markdown?url={url} - Get full page markdown GET /documents/{id}/markdown - Get document by ID GET /documents/search - Search documents by title/URL GET /documents/{id}/snippets - Get document code snippets
Crawl Management
POST /crawl/init - Advanced crawl configuration
{
"name": "Vue.js Documentation",
"start_urls": ["https://vuejs.org/docs"],
"max_depth": 2,
"domain_filter": "vuejs.org",
"url_patterns": ["*docs*", "*guide*", "*api*"],
"max_concurrent_crawls": 15,
"max_pages": 1000
}
GET /crawl/status/{job_id} - Detailed progress tracking POST /crawl/cancel/{job_id} - Cancel running jobs POST /crawl-jobs/bulk/cancel - Bulk job cancellation DELETE /crawl-jobs/bulk - Bulk job deletion
Upload System
POST /upload/file - Single file upload
POST /upload/files - Batch file upload
POST /upload/markdown - Direct content upload
MCP Tools
init_crawl - Start documentation crawling with advanced options
{
"name": "Next.js",
"start_urls": ["https://nextjs.org/docs"],
"max_depth": 2,
"domain_filter": "nextjs.org",
"url_patterns": ["*docs*", "*guide*", "*api*"],
"max_concurrent_crawls": 20
}
search_libraries - Find available libraries with pagination
get_content - Retrieve code snippets with search within library
get_page_markdown - Get full documentation page with search and chunking
{
"url": "https://nextjs.org/docs/app/building-your-application/routing",
"query": "middleware",
"max_tokens": 2048,
"chunk_index": 0
}
Architecture Overview
CodeDox uses a modular architecture designed for scalability and extensibility:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Web UI │────▶│ FastAPI │────▶│ PostgreSQL │
│ (React) │ │ Server │ │ Database │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ MCP │
│ Server │
└─────────────┘
│
┌──────┴──────┐
▼ ▼
┌─────────┐ ┌─────────┐
│ AI │ │ AI │
│Assistant│ │Assistant│
└─────────┘ └─────────┘
Components
- Web UI: React-based dashboard for visual management
- API Server: FastAPI backend handling requests and crawls
- PostgreSQL: Full-text search and data storage
- MCP Server: Protocol bridge for AI assistants
- Crawler: Crawl4AI-based web scraping engine
- Extractor: HTML and LLM-based code extraction
Performance & Optimization Features
Smart Content Management - Content-hash based deduplication saves 60-90% on re-crawl costs - Intelligent caching at multiple levels (database, search, content) - Batch processing with configurable concurrency limits - Progressive crawling with depth and pattern controls
Advanced Database Features - PostgreSQL full-text search with custom ranking - Optimized indexes on search vectors and metadata - Efficient pagination with cursor-based navigation - Bulk operations for source and job management
Health & Monitoring - Real-time crawl health monitoring with stall detection - WebSocket-based progress updates - Failed page tracking and retry mechanisms - Performance analytics and crawl statistics
Scalability Features - Configurable concurrent crawl sessions (up to 20) - Batch upload processing with size limits - Memory-efficient chunk processing for large documents - Rate limiting and polite crawling with configurable delays
Use Cases
Development Teams
- Centralized Documentation: Aggregate all framework docs, internal guides, and code examples in one searchable database
- Rapid Development: Find relevant code patterns instantly without browsing multiple documentation sites
- Team Onboarding: Create searchable knowledge bases combining public docs with internal documentation
- Code Discovery: Use advanced filtering to find examples by language, framework version, or specific patterns
AI Assistant Enhancement
- Smart Code Context: Provide AI assistants with up-to-date, relevant code examples from actual documentation
- Documentation-Aware Responses: Enable AI to reference specific documentation pages and code snippets
- Custom Knowledge Bases: Build domain-specific AI tools with curated documentation sets
- Real-Time Updates: Keep AI assistants current with latest framework changes through automated recrawling
Documentation Management
- Batch Operations: Process large documentation sets efficiently with concurrent crawling and batch uploads
- Content Deduplication: Automatically detect and handle duplicate content across multiple sources
- Performance Optimization: Use intelligent caching and content-hash tracking to minimize re-processing costs
Contributing
We welcome contributions! CodeDox is open source and accepts pull requests for:
- New documentation framework support
- Additional extraction patterns
- Performance improvements
- Bug fixes and enhancements
See our GitHub repository for contribution guidelines.
Support
- Issues: GitHub Issues
- Author: Chris Scott - chriswritescode.dev
- License: MIT
Getting Help
Common Questions
Q: Do I need an API key to use CodeDox? A: No, CodeDox works without API keys in standalone mode. API keys enable enhanced AI features for better extraction quality.
Q: What documentation sites are supported? A: CodeDox works with any HTML-based documentation site, including Docusaurus, VitePress, MkDocs, Sphinx, and custom sites.
Q: How much does it cost to run? A: CodeDox itself is free and open source. Optional AI enhancement costs depend on your LLM provider (OpenAI, Anthropic, or local models). Smart deduplication reduces API costs by 60-90% on re-crawls.
Q: Can I use local LLMs?
A: Yes! CodeDox supports Ollama and any OpenAI-compatible API endpoint. Configure CODE_LLM_BASE_URL to point to your local model server.
Q: What file types can I upload? A: Currently supports Markdown (.md), MDX (.mdx), and plain text files (.txt). Batch upload supports multiple files simultaneously with automatic duplicate detection.
Q: How does health monitoring work?
A: CodeDox automatically monitors crawl jobs for stalls, tracks progress via heartbeats, and can resume failed jobs. Use python cli.py crawl health to check job status or use the web UI to view active jobs.
Q: Can I recrawl existing sources? A: Yes! Use the recrawl feature to update existing documentation sources. Content-hash tracking ensures only changed pages are reprocessed, saving time and API costs.
Troubleshooting
For detailed troubleshooting, see:
Ready to transform your documentation into a searchable knowledge base?