CodeDox Features
Overview
CodeDox is a comprehensive documentation extraction and search system that transforms any documentation website into a searchable code database. This document provides detailed information about each feature, configuration options, and best practices.
Table of Contents
- Web Crawling
- Code Extraction
- Search Capabilities
- MCP Integration
- Web Dashboard
- Performance Optimization
- Upload Features
- API Features
Web Crawling
Depth-Controlled Crawling
CodeDox implements intelligent crawling with configurable depth levels (0-3) to control how deep into a documentation site the crawler explores.
Depth Levels:
- Level 0: Only crawl the provided URLs (no following links)
- Level 1: Crawl provided URLs and their direct links
- Level 2: Crawl two levels deep from starting URLs
- Level 3: Maximum depth for comprehensive documentation capture
Usage Example:
URL Pattern Filtering
Control which pages are crawled using glob patterns to focus on relevant documentation.
Configuration Options:
--url-patterns
: Include only URLs matching these patterns--domain
: Restrict crawling to specific domains
Examples:
# Only crawl documentation pages
python cli.py crawl start "Next.js" https://nextjs.org \
--url-patterns "*docs*" "*guide*" "*api*"
# Restrict to specific domain
python cli.py crawl start "Vue" https://vuejs.org \
--domain "vuejs.org"
Concurrent Crawling
Optimize crawl speed with configurable concurrent sessions while respecting server resources.
Configuration:
- Default: 3 concurrent pages
- Maximum: 20 concurrent pages
- Automatic rate limiting: 1 second delay between requests
Crawl Management
Available Commands:
crawl start
: Initiate new crawl jobcrawl status
: Check progress of specific jobcrawl list
: View all crawl jobscrawl cancel
: Stop running crawlcrawl resume
: Continue failed crawlcrawl health
: Monitor crawler health
Code Extraction
HTML-Based Extraction
The core extraction engine uses BeautifulSoup to accurately identify and extract code blocks from various documentation frameworks.
Supported Frameworks:
- Docusaurus
- VitePress
- MkDocs
- Sphinx
- GitBook
- Custom documentation sites
Extraction Process:
- Identifies code blocks using 20+ CSS selectors
- Extracts surrounding context and titles
- Captures filename hints from attributes
- Removes UI elements (copy buttons, line numbers)
- Preserves original formatting
AI Enhancement
Optional LLM integration enhances extracted code with intelligent metadata.
AI Features:
- Language Detection: Identifies programming language from syntax
- Title Generation: Creates descriptive titles for code blocks
- Description Generation: Produces 20-60 word summaries
- Relationship Mapping: Identifies related code blocks
Configuration:
# Enable AI enhancement
CODE_ENABLE_LLM_EXTRACTION=true
CODE_LLM_API_KEY=your-api-key
CODE_LLM_EXTRACTION_MODEL=gpt-4o-mini
# Use local LLM (Ollama, etc.)
CODE_LLM_BASE_URL=http://localhost:11434/v1
CODE_LLM_EXTRACTION_MODEL=qwen2.5-coder:32b
Language Support
CodeDox automatically detects and categorizes code in 50+ programming languages:
Popular Languages:
- JavaScript/TypeScript
- Python
- Java/Kotlin
- C/C++/C#
- Go
- Rust
- Ruby
- PHP
- Swift
- And many more...
Search Capabilities
PostgreSQL Full-Text Search
Lightning-fast search powered by PostgreSQL's native full-text search with weighted fields.
Search Features:
- Weighted Fields: Title (A) > Description (B) > Code (C)
- Fuzzy Matching: Handles typos and variations
- Language Filters: Search within specific languages
- Source Filters: Limit search to specific documentation sources
Search Syntax:
# Basic search
python cli.py search "authentication"
# With source filter
python cli.py search "middleware" --source "Express"
# Limit results
python cli.py search "database connection" --limit 20
Advanced Search Options
Web UI Search Features:
- Real-time search as you type
- Syntax highlighting in results
- Filter by language, source, or date
- Sort by relevance or recency
- Pagination for large result sets
MCP Integration
Model Context Protocol Support
Native integration with MCP allows AI assistants to access CodeDox directly.
Available Tools:
- init_crawl: Start documentation crawling
{
"name": "React",
"start_urls": ["https://react.dev/docs"],
"max_depth": 2,
"domain_filter": "react.dev"
}
- search_libraries: Find available documentation sources
- get_content: Retrieve code snippets
- get_page_markdown: Get full markdown content of a documentation page
MCP Transport Modes
HTTP Mode (Recommended):
- Endpoint:
http://localhost:8000/mcp
- Transport: Streamable HTTP
- Authentication: Optional token-based
SSE Mode (Legacy):
- Endpoint:
http://localhost:8000/mcp/v1/sse
- Transport: Server-Sent Events
Stdio Mode (Standalone):
- Command:
python cli.py serve --mcp
- Transport: Standard input/output
Authentication
Secure remote deployments with token-based authentication.
Configuration:
# Single token
MCP_AUTH_ENABLED=true
MCP_AUTH_TOKEN=your-secure-token
# Multiple tokens
MCP_AUTH_TOKENS=token1,token2,token3
Usage:
Web Dashboard
Dashboard Features
Real-time Statistics:
- Total snippets extracted
- Active documentation sources
- Running crawl jobs
- Recent activity tracking
Source Management:
- View all documentation sources
- Browse extracted snippets
- Re-crawl with updated settings
- Delete sources and associated data
- Edit source names
Crawl Monitoring:
- Live progress tracking
- Success/failure rates
- Pages crawled counter
- Error logs
User Interface
Technology Stack:
- React 18 with TypeScript
- Tailwind CSS for styling
- Vite for fast development
- ShadcnUI components
Key Components:
- Quick search bar
- Source explorer with tabs (Documents/Snippets)
- Snippet viewer with syntax highlighting
- Crawl job manager
- Pagination controls
- Dialog components for actions
Performance Optimization
Content Deduplication
Intelligent hash-based system prevents duplicate processing.
How It Works:
- Generates MD5 hash of page content
- Compares with stored hashes
- Skips unchanged content
- Only processes modified pages
Benefits:
- Reduces API costs by 60-90% on re-crawls
- Faster update cycles
- Lower database storage
Caching Strategy
Implementation:
- Database query result caching
- Content hash caching
- Search result optimization
Database Optimization
Performance Features:
- Indexes on search vectors
- Optimized full-text search
- Connection pooling
- Efficient pagination
Upload Features
Markdown Upload
Process local documentation files directly. Useful for downloading git hub docs and fast processing without hitting the website.
Supported Formats:
- Markdown (.md)
- MDX (.mdx)
- Plain text with code blocks
Directory Upload:
Batch Processing
Features:
- Recursive directory scanning
- Automatic language detection
- Progress tracking
- Bulk file processing
API Features
RESTful Endpoints
Core Endpoints:
GET /health
: System health checkPOST /search
: Search code snippetsGET /snippets/{id}
: Get specific snippet
Source Management:
GET /api/sources
: List documentation sourcesGET /api/sources/{source_id}
: Get source detailsGET /api/sources/{source_id}/snippets
: Get snippets for sourceDELETE /api/sources/{source_id}
: Delete sourcePOST /api/sources/{source_id}/recrawl
: Re-crawl sourcePUT /api/sources/{source_id}
: Update source details
Crawl Management:
POST /crawl/init
: Start new crawlGET /crawl/status/{job_id}
: Check crawl statusPOST /crawl/cancel/{job_id}
: Cancel running jobGET /crawl/jobs
: List all crawl jobs
Statistics:
GET /api/statistics/dashboard
: Dashboard statistics
WebSocket Support
Real-time updates for crawl progress and system events.
Events:
crawl_progress
: Live crawl status updatesjob_complete
: Crawl completion notifications
Configuration
Environment Variables
Essential Settings:
# Database
DB_HOST=localhost
DB_PORT=5432
DB_NAME=codedox
DB_USER=postgres
DB_PASSWORD=postgres
# Code Extraction
CODE_ENABLE_LLM_EXTRACTION=true
CODE_LLM_API_KEY=your-key
CODE_LLM_EXTRACTION_MODEL=gpt-4o-mini
CODE_LLM_BASE_URL=https://api.openai.com/v1
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
API_CORS_ORIGINS=http://localhost:5173
# MCP Settings
MCP_AUTH_ENABLED=false
MCP_AUTH_TOKEN=your-token
# Crawling
CRAWL_DELAY=1.0
MAX_CONCURRENT_CRAWLS=20
Docker Configuration
Docker Compose Services:
- PostgreSQL database
- API server
- Frontend development server
Setup:
Best Practices
Crawling Guidelines
- Start Small: Begin with depth 1 for initial crawls
- Use Patterns: Filter URLs to relevant documentation
- Monitor Progress: Check crawl health regularly
- Respect Servers: Keep default rate limiting
Search Optimization
- Use Specific Queries: More descriptive searches yield better results
- Filter by Source: Narrow searches when possible
- Leverage Language Filters: Specify programming language
- Use Pagination: Navigate through large result sets
API Usage
- Handle Errors: Implement proper error handling
- Use Pagination: Don't request all results at once
- Monitor Health: Check
/health
endpoint
Troubleshooting
Common Issues
No Code Extracted:
- Verify crawl completed successfully
- Check HTML structure compatibility
- Review crawler logs for errors
- Ensure LLM API key is valid (if using AI mode)
Poor Search Results:
- Enable LLM extraction for better quality
- Re-crawl with different depth settings
- Check search query syntax
Database Connection Issues:
- Verify PostgreSQL is running
- Check database credentials in
.env
- Ensure database exists and is initialized
Debug Mode
Enable detailed logging for troubleshooting:
# View API logs
docker-compose logs -f api
# View crawler logs
tail -f logs/codedox.log
# Check database
psql -U postgres -d codedox
Integration Examples
With Claude Desktop
Configure MCP server in Claude Desktop settings:
With Custom AI Applications
import requests
# Search for code
response = requests.post(
"http://localhost:8000/mcp/execute/get_content",
json={
"library_id": "react-docs",
"query": "component lifecycle"
}
)
snippets = response.json()
Support
For issues, questions, or feature requests:
- GitHub Issues: github.com/chriswritescode-dev/codedox/issues
- Author: Chris Scott - chriswritescode.dev