Skip to content

Architecture

System Overview

CodeDox is built with a modular architecture focusing on scalability and maintainability.

Core Components

Crawler System

  • Crawl4AI Integration: Headless browser automation
  • Job Manager: Async job queue and state management
  • Page Crawler: Individual page processing
  • URL Management: Deduplication and pattern filtering

Extraction Pipeline

  • HTML Parser: BeautifulSoup-based code block detection
  • LLM Processor: Optional AI enhancement for metadata
  • Content Hasher: Change detection and deduplication
  • Language Detector: Automatic language identification

Database Layer

  • PostgreSQL: Primary data store
  • Full-Text Search: pg_trgm and tsvector indexing
  • Schema Migrations: Version-controlled schema changes
  • Connection Pooling: Efficient connection management

API Server

  • FastAPI: High-performance async API
  • MCP Endpoints: Model Context Protocol integration
  • WebSocket Support: Real-time crawl updates
  • CORS Middleware: Cross-origin resource sharing

Frontend

  • React + TypeScript: Type-safe UI components
  • Vite: Fast development and building
  • TailwindCSS: Utility-first styling
  • React Query: Server state management

Data Flow

Documentation Site
   Crawl4AI
  HTML Parser
  LLM Extractor (optional)
   PostgreSQL
  Search Index
   API/MCP
  AI Assistant

Directory Structure

codedox/
├── src/
│   ├── api/           # FastAPI routes
│   ├── crawler/       # Crawling logic
│   ├── database/      # DB models and queries
│   ├── extraction/    # Code extraction
│   └── mcp_server/    # MCP implementation
├── frontend/          # React application
├── tests/            # Test suites
└── docs/             # Documentation

Key Design Decisions

PostgreSQL over Vector DB

  • Simpler infrastructure
  • Better for keyword search
  • Lower operational overhead
  • Built-in full-text search

LLM-Optional Design

  • Works without API keys
  • HTML extraction as fallback
  • Cost-effective for testing
  • Progressive enhancement

MCP Integration

  • Direct AI assistant access
  • Standardized protocol
  • HTTP and stdio support
  • Tool-based interface

Performance Optimizations

Crawling

  • Concurrent page processing
  • Domain-based rate limiting
  • Content hash caching
  • Incremental updates
  • Indexed search vectors
  • Weighted field ranking
  • Query result caching
  • Pagination support

API

  • Async request handling
  • Connection pooling
  • Response compression
  • Static file caching

Security Considerations

  • Input validation
  • SQL injection prevention
  • Rate limiting
  • Optional authentication
  • Environment variable secrets

Monitoring

  • Structured logging
  • Crawl job tracking
  • Performance metrics
  • Error reporting
  • Health endpoints