GitHub Repository Processing
CodeDox can directly process markdown documentation from GitHub repositories without manual downloading. This feature allows you to extract code snippets from entire repositories or specific folders.
Overview
The GitHub processor:
- Clones repositories efficiently using shallow cloning
- Processes all markdown files in the repository
- Extracts code blocks with intelligent context
- Generates source URLs for each file
- Supports both public and private repositories
- Automatically cleans up temporary files
Key Features
Repository Cloning
- Shallow Cloning: Only fetches the latest commit for efficiency
- Sparse Checkout: Can clone specific folders to save bandwidth
- Branch Selection: Process any branch (defaults to main)
- Token Authentication: Support for private repositories
File Processing
- Markdown Detection: Automatically finds all markdown files (.md, .markdown, .mdx, .mdown, .mkd, .mdwn)
- Pattern Filtering: Include or exclude files based on glob patterns
- Smart Exclusion: Automatically skips common non-documentation folders (node_modules, .git, vendor, etc.)
- Source URL Generation: Creates GitHub blob URLs for each processed file
Integration
- Unified Search: GitHub-sourced content is searchable alongside crawled documentation
- Progress Tracking: Real-time updates during processing
- LLM Enhancement: Optional AI-powered code understanding
Usage
Command Line Interface
Basic Usage
# Process entire repository
python cli.py upload-repo https://github.com/user/repo
# Process with custom name
python cli.py upload-repo https://github.com/user/repo --name "Project Docs"
# Process specific branch
python cli.py upload-repo https://github.com/user/repo --branch develop
Advanced Options
# Process specific folder
python cli.py upload-repo https://github.com/user/repo --path docs
# Include specific patterns
python cli.py upload-repo https://github.com/user/repo \
--include "docs/**/*.md" \
--include "examples/**/*.md"
# Exclude patterns
python cli.py upload-repo https://github.com/user/repo \
--exclude "**/test/*.md" \
--exclude "**/node_modules/**"
# Keep cloned repository (don't cleanup)
python cli.py upload-repo https://github.com/user/repo --no-cleanup
Private Repositories
# Using environment variable
export GITHUB_TOKEN=ghp_your_token_here
python cli.py upload-repo https://github.com/user/private-repo
# Using command line option
python cli.py upload-repo https://github.com/user/private-repo \
--token ghp_your_token_here
API Endpoint
The GitHub processor is integrated with the upload system and uses the same job tracking:
# Check job status
curl http://localhost:8000/crawl/status/<job-id>
# List all jobs
curl http://localhost:8000/crawl/list
Web Interface
GitHub repository processing is integrated into the Upload page:
- Navigate to the Upload section
- Select "GitHub Repository" tab
- Enter repository URL
- Configure options (branch, path, patterns)
- Click "Process Repository"
Configuration
Environment Variables
# GitHub personal access token for private repos
GITHUB_TOKEN=ghp_your_token_here
# LLM settings for code extraction
CODE_LLM_API_KEY=your-api-key
CODE_LLM_EXTRACTION_MODEL=gpt-4o-mini
Default Exclusions
The following directories are automatically excluded:
node_modules.git.githubvendordistbuildtarget.tox.pytest_cache__pycache__
Examples
Process Documentation Repository
# Process React documentation
python cli.py upload-repo https://github.com/facebook/react \
--path docs \
--name "React Documentation"
# Process Next.js examples
python cli.py upload-repo https://github.com/vercel/next.js \
--path examples \
--include "**/*.md" \
--name "Next.js Examples"
Process Private Company Docs
# With token authentication
export GITHUB_TOKEN=ghp_your_token
python cli.py upload-repo https://github.com/company/internal-docs \
--name "Internal Documentation" \
--branch main
Process Multiple Versions
# Process different versions from branches
python cli.py upload-repo https://github.com/user/project \
--branch v1.0 \
--name "Project v1.0 Docs"
python cli.py upload-repo https://github.com/user/project \
--branch v2.0 \
--name "Project v2.0 Docs"
How It Works
Processing Pipeline
- Repository Cloning
- Creates temporary directory
- Performs shallow clone with depth=1
- Optionally checks out specific branch
-
Applies sparse checkout if path specified
-
File Discovery
- Walks repository directory tree
- Identifies markdown files by extension
- Applies include/exclude patterns
-
Skips default excluded directories
-
Content Extraction
- Reads each markdown file
- Generates GitHub blob URL for source tracking
- Extracts code blocks using markdown parser
-
Optionally enhances with LLM analysis
-
Storage
- Creates source record in database
- Stores extracted code snippets
- Maintains source URLs for reference
-
Updates search indexes
-
Cleanup
- Removes temporary clone directory
- Reports processing statistics
- Returns job ID for tracking
Source URL Generation
For each file, CodeDox generates the appropriate GitHub URL:
This allows users to navigate back to the original documentation.
Performance Considerations
Optimization Tips
- Use Specific Paths: Clone only the documentation folder instead of the entire repository
- Apply Patterns: Use include/exclude patterns to process only relevant files
- Shallow Cloning: Default shallow clone reduces download time
- Concurrent Processing: Multiple files are processed in parallel
Resource Usage
- Disk Space: Temporary clone requires space (automatically cleaned up)
- Network: Initial clone bandwidth depends on repository size
- Processing Time: Depends on number of markdown files and LLM usage
Limitations
Current Limitations
- Only processes markdown files (not source code files directly)
- Requires git to be installed on the system
- Maximum repository size limited by available disk space
- Rate limiting applies when using GitHub API
Planned Enhancements
- Support for other version control systems (GitLab, Bitbucket)
- Direct source code file processing
- Incremental updates (only process changes)
- Webhook integration for automatic updates
Troubleshooting
Common Issues
Authentication Failed
Solution: Ensure your GitHub token has the necessary permissions:
- For public repos: No token needed
- For private repos: Token needs
reposcope
Path Not Found
Solution: Verify the path exists in the repository and matches the branch
No Markdown Files Found
Solution: Check that the repository contains markdown files and adjust include patterns
Debug Mode
Enable debug logging to see detailed processing information:
Best Practices
Repository Selection
- Documentation-First Repos: Best results with dedicated documentation repositories
- Monorepos: Use
--pathto target documentation folders - Version Control: Process specific branches for version-specific documentation
Pattern Usage
- Be Specific: Use precise patterns to avoid processing irrelevant files
- Test Patterns: Use
--no-cleanupto inspect which files would be processed - Combine Patterns: Use multiple include/exclude patterns for fine control
Performance
- Batch Processing: Process multiple repositories in sequence rather than parallel
- Off-Peak Hours: Run large processing jobs during low-usage periods
- Monitor Progress: Use job status endpoint to track processing
Integration with MCP Tools
Once processed, GitHub repository content is accessible through MCP tools:
# Search for content from GitHub sources
{
"tool": "get_content",
"arguments": {
"library_id": "project-docs",
"query": "authentication"
}
}
# Get full markdown from GitHub source
{
"tool": "get_page_markdown",
"arguments": {
"url": "https://github.com/user/repo/blob/main/docs/guide.md"
}
}
See Also
- Upload Feature Documentation - General upload functionality
- Code Extraction - How code blocks are processed
- Search Features - Finding processed content
- MCP Integration - Using content with AI assistants