Ananke2 is a comprehensive multi-modal knowledge extraction framework designed for processing and analyzing academic papers and technical documents. The framework supports:
- Multi-modal information processing (text, math expressions, logic expressions, code snippets)
- Knowledge graph extraction with semantic triple tracking
- Vector embeddings for semantic search
- Multi-language support (English, Chinese, French, German, Japanese)
- Document chunking with structured sentence parsing
- Asynchronous task processing with Redis queue
-
Graph Database (Neo4j)
- Knowledge graph storage
- Entity and relationship management
- Semantic triple storage with hit count tracking
- Graph-based querying capabilities
-
Vector Database (ChromaDB)
- Embedding storage for semantic search
- Multi-modal content vectorization
- Similarity search capabilities
- Document chunk embeddings
-
Structured Database (MySQL)
- Traditional data storage
- Metadata management
- Document structure information
- Processing task status tracking
- Redis Task Queue
- Asynchronous task management
- Document processing workflow
- Retry mechanism for failed tasks
- Task status monitoring
-
Multi-modal Content Extraction
- PDF document parsing
- LaTeX content processing
- Mathematical expression extraction
- Code snippet identification
- Logic expression parsing
-
Document Chunking
- Structured sentence parsing
- Context-aware chunking
- Cross-reference preservation
- Multi-language chunk handling
-
Knowledge Graph Extraction
- Entity identification
- Relationship extraction
- Semantic triple generation
- Hit count tracking
- Multi-language entity linking
- Python 3.12+
- Docker and Docker Compose
- Python venv module
- Clone the repository:
git clone https://github.com/ceresman/ananke2.git
cd ananke2
- Create and activate virtual environment:
python -m venv .venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows
- Configure pip to use BFSU mirror:
pip config set global.index-url https://mirrors.bfsu.edu.cn/pypi/web/simple
- Install dependencies:
pip install -r requirements.txt
pip install -r requirements-dev.txt # For development
- Configure environment:
cp .env.example .env
# Edit .env with your credentials and API keys
- Start services:
docker-compose up -d
- If you encounter SSL errors with the BFSU mirror, try:
pip install --trusted-host mirrors.bfsu.edu.cn -r requirements.txt
- For permission errors on Unix/Linux:
python -m venv .venv --system-site-packages
Key environment variables in .env
:
# Database Configuration
NEO4J_URI=bolt://0.0.0.0:7687
CHROMA_PERSIST_DIRECTORY=/path/to/chroma
MYSQL_HOST=0.0.0.0
# Redis Configuration
REDIS_HOST=0.0.0.0
REDIS_PORT=6379
# API Keys
QWEN_API_KEY=your-api-key
# Network Configuration
HOST=0.0.0.0
PORT=8000
from ananke2.tasks.document import process_document
# Process an arXiv paper
document_id = await process_document(arxiv_id="2301.00001")
# Process a local PDF
document_id = await process_document(file_path="/path/to/paper.pdf")
from ananke2.database.graph import Neo4jInterface
# Query entities by type
entities = await graph_db.search({
"type": "TECHNOLOGY",
"limit": 10
})
# Get related entities
related = await graph_db.get_relations(entity_id)
from ananke2.database.vector import ChromaInterface
# Search similar documents
results = await vector_db.search({
"query": "quantum computing applications",
"limit": 5
})
from ananke2.tasks.document import process_document
# Process documents in different languages
doc_zh = await process_document(file_path="paper_zh.pdf", language="zh")
doc_fr = await process_document(file_path="paper_fr.pdf", language="fr")
POST /api/v1/documents
: Upload and process new documentsGET /api/v1/documents/{id}
: Retrieve document informationGET /api/v1/entities
: Query knowledge graph entitiesGET /api/v1/search
: Semantic search across documents
document_status
: Real-time document processing statusextraction_progress
: Knowledge extraction progress updates
python -m pytest tests/
black .
isort .
flake8
Apache License 2.0
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.