👉 See our detailed Approach Explanation for methodology, design choices, and technical insights.
Persona-Driven Document Intelligence System is a production-grade, fully offline solution for extracting persona- and task-specific insights from unstructured PDF collections. Designed for enterprise deployment, it delivers fast, accurate, and reproducible results on CPU-only infrastructure.
- Heuristic PDF Parsing: Accurate section detection using font size, style, and context (PyMuPDF)
- Persona & Task-Aware Querying: Dynamic, domain-agnostic query construction
- State-of-the-Art Embeddings: all-mpnet-base-v2 transformer for semantic understanding
- Diversity-Driven Ranking: Multi-document representation, cosine similarity, and diversity algorithms
- Extractive Summarization: Sentence-level, fact-grounded summaries
- Parallel Processing: Multi-threaded, optimized for 8-core CPUs
- Memory-Efficient Caching: Model and embedding cache for sub-3GB RAM usage
- Containerized & Offline: Dockerized, 100% offline after build, reproducible anywhere
flowchart TD
A[PDF Collection] --> B[Section Detection & Parsing]
B --> C[Persona/Task Query Formulation]
C --> D[Semantic Embedding - all-mpnet-base-v2]
D --> E[Relevance & Diversity Ranking]
E --> F[Extractive Summarization]
F --> G[Persona-Specific Output]
- Python 3.9 (slim, AMD64)
- sentence-transformers: Embedding models
- NLTK: Tokenization
- NumPy: Vector math
- PyTorch (CPU): Inference backend
- PyMuPDF (fitz): PDF parsing
- Docker: Containerization
Model Footprint: ~570 MB (well under 1GB)
docker build --platform linux/amd64 -t cpt-adobe-1b:cpt .docker run --rm --cpus=8 --memory=16g \
-v "$(pwd)/Collection_1:/app/data" cpt-adobe-1b:cptdocker run --rm --cpus=8 --memory=16g \
-v "${PWD}/Collection_1:/app/data" cpt-adobe-1b:cptInput Folder Structure:
Collection_1/
├── challenge1b_input.json # Input configuration
└── PDFs/ # PDF documents
├── document1.pdf
└── document2.pdf
Output:
Collection_1/challenge1b_output.json— Persona/task-specific extracted sections and summaries
- Build Time: 5–8 min (model download & caching)
- Execution Time: 10–50 sec (varies by collection size)
- Memory Usage: ~2–3 GB peak
- CPU Utilization: Optimized for 8 cores
- Offline Operation: 100% offline after build (no network required)