Persona-Driven Document Intelligence System

👉 See our detailed Approach Explanation for methodology, design choices, and technical insights.

Overview

Persona-Driven Document Intelligence System is a production-grade, fully offline solution for extracting persona- and task-specific insights from unstructured PDF collections. Designed for enterprise deployment, it delivers fast, accurate, and reproducible results on CPU-only infrastructure.

Features

Heuristic PDF Parsing: Accurate section detection using font size, style, and context (PyMuPDF)
Persona & Task-Aware Querying: Dynamic, domain-agnostic query construction
State-of-the-Art Embeddings: all-mpnet-base-v2 transformer for semantic understanding
Diversity-Driven Ranking: Multi-document representation, cosine similarity, and diversity algorithms
Extractive Summarization: Sentence-level, fact-grounded summaries
Parallel Processing: Multi-threaded, optimized for 8-core CPUs
Memory-Efficient Caching: Model and embedding cache for sub-3GB RAM usage
Containerized & Offline: Dockerized, 100% offline after build, reproducible anywhere

Architecture

flowchart TD
    A[PDF Collection] --> B[Section Detection & Parsing]
    B --> C[Persona/Task Query Formulation]
    C --> D[Semantic Embedding - all-mpnet-base-v2]
    D --> E[Relevance & Diversity Ranking]
    E --> F[Extractive Summarization]
    F --> G[Persona-Specific Output]

Tech Stack

Python 3.9 (slim, AMD64)
sentence-transformers: Embedding models
NLTK: Tokenization
NumPy: Vector math
PyTorch (CPU): Inference backend
PyMuPDF (fitz): PDF parsing
Docker: Containerization

Model Footprint: ~570 MB (well under 1GB)

Setup & Usage

1. Build Docker Image

docker build --platform linux/amd64 -t cpt-adobe-1b:cpt .

2. Run Analysis

Linux/Mac:

docker run --rm --cpus=8 --memory=16g \
  -v "$(pwd)/Collection_1:/app/data" cpt-adobe-1b:cpt

Windows PowerShell:

docker run --rm --cpus=8 --memory=16g \
  -v "${PWD}/Collection_1:/app/data" cpt-adobe-1b:cpt

Input & Output

Input Folder Structure:

Collection_1/
├── challenge1b_input.json    # Input configuration
└── PDFs/                    # PDF documents
    ├── document1.pdf
    └── document2.pdf

Output:

Collection_1/challenge1b_output.json — Persona/task-specific extracted sections and summaries

Performance

Build Time: 5–8 min (model download & caching)
Execution Time: 10–50 sec (varies by collection size)
Memory Usage: ~2–3 GB peak
CPU Utilization: Optimized for 8 cores
Offline Operation: 100% offline after build (no network required)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Collection_1/PDFs		Collection_1/PDFs
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
Team_CPT_1B_presentation.pdf		Team_CPT_1B_presentation.pdf
approach_explanation.md		approach_explanation.md
main.py		main.py
requirements.txt		requirements.txt
test_organizer.sh		test_organizer.sh
test_organizer_windows.ps1		test_organizer_windows.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Persona-Driven Document Intelligence System

Table of Contents

Overview

Features

Architecture

Tech Stack

Setup & Usage

1. Build Docker Image

2. Run Analysis

Linux/Mac:

Windows PowerShell:

Input & Output

Performance

About

Uh oh!

Releases

Packages

Languages

StrungPattern-coder/cpt-adobe-1b

Folders and files

Latest commit

History

Repository files navigation

Persona-Driven Document Intelligence System

Table of Contents

Overview

Features

Architecture

Tech Stack

Setup & Usage

1. Build Docker Image

2. Run Analysis

Linux/Mac:

Windows PowerShell:

Input & Output

Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages