Skip to content

StrungPattern-coder/cpt-adobe-1b

Repository files navigation

Persona-Driven Document Intelligence System

👉 See our detailed Approach Explanation for methodology, design choices, and technical insights.

Python Docker Offline CPU Only


Table of Contents


Overview

Persona-Driven Document Intelligence System is a production-grade, fully offline solution for extracting persona- and task-specific insights from unstructured PDF collections. Designed for enterprise deployment, it delivers fast, accurate, and reproducible results on CPU-only infrastructure.


Features

  • Heuristic PDF Parsing: Accurate section detection using font size, style, and context (PyMuPDF)
  • Persona & Task-Aware Querying: Dynamic, domain-agnostic query construction
  • State-of-the-Art Embeddings: all-mpnet-base-v2 transformer for semantic understanding
  • Diversity-Driven Ranking: Multi-document representation, cosine similarity, and diversity algorithms
  • Extractive Summarization: Sentence-level, fact-grounded summaries
  • Parallel Processing: Multi-threaded, optimized for 8-core CPUs
  • Memory-Efficient Caching: Model and embedding cache for sub-3GB RAM usage
  • Containerized & Offline: Dockerized, 100% offline after build, reproducible anywhere

Architecture

flowchart TD
    A[PDF Collection] --> B[Section Detection & Parsing]
    B --> C[Persona/Task Query Formulation]
    C --> D[Semantic Embedding - all-mpnet-base-v2]
    D --> E[Relevance & Diversity Ranking]
    E --> F[Extractive Summarization]
    F --> G[Persona-Specific Output]
Loading

Tech Stack

  • Python 3.9 (slim, AMD64)
  • sentence-transformers: Embedding models
  • NLTK: Tokenization
  • NumPy: Vector math
  • PyTorch (CPU): Inference backend
  • PyMuPDF (fitz): PDF parsing
  • Docker: Containerization

Model Footprint: ~570 MB (well under 1GB)


Setup & Usage

1. Build Docker Image

docker build --platform linux/amd64 -t cpt-adobe-1b:cpt .

2. Run Analysis

Linux/Mac:

docker run --rm --cpus=8 --memory=16g \
  -v "$(pwd)/Collection_1:/app/data" cpt-adobe-1b:cpt

Windows PowerShell:

docker run --rm --cpus=8 --memory=16g \
  -v "${PWD}/Collection_1:/app/data" cpt-adobe-1b:cpt

Input & Output

Input Folder Structure:

Collection_1/
├── challenge1b_input.json    # Input configuration
└── PDFs/                    # PDF documents
    ├── document1.pdf
    └── document2.pdf

Output:

  • Collection_1/challenge1b_output.json — Persona/task-specific extracted sections and summaries

Performance

  • Build Time: 5–8 min (model download & caching)
  • Execution Time: 10–50 sec (varies by collection size)
  • Memory Usage: ~2–3 GB peak
  • CPU Utilization: Optimized for 8 cores
  • Offline Operation: 100% offline after build (no network required)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published