A comprehensive, automation-friendly system for analyzing, deduplicating, and reorganizing large-scale file repositories with minimal human intervention and maximum safety.
- Multi-threaded scanning of local, network, and cloud-mounted drives
- Rich metadata extraction (size, timestamps, MIME types, access patterns)
- Progressive hashing strategy (quick hash for pre-filtering, full hash for verification)
- SQLite-based indexing for fast queries and analysis
- Multi-stage detection pipeline:
- Fast pre-filtering by size and extension
- Quick hash matching (first 1MB)
- Full content verification (SHA-256)
- Fuzzy filename matching for near-duplicates
- Intelligent canonical file selection based on configurable priorities
- Safe quarantine approach (move to staging area, not immediate deletion)
- Interactive HTML reports with statistics and insights
- JSON exports for automation and integration
- CSV exports for spreadsheet analysis
- Automatic insight generation and recommendations
- Visualizations of storage distribution, duplicates, and trends
- Template-based target structure design
- Flexible classification rules by file type
- Standardized naming conventions
- Lifecycle management (active -> archive -> cold storage)
- Backward compatibility with optional symlinks
- Dry-run preview before execution
- Batch processing with progress tracking
- Checkpoint-based rollback capability
- Full audit trail with detailed logging
- Conflict resolution and error handling
- Multi-source library architecture: Manage files across local drives, network shares, and cloud storage
- Auto-detection: Automatically find mounted OneDrive, Google Drive, iCloud, and Proton Drive folders
- Native API support: Direct integration with OneDrive via Microsoft Graph API
- Two-way sync: Pull files from cloud, classify them, push organized files back
- Secure authentication: OAuth 2.0 with encrypted token storage using system keyring
- DistilBERT v2 classifier: Trained on 77k+ files with 96.7% accuracy
- Automatic document classification: Categorize files by content type
- Confidence scoring: Know when to trust automated decisions
- Incremental learning: Improve classification with user feedback
- Python 3.10 or higher
- Windows, macOS, or Linux
git clone https://github.com/FleithFeming/cognisys-core.git
cd cognisys-core
pip install -r requirements.txt
pip install -e .pip install msal keyring cryptographycognisys --helpcognisys scan --roots "C:\Users\Documents" --roots "C:\Projects"This creates a session and indexes all files with metadata and hashes.
Output:
[SUCCESS] Scan completed!
Session ID: 20251120-143022-a8f3
Files scanned: 12,453
Folders scanned: 1,234
Total size: 45.20 GB
cognisys analyze --session 20251120-143022-a8f3Runs the deduplication pipeline and identifies patterns.
Output:
[SUCCESS] Analysis complete!
Duplicate groups: 1,048
Duplicate files: 6,219
Wasted space: 28.40 GB
cognisys report --session 20251120-143022-a8f3 --format html --format json --format csvCreates comprehensive reports in multiple formats.
Outputs:
reports/20251120-143022-a8f3_report.html- Interactive dashboardreports/20251120-143022-a8f3_report.json- Machine-readable datareports/files_inventory.csv- Full file listing
cognisys plan --session 20251120-143022-a8f3 --structure cognisys/config/new_structure.ymlGenerates a migration plan based on target structure.
Output:
[SUCCESS] Migration plan created!
Plan ID: plan-20251120-150000-x9a2
Actions by type:
move: 38,456 files (120.50 GB)
archive: 2,345 files (3.20 GB)
cognisys dry-run --plan plan-20251120-150000-x9a2Shows sample actions without making any changes.
Output:
[PREVIEW] Sample actions:
1. MOVE: C:\Users\Documents\Report (1).docx
TO: C:\Repository\Quarantine\Duplicates_2025-11-20\Report (1).docx
REASON: Duplicate (group dup-a8f3c1d2)
2. MOVE: C:\Users\Downloads\client_data.xlsx
TO: C:\Repository\Active\Projects\ProjectA\Data\2025-10-20_ProjectA_ClientData_v01.xlsx
REASON: Reorganize to code structure
[INFO] Total actions: 40,801
[INFO] Total data: 123.70 GB
# Approve the plan
cognisys approve --plan plan-20251120-150000-x9a2
# Execute the migration
cognisys execute --plan plan-20251120-150000-x9a2Executes the migration with full audit logging and rollback capability.
Edit cognisys/config/scan_config.yml to customize:
- Root paths to scan
- Exclusion patterns (e.g.,
*.tmp,node_modules) - Threading and performance settings
- Hashing strategy
- Access tracking options
Edit cognisys/config/analysis_rules.yml to configure:
- Duplicate detection thresholds
- Canonical file selection priorities
- Orphaned file criteria
- Preferred paths for keeping files
Edit cognisys/config/new_structure.yml to define:
- Repository organization (Active, Archive, Reference, etc.)
- File classification rules by type
- Naming conventions
- Lifecycle policies
cognisys scan --roots <path1> --roots <path2> [options]
Options:
--config PATH Scan configuration file
--db PATH Database path (default: db/cognisys.db)
--session-id TEXT Custom session IDcognisys analyze --session <session-id> [options]
Options:
--rules PATH Analysis rules file
--db PATH Database pathcognisys report --session <session-id> [options]
Options:
--output PATH Output directory (default: reports)
--format FORMAT Output format(s): html, json, csv (multiple allowed)
--db PATH Database pathcognisys plan --session <session-id> [options]
Options:
--structure PATH Target structure config
--output TEXT Plan ID (auto-generated if not provided)
--db PATH Database path# Preview changes
cognisys dry-run --plan <plan-id>
# Approve plan
cognisys approve --plan <plan-id>
# Execute migration
cognisys execute --plan <plan-id>
Options:
--db PATH Database path# List all configured sources
cognisys source list
# Add a local source
cognisys source add my_docs --type local --path "C:\Users\Documents"
# Add a cloud API source (requires authentication)
cognisys source add onedrive_docs --type cloud_api --provider onedrive --path /Documents
# Detect cloud folders automatically
cognisys source detect
cognisys source detect --add # Add detected folders as sources
# Show source status
cognisys source status# Detect mounted cloud folders
cognisys cloud detect
# Authenticate with OneDrive
cognisys cloud auth --provider onedrive --client-id <your-client-id>
cognisys cloud auth --provider onedrive --device-code # For headless environments
# Check cloud connection status
cognisys cloud status
# Sync files with cloud source
cognisys cloud sync <source_name> --direction pull
cognisys cloud sync <source_name> --direction push
cognisys cloud sync <source_name> --dry-run
# Log out from cloud providers
cognisys cloud logout# Show classification statistics
cognisys reclassify stats
# Reclassify 'unknown' files using pattern matching
cognisys reclassify unknown # Dry run
cognisys reclassify unknown --execute # Apply changes
cognisys reclassify unknown --verbose # Show detailed output
# Reclassify NULL document_type files (patterns + ML)
cognisys reclassify null # Dry run
cognisys reclassify null --execute # Apply changes
cognisys reclassify null --no-ml # Patterns only
cognisys reclassify null --confidence 0.80 # Higher threshold
# Re-evaluate all files (full reclassification)
cognisys reclassify all --execute # Requires confirmation
# Show files with low classification confidence
cognisys reclassify low-confidence --threshold 0.5 --limit 100
Options:
--db PATH Database path (default: .cognisys/file_registry.db)
--execute Apply changes (default is dry-run)
--confidence FLOAT Confidence threshold (default: 0.70)
--use-ml/--no-ml Enable/disable ML model fallback
--batch-size INT Batch size for updates (default: 100)
--verbose, -v Show detailed output# List all scan sessions
cognisys list-sessions
# Show help for any command
cognisys <command> --helpCogniSys
├─ Storage Layer -> Multi-source abstraction (local, network, cloud)
│ ├─ LocalFileSource -> Local filesystem operations
│ ├─ OneDriveSource -> Microsoft Graph API integration
│ └─ SyncManager -> Two-way cloud synchronization
├─ Cloud Integration -> Provider detection and authentication
│ ├─ CloudFolderDetector -> Auto-detect mounted cloud folders
│ └─ OAuth Authenticator -> Secure token management
├─ Scanning Engine -> Multi-threaded file traversal and indexing
├─ Analysis Engine -> Deduplication and pattern detection
├─ ML Classifier -> DistilBERT-based document classification
├─ Reporting Engine -> Statistics, insights, and visualizations
├─ Migration Planner -> Rule-based reorganization planning
└─ Migration Executor -> Safe execution with rollback support
+------------------------------------------------------------------+
| CogniSys Source Library |
+------------------------------------------------------------------+
| |
| LOCAL SOURCES CLOUD SOURCES (Mounted) |
| +----------------+ +----------------+ |
| | Downloads | | OneDrive | |
| | Documents | | Google Drive | |
| | Projects | | iCloud | |
| +----------------+ +----------------+ |
| |
| CLOUD SOURCES (API) NETWORK SOURCES |
| +----------------+ +----------------+ |
| | OneDrive API | | NAS/SMB | |
| | (coming soon) | | Network Share | |
| +----------------+ +----------------+ |
| |
+------------------------------------------------------------------+
|
v
+--------------------+
| Unified Registry |
| (file_registry.db)|
+--------------------+
1. Source Library -> Configure local, network, and cloud sources
2. Scan -> Index files with metadata and hashes from all sources
3. Analyze -> Detect duplicates and patterns across sources
4. Classify -> ML-powered document categorization
5. Report -> Generate insights and recommendations
6. Plan -> Create migration strategy
7. Execute -> Apply changes with safety checks
8. Sync -> Push organized files back to cloud (optional)
SQLite database with tables for:
- files - File metadata and hashes
- folders - Folder hierarchy and stats
- sources - Configured source library (local, network, cloud)
- cloud_providers - Authenticated cloud provider credentials
- scan_history - Per-source scan tracking
- duplicate_groups - Duplicate sets with canonical selection
- duplicate_members - Individual duplicates with priority scores
- scan_sessions - Scan metadata and configuration
- migration_plans - Reorganization plans
- migration_actions - Individual file operations
- Dry-run preview before any changes
- Explicit approval required for execution
- Duplicates moved to quarantine, not deleted
- Audit trail for all operations
- Checkpoint creation before execution
- JSON-based rollback data
- Automatic rollback on errors
- Manual rollback support
- Graceful handling of permission errors
- Conflict resolution for naming collisions
- Batch processing with retry logic
- Detailed error logging
- Consolidate scattered documents and media
- Eliminate duplicate photos and downloads
- Organize by project or date
- Archive old files
- Audit network drives for redundancy
- Standardize naming conventions
- Implement retention policies
- Reclaim storage space
- Organize media libraries
- Version control for documents
- Metadata enrichment
- Searchable archives
- Organize code repositories
- Clean up build artifacts
- Manage dependencies
- Archive legacy projects
Add custom file categories and rules in configuration files.
Define templates with variables like {YYYY-MM-DD}_{ProjectName}_{Version}.
Automate transitions: Active -> Archive -> Cold Storage based on access patterns.
JSON exports and SQLite database enable integration with other tools and scripts.
- Increase thread count in config for faster scanning
- Use incremental scans to update existing indexes
- Split very large scans across multiple sessions
- Run with appropriate privileges for network drives
- Check exclusion patterns to skip system folders
- Review error logs for specific access issues
- Adjust batch size in configuration
- Use checkpoint intervals for long-running scans
- Close and reopen database connections periodically
- Scanning: Use 4-8 threads for balanced performance
- Hashing: Skip very large files (>10GB) or disable full hashing
- Analysis: Enable fuzzy matching only when needed (expensive)
- Migration: Execute during off-peak hours for large operations
Contributions are welcome! Areas for enhancement:
- Web-based dashboard UI
- Cloud storage integration (OneDrive, Google Drive, iCloud - mounted folder support)
- Machine learning for smarter classification (DistilBERT v2 - 96.7% accuracy)
- Additional cloud providers (S3, Azure Blob, GCP)
- Multi-user collaboration features
- Real-time monitoring and alerting
- Google Drive native API support
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Full documentation is available in the docs/ directory:
- Quick Start Guide - Get started in 5 minutes
- Architecture Overview - System design and components
- CLI Reference - Complete command reference
- Workflow Guide - End-to-end processing
For issues, questions, or feature requests:
- Create an issue in the repository
- Review documentation for guides and references
- Review logs in
logs/directory - Check configuration files for syntax errors
Built with Python, SQLite, Click, and PyYAML.
CogniSys - Bringing order to digital chaos.