PDF Finder with OCR

A tool to search for text in PDF files using multiple methods, including OCR (Optical Character Recognition).

Purpose

This project was created to solve a common problem: searching through dozens of PDF files quickly and efficiently. For example, when you need to find a specific transaction in your credit card history across multiple statements:

Download all your PDF statements from your bank
Place them in the /pdf folder
Run ./run.sh "search term" to locate the exact page and file where the term appears

I personally use this tool all the time to search through financial documents, receipts, and statements, and thought it would be valuable to share with the world.

Features

Basic text extraction using PyPDF2
Advanced text extraction with pdfplumber (better at handling complex layouts)
OCR-based text extraction using Tesseract (can read text from images/scans)
Table data extraction for structured content

Requirements

Python 3.6+
Tesseract OCR engine
poppler (for pdf2image)

Quick Installation

The easiest way to install is using the provided installation script:

# Clone the repository
git clone https://github.com/yourusername/pdf-finder.git
cd pdf-finder

# Run the installation script
./install.sh

The installation script will:

Create a Python virtual environment
Install all required Python dependencies
Install Tesseract OCR and Poppler if on a supported system (macOS or Debian/Ubuntu)
Create a convenient run script for daily use

Manual Installation

If you prefer to install manually:

Clone the repository
Create a virtual environment: python -m venv venv
Activate the virtual environment:
- macOS/Linux: source venv/bin/activate
- Windows: venv\Scripts\activate
Install Python dependencies: pip install -r requirements.txt
Install external dependencies:

macOS
```
brew install tesseract poppler
```
Linux (Ubuntu/Debian)
```
apt-get install tesseract-ocr poppler-utils
```
Windows
- Download and install Tesseract from https://github.com/UB-Mannheim/tesseract/wiki
- Download and install poppler from http://blog.alivate.com.au/poppler-windows/
- Add both to your PATH

Usage

Place your PDF files in the pdf directory

Run the script with your search term:

# Quick usage
./run.sh "your search term"

# Or with manual activation
source venv/bin/activate
python pdf_finder.py "your search term"

The script will search using all three methods (basic extraction, advanced extraction, and OCR) and display where your term was found

Example Use Cases

Finding specific transactions in bank or credit card statements
Searching through tax documents for specific amounts or references
Locating mentions of certain terms across multiple research papers
Finding information in scanned documents that aren't text-searchable

How It Works

The tool uses three different approaches to find text in PDFs:

PyPDF2: Fast basic text extraction
pdfplumber: More advanced extraction that handles tables and complex layouts
Tesseract OCR: Converts PDF pages to images and applies OCR to read text from scanned documents or images

This multi-method approach helps find text that might be missed by any single method alone.

Troubleshooting

If macOS Preview can find text that this tool cannot, it might be because:

The PDF contains text in images that requires better OCR
The text is using a special font or encoding
The PDF has complex formatting or structure

If Tesseract OCR is installed correctly but still not finding text, you might need to:

Improve the image quality before OCR
Try different Tesseract parameters or language settings
Use a cloud-based OCR service for better results

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Created by Aemal Sayer

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
find.sh		find.sh
install.sh		install.sh
pdf_finder.py		pdf_finder.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Finder with OCR

Purpose

Features

Requirements

Quick Installation

Manual Installation

macOS

Linux (Ubuntu/Debian)

Windows

Usage

Example Use Cases

How It Works

Troubleshooting

License

Author

About

Releases

Packages

Languages

License

aemal/pdf-finder

Folders and files

Latest commit

History

Repository files navigation

PDF Finder with OCR

Purpose

Features

Requirements

Quick Installation

Manual Installation

macOS

Linux (Ubuntu/Debian)

Windows

Usage

Example Use Cases

How It Works

Troubleshooting

License

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages