A tool to search for text in PDF files using multiple methods, including OCR (Optical Character Recognition).
This project was created to solve a common problem: searching through dozens of PDF files quickly and efficiently. For example, when you need to find a specific transaction in your credit card history across multiple statements:
- Download all your PDF statements from your bank
- Place them in the
/pdf
folder - Run
./run.sh "search term"
to locate the exact page and file where the term appears
I personally use this tool all the time to search through financial documents, receipts, and statements, and thought it would be valuable to share with the world.
- Basic text extraction using PyPDF2
- Advanced text extraction with pdfplumber (better at handling complex layouts)
- OCR-based text extraction using Tesseract (can read text from images/scans)
- Table data extraction for structured content
- Python 3.6+
- Tesseract OCR engine
- poppler (for pdf2image)
The easiest way to install is using the provided installation script:
# Clone the repository
git clone https://github.com/yourusername/pdf-finder.git
cd pdf-finder
# Run the installation script
./install.sh
The installation script will:
- Create a Python virtual environment
- Install all required Python dependencies
- Install Tesseract OCR and Poppler if on a supported system (macOS or Debian/Ubuntu)
- Create a convenient run script for daily use
If you prefer to install manually:
-
Clone the repository
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- macOS/Linux:
source venv/bin/activate
- Windows:
venv\Scripts\activate
- macOS/Linux:
-
Install Python dependencies:
pip install -r requirements.txt
-
Install external dependencies:
brew install tesseract poppler
apt-get install tesseract-ocr poppler-utils
- Download and install Tesseract from https://github.com/UB-Mannheim/tesseract/wiki
- Download and install poppler from http://blog.alivate.com.au/poppler-windows/
- Add both to your PATH
-
Place your PDF files in the
pdf
directory -
Run the script with your search term:
# Quick usage ./run.sh "your search term" # Or with manual activation source venv/bin/activate python pdf_finder.py "your search term"
-
The script will search using all three methods (basic extraction, advanced extraction, and OCR) and display where your term was found
- Finding specific transactions in bank or credit card statements
- Searching through tax documents for specific amounts or references
- Locating mentions of certain terms across multiple research papers
- Finding information in scanned documents that aren't text-searchable
The tool uses three different approaches to find text in PDFs:
- PyPDF2: Fast basic text extraction
- pdfplumber: More advanced extraction that handles tables and complex layouts
- Tesseract OCR: Converts PDF pages to images and applies OCR to read text from scanned documents or images
This multi-method approach helps find text that might be missed by any single method alone.
If macOS Preview can find text that this tool cannot, it might be because:
- The PDF contains text in images that requires better OCR
- The text is using a special font or encoding
- The PDF has complex formatting or structure
If Tesseract OCR is installed correctly but still not finding text, you might need to:
- Improve the image quality before OCR
- Try different Tesseract parameters or language settings
- Use a cloud-based OCR service for better results
This project is licensed under the MIT License - see the LICENSE file for details.
Created by Aemal Sayer