Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
-
Updated
Aug 23, 2025 - Python
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf
Use TradeRepublic in terminal and mass download all documents
JavaScript bindings for MuPDF
Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Translate many large PDF Reports for free using Python.
Extract presentation slides from videos with accurate timestamps
This sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.
Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.
A Python + C implementation for image-based PDF page layout analysis and content extraction.
Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction. Supports extraction from various file formats including PDF, Microsoft Office documents, OCR/Image to Text, Audio to Text, and Website to Text.
A tool to automatically extract GRI disclosure codes from corporate sustainability reports, enabling efficient analysis of environmental, social, and governance (ESG) data. Supports English and Indonesian reports.
PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.
A python tool to extract schedule data from PDF timetables and output it in GTFS.
Scalable PDF Extraction using Multimodal GPT 4o
A web application that scrapes ML research papers from arXiv and generates summaries using either OpenAI or Claude API.
End-to-End Data Pipeline for Tourism Industry Analysis
A PDF Reader application powered by AI, allowing users to upload PDF documents and extract meaningful information using advanced NLP models. Built with Streamlit, Transformers, and Langchain, this app provides a seamless interface for interacting with and analyzing PDF content.
Add a description, image, and links to the pdf-extraction topic page so that developers can more easily learn about it.
To associate your repository with the pdf-extraction topic, visit your repo's landing page and select "manage topics."